A r c h i v e d   I n f o r m a t i o n >
Testing Guide cover CHAPTER 1: Test Measurement Principles

This chapter explains basic test measurement standards and related educational principles for determining whether tests used as part of making high-stakes decisions for students provide accurate and fair information. As explained in Chapter Two below, federal court decisions have been informed and guided by professional test measurement standards and principles. Understanding professional test measurement standards can assist in efforts to use tests wisely and to comply with federal nondiscrimination laws.66 This chapter is intended as a helpful discussion of how to understand test measurement concepts and their use. These are not specific legal requirements, but rather are foundations for understanding appropriate test use.

Educational institutions use tests to accomplish specific purposes based on their educational goals, including making placement, promotion, graduation, admissions, and other decisions. It is only after educational institutions have determined the underlying goal they want to accomplish that they can identify the types of information that will best inform their decision-making. That information may include test results as well as other relevant measures that can effectively, accurately, and fairly address the purposes and goals specified by the institutions.67 As stated in the Joint Standards, ?When interpreting and using scores about individuals or groups of students, consideration of relevant collateral information can enhance the validity of the interpretation, by providing corroborating evidence or evidence that helps explain student performance. . . . As the stakes of testing increase for individual students, the importance of considering additional evidence to document the validity of score interpretations and the fairness in testing increases accordingly.?68

Although this guide focuses on the use of tests, policy-makers and educators need to consider the soundness and relevance of the entire high stakes decision-making process, including other information used in conjunction with test results.69

In using tests as part of high-stakes decision-making, educational institutions should ensure that the test will provide accurate results that are valid, reliable, and fair for all test takers. This includes obtaining adequate evidence of test quality about the current test being proposed and its use, evaluating the evidence, and ensuring that appropriate test use is based on adequate evidence.70 When test results are used to make high-stakes decisions about student promotion or graduation, educational institutions should provide students with a reasonable number of opportunities to demonstrate mastery and ensure that there is evidence available that students have had an adequate opportunity to learn the material being tested.71

I. Key Considerations in Test Use

This section addresses the fundamental concepts of test validity and reliability. It will also discuss issues associated with ensuring fairness in the meaning of test scores, and issues related to using appropriate cut scores. Test developers and users as appropriate determine adequate validity and reliability, ensure fairness, and determine where to set and how to use cut scores appropriately for all students by accumulating evidence of test quality from relevant groups of test takers.

A. Validity

Test validity refers to a determination of how well a test actually measures what it says it measures. The Joint Standards defines validity as ?[t]he degree to which accumulated evidence and theory support specific interpretations of test scores entailed by proposed uses of a test.?72 The demonstration of validity is multifaceted and must always be determined within the context of the specific use of a test. In order to promote readability, the discussion on validity presented here is meant to reflect this complex topic in an accurate, but concise and user-friendly way. The Joint Standards identifies and discusses in detail principles related to determining the validity of test results within the context of their use, and readers are encouraged to review the Joint Standards, Chapter 1, Validity, for additional, relevant discussion.73

There are three central points to keep in mind regarding validity:

  • The focus of validity is not really on the test itself, but on the validity of the inferences drawn from the test results for a given use.
  • All validity is really a form of ?construct validity.?
  • In validating the inferences of the test results, it is important to consider the consequences of the test?s interpretation and use.

    1. Validity of the Inferences Drawn from the Scores

It is not the test that is validated per se, but the inferences or meaning derived from the test scores for a given use?that is, for a specific type of purpose, in a specific type of situation, and with specific groups of students. The meaning of test scores will differ based on such factors as how the test is designed, the types of questions that are asked, and the documentation that supports how all groups of students are interpreting what the test is asking and how effectively their performance can be generalized beyond the test.

For instance, in one case, the educational institution may want to evaluate how well students can analyze complex issues and evaluate implications in history. For a given amount of test time, they would want to use a test that measures the ability of students to think deeply about a few selected history topics. The meaning of the scores should reflect this purpose and the limits of the range of topics being measured on the test. In another case, the institution may want to assess how well students know a range of facts about a wide variety of historical events. The institution would want to use a test that measures a broad range of knowledge about many different occurrences in history. The inferences drawn from the scores should be validated to determine how well they measure students? knowledge of a broad range of historical facts, but not necessarily how well students analyze complex issues in history.

2. Construct Validity

Construct validity refers to the degree to which the scores of test takers accurately reflect the constructs a test is attempting to measure. The Joint Standards defines a construct as ?the concept or the characteristic that a test is designed to measure.?74 Test scores and their inferences are validated to measure one or more constructs, which together comprise a particular content domain.75 In K-12 education, these domains are often codified in state or district content standards covering various subject areas. For instance, the domain of mathematics as described in the state?s elementary mathematics content standards may involve the constructs of mathematical problem-solving and knowledge of number systems. Items may be selected for a test that sample from this domain, and should be properly representative of the constructs identified within it. In that way, the meaning of the test scores should accurately reflect the knowledge and skills defined in the mathematics content standards domain.

Validity should be viewed as the overarching, integrative evaluation of the degree to which all accumulated evidence supports the intended interpretation of the test scores for a proposed purpose.76 This unitary and comprehensive concept of validity is referred to as ?construct validity.? Different sources of validity evidence may illuminate different aspects of validity, but they do not represent distinct types of validity.77

Therefore, ?construct validity? is not just one of the many types of validity?it is validity. The process of test validation ?logically begins with an explicit statement of the proposed interpretation of test scores, along with a rationale for the relevance of the interpretation for the proposed use.?78 Demonstrating construct validity then means gathering a variety of types of evidence to support the intended interpretations and uses of test scores. ?The decision about what types of evidence are important for validation in each instance can be clarified by developing a set of propositions that support the proposed interpretation for the particular purpose of testing.?79 These propositions provide details that support the claims that, for a proposed use, the test validly measures particular skills and knowledge of the students being tested. For instance, if a test is designed to measure students? learning of material described in a district?s science content standards, evidence that the test is properly aligned with these standards for the types of students taking the test would be a crucial component of the test?s validity. When such evidence is in place, users of the test can correctly interpret high scores as indicators that students have learned the designated material and low scores as evidence that they have not.

All validity evidence and the interpretation of the evidence are focused on the basic question: Is the test measuring the concept, skill, or trait in question? Is it, for example, really measuring mathematical reasoning or reading comprehension for the types of students that are being tested? A variety of types of evidence can be used to answer this question?none of which provides a simple yes or no answer. The exact nature of the types of evidence that need to be accumulated is directly related to the intended use of the test, which includes evidence regarding the skills and knowledge being measured, evidence documenting validity for the stated purpose, and evidence of validity for all groups of students taking the test.80

For instance, an educational institution may want to use a test to help make promotion decisions. It may also want to use a test to place students in the appropriate sequence of courses. In each situation, the types of validity evidence an institution would expect to see would depend on how the test is being used.

In making promotion decisions, the test should reflect content the student has learned. Appropriate validation would include adequate evidence that the test is measuring the constructs identified in the curriculum, and that the inferences of the scores accurately reflect the intended constructs for all test takers. Validation of the decision process involving the use of the test would include adequate evidence that low scores reflect lack of knowledge of students after they have been taught the material, rather than lack of exposure to the curriculum in the first place.

In making placement decisions, on the other hand, the test may not need to measure content that the student has already learned. Rather, at least in part, the educational institution may want the test to measure aptitude for the future learning of knowledge or skills that have been identified as necessary to complete a course sequence. Appropriate validation would include documentation of the relationship between what constructs are being measured in the test and what knowledge and skills are actually needed in the future placements. Evidence should also provide documentation that scores are not significantly confounded by other factors irrelevant to the knowledge and skills the test is intending to measure.

Institutions often think about using the same test for two or more purposes. This is appropriate as long as the validity evidence properly supports the use of the test for each purpose, and properly supports that the inferences of the results accurately reflect what the test is measuring for all students taking the test.81

The empirical evidence related to the various aspects of construct validity is collected throughout test development, during test construction, and after the test is completed. It is important for educators and policy-makers to understand and expect that the accumulated evidence spans the range of test development and implementation. There is not just one set of documentation collected at one point in time.82

When the empirical database is large and includes results from a number of studies related to a given purpose, situation, and type of test takers, it may be appropriate to generalize validity findings beyond validity data gathered for one particular test use. That is, it may be appropriate to use evidence collected in one setting when determining the validity of the meaning of the test scores for a similar use. If the accumulated validity evidence for a particular purpose, situation, or subgroup is small, or features of the proposed use of the test differ markedly from an adequate amount of validity evidence already collected, evidence from this particular type of test use will generally need to be compiled.83 Regardless of where the evidence is collected, educational institutions should expect adequate documentation of construct validity based on needs defined by the particular purposes and populations for which a test is being used.

When considering the types of construct validity evidence to collect, the Joint Standards emphasizes that it is important to guard against the two major sources of validity error. This error can distort the intended meaning of scores for particular groups of students, situations, or purposes.84

One potential source of error omits some important aspects of the intended construct being tested. This is called construct underrepresentation.85 An example would be a test that is being used to measure English language proficiency. When the institution has defined English language proficiency as including specific skills in listening, speaking, reading, and writing the English language, and wants to use a test which measures these aspects, construct underrepresentation would occur if the test only measured the reading skills.

The other potential source of error occurs when a test measures material that is extraneous to the intended construct, confounding the ability of the test to measure the construct that it intends to measure. This source of error is called construct irrelevance.86 For instance, how well a student reads a mathematics test may influence the student?s subtest score in mathematics computation. In this case, the student?s reading skills may be irrelevant when the skill of mathematics computation is what is being measured by the subtest.87 Thus, in order to address considerations of construct underrepresentation and construct irrelevance it is important to collect evidence not only about what a test measures in particular types of situations or for particular groups of students, but also evidence that seeks to document that the intended meaning of the test scores is not unduly influenced by either of the two sources of validity error.

3. Considering the Consequences of Test Use

Evidence about the intended and unintended consequences of test use can provide important information about the validity of the inferences to be drawn from the test results, or it can raise concerns about an inappropriate use of a test where the inferences may be valid for other uses.

For instance, significant differences in placement test scores based on race, gender, or national origin may trigger a further inquiry about the test and how it is being used to make placement decisions.88 The validity of the test scores would be called into question if the test scores are substantially affected by irrelevant factors that are not related to the academic knowledge and skills that the test is supposed to measure.89

Standard 13.1

When educational testing programs are mandated by school, district, state, or other authorities, the ways in which test results are intended to be used should be clearly described. It is the responsibility of those who mandate the use of tests to monitor their impact and to identify and minimize potential negative consequences. Consequences resulting from the uses of the test, both intended and unintended, should also be examined by the test user.

On the other hand, a test may accurately measure differences in the level of students? academic achievement. That is, low scores may accurately reflect that some students do not know the content. However, test users should ensure that they interpret those scores correctly in the context of their high-stakes decisions.90 For instance, test users could incorrectly conclude that the scores reflect lack of ability to master the content for some students when, in fact, the low test scores reflect the limited educational opportunities that the students have received. In this case, it would be inappropriate to use the test scores to place low-performing students in a special services program for students who have trouble learning and processing academic content.91 It would be appropriate to use the test to evaluate program effectiveness, however.92

B. Reliability

Reliability refers to the degree of consistency of test results over test administrations, forms, items, scorers, and/or other facets of testing.93 All indice

s of reliability are estimates of consistency, and all the estimates contain some error, since no test or other source of information is ever an ?error-free? measure of student performance.94 An example of reliability of test results over test administrations is when the same students, taking the test multiple times, receive similar scores. Consistency over parallel forms of a test occurs when forms are developed to be equivalent in content and technical characteristics. Reliability can also include estimates of a high degree of relationship across similar items within a single test or subtest that are intended to measure the same knowledge or skill. For judgmentally scored tests, such as essays, another widely used index of reliability addresses stability across raters or scorers. In each case, reliability can be estimated in different ways, using one of several statistical procedures.95 Different kinds of reliability estimates vary in degree and nature of generalization. Readers are encouraged to review Chapter 2, Reliability and Errors of Measurement, in the Joint Standards for additional, relevant information.96

C. Fairness

Fairness, like validity, cannot be properly addressed as an afterthought.... It must be confronted throughout the interconnected phases of the testing process, from test design and development to administration, scoring, interpretation, and use

National Research Council, High Stakes: Testing for Tracking, Promotion and Graduation, 80-81 (Jay P. Heubert & Robert M. Hauser eds., 1999).

Tests are fair when they yield score interpretations

that are valid and reliable for all groups of students who take the tests. That is, the tests must measure the same academic constructs (knowledge and skills) for all groups of students who take them, regardless of race, national origin, gender, or disability. Similarly, it is important that the scores not substantially and systematically underestimate or overestimate the knowledge or skills of members of a particular group. The Joint Standards discusses fairness in testing in terms of lack of bias, equitable treatment in the testing process, equal scores for students who have equal standing on the tested constructs, and, depending on the purpose, equity in opportunity to learn the material being tested.97 In order to promote readability, the discussion on fairness presented here is meant to reflect this complex topic in an accurate, but concise and user-friendly way. Readers are encouraged to review Chapter 7, Fairness in Testing and Test Use, in the Joint Standards for additional, relevant information.98

1. Fairness in Validity

Demonstrating fairness in the validation of test score inferences focuses primarily on making sure that the scores reflect the same intended knowledge and skills for all students taking the test. For the most part this means that the test should minimize the measurement of material that is extraneous to the intended constructs and that confounds the ability of the test to accurately measure the constructs that it intends to measure. A test score should accurately reflect how well each student has mastered the intended constructs. The score should not be significantly impacted by construct irrelevant influences.

The Joint Standards identifies a number of standards that outline important considerations related to fairness in validity throughout test development, test implementation, and the proper use of reported test results.99

Documenting fairness during test development involves gathering adequate evidence that items and test scores are constructed so that the inferences validly reflect what is intended. For all groups of test takers, evidence should support that valid inferences can be drawn from the scores.100 The Joint Standards states that when credible research reports that item and test results differ in meaning across examinee subgroups, then, to the extent feasible, separate validity evidence should be collected for each relevant subgroup.101 When items function differently across relevant subgroups, appropriate studies should be conducted, when feasible, so that bias in items due to test design, content, and format is detected and eliminated.102 Developers should strive to identify and eliminate language, form, and content in tests that have a different meaning in one subgroup than in others, or that generally have sensitive connotations, except when judged to be necessary for adequate representation of the intended constructs.103 Adequate subgroup analyses should be conducted when evaluating the validity of scores for prediction purposes.104

Adequate evidence should document the fair implementation of tests for all test takers. The testing process should reflect equitable treatment for all examinees.105 The Joint Standards states, ?In testing applications where the level of linguistic or reading ability is not part of the construct of interest, the linguistic or reading demands of the test should be kept to the minimum necessary for the valid assessment of the intended construct.?106 Documentation of appropriate reporting and test use should be available. Reported data should be clear and accurate, especially when there are high-stakes consequences for students.107 When tests are used as part of decision-making that has high-stakes consequences for students, evidence of mean score differences between relevant subgroups should be examined, where feasible. When mean differences are found between subgroups, investigations should be undertaken to determine that such differences are not attributable to construct underrepresentation or construct irrelevant error.108 Evidence about differences in mean scores and the significance of the validity errors should also be considered when deciding which test to use.109 In using test results for purposes other than selection, a test taker?s score should not be accepted as a reflection of standing on the intended constructs without consideration of alternative explanations for the test taker?s performance.110 Explanations might reflect limitations of the test, for instance construct irrelevant factors may have significantly impacted the student?s score. Explanations may also reflect schooling factors external to the test, for instance lack of instructional opportunities.

The issue of feasibility in collecting validity evidence is discussed in a few of the standards summarized above. In the comments associated with these standards, feasibility is generally addressed in terms of adequate sample size, with continued operational use of a test as a way of accumulating adequate numbers of subgroup results over administrations. When credible research reports that results differ in meaning across subgroups, collecting separate and parallel types of validity data verifies that the same knowledge and skills are being measured for all groups of test takers. Particularly in high-stakes situations, it is important that all feasibility considerations include the potential costs to students of using information where the validity of the scores has not been verified.111

2. Fairness in Reliability

Fairness in reliability focuses on making sure that scores are stable and consistently accurate for all groups of students. Two key standards address this issue. First, when there are reasons for expecting that test reliability analyses might differ substantially for different subpopulations, reliability data should be presented as soon as feasible for each major population for whom the test is recommended.112 Second,?[w]hen significant variations are permitted in test administration procedures, separate reliability analyses should be provided for scores produced under each major variation if adequate sample sizes are available.?113 Often, continued operational use of a test is a way to accumulate an adequate sample size over administrations.

D. Cut Scores

Where the results of the [cutscore] setting process have highly significant consequences,?those responsible for establishing cutscores should be concerned that the process?[is] clearly documented and defensible.

Joint Standards, Introduction to Chapter 4, p. 54

The same principles regarding validity, reliability, and fairness apply generally to the establishment and use of cut scores for the purpose of making high-stakes educational decisions. Cut scores, also known as cut points or cutoff scores, are specific points on the test or scale where test results are used to divide levels of knowledge, skill, or ability. Cut scores are used in a variety of contexts, including decisions for placement purposes or for other specific outcomes, such as graduation, promotion, or admissions.114 A cut score may divide the demonstration of acceptable and unacceptable skills, as in placement in gifted and talented programs where students are accepted or rejected. There may be multiple cut scores that identify qualitatively distinct levels of performance. In order to promote readability, the discussion on cut scores presented here is meant to reflect this complex topic in an accurate, but concise and user-friendly way. Readers are encouraged to review Chapter 4, Scales, Norms, and Score Comparability, in the Joint Standards, for additional, relevant information about cut scores particularly pages 53-54.

Many of the concepts regarding test validity apply to cut scores?that is, the cut points themselves, like all scores, must be accurate representations of the knowledge and skills of students.115 Further, ?[w]hen feasible, cut scores defining categories with distinct substantive interpretations should be established on the basis of sound empirical data concerning the relation of test performance to relevant criteria.?116 Validity evidence should generally be able to demonstrate that students above the cut score represent or demonstrate a qualitatively greater degree or different type of skills and knowledge than those below the cut score, whenever these types of inferences are made. In high-stakes situations, it is important to examine the validity of the inferences that underlie the specific decisions being made on the basis of the cut scores. In other words, what must be validated is the specific use of the test based on how the scores of students above and below the cut score are being interpreted.

Reliability of the cut scores is also important. The Joint Standards states that where cut scores are specified for selection or placement, the degree of measurement error around each cut score should be reported.117 Evidence should also indicate the misclassification rates, or percentage of error in classifying students, that are likely to occur among students with comparable knowledge and skills.118 This information should be available by group as soon as feasible if there is a prior probability that the misclassification rates may differ substantially by group.119 Misclassification of students above or below the cut points can result in both false positive and false negative classifications.120 As an example of false negative misclassifaction one might ask, what percentage of students who should be allowed to graduate would not be allowed to do so because of error due to the test rather than differences in their actual knowledge and skills? The Joint Standards states, ?Adequate precision in regions of score scales where cut points are established is prerequisite to reliable classification of examinees into categories.?121

There is no single right answer to the questions of when, where and how cut scores should be set on a test with high-stakes consequences for students.122 Some experts suggest, however, that multiple standard-setting methods of determining cut scores should be used when determining a final cut score.123 Further, the reasonableness of the standard setting process and the consequences for students should be clearly and specifically documented for a given use.124 Both the Joint Standards and High Stakes repeatedly state that decisions should not be made solely or automatically on the basis of a single test score, and that other relevant information should be taken into account if it will enhance the overall validity of the decision.125

Test Measurement Principles:
Questions about Appropriate Test Use

In order to determine if a test is being used appropriately to make high-stakes decisions about students, considerations about the context of the test use need to be addressed, as well as the validity, reliability, and fairness of the score interpretations from the current test being proposed.

  1. What is the purpose for which the test is being used?
  2. What information, besides the test, is being collected to inform this purpose?
  3. What are the particular propositions that need to be true to support the inferences drawn from the test scores for a given use?
  4. Based on how the test results are to be used, is there adequate evidence of the propositions to document the validity of the inferences for students taking the test? For example:
  • Does the evidence support the proposition that the test measures the specific knowledge and skills the test developers say that it measures?
  • Does the evidence support the proposition that the interpretation of the test scores is valid for the stated purpose for which the test is being proposed?
  • Does the evidence support the proposition that the interpretation of the test scores is valid in the particular type of situation where the test is to be administered?
  • Does the evidence support the proposition that the interpretation of the test scores is valid for the specific groups of students who are taking the test?
  1. Is there adequate evidence of reliability of the test scores for the proposed use?
  2. Is there adequate evidence of fairness in validity and reliability to document that the test score inferences are accurate and meaningful for all groups of students taking the test? That is:
  • Does the evidence support the inference that the test is measuring the same constructs for all groups of students?
  • Does the evidence support that the scores do not systematically underestimate or overestimate the knowledge or skills of members of any particular group?
  1. Is there adequate evidence that cutscores have been properly established and that they will be used in ways that will provide accurate and meaningful information for all test takers

 

II. The Testing of All Students: Issues of Intervention and Inclusion

All aspects of validity, reliability, fairness, and cut scores discussed above are applicable to the measurement of knowledge and skills of all students, including limited English proficient students126 and students with disabilities. This section addresses additional issues related to accurately measuring the knowledge and skills of these two populations in selected situations. Issues affecting limited English proficient and disabled students are addressed separately below following discussion of general considerations about the selection and use of accommodations.

Whenever tests are intended to evaluate the knowledge of skills of different groups of students, ensuring that test score inferences accurately reflect the intended constructs for all students is a complex task. It involves several aspects of test construction, pilot testing, implementation, analysis, and reporting. For limited English proficient students and students with disabilities, the appropriate inclusion of students from these groups in validation and norming samples, and the meaningful inclusion of limited English proficient and disability experts throughout the test development process, are necessary to ensure suitable test quality for these groups of test takers.

The proper inclusion of diverse groups of students in the same academic achievement testing program helps to ensure that high-stakes decisions are made on the basis of test results that are as comparable as possible across all groups of test takers.127 If different tests are used as part of the testing program, it is important to ensure that they measure the same content standards. The appropriate inclusion of students can also help to ensure that educational benefits attributable to the high-stakes decisions will be available to all. In some cases, it is appropriate to test limited English proficient students and students with disabilities under standardized conditions, as long as the evidence supports the validity of the results in a given situation for these students. In other cases, the conditions may have to be accommodated to assure that the inferences of the scores validly reflect the students? mastery of the intended constructs.128 The use of multiple measures generally enhances the accuracy of the educational decisions, and these measures can be used to confirm the validity of the test results. The use of multiple measures is particularly relevant for limited English proficient students and students with disabilities in cases where technical data are in the process of being collected on the proper use of accommodations and the proper interpretation of test results when testing conditions are accommodated.

A. General Considerations about Accommodations

Standard 10.1

In testing individuals with disabilities, test developers, test administrators, and test users should take steps to ensure that the test score inferences accurately reflect the intended construct rather than any disabilities and their associated characteristics extraneous to the intent of the measurement.

Making similar inferences about scores from academic achievement tests for all test takers, and making appropriate decisions when using these scores, requires accurately measuring the same academic constructs (knowledge and skills in specific subject areas) across groups and contexts. In measuring the knowledge and skills of limited English proficient students and students with disabilities, it is particularly important that the tests actually measure the intended knowledge and skills and not factors that are extraneous to the intended construct.129 For instance, impaired visual capacity may influence a student?s test score in science when the student must sight read a typical paper and pencil science test. In measuring science skills, the student?s sight likely is not relevant to the student?s knowledge of science. Similarly, how well a limited English proficient student reads English may influence the student?s test score in mathematics when the student must read the test. In this case, the student?s reading skills likely are not relevant when the skills of mathematics computation are to be measured. The proper selection of accommodations for individual students and the determination of technical quality associated with accommodated test scores are complex and challenging issues that need to be addressed by educators, policy-makers, and test developers.

Typically, accommodations to established conditions are found in three main phases of testing: 1) the administration of tests, 2) how students are allowed to respond to the items, and 3) the presentation of the tests (how the items are presented to the students on the test instrument). Administration accommodations involve setting and timing, and can include extended time to counteract the increased literacy demands for English language learners or fatigue for a student with sensory disabilities. Response accommodations allow students to demonstrate what they know in different ways, such as responding on a computer rather than in a test booklet. Presentation accommodations can include format variations such as fewer items per page, large print, and plain language editing procedures, which use short sentences, common words, and active voice. There is wide variation in the types of accommodations used across states and school districts. (Appendix C lists many of the accommodations used in large-scale testing for limited English proficient students and students with disabilities. The list is not meant to be exhaustive, and its use in this document should not be seen as an endorsement of any specific accommodations. Rather, the Appendix is meant to provide examples of the types of accommodations that are being used with limited English proficient students and students with disabilities.)

Issues regarding the use of accommodations are complex. When the possible use of an accommodation for a student is being considered, two questions should be examined: 1)?What is being measured if conditions are accommodated? 2) What is being measured if the conditions remain the same? The decision to use an accommodation or not should be grounded in the ultimate goal of collecting test information that accurately and fairly represents the knowledge and skills of the individual student on the intended constructs. The overarching concern should be that test score inferences accurately reflect the intended constructs rather than factors extraneous to the intent of the measurement.130

B. Testing of Limited English Proficient Students

The Joint Standards and several recent measurement publications discuss the population of limited English proficient students and how test publishers and users have handled inclusion in tests to date.131 This section briefly outlines principles derived from the Joint Standards and these publications. It addresses two types of testing situations especially relevant for limited English proficient students: the assessment of English language proficiency and the assessment of academic educational achievement.

1. Assessing English Language Proficiency

Standard 9.10

Inferences about test takers' general language proficiency should be based on tests that measure a range of language features, and not on a single linguistic skill.

Issues of validity, reliability, and fairness apply to tests and other relevant assessments that measure English language proficiency. English language proficiency is typically defined as proficiency in listening, speaking, reading, and writing English.132 Assessments that measure English language proficiency are generally used to make decisions about who should receive English language acquisition services, the type of programs in which these students are placed, and the progress of students in the appropriate programs. They are also used to evaluate the English proficiency of students when exiting from a program or services, to ensure that they can successfully participate in the regular school curriculum. In making decisions about which tests are appropriate, it is particularly important to make sure that the tests accurately and completely reflect the intended English language proficiency constructs so that the students are not misclassified. It is generally accepted that an evaluation of a range of communicative abilities will typically need to be assessed when placement decisions are being made.133

2. Assessing the Academic Educational Achievement of Limited English Proficient Students

Several factors typically affect how well the educational achievement of limited English proficient students is measured on standardized academic achievement tests. Technical issues associated with developing meaningful achievement tests for limited English proficient students can be complex and challenging. For all test takers, any test that employs written or oral skills in English or in another language is, in part, a measure of those skills in the particular language. Test use with individuals who have not sufficiently acquired the literacy or fluency skills in the language of the test may introduce construct-irrelevant components to the testing process. Further, issues related to differences in the experiences of students may substantially affect how test items are interpreted by different groups of students. In both instances, test scores may not accurately reflect the qualities and competencies that the test intends to measure.134

a. Background Factors for Limited English Proficient Students

The background factors particularly salient in ensuring accuracy in testing for students with limited English proficiency tend to relate to language proficiency, culture, and schooling.135

Limited English proficient students often bring varying levels of English and home-language fluency and literacy skills to the testing situation. These students may be adept in conversing orally in their home language, but unless they have had formal schooling in their home language, they may not have a corresponding level of literacy. Also, while students with limited English proficiency may acquire a degree of fluency in English, literacy in English for many students comes later. To add to the complexity, proficiency in fluency and literacy in either the home language or English involves both social and academic components. Thus, a student may be able to write a well-organized social letter in his or her home language, and may not be able to orally explain adequately in that language how to solve a mathematics problem that includes the knowledge of concepts and words endemic to the field of mathematics. The same phenomena may occur in English as well.136

Factors Related to Accurately Testing LEP Students

Language Proficiency

  • The student's level of oral and written proficiency in English
  • The student's proficiency in his or her home language
  • The language of instruction

Cultural Issues

  • Background experiences
  • Perceptions of prior experiences
  • Value systems

Schooling Issues

  • The amount of formal elementary and secondary schooling in the student's home country, if applicable, and in U.S. schools
  • Consistency of schooling
  • Instructional practices in the classroom

Therefore, in determining how to effectively measure the academic knowledge and skills of limited English proficient students, educators and policy-makers should consider how to minimize the influence of literacy issues, except when these constructs are explicitly being measured. The levels of proficiency of limited English proficient students in their home language and in English, as well as the language of instruction, are important in determining in which language an achievement test should be administered, and which accommodations to standardized testing conditions, if any, might be most useful for which students.137

Additionally, diverse cultural and other background experiences, including variations in amount, type and location (home country and United States) of formal elementary and secondary schooling, as well as interrupted and multi-location schooling of students (of the type frequently experienced by children of migrant workers), affect language literacy, the contextual content of items, and the academic foundational knowledge base that can be assumed in appropriately interpreting the results of educational achievement tests. The format and procedures involved in testing can also affect accuracy in test scores, particularly if the test practices differ substantially from ongoing instructional practices in classrooms, including which accommodations are used in the classroom and how they are used.138

b. Including Limited English Proficient Students in Large-Scale Standardized Achievement Tests

The Joint Standards recognizes the complexity of developing educational achievement tests that are appropriate for a range of test takers, including those who are limited English proficient. Overall, ?testing practice should be designed to reduce threats to the reliability and validity of test score inferences that may arise from language differences.?139 When credible research evidence reports that scores may differ in meaning across subgroups of linguistically diverse test takers, then, to the extent feasible, the same form of validity evidence should be collected for each relevant subgroup as for the examinee population as a whole.140 The Joint Standards states, ?When a test is recommended for use with linguistically diverse test takers, test developers and publishers should provide the information necessary for appropriate test use and interpretation.?141 Furthermore, ?when testing an examinee proficient in two or more languages for which the test is available, the examinee?s relative language proficiencies should be determined. The test generally should be administered in the test taker?s most proficient language, unless proficiency in the less proficient language is part of the assessment.?142 Recommended accommodations should be used appropriately and described in detail in the test manual;143 translation methods and interpreter expertise should be clearly described;144 evidence of test comparability should be reported when multiple language versions of a test are intended to be comparable;145 and evidence of the score reliability and the validity of the translated test?s score inferences should be provided for the intended uses and linguistic groups.146

Providing accommodations to established testing conditions for some students with limited English proficiency may be appropriate when their use would yield the most valid scores on the intended academic achievement constructs. Deciding which accommodations to use for which students usually involves an understanding of which construct irrelevant background factors would substantially influence the measurement of intended knowledge and skills for individual students, and if the accommodations would enhance the validity of the test score interpretations for these students.147 In collecting evidence to support the technical quality of a test for limited English proficient students, the accumulation of data may need to occur over several test administrations to ensure sufficient sample sizes. Educators and policy-makers need to understand that the proper use of accommodations for limited English proficient students and the determination of technical quality are complex and challenging endeavors.

Appendix C lists various test presentation, administration, and response accommodations that states and districts generally employ when testing limited English proficient students. Examples of accommodations in the presentation of the test include editing text so the items are in plain language, or providing page formats which minimize confusion by limiting use of columns and the number of items per page. Presenting the test in the student?s native language is an accommodation to a test written in English when the same constructs are being measured on both the English- and native-language versions. It is essential that translations accurately convey the meaning of the test items; poor translations can prove more harmful than helpful.148 Administration accommodations include extending the length of the testing period, permitting breaks, administering tests in small groups or in separate rooms, and allowing English or native-language glossaries or dictionaries as appropriate. Response accommodations include oral response and permitting students to respond in their native language.

C. Testing of Students with Disabilities

The Joint Standards and several recent measurement publications discuss the population of students with disabilities and how test publishers and users have handled inclusion in tests to date.149 This section briefly outlines principles derived from the Joint Standards and these publications. It addresses three types of testing situations especially relevant for students with disabilities: tests used for diagnostic and intervention purposes, the assessment of academic educational achievement, and alternate assessments for elementary and secondary school students with disabilities who cannot participate in districtwide academic achievement tests.

1. Tests Used for Diagnostic and Intervention Purposes

Standard 10.12

In testing individuals with disabilities for diagnostic and intervention purposes, the test should not be used as the sole indicator of the test taker's functioning. Instead, multiple sources of information should be used.

All issues of validity, reliability, and fairness apply to tests and other assessments used to make diagnostic and intervention decisions for students with disabilities. Tests that yield diagnostic information typically focus in great detail on identifying the specific challenges and strengths of a student.150 These diagnostic tests are often administered in one-to-one situations (test taker and examiner) rather than in a group situation. In many cases, they have been designed with standardized adaptations to fit the needs of individual examinees. In making decisions about which tests are appropriate to use, it is important to make sure that the tests accurately and completely reflect the intended constructs, so that the interventions are appropriate and beneficial for the individual students. Proper analyses should be conducted to yield correct interpretations of results when differential prediction for different groups is likely.151

2. Assessing the Academic Educational Achievement of Students with Disabilities

Several factors affect how well the educational achievement of students with disabilities is measured on standardized academic achievement tests. Test scores should accurately measure the students? knowledge and skills in academic achievement rather than factors irrelevant to the intended constructs of the test.152 The technical issues associated with developing meaningful achievement tests for students with disabilities can be complex and challenging. Under federal law, students with disabilities must be included in statewide or districtwide assessment programs and provided with appropriate accommodations if necessary. Guidance about testing elementary and secondary school students with disabilities is addressed by the individualized education program (IEP) process or other applicable evaluation procedures. The IEP or Section 504 plan addresses how a student should be tested, and identifies testing accommodations that would be appropriate for the individual student. The Individuals with Disabilities Education Act (IDEA) also requires state or local education agencies to develop guidelines for the relatively small number of students with disabilities who cannot take part in statewide or districtwide tests to participate in alternate assessments. The Joint Standards emphasizes that people who make decisions about accommodations for students with disabilities should be knowledgeable about the effects of the disabilities on test performance.153

a. Background Factors for Students with Disabilities

The background factors particularly important to students with disabilities are generally related to the nature of the disabilities or to the schooling experiences of these students.154 Within any disability category, the type, number, and severity of impairments vary greatly.155 For instance, some students with learning disabilities have a processing disability in only one subject, such as mathematics, while others experience accessing, retrieving, and processing impairments that affect a broad number of school subjects and contexts. For many of these students, one or more of the impairments may be relatively mild, while for others one or more can be significant. Further, different types of disabilities yield significantly different constellations of issues. For instance, the considerations surrounding students with hearing impairments or deafness may overlap significantly with limited English proficient students in some ways and with other students with disabilities in other respects. The Joint Standards discusses provisions regarding the testing and validation of tests for limited English proficient students that apply to students who have hearing impairments or deafness, as well.156 This complexity poses a challenge not only to educators, but also to test administrators and developers. In general, in determining how to use academic tests appropriately for students with disabilities, educators and policy-makers should consider how to minimize the influence of the impairments in measuring the intended constructs.

Factors Related to Accurately Testing Students with Disabilities

Disability Issues

  • Types of impairments
  • Severity of impairments

Schooling Experiences

  • Overlap of individualized educational goals and general education curricula in elementary and secondary schooling
  • Pace of schooling
  • Instructional practices in the classroom

Educating One and All explains that the schooling experiences of students with disabilities vary greatly as a function of their disability, the severity of impairments, and expectations of their capabilities.157 Two sets of educational experiences, in particular, affect how educators and policy-makers accommodate tests and use them appropriately for this population. First, the IEP teams identify individual educational plans for students with disabilities that have different degrees of overlap with the general education curricula. This alignment will affect what opportunities students with disabilities will have to master the material being tested on the schoolwide academic achievement tests. Second, the IEP team also recommends appropriate accommodations for students, and these accommodations are usually consistent with classroom accommodation techniques. However, while special educators have a long history of accommodating instruction and evaluation to fit student strengths, not all the instructional or testing practices in the classroom are appropriate in large-scale testing. Additionally, some students may not have been exposed routinely to the types of accommodations that would be possible in large-scale testing.158

b. Including Students with Disabilities in Large-Scale Standardized Achievement Tests

The Joint Standards recognizes the complexity of developing educational achievement tests that are appropriate for a range of test takers, including students with disabilities. The interpretation of the scores of students with disabilities should accurately and fairly reflect the academic knowledge, skills, or abilities that the test intends to measure. The interpretation should not be confounded by those challenges students face that are extraneous to the intent of the measurement.159 Rather, validity evidence should document that the inferences of the scores of students with disabilities are accurate. Pilot testing and other technical investigations should be conducted where feasible to ensure the validity of the test inferences when accommodations have been allowed.160 While, feasibility is a consideration, the Joint Standards comments that ?the costs of obtaining validity evidence should be considered in light of the consequences of not having usable information regarding the meanings of scores for people with disabilities.?161

Providing accommodations to established testing conditions for some students with disabilities may be appropriate when their use would yield the most valid scores on the intended academic achievement constructs. Deciding which accommodations to use for which students usually involves an understanding of which construct irrelevant background factors would substantially influence the measurement of intended knowledge and skills for individual students, and if the accommodations would enhance the validity of the test score interpretations for these students.162 In collecting evidence to support the technical quality of the test results for students with disabilities, the accumulation of data may need to occur over several administrations to ensure sufficient sample sizes. Educators and policy-makers need to understand that the proper use of accommodations for students with disabilities and the determination of technical quality are complex and challenging endeavors.

Appendix C lists various presentation, administration, and response accommodations that states and districts generally employ when testing students with disabilities. Examples of presentation accommodations are the use of Braille, large print, oral reading, or providing page formats that minimize confusion by limiting use of columns and the number of items per page. Administration accommodations in setting include allowing students to take the test at home or in a small group, and accommodations in timing include extended time and frequent breaks. Variations in response formats include allowing students to respond orally, point, or use a computer.

3. Alternate Assessments

Alternate assessments are assessments for those elementary and secondary school students with disabilities who cannot participate in state or districtwide standardized assessments, even with the use of appropriate accommodations and modifications.163 For the constructs being measured, the considerations with respect to validity, reliability, and fairness apply to alternate assessments, as well. Appropriate content needs to be identified, and procedures need to be designed to ensure technical rigor.164 In addition, evidence should show that the test measures the knowledge and skills it intends to measure, and that the measurement is a valid reflection of mastery in a range of contextual situations.


66. See, e.g., High Stakes, supra note 11, at pp. 59-60.. BACK

67. Among other considerations, institutions will determine if they want test score interpretations that are norm-referenced or criterion-referenced, or both. Norm-referenced means that the performances of students are compared to the performances of other students in a specified reference population; criterion-referenced indicates the extent to which students have mastered specific knowledge and skills. BACK

68. Joint Standards, supra note 3, at p. 141; see also Standard 13.7 (n.8) in Joint Standards, supra note 3, at p. 146. BACK

69. Joint Standards, supra note 3, at p. 141. BACK

70. In order to provide educational institutions with tests that are accurate and fair, test developers should develop tests in accordance with professionally recognized standards, and provide educational institutions with adequate evidence of test quality.

Standard 1.4 states, ?If a test is used in a way that has not been validated, it is incumbent on the user to justify the new use, collecting new evidence if necessary.? Joint Standards, supra note 3, at p. 18.

Standard 11.2 states, ?When a test is to be used for a purpose for which little or no documentation is available, the user is responsible for obtaining evidence of the test?s validity and reliability for this purpose.? Joint Standards, supra note 3, at p. 113. BACK

71. See Standard 7.5, 13.5 (n.22) and 13.6 (n.21) in Joint Standards, supra note 3, at pp. 82, 146.

Standard 7.5 states, ?In testing applications involving individualized interpretations of test scores other than selection, a test taker?s score should not be accepted as a reflection of standing on the characteristic being assessed without consideration of alternate explanations for the test taker?s performance on that test at that time.? Joint Standards, supra note 3, at p. 82. BACK

72. Joint Standards, supra note 3, at pp. 9, 184. BACK

73. Joint Standards, supra note 3, at pp. 9-24. BACK

74. Joint Standards, supra note 3, at p. 173. BACK

75. The Joint Standards defines a content domain as ?the set of behaviors, knowledge, skills, abilities, attitudes or other characteristics to be measured by a test, represented in a detailed specification, and often organized into categories by which items are classified.? Joint Standards, supra note 3, at p. 174. A domain, then, represents a definition of a content area for the purposes of a particular test. Other tests will likely have a different definition of what knowledge and skills a particular content area entails. BACK

76. See Joint Standards, supra note 3, at pp. 9-11, 184. BACK

77. Therefore, construct validity can be seen as an umbrella that encompasses what has previously been described as predictive validity, content validity, criterion validity, discriminant validity, etc. Rather, these terms refer to types or sources of evidence that can be accumulated to support the validity argument. Definitions of these terms can be found in Appendix B, Measurement Glossary. BACK

78. Joint Standards, supra note 3, at p. 9. BACK

79. Joint Standards, supra note 3, at p. 9. BACK

80. Rather than follow the traditional nomenclature (e.g. predictive validity, content validity, criterion validity, discriminant validity, etc.), the Joint Standards defines sources of validity evidence as evidence based on test content, evidence based on response processes, evidence based on internal structure, evidence based on relations to other variables, and evidence based on consequences of testing. See Joint Standards, supra note 3, at pp. 11-17. BACK

81. See Joint Standards, supra note 3, at pp. 9-24 (Chapter 1, Validity). BACK

82. Standard 3.6 states ?The type of items, the response formats, scoring procedures, and test administration procedures should be selected based on the purposes of the test, the domain to be measured, and the intended test takers. To the extent possible, test content should be chosen to ensure that intended inferences from test scores are equally valid for members of different groups of test takers. The test review process should include empirical analyses and, when appropriate, the use of expert judges to review items and response formats. The qualifications, relevant experiences, and demographic characteristics of expert judges should also be documented.? Joint Standards, supra note 3, at p. 44. BACK

83. As indicated in the Joint Standards, ?The extent to which predictive or concurrent evidence of validity generalization can be used in new situations is in large measure a function of accumulated research. Although evidence of generalization can often help to support a claim of validity in a new situation, the extent of available data limits the extent to which the claim can be sustained.? Joint Standards, supra note 3, at pp. 15-16. BACK

84. Joint Standards, supra note 3, at p. 10. BACK

85. Samuel Messick, Validity, in Educational Measurement, pp. 13-103 (Robert L. Linn ed., 3rd ed. 1989) (hereinafter Messick, Validity); Samuel Messick, Validity of Psychological Assessment: Validations of Inferences from Persons? Responses and Performances as Scientific Inquiry into Score Meaning, American Psychologist 50(9), pp. 741-749 (September 1995) (hereinafter Messick, Validity of Psychological Assessment). BACK

86. Messick, Validity, supra note 85; Messick, Validity of Psychological Assessment, supra note 85. BACK

87. On the other hand, if an item is measuring the student?s ability to apply mathematical skills in a written format (for instance when an item requires students to fill out an order form), then writing skills may not be extraneous to the construct being measured in this item. BACK

88. See Joint Committee on Testing Practices, Code of Fair Testing Practices in Education (1988). BACK

89. See Standard 1.24, 7.5 (n.71) and 7.6 in Joint Standards, supra note 3, at pp. 23-24, 82.

Standard 1.24 states, ?When unintended consequences result from test use, an attempt should be made to investigate whether such consequences arise from the test?s sensitivity to characteristics other than those it is intended to assess or to the test?s failure fully to represent the intended construct.? Joint Standards, supra note 63, at p. 23.

Standard 7.6 states, ?When empirical studies of differential prediction of a criterion for members of different subgroups are conducted, they should include regression equations (or an appropriate equivalent) computed separately for each group or treatment under consideration or an analysis in which the group or treatment variables are entered as moderator variables.? Joint Standards, supra note 3, at p. 82. BACK

90. See Standard 1.22, 1.23, 7.5 (n.71), 7.10 (n.33) and 13.9 (n.23) in Joint Standards, supra note 3, at pp. 23, 82, 83, 147.

Standard 1.22 states, ?When it is clearly stated or implied that recommended test use will results in a specific outcome, the basis for expecting that outcome should be presented, together with relevant evidence.? Joint Standards, supra note 3, at p. 23.

Standard 1.23 states, ?When a test use or score interpretation is recommended on the grounds that testing or the testing program per se will result in some indirect benefit in addition to the utility of information from the test scores themselves, the rationale for anticipating the indirect benefit should be made explicit. Logical or theoretical arguments and empirical evidence for the indirect benefit should be provided. Due weight should be given to any contradictory findings in the scientific literature, including findings suggesting important indirect outcomes other than those predicted.? Joint Standards, supra note 3, at p. 23. BACK

91. The Comment under Standard 13.1 states, ?Mandated testing programs are often justified in terms of their potential benefits for teaching and learning. Concerns have been raised about the potential negative impact of mandated testing programs, particularly when they result directly in important decisions for individuals or institutions. Frequent concerns include narrowing the curriculum to focus only on the objectives tested, increasing the number of dropouts among students who do not pass the test, or encouraging other instructional or administrative practices simply designed to raise test scores rather than to affect the quality of education.? Joint Standards, supra note 3, at p. 145. BACK

92. High Stakes, supra note 11, at pp. 247-272. BACK

93. Evaluating the reliability of test results includes identifying the major sources of measurement error, the size of the errors resulting from these sources, the indication of the degree of reliability to be expected, or the generalizability of results across items, forms, raters, sampling, administrations, and other measurement facets. BACK

94. All sources of assessment information, including test results, include some degree of error. There are two types of error. The first is random error that affects scores in such a way that sometimes students will score lower and sometimes higher than their ?true? score (the actual mastery level of the students? knowledge and skills). This type of error, also known as measurement error, particularly affects reliability of scores. Therefore, test scores are considered reliable when evidence demonstrates that there is a minimum amount of random measurement error in the test scores for a given group.

The second type of error that affects test results is systematic error. Systematic error consistently affects scores in one direction; that is, this type of error causes some students to consistently score lower or consistently score higher than their ?true? (or actual) level of mastery. For instance, visually impaired students will consistently score lower than they should on a test which has not been administered for them in Braille or large print, because their difficulty in reading the items on the page will negatively impact their score. This type of error generally affects the validity of the interpretation of the test results and is discussed in the validity section above. Systematic error should also be minimized in a test for all test takers.

When educators and policy-makers are evaluating the adequacy of a test for their local population of students, it is important to consider evidence concerning both types of error.." BACK

95. These types of reliability estimates are known as test-retest, alternate forms, internal consistency, and inter-rater estimates, respectively. Joint Standards, supra note 3, at pp. 25-31. BACK

96. Joint Standards, supra note 3, pp. 25-36. BACK

97. Joint Standards, supra note 3, at pp. 74-80. In test measurement, the term fairness has a specific set of technical interpretations. Four of these interpretations are discussed in the Joint Standards. For instance, bias is discussed in relation to fairness and is defined in the Joint Standards in two ways: ?In a statistical context, (bias refers to) a systematic error in a test score. In discussing test fairness, bias (also) may refer to construct underrepresentation or construct-irrelevant components of test scores that differentially affect the performance of different groups of test takers.? Joint Standards, supra note 3, at p. 172. Fairness as equitable treatment in the testing process ?requires consideration not only of the test itself, but also the context and purpose of testing, and the manner for which test scores are used.? Joint Standards, supra note 3, at p. 74. Equal scores for students of equal standing reflects that ?examinees of equal standing with respect to the construct the test is intended to measure should on average earn the same test score, irrespective of group membership.? Joint Standards, supra note 3, at p. 74. For purposes such as promotion and graduation, ?[w]hen some test takers have not had the opportunity to learn the subject matter covered by the test content, they are likely to get low scores . . . low scores may have resulted in part from not having had the opportunity to learn the material tested as well as from having had the opportunity and failed to learn.? Joint Standards, supra note 3, at p. 76. BACK

98. Joint Standards, supra note 3, at pp. 73-84. BACK

99. Joint Standards, supra note 3, at pp. 80-84. BACK

100. Standard 7.2 states, ?When credible research reports differences in the effects of construct-irrelevant variance across subgroups of test takers on performance of some part of the test, the test should be used if at all only for those subgroups for which evidence indicates that valid inferences can be drawn from test scores.? Joint Standards, supra note 3, at p. 81. BACK

101. See Standard 7.1 and 7.3 in Joint Standards, supra note 3, at pp. 80-81.

Standard 7.1 states, ?When credible research reports that test scores differ in meaning across examinee subgroups for the type of test in question, then to the extent feasible, the same forms of validity evidence collected for the examinee population as a whole should also be collected for each relevant subgroup. Subgroups may be found to differ with respect to appropriateness of test content, internal structure of test responses, the relation of test scores to other variables, or the response processes employed by individual examinees. Any such findings should receive due consideration in the interpretation and use of scores as well as in subsequent test revisions.? Joint Standards, supra note 3, at p. 80.

Standard 7.3 states, ?When credible research reports that differential item functioning exists across age, gender, racial/ethnic, cultural, disability and/or linguistic groups in the population of test takers in the content domain measured by the test, test developers should conduct appropriate studies when feasible. Such research should seek to detect and eliminate aspects of test design, content, and format that might bias test scores for particular groups.? Joint Standards, supra note 3, at p. 81. BACK

102. Standard 7.3 (n.101) in Joint Standards, supra note 3, at p. 81. BACK

103. See Standard 7.3 (n.101) and 7.4 in Joint Standards, supra note 3, at pp. 81-82.

Standard 7.4 states, ?Test developers should strive to identify and eliminate language, symbols, words, phrases, and content that are generally regarded as offensive by members of racial, ethnic, gender, or other groups, except when judged to be necessary for adequate representation of the domain.? Joint Standards, supra note 3, at p. 82.

The Comment to Standard 7.4 states, ?Two issues are involved. The first deals with the inadvertent use of language that, unknown to the test developer, has a different meaning or connotation in one subgroup than in others. Test publishers often conduct sensitivity reviews of all test material to detect and remove sensitive material from the test. The second deals with settings in which sensitive material is essential for validity. For example, history tests may appropriately include material on slavery or Nazis. Tests on subjects from life sciences may appropriately include material on evolution. A test of understanding of an organization?s sexual harassment policy may require employees to evaluate examples of potentially offensive behavior.? Joint Standards, supra note 3, at p. 82. BACK

104. See Standard 7.6 (n.89) in Joint Standards, supra note 3, at p. 82. BACK

105. Standard 7.12 states, ?The testing or assessment process should be carried out so that test takers receive comparable and equitable treatment during all phases of the testing or assessment process.? Joint Standards, supra note 3, at p. 84. BACK

106. Standard 7.7 in Joint Standards, supra note 3, at p. 82. BACK

107. See Standard 1.24 (n.89), 7.8, 7.9 and 7.10 (n.33) in Joint Standards, supra note 3, at pp. 23, 83.

Standard 7.8 states, ?When scores are disaggregated and publicly reported for groups identified by characteristics such as gender, ethnicity, age, language proficiency, or disability, cautionary statements should be included whenever credible research reports that test scores may not have comparable meaning across these different groups.? Joint Standards, supra note 3, at p. 83.

Standard 7.9 states, ?When tests or assessments are proposed for use as instruments of social, educational, or public policy, the test developers or users proposing the test should fully and accurately inform policy-makers of the characteristics of the tests as well as any relevant and credible information that may be available concerning the likely consequences of test use.? Joint Standards, supra note 3, at p. 83. BACK

108. Standard 7.10 (n.33) in Joint Standards, supra note 3, at p. 83. BACK

109. Standard 7.11 states, ?When a construct can be measured in different ways that are approximately equal in their degree of construct representation and freedom from construct-irrelevant variance, evidence of mean score differences across relevant subgroups of examinees should be considered in deciding which test to use.? Joint Standards, supra note 3, at p. 83. BACK

110. Standard 7.5 (n.71) in Joint Standards, supra note 3, at p. 82. BACK

111. The Comment to Standard 10.7 states, ?In addition to modifying tests and test administration procedures for people who have disabilities, evidence of validity for inferences drawn from these tests is needed. Validation is the only way to amass knowledge about the usefulness of modified tests for people with disabilities. The costs of obtaining validity evidence should be considered in light of the consequences of not having usable information regarding the meanings of scores for people with disabilities. This standard is feasible in the limited circumstances where a sufficient number of individuals with the same level or degree of a given disability is available.? Joint Standards, supra note 3, at p. 107 (emphasis added). BACK

112. Standard 2.11 states, ?If there are generally accepted theoretical or empirical reasons for expecting that reliability coefficients, standard errors of measurement, or test information functions will differ substantially for various subpopulations, publishers should provide reliability data as soon as feasible for each major population for which the test is recommended.? Joint Standards, supra note 3, at p. 34.

It should be noted that reliability estimates may differ simply because of limited variance within a group. This is not a flaw in the test leading to unfairness, but rather a function of the statistical methodologies used in calculating the estimates. BACK

113. Standard 2.18 in Joint Standards, supra note 3, at p. 36. BACK

114. See also Standard 1.19 and 13.9 (n.23) in Joint Standards, supra note 3, at pp. 22, 147.

Standard 1.19 states, ?If a test is recommended for use in assigning persons to alternative treatments or is likely to be so used, and if outcomes from those treatments can reasonably be compared on a common criterion, then, whenever feasible, supporting evidence of differential outcomes should be provided.? Joint Standards, supra note 3, at p. 22. BACK

115. See Joint Standards, supra note 3, pp. 9-16 (Chapter 1, Validity, discusses that the interpretation of all scores should be an accurate representation of what is being measured). BACK

116. Standard 4.20 in Joint Standards, supra note 3, at p. 60. BACK

117. Standard 2.14 states, ?Conditional standard errors of measurement should be reported at several score levels if constancy cannot be assumed. Where cut scores are specified for selection or classification, the standard errors of measurement should be reported in the vicinity of each cut score.? Joint Standards, supra note 3, at p. 35. BACK

118. ?Where the purpose of measurement is classification, some measurement errors are more serious than others. An individual who is far above or far below the value established for pass/fail or for eligibility for a special program can be mismeasured without serious consequences. Mismeasurment of examinees whose true scores are close to the cut score is a more serious concern. . . . The term classification consistency or inter-rater agreement, rather than reliability, would be used in discussions of consistency of classification. Adoption of such usage would make it clear that the importance of an error of any given size depends on the proximity of the examinee?s score to the cut score.? Joint Standards, supra note 3, at p. 30. BACK

119. Standard 2.11 (n.112) in Joint Standards, supra note 3, at p. 34. BACK

120. Joint Standards, supra note 3, at p. 30. BACK

121. Joint Standards, supra note 3, at p. 59. BACK

122. High Stakes, supra note 11, at p. 168. BACK

123. High Stakes, supra note 11, at p. 169. BACK

124. See Standard 4.19, 4.21 and their Comments in Joint Standards, supra note 3, at pp. 59-60; see also High Stakes, supra note 11, at pp. 89-187 (Chapters 5, 6, and 7).

Standard 4.19 states, ?When proposed score interpretations involve one or more cut scores, the rationale and procedures used for establishing cut scores should be clearly documented.? Joint Standards, supra note 3, at p. 59.

Standard 4.21 states, ?When cut scores defining pass-fail or proficiency categories are based on direct judgments about the adequacy of item or test performances or performance levels, the judgmental process should be designed so that judges can bring their knowledge and experience to bear in a reasonable way.? Joint Standards, supra note 3, at p. 60. BACK

125. See High Stakes, supra note 11, at pp. 89-187 (Chapters 5, 6, and 7); Standard 13.7 (n.8) in Joint Standards, supra note 3, at p. 146. BACK

126. These are students who are learning English as a second language; the same population sometimes also is referred to as English language learners. BACK

127. See High Stakes, supra note 11, at pp. 7, 80. BACK

128. See Joint Standards, supra note 3 at pp. 71-80, 91-97, 101-106 (Chapters 7, 9, and 10). BACK

129. This is known as construct irrelevance. See discussion supra Chapter 1 Part (I)(A)(3) (Sources of Validity Error); Joint Standards, supra note 3, at pp. 173-174. BACK

130. See Standard 9.1 and 10.1 in Joint Standards, supra note 3, at pp. 97, 106; Messick, Validity, supra note 85.

Standard 9.1 states, ?Testing practice should be designed to reduce threats to the reliability and validity of test score inferences that may arise from language differences.? Joint Standards, supra note 3, at p. 97.

Standard 10.1 states, ?In testing individuals with disabilities, test developers, test administrators, and test users should take steps to ensure that the test score inferences accurately reflect the intended construct rather than any disabilities and their associated characteristics extraneous to the intent of the measurement.? Joint Standards, supra note 3, at p. 106. BACK

131. E.g., Joint Standards, supra note 3, at pp. 91-97 (Chapter 9); High Stakes, supra note 11, at pp. 211-237 (Chapter 9); National Research Council, Improving America?s Schooling for Language Minority Children: A Research Agenda (Diane August & Kenji Hakuta eds., 1997) (hereinafter Improving America?s Schooling for Language Minority Children); Rebecca J. Kopriva, Council of Chief State School Officers, Ensuring Accuracy in Testing for English Language Learners (2000) (hereinafter Kopriva, Ensuring Accuracy in Testing). BACK

132. Improving America?s Schooling for Language Minority Children, supra note 131, at pp. 116-118. BACK

133. Standard 9.10 and Comment in Joint Standards, supra note 3, at pp. 99-100.

Standard 9.10 states, ?Inferences about test takers? general language proficiency should be based on tests that measure a range of language features, and not on a single linguistic skill.? Joint Standards, supra note 3, at pp. 99-100. BACK

134. Joint Standards, supra note 3, at pp. 91-97. BACK

135. See Joint Standards, supra note 3, at pp. 91-100 (Chapter 9); Improving Schooling for Language Minority Children, supra note 131; Kopriva, Ensuring Accuracy in Testing, supra note 131, at pp. 9-11 (Introduction). BACK

136. Improving America?s Schooling for Language Minority Children, supra note 131, at pp. 113-137. BACK

137. Improving America?s Schooling for Language Minority Children, supra note 131, at pp. 113-137. BACK

138. Kopriva, Ensuring Accuracy in Testing, supra note 131, at pp. 29-48, 61-70, 95-98. BACK

139. Standard 9.1 in Joint Standards, supra note 3, at p. 97. BACK

140. Standard 9.2 states, ?When credible research evidence reports that test scores differ in meaning across subgroups of linguistically diverse test takers, then to the extent feasible, test developers should collect for each linguistic subgroup studied the same form of validity evidence collected for the examinee population as a whole.? Joint Standards, supra note 3, at p. 97. BACK

141. Standard 9.6 in Joint Standards, supra note 3, at p. 99. BACK

142. Standard 9.3 in Joint Standards, supra note 3, at p. 98. BACK

143. See Standard 9.4 and 9.5 in Joint Standards, supra note 3, at p. 98.

Standard 9.4 states, ?Linguistic modifications recommended by test publishers, as well as the rationale for the modifications, should be described in detail in the test manual.? Joint Standards, supra note 3, at p. 98.

Standard 9.5 states, ?When there is credible evidence of score comparability across regular and modified tests or administrations, no flag should be attached to a score. When such evidence is lacking, specific information about the nature of the modification should be provided, if permitted by law, to assist test users properly to interpret and act on test scores.? Joint Standards, supra note 3, at p. 98. BACK

144. See Standard 9.7 and 9.11 in Joint Standards, supra note 3, at pp. 99-100.

Standard 9.7 states, ?When a test is translated from one language to another, the methods used in establishing the adequacy of the translation should be described, and empirical and logical evidence should be provided for score reliability and the validity of the translated test?s score inferences for the uses intended in the linguistic groups to be tested.? Joint Standards, supra note 3, at p. 99.

Standard 9.11 states, ?When an interpretation is used in testing, the interpreter should be fluent in both the language of the test and the examinee?s native language, should have expertise in translating, and should have a basic understanding of the assessment process.? Joint Standards, supra note 3, at p. 100. BACK

145. Standard 9.9 states ?When multiple language versions of a test are intended to be comparable, test developers should report evidence of test comparability.? Joint Standards, supra note 3, at p. 99. BACK

146. Standard 9.7 (n.144) and Comment in Joint Standards, supra note 3, at p. 99.

The Comment to Standard 9.7 states ?[f]or example, if a test is translated into Spanish for use with Mexican, Puerto Rican, Cuban, Central American, and Spanish populations, score reliability and the validity of the test score inferences should be established with members of each of these groups separately where feasible. In addition, the test translation methods used need to be described in detail.? Joint Standards, supra note 3, at p. 99. BACK

147. Kopriva, Ensuring Accuracy in Testing, supra note 131, at pp. 49-66, 71-76 (discussing which accommodations might be most beneficial for students with various background factors). BACK

148. President?s Advisory Commission on Educational Excellence for Hispanic Americans, Testing Hispanic Students in the United States: Technical and Policy Issues, Executive Summary, p. 8 (2000). BACK

149. E.g., Joint Standards, supra note 3, at pp. 101-106 (Chapter 10); High Stakes, supra note 11, at pp. 188-210 (Chapter 8); National Research Council, Educating One and All: Students with Disabilities and Standards-Based Reform (Lorraine M. McDonnell, Margaret J. McLaughlin & Patricia Morison eds., 1997) (hereinafter Educating One and All); Martha Thurlow, Judy Elliott & Jim Ysseldyke, Testing Students with Disabilities (1998) (hereinafter Thurlow et al., Testing Students with Disabilities). BACK

150. Joint Standards, supra note 3, at pp. 101-106, 119-145 (Chapters 10, 12, and 13); High Stakes, supra note 11, at pp. 13-28 (Chapter 1). BACK

151. See Standard 7.6 (n.89) in Joint Standards, supra note 3, at p. 82. BACK

152. Standard 10.1 (n.130) in Joint Standards, supra note 3, at p. 106. BACK

153. Standard 10.2 states, ? People who make decisions about accommodations and test modification for individuals with disabilities should be knowledgeable of existing research on the effects of the disabilities in question on test performance. Those who modify tests should also have access to psychometric expertise for so doing. Joint Standards, supra note 3, at p. 106. BACK

154. See Joint Standards, supra note 3, at pp. 101-108 (Chapter 10); Educating One and All, supra note 149. BACK

155. Thurlow et al., Testing Students with Disabilities, supra note 149. BACK

156. See Standard 9.2 (n.140) and 9.10 (n.133) in Joint Standards, supra note 3, at pp. 97, 99-100. BACK

157. Educating One and All, supra note 149, at Chapter 3. BACK

158. Educating One and All, supra note 149, at Chapter 5. BACK

159. See Standard 10.1 (n.130) and 10.10 in Joint Standards, supra note 3, at pp. 106, 107-108.

Standard 10.10 states, ?Any test modifications adopted should be appropriate for the individual test taker, while maintaining all feasible standardized features. A test professional needs to consider reasonably available information about each test taker?s experiences, characteristics, and capabilities that might impact test performance, and document the grounds for the modification.? Joint Standards, supra note 3, at pp. 107-108. BACK

160. Several standards discuss the appropriate types of validity evidence, including Standards 10.3, 10.5, 10.6, 10.7, 10.8, and 10.11. Because of the low-incidence nature of several of the disability groups, such as hearing loss, vision loss, or concomitant hearing and vision loss, especially when different severity levels and combinations of impairments are considered, this type of evidence will probably need to be accumulated over time in order to have a large enough sample size.

Standard 10.3 states, ?Where feasible, tests that have been modified for use with individuals with disabilities should be pilot tested on individuals who have similar disabilities to investigate the appropriateness and feasibility of the modifications.? Joint Standards, supra note 3, at p. 106.

Standard 10.5 states, ?Technical material and manuals that accompany modified tests should include a careful statement of the steps taken to modify the test to alert users to changes that are likely to alter the validity of inferences drawn from the test scores.? Joint Standards, supra note 3, at p. 106.

Standard 10.6 states, ?If a test developer recommends specific time limits for people with disabilities, empirical procedures should be used, whenever possible, to establish time limits for modified forms of timed tests rather than simply allowing test takers with disabilities a multiple of the standard time. When possible, fatigue should be investigated as a potentially important factor when time limits are extended.? Joint Standards, supra note 3, at p. 107.

Standard 10.7 states, ?When sample sizes permit, the validity of inferences made from test scores and the reliability of scores on tests administered to individuals with various disabilities should be investigated and reported by the agency or publisher that makes the modification. Such investigations should examine the effects of modifications made for people with various disabilities on resulting scores, as well as the effects of administering standard unmodified tests to them.? Joint Standards, supra note 3, at p. 107.

Standard 10.8 states, ?Those responsible for decisions about test use with potential test takers who may need or may request specific accommodations should (a) possess the information necessary to make an appropriate selection of measures, (b) have current information regarding the availability of modified forms of the test in question, (c) inform individuals, when appropriate, about the existence of modified forms, and (d) make these forms available to test takers when appropriate and feasible.? Joint Standards, supra note 3, at p. 107.

Standard 10.11 states, ?When there is credible evidence of score comparability across regular and modified administrations, no flag should be attached to a score. When such evidence is lacking, specific information about the nature of the modification should be provided, if permitted by law, to assist test users properly to interpret and act on test scores.? Joint Standards, supra note 3, at p. 108. BACK

161. See Comment to Standard 10.7 (n.111) in Joint Standards, supra note 3, at p. 106. BACK

162. Thurlow et al., Testing Students with Disabilities, supra note 149, for a discussion of which accommodations might be most beneficial for students with various impairments and other background factors. BACK

163. The IDEA requires use of alternate assessments in certain areas. See 34 C.F.R. ? 300.138. These assessments may or may not be used in decisions that have high-stakes consequences for students. BACK

164. See Educating One and All, supra note 149, at Chapter 5, and Thurlow et al., Testing Students with Disabilities, supra note 149, for a discussion of the issues and processes involved in developing and implementing alternate assessments. BACK

BackIndexNext

OCR home page

[Know Your Rights] [Prevention] [Civil Rights Data] [About OCR] [Reading Room] [Related Links] [Previous Page]

This page last modified March 8, 2001 (ts)