||A r c h i v e d I n f o r m a t i o n
Appendix B: Glossary of Test Measurement Terms
This glossary is provided as a plain language reference to assist readers in understanding commonly used test measurement terms used in this guide or terms relevant to issues discussed in the guide. For additional relevant information, readers are encouraged to review the Glossary in the Joint Standards, as well as the appropriate chapters in the Joint Standards.
Accommodation - A change in how a test is presented, in how a test is administered, or in how the test taker is allowed to respond. This term generally refers to changes that do not substantially alter what the test measures. The proper use of accommodations does not substantially change academic level or performance criteria. Appropriate accommodations are made in order to level the playing field, i.e., to provide equal opportunity to demonstrate knowledge.
Achievement level/ proficiency levels - Descriptions of a test taker?s competency in a particular area of knowledge or skill, usually defined as ordered categories on a continuum, often labeled from ?basic? to ?advanced,? that constitute broad ranges for classifying performance.
Alternate Assessment - An assessment designed for those students with disabilities who are unable to participate in general large-scale assessments used by a school district or state, even when accommodations or modifications are provided. The alternate assessment provides a mechanism for students with even the most significant disabilities to be included in the assessment system.
Assessment - Any systematic method of obtaining information from tests or other sources, used to draw inferences about characteristics of people, objects, or programs.
Bias - In a statistical context, a systematic error in a test score. In discussing test fairness, bias may refer to construct underrepresentation or construct irrelevant components of test scores. Bias usually favors one group of test takers over another.
Bilingual - The characteristic of being relatively proficient in two languages.
Classification accuracy - The degree to which neither false positive nor false negative categorizations and diagnoses occurs when a test is used to classify an individual or event.
Composite score - A score that combines several scores according to a specified formula.
Content areas - Specified subjects in education, such as language arts, science, mathematics, or history.
Content domain - The set of behaviors, knowledge, skills, abilities, attitudes or other characteristics to be measured by a test, represented in a detailed specification, and often organized into categories by which items are classified.
Content standard - Statements which describe expectations for students in a subject matter at a particular grade or at the completion of a level of schooling.
Content validity - Validity evidence which analyzes the relationship between a test?s content and the construct it is intended to measure. Evidence based on test content includes logical and empirical analyses of the relevance and representativeness of the test content to the defined domain of the test and the proposed interpretations of test scores.
Construct - The concept or the characteristic that a test is designed to measure.
Construct equivalence - 1. The extent to which the construct measured by one test is essentially the same as the construct measured by another test. 2. The degree to which a construct measured by a test in one cultural or linguistic group is comparable to the construct measured by the same test in a different cultural or linguistic group.
Construct irrelevance - The extent to which test scores are influenced by factors that are irrelevant to the construct that the test is intended to measure. Such extraneous factors distort the meaning of test scores from what is implied in the proposed interpretation.
Constructed response item - An exercise for which examinees must create their own responses or products rather than choose a response from an enumerated set. Short-answer items require a few words or a number as an answer, whereas extended-response items require at least a few sentences.
Construct underrepresentation - The extent to which a test fails to capture important aspects of the construct that the test is intended to measure. In this situation, the meaning of test scores is narrower than the proposed interpretation implies.
Criterion validity - Validity evidence which analyzes the relationship of test scores to variables external to the test. External variables may include criteria that the test is expected to be associated with, as well as relationships to other tests hypothesized to measure the same constructs and tests measuring related constructs. Evidence based on relationships with other variables addresses questions about the degree to which these relationships are consistent with the construct underlying the proposed test interpretations. See Predictive validity.
Criterion-referenced - Scores of students referenced to a criterion. For instance, a criterion may be specific, identified knowledge and skills that students are expected to master. Academic content standards in various subject areas are examples of this type of criterion.
Criterion-referenced test - A test that allows its users to make score interpretations in relation to a functional performance level, as distinguished from those interpretations that are made in relation to the performance of others. Examples of criterion-referenced interpretations include comparison to cut scores, interpretations based on expectancy tables, and domain-referenced score interpretations.
Cutscore - A specified point on a score scale, such that scores at or above that point are interpreted or acted upon differently from scores below that point. See Performance standard.
Discriminant validity - Validity evidence based on the relationship between test scores and measures of different constructs.
Error of measurement - The difference between an observed score and the corresponding true score or proficiency. This unintended variation in scores is assumed to be random and unpredictable and impacts the estimate of reliability of a test.
False negative - In classification, diagnosis, or selection, an error in which an individual is assessed or predicted not to meet the criteria for inclusion in a particular group but in truth does (or would) meet these criteria.
False positive - In classification, diagnosis, or selection, an error in which an individual is assessed or predicted to meet the criteria for inclusion in a particular group but in truth does not (or would not) meet these criteria.
Field test - A test administration used to check the adequacy of testing procedures, generally including test administration, test responding, test scoring, and test reporting. A field test is generally more extensive than a pilot test. See Pilot test.
High-stakes decision for students - A decision whose result has important consequences for students.
Internal consistency estimate of reliability - An index of the reliability of test scores derived from the statistical interrelationships of responses among item responses or scores on separate parts of a test.
Inter-rater agreement - The consistency with which two or more judges rate the work or performance of test takers; sometimes referred to as inter-rater reliability.
Local evidence - Evidence (usually related to reliability or validity) collected for a specific and particular set of test takers in a single institution, district, or state, or at a specific location.
Local norms - Norms by which test scores are referred to a specific, limited reference population of particular interest to the test user (such as institution, district, or state); local norms are not intended as representative of populations beyond that setting.
Norm-referenced - Scores of students compared to a specified reference population.
Norm-referenced test - A test that allows its users to make score interpretations of a test taker?s performance in relation to the performance of other people in a specified reference population.
Norms - Statistics or tabular data that summarize the distribution of test performance for one or more specified groups, such as test takers of various ages or grades. The group of examinees represented by the norms is referred to as the reference population. Norm reference populations can be a local population of test takers, e.g. from a school, district or state, or it can represent a larger population, such as test takers from several states or throughout the country.
Percentile rank - Most commonly, the percentage of scores in a specified distribution that fall below the point at which a given score lies. Sometimes the percentage is defined to include scores that fall at the point; sometimes the percentage is defined to include half of the scores at the point.
Performance assessments - Product- and behavior-based measurements based on settings designed to emulate real-life contexts or conditions in which specific knowledge or skills are actually applied.
Performance standard - 1. An objective definition of a certain level of performance in some domain in terms of a cut score or a range of scores on the score scale of a test measuring proficiency in that domain. 2. A statement or description of a set of operational tasks exemplifying a level of performance associated with a more general content standard; the statement may be used to guide judgements about the location of a cut score on a score scale. The term often implies a desired level of performance. See Cut scores.
Pilot test - A test administered to a representative sample of test takers to try out some aspects of the test or test items, such as instructions, time limits, item response formats, or item response options. See Field test.
Portfolio assessments - A systematic collection of educational or work products that have been compiled or accumulated over time, according to a specific set of principles.
Precision of measurement - A general term that refers to a measure?s sensitivity to error of measurement.
Predictive validity - Validity evidence that analyzes the relationship of test scores to variables external to the test that the test is expected to predict. Predictive evidence indicates how accurately test data can predict criterion scores that are obtained or outcomes that occur at a later time. See Criterion evidence of validity; False positive error; False negative error.
Random error - An unsystematic error; a quantity (often observed indirectly) that appears to have no relationship to any other variable.
Reference population - The population of test takers represented by test norms. The sample on which the test norms are based must permit accurate estimation of the test score distribution for the reference population. The reference population may be defined in terms of size of the population (local or larger), examinee age, grade, or clinical status at time of testing, or other characteristics.
Reliability - The degree to which test scores for a group of test takers are consistent over repeated applications of a measurement procedure and hence are inferred to be dependable and repeatable for an individual test taker; the degree to which scores are free of errors of measurement for a given group.
Sample - A selection of a specified number of entities called sampling units (test takers, items, schools, etc.) from a large specified set of possible entities, called the population. A random sample is a selection according to a random process, with the selection of each entity in no way dependent on the selection of other entities. A stratified random sample is a set of random samples, each of a specified size, from several different sets, which are viewed as strata of the population.
Sampling from a domain - The process of selecting test items to represent a specified universe of performance.
Score - Any specific number resulting from the assessment of an individual; a generic term applied for convenience to such diverse measures as test scores, absence records, course grades, ratings, and so forth.
Scoring rubric - The established criteria, including rules, principles, and illustrations, used in scoring responses to individual items and clusters of items. The term usually refers to the scoring procedures for assessment tasks that do not provide enumerated responses from which test takers make a choice. Scoring rubrics vary in the degree of judgement entailed, in the number of distinct score levels defined, in the latitude given scorers for assigning intermediate or fractional score values, and in other ways.
Selection - A purpose for testing that results in the acceptance or rejection of applicants for a particular educational opportunity.
Sole criterion - When only one standard (such as a test score) is used to make a judgement or a decision. This can include a step-wise decision-making procedure where students must reach or exceed one criterion (such as a cut score of a test) independent of or before other criteria can be considered.
Speed test - A test in which performance is measured primarily or exclusively by the time to perform a specified task, or the number of tasks performed in a given time, such as tests of typing speed and reading speed.
Standards-based assessment - Assessments intended to represent systematically described content and performance standards.
Systematic error - A score component (often observed indirectly), not related to the test performance, that appears to be related to some salient variable or sub-grouping of cases in empirical analyses. This type of error tends to increase or decrease observed scores consistently in members of the subgroup or levels of the salient variable. See Bias.
Technical manual - A publication prepared by test authors and publishers to provide technical and psychometric information on a test.
Test - An evaluative device or procedure in which a sample of an examinee?s behavior in a specified domain is obtained and subsequently evaluated and scored using a standardized process.
Test developer - The person(s) or agency responsible for the construction of a test and for the documentation regarding its technical quality for an intended purpose.
Test development - The process through which a test is planned, constructed, evaluated and modified, including consideration of content, format, administration, scoring, item properties, scaling, and technical quality for its intended purpose.
Test documents - Publications such as test manuals, technical manuals, user?s guides, specimen sets, and directions for test administrators and scorers that provide information for evaluating the appropriateness and technical adequacy of a test for its intended purpose.
Test manual - A publication prepared by test developers and publishers to provide information on test administration, scoring, and interpretation and to provide technical data on test characteristics.
Validation - The process through which the validity of the proposed interpretation of test scores is evaluated.
Validity - The degree to which accumulated evidence and theory support specific interpretations of test scores entailed by proposed uses of a test.
Validity argument - The degree to which accumulated evidence and theory support specific interpretations of test scores entailed by proposed uses of a test.
Validity evidence - Systematic documentation that empirically or theoretically demonstrates, under the specific conditions of the individual analysis, to which extent, for whom, and in which situations test score inferences are valid. No single piece of evidence is sufficient to document validity of test scores; rather, aspects of validity evidence must be accumulated to support specific interpretations of scores.
Validity evidence for relevant subgroups - Validity results disaggregated by subgroups, such as by race/ethnicity, or by disability or limited English proficiency status. This type of evidence is appropriate generally when credible research suggests that interpretations of the test scores may differ by subgroup. For instance, if a test will be used to predict future performance, validity evidence should document that the scores are as valid a predictor of the intended performance for one subgroup as for another.