Assessment of Student Performance April 1997
As described in the preceding chapter, assessment reform is taking place at multiple levels national, state, district, and school with the goal of meeting several purposes:
These purposes are not mutually exclusive, and any one performance assessment system may be intended to target several purposes at once.
Data collected during our study reveal that several factors function as facilitators or as barriers in the assessment reform process and, thus, as facilitators of or barriers to achieving the stated purposes of performance assessment systems. For example, if a state develops and implements an assessment system for the purpose of certifying student achievement, then a reliable scoring procedure is a facilitator for the intended use of the system. However, if there are technical problems (e.g., low interrater reliability) with the system, then those problems serve as a barrier to using the system to certify student achievement; no system that is technically unsound can be justifiably used for certifying student capabilities.
Our analysis of facilitators and barriers in assessment reform is complicated by the fact that many performance assessment systems particularly those established at the state level are intended to achieve multiple purposes, and factors that facilitate the achievement of one purpose may serve as a barrier to the achievement of a second, equally desirable purpose. A high degree of standardization and technical perfection (i.e., reliability and validity), for instance, may facilitate the gathering of reliable student data for monitoring student progress, but the rigidity of the system may serve as a barrier to adapting the system to inform and guide everyday instructional practices.
Facilitators and Barriers: Achieving the Stated Purposes of Assessment Reform
This chapter examines the facilitators and barriers in assessment reform vis-?-vis the purposes of assessment reform. The facilitators and barriers in achieving the first four purposes listed above are considered in the analysis:
The final stated purpose of performance assessment systems to influence instructional practices is left for consideration in subsequent chapters of this report. While the first four listed purposes of assessment reform can be conceived of and met outside of the classroom, this last purpose as well as the ultimate objective of assessment reform, improved student learning cannot. For this reason, the facilitators and barriers in bringing about instructional change are taken up separately in Chapter 6, and the observed impacts of assessment reform on teaching and learning are explicated in Chapter 7.
Finally, the analysis in this chapter is organized by the level of initiation of the assessment: state, school and national (considered together because the schools in this study participating in national reform efforts do so because of their own school-level purposes for assessment reform), and district. The analysis is divided according to the level of initiation of the performance assessment because, although some facilitators and barriers in reform are the same at the different levels of initiation, they tend to vary across the different levels in their manifestation and in their impact.
Limitations of the Analysis
The analysis is constrained by several limitations that should be noted at the outset. The first, and perhaps most important, limitation stems from the school-site level emphasis of the study. Though information was collected about the extent of success in achieving the stated purposes of assessment systems, this information tends to be more detailed for school- and district-initiated assessments than for state-initiated assessments: information about progress toward attaining state objectives tended to come from documents and general, as opposed to detailed and probing, interviews. Therefore, detailed information pertaining to achievement of purposes may be more complete and reliable for assessments initiated by schools and districts than for those initiated by states.
A second limitation is inherent in the timing and time frame of our study. Because even performance assessments that are being fully implemented are still relatively new, the current study can identify only those barriers and facilitators that have emerged during the early and intermediate stages of assessment development and implementation. Facilitators and potential barriers to sustained reform and to achieving all stated purposes must be identified through further studies.
Third, barriers imposed by the financial costs associated with developing and implementing performance assessment systems can be substantial. However, data concerning this particular barrier at the 16 sites were not uniformly available, and those cost data that were available encompassed different aspects (e.g., costs of assessment development, costs of professional development activities, costs of scoring) of performance assessment systems. This unevenness in data makes it difficult to identify common themes across sites.
Finally, an important caveat must be reiterated at this juncture. Our analyses of facilitators and barriers in the implementation of assessment systems initiated at the state and district levels are based, in part, upon documents provided by and interviews with state and district officials, and upon the experiences, opinions, and perceptions of individuals associated with a single school within the district or state developing or using the assessment. While information from state- and district-level documents and from state and district officials is true of the state and district as a whole, the experiences and perceptions of individuals at the school level cannot be taken to be representative of the experiences and perceptions of their counterparts in other schools within the state or district. Still, by comparing data across sites, we are able to identify themes and to formulate understandings of potential facilitators and barriers in assessment reform, even though we cannot generalize the reactions of individuals at a single school to their peers across schools and districts. In other words, this study is not an evaluation of the status of reforms in any one state, district, or school; it is focused instead upon gaining an understanding of cross-site issues that arise in developing and in implementing performance assessment systems.
The six state-initiated performance assessments included in this study Arizona, Kentucky, Maryland, New York, Oregon, and Vermont aim to achieve, in some combination, all four purposes under consideration in this section: monitoring student progress; holding schools and districts accountable for student performance; aligning curriculum, instruction, and assessment (this is not an explicitly stated purpose of any of the six state assessment reform initiatives, but it is an implicit purpose of several of them); and certifying student achievement. We identified the following facilitators and barriers (depending upon their presence or absence) to achieving or making progress toward these purposes:
Exhibit 5-1 summarizes the role played by each of these factors in facilitating or obstructing assessment reform in each state-initiated performance assessment system.
Below, each factor is considered in terms of its impact upon attainment of the stated purposes of performance assessments. Note, however, that the impact of each facilitator and barrier can be relatively strong or weak depending upon whether or not other facilitators or barriers are present. In other words, the impacts of these factors, as they serve to facilitate or impede assessment reform, are interdependent. In addition, certain factors serve as facilitators (or barriers) to the achievement of all purposes of assessment reform, while certain factors may affect only one or some of the stated purposes.
Utilization of Outside Sources of Information and Expert Help
Outside sources of information and expert help have facilitated the development and implementation of assessment systems (regardless of the purposes of the assessment systems). States have frequently found the leg work previously done by others to be useful in their conceptualization and development of performance-based assessments. In addition, the utilization of expert help in developing and scoring assessments or in evaluating assessment systems has enhanced states' capacity to develop, implement, and track the quality of their assessment systems.
All states in our sample have drawn upon extant information resources in both conceptualizing and designing their performance assessments (see Exhibit 4-13). The usefulness of doing so is clear: extant information eliminates the need to "reinvent the wheel" at every juncture. For example, four out of the five states Arizona, Maryland, Oregon, Vermont that are assessing mathematics outcomes turned to the National Council of Teachers of Mathematics (NCTM) for assistance in formulating their mathematics curriculum frameworks and in designing assessments in mathematics. Officials and teachers in these states consider NCTM's standards to be among the forefront of standards documents published by professional educators' associations, and these states wanted to benefit from the work already conducted by NCTM.
Similarly, three of the four states Kentucky, Maryland, and Oregon using performance assessments in science used the American Association for the Advancement of Science (AAAS) guidelines to conceptualize and design their science performance assessments. In the fourth state, New York, the extant science curriculum framework draws upon the AAAS guidelines, but teachers at our New York site did not necessarily utilize new sources of information to determine the content of their assessments.
In addition to utilizing information from professional organizations, several states have contracted with experts in education and measurement to help with the development, evaluation, and scoring of assessments. Arizona contracted with a private test developer, Riverside Publishing Company, to design its performance assessments, thereby easing the development process. Kentucky, Maryland, Oregon, and Vermont hired consultants to help teams of educators and policy makers develop their assessment systems, including developing tasks, scoring rubrics, and standards of performance. Both Arizona and Kentucky continue to contract with testing and measurement firms to score the assessment tasks.
New York's experimentation with waivers allowing schools to develop performance-based assessments to substitute for portions of the Regents Examinations clearly emerges as the outsider in this group of six state-initiated assessments. At this point in time, the New York Department of Education works with schools to ensure that these performance assessments are sufficiently rigorous to serve as substitutes to the Regents, but schools pursuing these waivers are otherwise left to their own devices to seek assistance in the conceptualization, development, and implementation of the assessments. In this regard, New York's initiative behaves as a small, local-level reform rather than a massive state-level change in the state's assessment system.
In sum, the existence of standards and guidelines developed by professional associations has facilitated states' efforts to develop new curriculum and assessment frameworks. The new frameworks, in turn, have provided the basis for designing and developing performance assessments that are consonant with up-to-date understanding of the subject matter. Furthermore, states also have enhanced their ability to develop and design large-scale assessments by seeking help in designing and scoring assessments from testing and in measurement companies and from experts in education and measurement.
Technical Soundness of the Assessment
Another important facilitator (in its presence) and barrier (in its absence) of the state's ability to achieve the intended purposes of assessment reform is the technical soundness of the assessment system. This factor is clearly important to a state's ability to achieve at least three of the four purposes under consideration. Technical soundness of an assessment is crucial to a state's endeavor to:
The interdependence of these three purposes is clear: that for which schools are held accountable and those skills for which student mastery is certified clearly must be monitored in some way. As the technical soundness of assessment systems is discussed in Chapter 4, we touch upon it only briefly here, focusing specifically upon the implications of the public's perceptions (the public both inside and outside the education system) of the technical aspects of performance assessment systems.
The importance of technical soundness is twofold. First, in the absence of construct validity and scoring reliability, student progress toward desired outcomes is not adequately measured. Data generated from the use of these assessments cannot legitimately be used for holding schools accountable or for certifying students. Second, if the state does not establish the assessment system's construct validity and interrater reliability, public confidence in the assessment may deteriorate, thereby derailing the assessment reform process.
In essence, the first of these issues is the problem educators perceive with standardized, norm-referenced, multiple-choice tests: performance assessments are believed to be a more valid method for assessing the types of skills and competencies educators want students to learn and demonstrate. Whether or not performance assessments are, indeed, valid and reliable must be determined, and it must be determined with each administration of the assessment. Kentucky and Maryland are using their systems for high stakes purposes (holding schools accountable for their students' performance) and have instituted measures to ensure the validity of their assessment systems. Oregon and Vermont are at an earlier developmental stage and have not yet established the validity of their systems, although Vermont has instituted measures to improve its scoring reliability. Arizona, on the other hand, instituted validity measures but had its performance assessment system rejected due the perception (and apparent reality) that it was a technically unsound system.
The second issue is that of convincing the stakeholders that a compromise must be made (at least to date and certainly for the foreseeable future) in turning to performance assessments. Stakeholders must believe that (1) all performance assessment systems face the barrier posed by sub-perfect interrater reliability but that interrater differences in scoring can be minimized and (2) the assessments reveal valuable information about student learning despite problems in standardizing scoring. (Issues of equity, which proponents of performance assessment must also address, are discussed later in this chapter.)
The state-initiated assessment system most hampered by perceived problems with the assessment's technical quality is the Arizona Student Assessment Program (ASAP). ASAP has been hindered in its efforts to monitor student progress toward attainment of the state's Essential Skills because of perceived technical problems with the assessment instruments. Teachers interviewed for this study suggested that the performance assessment contained several technical problems. Specifically, teachers said: (1) the tasks were not valid measures of student abilities; (2) designing the assessment to audit only a subset of the Essential Skills, a subset which would change yearly, led to incomparable results across years; (3) standards of satisfactory performance on the assessment had not been established; and (4) interrater reliability in scoring tasks was inadequate. In response to such objections, the state superintendent suspended the program indefinitely in January of 1995.
In contrast to ASAP, other state-initiated systems including those in Maryland, Kentucky, and Vermont are considered by teachers and other education constituents to be relatively sound technically, and they have encountered less opposition. Though the performance assessments in these states have been controversial in the past and have attracted some challenges to their technical soundness (particularly in Kentucky and Vermont, where interrater reliability has posed a problem), these states have, to date, successfully addressed concerns by instituting professional development sessions to improve interrater reliability and to convince teachers of the utility of the assessment system. This greater demonstration and public perception of validity, reliability, and utility has, so far, led to the continued implementation and acceptance of Kentucky's and Vermont's assessment systems. In addition, Kentucky's public relations campaign during the 1994-95 academic year has led to greater public acceptance of the system.
It should be noted that the perception of the utility of performance assessments in Kentucky and Vermont teachers participating in this study find that the use of portfolios has benefited their classroom practice can be a counterbalance to the problems associated with interrater reliability. In Kentucky and Vermont, where teachers find value in the state's performance assessment, teachers are more accepting of technical problems than are teachers in Arizona, who find little to value in ASAP (again, in the cases of the three schools participating in this study).
The performance assessments being developed in Oregon and New York are still in the developmental phase. The procedures to ensure the technical soundness of the assessments have, therefore, not yet been fully instituted. However, teachers in Oregon, in particular, expressed concerns about the content validity of the assessment tasks the state intends to adopt. These concerns are driven by the fact that the state initiated the development of assessment tasks before revising curriculum guidelines to better support the newly articulated student outcomes. Thus, teachers were being asked to develop tasks coordinated with a curriculum that did not yet exist. The anxiety about the technical soundness of the assessment system was particularly acute in Oregon because the assessment system eventually was to be used to certify student achievement.1
In the case of New York, each school is piloting its own portfolio. Hence, those teachers (as well as others associated with the school) who are developing the assessments believe the assessments to be valid and do not appear to be concerned about interrater reliability.
Coordination with Associated Reforms
Ensuring the compatibility of assessment reform with other related reforms can serve as a potential facilitator in assessment reform. This factor is particularly important when the objective (explicit or implicit) is to align curriculum, instruction, and assessment. It also is important when the objective of the assessment system is to hold schools accountable for student performance.
In theory, coordinated curriculum and assessment reforms should facilitate the implementation of both. At this point in time, coordinated efforts to introduce curriculum guidelines have been most successful in those states in which the efforts clearly reinforce each other and are visible at the local level. They have been less successful when the timing of the reforms has not been in sync or when there has been a lack of linkages with content or performance standards.
The reform efforts of two states in our sample illustrate how coordination of initiatives can facilitate reform: in both Maryland and Kentucky coordination of performance assessments with associated reforms has served as a facilitator of assessment reform. The Kentucky Education Reform Act of 1990 established six broad learning goals and a set of academic expectations for all students, and the assessment system is based upon these academic expectations. This integration of reforms has, arguably, facilitated the state's introduction of its accountability measures: because school administrators and teachers know where the state is going, their acceptance of KIRIS is enhanced. Similarly, "Maryland Learning Outcomes" are the bedrock of the assessment system in Maryland. These two states use the results of their assessment systems for accountability purposes, as the assessment system is integrated with a set of articulated outcomes. Coordination of reforms, however, is not in and of itself a facilitator. Teachers' and other constituents' acceptance of other aspects of the reform efforts is an equally important criterion.
Two counterexamples show the tenuousness of coordinated reforms. In Oregon, the state introduced Foundation Skills and Core Applications for Living to guide both curriculum and assessment for the state's Certificate of Initial Mastery. However, the articulation of the skills and applications and the initial progress toward developing performance assessments met with opposition in the state legislature, as the legislature wanted more rigorous content standards infused into the two reforms. Similarly, in Arizona, performance assessments were introduced to audit students' progress toward the state's newly adopted Essential Skills, representing a coordination between elements of reform. However, the technical problems with ASAP rendered moot any advantage of coordinating the reforms.
Actual or Perceived Fairness of the Assessment System
Actual fairness (as determined by objective criteria) and public perceptions of the fairness or lack of fairness of the assessment can also serve as facilitators or barriers in initiating and sustaining assessment reform and, hence, meeting its purposes. (Clearly, the issue of fairness is closely related to perceived technical soundness of the assessment system.) The fairness factor has two dimensions:
Fairness of Accountability Mechanisms
When accountability systems are "high stakes," the fairness of those mechanisms will come under close scrutiny. Based on some criteria, critics will examine the system to judge whether or not the system deals fairly with schools in the system. In addition, public perceptions of fairness may be based upon criteria that are not the same as those employed by objective evaluators or by the education authority itself, but which may nonetheless pose roadblocks in the implementation of performance assessments for accountability purposes.
Two states in our sample have built into their performance-based assessment systems high stakes accountability mechanisms Kentucky and Maryland. Both states have established performance goals for schools to strive for, though Kentucky's goals are for gain scores and differ across schools, while Maryland's standards are uniform across all schools. The two accountability systems are summarized in Exhibit 5-2.
To date, the Kentucky system (arguably the "fairer" of the two systems because school performance is measured according to gain not absolute scores) has drawn more criticism than Maryland's. However, the state has taken a proactive approach toward identifying any weaknesses in the system. An evaluation of the system commissioned by the Kentucky Institute for Education Research suggested:
By taking steps to identify potential problems, KDE may move to remedy problems and to explain its actions adequately to the public. The ultimate success of these actions remains to be seen. However, parents interviewed for this study said that, though they harbor concerns about the equity of the accountability system, they believe that some such system is necessary.
Maryland's accountability system also has drawn some criticism. District administrators interviewed for this study believe that it is premature to use the MSPAP for accountability purposes, because ". . . the state has yet to pull off a test that is fully comparable one year to the next." (Note that the issue of "fairness" here affects all schools in the state equally.) However, overall complaints about Maryland's intended use of MSPAP for accountability purposes seem to have been few. In both cases, the fairness of the system for accountability purposes has come under criticism, but in neither case has the barrier been strong enough to derail the assessment system or to dissuade the state from refining the system and continuing to use it for accountability purposes.
Fairness to Students
Perceptions of fairness to individual students also serve as a facilitator or barrier in adequately monitoring student progress and accountability in particular, and other purposes in general. This may be the case especially when certification of student skills is an objective of the assessment, but it is also an important factor in achieving the support of parents for a new assessment, regardless of the purposes of the assessment.
This issue was salient in Maryland and in Kentucky. In Maryland, teachers participating in this study said that the MSPAP presented too great a challenge for students with learning disabilities and that too many of these students experienced frustration and failure during the assessment. Similarly, in Kentucky, all students regardless of disability are included in the KIRIS, and their work is scored according to the same criteria as the work of their nondisabled peers. Teachers suggested that not only was this situation unfair to students, but that the undifferentiated standards led some teachers virtually to do some assignments for their students with disabilities.
In addition to inadequately monitoring student performance, if the assessments do not take into consideration special needs of students, both teachers and parents can raise opposition to assessments if they believe they treat some students unfairly. In particular, the inclusion of students with disabilities and the accommodations made (or not made) for their participation can lead to dissatisfaction on the part of parents and teachers. Such sentiments were evident at the schools we visited in Maryland, Kentucky, and Vermont. Again, in none of these three cases has this potential barrier been strong enough to derail the assessment reform process.
1 As stated earlier, Oregon's plan was substantially revised in late summer, 1995.
2 The Evaluation Center, Western Michigan University for the Kentucky Institute for Education Research, An Independent Evaluation of the Kentucky Instructional Results Information System (KIRIS). January 1995, Frankfort, Kentucky
[Chapter 4: Cross-Case Analysis 1: Part 4 of 4] [Chapter 5: Cross-Case Analysis 2: Part 1 of 2]