Evaluating Online Learning: Challenges and Strategies for Success
July 2008

Finding Appropriate Comparison Groups

When evaluators have research questions about an online program's impact on student achievement, their best strategy for answering those questions is often an experimental design, like a randomized controlled trial, or a quasi-experimental design that requires them to find matched comparison groups. These two designs are the best, most widely accepted methods for determining program effects (see table 3, p. 28).

Evaluators who use these methods must proceed carefully. They must first ensure that comparisons are appropriate, taking into consideration the population served and the program's goals and structure. For example, online programs dedicated to credit recovery* would want to compare their student outcomes with those of other credit recovery programs because they are serving similiar populations. Evaluators must also be scrupulous in their efforts to find good comparison groups—often a challenging task. For a particular online class, there may be no corresponding face-to-face class. Or it may be difficult to avoid a self-selection bias if students (or teachers) have chosen to participate in an online program. Or the online program might serve a wide range of students, possibly from multiple states, with a broad range of ages, or with different levels of preparation, making it difficult to compare to a traditional setting or another online program. Evaluators attempting to conduct a randomized controlled trial can encounter an even greater challenge in devising a way to randomly assign students to receive a treatment (see Glossary of Common Evaluation Terms, p. 65) online.

How might evaluators work around these comparison group difficulties?

Several of the evaluations featured in this guide sought to compare distance learning programs with face-to-face learning settings, and they took various approaches to the inherent technical challenges of doing so. Despite some difficulties along the way, the evaluators of online projects in Louisiana, Alabama, and Maryland all successfully conducted comparative studies that yielded important findings for their programs.

Identify Well-matched Control Groups for Quasi-experimental Studies

In designing the evaluation for Louisiana's Algebra I Online, program leaders in the Louisiana Department of Education wanted to address the bottom line they knew policymakers cared about most: whether students in the program's online courses were performing as well as students studying algebra in a traditional classroom.

To implement the quasi-experimental design of their evaluation (see Glossary of Common Evaluation Terms, p. 65), the program's administrators needed to identify traditional algebra classrooms to use as controls. The idea was to identify standard algebra courses serving students similar to those participating in the Algebra I Online program and then give pre- and post-course tests to both groups to compare how much each group learned, on average. Of course, in a design such as this, a number of factors besides the class format (online or face-to-face) could affect students' performance: student-level factors, such as individual ability and home environment; teacher-level factors, such as experience and skill in teaching algebra; and school-level factors, such as out-of-class academic supports for students. Clearly, it was important to the evaluation to find the closest matches possible.

Under a tight timeline, program administrators worked with all their participating districts to identify traditional classrooms that were matched demographically to the online classes and, then, administered a pretest of general mathematics ability to students in both sets of classes. External evaluators from the Education Development Center (EDC) joined the effort at this point. Although impressed by the work that the program staff had accomplished given time and budget constraints, the EDC evaluators were concerned about the quality of matches between the treatment and control classes. Finding good matches is a difficult task under the best of circumstances and, in this case, it proved even more difficult for nonevaluators, that is, program and school administrators. The quality of the matches was especially problematic in small districts or nonpublic schools that had fewer control classrooms from which to choose. EDC evaluator Rebecca Carey says, "To their credit, [the program's administrators] did the best they could in the amount of time they had." Still, she adds, "the matches weren't necessarily that great across the control and the experimental.… Had we been involved from the beginning, we might have been a little bit more stringent about how the control schools would match to the intervention schools and, maybe, have made the selection process a little bit more rigorous."

Table 3. Evaluation Design Characteristics

Design Characteristics Advantages Disadvantages
Experimental design Incorporates random assignment of participants to treatment and control groups. The purpose of randomization is to ensure that all possible explanations for changes (measured and unmeasurable) in outcomes are taken into account, randomly distributing participants in both the treatment and control groups so there should be no systematic baseline differences. Treatment and control groups are compared on outcome measures. Any differences in outcomes may be assumed to be attributable to the intervention. Most sound or valid study design available Most accepted in scientific community Institutional policy guidelines may make random assignment impossible.
Quasiexperimental design Involves developing a treatment group and a carefully matched comparison group (or groups). Differences in outcomes between the treatment and comparison groups are analyzed, controlling for baseline differences between them on background characteristics and variables of interest. More practical in most educational settings Widely accepted in scientific community Finding and choosing suitable treatment and comparison groups can be difficult. Because of nonrandom group assignment, the outcomes of interest in the study may have been influenced not only by the treatment but also by variables not studied.

Source: Adapted from U.S. Department of Education, Mobilizing for Evidence-Based Character Education (2007). Available from http://www.ed.gov/programs/charactered/mobilizing.pdf.

Ultimately, the evaluation team used students' pretest scores to gauge whether the treatment and control group students started the course with comparable skills and knowledge and employed advanced statistical techniques to help control for some of the poor matches.

In their report, the evaluators also provided data on differences between the control and treatment groups (e.g., student characteristics, state test scores in math, size of the school), and they drew from other data sources (including surveys and observations) to triangulate their findings (see Common Problems When Comparing Online Programs to Face-to-Face Programs, p. 30). To other programs considering a comparative study, the Algebra I Online evaluators recommend involving the evaluation team early in the planning process and having them supervise the matching of treatment and control groups.

The evaluators of Alabama's Alabama Connecting Classrooms, Educators, & Students Statewide Distance Learning (ACCESS) initiative similarly planned a quasi-experimental design and needed traditional classes to use as matches. ACCESS provides a wide range of distance courses, including core courses, electives, remedial courses, and advanced courses, which are either Web-based, utilize interactive videoconferencing (IVC) platforms, or use a combination of both technologies. In the case of IVC courses, distance learners receive instruction from a teacher who is delivering a face-to-face class at one location while the distance learners participate from afar.

When external evaluators set out to compare the achievement of ACCESS's IVC students to that of students in traditional classrooms, they decided to take advantage of the program's distinctive format. As controls, they used the classrooms where the instruction was delivered live by the same instructor. In other words, the students at the site receiving the IVC feed were considered the treatment group, and students at the sending site were the control group. This design helped evaluators to isolate the effect of the class format (IVC or face-to-face) and to avoid capturing the effects of differences in style and skill among teachers, a problem they would have had if the treatment and control classes were taught by different people. Martha Donaldson, ACCESS's lead program administrator, says, "We were looking to see if it makes a difference whether students are face-to-face with the teacher or if they're receiving instruction in another part of the state via the distance learning equipment." To compare performance between the two groups of students, evaluators gathered a range of data, including grades, scores on Advanced Placement tests, if relevant, and enrollment and dropout data. The design had some added logistical benefits for the evaluators: it was easier to have the control classrooms come from schools participating in ACCESS, rather than having to collect data from people who were unfamiliar with the program.

Despite these benefits, the strategy of using IVC sending sites as control groups did have a few drawbacks. For instance, the evaluators were not able to match treatment and control groups on characteristics that might be important, such as student- and school-level factors. It is possible that students in the receiving sites attended schools with fewer resources, for example, and the comparison had no way to control for that. For these reasons, the ACCESS evaluators ultimately chose not to repeat the comparison between IVC sending and receiving sites the following year. They did, however, suggest that such a comparison could be strengthened by collecting data that gives some indication of students' pretreatment ability level—GPA, for example—and using statistical techniques to control for differences. Another strategy, they propose, might be to limit the study to such a subject area as math or foreign language, where courses follow a sequence and students in the same course (whether online or traditional) would have roughly similar levels of ability.

Common Problems When Comparing Online Programs to Face-to-Face Programs

Evaluators need to consider both student- and school- or classroom-level variables when they compare online and face-to-face programs. At the student level, many online program participants enroll because of a particular circumstance or attribute, and thus they cannot be randomly assigned—for example, a student who takes an online course over the summer to catch up on credits. The inherent selection bias makes it problematic to compare the results of online and face-to-face students. Evaluators' best response is to find, wherever possible, control groups that are matched as closely as possible to the treatment groups; this includes matching for student demographic characteristics; their reason for taking the course (e.g., credit recovery); and their achievement level. Classroom- and school-level factors complicate comparisons as well. If the online program is more prevalent in certain types of schools (e.g., rural schools) or classrooms (e.g., those lacking a fully certified teacher), then the comparison unintentionally can capture the effects of these differences. Evaluators need to understand and account for these factors when selecting control groups.

Anticipate the Challenges of Conducting a Randomized Controlled Trial

Evaluators in Maryland had a different challenge on their hands when they were engaged to study Thinkport and its wide-ranging education offerings—including lesson plans, student activities, podcasts, video clips, blogs,** learning games, and information about how all of these things can be used effectively in classrooms. When asked by a key funder to evaluate the program's impact on student learning, the evaluation team chose to study one of Thinkport's most popular features, its collection of "electronic field trips," each one a self-contained curricular unit that includes rich multimedia content (delivered online) and accompanying teacher support materials that assist with standards alignment and lesson planning.

In collaboration with its evaluation partner, Macro International, the program's parent organization, Maryland Public Television, set out to study how the Pathways to Freedom electronic field trip impacted student learning in the classroom and whether it added value. Rather than conducting a quasi-experimental study in which the evaluator would have to find control groups that demographically matched existing groups that were receiving a treatment, the Thinkport evaluators wanted an experimental design study in which students were assigned randomly to either treatment or control groups.

Although they knew it would require some extra planning and coordination, the evaluators chose to conduct a randomized controlled trial. This design could provide them with the strongest and most reliable measure of the program's effects. But first, there were challenges to overcome. If students in the treatment and control groups were in the same classroom, evaluators thought, they might share information about the field trip and "contaminate" the experiment. Even having treatment and control groups in the same school could cause problems: In addition to the possibility of contamination, the evaluators were concerned that teachers and students in control classrooms would feel cheated by not having access to the field trip and would complain to administrators.

To overcome these challenges and maintain the rigor of the experimental design, program leaders decided to randomize at the school level. They recruited nine schools in two districts and involved all eighth-grade social studies teachers in each school, a total of 23 teachers. The evaluators then matched the schools based on student demographics, teacher data, and student scores on the state assessment. (One small school was coupled with another that matched it demographically, and the two schools were counted as one.) The evaluators then randomly identified one school in each pair as a treatment school and one as a control school. Teachers did not know until training day whether they had been selected as a treatment or control. (The control group teachers were told that they would be given an orientation and would be able to use the electronic field trip after the study.)

A second challenge for the evaluation team was ensuring that teachers in the control classrooms covered the same content as the teachers who were using the electronic field trip—a problem that might also be found in quasi-experimental designs that require matched comparison groups. They were concerned because the electronic field trip devotes six class periods to the topic of slavery and the Underground Railroad—perhaps more time than is typical in a regular classroom. To ensure that students in both groups would spend a similar amount of time on the Underground Railroad unit and have varied resources to use, the evaluators provided additional curricular materials to the control teachers, including books, DVDs, and other supplemental materials. On each of the six days they delivered the unit, all teachers in the study completed forms to identify the standards they were covering, and they also completed a form at the end of the study to provide general information about their lessons. During the course of the unit, the evaluators found that the control teachers began working together to pool their resources and develop lesson plans. The evaluators did not discourage this interaction, believing that it increased the control teachers' ability to deliver the content effectively and, ultimately, added credibility to the study. To compare how well students learned the content of the unit, the evaluators assessed students' knowledge of slavery and the Underground Railroad before and after the instructional unit was delivered.

The evaluators' approach to these challenges was thoughtful and effective. By balancing the experimental design concept with practical considerations, the evaluation team was able to get the information they wanted and successfully complete the study.


The evaluations of Algebra I Online, ACCESS, and Thinkport illustrate a variety of approaches to constructing comparative studies. For program leaders who are considering an evaluation that will compare the performance of online and traditional students, there are several important considerations. First, program leaders should work with an evaluator to ensure that comparisons are appropriate. Together, they will want to take into account the program's goals, the student population served, and the program's structure.

Highlights From the Three Comparative Analyses Featured in This Section

The comparative analyses described in this section produced a number of important findings for program staff and leaders to consider.

Algebra I Online (Louisiana). A quasi-experimental study that compared students' performance on a posttest of algebra content knowledge showed that students who participated in the Algebra I Online course performed at least as well as those who participated in the traditional algebra I course, and on average outscored them on 18 of the 25 test items. In addition, the data suggested that students in the online program tended to do better than control students on those items that required them to create an algebraic expression from a real-world example. A majority of students in both groups reported having a good or satisfactory learning experience in their algebra course, but online students were more likely to report not having a good experience and were less likely to report feeling confident in their algebra skills. Online students reported spending more time interacting with other students about the math content of the course or working together on course activities than their peers in traditional algebra classrooms; the amount of time they spent socializing, interacting to understand assignment directions, and working together on in-class assignments or homework was about the same. The evaluation also suggested that teacher teams that used small group work and had frequent communication with each other were the most successful.

ACCESS (Alabama). Overall, the evaluations of the program found that ACCESS was succeeding in expanding access to a range of courses and was generally well received by users. The comparative analyses suggested that students taking Advanced Placement (AP) courses from a distance were almost as likely to receive a passing course grade as those students who received instruction in person, and showed that both students and faculty found the educational experience in the distance courses was equal to or better than that of traditional, face-to-face courses. The evaluation also found that in the fall semester of 2006, the distance course dropout rate was significantly lower than nationally reported averages.

Thinkport (Maryland). The randomized controlled trial initially revealed that, compared to traditional instruction, the online field trip did not have a significant positive or negative impact on student learning. However, further analysis revealed that teachers using the electronic field trip for the first time actually had less impact on student learning than those using traditional instruction, while teachers who had used the electronic field trip before had a significantly more positive impact. In a second phase of the study, the evaluators confirmed the importance of experience with the product: when teachers in one district used the electronic field trip again, they were much more successful on their second try, and their students were found to have learned far more than students receiving traditional instruction.

Second, program leaders should clearly articulate the purpose of the comparison. Is the evaluation seeking to find out if the online program is just as effective as the traditional one, or more effective? In some instances, when online programs are being used to expand access to courses or teachers, for example, a finding of "no significant difference" between online and traditional formats can be acceptable. In these cases, being clear about the purpose of the evaluation ahead of time will help manage stakeholders' expectations.

If considering a quasi-experimental design, evaluators will want to plan carefully for what classes will be used as control groups, and assess the ways they are different from the treatment classes. They will want to consider what kinds of students the class serves, whether the students (or teachers) chose to participate in the treatment or control class, whether students are taking the treatment and control classes for similar reasons (e.g., credit recovery), and whether the students in the treatment and control classes began the class at a similar achievement level. If a randomized controlled trial is desired, evaluators will need to consider how feasible it is for the particular program. Is it possible to randomly assign students either to receive the treatment or be in the control group? Can the control group students receive the treatment at a future date? Will control and treatment students be in the same classroom or school, and if so, might this cause "contamination" of data?

Finally, there are a host of practical considerations if an evaluation will require collecting data from control groups. As we describe further in the next section, program leaders and evaluators need to work together to communicate the importance of the study to anyone who will collect data from control group participants, and to provide appropriate incentives to both the data collectors and the participants. The importance of these tasks can hardly be overstated: The success of a comparative study hinges on having adequate sets of data to compare.

* Podcasts are audio files that are distributed via the Internet, which can be played back on computers to augment classroom lessons.

** Blogs are regularly updated Web sites that usually provide ongoing information on a particular topic or serve as personal diaries, and allow readers to leave their own comments.

   13 | 14 | 15
Print this page Printable view Bookmark  and Share
Last Modified: 10/20/2009