An Educators' Guide to Schoolwide Reform - Home
An Educators' Guide to Schoolwide Reform - Introduction
An Educators' Guide to Schoolwide Reform - Acknowledgments
An Educators' Guide to Schoolwide Reform - Overview
An Educators' Guide to Schoolwide Reform - 24 Approaches
An Educators' Guide to Schoolwide Reform - Catalogs & Reviews
An Educators' Guide to Schoolwide Reform - References
An Educators' Guide to Schoolwide Reform - Appendices
An Educators' Guide to Schoolwide Reform - Ordering Information
An Educators' Guide to Schoolwide Reform

APPENDIX A
CRITERIA USED TO EVALUATE
EVIDENCE OF POSITIVE EFFECTS
ON
STUDENT ACHIEVEMENT


The review of the research on the approaches' effects on students was conducted in two stages. First, AIR collected all available studies that reported student achievement effects (e.g., test scores, grades, dropout rates, graduation rates) and critically reviewed them for methodological rigor. At this stage of the process, the research was rated based on important distinctions among studies, such as scope (e.g., number of students and schools, period over which data were collected), quality and objectivity of the measurement instruments, and affiliation of the researcher (i.e., did the researcher have a vested interest in the outcomes).

Second, AIR assigned to each approach an overall rating for evidence of effects on student achievement, based on the number of studies that met the criteria for methodological rigor and the strength of the data showing positive student achievement effects reported in those studies. Because important methodological issues were addressed in the first stage, AIR did not consider the methodological rigor of individual studies in the second stage.

Reviewing Studies for Methodological Rigor

Review Process

AIR made an extensive effort to gather and review all relevant material about each approach. Altogether, AIR reviewed over 130 student achievement studies, as well as numerous items that were not, ultimately, deemed appropriate for inclusion.1 In a few instances, developers recommended additional studies late in the review process, after they read the first draft of the profiles of their approaches. AIR attempted to acquire and review these materials. Fewer than five studies, identified very late in the process (i.e., the final two or three days of the project), were not included in the review.

All studies that reported student achievement effects were reviewed and rated using an instrument developed for this type of research review and tailored to this project, the Evaluation of Research on Educational Approaches (EREA).2

One of seven trained researchers reviewed each study individually using the EREA. The training process involved the researchers independently rating a sample study and collectively discussing the rationale behind their ratings. In areas where discrepancies occurred, standard processes were developed, recorded, and distributed to all researchers.

Each researcher reviewed all of the studies for several reform approaches. Each researcher also reviewed a sample of studies being reviewed by fellow researchers. One out of every five studies was reviewed by two or more staff. The overlapping studies were used to maintain inter-rater reliability. The project director compared ratings for these overlapping studies and, in cases of discrepancies, retrained the researchers and clarified the issues for all raters. For example, early reviews revealed different approaches to rating areas where information from the developers or studies reviewed was unclear. AIR developed a standard process, retrained researchers, and revisited ratings for all studies previously reviewed.

Review Criteria

Guided by the criteria in the EREA, we assigned each study a rating based on its overall methodology. The EREA contains a total of seven sections. The questions used to calculate the methodology rating are found in two sections ("levels") of the EREA, Level 3 and Level 4. The other five sections capture information on implementation (Levels 1 and 2), or are used by the reviewer to summarize the findings from Level 3 and Level 4 (Levels 5, 6 and 7).3

Across Levels 3 and 4, each study was rated in 10 categories: 1) construct validity; 2) the higher of two ratings—internal validity or study design; 3) duration; 4) sample bias; 5) external validity; 6) statistical conclusions; 7) measures; 8) sample description; 9) and 10) study clarity (rated at both Levels 3 and 4). Each of the 10 categories carried equal weight, to a maximum of four points. The 10 individual ratings then were averaged to form a final methodology rating for the study; studies with an average rating of 3.00 or above met the criteria for rigor.

This approach was deliberately chosen to accommodate varied study designs by focusing on the overall methodology rather than a single critical element of methodology. For example, a highly quantitative analysis of test scores might provide a limited description of the sample studied, but compensate by including a very large number of subjects and using random assignment to a control group. A longitudinal case study might use a small sample, but compensate by collecting data over a long period of time and providing rich descriptions of the treatment and sample. The questions used to rate studies are described below.

Level 3: Does the study satisfy minimal validity criteria? AIR assessed the degree to which each study satisfied minimal validity standards related to the following six categories:

  • Construct validity: Did the study focus on the construct (e.g., mathematics achievement) germane to the analysis? Did the study include measurable dependent variables? Did the dependent variable measure the construct under analysis? Did the study report the effects of the approach? If the answers to these questions were "yes," the study earned the highest rating, four points. If one answer was "not clear," the study earned three points. If there were multiple "not clear" responses, the study earned two points. If any of the questions were answered "no," the study earned one point. Because AIR sought to review studies that reported measurable student achievement outcomes, very few studies did not earn four points in this area.

  • Internal validity: What research methodology was used to assess the approach (e.g., true experimental group design, case study, quantitative synthesis)? Was it cross-sectional or longitudinal? If the methodology was a true or quasi-experimental group design or quantitative synthesis, the study earned four points. If the methodology was any other design, but the study was longitudinal (i.e., at least three years of data), the study earned three points. If the methodology was multiple-baseline or narrative synthesis and the study was cross-sectional, the study earned two points. Any other design earned one point.

  • Duration: What was the duration for data collection? If data were collected over at least three years, the study earned four points. If the duration was between one and three years, the study earned three points. If the duration was between six months and one year, the study earned two points. If the duration was less, the study earned one point.

  • Sample bias: Were students kept in the study regardless of low performance? Were students' results reported in the findings regardless of low performance? Was the attrition rate below 20 percent? Were both experimental and control sample selections a priori rather than post hoc? If the answer to these questions was positive, the study earned four points. If there were one or two "not clear" responses, the study earned three points. If there were three "not clear" responses, the study earned two points. If any of the answers was "no," the study earned one point. In general, studies tended to keep students in the study and findings, regardless of performance, but many studies suffered from high attrition rates (especially longitudinal studies) or post hoc sample selection.

  • External validity: How many students were in each condition? How many classes? How many schools? If the study involved at least 50 students, at least five classes, and at least five schools per condition, it earned four points. There was some flexibility on any one of these points. Fewer students, classes, and schools resulted in fewer points on external validity.

  • Statistical conclusions: Did the study provide sufficient quantitative information to permit calculation of statistical effects? Were appropriate statistical tests used to analyze data? If the answer to both questions was "yes," the study earned four points. If one was "not clear," the study earned three points. If both were "not clear," the study earned two points. If either answer was "no," the study earned one point. Many studies provided some quantitative information but not enough to calculate effect sizes. For example, some studies provided means but not standard deviations, or percentiles but not number of participants.

Level 4: Are differences between groups attributable to the approach? AIR assessed the degree to which each study satisfied internal validity standards in three areas:

  • Study design: What type of comparison or controls did the study use? Of the 10 experimental designs described in this section, the designs warranting four points were: randomly assigned subjects, stratified sampling, randomly assigned intact groups, and stratified randomly assigned intact groups. Designs that earned three points were: a priori match on demographic and achievement characteristics, group comparability at pretest on critical measures, a priori match on demographic characteristics, or statistical adjustment for small a priori differences. Studies using pre-post designs, including case studies, earned two points; other designs involving controls earned one point. This category, study design, is very similar to the internal validity category in Level 3. However, the ratings in the study design category tend to favor quantitative studies, while the ratings in the internal validity category tend to favor longitudinal case studies. The final methodology rating was calculated with only the higher of the two ratings—internal validity or study design—in order to give strong quantitative and strong qualitative studies similar weight.

  • Measures: Were measures adequately described or commonly recognized? Were they reliable (r > .75)? Did they assess skills taught in both experimental and control conditions? Was more than one measure of outcome used? Were some measures developed by someone other than the experimenter? Were data collected and analyzed by researchers other than the approach developer? Was adequate information available to assess degree of implementation? Did the study provide information on materials, roles, participants, and length of intervention? Were differences between conditions limited to the approach? To calculate the rating for this category, total earned points were divided by total possible points (generally excluding questions that are marked "not clear" unless there is a substantive reason to include them), and this ratio was multiplied by four to create a four-point scale.

  • Sample description: Did the study indicate that the approach was implemented in settings representative of actual instructional conditions? Were other instructional differences between groups (e.g., age, ethnicity, setting) described and adequately controlled? To calculate the rating for this category, total earned points were divided by total possible points (excluding questions that are marked "not clear"), and this ratio was multiplied by four to create a four-point scale. Very few studies earned below four points on this category for two reasons: 1) very few studies were not set in representative conditions and 2) very few studies identified differences between comparison groups and do not control for those differences. Many studies did not identify differences, and so received "not clear" ratings, for which they were not penalized in the sample category rating. However, this lack of information was captured in the study clarity rating described below.

Study clarity. In addition to the substantive areas listed above, studies were rated on the clarity of information in Level 3 and Level 4. The intent here was to identify studies that systematically provided inadequate data to understand the methodology or replicate the study. Within each category listed above (e.g., sample bias, statistical conclusions), studies were penalized minimally for one or two "not clear" responses; however, the study clarity rating targeted studies with patterns of frequent "not clear" responses. Within Level 3 and Level 4, the proportion of "clear" responses was standardized to a four-point scale to calculate a clarity rating, for a total of two study clarity ratings.

Assigning Ratings for Evidence of Effects on Student Achievement

Review Process

Next, AIR summarized the strength of the research for each approach—with an emphasis on findings from studies with a methodological rating of 3.0 or above—using the rating criteria presented below. Because there was limited research on the effects of the approaches, a difference of one study could be quite meaningful. For example, an approach with a marginal research might have had one study; a single additional study would have doubled the information available.4

AIR researchers used Levels 5, 6, and 7 of the EREA to summarize findings from each study (Level 5), make conclusions about the methodological strength of the study (Level 6), and rate the evidence of effects of the approach overall (Level 7).

Level 5: Is the approach effective as determined by scientifically valid research methods? AIR reported all statistical information that could be used to calculate effects (e.g., number and percentile, effect size) in Level 5. Descriptive information about the measures (e.g., measure name, statistical tests used) also was recorded in Level 5. For studies that met methodological standards (rating of 3.0 or above), we used the information reported in Level 5 to complete the findings tables in Appendix C.

Level 6: What is the quality of the research base underlying the approach? AIR researchers summarized methodology ratings within and across studies in Level 6, and entered the final methodology rating (i.e., an average of the 10 ratings in Levels 3 and 4). Descriptive information about the study (e.g., publication information, names of schools and districts in the study) also was recorded in Level 6. Only studies that earn a methodology rating of 3.0 or above were considered sufficiently rigorous to report their findings. AIR used the information summarized in Level 6 to complete the research tables in Appendix B.

Level 7: What is the overall efficacy of the approach? This level synthesized information from the most rigorous studies (those earning a research rating of 3.0 or above) in terms of reported effects of the approaches. Researchers rated the research base as a whole, using information on the number of studies that met the minimum criteria and the findings of these studies.

Rating Criteria

The rating criteria draw on multiple sources, including Stringfield (1998), National Center to Improve the Tools of Educators (1998), and the U.S. Department of Education (1998). The rating criteria were reviewed by the project's scientific advisors as well as other experts in educational evaluation. The final rating criteria reflect their comments and suggestions.

full.gif (85 bytes) = Strong evidence of positive effects on student achievement

At least four studies (or two studies and one research review/meta-analysis) that use a rigorous methodology and show positive effects on student achievement.

At least three of these studies that show statistically or educationally significant positive effects on students (i.e., effect size of at least .25, statistically significant at the p<.01 level, or gains greater than 10 percentiles).

No more than 20 percent of studies that use a rigorous methodology show negative or no effects5 on students.

To ensure that there is enough information to replicate any particular approach, at least one study must be available that provides information on implementation of the approach (high methodology rating not required).

half.gif (93 bytes) = Promising evidence of positive effects on student achievement

At least three studies (or two studies and one research review/meta-analysis) that use a rigorous methodology and show positive effects OR a combination of one such study and at least six longitudinal (i.e., three years or longer) case studies (rigorous methodology not required) that show positive effects.

At least one of these studies that shows statistically or educationally significant positive effects (i.e., effect size of at least .25, statistically significant at the p<.01 level, or gains greater than 10 percentiles).

No more than 30 percent of studies that use rigorous methodologies OR are longitudinal case studies show negative or no effects on students.

At least one study provides information on implementation of the approach (high methodology rating not required).

quarter.gif (91 bytes) = Marginal evidence of positive effects on student achievement

At least one study that uses a rigorous methodology OR four longitudinal case studies (high methodology rating not required).

No more than 50 percent of studies that use rigorous methodology OR are longitudinal case studies show negative or no effects on students.

empty.gif (88 bytes) = Evidence of mixed, weak, or no effects on student achievement

At least one study that uses a rigorous methodology OR two longitudinal case studies (high methodology rating not required) that show inconsistent, mostly negative, or no effects on students.

? = No research on effects on student achievement

Insufficient data on student outcomes: no studies use rigorous methodology AND there are fewer than two longitudinal case studies.

The criteria evaluate two dimensions of the evidence of positive effects on student achievement: size of the research base, and strength of the findings. The highest-rated approaches must have multiple studies that meet the EREA criteria for rigorous research (methodology rating of 3.0 or above), the vast majority of the research must show positive effects, and a majority of the findings must be statistically or educationally significant.

Overall, the ratings flow from more to less research and from stronger to weaker positive findings. The fourth rating, evidence of mixed, weak, or no effects on student achievement, is an exception. This rating is used for approaches that may have the number of studies required for a marginal rating, but the studies show inconclusive or negative effects on students. Thus, an approach with one rigorous study that shows positive effects would be rated marginal, while an approach with at least one rigorous study that shows ambivalent or negative effects would be rated mixed, weak, or no effects, and an approach with no rigorous studies at all would be rated no research.

Several conditions in these rating criteria have been tailored for this guide. First, the criteria incorporate both quantitative and qualitative research. Since the guide reports on measurable improvements in student achievement, quantitative results are necessary for a high rating. However, the sponsoring organizations and AIR recognized the need to include a variety of research designs, including well-conducted qualitative studies. Therefore, in certain cases, the rating criteria permit a large number of longitudinal case studies to be substituted for a smaller number of studies that use a rigorous methodology. This was done to compensate for the quantitative bias of the EREA, as high-quality qualitative studies (e.g., well-conducted longitudinal case studies) are less likely to meet the criteria of the EREA than high-quality quantitative studies. That particular case never occurred during AIR's review, in part because the EREA was successfully adapted to include qualitative research. However, AIR kept the condition in the rating criteria to emphasize the intent to incorporate a variety of research designs.

Second, the rating criteria make a distinction between positive findings and significantly positive findings. To earn a strong or promising rating, an approach must have studies with positive findings; further, some—but not all—of these studies must report findings that are educationally or statically significant. For example, strong evidence of positive effects calls for four or more studies with positive student achievement findings, of which at least three must have significantly positive findings. Again, this condition is intended to incorporate findings from qualitative studies, which are unlikely to report effects in terms of measurable significance, as well as quantitative studies, which are likely to report significance levels. Further, some quantitative studies provide ample evidence of strong positive effects (e.g., a large rise in test scores across the school), but neglect to include some piece of information (e.g., the exact number of students tested) that would be necessary to calculate significance levels. However, if studies passed the EREA criteria for rigorous methodology on other counts, AIR considered them when rating an approach.

Third, the criteria accommodate studies that report mixed effects on achievement. For example, a study that shows a positive effect on reading and a negative effect on mathematics test scores would be credited both as a study with positive effects and a study with negative effects.6 Each study, rather than each outcome, is considered equally for the rating, so that both positive and negative outcomes are recognized.

Fourth, the rating criteria consider information on implementation. To earn the highest ratings, an approach must have research that reports both implementation and effects. This condition ensures that statements can be made about the interaction between level of implementation and effects on students. For example, although a particular developer might not require or provide extensive staff development, it would be important to know that all schools that made significant student achievement gains, using the approach, chose to contract for such services. In addition, researchers need information on the implementation level if they want to accurately replicate the research.

The reported analyses treat all studies equally. In fact, studies differ on a number of dimensions, including the number of schools studied, the number of grade levels tested, the number of outcome measures used, and the number of students included.

To examine the sensitivity of the results to alternative asumptions about the studies, supplementary analysis was conducted based on one assumption, that number of schools in the study matters. Studies with one to 20 schools were assigned a weight of one, and studies with more than 20 schools were assigned a weight of two. Weighting the analyses had minimal impact: it raised the ratings for three borderline cases and lowered the rating for a fourth borderline case. We retained the original rating strategy, in which studies were the unit of analysis, for three reasons. First, the number of studies reflects the potential breadth and variety of the research. Second, the limited information available on the studies would not support a truly accurate weighting scheme. Third, the supplementary analysis suggested that weighting had minimal impact.


1 For example, some materials originally classified as studies were, upon closer inspection, promotional materials reporting anecdotes of successful reform.

2 One of the most critical changes to the EREA was to change the ratings system from exclusionary to summative. In the original version, studies were automatically dropped if they did not meet certain criteria. In the revised version, studies were rated on a number of criteria; the average rating across criteria was used to determine whether the study should be used or dropped. A second critical change was in the rating criteria used to summarize research across an approach. The revised version incorporates the original criteria, those recommended by Stringfield (1998), and those identified in the guidance to the Comprehensive School Reform Demonstration Act.

3 AIR reviewed available information on implementation but did not include it in the methodology rating. There are three reasons for this decision—two methodological and one an artifact of the available research. First, most studies did not provide adequate information on implementation, and so ratings in that area would be suspect. Second, the methodology rating is intended to reflect the quality of the study, rather than quality of implementation of an approach at a particular site. We discuss implementation in other sections of the report. (See Appendix E for a description of implementation data collected and reported.) Third, including both well and poorly implemented studies in the discussion of effects allowed AIR to look at data and make statements about the relationship between implementation level and effects of the approach. If studies had been systematically excluded from analysis because the approaches were poorly implemented at the sites studied, AIR could not have addressed this question.

4 Because the number of studies matters, a study that appears as a paper and, in modified format, as a journal article, was reviewed and reported only once, using the most recent version. Some longitudinal studies involved multiple reports over the course of data collection. In such cases, we reviewed the most comprehensive report. If other reports from this study provided unique data or analysis (e.g., an implementation report in year one and an outcome report in year three), we also reviewed those reports.

5 For a study to show "negative or no effects," at least one-third of its findings must be negative or ambiguous.

6 Ratings are not separated by subject area, as the limited data would not support this level of analysis.