








|
An Educators' Guide to Schoolwide Reform APPENDIX A
CRITERIA USED TO EVALUATE
EVIDENCE OF POSITIVE EFFECTS
ON STUDENT ACHIEVEMENT
The review of the research on the approaches' effects on students was
conducted in two stages. First, AIR collected all available studies that reported student
achievement effects (e.g., test scores, grades, dropout rates, graduation rates) and
critically reviewed them for methodological rigor. At this stage of the process, the
research was rated based on important distinctions among studies, such as scope (e.g.,
number of students and schools, period over which data were collected), quality and
objectivity of the measurement instruments, and affiliation of the researcher (i.e., did
the researcher have a vested interest in the outcomes).
Second, AIR assigned to each approach an overall rating for evidence of
effects on student achievement, based on the number of studies that met the criteria for
methodological rigor and the strength of the data showing positive student achievement
effects reported in those studies. Because important methodological issues were addressed
in the first stage, AIR did not consider the methodological rigor of individual studies in
the second stage.
Reviewing Studies for Methodological Rigor
Review Process
AIR made an extensive effort to gather and review all relevant material
about each approach. Altogether, AIR reviewed over 130 student achievement studies, as
well as numerous items that were not, ultimately, deemed appropriate for inclusion.1
In a few instances, developers recommended additional studies late in the review process,
after they read the first draft of the profiles of their approaches. AIR attempted to
acquire and review these materials. Fewer than five studies, identified very late in the
process (i.e., the final two or three days of the project), were not included in the
review.
All studies that reported student achievement effects were reviewed and
rated using an instrument developed for this type of research review and tailored to this
project, the Evaluation of Research on Educational Approaches (EREA).2
One of seven trained researchers reviewed each study individually using
the EREA. The training process involved the researchers independently rating a sample
study and collectively discussing the rationale behind their ratings. In areas where
discrepancies occurred, standard processes were developed, recorded, and distributed to
all researchers.
Each researcher reviewed all of the studies for several reform
approaches. Each researcher also reviewed a sample of studies being reviewed by fellow
researchers. One out of every five studies was reviewed by two or more staff. The
overlapping studies were used to maintain inter-rater reliability. The project director
compared ratings for these overlapping studies and, in cases of discrepancies, retrained
the researchers and clarified the issues for all raters. For example, early reviews
revealed different approaches to rating areas where information from the developers or
studies reviewed was unclear. AIR developed a standard process, retrained researchers, and
revisited ratings for all studies previously reviewed.
Review Criteria
Guided by the criteria in the EREA, we assigned each study a rating
based on its overall methodology. The EREA contains a total of seven sections. The
questions used to calculate the methodology rating are found in two sections
("levels") of the EREA, Level 3 and Level 4. The other five sections capture
information on implementation (Levels 1 and 2), or are used by the reviewer to summarize
the findings from Level 3 and Level 4 (Levels 5, 6 and 7).3
Across Levels 3 and 4, each study was rated in 10 categories: 1)
construct validity; 2) the higher of two ratingsinternal validity or study design;
3) duration; 4) sample bias; 5) external validity; 6) statistical conclusions; 7)
measures; 8) sample description; 9) and 10) study clarity (rated at both Levels 3 and 4).
Each of the 10 categories carried equal weight, to a maximum of four points. The 10
individual ratings then were averaged to form a final methodology rating for the study;
studies with an average rating of 3.00 or above met the criteria for rigor.
This approach was deliberately chosen to accommodate varied study
designs by focusing on the overall methodology rather than a single critical element of
methodology. For example, a highly quantitative analysis of test scores might provide a
limited description of the sample studied, but compensate by including a very large number
of subjects and using random assignment to a control group. A longitudinal case study
might use a small sample, but compensate by collecting data over a long period of time and
providing rich descriptions of the treatment and sample. The questions used to rate
studies are described below.
Level 3: Does the study satisfy minimal validity criteria?
AIR assessed the degree to which each study satisfied minimal validity standards related
to the following six categories:
Construct validity: Did the study focus on the construct
(e.g., mathematics achievement) germane to the analysis? Did the study include measurable
dependent variables? Did the dependent variable measure the construct under analysis? Did
the study report the effects of the approach? If the answers to these questions were
"yes," the study earned the highest rating, four points. If one answer was
"not clear," the study earned three points. If there were multiple "not
clear" responses, the study earned two points. If any of the questions were answered
"no," the study earned one point. Because AIR sought to review studies that
reported measurable student achievement outcomes, very few studies did not earn four
points in this area.
Internal validity: What research methodology was used to
assess the approach (e.g., true experimental group design, case study, quantitative
synthesis)? Was it cross-sectional or longitudinal? If the methodology was a true or
quasi-experimental group design or quantitative synthesis, the study earned four points.
If the methodology was any other design, but the study was longitudinal (i.e., at least
three years of data), the study earned three points. If the methodology was
multiple-baseline or narrative synthesis and the study was cross-sectional, the study
earned two points. Any other design earned one point.
Duration: What was the duration for data collection? If data
were collected over at least three years, the study earned four points. If the duration
was between one and three years, the study earned three points. If the duration was
between six months and one year, the study earned two points. If the duration was less,
the study earned one point.
Sample bias: Were students kept in the study regardless of
low performance? Were students' results reported in the findings regardless of low
performance? Was the attrition rate below 20 percent? Were both experimental and control
sample selections a priori rather than post hoc? If the answer to these questions was
positive, the study earned four points. If there were one or two "not clear"
responses, the study earned three points. If there were three "not clear"
responses, the study earned two points. If any of the answers was "no," the
study earned one point. In general, studies tended to keep students in the study and
findings, regardless of performance, but many studies suffered from high attrition rates
(especially longitudinal studies) or post hoc sample selection.
External validity: How many students were in each condition?
How many classes? How many schools? If the study involved at least 50 students, at least
five classes, and at least five schools per condition, it earned four points. There was
some flexibility on any one of these points. Fewer students, classes, and schools resulted
in fewer points on external validity.
Statistical conclusions: Did the study provide sufficient
quantitative information to permit calculation of statistical effects? Were appropriate
statistical tests used to analyze data? If the answer to both questions was
"yes," the study earned four points. If one was "not clear," the study
earned three points. If both were "not clear," the study earned two points. If
either answer was "no," the study earned one point. Many studies provided some
quantitative information but not enough to calculate effect sizes. For example, some
studies provided means but not standard deviations, or percentiles but not number of
participants.
Level 4: Are differences between groups attributable to the
approach? AIR assessed the degree to which each study satisfied internal validity
standards in three areas:
Study design: What type of comparison or controls did the
study use? Of the 10 experimental designs described in this section, the designs
warranting four points were: randomly assigned subjects, stratified sampling, randomly
assigned intact groups, and stratified randomly assigned intact groups. Designs that
earned three points were: a priori match on demographic and achievement characteristics,
group comparability at pretest on critical measures, a priori match on demographic
characteristics, or statistical adjustment for small a priori differences. Studies using
pre-post designs, including case studies, earned two points; other designs involving
controls earned one point. This category, study design, is very similar to the internal
validity category in Level 3. However, the ratings in the study design category tend to
favor quantitative studies, while the ratings in the internal validity category tend to
favor longitudinal case studies. The final methodology rating was calculated with only the
higher of the two ratingsinternal validity or study designin order to give
strong quantitative and strong qualitative studies similar weight.
Measures: Were measures adequately described or commonly
recognized? Were they reliable (r > .75)? Did they assess skills taught in both
experimental and control conditions? Was more than one measure of outcome used? Were some
measures developed by someone other than the experimenter? Were data collected and
analyzed by researchers other than the approach developer? Was adequate information
available to assess degree of implementation? Did the study provide information on
materials, roles, participants, and length of intervention? Were differences between
conditions limited to the approach? To calculate the rating for this category, total
earned points were divided by total possible points (generally excluding questions that
are marked "not clear" unless there is a substantive reason to include them),
and this ratio was multiplied by four to create a four-point scale.
Sample description: Did the study indicate that the approach
was implemented in settings representative of actual instructional conditions? Were other
instructional differences between groups (e.g., age, ethnicity, setting) described and
adequately controlled? To calculate the rating for this category, total earned points were
divided by total possible points (excluding questions that are marked "not
clear"), and this ratio was multiplied by four to create a four-point scale. Very few
studies earned below four points on this category for two reasons: 1) very few studies
were not set in representative conditions and 2) very few studies identified differences
between comparison groups and do not control for those differences. Many studies did not
identify differences, and so received "not clear" ratings, for which they were
not penalized in the sample category rating. However, this lack of information was
captured in the study clarity rating described below.
Study clarity. In addition to the substantive areas
listed above, studies were rated on the clarity of information in Level 3 and Level 4. The
intent here was to identify studies that systematically provided inadequate data to
understand the methodology or replicate the study. Within each category listed above
(e.g., sample bias, statistical conclusions), studies were penalized minimally for one or
two "not clear" responses; however, the study clarity rating targeted studies
with patterns of frequent "not clear" responses. Within Level 3 and Level 4, the
proportion of "clear" responses was standardized to a four-point scale to
calculate a clarity rating, for a total of two study clarity ratings.
Assigning Ratings for Evidence of Effects on Student Achievement
Review Process
Next, AIR summarized the strength of the research for each
approachwith an emphasis on findings from studies with a methodological rating of
3.0 or aboveusing the rating criteria presented below. Because there was limited
research on the effects of the approaches, a difference of one study could be quite
meaningful. For example, an approach with a marginal research might have had one study; a
single additional study would have doubled the information available.4
AIR researchers used Levels 5, 6, and 7 of the EREA to summarize
findings from each study (Level 5), make conclusions about the methodological strength of
the study (Level 6), and rate the evidence of effects of the approach overall (Level 7).
Level 5: Is the approach effective as determined by
scientifically valid research methods? AIR reported all statistical information
that could be used to calculate effects (e.g., number and percentile, effect size) in
Level 5. Descriptive information about the measures (e.g., measure name, statistical tests
used) also was recorded in Level 5. For studies that met methodological standards (rating
of 3.0 or above), we used the information reported in Level 5 to complete the findings
tables in Appendix C.
Level 6: What is the quality of the research base underlying
the approach? AIR researchers summarized methodology ratings within and across
studies in Level 6, and entered the final methodology rating (i.e., an average of the 10
ratings in Levels 3 and 4). Descriptive information about the study (e.g., publication
information, names of schools and districts in the study) also was recorded in Level 6.
Only studies that earn a methodology rating of 3.0 or above were considered sufficiently
rigorous to report their findings. AIR used the information summarized in Level 6 to
complete the research tables in Appendix B.
Level 7: What is the overall efficacy of the approach?
This level synthesized information from the most rigorous studies (those earning a
research rating of 3.0 or above) in terms of reported effects of the approaches.
Researchers rated the research base as a whole, using information on the number of studies
that met the minimum criteria and the findings of these studies.
Rating Criteria
The rating criteria draw on multiple sources, including Stringfield
(1998), National Center to Improve the Tools of Educators (1998), and the U.S. Department
of Education (1998). The rating criteria were reviewed by the project's scientific
advisors as well as other experts in educational evaluation. The final rating criteria
reflect their comments and suggestions.
= Strong evidence of positive effects on student achievement
 |
At least four studies (or two studies and one research
review/meta-analysis) that use a rigorous methodology and show positive effects on student
achievement.
|
 |
At least three of these studies that show
statistically or educationally significant positive effects on students (i.e., effect size
of at least .25, statistically significant at the p<.01 level, or gains greater than 10
percentiles).
|
 |
No more than 20 percent of studies that
use a rigorous methodology show negative or no effects5 on students.
|
 |
To ensure that there is enough
information to replicate any particular approach, at least one study must be available
that provides information on implementation of the approach (high methodology rating not
required).
|
= Promising evidence of positive effects on student achievement
 |
At least three studies (or two studies
and one research review/meta-analysis) that use a rigorous methodology and show positive
effects OR a combination of one such study and at least six longitudinal (i.e., three
years or longer) case studies (rigorous methodology not required) that show positive
effects.
|
 |
At least one of these studies that shows
statistically or educationally significant positive effects (i.e., effect size of at least
.25, statistically significant at the p<.01 level, or gains greater than 10
percentiles).
|
 |
No more than 30 percent of studies that
use rigorous methodologies OR are longitudinal case studies show negative or no effects on
students.
|
 |
At least one study provides information
on implementation of the approach (high methodology rating not required).
|
= Marginal evidence of positive effects on student achievement
 |
At least one study that uses a rigorous
methodology OR four longitudinal case studies (high methodology rating not required).
|
 |
No more than 50 percent of studies that
use rigorous methodology OR are longitudinal case studies show negative or no effects on
students.
|
= Evidence of mixed, weak, or no effects on student achievement
 |
At least one study that uses a rigorous
methodology OR two longitudinal case studies (high methodology rating not required) that
show inconsistent, mostly negative, or no effects on students.
|
? = No research on effects on student achievement
 |
Insufficient data on student outcomes: no studies use
rigorous methodology AND there are fewer than two longitudinal case studies. |
The criteria evaluate two dimensions of the evidence of positive
effects on student achievement: size of the research base, and strength of the findings.
The highest-rated approaches must have multiple studies that meet the EREA criteria for
rigorous research (methodology rating of 3.0 or above), the vast majority of the research
must show positive effects, and a majority of the findings must be statistically or
educationally significant.
Overall, the ratings flow from more to less research and from stronger
to weaker positive findings. The fourth rating, evidence of mixed, weak, or no effects
on student achievement, is an exception. This rating is used for approaches that may
have the number of studies required for a marginal rating, but the studies show
inconclusive or negative effects on students. Thus, an approach with one rigorous study
that shows positive effects would be rated marginal, while an approach with at
least one rigorous study that shows ambivalent or negative effects would be rated mixed,
weak, or no effects, and an approach with no rigorous studies at all would be rated
no research.
Several conditions in these rating criteria have been tailored for this
guide. First, the criteria incorporate both quantitative and qualitative research. Since
the guide reports on measurable improvements in student achievement, quantitative results
are necessary for a high rating. However, the sponsoring organizations and AIR recognized
the need to include a variety of research designs, including well-conducted qualitative
studies. Therefore, in certain cases, the rating criteria permit a large number of
longitudinal case studies to be substituted for a smaller number of studies that use a
rigorous methodology. This was done to compensate for the quantitative bias of the EREA,
as high-quality qualitative studies (e.g., well-conducted longitudinal case studies) are
less likely to meet the criteria of the EREA than high-quality quantitative studies. That
particular case never occurred during AIR's review, in part because the EREA was
successfully adapted to include qualitative research. However, AIR kept the condition in
the rating criteria to emphasize the intent to incorporate a variety of research designs.
Second, the rating criteria make a distinction between positive
findings and significantly positive findings. To earn a strong or promising
rating, an approach must have studies with positive findings; further, somebut not
allof these studies must report findings that are educationally or statically
significant. For example, strong evidence of positive effects calls for four or
more studies with positive student achievement findings, of which at least three must have
significantly positive findings. Again, this condition is intended to incorporate findings
from qualitative studies, which are unlikely to report effects in terms of measurable
significance, as well as quantitative studies, which are likely to report significance
levels. Further, some quantitative studies provide ample evidence of strong positive
effects (e.g., a large rise in test scores across the school), but neglect to include some
piece of information (e.g., the exact number of students tested) that would be necessary
to calculate significance levels. However, if studies passed the EREA criteria for
rigorous methodology on other counts, AIR considered them when rating an approach.
Third, the criteria accommodate studies that report mixed effects on
achievement. For example, a study that shows a positive effect on reading and a negative
effect on mathematics test scores would be credited both as a study with positive effects
and a study with negative effects.6 Each study, rather than each outcome, is
considered equally for the rating, so that both positive and negative outcomes are
recognized.
Fourth, the rating criteria consider information on implementation. To
earn the highest ratings, an approach must have research that reports both implementation
and effects. This condition ensures that statements can be made about the interaction
between level of implementation and effects on students. For example, although a
particular developer might not require or provide extensive staff development, it would be
important to know that all schools that made significant student achievement gains, using
the approach, chose to contract for such services. In addition, researchers need
information on the implementation level if they want to accurately replicate the research.
The reported analyses treat all studies equally. In fact, studies
differ on a number of dimensions, including the number of schools studied, the number of
grade levels tested, the number of outcome measures used, and the number of students
included.
To examine the sensitivity of the results to alternative asumptions
about the studies, supplementary analysis was conducted based on one assumption, that
number of schools in the study matters. Studies with one to 20 schools were assigned a
weight of one, and studies with more than 20 schools were assigned a weight of two.
Weighting the analyses had minimal impact: it raised the ratings for three borderline
cases and lowered the rating for a fourth borderline case. We retained the original rating
strategy, in which studies were the unit of analysis, for three reasons. First, the number
of studies reflects the potential breadth and variety of the research. Second, the limited
information available on the studies would not support a truly accurate weighting scheme.
Third, the supplementary analysis suggested that weighting had minimal impact.
1 For example, some materials originally
classified as studies were, upon closer inspection, promotional materials reporting
anecdotes of successful reform.
2 One of the most critical changes to the
EREA was to change the ratings system from exclusionary to summative. In the original
version, studies were automatically dropped if they did not meet certain criteria. In the
revised version, studies were rated on a number of criteria; the average rating across
criteria was used to determine whether the study should be used or dropped. A second
critical change was in the rating criteria used to summarize research across an approach.
The revised version incorporates the original criteria, those recommended by Stringfield
(1998), and those identified in the guidance to the Comprehensive School Reform
Demonstration Act.
3 AIR reviewed available information on
implementation but did not include it in the methodology rating. There are three reasons
for this decisiontwo methodological and one an artifact of the available research.
First, most studies did not provide adequate information on implementation, and so ratings
in that area would be suspect. Second, the methodology rating is intended to reflect the
quality of the study, rather than quality of implementation of an approach at a particular
site. We discuss implementation in other sections of the report. (See Appendix E for a
description of implementation data collected and reported.) Third, including both well and
poorly implemented studies in the discussion of effects allowed AIR to look at data and
make statements about the relationship between implementation level and effects of the
approach. If studies had been systematically excluded from analysis because the approaches
were poorly implemented at the sites studied, AIR could not have addressed this question.
4 Because the number of studies matters, a
study that appears as a paper and, in modified format, as a journal article, was reviewed
and reported only once, using the most recent version. Some longitudinal studies involved
multiple reports over the course of data collection. In such cases, we reviewed the most
comprehensive report. If other reports from this study provided unique data or analysis
(e.g., an implementation report in year one and an outcome report in year three), we also
reviewed those reports.
5 For a study to show "negative or no
effects," at least one-third of its findings must be negative or ambiguous.
6 Ratings are not separated by subject area,
as the limited data would not support this level of analysis. |