Last May, the seven most recent New York State Teacher of the Year award winners joined the 1993 recipient in sounding the alarm bell over the state’s new plan to stake 40 percent of teachers’ performance evaluations on students’ standardized test scores.
In their letter to Chancellor Merryl Tisch and the state Board of Regents, the teachers said the “politically popular” changes to the state education department’s performance review system “will neither improve schools nor increase student learning; rather, they will cause tangible harm to students and teachers alike.” While strongly supporting the need to “develop rigorous systems to evaluate teachers and support professional growth,” the teachers set out seven distinct scenarios in which teacher evaluations based on student test scores could backfire.
Elaine Weiss coordinates the campaign for a Broader Bolder Approach to Education, which promotes holistic school accountability systems. Photo by Michael Pitch
These teachers are not alone. California, Florida and Tennessee are among an increasing number of states using student test scores — often through so-called “value-added measures,” or VAMs — to assess the classroom performance of teachers. The rise in VAMs is raising both questions about their utility and concerns over their potential to do serious damage.
The 2001 reauthorization of the Elementary and Secondary Education Act, No Child Left Behind, brought with it a host of new measures and mandates designed to improve the academic achievement, in particular, of the most at-risk students. Among these is a requirement that every classroom have a highly qualified teacher by 2014. This prompted a search for ways to evaluate teacher effectiveness, which was intensified by the Obama administration’s Race to the Top initiative. Philanthropists, high-profile school chancellors and business leaders have joined the clamor of voices demanding teacher accountability based on test scores.
But criticism of initial techniques, such as assessing performance based on the percentage of students reaching a “cut score,” led to the development of more sophisticated methodologies, namely the value-added measures currently in vogue. These measures purport to capture student growth over time rather than simply the students’ skill level at a given point, and some versions attempt to control for non-teacher-based contributors to student learning.
“Despite the enthusiasm these models have generated among many policymakers, several technical reviews of VAMs have revealed a number of serious concerns,” noted Henry Braun, who directs the Center for the Study of Testing, Evaluation, and Educational Policy at Boston College’s Lynch School of Education, in his 2005 primer on value-added models, published by the Educational Testing Service. “Indeed, the implementation of such models and the proposed uses of the results raise a host of practical, technical and even philosophical issues.”
Experts across the political spectrum who have evaluated value-added measurements point to a range of technical challenges, many driven by lack of randomization. Ideally, data that are intended to estimate causality (the effect of teacher quality on learning) would be drawn from a randomly assigned sample of students, teachers and classrooms. Schools, of course, do not work this way at all, so test score data are anything but randomly distributed.
Parents devote substantial effort and money to buying homes that feed into particular districts and schools. They also may work to get their child into a particular teacher’s class. Teachers may request assignments to schools and even classrooms they prefer. And principals place students in classrooms based on their level of ability, disruptiveness, etc. In short, the system creates data that reflect a host of choices and biases that must be, but in reality cannot be, controlled for in order to estimate causality. And because value-added analyses draw on data from a relatively small number of classes, a few disruptive students can seriously skew the results. As every teacher is aware, the odds of having such a class, or more than one, vary from year to year. Test scores reflecting other students’ reactions to that behavior render the teacher “ineffective.” Likewise, if a few students slept particularly badly the night before the test, that would bias the results, as would students’ stress over a violent storm or shooting involving a classmate.
Another major, related problem is the propensity to mistakenly attribute certain factors that influence student test scores to teachers. These include within-school factors such as other teachers (previous years, other subjects), school conditions (curriculum quality, tutoring supports, class size), policies (pulling students out for special sessions, team teaching, block scheduling) and attendance.
Student test scores also are affected by a range of student-specific, out-of-school factors found by researchers to have a larger impact on test scores than all within-school effects combined. These include home environment; health and nutrition; residential mobility; peers; trips to museums, libraries and parks; and help or lack of help with homework. Erosion of school-year gains over the summer presents a particularly major obstacle to achievement for low-income children. A 2007 study by three researchers at Johns Hopkins University found that two-thirds of the difference between 9th-grade test scores of high- and low-income students can be traced to elementary school summer learning differences.
These technical obstacles translate into two related problems with VAM scores: They are unreliable, and they are biased in favor of higher-performing students. Both defeat the purpose of consistently sorting good teachers on the basis of teacher, not student, ability.
A Research Snapshot
Scholars from a variety of institutions have voiced concerns about the use of value-added measurements to evaluate teachers. Education experts convened by the Economic Policy Institute documented their lack of reliability in a 2010 report, “Problems with the Use of Student Test Scores to Evaluate Teachers.” It stated: “VAM estimates have proven to be unstable across statistical models, years and classes that teachers teach. One study found that across five large urban districts, among teachers who were ranked in the top 20 percent of effectiveness in the first year, fewer than a third were in that top group the next year, and another third moved all the way down to the bottom 40 percent. Another found that teachers’ effectiveness one year could only predict from 4 percent to 16 percent of the variation in such ratings in the following year.”
On the matter of bias toward high-achieving students, the report said: “[A]t least one study has examined the same teachers with different populations of students, showing that these same teachers consistently appeared to be more effective when they taught more academically advanced students, fewer English language learners, and fewer low-income students. This finding suggests that VAM cannot control completely for differences in students’ characteristics or starting points.”
As such, RAND Corporation researchers asserted that the research base is insufficient to support using value-added measures for high-stakes decisions about individual teachers or schools.
Additionally, increasing reliability over time requires more years of data, which runs counter to the goal of timely information for schools and districts. One study conservatively estimated an error rate of 36 percent identifying high-, low- and average-performing teachers based on one year of data. If the data cover three years, the error rate falls to only 26 percent; it would take 10 years of data to reduce the error rate to 12 percent. Even worse, instability is the most severe in the places on the test score range with the biggest consequences: at the top of the test score range, which leads to teacher rewards, and at the bottom, which can result in teachers being placed on probation or even fired.
Aside from the technical difficulties, as the New York Teacher of the Year winners emphasized in their letter, using value-added measurements to evaluate teacher effectiveness has a host of negative consequences for students, teachers and schools.
First, a narrow focus on basic reading and math scores inevitably steers instruction toward those subjects. Important topics that make for well-rounded students and thoughtful citizens, such as science, history and the arts, and important skills such as writing, research and complex problem solving — are neglected or even omitted. Students in schools at highest risk of being sanctioned lose out the most. Drilling poor and minority children on the basics is an increasingly common practice.
Second, VAM leads to problems with test-score validity and to artificially inflated scores. It becomes hard to determine whether students have gained knowledge or skills in the subject or in test taking. As well, attaching high stakes to tests creates strong incentives to game the system and, in the extreme, to cheat. In high-poverty schools, teachers logically focus on the most stable students at the expense of mobile children, who most need support but whose scores may not count in evaluations. A disincentive exists to focus on students who are far behind or on promising students with little room for test-score gains. Again, students lose.
While it might have been possible a year ago to contend that altering test scores was an aberration and the work of a few problem teachers, public investigations into widespread cheating in the District of Columbia and Atlanta and more recently in dozens of schools across Pennsylvania suggest systemic incentives are at work. Indeed, Duke University economist Dan Ariely responded to the slew of recent cheating revelations with a Washington Post commentary that asks “Want to stop teachers from cheating?” He asserts that reducing “something as broad as education” to a single test score, rather than teacher bonuses, is the main problem and suggests that broader measures of teacher effectiveness would help a lot.
Ironically, value-added measures actually hinder schools’ abilities to secure strong teachers where they are needed most. Wrongheaded assumptions about individual teacher impact that are embedded in VAM discourage the very collaboration among teachers that has been shown to dramatically improve instruction and learning. Pressure to raise test scores of low-income students, combined with a bias against teachers who serve such children, limits the ability of schools in high-poverty areas to recruit and retain such teachers. As the New York teachers wrote, “Under [the proposed test score-heavy evaluation system], what motivation will teachers have to take on the most challenging students?”
Perhaps most troubling, given the already high rate of turnover, teachers’ understanding of how poorly value-added scores reflect their whole effort is a major blow to morale. Indeed, as the Economic Policy Institute study noted, “[R]ecent survey data reveal that accountability pressures are associated with higher attrition and reduced morale, especially among teachers in high-need schools.” Less critical issues include lack of funds to pay teacher bonuses associated with excellence — some big foundations currently foot those bills — and the inability to use growth models for new teachers.
These many problems notwithstanding, value-added measures represent a clear improvement over the use of raw test-score data. Moreover, little doubt remains that states and districts will increase their use for a number of purposes. That said, given the serious technical and practical concerns that they present, it is critical that:
the individuals and groups tasked with designing and implementing the tests and interpreting the data are trained experts who take into account weaknesses and assumptions;
test scores not be used as the sole, primary or “trigger” measure of a teacher’s effectiveness;
schools and states employ a range of other tools, such as broader measures of student well-being, teacher observations and peer reviews, and student and parent surveys, to evaluate teachers; and
tests serve their key purpose of informing teachers’ understanding of student knowledge and needs.
It is also important to note that good alternatives exist. Montgomery County, Md., one of the country’s highest-ranked school districts, rejected test-based evaluations in favor of a “Teacher Professional Growth System” that takes a qualitative approach. The system’s six core elements are clear performance standards with descriptive examples of observable teaching behaviors; evaluator and teacher training that fosters skill development in analysis and critique; a multiyear professional growth cycle grounded in collegial interaction; formal evaluations that combine narrative assessments with qualitative feedback; a peer assistance and review program to support novice and struggling teachers; and collaborative, data-based professional development that is built into the school year.
Similar peer assistance and review systems are in place, too, in Toledo and Cincinnati, Ohio; Syracuse and Rochester, N.Y.; Minneapolis, Minn.; and San Juan, Calif. While more expensive than test-based systems, well-implemented peer review processes generate high payoffs, according to a recent study by the Harvard Graduate School of Education. Not only do they better improve teaching and foster trust and collaboration, in those rare instances in which support fails to produce improvement, they avert due-process complaints that can distract from more important educational matters.
Elaine Weiss is national coordinator of the Broader Bolder Approach to Education at the Economic Policy Institute in Washington, D.C.. E-mail: firstname.lastname@example.org. Contributing to this article were two co-chairs of the BBA campaign: Tom Payzant, a professor at the Harvard University Graduate School of Education, and Helen Ladd, a professor of public policy at Duke University.