Feature

Rewards & Supports

Fair and accurate teacher evaluations are at the heart of pay-for-performance systems in schools by THEODORE HERSHBERG AND CLAIRE ROBERTSON-KRAFT

Pay-for-performance systems in public schools have long been burdened with controversy. But we now are at a watershed moment, with a Democratic president challenging the educational status quo.

For states and school districts to secure monies from the $4.3 billion Race to the Top Fund, they must “use data effectively to reward effective teachers, to support teachers who are struggling, and when necessary, to replace teachers who aren’t up to the job.”

Ted Hershberg and Claire Robertson-KraftTed Hershberg is director of Operation Public Education in Philadelphia. Claire Robertson-Kraft is a doctoral student in education policy at University of Pennsylvania.


Critics of performance pay systems contend that because teachers’ impact cannot be measured without error, it is impossible to create fair and accurate systems for evaluating and rewarding performance. By this standard, however, current practice fails on both counts.

Paying teachers based on longevity and academic credentials ignores compelling research cited in the 2004 report “Increasing the Odds,” from the National Council on Teacher Quality, showing that teachers improve only during the first three to five years of their careers and that master’s and doctoral degrees in education and graduate-level courses beyond the master’s level have no bearing on student learning.

Moreover, K-12 personnel evaluation systems have long failed “to distinguish great teaching from good, good from fair, and fair from poor,” according to “The Widget Effect: Our National Failure to Acknowledge and Act on Differences in Teacher Effectiveness,” a report by the New Teacher Project. Despite overwhelming evidence that quality instruction is the most important factor in improving student achievement, a teacher’s effectiveness “is not measured, recorded or used to inform decision making in any meaningful way,” the report adds.

New proposals for evaluation and compensation systems may not be perfect, but they will be vastly less imperfect than what school districts have been using for decades.

A Balanced Process
The new administration has helped to reinvigorate conversations around merit pay, and an increasing number of states and school districts are developing new perform-ance pay initiatives. Past efforts to implement pay-for-performance systems failed for good reasons — either they used achievement scores, which are deeply biased by family income, or were based entirely on principal evaluations, which were perceived as subjective.

Overcoming these shortcomings will require metrics that produce results teachers trust and sufficient training to help them interpret and apply data to improve their instruction. The recommendations below are drawn from the experience of existing programs and the Operation Public Education framework discussed in A Grand Bargain for Education Reform: New Rewards and Supports for New Accountability, edited by the two of us.

To begin, evaluation systems should take a balanced approach, using multiple sources of data to gauge teacher effectiveness. We recommend systems use both outputs, that is empirical data from value-added assessment, and inputs, that is observational data from sophisticated performance frameworks.

Incorporating Outputs
Aware that family background is the primary factor explaining levels of student achievement, many educators believe no fair way exists to evaluate teachers based on student learning results. But if the measure is growth — the progress students make over the course of the year — then the effectiveness of instruction can make an important empirical contribution to teacher evaluation.

Used by a growing number of states and districts, value-added assessment is a new way to measure teaching and learning. Based on a review of students’ test-score gains from previous grades, researchers can predict the amount of growth students are likely to make in a given year.

The results of value-added assessment can be used to create three levels of teacher, school and district effectiveness: (1) highly effective — those in which the students exceed expected growth; (2) effective — classrooms, schools or districts where students on average are receiving expected growth in a year; and (3) ineffective — those where projections fall below what would have been expected.

The growing national consensus for internationally benchmarked standards and more rigorous assessments is needed and welcome, but research makes clear that robust value-added models can be used with confidence, even with existing tests to identify the most and least effective performers.

Value-added assessment has sometimes been confused with simple growth, where a student’s score on last year’s test would be subtracted from this year’s test and attributed to the teacher as the “value” that was “added.” These simple models are an improvement on using absolute achievement to gauge a teacher’s impact on student learning, but they incorrectly assume all of a student’s growth can be attributed to the teacher and will inevitably lead to erratic and specious results.
To increase the accuracy of the measure, policymakers should employ rigorous value-added models which:

•  Use multiple years of data. Accurate conclusions about teacher effectiveness cannot be drawn from a single test. Using multiple years of data can help ensure the estimate of a teacher’s impact on student learning is as precise and stable as possible. 

•  Include students with incomplete records. Because of absences or mobility, many students have incomplete records. Because these students are not missing at random, they cannot be dropped from the analyses. Sophisticated value-added models use all of the available data for all of the teachers’ students in the estimate to maximize the accuracy of the results.

•  Account for the contributions of various teachers. The model needs to account for the fact multiple teachers may instruct the same student in a given year. Even though we do not yet know the impact of weighting the estimates in this fashion, the political value of including this information is clear. Simply put, educators will be more likely to embrace the new methodology if they feel they are being held accountable only for those students for whom they were responsible.

•  Use assessments meeting specific criteria. Policymakers should ensure the assessments used as the basis for value-added analysis are closely aligned with the standards and curricula, have appropriate stretch at the ends of the distribution and are available in fresh, nonredundant and equivalent forms.

Sophisticated Inputs
Effective teaching is complex, so no educator should ever be evaluated solely on the basis of a single measure, not even one as robust as value-added assessment. Unfortunately, most school districts’ observation practices reveal little information about the quality of instruction.

A growing number of school districts have begun to rely on more comprehensive observation approaches, such as Charlotte Danielson’s Framework for Teaching. In these approaches, observation is carried out using sophisticated protocols instead of subjective judgments, and rubrics are used to differentiate among levels of performance.

To ensure accuracy, policymakers should create evaluation systems which:

•  Define effective teaching. An observation system should begin with a clear definition of effective instructional practice. These standards should explicitly define teacher competencies and describe what teachers should know and be able to do at various stages in their careers. Levels of performance should be established so the evaluation system accurately differentiates teachers based on their effectiveness.

•  Develop a standards-driven, reliable process of evaluation. For systems to be trusted, evaluation processes must yield accurate information. This means the instruments used to make judgments about teacher effectiveness should meet certain tests of validity.

•  Provide sufficient training and engage stakeholders. Evaluators, either administrators or teachers in a peer-review process, must be provided with sufficient training to ensure their judgments reveal consistent information about teacher performance. Trainings should build evaluators’ skills in identifying evidence of effective teaching and providing constructive feedback. All relevant stakeholders — teachers, principals, parents — should be engaged in the process to ensure its sustainability.

Fairness Issues
Even if the evaluation measures are accurate, for a perform-ance pay system to be motivating, teachers must view the criteria as fair and believe that, through hard work, they can improve their own effectiveness. Thus, policymakers should develop evaluation systems that:

•  Include all educators. If we provide teachers in tested subjects and grades with an opportunity to earn additional pay as part of a new evaluation and compensation system, fairness suggests we should provide other teachers and specialists with the same opportunity. When systems have excluded teachers outside of tested subjects, studies have found increased resistance to change.

According to the Center for Educator Compensation Reform, the performance pay program in the Houston Independent School District had to be modified in response to the emotional outcry over differential pay by teachers outside of tested subjects. Defenders of the single salary schedule often hold to the position that if you can’t treat all educators the same way, then a compensation system inevitably will be perceived as unfair.

This argument confuses equity with equality, fairness with sameness. Policymakers can devise systems in which educators are treated fairly but not necessarily in identical fashion.

•  Offer additional support. Evaluation systems should serve a dual purpose — quality assurance and professional growth. Too often, professional development is considered to be separate from the teacher evaluation and compensation system rather than a reinforcement where all resources are viewed as investments in developing teacher capacity.

Policymakers should design evaluation systems that maximize human potential by providing educators the additional support they need to improve instructional practice. Because value-added assessment is unfamiliar to most educators, these systems also should provide relevant stakeholders with sufficient training on how value-added assessment will be used to evaluate and reward educators. 

•  Provide appropriate safeguards. Mechanisms should be put in place to ensure consequential career decisions are made with appropriate human safeguards. The Operation Public Education framework recommends adapting for use with all teachers a system of peer assistance and review, pioneered in Toledo and used for many years with new and untenured teachers in Columbus, Ohio.

Peer Assistance and Review, or PAR, provides teachers with professional development and an evaluation system that identifies, remediates and, if necessary, dismisses those who show little aptitude for the classroom. Peer review makes it statistically more likely the evaluator is familiar with the subject matter being taught. By giving the observation responsibility to teachers, it makes clear the evaluation is designed not as a “gotcha” system but as one intended to help teachers improve their instructional effectiveness.

Indispensable Means
Pay for performance plays a central role in the Race to the Top guidelines because it aligns new system goals with rewards. It was OK to pay educators based on longevity and academic credentials when the goal of schooling was to provide most students with rudimentary skills to succeed in an industrial economy. However, today, when the nation’s goal is to educate all students to new and demanding standards, we must link performance with rewards, both positive and negative.

While some teachers may work harder because of new incentives, it is understandable why many educators feel insulted by what they see as an implied notion they have been withholding expertise awaiting higher pay. The genuinely significant point, deserving far greater recognition, is that pay for performance also addresses critical long-term needs.

We no longer recruit the best and the brightest of our college graduates into the teaching profession. The lessening of gender discrimination has made it possible for women of talent to enter occupations they consider more rewarding. In 1972, just under a third of all professional women were teachers, but only a seventh in 2004. Attracting more talented graduates and retaining more of our highly effective teachers will require more than a system of bonuses.

Fair and accurate evaluations have the potential to inform an entirely new compensation system. The opportunity is to create career ladders to provide educators with the incentive to improve instruction throughout their time in the classroom, win greater prestige and secure higher pay more quickly if they are successful.

The Denver experience with ProComp makes clear that because new evaluation systems and compensation reform get everyone’s attention, they can serve as a wonderful occasion on which to undertake more systemic reform.

Ted Hershberg is a professor of public policy and history at the University of Pennsylvania and director of the Center for Greater Philadelphia. E-mail: tedhersh@upenn.edu. Claire Robertson-Kraft is pursuing her doctorate in education policy in Penn’s Graduate School of Education.