Missing the Mark: What Test Scores Really Tell Us

Why teaching to the test is akin to studying for an eye exam for your driver’s license by JOHN R. TANNER

State test scores administered for accountability purposes are regularly used to adjust instruction in nuanced ways. This is no accident — No Child Left Behind demanded that students’ scores be returned quickly to teachers in order that this might be the case, and the idea of data-driven decision making continues as one way the promise of education reform might be realized.

But what if test scores from state tests were never designed to support instruction? What if the methodologies supporting these tests were intended for a purpose that actually had little, if anything, to do with informing nuanced changes in the curriculum? Or worse, what if the design is such that instruction in anticipation of the test actually makes the results invalid? And what if we did it anyway, out of fear, frustration or a feeling that no other option existed?

TannerJohn Tanner in his office in San Antonio, Texas, where he heads a consulting firm on student assessment programs. Photo courtesy of Test Sense.

The result might be akin to what would happen if people studied for the eye test at the Department of Motor Vehicles. We would have lots of people passing the test, but we wouldn’t have a clue whether they actually could see well enough to drive. That eye test only works when it can reference something meaningful beyond itself. When it refers only to itself, it should be obvious it tells us nothing about the thing it was trying to measure, and interpretations that suggest otherwise are just plain wrong.

Illogical Purposes
While it comes as a shock to many, the standardized tests used for school accountability purposes aren’t much different. It is immediately obvious that results from a compromised eye test are useless for the purpose of the test’s design, but for some reason we fail to extend that same logic to a compromised standardized test score. Even though most Americans acknowledge and decry the problems of reducing a school to a “teach to the test” mentality, we treat standardized test results from those schools as if they still have something useful to tell us. Even compromised test results, we seem to reason, are better than nothing.

What needs to be understood is that compromised results from a standardized test are not helpful instructionally. At their most fundamental level, standardized tests aren’t even about instruction. Standardized test methodologies were designed to rank order or “distribute” students within a domain and do so consistently to allow for meaningful comparisons within and among schools.

The methodology has been deployed in schools in two primary forms. One is the “off the shelf” tests sold for many years by publishers (the Stanford Achievement Test, the California Achievement Test and the Iowa Tests of Basic Skills were in vogue when many of us were in school). The other is the current spate of accountability tests that are customized to each state’s standards.

While we apply different names to these tests — norm-referenced for the former, criterion-referenced for the latter — these refer to the way the results are interpreted, not how the tests are made. In a norm-referenced test, the comparison is to a nationally representative set of students, and in a criterion-referenced test, the comparison is to a line drawn at a particular test score.

Testing Limits
The material best suited to show the distribution of students is test items that roughly 40 to 60 percent of the students answer correctly in a field test. Consider that if 100 percent of students answer an item right or wrong, the item will have failed against the test designer’s goal of identifying differences in achievement. The first item in a standardized test then will divide a population of students into two piles, while the second such item will divide into three piles (no right, one right/one wrong, both right), and so on.

A test designer only needs so many piles in order for the results to be meaningful and can stop adding more items once that threshold is met. While no magic number exists, most such tests wind up in the neighborhood of 40 or 50 items.

However, it is not enough to rely only on the percentage of students who answered an item correctly as the basis for inclusion. The pattern of responses also must be analyzed in order to ensure consistency in the results, and items that perform outside the expected pattern are eliminated.

For example, in one analysis the tested population is divided into several segments according to students’ overall scores, and an analysis is performed on each item within that segment of the tested population. The test designer is looking to see if top-performing students answered most items correctly, and if most low-performing students answered most items incorrectly. If most of the top-performing students answered an item incorrectly, the item is excluded on the live test because it negatively impacts the ability of the test to show a consistent distribution. So, too, with items that most low-performing students answered correctly.

In many instances, the reason for the backward perform­ance can be attributed to a poorly written item, but not always. If most low-performing students answered an item correctly, one reason for that might be because it was a part of their instruction, which would be helpful to know instructionally. Or if most high-performing students answered it incorrectly, it might be because it was not a part of their instruction, which also would be helpful instructionally.

But these results won’t be available from the test because these items fail to contribute to understanding how students distribute and will be excluded. The fact they may have value in an instructional context does not also render them useful in a standardized test context, and state testing programs operate within the standardized test context.

It should be readily understood that a test that includes and excludes items based not on their ability to guide instruction but on their ability to produce a rank ordering of students — and every state test used for accountability does just that — is not appropriate as a guide for instruction.

Consider that such tests, by definition, do not cover the full range or richness of a set of standards because items against certain standards will fail to meet the statistical criteria for inclusion. That means that the easiest-to-learn material is not likely to find its way onto the test nor will the most difficult to learn, which also may be the most important. Consider that even within the tested standards that the items must perform within a certain range to be included, further narrowing the material that makes it to the test. The design is intended to help understand the distribution of achievement, and to do that effectively the instructional nuances so valuable to a teacher aren’t the most important thing when selecting test items.

Student Distribution
Finally, even if items that don’t completely meet the criteria are included, an additional statistical process that translates the number of items answered correctly to a scale score that can be compared from one year to the next will ensure that their impact on the final score is negligible. The resulting scale score is designed to place a student at the appropriate point in the distribution.

Because an item that everyone answered correctly (or one that everyone answered incorrectly) doesn’t contribute to our understanding of that distribution, the overall effect is that it doesn’t count. Or said in another way, even though students all earned an extra point (or missed a point), the effect is to slide the scale up or down to accommodate that fact. A test designer will see such an item as a waste of time and money because it doesn’t improve the ability to understand the distribution and, at the next opportunity, will eliminate the item as unnecessary.

What is so striking is how much of what is valuable, even essential, from an instructional viewpoint fails to make it in to such a test. What is also striking is that the test design selected as the means to drive the system toward high standards ironically need not even contain items reflective of that high standard if only a few kids answer those items correctly, given that the purpose of the design is to identify the distribution of students.

That alone creates a misalignment between the goals of educational reform and the tools selected for accountability. The same is true with some of the basic material that is critical to struggling learners, many of whom are years behind instructionally.

What they most need from an instructional perspective isn’t represented on the exam because most of the kids got it right and it failed to contribute to understanding the distribution of students. And even within the tested content, only items that roughly half of the kids get right and half get wrong make it to the test. A truly engaging item that all kids get right and that represents the true state of what students are learning will be excluded, while a less engaging item that doesn’t but behaves well statistically will make it.

Test Prep Dangers
Consider now how many of the goals of reform are immediately compromised if the tested material is used to drive instruction. The high standards that represent the purpose of the reform all too often are not even in the measure. The idea that all students can receive instruction appropriate to their needs is compromised, particularly for students who struggle and are behind grade level because material that can direct their instructional path is excluded. The idea of high standards for all is reduced to a cut score on a test designed entirely for another purpose.

Standardized testing, it must be said, was a useful tool in the hands of a thoughtful practitioner or researcher if and only if — and this is critical — no one anticipated the tested material. What a test never has been able to do on its own, whether an eye test or the standardized type being discussed here, is answer the question as to whether someone prepped for it at the expense of real learning (or seeing), even though knowing that fact is critical to a valid interpretation.

Both tests are designed as proxies for much larger, far more important targets. If the tests are unanticipated, then a proper administration allows for inferences to be made against those larger targets — the ability to see or an overall sense of achievement. If the material in the tests is anticipated and becomes the basis for study, then the results tell us only about the level of prep work and nothing else. The fact the results can’t distinguish as to the conditions under which they were generated contributes greatly to the potential for misinterpretation, but it doesn’t change the underlying reality that anticipating the tested material produces results that risk being dead wrong.

Of course, it must be acknowledged that unlike an eye test that gets memorized, teaching to reading or math tests at least requires some teaching and learning of the content, albeit through a narrowed lens. The fact the tests contain some reading and math items becomes the basis for reasoning that even when the test becomes the focus for teaching, some sort of valid interpretation is still available. Thus, the argument would continue, while a memorized eye test tells us nothing about someone’s ability to see, a compromised math score at least tells us a little something about the student’s ability to do math and what might be done instructionally because the student had to do at least some math in the process.

Here’s why that isn’t the case. Consider, as just one example, research that suggests the best means by which to teach students some of the more basic mathematics is by placing them in rich, complex contexts that require them to learn the basics as a means to solving a bigger problem. A school that teaches to the test already has eliminated such an opportunity, and interpreting the test results at face value isn’t going to suggest that a richer curriculum is the key because that possibility isn’t in the items given the design. The school may well have eliminated the very best means by which to educate its students and having done so, having reduced their efforts to educating to a test, made it harder to do even the simplest things. What such a school will have done is limit the range of available interpretations to those in the items on the test, which as was shown above, is far more limiting than most people realize.

Giving credence to compromised scores excludes all but the simplest of interpretations, which is to work harder on the tested material — hardly the goal of education reform. That is why the compromised eye test and a compromised test score are more alike than not: They both leave us blind to what we really need to know. It’s harder to see in the case of a reading or math test, but that doesn’t make it less true.

Stressing Comparisons
It may seem counterintuitive to think that not teaching to the current spate of standardized tests and then interpreting test results broadly and only in the context of other data actually represent the best means by which to achieve higher test scores, but the design of the instruments suggests that is exactly the right thing to do.

I’m reminded of something Lee Cronbach of Stanford University, a leading psychologist and testing expert for the better part of the last century, said in an interview published by the Phi Delta Kappa Educational Foundation in 2004 about the design and purpose of test instruments. It went something like this: You can have a test that tells you something about what students are actually learning in class and that can then inform instruction, or you can have a test designed to allow for meaningful comparisons among schools and students, but one test cannot and never will be able to do both!

Current accountability testing is based upon a comparative, not an instructional model. Our hunger for instructional information and for alignment among all the parts of the system seems to have trumped Cronbach’s warning that one instrument cannot do everything. We may continue to believe and act otherwise, but if we do, we will continue to drive without being able to see very well, while thinking we see just fine.

John Tanner is executive director of Test Sense in San Antonio, Texas. E-mail: johnt@testsense.com