Cut Scores, NAEP achievement Levels and Their Discontents

The attempts of political bodies to bludgeon public schools with arbitrary performance standards by Gerald W. Bracey

On virtually all tests these days, there is a score that determines whether a student passes or fails, is proficient or not or is being educated or left behind. This is the cut score.

While it might seem cynical, the fairest way to set the cutoff score is to decide in advance how many students you want to fail. This approach signals emphatically that the procedure is both arbitrary and political. But it might forestall many political and technical difficulties once the test is operational.

The common procedure for setting cut scores requires a committee, typically of about 20 people, to examine the items of a test and to judge the likelihood that a minimally competent person will get the item right. (The concept of “minimally competent person” is itself vague and seldom discussed.) One weakness in this approach is that those who simply look at a test invariably think it will be easier than it turns out to be.

While I worked as director of research, evaluation and testing for the Virginia Department of Education, the legislature enacted a minimum competency test as a high school graduation requirement. I invited news media from around the state to come to Richmond at an appointed hour to take the test for themselves. Only the education writers from The Washington Post and the Roanoke Times accepted the offer.

After the first test administration when many items were released to the public, most media outlets raged it was too easy and was an insult to the intelligence of Virginia students. The reporters who had taken the test, neither of whom got a perfect score, wrote that the exam was a fair test of what it was intended for. Other writers had to try to figure out how a test that seemed so simple had stumped so many when it turned out that 12 percent of white students and 33 percent of African-American students did not reach the test’s 70 percent correct criterion for passing.

A similar problem arose later in Virginia with the Standards of Learning tests, the current gateway for high school graduation. Feeling a need to look tough in regards to high standards, the state board of education selected extremely high cut scores for the 21 tests in the program. The state has since had to lower some of the cut scores because a politically unacceptable number of students failed.

A Failed Attempt
Students who reach or surpass the cut score are said to meet standards or, more commonly today, be declared “proficient” in the subject. The word proficient implies the use of a truly criterion-referenced test in which certain behaviors define proficiency. (Recall that in his writings about the concept in the 1960s, Robert Glaser argued that for any skill we could imagine a continuum of achievement from no skill to conspicuous excellence and that there were specific behaviors that defined points along that continuum.)

The most ambitious attempt to define proficient in terms of what students can or cannot do occurs with the National Assessment of Educational Progress’s achievement levels of basic, proficient and advanced. But this attempt has failed — a student at any of the levels will get items right that should be too difficult and get items wrong that should be a cinch.

In fact, a cut score only establishes the height of a hurdle. Virtually every test today is reported in terms of the percent of students who leap over this hurdle. This practice has several pernicious outcomes.

First, for students who manage to leap over the hurdle, we know only that they got over it. If the hurdle were 3-feet high we don’t know if a student leapt 4 feet in the air, 5 feet, 6 feet or higher still. Reporting in terms of percentage proficient thus throws away important information, namely, the actual scores.

Focusing on the cut score — and currently that’s what’s being done obsessively — also leads to a form of gaming the system, notably giving extra attention to the kids who are close to making the leap (the “bubble kids”) and paying less attention to the “hopeless cases” and “sure things.”

More insidious still is the way a cut score can be manipulated to increase or decrease differences among different groups of students.

A Definition Finessed
Relying on cut scores and reporting the percentage passing or failing can actually obscure an increasing achievement gap that would be seen if actual test scores were used instead.

Discussions with David Herszenhorn, who at the time covered the education beat for The New York Times, revealed that this phenomenon occurred a couple of years ago in New York City. A lot of inner-city students had scored close to passing one year. Students in the more affluent sections of the city already had high pass rates.

There was great tumult and celebration the next year as the passing rates in poor neighborhoods soared while those in the affluent areas did not. The poor kids had finally managed to get over that hurdle in great numbers. The middle-class kids, of course, had already been clearing the hurdle and so their performance did not show up as much of an improvement. If New York had looked at actual test scores, though, it might have found big gains for middle-class students.

To this point, we have mostly finessed that word “proficiency.” Its meaning has become more confused and confusing since the inception of No Child Left Behind. While some states, such as Colorado, had state standards cast in terms of proficiency, most references to proficient prior to NCLB were to the proficient level of the National Assessment of Educational Progress.

NCLB muddied the waters by declaring that all students must be proficient in reading and math by 2014 but then permitted each state to define proficient. Secretary of Education Rod Paige and especially his successor, Margaret Spellings, contributed to the Babel of definitions when they started using proficient interchangeably with the phrase “at grade level.”

Grade level had long been defined, albeit arbitrarily, as the score of the average child in a given grade. On norm-referenced tests, the universal form of testing before the onset of criterion-referenced testing, half of all test takers are always, by definition, below grade level. Grade level, at least as far as the federal government is concerned, apparently is now whatever a given state has defined as proficient, giving us 50 definitions of grade level to parallel the 50 definitions of proficient.

Although American educators have championed local control, 50 definitions of a concept in the context of a national program troubled a lot of folks. The only existing program that could provide some uniformity across states, of course, is NAEP. Calls came for NAEP to be the unifying common measure of states’ performance on NCLB. These calls grew louder when some perceived a “race to the bottom” — states setting low standards in order to look good while meeting the requirements for NCLB’s adequate yearly progress.

A Contrarian's Lyrical Lament by Thomas Sobol

(Sung to the tune of “As Someday It May Happen” from Gilbert and Sullivan’s “The Mikado”)

read more

NAEP’s Prescription
Adopting NAEP achievement levels would be a multifaceted, unmitigated disaster, but to demonstrate this we need to back up and take a look at how the NAEP achievement levels of basic, proficient and advanced came into existence.

Until 1988, NAEP was purely descriptive. Starting in 1963, NAEP’s conceptual father, Francis Keppel, and technical father, Ralph Tyler, wanted to create something different from a norm-referenced test on which about 50 percent of students answer most items correctly. On purpose, NAEP created items that the test designers figured few students would answer correctly along with items the creators thought most would answer correctly, as well as the usual items that about half the people would get right. In the same way a medical survey might analyze the incidence of tuberculosis nationwide, NAEP would survey the incidence of knowledge in the country.

In 1988, though, Congress created the National Assessment Governing Board and charged it with establishing standards. NAEP now became prescriptive, reporting not only what people did know but also laying claim to what they should know. The attempt to establish achievement levels in terms of the proportion of students at the basic, proficient and advanced levels failed.

The governing board hired a team of three well-known evaluators and psychometricians to evaluate the process — Daniel Stufflebeam of Western Michigan University, Richard Jaeger of the University of North Carolina at Greensboro and Michael Scriven of NOVA Southeastern University. The team delivered its final report on Aug. 23, 1991. This process does not work, the team averred, saying: “[T]he technical difficulties are extremely serious … these standards and the results obtained from them should under no circumstances be used as a baseline or benchmark … the procedures used in the exercise should under no circumstances be used as a model.”

NAGB, led by Chester E. Finn Jr., summarily fired the team, or at least tried to. Because the researchers already had delivered the final report, the contract required payment.

Flawed Uses
The inappropriate use of these levels continues today. The achievement levels have been rejected by the Government Accountability Office, the National Academy of Sciences, the National Academy of Education, the Center for Research in Evaluation, Student Standards and Testing and the Brookings Institution, as well as by individual psychometricians.

I have repeatedly observed that the NAEP results do not mesh with those from international comparisons. In the 1995 Trends in International Mathematics and Science Study, or TIMSS, assessment, American 4th graders finished third among 26 participating nations in science, but the NAEP science results from the same year stated that only 31 percent of them were proficient or better.

The National Academy of Sciences put it this way: “NAEP’s current achievement-setting procedures remain fundamentally flawed. The judgment tasks are difficult and confusing; raters’ judgments of different item types are internally inconsistent; appropriate validity evidence for the cut scores is lacking; and the process has produced unreasonable results.”

The academy recommended use of the levels on a “developmental” basis (whatever that means) until something better could be developed. In 1996, the National Academy of Education recommended the current achievement levels “be abandoned by the end of the century and replaced by new standards … .”

Continuing Mischief
Here we are almost a decade into a new century and the old standards remain, causing a great deal of mischief every time a new NAEP assessment is released to the news media. No one is working to create new standards. Why? The use of the NAEP standards fits into the current zeitgeist of school reform as all stick and no carrot.

When the U.S. Chamber of Commerce and the Center for American Progress rolled out its jointly developed “Leaders and Laggards” in February 2007, the report lamented: “[T]he measures of our educational shortcomings are stark indeed; most 4th and 8th graders are not proficient in either reading or mathematics … .”

At the press conference announcing the report, an incensed John Podesta, president and CEO of the Center for American Progress, declared: “It is unconscionable to me that there is not a single state in the country where a majority of 4th and 8th graders are proficient in math and reading.” He based his claim on the 2005 NAEP assessments.

Podesta could have saved himself some embarrassment had he read the recent study by Gary Phillips, formerly the acting commissioner of statistics at the National Center for Education Statistics. Phillips, now at the American Institutes for Research, had asked: “If students in other nations sat for NAEP assessments in reading, mathematics and science, how many of them would be proficient?”

Because we have scores for American students on NAEP and TIMSS and scores for students in other countries on TIMSS, it is possible to estimate the performance of other nations if their students took NAEP assessments.

How many of the 45 countries in TIMSS have a majority of their students proficient in reading? Zero, said Phillips. Sweden, the highest scoring nation, would show about one-third of its students proficient while the United States had 31 percent. In science, only two nations would have a majority of their students labeled proficient or better while six countries would cross that threshold in mathematics.

NAEP reports issued prior to the current Bush administration noted that the commissioner of education statistics had declared the NAEP achievement levels usable only in a “developmental” way. That is, only until someone developed something better. But no one was or is working to develop anything better. When I wrote an op-ed piece for The Washington Post (“A Test Everyone Will Fail,” May 20, 2007), an indictment of the achievement levels, I got feedback that officials at the National Assessment Governing Board were quite satisfied with the levels as they are. That can only mean NAGB approves of the achievement levels used as sledgehammers to bludgeon public schools. They serve no other function.

Jerry Bracey is an independent researcher in Alexandria, Va., and author of Reading Educational Research: How to Avoid Getting Statistically Snookered. E-mail: gbracey1@verizon.net