Skewed Comparisons

"Then and now" studies of student performances come up short in delivering an honest assessment of public education today by RICHARD ROTHSTEIN

How can we know whether today's schools are better or worse than yesterday's? The best way would be to administer the same tests we gave years ago to compare achievement then and now. But things are not so simple.


Few test series extend back in time, although scores long have been used to evaluate (and mostly condemn) American schools. Horace Mann administered the nation's first standardized exam in 1845 to the 500 brightest 14 year-olds in Boston's public schools. Should we ask today's 8th graders the same questions to compare them to their 19th century peers? Probably not.

Mann tested science by asking questions such as "How high can you raise water in a common pump with a single box?" and "What is the altitude of a heavenly body?" Since Mann's time, American educators have sought to ask better questions, improve test conditions and emphasize different skills.

Commercial firms now compete to persuade educators to adopt their assessments. It's generally impossible to compare one of these to another without an expensive experiment in which scores on several tests given to the same students are compared. Even this would be misleading. Different tests emphasize different parts of the curriculum, so some students might perform better on one and worse on another.

Wayward Interpretations
Today, for example, we worry about whether math tests should assess computational skills, theoretic understanding or reasoning. Each test resolves this somewhat differently. Students scoring well on one are not necessarily comparable to students scoring well on another.

Even if identical tests were given to students at different times, misleading comparisons still would follow. Curricula change. There are mathematical concepts taught today that were not in the past. Consider, for instance, set theory. Re-administering an earlier test (when set theory was not taught) can't help us compare today's and yesterday's student achievement. And to the extent schools spend any time teaching set theory (whether too much or too little is not the point), they necessarily spend less time drilling multiplication tables (or some other aspect of the prior curriculum). We should expect students to do somewhat less well on tests emphasizing calculation.

Educational Testing Service researcher Robert Mislevy uses an example from athletic competition to suggest how difficult test interpretation can be. The Olympic decathlon includes 10 events (100-meter dash, long jump, shot put, etc). It's easy to determine which athlete wins any particular event. But determining who wins the decathlon requires decisions about how each event should be weighted. (Is two seconds faster in a sprint worth two feet farther in the shot put or should it be three feet? Is a discus throw worth more or less than a long jump?)

In 1932, an American, James Bausch, won the Olympic gold in the decathlon because he won some events (discus, shot put, javelin) by large margins, though he did less well in others (running). But in 1985, the Olympic Committee changed the weights. Now, athletes who do exceptionally well in some events cannot offset poorer showings in others. The new scoring table gives greater emphasis to well-roundedness. If the 1932 decathlon had been scored using 1985 weights, a Finn, Akilles Jarvinen, and not Bausch, would have won the gold medal.

Hasty Conclusions
In comparing scores of today's athletes to yesterday's, should we use old or modern weights? What if athletic training programs change to give greater emphasis to well-roundedness but scores continue to emphasize exceptional excellence in a few events? Modern athletes who are superior (by today's standards) then would score lower than prior generations whose mix of skills we no longer value.

Every standardized math, language arts, social studies or science test is an academic decathlon of its own, hiding a myriad of difficult curricular and assessment decisions. Yet without knowing how test questions match changing curricular emphases, we are quick to form conclusions about changes in performance.

Even where we have consistent tests over time, they were not likely administered with an eye toward future intergenerational comparisons, and so necessary background data were not collected. To assess schools' comparative performance, we must ensure we're comparing children in similar economic and social circumstances.

In the mid-1960s, the federal government sponsored a massive school data project, headed by sociologist James Coleman. Since his 1966 report, Equality of Educational Opportunity, education researchers have better understood that family and community characteristics affect even a good school's ability to educate. Apparent school quality may have less influence on academic outcomes than children's unique social capital.

Socioeconomic Effects
We now understand that student achievement is the product of many socializing institutions, not schools alone. Most important is the family, whose influence is affected by parents' social and economic status as well as ethnic cultural values and practices. University of Kansas researchers Betty Hart and Todd Risley, for example, found recently that the language infants hear in the first three years of life plays an important role in later intellectual development. (Parents with professional occupations addressed an average of 2,153 words an hour to their infant children. Among working-class parents, it was 1,251, and for parents on welfare, it was only 616.)

Even a good Head Start program would come too late to overcome entirely these initial differences. We learn less than it appears if we use test scores to compare a school whose students come from professional families to one whose students are working class.

Prenatal and infant nutrition and family structure also play a role. Some research has shown that in families with fewer siblings, children experience a higher average intellectual content at home because their environment has a higher proportion of adults. Children whose parents attended school longer (regardless of the quality of those schools) tend to have better test results, perhaps because these parents, no matter what they learned, come to value the importance of education.

Cultural values, too, influence achievement. It's no accident that the offspring of some, but not all, Asian groups tend to excel more in school. Although we can observe these patterns, we can't fully explain them. Researchers believe that single parenthood has a negative impact on children's achievement, and that children born to very young mothers can be expected to perform less well academically than children born to adults.

Increased sophistication about these nonschool influences does not imply iron determinism, but it does mean that unadjusted scores shed little light on school quality. A school where family and community characteristics are academically disadvantageous might bring students to the 40th percentile on nationally normed tests and be a far better place than one whose students are advantaged, but whose scores are only at the 60th percentile.

A Defensive Tone
School administrators used to defend themselves from persistent claims of declining student performance by conducting "then and now" studies of education. In reading reports of these studies, one can't help but be struck by their consistently defensive tone: professors or administrators began their articles by recounting denunciations of schools by politicians, journalists, academics, pundits or parent groups who claimed that "today's schools" didn't measure up to past standards, that teachers no longer emphasized basic skills and that young people knew less than before.

The commentaries then exclaimed that public schools had been subjected to abuse long enough, and therefore the authors had combed archives for tests given to students decades earlier. These outdated tests were readministered to contemporary students, under conditions as similar to the past as possible. Reports almost always concluded by showing contemporary scores to be superior, refuting conventional wisdom of the day.

Social scientists today never would sanction such uncontrolled (for background characteristics) research, but these studies shed light on the unchanging debates about American education. If we make the not-unreasonable assumption that demographic change may not have been as rapid during the period of these earlier studies as it is today, the studies suggested that school critics of the past were mistaken.

In 1919, Otis Caldwell, a professor at Columbia University Teachers College, and Stuart Courtis, director of teacher training at Detroit Teachers College, noted that "[s]urvey after survey has revealed unsuspected inadequacy or inefficiency in American education," resulting in "[s]uperintendents and teachers [being] dismissed" and "school systems and methods [being] reorganized."

Caldwell and Courtis sought to "bring a long-delayed message of encouragement to all who have participated in accomplishing the educational progress of the last 50 years." To do so, they readministered Horace Mann's 1845 test to a national sample of 8th graders.

Questions that retained curricular relevance were selected (questions about the height of a heavenly body or the U.S. invasion of Canada in the "last war" were dropped). Caldwell and Courtis printed new exams with the remaining questions. School superintendents from 46 states volunteered to participate. Unlike Mann's test, which was given only to the best students (Mann described them as "the flower of Boston schools"), the superintendents agreed to test all 8th graders present on the day the test was given. Twelve thousand exams were returned for scoring.

Although the 1919 test was administered to a full range of 8th graders, the median score rose to 45.5 percent correct in 1919 from 37.5 percent correct in 1845. Caldwell and Courtis concluded that children in 1919 did somewhat worse than the "best" 1845 children on "pure memory" questions and somewhat better on the "thought or meaningful questions." For example, they reported that "in 1845, 35 percent of the children knew the year when the embargo was laid by President Jefferson, but only 28 percent knew what an embargo was. In 1919, only 23 percent knew the year, but 34 percent knew the meaning."

A Tiresome Refrain
In 1934, a Los Angeles school researcher, Elizabeth Woods, gave a 1924 6th-grade reading test to students in 33 elementary schools where the test had been administered 10 years earlier. She found that scores were half a grade higher in 1934 than they had been in 1924.

In 1946, Don Rogers, an assistant school superintendent in Chicago, grew tired of hearing "employers ... allege that present-day pupils (even high school graduates) are not proficient. … The imputation is that ... our school system formerly trained them better than now." So Rogers readministered a 6th-grade arithmetic test from 1923. He found that the 1946 pupils scored about the same as 1923 pupils (despite unusually high teacher turnover during World War II and the constant disruptions of wartime drives conducted in schools). He concluded that this "discounts the allegations that ... Chicago pupils of an earlier generation did better work than their sons and daughters who are now in the elementary schools."

In 1948, Springfield, Mo., schools came under attack from a citizens group for embracing tenets of "progressive education" and for ignoring basic skills. University of Illinois Professors F. H. Finch and V. W. Gillenwater undertook to study "whether the teaching of reading had increased or decreased in effectiveness" by readministering a 1931 6th-grade reading test in the same Springfield schools. They found that 1948 scores were higher and concluded that "[a]pparently reading instruction ... is now more effective … and most 6th-grade children now in schools do better in reading than did their predecessors."

Tests of General Educational Development, an alternative high school certification for students who drop out, originally were developed by the Army in 1943 to assess draftees' academic skills. To establish a scoring scale, the Army tested a representative sample of 35,000 seniors in high schools across the country (though in segregated states only white schools were included).

At a time when belief that schools had deteriorated was as widespread as today, Army officials wondered if the 1943 scale retained validity 12 years later. So the Pentagon conducted a new study, giving a 1955 GED test to a similarly representative national group of seniors. Then, a smaller sample was given both the 1943 and 1955 tests, so scales on the two could be equated.

The Chicago professor who analyzed the results, Benjamin Bloom, concluded that "[I]n each of the GED tests the performance of the 1955 sample of seniors is higher than the performance of the 1943 sample ... [I]n mathematics the average senior tested in 1955 exceeds 58 percent of the students tested in 1943." (Performance also exceeded earlier scores in science, literature, English and social studies.) "These differences are not attributable to chance variation," Bloom concluded, adding they show "high schools are doing a significantly better job of education in 1955 than they were doing in 1943."

In the early 1950s, Vera Miller and Wendell Lanton, Evanston, Ill., school researchers, noted that parents often charged that "too much time [is] being devoted to music, arts, crafts, dramatics and unit work [group projects] to the detriment of the 'Three R's.'" So Miller and Lanton readministered Evanston's standardized reading tests from 1932 to contemporary students. They found that 4th graders in 1952 scored six months higher in reading and eight months higher in vocabulary than their 1932 counterparts. "[P]resent day pupils read with more comprehension and understand the meaning of words better than did children who were enrolled in the same grades and schools more than two decades ago," Miller and Lanton concluded.

In 1976, Indiana’s state superintendent of public instruction, Harold Negley, reviewed reading instruction. He noted that in 1976 "the charge is sometimes made that children do not read as well as in the past and that schools are to blame." In 1945, Indiana had tested reading with a 25 percent sample of the state's students. So in 1976, Negley gave the same tests to a sample of 6th and 10th graders selected (taking account of urban-rural-suburban distribution) so that test-takers in the two time periods were as similar as possible. Raw results revealed that 6th and 10th graders read at virtually the same level as comparable 1945 students.

Indiana, however, had kept unusually good records on 1945 students, and Negley (along with two Indiana University researchers) found they were considerably older than those in 1976 who took the test. In 1976 when there was more social promotion, 6th graders included 11- and 12-year-olds almost exclusively, but in 1945 the 6th grade also had many 13- and 14-year-olds. There was nearly a full year's average age difference.

After adjusting results to compare age-equivalent scores rather than grade- equivalent scores, the researchers found that the 1976 sample for both 6th and 10th grades "outscored the 1945 sample significantly on every test." And because fewer teen-agers dropped out in 1976 than in 1945, the earlier 10th-grade students were, on average, higher achievers compared to all young people their age than were the less selective group of 1976. But the more selective earlier students did no better than the more universal contemporary group.

"[T]he general national assumption that the reading abilities of our children are decreasing at an alarming rate [is] unsupported by this study," the Indiana team concluded. This "ungrounded alarm leads to attacks on school programs that have been developed over the same time span for which this study shows the improvement in student reading achievement."

Judging with Caution
School officials no longer publish such reports, perhaps because today even school officials believe a fable of school deterioration. Or perhaps they now recognize that reasonable then and now comparisons require sophisticated controls for student background characteristics. With demographic change in many districts more rapid today, we can no longer take then and now studies seriously without better data on test takers' socioeconomic circumstances. These data do not exist for past tests, and there's no way now to create them.

We can illustrate the difference background characteristics can make by carefully comparing scores on domestic and on international math tests, where Japanese and Korean students outscore Americans. These comparisons show that average scores in Iowa and North Dakota are higher than those in Korea and Japan.

Is this because Iowa and North Dakota math instruction is better than in Korea and Japan, while instruction elsewhere in America is worse? Or is it because average students in Iowa and North Dakota have more advantageous economic and social circumstances? If Iowa teaching methods were used in Mississippi, would Mississippi students also have done better than Japanese students? Or would Mississippi students have to acquire Iowa family and economic characteristics for them to outscore the Japanese? Without more sophisticated data, we have no way of answering these questions with certainty.

However, there has been one recent attempt to conduct a modern "then and now" study, using today's more sophisticated statistical techniques. RAND research scientist David Grissmer has developed a procedure for estimating how score changes on the National Assessment of Educational Progress have resulted from students' family characteristics.

After identifying statistical relationships between test scores and student background factors, Grissmer found marked changes in students' family situations since 1970. He then examined NAEP scores from the early 1970s and re-computed what these scores would have been if these students possessed the family characteristics that the census showed students actually had in 1990.

Grissmer initially expected these "predicted" scores would probably be lower. After all, today there's apparently more poverty and more single-parent families than there were in 1970 so achievement should be lower. Yet Grissmer found that when he pretended students in 1970s had 1990 family characteristics, their predicted scores jumped because while some family characteristics that affect student achievement deteriorated, other very important ones improved.

In particular, parental education levels increased in the last generation and family size decreased. According to Grissmer's calculations, a child whose parents graduated from college is likely, other things being equal, to have a NAEP score that is 18 percentile points higher than a child whose parents did not graduate from high school. More parental education means children get more academic support at home; smaller family size means they get more parental attention. These factors, which produce higher scores, improved more than enough to cancel the negative influences.

The gap between white and black students' math and reading NAEP scores was reduced by about 40 percent from 1975 to 1990. White test scores, Grissmer calculated, went up by about as much as white students' improved social and economic circumstances should have caused. (This was a very slight improvement.) But black students' reading and math scores improved by more than twice the predicted amount; i.e., twice what socioeconomic factors alone can explain. For all minority students, math test scores jumped three times as much as socioeconomic changes alone lead us to expect.

In the last generation, funding for compensatory and bilingual education programs has grown faster than funding for regular education. Could this have caused the results Grissmer documented? We can't know for sure because while Grissmer showed that family characteristics can't alone explain the relatively faster improvement of minority student achievement, his statistical analysis provides no evidence for evaluating what it is about schools that has probably made a difference.

But like earlier "then and now" studies, Grissmer's analysis of NAEP trends provides no support for the declining school achievement fable, particularly where minority students are concerned.

It may be that American schools are worse today than they once were. But without better historical data on student scores and on their characteristics, we simply have no way to confirm this judgment. Yet when claims of deterioration are repeated and refuted for nearly a century, we should be cautious. That we've been wrong before doesn't necessarily mean we're wrong today, but there's a good chance of it.

Richard Rothstein is a research associate of the Economic Policy Institute in Washington, D.C. E-mail: rothstei@oxy.edu. This article is adapted from his study, "The Way We Were?" published by Century Foundation Press.