The Golden Age of Standardized Test Creation
09/07/2010
A+
|
a-
Print Friendly and PDF
Psychometrics is a relatively mature field of science, and a politically unpopular one. So you might think there isn't much money to be made in making up brand new standardized tests. Yet, there is.

From the NYT:

U.S. Asks Educators to Reinvent Student Tests, and How They Are Given
By SAM DILLON
Standardized exams – the multiple-choice, bubble tests in math and reading that have played a growing role in American public education in recent years – are being overhauled.

Over the next four years, two groups of states, 44 in all, will get $330 million to work with hundreds of university professors and testing experts to design a series of new assessments that officials say will look very different from those in use today.

The new tests, which Secretary of Education Arne Duncandescribed in a speech in Virginia on Thursday, are to be ready for the 2014-15 school year.

They will be computer-based, Mr. Duncan said, and will measure higher-order skills ignored by the multiple-choice exams used in nearly every state, including students’ ability to read complex texts, synthesize information and do research projects.

”The use of smarter technology in assessments,” Mr. Duncan said, ”makes it possible to assess students by asking them to design products of experiments, to manipulate parameters, run tests and record data.”

I don't know what the phrase "design products of experiments" even means, so I suspect that the schoolchildren of 2014-15 won't be doing much of it.

Okay, I looked up Duncan's speech, "Beyond the Bubble Tests," and what he actually said was "design products or experiments," which almost makes sense, until you stop and think about it. Who is going to assess the products the students design? George Foreman? Donald Trump? (The Donald would be good at grading these tests: tough, but fair. Here's a video of Ali G pitching the product he designed — the "ice cream glove" — to Trump.

Because the new tests will be computerized and will be administered several times throughout the school year, they are expected to provide faster feedback to teachers than the current tests about what students are learning and what might need to be retaught.
Both groups will produce tests that rely heavily on technology in their classroom administration and in their scoring, she noted.
Both will provide not only end-of-year tests similar to those in use now but also formative tests that teachers will administer several times a year to help guide instruction, she said.
And both groups’ tests will include so-called performance-based tasks, designed to mirror complex, real-world situations.
In performance-based tasks, which are increasingly common in tests administered by the military and in other fields, students are given a problem – they could be told, for example, to pretend they are a mayor who needs to reduce a city’s pollution – and must sift through a portfolio of tools and write analytically about how they would use them to solve the problem.
Oh, boy ...

There is some good stuff here — adaptive tests are a good idea (both the military's AFQT and the GRE have gone over to them). But there's obvious trouble, too.

Okay, so these new tests are going to be much more complex, much more subjective, and get graded much faster than fill-in-the-bubble tests? They'll be a dessert topping and a floor wax!

These sound a lot like the Advanced Placement tests offered to high school students, which usually include lengthy essays. But AP tests take two months to grade, and are only offered once per year (in May, with scores coming back in July), because they use high school teachers on their summer vacations to grade them.

There's no good reason why fill-in-the-bubble tests can't be scored quickly. A lot of public school bubble tests are graded slothfully, but they don't have to be. My son took the ERB's Independent School Entrance Exam on a Saturday morning and his score arrived at our house in the U.S. Mail the following Friday, six days later.

The only legitimate reason for slow grading is if there are also essays to be read, but in my experience, essay results tend to be dubious at least below the level of Advanced Placement tests, where there is specific subject matter in common. The Writing test that was added to the SAT around 2003 has largely been a bust, with many colleges refusing to use it in the admissions process.

One often overlooked problem with any kind of writing test, for example, is that graders have a hard time reading kids' handwriting. You can't demand that kids type because millions of them can't. Indeed, writing test results tend to correlate with number of words written, which is often more of a test of handwriting speed than of anything else. Multiple choice tests have obvious weaknesses, but at least they minimize the variance introduced by small motor skills.

And the reference to "performance-based tasks" in which people are supposed to "write analytically" is naive. I suspect that Duncan and the NYT man are confused by all the talk during the Ricci case about the wonders of "assessment centers" in which candidates for promotion are supposed to sort through an in-basket and talk out loud about how they would handle problems. In other words, those are hugely expensive oral tests. The city of New Haven brought in 30 senior fire department officials from out of state to be the judges on the oral part of the test.

And the main point of spending all this money on an oral test is that an oral test can't be blindgraded. In New Haven, 19 of the 30 oral test judges were minorities, which isn't something that happens by randomly recruiting senior fire department officials from across the country.

But nobody can afford to rig the testing of 35,000,000 students annually.

Here are some excerpts from Duncan's speech:

President Obama called on the nation's governors and state education chiefs "to develop standards and assessments that don't simply measure whether students can fill in a bubble on a test, but whether they possess 21st century skills like problem-solving and critical thinking and entrepreneurship and creativity."
You know your chain is being yanked when you hear that schoolteachers are supposed to teach "21st century skills" like "entrepreneurship." So, schoolteachers are going to teach kids how to be Steve Jobs?

Look, there are a lot of good things to say about teachers, but, generally speaking, people who strive for union jobs with lifetime tenure and summers off are not the world's leading role models on entrepreneurship.

Further, whenever you hear teachers talk about how they teach "critical thinking," you can more or less translate that into "I hate drilling brats on their times tables. It's so boring." On the whole, teachers aren't very good critical thinkers. If they were, Ed School would drive them batty. (Here is an essay about Ed School by one teacher who is a good critical thinker.)

And last but not least, for the first time, the new assessments will better measure the higher-order thinking skills so vital to success in the global economy of the 21st century and the future of American prosperity. To be on track today for college and careers, students need to show that they can analyze and solve complex problems, communicate clearly, synthesize information, apply knowledge, and generalize learning to other settings. ...

Over the past 19 months, I have visited 42 states to talk to teachers, parents, students, school leaders, and lawmakers about our nation's public schools. Almost everywhere I went, I heard people express concern that the curriculum had narrowed as more educators "taught to the test," especially in schools with large numbers of disadvantaged students.

Two words: Disparate Impact.

The higher the intellectual skills that are tested, the larger the gaps between the races will turn out to be. Consider the AP Physics C exam, the harder of the two AP physics tests: In 2008, 5,705 white males earned 5s (the top score) versus six black females.

In contrast, tests of rote memorization, such as having third graders chant the multiplication tables, will have smaller disparate impact than tests of whether students "can analyze and solve complex problems, communicate clearly, synthesize information, apply knowledge, and generalize learning to other settings." That's a pretty decent description of what IQ tests measure.

Duncan says that the new tests could replace existing high school exit exams that students must pass to graduate.

Many educators have lamented for years the persistent disconnect between what high schools expect from their students and the skills that colleges expect from incoming freshman. Yet both of the state consortia that won awards in the Race to the Top assessment competition pursued and got a remarkable level of buy-in from colleges and universities.

... In those MOUs, 188 public colleges and universities and 16 private ones agreed that they would work with the consortium to define what it means to be college-ready on the new high school assessments.

The fact that you can currently graduate from high school without being smart enough for college is not a bug, it's a feature. Look, this isn't Lake Wobegon. Half the people in America are below average in intelligence. They aren't really college material. But they shouldn't all have to go through life branded as a high school dropout instead of high school graduate because they weren't lucky enough in the genetic lottery to be college material.

The Gates Foundation and the U. of California ganged up on the LA public schools to get the school board to pass a rule that nobody will be allowed to graduate who hasn't passed three years of math, including Algebra II. That's great for UC, not so great for an 85 IQ kid who just wants a high school diploma so employers won't treat him like (uh oh) a high school dropout. But, nobody gets that.

Another benefit of Duncan's new high stakes tests will be Smaller Sample Sizes of Questions:

With the benefit of technology, assessment questions can incorporate audio and video. Problems can be situated in real-world environments, where students perform tasks or include multi-stage scenarios and extended essays.

By way of example, the NAEP has experimented with asking eighth-graders to use a hot-air balloon simulation to design and conduct an experiment to determine the relationship between payload mass and balloon altitude. As the balloon rises in the flight box, the student notes the changes in altitude, balloon volume, and time to final altitude. Unlike filling in the bubble on a score sheet, this complex simulation task takes 60 minutes to complete.

So, the NAEP has experimented with this kind of question. How did the experiment work out?

You'll notice that the problem with using up 60 minutes of valuable testing time on a single multipart problem instead of, say, 60 separate problems is that it radically reduces the sample size. A lot of kids will get off track right away and get a zero for the whole one hour segment. Other kids will have seen a hot air balloon problem the week before and nail the whole thing and get a perfect score for the hour.

That kind of thing is fine for the low stakes NAEP where results are only reported by groups with huge sample sizes (for example, the NAEP reports scores for whites, blacks, and Hispanics, but not for Asians). But for high stakes testing of individual students and of their teachers, it's too random. AP tests have large problems on them, but they are only given to the top quarter or so of high school students in the country, not the bottom half of grade school students.

It's absurd to think that it's all that crucial that all American schoolchildren must be able to "analyze and solve complex problems, communicate clearly, synthesize information, apply knowledge, and generalize learning to other settings." You can be a success in life without being able to do any of that terribly well.

Look, for example, at the Secretary of Education. Arne Duncan has spent 19 months traveling to 42 states, talking about testing with teachers, parents, school leaders, and lawmakers. Yet, has he been able to synthesize information about testing terribly well at all? Has his failure to apply knowledge and generalize learning about testing gotten him fired from the Cabinet?

Print Friendly and PDF