Who Is Going To Win? Does Nate Silver Know?

In Maclean`s, Colby Cosh of Edmonton (who has always seemed like the smartest man in North American journalism) gives a balanced assessment of the Silver Age:

The whole world is suddenly talking about election pundit Nate Silver, and as a longtime heckler of Silver I find myself at a bit of a loss. These days, Silver is saying all the right things about statistical methodology and epistemological humility; he has written what looks like a very solid popular book about statistical forecasting; he has copped to being somewhat uncomfortable with his status as an all-seeing political guru, which tends to defuse efforts to make a nickname like “Mr. Overrated” stick; and he has, by challenging a blowhard to a cash bet, also damaged one of my major criticisms of his probabilistic presidential-election forecasts. That last move even earned Silver some prissy, ill-founded criticism from the public editor of the New York Times, which could hardly be better calculated to make me appreciate the man more.

The situation is that many of Nate Silver’s attackers don’t really know what the hell they are talking about. Unfortunately, this gives them something in common with many of Nate Silver’s defenders, who greet any objection to his standing or methods with cries of “Are you against SCIENCE? Are you against MAAATH?” If science and math are things you do appreciate and favour, I would ask you to resist the temptation to embody them in some particular person …

Silver is a terrific advocate for statistical literacy. But it is curious how often he seems to have failed upward almost inadvertently. Even this magazine’s coverage of Silver mentions the means by which he first gained public notice: his ostensibly successful background as a forecaster for the Baseball Prospectus website and publishing house.

Silver built a system for projecting future player performance called PECOTA—a glutinous mass of Excel formulas that claimed to offer the best possible guess as to how, say, Adam Dunn will hit next year. PECOTA, whose contents were proprietary and secret and which was a major selling point for BPro, quickly became an industry standard for bettors and fantasy-baseball players because of its claimed empirical basis. Unlike other projection systems, it would specifically compare Adam Dunn (and every other player) to similar players in the past who had been at the same age and had roughly the same statistical profile.

Colby`s on to an interesting distinction here between modern philosophy of science`s idealization of forecasting as the acid test of SCIENCE— which we also assume to be related to the quest for reductionism, Occam`s Razor, transparency, peer review, all that kind of elegant stuff—versus the messy reality of forecasting as a profit-making business. In my corporate career, I got drafted into building sales forecasting models for both companies in the industry, so I`ve had first hand experience with these issues.

For most players in most years, Silver’s PECOTA worked pretty well. But the world of baseball research, like the world of political psephology, does have its cranky internet termites. They pointed out that PECOTA seemed to blunder when presented with unique players who lack historical comparators, particularly singles-hitting Japanese weirdo Ichiro Suzuki.

Ichiro is one of baseball history`s more consistent players—across two decades across two continents—so his future stats have been pretty easy to predict by the most superficial fan: just project the trendline to account for aging. But Silver`s proprietary system couldn`t believe Ichiro could continue to get all those obviously fluky infield hits before regression to the mean crashed in, so PECOTA routinely underpredicted his performance, although by less and less as Silver added some kind of specialized secret counter-Japanese-weirdo gizmo or gizmos to his model to make his Ichiro forecasts less wrong.

Notice, however, that by making his black box forecast more accurate, Silver was making it scientifically less useful. Back when Silver was saying: Based on everything we know about slap hitters, Ichiro is due for a comeuppance this year. Nobody can continue to accumulate seeing-eye singles and slow rollers and all the lucky crud that Ichiro got last years. And, year after year, Silver was wrong, which suggested that Ichiro wasn`t like the other slap hitters, that American baseball minds needed to reverse engineer this Japanese import and figure out exactly how he does it. Maybe we could train a few guys over here to do it too, or maybe we could select more young players with some of Ichiro`s obscure gifts.

But, over time, as Silver added secret adjustments to his hidden model, that Ichiro Anomaly became less glaring.

Science progresses by the accumulation of wrong predictions. For example, if Isaac Newton had set up a proprietary firm called Astronomy Prospectus that owned a black box model for predicting planetary orbits based on Newton`s secret Law of Gravity, his successors could have made minor adjustments in their forecasts to account for the Mercury Anomaly. So, who need`s Einstein`s General Theory of Relativity when Astronomy Prospectus can just nudge their forecasts?

More importantly, PECOTA produced reasonable predictions, but they were only marginally better than those generated by extremely simple models anyone could build. The baseball analyst known as “Tom Tango” (a mystery man I once profiled for Maclean’s, if you can call it a profile) created a baseline for projection systems that he named the “Marcels” after the monkey on the TV show Friends—the idea being that you must beat the Marcels, year-in and year-out, to prove you actually know more than a monkey. PECOTA didn’t offer much of an upgrade on the Marcels—sometimes none at all.

Einstein famously said that science should be as simple as possible, but no simpler. But, that leaves a lot of latitude.

PECOTA came under added scrutiny in 2009, when it offered an outrageously high forecast—one that was derided immediately, even as people waited in fear and curiosity to see if it would pan out—for Baltimore Orioles rookie catcher Matt Wieters. Wieters did have a decent first year, but he has not, as PECOTA implied he would, rolled over the American League like the Kwantung Army sweeping Manchuria. By the time of the Wieters Affair, Silver had departed Baseball Prospectus for psephological godhood, ultimately leaving his proprietary model behind in the hands of a friendly skeptic, Colin Wyers, who was hired by BPro. In a series of 2010 posts by Wyers and others called “Reintroducing PECOTA”—though it could reasonably have been entitled “Why We Have To Bulldoze This Pigsty And Rebuild It From Scratch”—one can read between the lines. Or, hell, just read the lines.

Behind the scenes, the PECOTA process has always been like Von Hayes: large, complex, and full of creaky interactions and pinch points… The numbers crunching for PECOTA ended up taking weeks upon weeks every year, making for a frustrating delay for both authors of the Baseball Prospectus annual and fantasy baseball players nationwide. Bottlenecks where an individual was working furiously on one part of the process while everyone else was stuck waiting for them were not uncommon. To make matters worse, we were dealing with multiple sets of numbers.

…Like a Bizarro-world subway system where texting while drunk is mandatory for on-duty drivers, there were many possible points of derailment, and diagnosing problems across a set of busy people in different time zones often took longer than it should have. But we plowed along with the system with few changes despite its obvious drawbacks; Nate knew the ins and outs of it, in the end it produced results, and rebuilding the thing sensibly would be a huge undertaking. We knew that we weren’t adequately prepared in the event that Nate got hit by a bus, but such is the plight of the small partnership.

…As the season progressed, we had some of our top men—not in the Raiders of the Lost Ark meaning of the term—look at the spreadsheet to see how we could wring the intellectual property out of it and chuck what was left. But in addition to the copious lack of documentation, the measurables from the latest version of the spreadsheet I’ve got include nice round numbers like 26 worksheets, 532 variables, and a 103 MB file size. The file takes two and a half minutes to open on this computer, a fairly modern laptop. The file takes 30 seconds to close on this computer. …We’ve continued to push out PECOTA updates throughout the 2010 season, but we haven’t been happy with their presentation or documentation, and it’s become clear to everyone that it’s time to fix the problem once and for all.

The stuff in italics is not Colby talking, it`s Nate Silver`s successors at Baseball Prospectus talking.

I can sympathizes with Silver, because his Excel-based PECOTA model / ball of twine reminds me of the Excel-based sales forecast model I built for the other firm in the CPG marketing data industry after building an almost unbreakable Lotus 1-2-3 3.0 sales forecasting model for the first firm.

The third release of the once-famous Lotus spreadsheet at the end of the 1980s was designed to make decentralized forecasting and budgeting simple before the Internet by elegantly implementing a 3D model in which identical spreadsheets could be easily stacked and summarized.

And it was simple. In Lotus, I sent each regional sales manager a spreadsheet on which he or she would list every sales proposal, its size, some other details, and its chance of closing this quarter. Multiply the dollars by the probabilities, sum, and there`s your regional sales forecast. Fed Ex the diskette (this is c. 1990) back to me at the Chicago HQ, and I have Lotus 3.0 aggregate the numbers across all the regional sheets and give the national forecast for the quarter to the CEO.

Awhile later, after I`d followed the firm`s Chairman into his second (and not quite as successful) high-tech startup, the CEO got hired away by the archrival firm, and he hired me to build him the same sales forecasting system.

The only problem was that the other corporation had standardized on Microsoft Office, including Excel, so I had to build the system in Excel, not Lotus 3.0. Even though I knew how to do it now, building the same system in Excel took about three times as long, including about 100 hours on the phone with Excel tech support in Redmond. Each phone call I`d begin by explaining to the MS Excel wizard what the goal of my project was, and each one would reply that that sounded fascinating, but he`d never heard of anybody building a national forecasting system using Excel, and that, as far as he knew, nobody at Microsoft had ever contemplated that use when designing Excel. In theory, Excel supported drilling down through multiple worksheets, but, in practice, it was an ordeal to build, and, worse, practically impossible to explain to anybody else how it worked.

At my first company, when I left for the start-up, my assistants carried on running the Lotus-based sales forecasting system effortlessly. When I left the second company, I printed up a 45 page guide to how keep the sales forecasting system running, which was about an order of magnitude longer than would have been required with Lotus.

Of course, Excel went on to become the global standard in spreadsheets and Lotus 1-2-3 vanished. The idea of non-programmers building large systems in spreadsheets largely vanished with it, as well. Who knows how much productivity has been lost due to Excel`s dominion?

Let`s come back to the Philosophy of Science questions. In my sales forecasting systems, I did not attempt to build in ad hoc Ichiro-style adjustments. I kept them super-reductionist. I could have made my forecasts more accurate by putting in stuff like Smith is always overoptimistic by 15% until the week before the quarter closes, while Jones is notorious for sandbagging 10% of her likely revenue. But, I went instead for total transparency for my bosses at HQ and total reporting for their regional sales managers. If the national forecast was wrong, it was because specific regional sales managers were wrong, and it`s up to those individuals to correct their biases and delusions, or face the consequences Being honest and realistic with HQ was part of their job, and part of my job was to make clear when they weren`t.

The corporate officers` jobs, however, included reporting profit forecasts to stock market analysts, and they could massage the numbers I reported to them for known biases (or hunches, hopes, or whatever, with only worries about their reputations with analysts and fears of shareholder lawsuits to rein them in).

In contrast to what I did, Silver is running a proprietary non-transparent black box business. He has incentives to be accurate, but he has other incentives as well, such as providing comforting fare to biased readers of the NYT and keeping his system secret.

Colby continues:

If the history of Silver’s PECOTA is new to you, and you’re shocked by brutal phrases like “wring the intellectual property out of it and chuck what was left”, you should now have the sense to look slightly askance at the New PECOTA, i.e., Silver’s presidential-election model. When it comes to prestige, it stands about where PECOTA was in 2006. Like PECOTA, it has a plethora of vulnerable moving parts. Like PECOTA, it is proprietary and irreproducible. That last feature makes it unwise to use Silver’s model as a straw stand-in for “science”, as if the model had been fully specified in a peer-reviewed journal.

Silver has said a lot about the model’s theoretical underpinnings, and what he has said is all ostensibly convincing. The polling numbers he uses as inputs are available for scrutiny, if (but only if) you’re on his list of pollsters. The weights he assigns to various polling firms, and the generating model for those weights, are public. But that still leaves most of the model somewhat obscure, and without a long series of tests—i.e., U.S. elections—we don’t really know that Nate is not pulling the numbers out of the mathematical equivalent of a goat’s bum.

Unfortunately, the most useful practical tests must necessarily come by means of structurally unusual presidential elections. The one scheduled for Tuesday won’t tell us much, since Silver gives both major-party candidates a reasonable chance of victory and there is no Ross Perot-type third-party gunslinger or other foreseeable anomaly to put desirable stress on his model.

I`m reminded of the 1996 election, when the consensus was Clinton over Dole by double digits, but it turned out considerably closer. Among pollsters, only Zogby got the margin right. This did wonders for Zogby`s career, but the whole incident remains shrouded. What did Zogby know that nobody else knew? Anything? Or was he just lucky? Who knows? His system was proprietary.

By the way, the failure of polls in 1996 helps explain why so many pundits didn`t question the catastrophic failure of the exit polls in 2004 to pick the winner of the election. The afternoon of Election Day 2004, the word swept the country that the exit polls showed Kerry winning easily. This was widely accepted as true, in part because the reputation of telephone polls before the election had been badly dented in 1996. But, it turned out the exit polls were biased.

The funniest recent forecasting fiasco was Election Night in 2000 when the networks first called Florida for Gore even though Bush had a big lead in the partially counted vote, making Gore the Presumptive President. Then they switched and called Florida for Bush, declaring the country for Bush, even though Bush`s lead in the actual votes counted was shrinking relentlessly. To me, watching at home, a simple trendline suggested that when they got to 99.9% of the votes counted, Florida would be virtually tied. Eventually, the networks figured that out too and switched Florida back to uncalled.