How a Minor Calculation Error Cost Intel Half a Billion Dollars

A version of this post originally appeared on Tedium, a twice-weekly newsletter that hunts for the end of the long tail.

For the past few decades, Intel has been by far the largest and most influential maker of processor chips in the world, long outlasting many competitors such as Motorola and IBM on the way to sales supremacy. But the company, as it preps its next generation of laptop processors (code-named Tiger Lake), faces more competition than ever for its decades-long reign at the top of the processor world.

AMD has been besting it on the desktop processor front for quite a while, and thanks to Apple, ARM-based processors are destined to go mainstream in desktop computing form factors.

But Intel, still generally considered atop the laptop processor heap, has been on the ropes of negative public opinion before—with the one-two punch of Meltdown and Spectre a particularly damaging blow. But back in the 1990s, a mathematician found an equation that threatened an entire processor line at a time just before computing truly went mainstream.

With all that said, let’s look back at the floating-point glitch that, for a time, nearly turned the Pentium into a laughingstock.

The year that Norwegian mathematician Viggo Brun proved that the sum of the reciprocals of twin primes (or prime numbers that differ by 2, such as 3 and 5) converge to a finite value, known as Brun’s constant. By calculating all twin prime numbers up to 10 to the 14th power, the constant stands at around 1.902160578, according to former University of Lynchburg mathematics professor Thomas Nicely. Nicely plays a key role in our story.

image4.jpg

Image: pixel2013/Pixabay

Intel’s big Pentium PR headache was caused by the wrong person finding a needle in a gigantic haystack

To get to the root of this tale, let’s play a logic exercise, shall we? Here we go:

There is somethIng wrong with this sentence.

Did you catch it? What mistake did I make when I typed out the sentence above? Is it obvious, or subtle? Did it affect your understanding of what I wrote, or was it easy to trip over it?

Go on, look again. I’ll wait.

OK, you probably noticed it—the capital “I” in “somethIng.” I apologize for the error, which is something I do a lot in my personal life.

Now, imagine if that was the only spelling or grammatical error on this entire site. (It’s not.) And, perhaps, you duplicated this site a couple of times, correcting the “I” so it wasn’t capitalized in that one word. But on that one version of the site, that one error persists.

Now imagine if I had millions of people scouring over every phrase that I’ve ever shared on this site, some more than others, looking for this error. And one day, someone finds it. And that someone is an influential editor. And this error, though in the grand scheme very minor, becomes enough to threaten my reputation as a writer.

This is the grammatical equivalent of what Thomas Nicely unwittingly did to Intel in October of 1994, after getting his hands on a new Pentium processor. On his hunt for the largest documented example of Brun’s constant, he used the Intel processor, with its floating point capabilities, and realized that the answer the processor was giving was a bit off.

image2.jpg

Thomas Nicely, as shown in 1984. Image: University of Lynchburg

For most people, an error like this would probably go unnoticed. After all, it’s not like Doom was going to stop working because the Pentium was giving the wrong answer. But for Nicely’s specific use case, it was a problem, because it ruined his work and created a lot of problems for his equation. In a 1994 CNN interview recounted on Usenet at the time, Nicely said this of the saga:

I was working on a research problem in pure mathematics; it involved prime numbers and a huge number of divisions had to be performed over a long period of time and in the process of the calculation, I discovered at one point that there was a discrepancy and it took several months to track it down. And it turned out to be the least likely suspect of all: the chip.

The 60-MHz Pentium chip that Nicely got his hands on was the culprit—and it took him a few months to properly diagnose that problem came down to the CPU.

The problem Nicely ran into was basically a major headache specifically for him and other mathematicians trying to figure out this specific problem, and not really anyone else. But even a single error like that was enough to damage the high-profile Pentium chip’s reputation in the extremely technical field of mathematics.

After Nicely reported the error on CompuServe on October 30, 1994, it became one of the first stories to truly go viral thanks to the internet. Just a few days later, someone posted about the problem on the Usenet group comp.sys.intel confirming the floating-point error, and from there, the saga gained a life of its own after the story was picked up in the news by EE Times, an engineering trade publication. 

“It looks to me like the Pentium, both the 60 and the 90MHz models, only carry [floating-point] divisions to single precision,” wrote Norwegian programmer Terje Mathisen, confirming the error.

From there, the story started to draw attention in the engineering and mathematics spaces. But the real problem might have been that Intel made an even bigger error—a business error—in response to this problem. By the end of November and the beginning of December, this story, which started with one person making an observation, blew up into one of the biggest technology stories of 1994, a year when the internet started to creep into mainstream contexts for the first time, however awkwardly.

But it’s worth considering that while the base issue was a good ol’ chip design error, the true problem lied in Intel’s handling of the saga. In short, Intel’s best users were feeling disrespected.

The year that mathematician Émile Borel first posited the infinite monkey theorem—the famed theory that suggested, if a million monkeys typed on a million typewriters for ten hours a day, they would eventually type a great work of literature. (Some are literally trying to confirm this concept in the cloud, because of course.) In many ways, the problem Nicely stumbled upon was something of the academic-world equivalent of that.

image1.jpg

An Intel Pentium chip. Image: Krzysztof Burghardt/Wikimedia Commons

The real reason the saga hurt Intel so much comes down to its reaction

The problem for Intel is less that the error was there and more how the company handled its discovery.

Going back to my earlier example, if I got an email from a very high-level editor telling me that I accidentally had a capital letter in the middle of one word out of my entire site, how would you expect I would react?

Likely, I would quietly just fix the error and be done with it. But Intel couldn’t do that. Like a correction in a newspaper, an error in a chip is basically final. (Well, unless you’re using something like a field-programmable gate array.) The best they can do is to remove it in future versions. While software can mitigate the problem to some degree, if something is wrong in a chip, they can’t exactly fix it without completely replacing it.

And to be clear, the floating-point error, as bad as it was, was minor in the grand scheme. It would be like owning a calculator and having it give you a wrong answer a single time out of the entire time you owned that calculator.

By modern comparison, the more recent processor flaws Meltdown and Spectre, the former of which was baked into the design of most Intel, Power Architecture, and ARM chips released throughout the past 20 years, were far more damaging overall. They weren’t theoretical problems—they were fundamental security risks. The solution to repair the issues uncovered by these two flaws meant that hardware and software makers had to go out of their way to turn off features of the processor, ensuring that people’s computers would have to run slower. In some use cases, like cloud computing, a change like this literally means that using the same processor will cost you more money and time to do the same amount of work.

And beyond the repetitional damage, Intel still has folks trying to fix those flaws.

So what to say of the floating-point error? Thomas Nicely, the academic that discovered it, generally understood that while it was a problem for him, but the sheer complexity of computer processors at the time meant that his chance discovery might not have otherwise have been discovered had he, specifically, not been looking for it.

“The current generation of microprocessors has become so complex that it’s no longer possible to completely debug one,” he told PC Magazine in early 1995.

But Intel certainly could have handled things better. As recalled by mathematician and MATLAB developer Cleve Moler in 2013, Intel’s initial response to customers left a lot to be desired. Per Moler, here’s what the customer support apparatus sent customers:

There has been a lot of communication recently on the Internet about a floating point flaw on the Pentium processor. For almost all users, this is not a problem.

Here are the facts. Intel detected a subtle flaw in the precision of the divide operation for the Pentium processor. For rare cases (one in nine billion divides), the precision of the result is reduced. Intel discovered this subtle flaw during on going testing after several trillions of floating point operations in our continuing testing of the Pentium processor. Intel immediately tested the most stringent technical applications that use the floating point unit over the course of months and we have been unable to detect any error. In fact, after extensive testing and shipping millions of Pentium processor-based systems there has only been one reported instance of this flaw affecting a user to our knowledge, In this case, a mathematician doing theoretical analysis of prime numbers and reciprocals saw reduced precision at the 9th place to the right of the decimal.

In fact, extensive engineering tests demonstrated that an average spreadsheet user could encounter this subtle flaw of reduced precision once in every 27,000 years of use. Based on these empirical observations and our extensive testing, the user of standard off-the-shelf software will not be impacted. If you have this kind of prime number generation or other complex mathematics, call 1 800 628-8686 (International) 916 356-3551). If you don’t, you won’t encounter any problems with your Pentium processor-based system. If ever in the life of the computer this becomes a problem, Intel will work with the customer to resolve the issue.

As I pointed out above, Intel very much had a case of the million-monkeys problem on its hands. The problem that the passage highlights is that Intel knew about it before Nicely reached out, and sort of let it go. Part of the reason this was problematic is that the response reflected a shift in focus away from the technical community, which cared about this problem, over the average consumer, which didn’t. Intel was trying to put the full-court press on the consumer, and introduced its Intel Inside branding campaign that year, along with the consumer-friendly (and trademark-friendly) branding for Pentium chips.

But in trying to win over a general audience, Intel seemed to hint that it was no longer taking its existing user base seriously.

For those focused on technical applications, the floating-point division situation created uncertainty, and Intel’s response just didn’t do the situation justice. In a 1994 Wall Street Journal article, Jet Propulsion Laboratory researcher Dave Bell made clear that the confusion over the chips would likely discourage the scientific community from using Pentiums.

“There are a lot of people who do research and have to stand up and publish their results based on computer simulations,” Bell said. “Maybe one of the questions now will be, ‘Was it done on a chip with the bug or without the bug?’”

image3.jpg

Late Intel CEO Andy Grove, who died in 2016. Image: Intel Free Press/Wikimedia Commons

Eventually, Intel CEO Andy Grove took to social media—which in late 1994 meant that he posted a response on comp.sys.intel. The posting did not go well, especially for someone supposedly technical like Grove. At first, Intel’s Richard Wirt posted it, leading to accusations that the response was an impostor. Then, Grove posted it again, in his personal voice, emphasizing that he took the problem seriously, noting that the issue only emerged on their side more than a year after the processor’s initial release.

“We held the introduction of the chip several months in order to give them more time to check out the chip and their systems,” he wrote, while emphasizing that no chip was perfect. “We worked extensively with many software companies to this end as well.”

As responses go, it struck a much better pose than the customer service message that angered so many technical users. But if you look at the thread, you’ll see that Grove had a lot of trolls to deal with.

“It really pisses me off that I paid good money for this chip,” one respondent wrote. “But because I don’t do math-intensive work for some big company who probably got their Pentiums wholesale, I am dirt.”

The dynamic is not unlike what you might see if someone posts a bad take on Twitter today.

Around the time Grove’s message went online right after Thanksgiving 1994, the story started to hit the mainstream press in a big way … and the company saw its stock take a big hit.

The timing in many ways couldn’t have been worse: 1994 was the year many families brought home multimedia, internet-capable home computers for the first time, many with Pentium chips, which have explicitly been branded as things average consumers can buy. And here Grove is, the day after Black Friday, having to ease the concerns of technical users and academics … while the mainstream press is effectively having to dumb down the saga to the public.

Some took advantage of this situation—IBM, which was in the midst of releasing some of its first PowerPC machines to the public, removed the Pentium chips from its devices and publicly claimed that the average consumer would run into the bug every 24 days, rather than 27,000 years. (Perhaps the truth was somewhere in the middle?) It was not a good time to be Intel.

The PR crisis eventually saw its inevitable conclusion. Just before Christmas, Intel saw the writing on the wall, and recalled the chips. Brun’s constant was the one that pulled the trigger.

$475 Million

The approximate cost of the write-off related to the Pentium chips Intel had to replace after its recall, in which any consumers who wanted to replace their new processor could get a new version. Despite the charge, Intel saw its sales of 486 and Pentium processors leap in demand during the 1994 holiday season. (Perhaps the additional press was a good thing?)

“Bad companies are destroyed by crises; good companies survive them; great companies are improved by them.”

Say what you will about the Pentium floating-point division bug saga, but Intel found a good way to turn this crisis into a major learning moment of sorts.

This is highlighted by the company’s decision to convert the infamously broken chips into keychains with the above quote from Grove, basically always-prevalent reminders to employees that they’re not perfect, but they’re learning from their mistakes.

The Pentium processor, even with the flaw, became one of the defining technology launches of the ’90s, succeeding at its goal of upgrading the CPU’s role in a computer from a mere component hiding in the box into a household name. And one could make the case that, while they certainly upset some of their more technical users, the saga raised the profile of the company among the same general computing users it was trying to reach with its emphasis on branding.

And not to be lost here is that this saga actually made a mathematician famous, which is not something you can say every day. Thomas Nicely—whose prior claim to fame, before the fateful equation that cost Intel half a billion dollars, was a board game that predicted the future success of fantasy football—admittedly didn’t see it coming.

“Mathematicians in general have very private lives,” he said at the peak of the scandal in an Associated Press interview. “I find it kind of embarrassing to see my own name in print.”

Nicely, who died last year and was largely seen as legendary by his peers, spent roughly three decades at the University of Lynchburg before retiring in 2000.

One side effect of the scandal that, for a time, changed his life? He helped to extend Brun’s constant just a little bit further.