You need to use error-correcting memory

barrkel · on Feb 8, 2010

> So the probability of having at least one bit error in 4 gigabytes of memory at sea level on planet Earth in 72 hours is over 95% .

This is misleading. A flaky machine will indeed see bit errors, and it will probably be visible as random crashes, but that's not even necessarily the case for the average machine. If you look at the quantitative study from Google, which the author links to:

http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

... you can see that in terms of errors per DIMM:

"Across the entire fleet, 8.2% of all DIMMs are affected by correctable errors and an average DIMM experiences nearly 4000 correctable errors per year. These numbers vary greatly by platform. Around 20% of DIMMs in Platform A and B are affected by correctable errors per year, compared to less than 4% of DIMMs in Platform C and D. Only 0.05-0.08% of the DIMMs in Platform A and Platform E see an uncorrectable error per year compared to nearly 0.3% of the DIMMs in Platform C and Platform D. The mean number of correctable errors per DIMM are more comparable, ranging from 3351-4530 correctable errors per year."

So the mean rate of correctable errors is high, but the variance is also very high: depending on the manufacturer, 80 to 96% of DIMMs see out a whole year without a single correctable error. If the original statistics of 95% chance of error in 3 days were correct, a single-DIMM machine ought to have approximately 0% (astronomically close to 0%) of living out a whole year without errors - but we can see here that you seem to have between 80 and 96% of single-DIMM machines doing just that.

The moral here is to test your memory for a while - preferably a few days - before trusting the DIMMs. But once you know you have good DIMMs, it doesn't look like you need to be quite so paranoid about bit errors.

anamax · on Feb 8, 2010

> But once you know you have good DIMMs, it doesn't look like you need to be quite so paranoid about bit errors.

During the talk, http://www.stanford.edu/class/ee380/fall-schedule-20092010.h... , IIRC, she said that good DIMMs went bad over a fairly short period of time, on the order of 2-3 years.

tbrownaw · on Feb 8, 2010

> But once you know you have good DIMMs, it doesn't look like you need to be quite so paranoid about bit errors.

Assuming that only the one-error-per-year cases were due to random bit flips, and all the multiple-errors-per-year cases were due to bad DIMMs, I came up with about a 1/5 chance of getting a single random bit-flip over a 6 year lifespan. But there also seems to be about a 1/3 chance of having a DIMM randomly go bad after a couple years, which of course without ECC would manifest as random crashes and lost (or maybe corrupted) work.

Estragon · on Feb 8, 2010

Seems like running memtest every six months or so would be a good policy.

It would be nice if there were a way to test the memory while a machine is running.

psranga · on Feb 8, 2010

The errors we're talking about here are transient. The memory location itself is still usable, the contents get changed when a cosmic ray hits. After the hit, the corrupted value is held without a problem.

Memtest checks if the memory location has a gross fault which prevents it from storing values correctly.

Doesn't seem like memtest will help.

Estragon · on Feb 8, 2010

Good point. Thanks.

regularfry · on Feb 8, 2010

I remember someone-somewhere saying that in their experience building GCC was a more thorough memory test than memtest itself...

scotty79 · on Feb 8, 2010

I don't think that there was any mention that error rate was dependent on manufacturer.

InclinedPlane · on Feb 8, 2010

This article contains a fundamental flaw. It estimates the upset rate in memory due to cosmic ray flux as upsets/bits/hour, but this is an incorrect unit. Upsets depend on the total physical size of the memory (and thus the total neutron flux) and the sensitivity of each memory cell (bit) to cosmic rays. Sensitivity may increase as you decrease the size of memory cells, but not in lock-step with the change in size. A room full of 4 Mbit memory chips will almost certainly have a higher rate of upsets per bit than will a single 2GB DIMM. The figures quoted in the article are from studies of computer systems in the 1980s, so upsets/bits rates are much higher than would be expected with modern RAM (which has 3 orders of magnitude more bits in the same volume).

This error is probably why the article's theorized SEU event rate for modern systems is about 3 orders of magnitude higher than experimental evidence suggests (such as from this Google study): http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

jessriedel · on Feb 8, 2010

I don't understand your first paragraph. Shouldn't the cosmic ray flux through each bit be the same for each bit, and unchanged as you increase the total amount of memory? If I'm a given bit, should having a second stick of RAM 10 inches away affect whether I flip during a given time period?

InclinedPlane · on Feb 8, 2010

Shouldn't the cosmic ray flux through each bit be the same for each bit, and unchanged as you increase the total amount of memory?

For identically manufactured ram, generally yes. The total number of upsets you'll see from a collection of 10 sticks of RAM will be roughly 10x higher than from 1 stick of RAM. However, there are huge variations in RAM, especially when you're comparing modern RAM to RAM manufactured in, say, 1988. The main figure used in the article (1.3e-12 upsets/bit/hour) comes from a study of a Cray Y-MP 8 system that had a main memory system containing approximately 32,000 SRAM chips. This amount of memory is measured in cubic meters, yet today the same number of bits of RAM fits on half or a quarter of a single DIMM.

Suffice it to say, the cosmic ray flux through the Cray Y-MP 8's main memory system and through half of a 2GB DIMM is significantly different, by orders of magnitude. At the same time, the memory cell in the Y-MP 8 and the memory cell in a 2GB DDR2 DIMM will have a different rate of sensitivity to cosmic ray flux, translating to a different rate of upsets for the same rate of neutron flux per memory cell. However, these two factors don't balance each other out, modern memory cells aren't thousands of times more sensitive to cosmic rays even though they take up thousands of times less space. The result is that a figure of upsets/bits/year can only be taken to be constant so long as the memory technology remains constant. That is most decidedly not the case here. If one were using 4GB of Cray Y-MP ram (which would likely fill an entire server rack, and more) perhaps you'd see the SEU rates the author calculates. However, most folks these days are using 4GB of RAM in 2 tiny DIMMs which may have, at most, a combined cross-sectional area (of the actual memory chips) of at most maybe 16 cm^2. This has non-trivial effects on the SEU rate.

jessriedel · on Feb 8, 2010

Oh, OK. I guess then I wouldn't say that upsets/bit/hour is an incorrect unit. (It's clearly what you want to know to calculate the chance of error for a given piece of RAM.) It's just that this parameter varies across time and manufacturers. Using the value from a particular model of RAM manufactured in 1988 is sure to lead to wrong conclusions.

Thanks.

ynniv · on Feb 8, 2010

The rate of cosmic ray intersection is most likely directly proportional to the physical cross section of the silicon. The author uses per-bit empirical data from 10 years ago, when memory was 100 times less dense, and then extrapolates to the present. It would likely be more correct to use a per-chip (or per cm^2) rate.

adamc · on Feb 8, 2010

I think his point was that as you get a higher number of bits per volume, the number of flips per GB would go down, basically because for a given number of bits, a higher density would imply they are exposed to less flux.

Tuna-Fish · on Feb 8, 2010

Based on pure gut feeling, I doubt that sensitivity has increased at all for at least a decade now. Anything that penetrates into the casing is either something that will not interact with it, or has enough energy to flip a bit both in a 180nm and 32nm process.

djcapelis · on Feb 8, 2010

I love ECC as much as the next guy, but the reality is that the entry overstates the importance of bit-flips by assuming the bits always flip in something that matters.

The simple fact is that most bit flips occur in portions of memory no one cares about. If the error even manifests a lot of the time it'll just manifest as one pixel in some image somewhere changing color by one bit.

With the author's estimate of 1,000 bit flips over the lifetime of a computer, maybe 10 of them cause crashes. Most of those crashes are likely to be a web browser on most people's desktops anyways, so if you just imagine you have the previous version of the flash player, you can simulate an increased SEU rate pretty nicely.

ECC is standard on servers because we assume the data they carry matters the large majority of the time. We assume that server's typically have a larger portion of their memory devoted to "important" things. (I.E. not images, video or stuff a javascript interpreter forgot to free().) On desktops, it is still probably reasonable to purchase non-ECC hardware for the time being.

I agree with the author though, that this is only getting worse. The trends are all in directions where this is going to start affecting consumer-level stuff at some point, but I'm not sure we're there yet.

As always, it's a matter of your workload combined with good risk analysis.

ableal · on Feb 8, 2010

Last year I shopped around for a quad-core 8GB box with ECC RAM. The RAM itself is not much more expensive, the problem is CPU/chipset support. I went with an AMD Phenom - I think that with Intel CPUs, you only get ECC in the Xeon server line.

(Note that besides bunging in the ECC RAM DIMMs, you may have to turn on ECC support in the BIOS.)

Just using increasing amounts of RAM, storage and bandwidth, without adding data-integrity checks, is really asking for trouble ...

lutorm · on Feb 8, 2010

The increasing storage problem applies to hard drives, too, and the increasing need for something like RAID6 over RAID5.

derobert · on Feb 9, 2010

Neither RAID5 nor RAID6 give you integrity checks. Each block of data is only read from /one/ disk, unless that disk is failed (in which case parity & data is read from the remaining disks to calculate that block).

If the disk recognizes the sector as bad (through its own, internal redundancy checks), then (depending on RAID implementation) either that one block will be read from parity or the entire disk will be dropped from the array.

But, if the disk silently corrupts data, RAID5/6 will not protect you. In fact, it makes the problem worse; silent corruption is more likely the more disks you have)

Confusion · on Feb 8, 2010

the probability of having at least one bit error in 4 gigabytes of memory at sea level on planet Earth in 72 hours is over 95% .

This metric is only relevant if you read all 4 GB of your memory every second and use the data for something that can't stand a flipped bit. Then you'll have one problem for every 72 hours of constant use of all of your memory.

How much of your memory do you use on average? How many flipped bits will be read, before being overwritten? How many bit flips cause a real problem? If one of the gray background dots of HN turns blue, I don't really care. The likelihood of an actual problem for an average user is vastly lower because of these factors.

The average comment on the blog of this guy and on Reddit is just sad: it's all fine and well that anecdotal evidence and the Google paper tell you he's wrong, but his math makes sense. Doesn't anyone feel the need to get to the root of the error in his assertion?

mfukar · on Feb 8, 2010

The "metric" is always relevant. Your confusion comes from the fact that you're thinking about whether those errors are reflected in a user's experience. If you think about the concept of RAM coupled with the fact that desktops are most likely running every little piece of software available (OS, a browser with several windows/tabs open, IM programs, a game, etc.) you'll see why this is a big deal. And let's not even mention servers and machines where actually significant work is being carried out..

Confusion · on Feb 8, 2010

I'm responsible for a handful of servers where 'actually significant work' is being carried out. My reasoning applies to those machines just as much:

- how much memory is in use?

- how large is the chance that a flipped bit is read (as opposed to being overwritten before being read)?

- What are the consequences of the flipped bit?

Apart from that: there are quite a few desktops in the world where 'actually significant work' is being carried out.

regularfry · on Feb 8, 2010

The author touches on the "how much memory is in use" question: all major OSes use unallocated RAM as a file cache (or equivalent), so no matter where the error happens, it is almost certain to hit something. Whether that "something" is actually relevant is another matter.

mfukar · on Feb 8, 2010

Those are not directly related questions; the answer to each of those depends on the setup of the machine, both on hardware and software, and the practical cases are far too many to enumerate on a single blog post. You are welcome to perform some tests or even theorycrafting on your systems.

On an unrelated note, I did not mean to demean desktops, but the reality is that there's orders of magnitude more devices that carry out tasks more critical than image processing or development. Embedded devices are one example.

patrickgzill · on Feb 8, 2010

Sidenote: Solaris 10 and OpenSolaris have the ability to not only monitor ECC memory errors, but when detecting that the errors go over a certain threshold will automatically mark those pages "bad" and force the operating system to no longer use that range of memory.

I have an 8GB dual-Opteron system with a DIMM that is in production and should not be taken down - about 4MB on one DIMM has been marked bad and removed from use by the OS.

prewett · on Feb 8, 2010

In four years of reading Cassini updates, they have never once mentioned worrying about software errors due to cosmic rays. And this is without any atmospheric protection at all; we have 100 miles of atmosphere.

They have mentioned that their solid state relays trip 2 or 3 times a year due to cosmic rays. I'm not sure how comparable those are to DIMMs, but it does suggest that the author's claim of one error per day is a bit off...

AngryParsley · on Feb 9, 2010

Pretty much anything launched into space uses radiation-hardened electronics. I don't know what Cassini has on board, but this is popular: http://en.wikipedia.org/wiki/IBM_RAD6000

varaon · on Feb 8, 2010

We are much more shielded on Earth.

rlpb · on Feb 8, 2010

> First, let's assume you have a system with no error-correction nor parity. The probability that you'll experience a bit error during the time T will be 1-(1-p)^m .

OK, let's assume that.

> For T=1 hour , p = 1.3e-12 and m = 42^308 that gives 0.044 or 4.4% .

WTF? Where did those figures come from?

sailormoon · on Feb 8, 2010

Perhaps a good thing for "mission critical" software to do might be to implement software "hash checks" where the software keeps a running series of hashes of what it thinks it has written to disk, and then compares that against the actual stored bytes at the end of the operation. Or, if that is impractical, have a "safe mode" intended for kernel builds, software releases, etc, where it will compile and build twice, then compare the two for differences. That would solve a lot of the problems the author is postulating, however unlikely they might be.