[vox-tech] ECC memory --- is it worth it? (semi-OT)

Tue Apr 10 17:51:05 PDT 2007

Rick Moen wrote:
> You have a small point, but only for trivial values of "survive":  The
> lion's share of those bit flips will turn out to be harmless for any of
> sundry reasons.  (I'd specualate that some non-zero percentage of
> prematurely deceased httpd instances owed to that, for example -- but

Sure, killing a process that is corrupted is the best thing that could
happen, that way the corruption can't spread and the error is contained.

Corruptions can cascade, granted not all do, but a bad bit in memory,
could be a pointer, which then corrupts another region of memory, if one
is written to disk and then used for future operations you could
quickly have millions if not billions (I.e. a dead filesystem) of corrupt
bytes. I'm quite grateful every time I see an ECC error, one potential
major issue stopped in it's tracks.

I think AMD was quite smart to include ECC support on all their processors
and wish that intel did the same.  After all the importance of your
data isn't always related to the cost of the machine manipulating
said data.

It's the bit flips that don't cause a process crash that you worry about,
since you now have a corrupt process with (generally) the ability to
read and write part of the disk.  Same with the kernel, a bit flip
on a kernel owned page resulting in an immediate panic is the best case
scenario and is much more likely if you have ECC.

> those just respawn.)

Dunno, for instance linux caches the filesystem aggressively, if any dirty
page has a bit flip when purged you have a corruption on disk.  If any of
the meta data that is cached has a bit flip you potentially have a corrupt
filesystem.  Every disk write is at risk.  Most open process could
do something bad.  Granted limited permissions per process/daemon helps...
unless of course the error is in the kernel.

> If that were a concern meriting real-world concern in situations where
> the RAM _doesn't_ give unmistakeable signs of defects, my data would have
> gone to mush a decade ago.

Er, ECC exists exactly because of real-world concerns.  I've seen entire
clusters replaced to add ECC because of exactly those real-world concerns.
Researchers don't like to hear that the results are usually right and it's
unlikely that there results are wrong.  I've heard cases where a month
long calculation on 64 nodes gave an exciting answer, and to be sure
they repeated it and got a second answer.  They were pretty sure one
was a memory problem.... till they got a 3rd.  Not sure they ever figured it
out, but they did end up adding ECC.

Sure if 99% of your data is disposable, say mp3 files that you can
re-rip or jpegs where you aren't going to notice a pixel being off (at
least until the viewer crashes) then sure.  Then again other systems
have a greater percentage of valuable data.

> Frankly, HD defects are a many orders of
> magnitude more significant threat.

Not sure where this comes from, how many orders are you suggesting?  Disks
also use ECC, and single bit errors are on the order of 1 per 10^14 sectors
and an annual failure rate of 1-3%.  RAID is relatively common among servers
and in my mind provides similar protection against similar risk and is
similar justified for those who want reliability and higher uptimes.

So you are saying that HD defects are 10 or 100 times likely then the 1
bit per GB per month?  Frankly if that was true I would expect your "data
would have gone to mush a decade ago".  Try flipping a 10 or 100 random
bits on your disk once a month and report back ;-).

So, anyways, sure don't run ECC if you don't want, and sure many desktop
users won't notice.  But since the original goal of the thread was a machine
that "will be up for months between reboots" spending an extra $10 [1]
for ECC dimms is reasonable.  I'd also suggest that running redundant
disks would be worth it.

BTW, out of 180 nodes with 4GB ram I did manage to find quite a few ECC
errors, I'd consider it a major deal if I had to contact every that
had run in the last month about potentially erroneous results.

[1] random data point on the price difference, both kingston 1GB modules:
http://www.newegg.com/Product/Product.aspx?Item=N82E16820144153
http://www.newegg.com/Product/Product.aspx?Item=N82E16820134045