[vox-tech] ECC memory --- is it worth it? (semi-OT)

Rick Moen rick at linuxmafia.com
Tue Apr 10 22:49:11 PDT 2007


Quoting Bill Broadley (bill at cse.ucdavis.edu):

> Corruptions can cascade, granted not all do, but a bad bit in memory,
> could be a pointer, which then corrupts another region of memory, if one
> is written to disk and then used for future operations you could
> quickly have millions if not billions (I.e. a dead filesystem) of corrupt
> bytes.

A bad bit in memory, if indicative of a physical defect, will quickly
manifest unmistakeably on Linux in the manner I described.  If not thus
indicative, (from empirical observation over a long period of time:)
it's extremely unlikely to have detectable long-term consequences.  

> I'm quite grateful every time I see an ECC error, one potential
> major issue stopped in it's tracks.

And if we all had unlimited funds, we'd all pay through the nose to buy
it for all of our machines.  Sadly lacking the wealth if Midas, however,
we're always obliged to decide in _which_ specific area of systems design 
that extra dollar is best applied.  E.g., one might splurge on a disk
redundancy, or a less cheap and cruddy HBA, or a less laughably
inadequate PSU, all of which decisions are often (again, from empirical
observation over a long period of time), in commodity PC purchases,
likely to make a bigger difference to data integrity than does ECC.

Far be it from me to tell you you shouldn't be delighted with your ECC,
however.  Enjoy!

> It's the bit flips that don't cause a process crash that you worry
> about, since you now have a corrupt process with (generally) the
> ability to read and write part of the disk.

Again, if this were not basically damned close to a fantasy-novel
scenario, my data would have melted down into slag a decade ago.  So
would nearly everyone else's.

> I've heard cases where a month long calculation on 64 nodes gave an
> exciting answer, and to be sure they repeated it and got a second
> answer. 

You might preach cluster design to someone who didn't build the largest 
Linux HPC cluster in history (#3 on the Top 500 list, when deployed).  ;->

> > Frankly, HD defects are a many orders of magnitude more significant
> > threat.
> 
> Not sure where this comes from, how many orders are you suggesting? 

I'd guesstimate about three orders of magnitude more likely to be a
threat to data than is RAM corruption that is not based in outright
defective RAM.  Where from?  From twenty years' experience, pretty much.

> So you are saying that HD defects are 10 or 100 times likely then the
> 1 bit per GB per month?  

If you assumed I was endorsing your figure, you assumed wrong.  If you 
remain unclear on what I _was_ saying, you might want to re-read.

-- 
Cheers,          "You know, I've gone to a lot of psychics, and they've told me
Rick Moen        a lot of different things, but not one of them has ever told me
rick at linuxmafia.com     'You are an undercover policewoman, here to arrest me.'"
                                         -- New York City undercover policewoman


More information about the vox-tech mailing list