[vox-tech] ECC memory --- is it worth it? (semi-OT)

Bill Broadley bill at cse.ucdavis.edu
Tue Apr 10 23:38:26 PDT 2007


Rick Moen wrote:
> A bad bit in memory, if indicative of a physical defect, will quickly
> manifest unmistakeably on Linux in the manner I described.  If not thus
> indicative, (from empirical observation over a long period of time:)
> it's extremely unlikely to have detectable long-term consequences.  

You speculate that it contributes to premature httpd deaths but is
undetectable long term?

> And if we all had unlimited funds, we'd all pay through the nose to buy
> it for all of our machines.  Sadly lacking the wealth if Midas, however,

$10 a dimm requires you to "pay through the node" and the "wealth of
midas"?  I guess we have different standards for differential costs
on a machine designed for multi month uptimes.

> we're always obliged to decide in _which_ specific area of systems design 
> that extra dollar is best applied.  E.g., one might splurge on a disk
> redundancy, or a less cheap and cruddy HBA, or a less laughably
> inadequate PSU, all of which decisions are often (again, from empirical
> observation over a long period of time), in commodity PC purchases,
> likely to make a bigger difference to data integrity than does ECC.

True, but if you are stating from the beginning that you want a decent
design with multi-month uptimes you are IMO above this level of "laughably
inadequate" PSUs and similar crud.

> Again, if this were not basically damned close to a fantasy-novel
> scenario, my data would have melted down into slag a decade ago.  So
> would nearly everyone else's.

I agree that decent quality machine is required before the improvement
in ECC is detectable.  But since it's incredibly cheap, adding $10-$20
to a small server or desktop seems reasonable to me.

> You might preach cluster design to someone who didn't build the largest 
> Linux HPC cluster in history (#3 on the Top 500 list, when deployed).  ;->

Cool, er, do I need to ask the obvious?  Did it use ECC?  If it did how many
ECC errors did you see per GB per day.  I'll start collecting this but
it will be months before I have any useful numbers.

> I'd guesstimate about three orders of magnitude more likely to be a
> threat to data than is RAM corruption that is not based in outright
> defective RAM.  Where from?  From twenty years' experience, pretty much.

So what rate of disk errors have you seen that are not based on outright
defective disks?  Of course file corruptions are similarly hard to detect
unless you run tripwire or related checksum based monitoring.

In general people seem more worried about file integrity than memory
integrity, even though file integrity depends on memory integrity.

>> So you are saying that HD defects are 10 or 100 times likely then the
>> 1 bit per GB per month?  
> 
> If you assumed I was endorsing your figure, you assumed wrong.  If you 
> remain unclear on what I _was_ saying, you might want to re-read.

Which figures?  The wikipedia mentioned 1 bit per GB per month?  Or
the 1 sector per 10^14?  The later is from one of the seagate enterprise/
raid edition drives.  Granted real world numbers tend to be worse
than reported values, and MTBFs only very loosely correlated with real world
annual return percentages.  Then again a corrupt sector being read from
a disk that the system thinks is a valid sector seems very rare indeed.

What exactly do you mean by 3 orders of magnitude (base 2? base 10?)?
Undetected errors in healthy hardware?  Deaths? Detectable errors?
Loss of files?  Something else?

How often do you see this type of disk corruption?  Seems most fair to
equate clearly bad dimms with clearly dead disks, and non-ecc errors
caused by random effects being reported as valid memory equated with
corrupted disk sectors being reported as real.

I'm genuinely interested in information related to this and have
significant practical experience with these issues as well.  I still
want to compare notes though.


More information about the vox-tech mailing list