[vox-tech] ECC memory --- is it worth it? (semi-OT)

Wed Apr 11 17:42:21 PDT 2007

hajhouse wrote:
> Here's my perspective on that. Assuming that one of those uncorrected
> single-bit errors turned out to be in the worst possible place (say, a
> pointer in the kernel or in postgresql in a journaling memory structure)
> that turned out to cause data corruption that caused a day of work to be
> lost (i.e., the last good backup was 24 hours old), then:
> 
> - assuming a man-hour is worth $50 (that's probably low) 
> - assuming that the machine is used by four people (other people's
>   servers have more users),
> 
> then the problem would cost $1600 to recover from, plus whatever
> additional time was required to take the system down to restore the
> backup, fsck the filesystem, etc.
> 

Sure, that is a very special case, but there are many other insidious things
that could happen.  Say for instance a row in the database gets the wrong
value.  Many calculations are based on it, and then the next time you do
taxes things that should add up don't.  Your months of backups all have
the same error, and your not sure why it happened, what exactly is
reliable, nor what is corrupt.

> That notwithstanding, I agree with Rick about disk failures being an
> order of magnitute more likely. I've experienced the pain of a failing
> disk more times that I care to remember.

I'm not trying to be argumentative.... but I don't understand this argument.

ECC memory doesn't protect from a dead dimm, it protects from a silent
corruption of data.  Sure disks die more than ram, but that isn't a reason
to use ECC (or not use ECC).  Disk deaths are fairly easy to protect against,
300 GB raid/enterprise edition disks go for $115 or so.  Disks already have
ECC for sectors to protect against bit rot, as well as in the protocol
(for sata anyways) to help protect against transfer errors.

Disks while dying on average between 1-3% (see the google study on er, 40k
drives) various brands, models, and environmental factors can make that
dramatically worse.

The fair comparison is undetected corruptions on disks (what looks like
a valid read/write reporting bad data) and undetected corruptions on Dimms
(what looks like a valid read/write reporting bad data).  That is exactly
what does (or does not) justify ECC.

So, yes to address the original question.  Yes I'd recommend another $10
a dimm and a redundant disk ($50-$150 for popular sizes) for any system that
you want to achieve high uptimes.  Don't forget RAIDs aren't a replacement
for backups.

Oh, I also wanted to note that the error rates in dimms are rising as the
process shrinks.  Things have changed in the last few process generations.
http://www.edn.com/article/CA454636.html has a good discussion, especially
the "getting worse not better" section.  The root of the problem seems to
be "As process technologies continue to shrink, the critical charge required
to cause an upset is decreasing faster than the charge-collection area in the
memory cell."

Folks that want to measure numbers themselves can monitor ECC errors
themselves, or turn if you don't have ECC try memtest86 for as long as
you want to sample.