[vox-tech] ECC memory --- is it worth it? (semi-OT)

Bill Broadley bill at cse.ucdavis.edu
Sun Apr 8 13:03:43 PDT 2007


hajhouse wrote:
> ECC memory is supposed to correct single-bit errors that can be caused

And detect double bit errors.

> by radiation and other freak events of the quantum-mechanical field in
> which we live.

Statistically I believe the radiation is more likely to come from
the package... hrm, maybe that's only for ceramic packages.  I was
reading a story about the "big mac", a rather strange large 1100
node cluster at Virginia Tech based on the G5.  It didn't have ECC
support, and they noticed significantly higher error rates during
the day, I believe due to increased radiation from the sun.  They ended up
replacing all 1100 with the next version apple xservs that did have ECC support.

> That sounds like a good thing that I would like to have
> and an important feature for a machine that will be up for months
> between reboots. 

Statistically I wouldn't necessarily expect a reboot, but it might well
twiddle a bit that causes a corruption or segfault (if it twiddles
a pointers).  So application misbehavior, crashing, corrupt disks,
and if unlucky enough to tweak the kernel a crash or panic.


> However, ECC modules cost more than standard modules.  Also, most
> motherboards don't list ECC support in their feature lists. I assume

Careful, many don't mention it, but then include ECC dimms in their
certified list of dimms, usually mentioned in the manual.  Athlon 64's
include the memory controller on chip and do include the ECC functionality,
the memory bus is 144 bits wide to allow for the ecc bits.

Not sure on the intel side of things, might require one of the higher
end chipsets (like the 975 instead of the 965).

> that this means that either plugging in ECC modules would lead to
> non-function, or that they would function as standard memory, without
> using their error-correcting capability (rather pointless). Choosing to
> use ECC memory then also means you get to pick from a smaller set of
> motherboards than you would otherwise, and will probably pay more for
> the board because only high-end boards have ECC support.

Not necessarily, I've seen under $100 amd64 boards with ECC support.

> How many of you are using non-ECC (standard) memory on long-uptime
> machines? Are you having any problems because of it? Do you think ECC is
> worth the premium?

If uptime is important I make sure to specific ECC, RAID, and if important
enough and I have the budget a redundant power supply.

> My current main machine does have ECC memory. I've not made a habit of
> looking at /proc/ram to see whether my machine has had RAM errors, but
> currently it shows none.

What kernel are you running, I have a TB or so of ram around, alas no
/proc/ram that I noticed.  Do you have to load a particular module?
Run a certain kernel?  Does it's location under /proc vary?

> Some
>    systems also 'scrub' the errors, by writing the corrected version back
>    to memory.

Usually under BIOS control, this helps avoid reccuring errors from overwelming
the ECC.  Same idea goes for RAID.  Say you have a RAID-5, if you don't
read 100% of the disk you might not know about a silent failure until a disk
dies and your rebuild fails.


More information about the vox-tech mailing list