[vox-tech] ECC memory --- is it worth it? (semi-OT)
Bill Broadley
bill at cse.ucdavis.edu
Tue Apr 10 17:01:11 PDT 2007
hajhouse wrote:
> Linux wotan 2.6.17-10-generic #2 SMP Tue Dec 5 22:28:26 UTC 2006 i686 GNU/Linux
>
> Try 'modprobe ecc'.
My research found:
* Bluesmoke is now EDAC
* The ecc.ko is part of the EDAC project
* EDAC has been somewhat intel centric in the past
* Main line kernels have EDAC and support intel chipsets
* 2.6.17-10-generic does not support opteorn
* The devel tree on sourceforge has opteron support
* Mcelog is the more AMD centric way to do it
* Mcelog seems reasonably popular (redhat and ubuntu anyways)
* Mcelog seems to support numerous events, not just dimm related ecc errors
So while getting the ecc module to build would require a new kernel
(2.6.18 or newer) and custom patches from sourceforge mcelog just requires
a small binary to read /dev/mcelog. I ran it on 180 machines or so and
found one very unhappy node:
CPU 0 1 instruction cache TSC e6a7a079a8a84
ADDR 117b00
Instruction cache ECC error
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node origin, request didn't time out
instruction fetch mem transaction
memory access, level generic'
STATUS d400400000000853 MCGSTATUS 0
MCE 5
CPU 0 2 bus unit TSC e6a7a079a8ccd
ADDR c500
L2 cache ECC error
Bus or cache array error
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d400400000000813 MCGSTATUS 0
MCE 6
CPU 0 4 northbridge TSC e6a7a079a906a
ADDR 3ce5e0
Northbridge ECC error
ECC syndrome = 64
bit32 = err cpu0
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d432400100000813 MCGSTATUS 0
More information about the vox-tech
mailing list