[vox-tech] ECC memory --- is it worth it? (semi-OT)

Wed Apr 11 19:08:46 PDT 2007

Quoting Bill Broadley (bill at cse.ucdavis.edu):

> ECC memory doesn't protect from a dead dimm, it protects from a silent
> corruption of data.

I saw an example of that, back in 1989.  I was working in what was then
called the MIS Department at Blyth Software in Foster City:  The VP of
Engineering passed along a requirement for MIS to build a new
engineering NetWare 3.12 server.  He wanted that server to run DOS and
MacOS namespaces (to do SMB and AppleTalk-based file and print
services), be an NFS server, run the source code repository (whatever
that was; can't remember), _and_ run prototyping installations of the
Oracle and Sybase RDBMSes, _and_ handle all Engineering e-mail.  The
task was handed to me, with a budget of something like $20k.  

Even though I was just the PFY, I balked:  I countered that it would be 
smarter to divide those functions among about five or six servers, at no
more total dollars and possibly fewer.  The VP told me to never mind my
opinion, but just implement his plan.  I politely dug in my heels and
talked about the advantages of doing it the other way, and alluded to
eggs and baskets.  The VP was annoyed (and complained to my boss), but
couldn't claim I'd refused, because I'd carefully never said "no", not
exactly.  

Losing patience, the VP took his specs to an outside VAR in Burlingame,
who was quite happy to spec a do-it-all HP NetServer something-or-other
with immensely large amounts of disk and RAM (for those days).  The VAR
deployed it.  Backups (weekly full on Friday, differential daily M-Th)
occurred per MIS Dept.'s standard practice onto 8mm Exabyte tapes.
Months passed.

And then they started noticing that the data stored on the array were
corrupted.  Test restores were done from various tapes:  It emerged that
_all_ of the tape sets featured data corruption in incrementally
increasing degrees, going back about four months to the new server's
deployment.  Engineering thus got to decide how much random file
corruption it was willing to tolerate, versus how many months' work it
was willing to throw away.  After a few days' debate, they decided to
jettison _all_ of those four months of everyone's work -- plus the VP of
Engineering.

I did my best to not even look like I wanted to say "I told you so" --
not least because I hadn't actually anticipated that particular
scenario at all.

The HP NetServer was subjected to extensive testing, in an effort to
save it.  The VAR used, among other things, all available memory-testing
software tools in an effort to isolate the problem -- and I believe I
remember them actually swapping out all of the RAM, at one point.  I
vaguely recall that it was still a useless hulk when I left the firm in
1994.

It was a very striking experience.  And it's also something I've never
seen since then.  (I've seen plenty of bad sticks of RAM on *ix servers,
but never progressive & silent data corruption without signs that
there's bad RAM needing immediate replacement.)

If I _had_ been seeing that, even rarely, my current view would be
different -- and of course I _will_ change my view if and when what I
see changes.