l i n u x - u s e r s - g r o u p - o f - d a v i s
L U G O D
 
Next Meeting:
January 6: Social gathering
Next Installfest:
TBD
Latest News:
Nov. 18: Club officer elections
Page last updated:
2007 Apr 13 20:33

The following is an archive of a post made to our 'vox-tech mailing list' by one of its subscribers.

Report this post as spam:

(Enter your email address)
Re: [vox-tech] ECC memory --- is it worth it? (semi-OT)
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [vox-tech] ECC memory --- is it worth it? (semi-OT)



Quoting Bill Broadley (bill@cse.ucdavis.edu):

> ECC memory doesn't protect from a dead dimm, it protects from a silent
> corruption of data.

I saw an example of that, back in 1989.  I was working in what was then
called the MIS Department at Blyth Software in Foster City:  The VP of
Engineering passed along a requirement for MIS to build a new
engineering NetWare 3.12 server.  He wanted that server to run DOS and
MacOS namespaces (to do SMB and AppleTalk-based file and print
services), be an NFS server, run the source code repository (whatever
that was; can't remember), _and_ run prototyping installations of the
Oracle and Sybase RDBMSes, _and_ handle all Engineering e-mail.  The
task was handed to me, with a budget of something like $20k.  

Even though I was just the PFY, I balked:  I countered that it would be 
smarter to divide those functions among about five or six servers, at no
more total dollars and possibly fewer.  The VP told me to never mind my
opinion, but just implement his plan.  I politely dug in my heels and
talked about the advantages of doing it the other way, and alluded to
eggs and baskets.  The VP was annoyed (and complained to my boss), but
couldn't claim I'd refused, because I'd carefully never said "no", not
exactly.  

Losing patience, the VP took his specs to an outside VAR in Burlingame,
who was quite happy to spec a do-it-all HP NetServer something-or-other
with immensely large amounts of disk and RAM (for those days).  The VAR
deployed it.  Backups (weekly full on Friday, differential daily M-Th)
occurred per MIS Dept.'s standard practice onto 8mm Exabyte tapes.
Months passed.

And then they started noticing that the data stored on the array were
corrupted.  Test restores were done from various tapes:  It emerged that
_all_ of the tape sets featured data corruption in incrementally
increasing degrees, going back about four months to the new server's
deployment.  Engineering thus got to decide how much random file
corruption it was willing to tolerate, versus how many months' work it
was willing to throw away.  After a few days' debate, they decided to
jettison _all_ of those four months of everyone's work -- plus the VP of
Engineering.

I did my best to not even look like I wanted to say "I told you so" --
not least because I hadn't actually anticipated that particular
scenario at all.

The HP NetServer was subjected to extensive testing, in an effort to
save it.  The VAR used, among other things, all available memory-testing
software tools in an effort to isolate the problem -- and I believe I
remember them actually swapping out all of the RAM, at one point.  I
vaguely recall that it was still a useless hulk when I left the firm in
1994.

It was a very striking experience.  And it's also something I've never
seen since then.  (I've seen plenty of bad sticks of RAM on *ix servers,
but never progressive & silent data corruption without signs that
there's bad RAM needing immediate replacement.)

If I _had_ been seeing that, even rarely, my current view would be
different -- and of course I _will_ change my view if and when what I
see changes.
_______________________________________________
vox-tech mailing list
vox-tech@lists.lugod.org
http://lists.lugod.org/mailman/listinfo/vox-tech



LinkedIn
LUGOD Group on LinkedIn
Sign up for LUGOD event announcements
Your email address:
facebook
LUGOD Group on Facebook
'Like' LUGOD on Facebook:

Hosting provided by:
Sunset Systems
Sunset Systems offers preconfigured Linux systems, remote system administration and custom software development.

LUGOD: Linux Users' Group of Davis
PO Box 2082, Davis, CA 95617
Contact Us

LUGOD is a 501(c)7 non-profit organization
based in Davis, California
and serving the Sacramento area.
"Linux" is a trademark of Linus Torvalds.

Sponsored in part by:
Appahost Applications
For a significant contribution towards our projector, and a generous donation to allow us to continue meeting at the Davis Library.