l i n u x - u s e r s - g r o u p - o f - d a v i s
L U G O D
 
Next Meeting:
December 2: Social gathering
Next Installfest:
TBD
Latest News:
Nov. 18: Club officer elections
Page last updated:
2007 Apr 10 23:38

The following is an archive of a post made to our 'vox-tech mailing list' by one of its subscribers.

Report this post as spam:

(Enter your email address)
Re: [vox-tech] ECC memory --- is it worth it? (semi-OT)
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [vox-tech] ECC memory --- is it worth it? (semi-OT)



Quoting Bill Broadley (bill@cse.ucdavis.edu):

> Corruptions can cascade, granted not all do, but a bad bit in memory,
> could be a pointer, which then corrupts another region of memory, if one
> is written to disk and then used for future operations you could
> quickly have millions if not billions (I.e. a dead filesystem) of corrupt
> bytes.

A bad bit in memory, if indicative of a physical defect, will quickly
manifest unmistakeably on Linux in the manner I described.  If not thus
indicative, (from empirical observation over a long period of time:)
it's extremely unlikely to have detectable long-term consequences.  

> I'm quite grateful every time I see an ECC error, one potential
> major issue stopped in it's tracks.

And if we all had unlimited funds, we'd all pay through the nose to buy
it for all of our machines.  Sadly lacking the wealth if Midas, however,
we're always obliged to decide in _which_ specific area of systems design 
that extra dollar is best applied.  E.g., one might splurge on a disk
redundancy, or a less cheap and cruddy HBA, or a less laughably
inadequate PSU, all of which decisions are often (again, from empirical
observation over a long period of time), in commodity PC purchases,
likely to make a bigger difference to data integrity than does ECC.

Far be it from me to tell you you shouldn't be delighted with your ECC,
however.  Enjoy!

> It's the bit flips that don't cause a process crash that you worry
> about, since you now have a corrupt process with (generally) the
> ability to read and write part of the disk.

Again, if this were not basically damned close to a fantasy-novel
scenario, my data would have melted down into slag a decade ago.  So
would nearly everyone else's.

> I've heard cases where a month long calculation on 64 nodes gave an
> exciting answer, and to be sure they repeated it and got a second
> answer. 

You might preach cluster design to someone who didn't build the largest 
Linux HPC cluster in history (#3 on the Top 500 list, when deployed).  ;->

> > Frankly, HD defects are a many orders of magnitude more significant
> > threat.
> 
> Not sure where this comes from, how many orders are you suggesting? 

I'd guesstimate about three orders of magnitude more likely to be a
threat to data than is RAM corruption that is not based in outright
defective RAM.  Where from?  From twenty years' experience, pretty much.

> So you are saying that HD defects are 10 or 100 times likely then the
> 1 bit per GB per month?  

If you assumed I was endorsing your figure, you assumed wrong.  If you 
remain unclear on what I _was_ saying, you might want to re-read.

-- 
Cheers,          "You know, I've gone to a lot of psychics, and they've told me
Rick Moen        a lot of different things, but not one of them has ever told me
rick@linuxmafia.com     'You are an undercover policewoman, here to arrest me.'"
                                         -- New York City undercover policewoman
_______________________________________________
vox-tech mailing list
vox-tech@lists.lugod.org
http://lists.lugod.org/mailman/listinfo/vox-tech



LinkedIn
LUGOD Group on LinkedIn
Sign up for LUGOD event announcements
Your email address:
facebook
LUGOD Group on Facebook
'Like' LUGOD on Facebook:

Hosting provided by:
Sunset Systems
Sunset Systems offers preconfigured Linux systems, remote system administration and custom software development.

LUGOD: Linux Users' Group of Davis
PO Box 2082, Davis, CA 95617
Contact Us

LUGOD is a 501(c)7 non-profit organization
based in Davis, California
and serving the Sacramento area.
"Linux" is a trademark of Linus Torvalds.

Sponsored in part by:
EDGE Tech Corp.
For donating some give-aways for our meetings.