l i n u x - u s e r s - g r o u p - o f - d a v i s
L U G O D
 
Next Meeting:
October 7: Social gathering
Next Installfest:
TBD
Latest News:
Aug. 18: Discounts to "Velocity" in NY; come to tonight's "Photography" talk
Page last updated:
2007 Apr 11 19:08

The following is an archive of a post made to our 'vox-tech mailing list' by one of its subscribers.

Report this post as spam:

(Enter your email address)
Re: [vox-tech] ECC memory --- is it worth it? (semi-OT)
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [vox-tech] ECC memory --- is it worth it? (semi-OT)



hajhouse wrote:
> Here's my perspective on that. Assuming that one of those uncorrected
> single-bit errors turned out to be in the worst possible place (say, a
> pointer in the kernel or in postgresql in a journaling memory structure)
> that turned out to cause data corruption that caused a day of work to be
> lost (i.e., the last good backup was 24 hours old), then:
> 
> - assuming a man-hour is worth $50 (that's probably low) 
> - assuming that the machine is used by four people (other people's
>   servers have more users),
> 
> then the problem would cost $1600 to recover from, plus whatever
> additional time was required to take the system down to restore the
> backup, fsck the filesystem, etc.
> 

Sure, that is a very special case, but there are many other insidious things
that could happen.  Say for instance a row in the database gets the wrong
value.  Many calculations are based on it, and then the next time you do
taxes things that should add up don't.  Your months of backups all have
the same error, and your not sure why it happened, what exactly is
reliable, nor what is corrupt.

> That notwithstanding, I agree with Rick about disk failures being an
> order of magnitute more likely. I've experienced the pain of a failing
> disk more times that I care to remember.

I'm not trying to be argumentative.... but I don't understand this argument.

ECC memory doesn't protect from a dead dimm, it protects from a silent
corruption of data.  Sure disks die more than ram, but that isn't a reason
to use ECC (or not use ECC).  Disk deaths are fairly easy to protect against,
300 GB raid/enterprise edition disks go for $115 or so.  Disks already have
ECC for sectors to protect against bit rot, as well as in the protocol
(for sata anyways) to help protect against transfer errors.

Disks while dying on average between 1-3% (see the google study on er, 40k
drives) various brands, models, and environmental factors can make that
dramatically worse.

The fair comparison is undetected corruptions on disks (what looks like
a valid read/write reporting bad data) and undetected corruptions on Dimms
(what looks like a valid read/write reporting bad data).  That is exactly
what does (or does not) justify ECC.

So, yes to address the original question.  Yes I'd recommend another $10
a dimm and a redundant disk ($50-$150 for popular sizes) for any system that
you want to achieve high uptimes.  Don't forget RAIDs aren't a replacement
for backups.

Oh, I also wanted to note that the error rates in dimms are rising as the
process shrinks.  Things have changed in the last few process generations.
http://www.edn.com/article/CA454636.html has a good discussion, especially
the "getting worse not better" section.  The root of the problem seems to
be "As process technologies continue to shrink, the critical charge required
to cause an upset is decreasing faster than the charge-collection area in the
memory cell."

Folks that want to measure numbers themselves can monitor ECC errors
themselves, or turn if you don't have ECC try memtest86 for as long as
you want to sample.
_______________________________________________
vox-tech mailing list
vox-tech@lists.lugod.org
http://lists.lugod.org/mailman/listinfo/vox-tech



LinkedIn
LUGOD Group on LinkedIn
Sign up for LUGOD event announcements
Your email address:
facebook
LUGOD Group on Facebook
'Like' LUGOD on Facebook:

Hosting provided by:
Sunset Systems
Sunset Systems offers preconfigured Linux systems, remote system administration and custom software development.

LUGOD: Linux Users' Group of Davis
PO Box 2082, Davis, CA 95617
Contact Us

LUGOD is a 501(c)7 non-profit organization
based in Davis, California
and serving the Sacramento area.
"Linux" is a trademark of Linus Torvalds.

Sponsored in part by:
Appahost Applications
For a significant contribution towards our projector, and a generous donation to allow us to continue meeting at the Davis Library.