l i n u x - u s e r s - g r o u p - o f - d a v i s
L U G O D
 
Next Meeting:
December 2: Social gathering
Next Installfest:
TBD
Latest News:
Nov. 18: Club officer elections
Page last updated:
2007 Apr 10 17:03

The following is an archive of a post made to our 'vox-tech mailing list' by one of its subscribers.

Report this post as spam:

(Enter your email address)
Re: [vox-tech] ECC memory --- is it worth it? (semi-OT)
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [vox-tech] ECC memory --- is it worth it? (semi-OT)



hajhouse wrote:
> Linux wotan 2.6.17-10-generic #2 SMP Tue Dec 5 22:28:26 UTC 2006 i686 GNU/Linux
> 
> Try 'modprobe ecc'.

My research found:
* Bluesmoke is now EDAC
* The ecc.ko is part of the EDAC project
* EDAC has been somewhat intel centric in the past
* Main line kernels have EDAC and support intel chipsets
* 2.6.17-10-generic does not support opteorn
* The devel tree on sourceforge has opteron support
* Mcelog is the more AMD centric way to do it
* Mcelog seems reasonably popular (redhat and ubuntu anyways)
* Mcelog seems to support numerous events, not just dimm related ecc errors

So while getting the ecc module to build would require a new kernel
(2.6.18 or newer) and custom patches from sourceforge mcelog just requires
a small binary to read /dev/mcelog.  I ran it on 180 machines or so and
found one very unhappy node:

CPU 0 1 instruction cache TSC e6a7a079a8a84
ADDR 117b00
  Instruction cache ECC error
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
      instruction fetch mem transaction
      memory access, level generic'
STATUS d400400000000853 MCGSTATUS 0
MCE 5
CPU 0 2 bus unit TSC e6a7a079a8ccd
ADDR c500
  L2 cache ECC error
  Bus or cache array error
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS d400400000000813 MCGSTATUS 0
MCE 6
CPU 0 4 northbridge TSC e6a7a079a906a
ADDR 3ce5e0
  Northbridge ECC error
  ECC syndrome = 64
       bit32 = err cpu0
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS d432400100000813 MCGSTATUS 0
_______________________________________________
vox-tech mailing list
vox-tech@lists.lugod.org
http://lists.lugod.org/mailman/listinfo/vox-tech



LinkedIn
LUGOD Group on LinkedIn
Sign up for LUGOD event announcements
Your email address:
facebook
LUGOD Group on Facebook
'Like' LUGOD on Facebook:

Hosting provided by:
Sunset Systems
Sunset Systems offers preconfigured Linux systems, remote system administration and custom software development.

LUGOD: Linux Users' Group of Davis
PO Box 2082, Davis, CA 95617
Contact Us

LUGOD is a 501(c)7 non-profit organization
based in Davis, California
and serving the Sacramento area.
"Linux" is a trademark of Linus Torvalds.

Sponsored in part by:
EDGE Tech Corp.
For donating some give-aways for our meetings.