We have 4 of these babies in the house. They are powerful, and they have been working 24/7 since they were installed like 5 years ago. Unfortunately, nothing is eternal, and they start to show signs of age.
For example, when they run full power on processors and RAM, there is an annoying message popping up:
[6896093.455573] [Hardware Error]: MC4 Error (node 1): DRAM ECC error detected
[6896093.468312] EDAC MC1: 1 CE on mc#1csrow#0channel#0 (csrow:0
channel:0 page:0xc9139a offset:0xb30 grain:0 syndrome:0xb903) [6896093.468317] [Hardware Error]: Error Status: Corrected error, no action required.
Actually the error is correcting itself, but this is not meaning it is not annoying. What is really this? Googling, one finds several suggestions: Super user says its a North Bridge error. On another blog they relate it with overheating. But what I want is to suppress it. For debian systems it seems to be a solution. Unfortunately, it doesn’t seem to work on our CentOS.7. So we need to go for the hardware fix. The tool that lets us, at least, see the memory module errors is described here. I do then as suggested
yum install edac-utils edac-util -v
And yes, I find modules with errors. They look like this:
mc7: csrow0: mc#7csrow#0channel#0: 869 Corrected Errors
mc1: csrow0: mc#1csrow#0channel#0: 83 Corrected Errors
Next task, find out where they are and replace them for new ones 😦