Jump to content

ECC Error every 314 seconds on Ryzen 5800X and Micron ECC SODIMM

Hi everyone. Here is a strange problem that you might be interested in. Welcome to send any comments and I will update my debugging process (and communication with vendors, if any and if possible).

 

TL;DR for problem description

 

If all those conditions are met:

  • RAM is inserted in DIMM_A1
  • Stable Load (0 load or 100% cpu usage is okay, as long as it is stable)

Then on my operating system (Linux 5.11.10-hardened), after each 5'14'' (which means 314 seconds, or 100 * pi), there would be an ECC CE (correctable error) on random page/bank at random offset.

image.thumb.png.dead68663fa78afb05ea532aaab658f2.png

 

And here part of the log (in GMT+8):

Apr 04 08:40:42 new_nas_server kernel: EDAC MC0: Giving out device to module amd64_edac controller F19h_M20h: DEV 0000:00:18.3 (INTERRUPT)
Apr 04 08:45:54 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 08:45:54 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 08:45:54 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 08:45:54 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000004f8c93a80
Apr 04 08:45:54 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903
Apr 04 08:45:54 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 08:45:54 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x508c93 offset:0xa80 grain:64 syndrome:0x80)
Apr 04 08:45:54 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 08:51:08 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 08:51:08 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 08:51:08 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 08:51:08 new_nas_server kernel: [Hardware Error]: Error Addr: 0x000000018ee2f480
Apr 04 08:51:08 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0xbe0700100a800903
Apr 04 08:51:08 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 08:51:08 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x19ee2f offset:0x480 grain:64 syndrome:0x10)
Apr 04 08:51:08 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 09:06:50 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 09:06:50 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 09:06:50 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 09:06:50 new_nas_server kernel: [Hardware Error]: Error Addr: 0x0000000684a02380
Apr 04 09:06:50 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800902
Apr 04 09:06:50 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 09:06:50 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x694a02 offset:0x380 grain:64 syndrome:0x80)
Apr 04 09:06:50 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 09:12:04 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 09:12:04 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 09:12:04 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 09:12:04 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000001628f49c0
Apr 04 09:12:04 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800902
Apr 04 09:12:04 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 09:12:04 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x1728f4 offset:0x9c0 grain:64 syndrome:0x80)
Apr 04 09:12:04 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 09:17:18 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 09:17:18 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 09:17:18 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 09:17:18 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000000f73a0e00
Apr 04 09:17:18 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903
Apr 04 09:17:18 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 09:17:18 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x1073a0 offset:0xe00 grain:64 syndrome:0x80)
Apr 04 09:17:18 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 09:22:32 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 09:22:32 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 09:22:32 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 09:22:32 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000007c4e2b600
Apr 04 09:22:32 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800902
Apr 04 09:22:32 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 09:22:32 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x7d4e2b offset:0x600 grain:64 syndrome:0x80)
Apr 04 09:22:32 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 09:23:53 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 09:23:53 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 09:23:53 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 09:23:53 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000003f55f3740
Apr 04 09:23:53 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903
Apr 04 09:23:53 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 09:23:53 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x4055f3 offset:0x740 grain:64 syndrome:0x80)
Apr 04 09:23:53 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 09:27:46 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 09:27:46 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 09:27:46 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|Scrub]: 0xdc2041000000011b
Apr 04 09:27:46 new_nas_server kernel: [Hardware Error]: Error Addr: 0x000000027a3176c0
Apr 04 09:27:46 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800902
Apr 04 09:27:46 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 09:27:46 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x28a317 offset:0x6c0 grain:64 syndrome:0x80)
Apr 04 09:27:46 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 09:33:00 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 09:33:00 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 09:33:00 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 09:33:00 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000001c683b880
Apr 04 09:33:00 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903
Apr 04 09:33:00 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 09:33:00 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x1d683b offset:0x880 grain:64 syndrome:0x80)
Apr 04 09:33:00 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 09:38:14 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 09:38:14 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 09:38:14 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 09:38:14 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000004275c9e80
Apr 04 09:38:14 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903
Apr 04 09:38:14 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 09:38:14 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x4375c9 offset:0xe80 grain:64 syndrome:0x80)
Apr 04 09:38:14 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 09:43:28 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 09:43:28 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 09:43:28 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 09:43:28 new_nas_server kernel: [Hardware Error]: Error Addr: 0x000000017a5fcc80
Apr 04 09:43:28 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903
Apr 04 09:43:28 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 09:43:28 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x18a5fc offset:0xc80 grain:64 syndrome:0x80)
Apr 04 09:43:28 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 09:48:42 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 09:48:42 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 09:48:42 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 09:48:42 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000002b4798280
Apr 04 09:48:42 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903
Apr 04 09:48:42 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 09:48:42 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x2c4798 offset:0x280 grain:64 syndrome:0x80)
Apr 04 09:48:42 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 09:53:56 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 09:53:56 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 09:53:56 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 09:53:56 new_nas_server kernel: [Hardware Error]: Error Addr: 0x000000025bfea4c0
Apr 04 09:53:56 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903
Apr 04 09:53:56 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 09:53:56 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x26bfea offset:0x4c0 grain:64 syndrome:0x80)
Apr 04 09:53:56 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 09:58:56 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 09:58:56 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 09:58:56 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 09:58:57 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000004c7993f80
Apr 04 09:58:57 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903
Apr 04 09:58:57 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 09:58:57 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x4d7993 offset:0xf80 grain:64 syndrome:0x80)
Apr 04 09:58:57 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 10:03:57 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 10:03:57 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 10:03:57 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 10:03:57 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000002c11fa380
Apr 04 10:03:57 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800902
Apr 04 10:03:57 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 10:03:57 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x2d11fa offset:0x380 grain:64 syndrome:0x80)
Apr 04 10:03:57 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 10:09:11 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 10:09:11 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 10:09:11 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 10:09:11 new_nas_server kernel: [Hardware Error]: Error Addr: 0x000000025f6b0700
Apr 04 10:09:11 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903
Apr 04 10:09:11 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 10:09:11 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x26f6b0 offset:0x700 grain:64 syndrome:0x80)
Apr 04 10:09:11 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 10:14:25 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 10:14:25 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 10:14:25 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 10:14:25 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000002f620c840
Apr 04 10:14:25 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800902
Apr 04 10:14:25 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 10:14:25 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x30620c offset:0x840 grain:64 syndrome:0x80)
Apr 04 10:14:25 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 10:19:39 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 10:19:39 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 10:19:39 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 10:19:39 new_nas_server kernel: [Hardware Error]: Error Addr: 0x000000014273bb80
Apr 04 10:19:39 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903
Apr 04 10:19:39 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 10:19:39 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x15273b offset:0xb80 grain:64 syndrome:0x80)
Apr 04 10:19:39 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 10:24:53 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 10:24:53 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 10:24:53 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 10:24:53 new_nas_server kernel: [Hardware Error]: Error Addr: 0x0000000786d15580
Apr 04 10:24:53 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903
Apr 04 10:24:53 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 10:24:53 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x796d15 offset:0x580 grain:64 syndrome:0x80)
Apr 04 10:24:53 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 10:30:07 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 10:30:07 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 10:30:07 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 10:30:07 new_nas_server kernel: [Hardware Error]: Error Addr: 0x0000000351090980
Apr 04 10:30:07 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800902
Apr 04 10:30:07 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 10:30:07 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x361090 offset:0x980 grain:64 syndrome:0x80)
Apr 04 10:30:07 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 10:35:21 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 10:35:21 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 10:35:21 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 10:35:21 new_nas_server kernel: [Hardware Error]: Error Addr: 0x0000000212453f00
Apr 04 10:35:21 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903
Apr 04 10:35:21 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 10:35:21 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x222453 offset:0xf00 grain:64 syndrome:0x80)
Apr 04 10:35:21 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 10:40:35 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 10:40:35 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 10:40:35 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 10:40:35 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000001e2df0e80
Apr 04 10:40:35 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903
Apr 04 10:40:35 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 10:40:35 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x1f2df0 offset:0xe80 grain:64 syndrome:0x80)
Apr 04 10:40:35 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 10:45:49 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 10:45:49 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 10:45:49 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 10:45:49 new_nas_server kernel: [Hardware Error]: Error Addr: 0x0000000721870f00
Apr 04 10:45:49 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800902
Apr 04 10:45:49 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 10:45:49 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x731870 offset:0xf00 grain:64 syndrome:0x80)
Apr 04 10:45:49 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 10:51:03 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 10:51:03 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 10:51:03 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 10:51:03 new_nas_server kernel: [Hardware Error]: Error Addr: 0x0000000144e82b00
Apr 04 10:51:03 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903
Apr 04 10:51:03 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 10:51:03 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x154e82 offset:0xb00 grain:64 syndrome:0x80)
Apr 04 10:51:03 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 10:56:03 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 10:56:04 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 10:56:04 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 10:56:04 new_nas_server kernel: [Hardware Error]: Error Addr: 0x000000031770b180
Apr 04 10:56:04 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903
Apr 04 10:56:04 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 10:56:04 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x32770b offset:0x180 grain:64 syndrome:0x80)
Apr 04 10:56:04 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 11:01:04 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 11:01:04 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 11:01:04 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 11:01:04 new_nas_server kernel: [Hardware Error]: Error Addr: 0x0000000232e46340
Apr 04 11:01:04 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800902
Apr 04 11:01:04 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 11:01:04 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x242e46 offset:0x340 grain:64 syndrome:0x80)
Apr 04 11:01:04 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 11:06:18 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 11:06:18 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 11:06:18 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 11:06:18 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000001667a1240
Apr 04 11:06:18 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800902
Apr 04 11:06:18 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 11:06:18 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x1767a1 offset:0x240 grain:64 syndrome:0x80)
Apr 04 11:06:18 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 11:11:32 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 11:11:32 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 11:11:32 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 11:11:32 new_nas_server kernel: [Hardware Error]: Error Addr: 0x0000000428b809c0
Apr 04 11:11:32 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903
Apr 04 11:11:32 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 11:11:32 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x438b80 offset:0x9c0 grain:64 syndrome:0x80)
Apr 04 11:11:32 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 11:16:46 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 11:16:46 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 11:16:46 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 11:16:46 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000001c1d8ce80
Apr 04 11:16:46 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800902
Apr 04 11:16:46 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 11:16:46 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x1d1d8c offset:0xe80 grain:64 syndrome:0x80)
Apr 04 11:16:46 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 11:22:00 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 11:22:00 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 11:22:00 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 11:22:00 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000006459b19c0
Apr 04 11:22:00 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903
Apr 04 11:22:00 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 11:22:00 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x6559b1 offset:0x9c0 grain:64 syndrome:0x80)
Apr 04 11:22:00 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 11:27:14 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 11:27:14 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 11:27:14 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 11:27:14 new_nas_server kernel: [Hardware Error]: Error Addr: 0x000000034e7aa140
Apr 04 11:27:14 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903
Apr 04 11:27:14 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 11:27:14 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x35e7aa offset:0x140 grain:64 syndrome:0x80)
Apr 04 11:27:14 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 04 11:32:28 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged
Apr 04 11:32:28 new_nas_server kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 11:32:28 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Apr 04 11:32:28 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000005a6933e80
Apr 04 11:32:28 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903
Apr 04 11:32:28 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Apr 04 11:32:28 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x5b6933 offset:0xe80 grain:64 syndrome:0x80)
Apr 04 11:32:28 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

 

Background Information

 

I'm assembling my new NAS based on following configurations (sorted by relevance with my this thread)

  • CPU: Ryzen 7 5800X
    • microcode updated, patch level 0x0a201009
    • Stepping 0
  • Memory: 1 x Micron MTA18ASF4G72HZ-3G2B1 (datasheet)
    • Running at DDR4-2933 as reported by BIOS
    • CLS arguments are
  • Motherboard: ASRock Rack X570D4I-2T (manual)
    • BIOS 2.1.0 (AGESA ComboV2 1.1.0.0)
    • BMC Firmware 01.40.00
    • All settings in BIOS and BMC are default, expect for Boot Filter (UEFI Only), BMC IP address and passwords
  • Cooler: Noctua NH-L9i
  • Case: The old case of an HP ProLiant Microserver Gen8, with some modifications. However, those (physical) modifications are not ready yet so here is its workbench now:

image.thumb.png.314402438a899ea67908aa72c9d8f587.png

 

Measures taken and results

  • Clean golden fingers: CE continue
  • Swap RAM to DIMM_A2 or DIMM_B2: Won't boot (as expected)
  • Swap RAM to DIMM_B1: No future errors but I just don't want to use B1 since the problem is just hidden, and I hope to add more memory later
  • Run memtest86 (both the Freeware version by PassMark, and the FOSS version memtest86+) and:
    • RAM at DIMM_A1:
      • FOSS version would fail at random block in first 1M if SMP is enabled, but if disable SMP or change SMP mode to round-robin, no error is reported
      • Freeware (paid) always pass
    • RAM at DIMM_B1: Same as A1
    • A2/B2: not bootable so N/A
  • Run stress and memtest for memory pressure test under Linux: Errors would be more/less at first, but after ~10min (which means 2 rounds of error), CE continue
  • When DRAM at A1, push forces to the motherboard itself genteelly by following ways, CE continue. Ways are (DO NOT TRY THIS AT HOME IF YOU DON'T KNOW IR SOLDERING):
    • Loosen/Tighten CPU cooler screws
    • Bend the board at less than 2 degrees
  • Read the whole BIOS settings, related manuals, but found nothing looks reasonable for me

Suspects

I think at least one of those parts is faulty (sorted by possibility based on current findings):

  • Software problems, including
    • The default BIOS settings for this motherboard will cause false alarm for Linux MCE related subsystem
    • The default BIOS settings cause the data being changed without properly set ECC bits
    • Linux MCE things is just not working on Ryzen
    • ......
    • Software problems is really possible, since 314 seconds looks like something periodic, not random
  • CPU memory controller for channel A (mc0) is broken (which means, RMA 😞).
    I think in Ryzen, memory is directly connected to CPU and if I put memory on channel B (maybe another MC?), runtime error disappears. So mc0 seems suspicious.
  • RAM is not stable
    DIMM_A1 is much shorter than B1 so due to some magical physical things (reflection, etc.) an unstable RAM is possible. Additionally, the vendor of this RAM is also suspicious, and I have never brought things before.
  • Motherboard trace is broken, including traces from A1 to CPU, B1 to CPU, and the DIMM slot to motherboard
    This is reasonable because when I got the motherboard, the package seems treated violently, although the internal box seems fine.
  • CPU Socket connectors are not in good contact
    Least possibility. By the pinout diagram (source), MA_* pins are really near MB_* pins, and even located more insider.
  • Affected by EMI
    Can't explain why B1 works. No high-power devices working near me.

To-Do List

  • Try if Windows (PE) have same system logs (if they can generate any, hehe)
  • Re-plug CPU
  • Contact AMD and ASRock for help since I'm not sure if this is really some hardware problem
  • Try a new memory
  • Try a new CPU

 

The most interesting thing is: Why 314 seconds? Is this a periodically thing, or I just treated it wrongly?

 

In my experience of R&D, even if I need some counter for, like drawing an circle, I would use 2**n instead of pi since this is how binary works. Oscillators generating sin wave also don't really use pi, although the sin function have some historical relationship with it. This makes me think maybe those CEs are some software bugs, not hardware problems. But why would this happen.... I spend about 16 hours continuously to change sockets, read kernel source code... but still have no idea about it.

 

Anyway, thanks for everyone's help and I hope if you have met things like this before, you could give me some instructions. It would be better if some engineers from AMD, Micron or ASRock is interested in this.

 

Totally pi**ed off,

izumi_konata

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

I'm way off from being an expert here but I recently had issues with memory stability on a cheap B450 board and a R5 2600. I had RAM in the first two slots nearest the CPU, moving them to the last two slots farthest away solved it. I think it's something to do with the IMC in Ryzen, you need to populate the farthest slots first for some reason. I used Prime95 large fft to try to pin down the error, it was a different rounding error each time and took a different amount of time to surface. Went away 100% when I swapped the memory slots. I don't think it's just getting covered up, I think that it's just a Ryzen quirk maybe.

Link to comment
Share on other sites

Link to post
Share on other sites

12 hours ago, Bitter said:

I don't think it's just getting covered up, I think that it's just a Ryzen quirk maybe.

I think this is reasonable... since now the error does not show up at exactly 314 seconds:

 

image.thumb.png.75148ac5b6b7a65aa3baa54d69846ac5.png

 

During 13:00-24:00 the only thing being turned off is my LED desk lamp...

 

Maybe be I should buy another memory and put it in B1. If errors on A1 gone, then this is some Ryzen mystery.

Link to comment
Share on other sites

Link to post
Share on other sites

Status update: 

  • Ordered a new RAM for test from same vendor. Anyway, the memory is not enough for me
  • New finding: when inserted in A slot, info in /sys/devices/system/edac/mc/mc0/ is kinda different. The only difference: (B slot in left, A slot in right)image.thumb.png.b0392c946979a4702f641a00168569c6.png
  • And no, lamp has nothing to do with it. I just forget to stop some tasks.
     
Link to comment
Share on other sites

Link to post
Share on other sites

I love lamp.

 

Keep pluggin away dude, took me like 2 weeks to figure out my foulup with the R5 2600 and memory.

Link to comment
Share on other sites

Link to post
Share on other sites

Hi all,

 

having started today I now have a very similar problem:

 mce: [Hardware Error]: Machine check events logged
[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:0 (17:60:1) MC17_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
[Hardware Error]: Error Addr: 0x00000000601a3480
[Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x000020080a400f03
[Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
EDAC MC0: 1 CE on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x601a3 offset:0x480 grain:64 syndrome:0x2008)
[Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

The error occurs every 311,29s on Debian Linux (bullseye, most recent Kernel)

Hardware:
Asrock B550 Pro4 (Bios 1.8)
2x Kingston 16GB ECC @2667 MT/s (dmidecode output) Bank: P0/Channel A&B
AMD Ryzen 4650G Pro (Stepping1, Microcode 0x8600106)

Maybe we can find some similarities.

Hope we find a solution soon.

Joe

Link to comment
Share on other sites

Link to post
Share on other sites

35 minutes ago, Joe217 said:


Hope we find a solution soon.

Joe

Run the memory at Jedec instead of overclocked, probably have less errors or none at all. 

 

ECC memory is more or less designed to put in and run. 

 

The original poster is running 2933mhz and you are at 2667mhz while all systems generally post 2133mhz. 

 

I think some testing a defaults would be the way to go first, then maybe some OC.

Link to comment
Share on other sites

Link to post
Share on other sites

On 4/5/2021 at 9:06 PM, ShrimpBrime said:

Run the memory at Jedec instead of overclocked, probably have less errors or none at all. 

 

ECC memory is more or less designed to put in and run. 

 

The original poster is running 2933mhz and you are at 2667mhz while all systems generally post 2133mhz.

Thank you for your idea!


I am not running the ECC memory OC. The memory is certified by ASrock for board compatibility and simply plugged into the board w/o any adjustments in the BIOS (I assume this is meant by running with default parameters)... The memory JEDEC specs are "16GB 2Rx8 PC4-2666V-EE1-11" so 2666 seems to be fine for me, only way to go down to 2133 would be underclocking (if this is possible at all with the board). I will give it a try to underclock the memory, maybe this solves the problem.

 

Sorry forgot to mention the memory details: 2x Kingston KSM26ED8/16ME (dmidecode part number 9965745-002.A00G)

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Joe217 said:

Thank you for your idea!


I am not running the ECC memory OC. The memory is certified by ASrock for board compatibility and simply plugged into the board w/o any adjustments in the BIOS (I assume this is meant by running with default parameters)... The memory JEDEC specs are "16GB 2Rx8 PC4-2666V-EE1-11" so 2666 seems to be fine for me, only way to go down to 2133 would be underclocking (if this is possible at all with the board). I will give it a try to underclock the memory, maybe this solves the problem.

 

Sorry forgot to mention the memory details: 2x Kingston KSM26ED8/16ME (dmidecode part number 9965745-002.A00G)

Jedec specs go from (if I remember it correctly today)

1600mhz through 2667mhz.

 

If the memory is only 2667mhz, it may still have a lower frequency profile.

 

Will have to wait to look at your specs. I'm mobile rn.

Link to comment
Share on other sites

Link to post
Share on other sites

decode-dimms gives the following output:

Guessing DIMM is in                              bank 3
Kernel driver used                               eeprom

---=== SPD EEPROM Information ===---
EEPROM CRC of bytes 0-125                        OK (0xC434)
# of bytes written to SDRAM EEPROM               384
Total number of bytes in EEPROM                  512
Fundamental Memory type                          DDR4 SDRAM
SPD Revision                                     1.1
Module Type                                      UDIMM
EEPROM CRC of bytes 128-253                      OK (0xD6A6)

---=== Memory Characteristics ===---
Maximum module speed                             2666 MT/s (PC4-21300)
Size                                             16384 MB
Banks x Rows x Columns x Bits                    16 x 16 x 10 x 64
SDRAM Device Width                               8 bits
Ranks                                            2
Rank Mix                                         Symmetrical
Primary Bus Width                                64 bits
Bus Width Extension                              8 bits
AA-RCD-RP-RAS (cycles)                           19-19-19-43
Supported CAS Latencies                          20T, 19T, 18T, 17T, 16T, 15T, 14T, 13T, 12T, 11T, 10T

---=== Timings at Standard Speeds ===---
AA-RCD-RP-RAS (cycles) as DDR4-2666              19-19-19-43
AA-RCD-RP-RAS (cycles) as DDR4-2400              17-17-17-39
AA-RCD-RP-RAS (cycles) as DDR4-2133              15-15-15-35
AA-RCD-RP-RAS (cycles) as DDR4-1866              13-13-13-30
AA-RCD-RP-RAS (cycles) as DDR4-1600              11-11-11-26

---=== Timing Parameters ===---
Minimum Cycle Time (tCKmin)                      0.750 ns
Maximum Cycle Time (tCKmax)                      1.600 ns
Minimum CAS Latency Time (tAA)                   13.750 ns
Minimum RAS to CAS Delay (tRCD)                  13.750 ns
Minimum Row Precharge Delay (tRP)                13.750 ns
Minimum Active to Precharge Delay (tRAS)         32.000 ns
Minimum Active to Auto-Refresh Delay (tRC)       45.750 ns
Minimum Recovery Delay (tRFC1)                   350.000 ns
Minimum Recovery Delay (tRFC2)                   260.000 ns
Minimum Recovery Delay (tRFC4)                   160.000 ns
Minimum Four Activate Window Delay (tFAW)        21.000 ns
Minimum Row Active to Row Active Delay (tRRD_S)  3.000 ns
Minimum Row Active to Row Active Delay (tRRD_L)  4.900 ns
Minimum CAS to CAS Delay (tCCD_L)                5.000 ns
Minimum Write Recovery Time (tWR)                15.000 ns
Minimum Write to Read Time (tWTR_S)              2.500 ns
Minimum Write to Read Time (tWTR_L)              7.500 ns

---=== Other Information ===---
Package Type                                     Monolithic
Maximum Activate Count (MAC)                     Unlimited
Post Package Repair                              One row per bank group
Soft PPR                                         Supported
Module Nominal Voltage                           1.2 V
Thermal Sensor                                   TSE2004 compliant

---=== Physical Characteristics ===---
Module Height                                    32 mm
Module Thickness                                 2 mm front, 2 mm back
Module Reference Card                            E revision 1

 

It looks to me as if the DIMM has profiles from 1600 up to 2666.

 

Update (April 15th, uptime 6 days): With memory downclocked to 2133 the system seems to be running w/o errors and stable...

 

@ShrimpBrime Thank you again for your help!

Link to comment
Share on other sites

Link to post
Share on other sites

Welcome 🙂

Link to comment
Share on other sites

Link to post
Share on other sites

  • 5 months later...

It's kind of too old... but after nearly 5 months I finally get an exact identical memory. And after I plug in those two memorys in A1 and B1, no more MCEs during last 24 hours. So I think ShrimpBrime is right (and thanks), it's some compatibility issues.

 

I'm going to run at 2999 even if MCE happened again, since it seems not affecting my process...

 

 

@ShrimpBrime @Joe217 

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, izumi_konota said:

It's kind of too old... but after nearly 5 months I finally get an exact identical memory. And after I plug in those two memorys in A1 and B1, no more MCEs during last 24 hours. So I think ShrimpBrime is right (and thanks), it's some compatibility issues.

 

I'm going to run at 2999 even if MCE happened again, since it seems not affecting my process...

 

 

@ShrimpBrime @Joe217 

Hey thanks for coming back with a good news update! Glad it's working well. 

Link to comment
Share on other sites

Link to post
Share on other sites

  • 2 years later...

Update after 2 years:

 

In 2021 I opened this thread. The problem was solved by swapping two memory sticks.

 

However, in October, after swapped M.2 SSD, this issue happened again. Nearly everything is same as before, the 314 seconds period kept same. It's so weird. The chassis design does not require touching memory area when changing SSD. ESD is also not a good explanation because although MCE still happens, it's targeted at random bank and address.

 

I decided to ignore the issue since it does not affect running services, then go working on another small ITX-based machine. The new machine is also very small, and accept 12V VO input from a external jack. The case is basically six pieces of aluminum, and very short in height, not leaving much room above the CPU heat sink. It leave about 3 inch room to put a PCI-e card (raiser card required) and a power supply. But I already have a 12V bus in my rack, so the power supply is not necessary. I made a 12V CPU power cable with a XT60 connector so I can put the connector out of the case. Anyway, I kept the box open until everything works fine. No ECC Error for days.

 

Then I finished everything, closed the chassis. The next morning, journal reported there is a EDAC Error. Since this time the machine is on a Intel CPU, I can see it clearly that the error is reported from Slot A1. The error occurred at about 10 hours after the host booted. I thought it was some random error, and ignored it.

 

10 hour later, the same issue reported on slot A2. Another 10 hour later, A1. Another 10 hour later, A1.

 

image.thumb.png.c921e9123e95a375a2431a154e983265.png

 

Okay. That's not a small problem. But how can both memory sticks having CE with different period each stick, but constant period if counting them all?

 

I opened the case, thought for a while, re-inserted memory, and closed the case. The 10 hour issue still exists.

 

Then I opened the case again, reconnected memory again, and closed the case.

 

The problem disappears.

 

Weird, right?

 

Then I realized this has nothing to do with memory. Luckily, I installed a camera in my rack, and this time I did everything right in the rack. After comparing tapes, I know the real reason: The 12V power cable was running above the A1 and A2 memory, and insulation of the wire is softly touching the PCB. Meanwhile after the case is closed, another part of the insulation will touch the chassis cover. During the second time reconnecting the memory, I put the power cable below buckles of memory slots.

 

Then I realized this 314 second problem. I pulled out the old machine, found power cable is also touching the memory stick. I did some cable management, and it has been ran well for a month. No memory error ever reported.

 

So that's it. Not memory problem, but magic problem. My idea is: Although PCB of memory, insulation of cable, and the chassis cover are all isolated well, and not conducting, they can still work as some kind of capacity, forming some kind of oscillation network, and glitching DRAM signals.

 

This is a pretty important lesson for me. Now I know cable management is useful. At least prevent running cables above your memory.

Link to comment
Share on other sites

Link to post
Share on other sites

23 hours ago, izumi_konota said:

Update after 2 years:

 

In 2021 I opened this thread. The problem was solved by swapping two memory sticks.

 

However, in October, after swapped M.2 SSD, this issue happened again. Nearly everything is same as before, the 314 seconds period kept same. It's so weird. The chassis design does not require touching memory area when changing SSD. ESD is also not a good explanation because although MCE still happens, it's targeted at random bank and address.

 

I decided to ignore the issue since it does not affect running services, then go working on another small ITX-based machine. The new machine is also very small, and accept 12V VO input from a external jack. The case is basically six pieces of aluminum, and very short in height, not leaving much room above the CPU heat sink. It leave about 3 inch room to put a PCI-e card (raiser card required) and a power supply. But I already have a 12V bus in my rack, so the power supply is not necessary. I made a 12V CPU power cable with a XT60 connector so I can put the connector out of the case. Anyway, I kept the box open until everything works fine. No ECC Error for days.

 

Then I finished everything, closed the chassis. The next morning, journal reported there is a EDAC Error. Since this time the machine is on a Intel CPU, I can see it clearly that the error is reported from Slot A1. The error occurred at about 10 hours after the host booted. I thought it was some random error, and ignored it.

 

10 hour later, the same issue reported on slot A2. Another 10 hour later, A1. Another 10 hour later, A1.

 

image.thumb.png.c921e9123e95a375a2431a154e983265.png

 

Okay. That's not a small problem. But how can both memory sticks having CE with different period each stick, but constant period if counting them all?

 

I opened the case, thought for a while, re-inserted memory, and closed the case. The 10 hour issue still exists.

 

Then I opened the case again, reconnected memory again, and closed the case.

 

The problem disappears.

 

Weird, right?

 

Then I realized this has nothing to do with memory. Luckily, I installed a camera in my rack, and this time I did everything right in the rack. After comparing tapes, I know the real reason: The 12V power cable was running above the A1 and A2 memory, and insulation of the wire is softly touching the PCB. Meanwhile after the case is closed, another part of the insulation will touch the chassis cover. During the second time reconnecting the memory, I put the power cable below buckles of memory slots.

 

Then I realized this 314 second problem. I pulled out the old machine, found power cable is also touching the memory stick. I did some cable management, and it has been ran well for a month. No memory error ever reported.

 

So that's it. Not memory problem, but magic problem. My idea is: Although PCB of memory, insulation of cable, and the chassis cover are all isolated well, and not conducting, they can still work as some kind of capacity, forming some kind of oscillation network, and glitching DRAM signals.

 

This is a pretty important lesson for me. Now I know cable management is useful. At least prevent running cables above your memory.

Capacitance could be the cause, you have conductors and insulators and a really sensitive component, a small charge could build and then discharge across skin effect from cable to PCB to ground through the memory causing the error. Fascinating, thank you for updating.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×