Jump to content

Hello!

 

Spec list:

 

  • Seasonic Focus GX-850 Gold (around 3–4 years old)

  • 5950X

  • TUF GAMING X570-PLUS (WI-FI) with latest stable BIOS (5021)

  • Dark Rock Pro 4

  • CL14 64GB (4x16GB)

  • 6800xt gaming oc ASUS

  • Kendomen deepcool with 8 artic p12 max

  • 9100 pro 2tb + 6  sata ssd 870 EVO 1 tb

one kit is:
32GB CL14 B-die kit (G.Skill Trident Z Neo F4-3600C14D-32GTZN, Intel XMP ready).

 

another is:
32GB CL14 B-die kit (G.Skill Trident Royal F4-3600C14D-32GTRGA, Intel XMP ready).

 


OC CPU:

 

curve:

 

⭐ Best cores (02,06,09): -8
🔥 Hot cores (01,05,03): -3
VID voltage hungry (11,12,13): -15
🌡 All other cores (0,2,4,7,8,10,14,15): -10

 

PBO Advanced:

 

  • PPT: 200 W

  • TDC: 160 A

  • EDC: 150 A

  • Scalar: auto

  • Boost Override: +50 MHz

  • CPPC → Enabled

  • CPPC Preferred Cores → Enabled

  • PBO Fmax Enhancer → Disabled

  • VDDCR SOC Power Phase Control → Extreme

  • VDDCR CPU Power Phase Control → Extreme


RAM OC:

 

D.O.C.P:

 

  • 3600 MHz

  • 14-14-14-34

  • 1.495V

  • FCLK 1800

  • SOC 1.16250

everything below 1.48V and 1.1 SOC → testmem errors

 


ISSUE:

 

I'm getting MCE errors on Linux, pretty rarely. I've never had such errors on Windows (I've never hosted VMs there, but I played games and left my PC running for weeks, etc.)

 

I have an extremely tough situation for the IMC, but:

 

I've run dozens of tests such as (memtest86, testmem5, prime95, y-cruncher, linpack extreme, OCCT, HCI memtest), with an average time of 8h± (sometimes I left my PC for over 24h per test) — I've never had any errors or crashes on Windows.

 

I daily drive Arch, and I’m pretty rarely getting MCE errors with my OC setup:
this error usually occurs once a week (this is my server and it hosts 15 Windows 11 VMs under full load 24/7)

 


[Hardware Error]: System Fatal error.
[Hardware Error]: CPU:17 (19:21:2) MC5_STATUS[-|UE|-|-|-|-]: 0xbea0000001000108
[Hardware Error]: Error Addr: 0x00007ffd216e1517
[Hardware Error]: IPID: 0x000500b000000000, Syndrome: 0x000000004d000000
[Hardware Error]: Execution Unit Ext. Error Code: 0
[Hardware Error]: cache level: RESU, tx: GEN, mem-tx: GEN

 


I've decided to turn off my PBO + CO and got it again:
this time the error appeared within 24 hours

 

[Hardware Error]: System Fatal error.
[Hardware Error]: CPU:1 (19:21:2) MC5_STATUS[-|UE|-|-|-|-]: 0xbaa0000000090150
[Hardware Error]: IPID: 0x000500b000000000, Syndrome: 0x000000004d000020
[Hardware Error]: Execution Unit Ext. Error Code: 9
[Hardware Error]: cache level: RESU, tx: INSN, mem-tx: IRD

 


So right now I'm stuck at the point where I have to use PBO + CO to reduce the frequency of these errors and deal with crashes once a week. I know that this is a pretty heavy load for the IMC, but maybe there are some tweaks I can apply to mitigate errors without reducing OC speed?

I'm 99.9% sure that this is due to the amount of ram sticks + tight timings + pretty high RAM speed. 

 

I would like to keep 1:1 FCLK and definitely CL14 timings due to the latency, which I really need.

Is there any suggestions? I can provide any logs as soon as new errors pops up ( i've reduced my timings to 14-15-15-15-35 right now and left PBO+CO off with 15 vm's)

Link to comment
https://linustechtips.com/topic/1636107-linux-oc-mce-errors-4-dimms/
Share on other sites

Link to post
Share on other sites

16 minutes ago, newbie0c said:

I'm 99.9% sure that this is due to the amount of ram sticks + tight timings + pretty high RAM speed. 

That would be my guess as well. B-die is notoriously hard on the CPU's IMC, and 3600 with quad rank memory is already pushing the limits of the memory IC. Unless you got lucky with a good memory controller, that would probably be the issues. 

 

17 minutes ago, newbie0c said:

I would like to keep 1:1 FCLK and definitely CL14 timings due to the latency, which I really need.

1:1 FCLK sure, but CL14 latency is a bit more debatable whether you actually need it or not (CL14 and CL16 are almost identical for overall system performance all else being equal). I doubt it would fix anything changing it, but just figured I'd mention it. 

 

 

Anyway, for the steps I'd go through to fix this:

  1. Up the SOC voltage.
    1. Since your issue is likely the memory controller, increasing the SOC voltage can help. This is safe up to 1.2V, though it does tend to sweetspot around the 1.15V mark (my 5900X would do the best memory OC at 1.175V mark, so play with it around that mark to see if they go away). 
  2. Lower tRCD/tRP
    1. Those timings are known to be the ones that stress the memory controller the most with B die for whatever reason, so dropping it from 14 to 16 should help alleviate some of that IMC stress. 
  3. Back off the frequency to something a bit lower.
    1. 3466MT/s should be more reliable if the other things don't fix it. 
Link to post
Share on other sites

13 minutes ago, RONOTHAN## said:

That would be my guess as well. B-die is notoriously hard on the CPU's IMC, and 3600 with quad rank memory is already pushing the limits of the memory IC. Unless you got lucky with a good memory controller, that would probably be the issues. 

 

1:1 FCLK sure, but CL14 latency is a bit more debatable whether you actually need it or not (CL14 and CL16 are almost identical for overall system performance all else being equal). I doubt it would fix anything changing it, but just figured I'd mention it. 

 

 

Anyway, for the steps I'd go through to fix this:

  1. Up the SOC voltage.
    1. Since your issue is likely the memory controller, increasing the SOC voltage can help. This is safe up to 1.2V, though it does tend to sweetspot around the 1.15V mark (my 5900X would do the best memory OC at 1.175V mark, so play with it around that mark to see if they go away). 
  2. Lower tRCD/tRP
    1. Those timings are known to be the ones that stress the memory controller the most with B die for whatever reason, so dropping it from 14 to 16 should help alleviate some of that IMC stress. 
  3. Back off the frequency to something a bit lower.
    1. 3466MT/s should be more reliable if the other things don't fix it. 

 

Thanks for reply! 
I've wanted to up SOC as well, but i was kind of doubtful about it ( i thought it would have negative effect) ill try that ASAP!

I also forgot to mention that i have tRCD/tRP set on auto. should i make it manual? and set it like this:
 

 - 580-620 for tRFC

 - 35-36 tRAS

 - tRC 52
 

 

 

and is there any chance to manually trigger that MCE error?

Link to post
Share on other sites

17 minutes ago, newbie0c said:

I also forgot to mention that i have tRCD/tRP set on auto. should i make it manual? and set it like this:
 

 - 580-620 for tRFC

 - 35-36 tRAS

 - tRC 52

tRCD/tRP on auto on most of the boards I've used have set that to the XMP values, so I would manually set them to something a bit looser like 15 or 16. 

 

As for those other timings, leaving those on auto is fine, they shouldn't really affect IMC stability. 

 

18 minutes ago, newbie0c said:

and is there any chance to manually trigger that MCE error?

That I have no clue about. I've never personally run into errors like that on Linux, but granted personally the only systems I have running Linux are ones where I will not do any OC whatsoever (I'll usually underclock them for power efficiency). 

 

Also I just realized you're running Arch for a server OS, that takes a special kind of masochism. 

Link to post
Share on other sites

Thanks ill try those suggestions 
 

10 minutes ago, RONOTHAN## said:

Also I just realized you're running Arch for a server OS, that takes a special kind of masochism. 

It is a bit 😀
It is my daily driver PC actually ( PC-server and ARCH as well) and there is more! 🙂 i use XFCE 

Right now i must have to use mac instead when daily pc is busy😞
 

Link to post
Share on other sites

If your rig is a server I'd run it at stock (no OC no undervolt/CO)

As for the RAM try using CL16

AMD R9  7950X3D CPU/ Asus ROG STRIX X670E-E board/ 2x32GB G-Skill Trident Z Neo 6000CL30 RAM ASUS TUF Gaming AMD Radeon RX 7900 XTX OC Edition GPU/ Phanteks P600S case /  Arctic Liquid Freezer III 360 ARGB cooler/  2TB WD SN850 NVme + 2TB Crucial T500  NVme  + 4TB Toshiba X300 HDD / Corsair RM850x PSU/ Alienware AW3420DW 34" 120Hz 3440x1440p monitor / ASUS ROG AZOTH keyboard/ Logitech G PRO X Superlight mouse / Audeze Maxwell headphones

Link to post
Share on other sites

28 minutes ago, PDifolco said:

If your rig is a server I'd run it at stock (no OC no undervolt/CO)

As for the RAM try using CL16

Actually pbo + CO is lasts longer w/o erorrs than CPU OC off ( and +9% performance )
Trying wider timing right now, hope that this will work 🙂
 

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×