Jump to content

Folding@Home Causing Unexpected Shutdown and WHEA Errors with Ryzen 9 5900X?

I've been having intermittent WHEA errors that bring down my system when it's running Folding@Home for the last few months. I have tried my best to isolate the issue, and I believe I've ruled out the hardware - I've tried putting my CPU and RAM and even the whole BIOS config to stock settings, removing non-essential expansion cards and peripherals, updating the BIOS, cleaning out dust, swapping the position of the M.2 drives, running it with only 2 sticks of RAM, and every other thing I could think with my config's hardware. I've run Prime95, OCCT, and Cinebench+Heaven as various stress tests for at least an hour each to try to rule out stability or PSU concerns. I've removed background software from my computer, like Mystic Light, and I'm not running anything unnecessary in the background - I close all windows and tasks when I leave my computer now.

 

Also, of note, I used to do CPU and GPU mining last winter for room heating, and the system was able to stay running without crashes. That's a very similar load situation where the system was fine. I'm pretty sure F@H is the culprit. I'm running version 7.6.21 - I don't think they've updated the software in a while.

 

But, for completeness, here is the complete list of internal components if you want to see them:

Spoiler

AMD Ryzen 9 5900X

MSI MAG B550 Gaming Edge WiFi

48GB (2x8+2x16) DDR4-3000 16-18-18-38 TeamForce Vulcan Z memory

Gigabyte Windforce OC RTX 2060 Super

Enermax Revolution D.F. 750W Gold PSU

 

1TB WD Black SN750 NVMe SSD (boot drive)

1TB Crucial P1 NVMe SSD

2TB Samsung 860 EVO SATA SSD

1TB TeamGroup C2 SATA SSD

 

Phanteks P300A case with stock fan for exhaust

Scythe Fuma 2 CPU cooler + Arctic Silver 5 paste

3x Arctic P12 PWM PST RGB fans (2 front, 1 top)

 

Tiergrade Superspeed 7 Ports PCI-E to USB 3.0 Expansion Card

4x USB 2.0 header expansion from 2 headers on the motherboard

 

The only time I get WHEA errors and crashing is when it's sitting running F@H on the CPU and GPU. About the only thing I haven't tried is running folding on only the GPU or only the CPU, but being able to run folding on both is kinda the point - I want my idle computer to be running folding to heat my room in the winter while also giving computing time and power to a worthy cause.

 

These crashes happen sometimes after a day or sometimes after a couple of weeks - the timing is seemingly completely random. The logs report a critical error of an unexpected shutdown, then an error saying that the system shut down about a half-hour before. It is always a WHEA with one thread on one CCD and another CPU thread on the other CCD. Also, In the logs, before each event is recorded, there are always a bunch of HTTP service events - not sure if that even means anything, but it was something I noticed.

 

I have attached the Event Viewer System logs going back the last two months or so if anyone wants to look at them in more detail, they can. I've also included a version of the logs that are just the Error and Critical ones.

 

Any help would be appreciated.

Event Viewer Log.evtx OnlyErrors.evtx

Link to comment
Share on other sites

Link to post
Share on other sites

Here, do this:

-> Prime95 Blend + FurMark

-> OCCT (CPU or Linpack) + MSI Kombustor

Test. Should be similar to folding cpu+gpu but will show issues faster.

M.S.C.E. (M.Sc. Computer Engineering), IT specialist in a hospital, 30+ years of gaming, 20+ years of computer enthusiasm, Geek, Trekkie, anime fan

  • Main PC: AMD Ryzen 7 5800X3D - EK AIO 360 D-RGB - Arctic Cooling MX-4 - Asus Prime X570-P - 4x8GB DDR4 3200 HyperX Fury CL16 - Sapphire AMD Radeon 6950XT Nitro+ - 1TB Kingston Fury Renegade - 2TB Kingston Fury Renegade - 512GB ADATA SU800 - 960GB Kingston A400 - Seasonic PX-850 850W  - custom black ATX and EPS cables - Fractal Design Define R5 Blackout - Windows 11 x64 23H2 - 3 Arctic Cooling P14 PWM PST - 5 Arctic Cooling P12 PWM PST
  • Peripherals: LG 32GK650F - Dell P2319h - Logitech G Pro X Superlight with Tiger Ice - HyperX Alloy Origins Core (TKL) - EndGame Gear MPC890 - Genius HF 1250B - Akliam PD4 - Sennheiser HD 560s - Simgot EM6L - Truthear Zero - QKZ x HBB - 7Hz Salnotes Zero - Logitech C270 - Behringer PS400 - BM700  - Colormunki Smile - Speedlink Torid - Jysk Stenderup - LG 24x External DVD writer - Konig smart card reader
  • Laptop: Acer E5–575G-386R 15.6" 1080p (i3 6100U + 12GB DDR4 (4GB+8GB) + GeForce 940MX + 256GB nVME) Win 10 Pro x64 22H2 - Logitech G305 + AAA Lithium battery
  • Networking: Asus TUF Gaming AX6000 - Arcadyan ISP router - 35/5 Mbps vDSL
  • TV and gadgets: TCL 50EP680 50" 4K LED + Sharp HT-SB100 75W RMS soundbar - Samsung Galaxy Tab A8 10.1" - OnePlus 9 256GB - Olymous Cameda C-160 - GameBoy Color 
  • Streaming/Server/Storage PC: AMD Ryzen 5 3600 - LC-Power LC-CC-120 - MSI B450 Tomahawk Max - 2x4GB ADATA 2666 DDR4 - 120GB Kingston V300 - Toshiba DT01ACA100 1TB - Toshiba DT01ACA200 2TB - 2x WD Green 2TB - Sapphire Pulse AMD Radeon R9 380X - 550W EVGA G3 SuperNova - Chieftec Giga DF-01B - White Shark Spartan X keyboard - Roccat Kone Pure Military Desert strike - Logitech S-220 - Philips 226L
  • Livingroom PC (dad uses): AMD FX 8300 - Arctic Freezer 64 - Asus M5A97 R2.0 Evo - 2x4GB DDR3 1833 Kingston - MSI Radeon HD 7770 1GB OC - 120GB Adata SSD - 500W Fractal Design Essence - DVD-RW - Samsung SM 2253BW - Logitech G710+ - wireless vertical mouse - MS 2.0 speakers
Link to comment
Share on other sites

Link to post
Share on other sites

5 minutes ago, 191x7 said:

Here, do this:

-> Prime95 Blend + FurMark

-> OCCT (CPU or Linpack) + MSI Kombustor

Test. Should be similar to folding cpu+gpu but will show issues faster.

I've already run Cinebench+Heaven as a combined stress test, but I'll try these now. We'll see if anything happens, but I'm pretty sure this isn't a hardware issue.

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, YoungBlade said:

I've already run Cinebench+Heaven as a combined stress test, but I'll try these now. We'll see if anything happens, but I'm pretty sure this isn't a hardware issue.

I found Heaven not to be that reliable as a GPU stress test in the recent times, it's probably too old for stressing newer GPU-s.

M.S.C.E. (M.Sc. Computer Engineering), IT specialist in a hospital, 30+ years of gaming, 20+ years of computer enthusiasm, Geek, Trekkie, anime fan

  • Main PC: AMD Ryzen 7 5800X3D - EK AIO 360 D-RGB - Arctic Cooling MX-4 - Asus Prime X570-P - 4x8GB DDR4 3200 HyperX Fury CL16 - Sapphire AMD Radeon 6950XT Nitro+ - 1TB Kingston Fury Renegade - 2TB Kingston Fury Renegade - 512GB ADATA SU800 - 960GB Kingston A400 - Seasonic PX-850 850W  - custom black ATX and EPS cables - Fractal Design Define R5 Blackout - Windows 11 x64 23H2 - 3 Arctic Cooling P14 PWM PST - 5 Arctic Cooling P12 PWM PST
  • Peripherals: LG 32GK650F - Dell P2319h - Logitech G Pro X Superlight with Tiger Ice - HyperX Alloy Origins Core (TKL) - EndGame Gear MPC890 - Genius HF 1250B - Akliam PD4 - Sennheiser HD 560s - Simgot EM6L - Truthear Zero - QKZ x HBB - 7Hz Salnotes Zero - Logitech C270 - Behringer PS400 - BM700  - Colormunki Smile - Speedlink Torid - Jysk Stenderup - LG 24x External DVD writer - Konig smart card reader
  • Laptop: Acer E5–575G-386R 15.6" 1080p (i3 6100U + 12GB DDR4 (4GB+8GB) + GeForce 940MX + 256GB nVME) Win 10 Pro x64 22H2 - Logitech G305 + AAA Lithium battery
  • Networking: Asus TUF Gaming AX6000 - Arcadyan ISP router - 35/5 Mbps vDSL
  • TV and gadgets: TCL 50EP680 50" 4K LED + Sharp HT-SB100 75W RMS soundbar - Samsung Galaxy Tab A8 10.1" - OnePlus 9 256GB - Olymous Cameda C-160 - GameBoy Color 
  • Streaming/Server/Storage PC: AMD Ryzen 5 3600 - LC-Power LC-CC-120 - MSI B450 Tomahawk Max - 2x4GB ADATA 2666 DDR4 - 120GB Kingston V300 - Toshiba DT01ACA100 1TB - Toshiba DT01ACA200 2TB - 2x WD Green 2TB - Sapphire Pulse AMD Radeon R9 380X - 550W EVGA G3 SuperNova - Chieftec Giga DF-01B - White Shark Spartan X keyboard - Roccat Kone Pure Military Desert strike - Logitech S-220 - Philips 226L
  • Livingroom PC (dad uses): AMD FX 8300 - Arctic Freezer 64 - Asus M5A97 R2.0 Evo - 2x4GB DDR3 1833 Kingston - MSI Radeon HD 7770 1GB OC - 120GB Adata SSD - 500W Fractal Design Essence - DVD-RW - Samsung SM 2253BW - Logitech G710+ - wireless vertical mouse - MS 2.0 speakers
Link to comment
Share on other sites

Link to post
Share on other sites

10 minutes ago, 191x7 said:

I found Heaven not to be that reliable as a GPU stress test in the recent times, it's probably too old for stressing newer GPU-s.

Sure, but is folding really that much more intensive than Heaven? And if it is, is it that much more intensive than ETH mining? Because why didn't mining cause these types of crashes last year? And why don't I get crashes when gaming, even if I'm playing a really GPU intensive game like Control with RT?

Link to comment
Share on other sites

Link to post
Share on other sites

37 minutes ago, YoungBlade said:

Sure, but is folding really that much more intensive than Heaven? And if it is, is it that much more intensive than ETH mining? Because why didn't mining cause these types of crashes last year? And why don't I get crashes when gaming, even if I'm playing a really GPU intensive game like Control with RT?

It's not as much as how much it loads but what it loads.

M.S.C.E. (M.Sc. Computer Engineering), IT specialist in a hospital, 30+ years of gaming, 20+ years of computer enthusiasm, Geek, Trekkie, anime fan

  • Main PC: AMD Ryzen 7 5800X3D - EK AIO 360 D-RGB - Arctic Cooling MX-4 - Asus Prime X570-P - 4x8GB DDR4 3200 HyperX Fury CL16 - Sapphire AMD Radeon 6950XT Nitro+ - 1TB Kingston Fury Renegade - 2TB Kingston Fury Renegade - 512GB ADATA SU800 - 960GB Kingston A400 - Seasonic PX-850 850W  - custom black ATX and EPS cables - Fractal Design Define R5 Blackout - Windows 11 x64 23H2 - 3 Arctic Cooling P14 PWM PST - 5 Arctic Cooling P12 PWM PST
  • Peripherals: LG 32GK650F - Dell P2319h - Logitech G Pro X Superlight with Tiger Ice - HyperX Alloy Origins Core (TKL) - EndGame Gear MPC890 - Genius HF 1250B - Akliam PD4 - Sennheiser HD 560s - Simgot EM6L - Truthear Zero - QKZ x HBB - 7Hz Salnotes Zero - Logitech C270 - Behringer PS400 - BM700  - Colormunki Smile - Speedlink Torid - Jysk Stenderup - LG 24x External DVD writer - Konig smart card reader
  • Laptop: Acer E5–575G-386R 15.6" 1080p (i3 6100U + 12GB DDR4 (4GB+8GB) + GeForce 940MX + 256GB nVME) Win 10 Pro x64 22H2 - Logitech G305 + AAA Lithium battery
  • Networking: Asus TUF Gaming AX6000 - Arcadyan ISP router - 35/5 Mbps vDSL
  • TV and gadgets: TCL 50EP680 50" 4K LED + Sharp HT-SB100 75W RMS soundbar - Samsung Galaxy Tab A8 10.1" - OnePlus 9 256GB - Olymous Cameda C-160 - GameBoy Color 
  • Streaming/Server/Storage PC: AMD Ryzen 5 3600 - LC-Power LC-CC-120 - MSI B450 Tomahawk Max - 2x4GB ADATA 2666 DDR4 - 120GB Kingston V300 - Toshiba DT01ACA100 1TB - Toshiba DT01ACA200 2TB - 2x WD Green 2TB - Sapphire Pulse AMD Radeon R9 380X - 550W EVGA G3 SuperNova - Chieftec Giga DF-01B - White Shark Spartan X keyboard - Roccat Kone Pure Military Desert strike - Logitech S-220 - Philips 226L
  • Livingroom PC (dad uses): AMD FX 8300 - Arctic Freezer 64 - Asus M5A97 R2.0 Evo - 2x4GB DDR3 1833 Kingston - MSI Radeon HD 7770 1GB OC - 120GB Adata SSD - 500W Fractal Design Essence - DVD-RW - Samsung SM 2253BW - Logitech G710+ - wireless vertical mouse - MS 2.0 speakers
Link to comment
Share on other sites

Link to post
Share on other sites

18 minutes ago, 191x7 said:

It's not as much as how much it loads but what it loads.

Sure, and I think that something about F@H specifically is causing an issue, which is why nothing else causes a crash.

 

I just got done with 1 hour of P95+Furmark and nothing happened. I'll now switch over to an hour of OCCT+Kombustor.

Link to comment
Share on other sites

Link to post
Share on other sites

1h isn't enough. A few weeks ago I worked on a system that would crash after 1.5 hours to 2.

 

M.S.C.E. (M.Sc. Computer Engineering), IT specialist in a hospital, 30+ years of gaming, 20+ years of computer enthusiasm, Geek, Trekkie, anime fan

  • Main PC: AMD Ryzen 7 5800X3D - EK AIO 360 D-RGB - Arctic Cooling MX-4 - Asus Prime X570-P - 4x8GB DDR4 3200 HyperX Fury CL16 - Sapphire AMD Radeon 6950XT Nitro+ - 1TB Kingston Fury Renegade - 2TB Kingston Fury Renegade - 512GB ADATA SU800 - 960GB Kingston A400 - Seasonic PX-850 850W  - custom black ATX and EPS cables - Fractal Design Define R5 Blackout - Windows 11 x64 23H2 - 3 Arctic Cooling P14 PWM PST - 5 Arctic Cooling P12 PWM PST
  • Peripherals: LG 32GK650F - Dell P2319h - Logitech G Pro X Superlight with Tiger Ice - HyperX Alloy Origins Core (TKL) - EndGame Gear MPC890 - Genius HF 1250B - Akliam PD4 - Sennheiser HD 560s - Simgot EM6L - Truthear Zero - QKZ x HBB - 7Hz Salnotes Zero - Logitech C270 - Behringer PS400 - BM700  - Colormunki Smile - Speedlink Torid - Jysk Stenderup - LG 24x External DVD writer - Konig smart card reader
  • Laptop: Acer E5–575G-386R 15.6" 1080p (i3 6100U + 12GB DDR4 (4GB+8GB) + GeForce 940MX + 256GB nVME) Win 10 Pro x64 22H2 - Logitech G305 + AAA Lithium battery
  • Networking: Asus TUF Gaming AX6000 - Arcadyan ISP router - 35/5 Mbps vDSL
  • TV and gadgets: TCL 50EP680 50" 4K LED + Sharp HT-SB100 75W RMS soundbar - Samsung Galaxy Tab A8 10.1" - OnePlus 9 256GB - Olymous Cameda C-160 - GameBoy Color 
  • Streaming/Server/Storage PC: AMD Ryzen 5 3600 - LC-Power LC-CC-120 - MSI B450 Tomahawk Max - 2x4GB ADATA 2666 DDR4 - 120GB Kingston V300 - Toshiba DT01ACA100 1TB - Toshiba DT01ACA200 2TB - 2x WD Green 2TB - Sapphire Pulse AMD Radeon R9 380X - 550W EVGA G3 SuperNova - Chieftec Giga DF-01B - White Shark Spartan X keyboard - Roccat Kone Pure Military Desert strike - Logitech S-220 - Philips 226L
  • Livingroom PC (dad uses): AMD FX 8300 - Arctic Freezer 64 - Asus M5A97 R2.0 Evo - 2x4GB DDR3 1833 Kingston - MSI Radeon HD 7770 1GB OC - 120GB Adata SSD - 500W Fractal Design Essence - DVD-RW - Samsung SM 2253BW - Logitech G710+ - wireless vertical mouse - MS 2.0 speakers
Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, 191x7 said:

1h isn't enough. A few weeks ago I worked on a system that would crash after 1.5 hours to 2.

 

Well, OCCT is limited to 1 hour, and it plus Kombuster also didn't cause an issue. I ran those test back to back, so that's basically 2 hours of testing right there.

 

I guess I can leave the P95+Furmark config running overnight to see if anything happens - it'll heat my room just as well as F@H does.

 

But let's say I run it all night and in the morning it's still running with no errors reported - does that actually prove anything about the issue?

Link to comment
Share on other sites

Link to post
Share on other sites

16 minutes ago, YoungBlade said:

Well, OCCT is limited to 1 hour, and it plus Kombuster also didn't cause an issue. I ran those test back to back, so that's basically 2 hours of testing right there.

 

I guess I can leave the P95+Furmark config running overnight to see if anything happens - it'll heat my room just as well as F@H does.

 

But let's say I run it all night and in the morning it's still running with no errors reported - does that actually prove anything about the issue?

Years ago I remember Linus having faulty RAM that crashed only after running Prime95 for days.

M.S.C.E. (M.Sc. Computer Engineering), IT specialist in a hospital, 30+ years of gaming, 20+ years of computer enthusiasm, Geek, Trekkie, anime fan

  • Main PC: AMD Ryzen 7 5800X3D - EK AIO 360 D-RGB - Arctic Cooling MX-4 - Asus Prime X570-P - 4x8GB DDR4 3200 HyperX Fury CL16 - Sapphire AMD Radeon 6950XT Nitro+ - 1TB Kingston Fury Renegade - 2TB Kingston Fury Renegade - 512GB ADATA SU800 - 960GB Kingston A400 - Seasonic PX-850 850W  - custom black ATX and EPS cables - Fractal Design Define R5 Blackout - Windows 11 x64 23H2 - 3 Arctic Cooling P14 PWM PST - 5 Arctic Cooling P12 PWM PST
  • Peripherals: LG 32GK650F - Dell P2319h - Logitech G Pro X Superlight with Tiger Ice - HyperX Alloy Origins Core (TKL) - EndGame Gear MPC890 - Genius HF 1250B - Akliam PD4 - Sennheiser HD 560s - Simgot EM6L - Truthear Zero - QKZ x HBB - 7Hz Salnotes Zero - Logitech C270 - Behringer PS400 - BM700  - Colormunki Smile - Speedlink Torid - Jysk Stenderup - LG 24x External DVD writer - Konig smart card reader
  • Laptop: Acer E5–575G-386R 15.6" 1080p (i3 6100U + 12GB DDR4 (4GB+8GB) + GeForce 940MX + 256GB nVME) Win 10 Pro x64 22H2 - Logitech G305 + AAA Lithium battery
  • Networking: Asus TUF Gaming AX6000 - Arcadyan ISP router - 35/5 Mbps vDSL
  • TV and gadgets: TCL 50EP680 50" 4K LED + Sharp HT-SB100 75W RMS soundbar - Samsung Galaxy Tab A8 10.1" - OnePlus 9 256GB - Olymous Cameda C-160 - GameBoy Color 
  • Streaming/Server/Storage PC: AMD Ryzen 5 3600 - LC-Power LC-CC-120 - MSI B450 Tomahawk Max - 2x4GB ADATA 2666 DDR4 - 120GB Kingston V300 - Toshiba DT01ACA100 1TB - Toshiba DT01ACA200 2TB - 2x WD Green 2TB - Sapphire Pulse AMD Radeon R9 380X - 550W EVGA G3 SuperNova - Chieftec Giga DF-01B - White Shark Spartan X keyboard - Roccat Kone Pure Military Desert strike - Logitech S-220 - Philips 226L
  • Livingroom PC (dad uses): AMD FX 8300 - Arctic Freezer 64 - Asus M5A97 R2.0 Evo - 2x4GB DDR3 1833 Kingston - MSI Radeon HD 7770 1GB OC - 120GB Adata SSD - 500W Fractal Design Essence - DVD-RW - Samsung SM 2253BW - Logitech G710+ - wireless vertical mouse - MS 2.0 speakers
Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, YoungBlade said:

I've already run Cinebench+Heaven as a combined stress test, but I'll try these now. We'll see if anything happens, but I'm pretty sure this isn't a hardware issue.

Unfortunately it likely is a hardware issue. Folding at Home is one of the heaviest loads you can put on a system and will usually quickly find any hardware or over-heating issues. Could be a component is getting too hot and is overheating. The VRM on your motherboard should be robust enough to run the 5900x 24x7 though.

 

I'd suggest running HWinfo64 while running prime95 + FurMark and keep an eye on the thermals.

 

Is Precision Boost Overdrive (PBO) enabled or any other non-default settings that might push the CPU past stock settings?

 

Are you running an overclock on the GPU using afterburner etc? Generally a GPU over-clock that is stable for gaming will not be stable for F@H but most of the time this will just result in a Work Unit (WU) crashing and not the system blue screening.

3 hours ago, YoungBlade said:

Sure, but is folding really that much more intensive than Heaven? And if it is, is it that much more intensive than ETH mining? Because why didn't mining cause these types of crashes last year? And why don't I get crashes when gaming, even if I'm playing a really GPU intensive game like Control with RT?

How may threads are you using for Folding? On Windows you need to leave a thread free to feed the GPU and a couple of threads free for the Operating System.

 

In folding Advanced Control in (Task Bar) you can set the number of threads used.,

FaH BOINC HfM

Bifrost - 6 GPU Folding Rig  Linux Folding HOWTO Folding Remote Access Folding GPU Profiling ToU Scheduling UPS

Systems:

desktop: Lian-Li O11 Air Mini; Asus ProArt x670 WiFi; Ryzen 9 7950x; EVGA 240 CLC; 4 x 32GB DDR5-5600; 2 x Samsung 980 Pro 500GB PCIe3 NVMe; 2 x 8TB NAS; AMD FirePro W4100; MSI 4070 Ti Super Ventus 2; Corsair SF750

nas1: Fractal Node 804; SuperMicro X10sl7-f; Xeon e3-1231v3; 4 x 8GB DDR3-1666 ECC; 2 x 250GB Samsung EVO Pro SSD; 7 x 4TB Seagate NAS; Corsair HX650i

nas2: Synology DS-123j; 2 x 6TB WD Red Plus NAS

nas3: Synology DS-224+; 2 x 12TB Seagate NAS

dcn01: Fractal Meshify S2; Gigabyte Aorus ax570 Master; Ryzen 9 5900x; Noctua NH-D15; 4 x 16GB DDR4-3200; 512GB NVMe; 2 x Zotac AMP 4070ti; Corsair RM750Mx

dcn02: Fractal Meshify S2; Gigabyte ax570 Pro WiFi; Ryzen 9 3950x; Noctua NH-D15; 2 x 16GB DDR4-3200; 128GB NVMe; 2 x Zotac AMP 4070ti; Corsair RM750x

dcn03: Fractal Meshify C; Gigabyte Aorus z370 Gaming 5; i9-9900k; BeQuiet! PureRock 2 Black; 2 x 8GB DDR4-2400; 128GB SATA m.2; MSI 4070 Ti Super Gaming X; MSI 4070 Ti Super Ventus 2; Corsair TX650m

dcn05: Fractal Define S; Gigabyte Aorus b450m; Ryzen 7 2700; AMD Wraith; 2 x 8GB DDR 4-3200; 128GB SATA NVMe; Gigabyte Gaming RTX 4080 Super; Corsair TX750m

dcn06: Fractal Focus G Mini; Gigabyte Aorus b450m; Ryzen 7 2700; AMD Wraith; 2 x 8GB DDR 4-3200; 128GB SSD; Gigabyte Gaming RTX 4080 Super; Corsair CX650m

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, 191x7 said:

Years ago I remember Linus having faulty RAM that crashed only after running Prime95 for days.

Yup - been there - running memtest86 for 24hrs also might be an idea and/or removing 2 of the memory sticks.

FaH BOINC HfM

Bifrost - 6 GPU Folding Rig  Linux Folding HOWTO Folding Remote Access Folding GPU Profiling ToU Scheduling UPS

Systems:

desktop: Lian-Li O11 Air Mini; Asus ProArt x670 WiFi; Ryzen 9 7950x; EVGA 240 CLC; 4 x 32GB DDR5-5600; 2 x Samsung 980 Pro 500GB PCIe3 NVMe; 2 x 8TB NAS; AMD FirePro W4100; MSI 4070 Ti Super Ventus 2; Corsair SF750

nas1: Fractal Node 804; SuperMicro X10sl7-f; Xeon e3-1231v3; 4 x 8GB DDR3-1666 ECC; 2 x 250GB Samsung EVO Pro SSD; 7 x 4TB Seagate NAS; Corsair HX650i

nas2: Synology DS-123j; 2 x 6TB WD Red Plus NAS

nas3: Synology DS-224+; 2 x 12TB Seagate NAS

dcn01: Fractal Meshify S2; Gigabyte Aorus ax570 Master; Ryzen 9 5900x; Noctua NH-D15; 4 x 16GB DDR4-3200; 512GB NVMe; 2 x Zotac AMP 4070ti; Corsair RM750Mx

dcn02: Fractal Meshify S2; Gigabyte ax570 Pro WiFi; Ryzen 9 3950x; Noctua NH-D15; 2 x 16GB DDR4-3200; 128GB NVMe; 2 x Zotac AMP 4070ti; Corsair RM750x

dcn03: Fractal Meshify C; Gigabyte Aorus z370 Gaming 5; i9-9900k; BeQuiet! PureRock 2 Black; 2 x 8GB DDR4-2400; 128GB SATA m.2; MSI 4070 Ti Super Gaming X; MSI 4070 Ti Super Ventus 2; Corsair TX650m

dcn05: Fractal Define S; Gigabyte Aorus b450m; Ryzen 7 2700; AMD Wraith; 2 x 8GB DDR 4-3200; 128GB SATA NVMe; Gigabyte Gaming RTX 4080 Super; Corsair TX750m

dcn06: Fractal Focus G Mini; Gigabyte Aorus b450m; Ryzen 7 2700; AMD Wraith; 2 x 8GB DDR 4-3200; 128GB SSD; Gigabyte Gaming RTX 4080 Super; Corsair CX650m

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, Gorgon said:

Unfortunately it likely is a hardware issue. Folding at Home is one of the heaviest loads you can put on a system and will usually quickly find any hardware or over-heating issues. Could be a component is getting too hot and is overheating. The VRM on your motherboard should be robust enough to run the 5900x 24x7 though.

 

I'd suggest running HWinfo64 while running prime95 + FurMark and keep an eye on the thermals.

 

Is Precision Boost Overdrive (PBO) enabled or any other non-default settings that might push the CPU past stock settings?

 

Are you running an overclock on the GPU using afterburner etc? Generally a GPU over-clock that is stable for gaming will not be stable for F@H but most of the time this will just result in a Work Unit (WU) crashing and not the system blue screening.

 

How may threads are you using for Folding? On Windows you need to leave a thread free to feed the GPU and a couple of threads free for the Operating System.

 

In folding Advanced Control in (Task Bar) you can set the number of threads used.,

Nothing in the computer is overheating. The CPU is actually undervolted and I use PBO to reduce the power limit to 107W. When running F@H, it's always in the low-mid 50C range. I also power and thermal limit my graphics card as low as it can go, so it never goes above 65C. So I'm actually underclocking the GPU, not overclocking it. It usually runs around 1200-1450MHz when folding.

 

I have tried switching the CPU back to stock settings, and running the RAM at the JEDEC settings, 2400, and also running just 2 sticks. I mentioned these things in the OP, but I guess I made that original post too long.

 

I'm running folding on 22 threads, which leaves an entire Zen 3 core for controlling the GPU. I'm also able to use the computer just fine while it's folding for basic tasks like web browsing without that ever causing a crash.

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, YoungBlade said:

Nothing in the computer is overheating. The CPU is actually undervolted and I use PBO to reduce the power limit to 107W. When running F@H, it's always in the low-mid 50C range. I also power and thermal limit my graphics card as low as it can go, so it never goes above 65C. So I'm actually underclocking the GPU, not overclocking it. It usually runs around 1200-1450MHz when folding.

 

I have tried switching the CPU back to stock settings, and running the RAM at the JEDEC settings, 2400, and also running just 2 sticks. I mentioned these things in the OP, but I guess I made that original post too long.

 

I'm running folding on 22 threads, which leaves an entire Zen 3 core for controlling the GPU. I'm also able to use the computer just fine while it's folding for basic tasks like web browsing without that ever causing a crash.

That all sounds good and you've got some good hardware and decent thermals so it _SHOULD_ be OK.

 

I would suggest, however, you reduce the threads Folding to 18 or so as under Windows many current GPU WUs will, occasionally, hit the CPU hard and one thread for the OS isn't just sufficient anymore.

 

I don't know about the GigaByte Windforce cards but my EVGA 2060 and 2060 Super both can't lower the power-limit low enough to hit the efficiency sweet-spot. I found by clock-limiting them instead I can get them down to where they do the most work for the least amount of electricity used.

 

This command:

nvidia-smi -i 0 -lgc 0,1350

run from a Command Prompt or as a Scheduled Task at startup should get it's power consumption even lower. I don't run F@H on my 2060s but under Einstein@Home and OpenPandemics on World Community Grid they at 69-73W and sit at 57C and 63C.

 

Let us know if it survives the Prime95/Furmark stress tests.

 

I run Linux on most of my Folding systems but I still use mprime (Prime95 for Linux) with Small FFTs to stress test new systems and run it at Stock Settings for 24 hours. Just have to keep an eye on the VRM temperatures usually as some of my older motherboards (Gigabyte x370 Gaming K5, for example) have weak VRMs but these days pretty much all x570 and b550 motherboards have adequate VRMs to run R9 CPUs without hitting the thermal limits.

FaH BOINC HfM

Bifrost - 6 GPU Folding Rig  Linux Folding HOWTO Folding Remote Access Folding GPU Profiling ToU Scheduling UPS

Systems:

desktop: Lian-Li O11 Air Mini; Asus ProArt x670 WiFi; Ryzen 9 7950x; EVGA 240 CLC; 4 x 32GB DDR5-5600; 2 x Samsung 980 Pro 500GB PCIe3 NVMe; 2 x 8TB NAS; AMD FirePro W4100; MSI 4070 Ti Super Ventus 2; Corsair SF750

nas1: Fractal Node 804; SuperMicro X10sl7-f; Xeon e3-1231v3; 4 x 8GB DDR3-1666 ECC; 2 x 250GB Samsung EVO Pro SSD; 7 x 4TB Seagate NAS; Corsair HX650i

nas2: Synology DS-123j; 2 x 6TB WD Red Plus NAS

nas3: Synology DS-224+; 2 x 12TB Seagate NAS

dcn01: Fractal Meshify S2; Gigabyte Aorus ax570 Master; Ryzen 9 5900x; Noctua NH-D15; 4 x 16GB DDR4-3200; 512GB NVMe; 2 x Zotac AMP 4070ti; Corsair RM750Mx

dcn02: Fractal Meshify S2; Gigabyte ax570 Pro WiFi; Ryzen 9 3950x; Noctua NH-D15; 2 x 16GB DDR4-3200; 128GB NVMe; 2 x Zotac AMP 4070ti; Corsair RM750x

dcn03: Fractal Meshify C; Gigabyte Aorus z370 Gaming 5; i9-9900k; BeQuiet! PureRock 2 Black; 2 x 8GB DDR4-2400; 128GB SATA m.2; MSI 4070 Ti Super Gaming X; MSI 4070 Ti Super Ventus 2; Corsair TX650m

dcn05: Fractal Define S; Gigabyte Aorus b450m; Ryzen 7 2700; AMD Wraith; 2 x 8GB DDR 4-3200; 128GB SATA NVMe; Gigabyte Gaming RTX 4080 Super; Corsair TX750m

dcn06: Fractal Focus G Mini; Gigabyte Aorus b450m; Ryzen 7 2700; AMD Wraith; 2 x 8GB DDR 4-3200; 128GB SSD; Gigabyte Gaming RTX 4080 Super; Corsair CX650m

Link to comment
Share on other sites

Link to post
Share on other sites

On 1/28/2023 at 9:22 PM, Gorgon said:

That all sounds good and you've got some good hardware and decent thermals so it _SHOULD_ be OK.

 

I would suggest, however, you reduce the threads Folding to 18 or so as under Windows many current GPU WUs will, occasionally, hit the CPU hard and one thread for the OS isn't just sufficient anymore.

 

I don't know about the GigaByte Windforce cards but my EVGA 2060 and 2060 Super both can't lower the power-limit low enough to hit the efficiency sweet-spot. I found by clock-limiting them instead I can get them down to where they do the most work for the least amount of electricity used.

 

This command:

nvidia-smi -i 0 -lgc 0,1350

run from a Command Prompt or as a Scheduled Task at startup should get it's power consumption even lower. I don't run F@H on my 2060s but under Einstein@Home and OpenPandemics on World Community Grid they at 69-73W and sit at 57C and 63C.

 

Let us know if it survives the Prime95/Furmark stress tests.

 

I run Linux on most of my Folding systems but I still use mprime (Prime95 for Linux) with Small FFTs to stress test new systems and run it at Stock Settings for 24 hours. Just have to keep an eye on the VRM temperatures usually as some of my older motherboards (Gigabyte x370 Gaming K5, for example) have weak VRMs but these days pretty much all x570 and b550 motherboards have adequate VRMs to run R9 CPUs without hitting the thermal limits.

So the computer survived the overnight Prime95/Furmark load as I expected. I also did a series of OCCT/Kombustor tests yesterday - everything was fine.

 

However, I have now made the change that you suggested of lowering the CPU threads to 18. I noticed that the usage when folding bounces at around 93-97%, so apparently the GPU was requiring more than just as single core worth of resources.

 

As for the GPU, this is my main system that I also use for gaming, so I'd prefer to just use Afterburner if I can. However, I also added a clockspeed reduction to the core in addition to the power/thermal limit. The card is an OC card, so it is technically overclocked out of the box. It's possible that, with folding, certain parts of the voltage/frequency curve aren't quite stable. Giving a bit more voltage for a given frequency could fix that.

 

Hopefully, those changes are the fix needed to stop the system from crashing. It'll take a few weeks to confirm, but I have hope that this was the solution.

 

Thanks for the advice! I'll keep you posted.

Link to comment
Share on other sites

Link to post
Share on other sites

  • 2 weeks later...

@Gorgon@dogwitch

Well, I have bad news. I got back from grocery shopping this morning, and the computer had shut off while I was gone.

 

However, it had something I'd never seen before in all of the times looking at the Event Viewer logs: 3 WHEA errors, rather than just 2 of them.

 

image.png.d3cc855190225f1542218159abb5c6b9.png

 

So I guess reducing the number of cores in use and underclocking the GPU did do something? Even if that something was seemingly worse... It took over 10 days for this crash to occur. The longest it had ever gone between crashes was two weeks, so if I had just made it 4 more days, I was going to declare victory. Sadly, that was not meant to be.

 

I'm really scratching my head with this one. Why is the system unstable only with F@H after many days and not unstable with any other stress tests or workloads? Even crypto mining on the CPU and GPU, which should be comparable in terms of system stress, did not have this issue over the course of months.

 

I fear I may have run across some sort of phantom bug that will never be solved or fixed...

Link to comment
Share on other sites

Link to post
Share on other sites

honestly. there has been some cpu that will crash folding on amd. be its normal ryzne or thread ripper.

MSI x399 sli plus  | AMD theardripper 2990wx all core 3ghz lock |Thermaltake flo ring 360 | EVGA 2080, Zotac 2080 |Gskill Ripjaws 128GB 3000 MHz | Corsair RM1200i |150tb | Asus tuff gaming mid tower| 10gb NIC

Link to comment
Share on other sites

Link to post
Share on other sites

  • 2 weeks later...

Hi, saw you from my thread where I am also experiencing stupidly persistent WHEA 18s across multiple hardware swaps and RMAs in only one specific instance, though mine's gaming while yours is F@H. You mentioned in my thread that the wiring in your home isn't so good, that the circuit breaker trips with a microwave and a high power device on in the next room.

 

On my side, my house's lights sometimes goes dim for a split second. There's also a clothes iron on the other side of the wall from my PC, and if that iron is on, my display artifacts and even blackscreens for a split sec before it goes back on. I got myself an AVR by Cyberpower (basically a UPS minus batteries, am I right?) and the display artifacts stopped. But wonder if my house power issues are bad enough that the AVR doesn't catch it and my PC is affected and then it goes WHEA 18...

 

It's weird though that even with our house power issues, stress tests work perfectly fine but gaming/folding crashes our PCs.

Noelle best girl

 

PC specs:

CPU: AMD Ryzen 5 3600 3.6 GHz 6-Core Processor
CPU Cooler: Deepcool GAMMAXX 400 V2 64.5 CFM CPU Cooler
Motherboard: ASRock B450M Steel Legend Micro ATX AM4 Motherboard, BIOS P4.60
Memory: ADATA XPG 32GB GB (2 x 16GB) DDR4-3200 CL16 Memory
Storage: HP EX900 500 GB M.2-2280 PCIe 3.0 X4 NVME Solid State Drive, PNY CS900 1 TB 2.5" Solid State Drive
Video Card: Colorful iGame RTX 4060 Ti 16GB
Power Supply: Cooler Master MWE Bronze V2 650 W 80+ Bronze Certified ATX Power Supply
Operating System: Microsoft Windows 10 Pro
Wireless Network Adapter: TP-Link TL-WN881ND 802.11a/b/g/n PCIe x1 Wifi adapter
Monitor: Acer QG240Y S3 24.0" 1920 x 1080 180Hz Monitor

Link to comment
Share on other sites

Link to post
Share on other sites

Update: I've been trying to pay attention now to the WU that is being worked on when I have a crash. I've noticed that WUs from one particular project does seem more likely to crash the system. The project in question is this one: https://stats.foldingathome.org/project/18483

 

On my 5900X, the WUs for this project tend to take more than a day, so it is important to note that it truly could just be a coincidence - the computer is more likely to crash during a long WU than during a short WU because it spends more time with a long WU.

 

However, something strange happened after this last crash that makes me wonder if F@H is aware of a potential issue here: the WU wouldn't start up again and, when I came back to my computer to write this, it had switched to a different WU on a different project. If that's the case, perhaps Ryzen systems won't be sent WUs from projects like this? I don't know anything about the internal workings of the F@H project, so I don't know if they ever make such changes or if such a change is even possible.

 

Again - and I feel I need to stress this because this is the Internet and people sometimes go off their rocker with wild theories - this is complete speculation. I am not, in any way, claiming that this project, the F@H project, or anyone working on these is at fault for these crashes. I am certain that, even if it is the cause of the issue - which is not a certainty - no one at any step of the chain is in any way acting in bad faith, nor is it negligence on their part to not correct what is obviously a very niche issue.

 

However, if this pattern holds, I think I'll change my settings to request work from a cause other than influenza research, in the hopes of minimizing the odds of my CPU working on WUs from that project. I could even switch my other computer over to influenza specifically in the hopes of balancing things out, because I do feel that research in that area is important. I just don't want my computer crashing at odd times.

Link to comment
Share on other sites

Link to post
Share on other sites

So, I have the same exact issue, my system is a 5950x and a 3080ti, and I recently started using FAH.  Crashes within 2 ~ 12 hours, just depends on the day.  I've tried voltage regulating the soc (both increase and decrease),  reducing power limits for the cpu, as well as the gpu.  I have a full custom loop, so my cpu temps are in the 60s, and my gpu temps are in the 40s, so heat is not my issue.  Prime95, Linpack extreme, cinebench, and furmark running for 8+ hours in a row will not cause any crashes.  Usually, I get 2 Whea 18 errors after a reboot, however, sometimes there is no log at all. 

I don't think it is that specific project, as project 18485 just crashed my desktop a little while ago.  I think this may be some weird compatibility issue with certain ryzen cpu's and FAH, no stress test program can replicate the instability.  Very frustrating, and wanted to point out that you are not alone with this issue. 

Link to comment
Share on other sites

Link to post
Share on other sites

Maybe it's a system inter-compatibility issue, where certain cpu's have a very obscure bug that only presents itself in specific scenarios.  I don't know.  The only thing that ensures stability is to not fold.  Let me know if you manage to find a workaround, but I imagine this is something AMD would need to investigate to properly debug   

Link to comment
Share on other sites

Link to post
Share on other sites

19 minutes ago, ToastyPillsbury said:

So, I have the same exact issue, my system is a 5950x and a 3080ti, and I recently started using FAH.  Crashes within 2 ~ 12 hours, just depends on the day.  I've tried voltage regulating the soc (both increase and decrease),  reducing power limits for the cpu, as well as the gpu.  I have a full custom loop, so my cpu temps are in the 60s, and my gpu temps are in the 40s, so heat is not my issue.  Prime95, Linpack extreme, cinebench, and furmark running for 8+ hours in a row will not cause any crashes.  Usually, I get 2 Whea 18 errors after a reboot, however, sometimes there is no log at all. 

I don't think it is that specific project, as project 18485 just crashed my desktop a little while ago.  I think this may be some weird compatibility issue with certain ryzen cpu's and FAH, no stress test program can replicate the instability.  Very frustrating, and wanted to point out that you are not alone with this issue. 

That's the project without info for it. My CPU crashed on that one, too, about two weeks ago - it was in my browser history because I'd looked at it to check what it was.

 

Interesting to see that we both have basically the exact same problem with Ryzen 9 5000 series CPUs.

Link to comment
Share on other sites

Link to post
Share on other sites

Yeah, that's why I am thinking it is cpu specific.  I used to FAH a lot in the past, and had a 3950x in my system (2080 gpu).  Zero issues on that hardware.  I have the same ram, mobo, and psu from that old system, and I highly doubt the gpu is throwing cache hierarchy errors on the cpu.  

Link to comment
Share on other sites

Link to post
Share on other sites

7 minutes ago, ToastyPillsbury said:

Yeah, that's why I am thinking it is cpu specific.  I used to FAH a lot in the past, and had a 3950x in my system (2080 gpu).  Zero issues on that hardware.  I have the same ram, mobo, and psu from that old system, and I highly doubt the gpu is throwing cache hierarchy errors on the cpu.  

My GPU used to be in a system with an i5 9600K before I moved it, and I folded on that a lot, so I don't think my issue is GPU related.

 

Heck, it even happened when I'd underclocked my GPU recently while keeping voltage the same, so I doubt the GPU was having problems with that when it was effectively over-volted.

Link to comment
Share on other sites

Link to post
Share on other sites

Update:  I am going to reach out to AMD support sometime this weekend, and see if I can get something moving on their end.  There's this older thread on reddit about obscure memory issues with ryzen, and it ended up being a micro-code issue that AMD had to fix.  

 

 

This issue doesn't seem to be too widespread either, and others with whea 18s seem to have them constantly, indicating a faulty cpu.  For us though, our systems are perfectly stable for all other applications, which doesn't point as easily to some hardware fault within the processor.  Such a strange, specific bug.  Well, keep the thread updated if you find anything new, and I will do so as well.  I'm hoping this issue will somehow magically sort itself in time 😂 Best of luck in the search for the cure!   

Edited by ToastyPillsbury
Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×