So I have a bright and shinny 3950x. So far this thing has proved to be the beast it was advertised to be. But there's always that one fly that gets in the ointment.
The whole rig (which is not at all intended for gaming) looks like this:
3950x
Gigabyte X570 Aorus Pro
32GB Corsair memory (don't remember the model, but I chose it off the QVL for the MB and to clear the cooler.)
MSI GTX-1660 Super (no, I'm really not gaming with this)
1 TB Intel MVME (because I could)
Seasonic GX-750 gold PSU
Noctua NH-D15
Windows 10 Pro
All the latest drivers have been installed and I flashed the board to the most recent UEFI.
It has handled Cinebench and Aida64 testing with no problem (both were run for hours, I know not complete tests, but I only built it yesterday.) Temps are reasonable (hovering around 61 under load, with occasional spikes to around 80 for a second or two, can't really explain why, but they're there) and I don't see any errors reported.
Being a bit old school, I then fired up Prime95. And right out of the gate got errors on small FFTs. There were consistent on the same "core" numbers (18 and 19, so I assume them to be virtual as the physicals seem to be 0 through 15.) Long story short, I could "walk" the errors around by messing with the number of works I used and how many "cores" I told each to exercise. Everything in the UEFI was set stock, I hadn't even enabled XMP yet. (Though when I did that it made the errors worse.)
I didn't disable Turbo (or whatever AMD is calling it these days.) So in a sense the chip was trying to OC itself when it detected load. I consider that "normal" behavior and should have been included in the test.
I started playing with XMP as the memory voltage seemed low. But that got me looking at other voltages and ultimately lead me to start thinking about vdroop. (That was a long and twisted path that I won't bore everyone with.)
Ultimately if I did find that if I set LLC to "Low", the system stopped throwing errors and Prime95 ran for slightly over 8 hours before I stopped it. I have another small FFT run going right now and it's behaving similarly.
What I'm now faced with is what do with this mess. I do think the chip is beast and even in my short time I've come to really like it, but I also want something that's long term stable and doesn't have monsters lurking inside it just waiting for the right (and inopportune) time to come leaping out and reeking havoc.
While I now know how to keep said monsters locked up where they don't show themselves, I don't really like that I had to tweak something in the UEFI to get it to be stable. I've never had to do that with any other chip and I've never seen a chip that didn't pass basic (albeit strenuous) tests. I cannot decide:
If there is some problem with how the "Auto"/"Normal"/"Standard" LLC setting is implemented in the Gigabyte UEFI (all three seem to be the same.)
If there is some system power supply issue that is showing up on power hungry chips like the 3950x and the Threadripper series (there are several folks in other forums reporting similar issues with Prime95 and these chips)
If this has something to do with the fact that the memory seems to be under-volted.
If there is a problem with my specific chip.
If there is a problem with some other component (motherboard, memory, or PSU.) All are going to get tested as best I can.
Or if I should be glad that I know how to keep this controlled and be happy with what I have.
I did also read a description posted by a guy in another forum who had an issue similar to mine. He decided to RMA his chip, and has ended up, two RMAs later, with one that behaves worse than either of it's predecessors. That is train I totally don't want to get on.
I figured that I'd see what folks here thought about this.
Thank you in advance for any input you might have, and sorry for the long post.