Jump to content

Intel Processor Instability Causing Oodle Decompression Failures

ook512

Summary

 

It looks like Intel 13900K and 14900K under high load can in some cases produce wrong results. It causes wrong decompressed data in Unreal Eingine-based games, and is detected as an error.

 

Quotes

Quote

As far as we can tell, there is not any software bug in Oodle or Unreal that is causing this. Due to what seem to be overly optimistic BIOS settings, some small percentage of processors go out of their functional range of clock rate and power draw under high load, and execute instructions incorrectly. This is being seen disproportionately in Oodle Data decompression because unlike most gameplay, simulation, audio or rendering code, decompression needs to perform extra integrity checks to handle accidentally or maliciously corrupted data, and is thus likely to spot inconsistencies very soon after they occur.

 

My thoughts

It might be good idea to check/ensure that CPU/GPU benchmarks not only measure time-per-test or iterations-per-time, but also verify that results are exactly the same. Does e.g. Cinebench check that resulting image is bit-perfect on on all runs?

There is https://reproducible-builds.org/ which tracks attempts of software ensuring that repeated compilation results in exactly the same binaries (i.e. do not embedd date of compilation, disk paths, do not randomize anything, etc..). It could be used as part of a benchmark, not only measure compile times, but verify that CPU does not miscalculate during sustained high loads (it would be very useful test for cheaper non-workstation class laptops often used by technical students for generally workstation-level tasks).

Also, AM4 with ECC is nice, but full ECC should just be available everywhere. Producing new hardware without full ECC support (or fusing it from some SKUs) is irresponsible.

 

Sources

https://www.radgametools.com/oodleintel.htm

https://web.archive.org/web/20240224072046/https://www.radgametools.com/oodleintel.htm

Link to comment
Share on other sites

Link to post
Share on other sites

Could end up being a slightly smaller problem than Intel had when they tried to push Coppermine to 1.13GHz. That ended up in a total recall.

"We also blind small animals with cosmetics.
We do not sell cosmetics. We just blind animals."

 

"Please don't mistake us for Equifax. Those fuckers are evil"

 

This PSA brought to you by Equifacks.
PMSL

Link to comment
Share on other sites

Link to post
Share on other sites

A similar thing was happening with 6th to 10th gen intel CPUs running APEX legends. 

 

https://answers.ea.com/t5/Technical-Issues/Apex-Legends-Crash-no-error-PC-apex-crash-txt/td-p/7766168/page/3

 

They had to write a workaround patch to stop the crashing.

Link to comment
Share on other sites

Link to post
Share on other sites

It's very common for CPUs to have these kinds of issues and they get fixed in software or hardware revisions without it ever getting acknowledged in the wider media. 

 

In this case though, from what I can tell, the issue seems to be related to motherboard settings that cause the CPU to boost too much because of "auto-OC" features and such. 

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, LAwLz said:

It's very common for CPUs to have these kinds of issues and they get fixed in software or hardware revisions without it ever getting acknowledged in the wider media.

Yes. CPU errata is in every processor these days. Quite normal and to be expected. Either is corrected with a microcode update or the next stepping. Here's a list of known 13th, and 14th gen errata.

 

2 hours ago, LAwLz said:

In this case though, from what I can tell, the issue seems to be related to motherboard settings that cause the CPU to boost too much because of "auto-OC" features and such. 

If this is a known CPU errata issue, the closest thing I can see would be RPL047 with no known fix. That could cause bit-flips and thus a miscalculation among other data integrity issues. Not good.

 

RPL047

DDR5 Clock Jitter Out of Spec

Problem

DDR5 Clock Jitter, as measured by jitter parameters Dj, Rj, and Tj (Dynamic/Random/Total jitter), may be beyond the JEDEC specification (JEDEC doc number JESD79-5B, Chapter 8.3) limits.

Implication

Due to this erratum Clock Jitter measurements may be out of spec. Intel has not observed any functional implications due to this erratum.

Workaround

None identified.

Status

For the steppings affected, refer to the Summary Table of Changes.


 

Link to comment
Share on other sites

Link to post
Share on other sites

  • 2 weeks later...
On 2/24/2024 at 10:22 PM, ook512 said:

Does e.g. Cinebench check that resulting image is bit-perfect on on all runs?

idk but i do know it throws certain errors with unstable ocs that indicate something like that at least. 

 

also while cb is popular,  its also commonly known to employ "unrealistic" loads so idk how reliable it even is (likely not very)

The direction tells you... the direction

-Scott Manley, 2021

 

Softwares used:

Corsair Link (Anime Edition) 

MSI Afterburner 

OpenRGB

Lively Wallpaper 

OBS Studio

Shutter Encoder

Avidemux

FSResizer

Audacity 

VLC

WMP

GIMP

HWiNFO64

Paint

3D Paint

GitHub Desktop 

Superposition 

Prime95

Aida64

GPUZ

CPUZ

Generic Logviewer

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Mark Kaine said:

idk but i do know it throws certain errors with unstable ocs that indicate something like that at least. 

 

also while cb is popular,  its also commonly known to employ "unrealistic" loads so idk how reliable it even is (likely not very)

Just because it's worst case scenario doesn't make it unrealistic. Whole point of worst case scenarios is ensuring it'll never crash because of overclock. Nothing worse than having seemingly stable system that runs rock solid for days and then just randomly decides to crash repeatedly in same day because you used a new load you didn't test for.

 

I've actually had overclocks that passed all the usual "burnin" tests, but kept failing ASUS's RealBench video encoding in a loop. It even passed one encode, but on 5 repeats, it crashed.

 

RAM is especially finicky in this regard and often unpredictable between boots because the training might work different between boots on used settings. Besides, RAM usually has such small effect it's just not worth running it on the edge.

Link to comment
Share on other sites

Link to post
Share on other sites

30 minutes ago, RejZoR said:

Just because it's worst case scenario doesn't make it unrealistic. Whole point of worst case scenarios is ensuring it'll never crash because of overclock. Nothing worse than having seemingly stable system that runs rock solid for days and then just randomly decides to crash repeatedly in same day because you used a new load you didn't test for.

 

I've actually had overclocks that passed all the usual "burnin" tests, but kept failing ASUS's RealBench video encoding in a loop. It even passed one encode, but on 5 repeats, it crashed.

 

RAM is especially finicky in this regard and often unpredictable between boots because the training might work different between boots on used settings. Besides, RAM usually has such small effect it's just not worth running it on the edge.

The most I've ever seen RAM overclocking have an effect, is when bandwidth was already limited to below what the CPU bus speed was (think SDR) and every MHz helped.

Though tighter timings still made more of a difference.

"We also blind small animals with cosmetics.
We do not sell cosmetics. We just blind animals."

 

"Please don't mistake us for Equifax. Those fuckers are evil"

 

This PSA brought to you by Equifacks.
PMSL

Link to comment
Share on other sites

Link to post
Share on other sites

53 minutes ago, RejZoR said:

Just because it's worst case scenario doesn't make it unrealistic. Whole point of worst case scenarios is ensuring it'll never crash because of overclock. Nothing worse than having seemingly stable system that runs rock solid for days and then just randomly decides to crash repeatedly in same day because you used a new load you didn't test for.

I've had more than a few disagreements with Mark so I don't want to assume, but I read unrealistic as "not generally representative", in which case I'd agree. At the end of the day, software can only represent itself. Cinebench is a benchmark first, anything else is secondary.

 

Even if it was referred to as loading potential, I view the latest 2024 version as getting into the upper mid range of overall stress. Perhaps 6/10. It is still a relatively low intensity workload that scales well to as many threads as you can throw at it. Higher intensity workloads I'd include Y-cruncher at 8/10 and Prime95/Linpack at 10/10. Older Cinebench like R15 was pretty lightweight, 3/10 at most.

 

If you wonder what I mean by intensity, I'd illustrate it as, if you run a CPU at a fixed clock and voltage, how much power would it use? To represent the modern context a bit better, how hard does the clock drop if you run it on a fixed power limit? Edit: now that I say that, I feel like doing some testing to get hard numbers on that.

 

53 minutes ago, RejZoR said:

RAM is especially finicky in this regard and often unpredictable between boots because the training might work different between boots on used settings. Besides, RAM usually has such small effect it's just not worth running it on the edge.

As always it comes with a big "depends on the workload", but I'd certainly agree with not worth pushing it in general. Either by manual OC or buying insane rated kits.

 

20 minutes ago, Dabombinable said:

The most I've ever seen RAM overclocking have an effect, is when bandwidth was already limited to below what the CPU bus speed was (think SDR) and every MHz helped.

Though tighter timings still made more of a difference.

Y-cruncher and Prime95 certainly do scale with ram performance. Modern cores can get through work faster than you can feed it, not even 3D cache is enough here. I'd like to see GB scale caches on CPUs if the x86 makers are not going to give us more affordable memory channels.

 

Even ram configuration makes a difference. Link below I tested 1Rx16 vs 1Rx8 vs 2Rx8, in 1DPC and 2DPC configurations. And that was at one speed and same nominal timings. I don't usually bother testing timings since for my interests it has a far smaller impact than practical (not peak) bandwidth.

 

 

Main system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, Corsair Vengeance Pro 3200 3x 16GB 2R, RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

30 minutes ago, porina said:

I'd like to see GB scale caches on CPUs if the x86 makers are not going to give us more affordable memory channels.

You can get both from AMD for:

200w.gif?cid=6c09b952b7cmf7jvbdjhp727fu9

 

AMD  EPYC 9684X (1.125GB L3 cache): ~14,000 USD. Probably better off with a 9184X for what you want though.

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, leadeater said:

You can get both from AMD

I don't want to take this too off topic, but implicitly I was thinking of consumer tier CPUs with consumer tier pricing. I've given up on seeing affordable HEDT again so the next best chance would be a different take on CPU caches than current 3D. Quantity over quality. Much bigger but slower, as long as it is still significantly faster than ram. Imagine a 2025 version of Broadwell C, which would also be 10 years from when it launched. I can dream!

Main system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, Corsair Vengeance Pro 3200 3x 16GB 2R, RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

I had similar issue.

 

I overclock my 13600K and it passes prime95 smalllff, OCCT, and Y-Cruncher for 8 hours each without any error message.

 

As soon as I try to install a big fitgirl-repack game it would error out and I had to downclock the CPU for it to work properly.

Yeah, we're all just a bunch of idiots experiencing nothing more than the placebo effect.
Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×