[Debunked] BSD team discovered a new hardware issue with Ryzen

zMeul · August 6, 2017

source: https://svnweb.freebsd.org/base?view=revision&revision=321899

4 days ago the developers of FreeBSD have issued a report for a new issue they encountered with Ryzen CPUs:

Quote

Hi, Matt Dillon here. Yes, I did find what I believe to be a hardware issue with Ryzen related to concurrent operations. In a nutshell, for any given hyperthread pair, if one hyperthread is in a cpu-bound loop of any kind (can be in user mode), and the other hyperthread is returning from an interrupt via IRETQ, the hyperthread issuing the IRETQ can stall indefinitely until the other hyperthread with the cpu-bound loop pauses (aka HLT until next interrupt). After this situation occurs, the system appears to destabilize. The situation does not occur if the cpu-bound loop is on a different core than the core doing the IRETQ. The %rip the IRETQ returns to (e.g. userland %rip address) matters a *LOT*. The problem occurs more often with high %rip addresses such as near the top of the user stack, which is where DragonFly's signal trampoline traditionally resides. So a user program taking a signal on one thread while another thread is cpu-bound can cause this behavior. Changing the location of the signal trampoline makes it more difficult to reproduce the problem. I have not been because the able to completely mitigate it. When a cpu-thread stalls in this manner it appears to stall INSIDE the microcode for IRETQ. It doesn't make it to the return pc, and the cpu thread cannot take any IPIs or other hardware interrupts while in this state.

the issue described has been observed on FreeBSD systems with SMT disabled

---

remember when me and others told "you" to stay away from Zen CPUs for at least ¹/₂ y? this is precisely why - AMD is dealing with new arch that is bound to have HW issues and some of them cannot be fixed with a micro-code update

on top of the segmentation fault, we now have this ... if this turns into a recall, AMD will be severely hit and not only financially

BuckGup · August 6, 2017

Yes zMuel because you are always right we should only listen to you....

zMeul · August 6, 2017

7 minutes ago, BuckGup said:

Yes zMuel because you are always right we should only listen to you....

you deny these issues exists?

no one has to listen to anything I "say", look at the sources I provide

the segmentation fault is 3 months old and AMD has yet to deal with it in one way or another

while at the same time they push server and workstation grade products that seem to be affected by the same issues the original desktop parts are - businesses will be thrilled that their new stations have unresolved HW issues

Edited August 6, 2017 by zMeul

The Sloth · August 6, 2017

not only ryzen -" the conftest segmentation faults aren't specific to Ryzen, so updating the tests to avoid confusion. Though one area being explored now as well is the Clang segmentation faults shown in the original article, not originating from conftest as well as Clang being able to yield the system hanging hard where the system is unresponsive and SSH is not working. Plus also incorporating more Ryzen-Kill tests as outlined in the aforelinked article. As many readers have pointed out, BSD developers have also discovered a Ryzen bug. More details soon."

zMeul · August 6, 2017

3 minutes ago, nerdslayer1 said:

not only ryzen

it was discovered with EPYC and ThreadRipper

leadeater · August 6, 2017

19 minutes ago, zMeul said:

the issue described has been observed on FreeBSD systems with SMT disabled

How is this happening when SMT is disabled when the issue been described is with SMT itself? The other mention of system stability seems to be a different problem.

This seems like a good explanation to the cause of some of the issues:

Quote

Well this could be somewhat worrying if it's not just a software bug... From what I've understood Ryzen is supposed to do some pretty aggressive runtime optimization with stuff like heavy handed out-of-order execution. If turning SMT off really fixes this bug and it's present on all modern versions of GCC my first guess would be that it's out-of-order execution clashing with SMT and generating segmentation faults when memory reads and writes are put out of sequence when they shouldn't be.

Then again it's not like Intel has never had chip bugs in their hardware...

https://www.phoronix.com/forums/forum/hardware/processors-memory/955368-some-ryzen-linux-users-are-facing-issues-with-heavy-compilation-loads/page7

Sounds like GCC needs some Zen optimizations and a lot of software recompiled and update versions pushed out.

Also there's an update to the Segfaults issue that points to potential improper testing.

Quote

Update [5 August]: As a result of feedback, currently working on some updated results. As some have pointed out, the conftest segmentation faults aren't specific to Ryzen, so updating the tests to avoid confusion.

https://www.phoronix.com/scan.php?page=article&item=ryzen-segv-continues&num=1

theMillen · August 6, 2017

2 minutes ago, zMeul said:

it was discovered with EPYC and ThreadRipper

LMAO

zMeul · August 6, 2017

24 minutes ago, leadeater said:

How is this happening when SMT is disabled when the issue been described is with SMT itself? The other mention of system stability seems to be a different problem.

This seems like a good explanation to the cause of some of the issues:

https://www.phoronix.com/forums/forum/hardware/processors-memory/955368-some-ryzen-linux-users-are-facing-issues-with-heavy-compilation-loads/page7

Sounds like GCC needs some Ryzen optimizations and a lot of software recompiled and update versions pushed out.

Also there's an update to the Segfaults issue that points to potential improper testing.

https://www.phoronix.com/scan.php?page=article&item=ryzen-segv-continues&num=1

wait a second or two

the original seg fault is a different issue that what the BSD team discovered

and the original seg fault issue is independent of SMT being disabled or not! new BIOSes have introduced a new option: OPcache control

but even with disabling OPcache it still seems to occur: https://community.amd.com/thread/215773

the original seg fault was not discovered running Phoronix's test suite but compiling Linux kernel with GCC

please get your ducks in a row

Edited August 6, 2017 by zMeul

ravenshrike · August 6, 2017

11 minutes ago, zMeul said:

it was discovered with EPYC and ThreadRipper

No, the people saying that are noting configuration segmentation faults are common across many processors. To wit from post #37

Quote

I am running the test in a Core i5 laptop and the conftest segfault appear here too. Michael, did you get only conftest segfaults?

Also, I love how you managed to quote the completely wrong portion in your original post.

leadeater · August 6, 2017

4 minutes ago, zMeul said:

wait a second or two

the original seg fault is a different issue that what the BSD team discovered

and the original seg fault issue is independent of SMT being disabled or not! new BIOSes have introduced a new option: OPcache control

Yes I know that that's what I'm commenting on, this issue.

26 minutes ago, zMeul said:

the issue described has been observed on FreeBSD systems with SMT disabled

How? This IS a bug with SMT so how can it exist if it's turned off. Your source seems to be linking two issues as one.

zMeul · August 6, 2017

2 minutes ago, ravenshrike said:

No, the people saying that are noting configuration segmentation faults are common across many processors. To wit from post #37

Also, I love how you managed to quote the completely wrong portion in your original post.

good god! I should not included the phoronix link since you people can't get it straight

Phoronix was not the original discoverer of the seg fault issue - it was discovered back in May by people compiling the Linux Kernel with GCC

Haaselh0ff · August 6, 2017

24 minutes ago, zMeul said:

remember when me and others told "you" to stay away from Zen CPUs for at least ¹/₂ y? this is precisely why - AMD is dealing with new arch that is bound to have HW issues and some of them cannot be fixed with a micro-code update

on top of the segmentation fault, we now have this ... if this turns into a recall, AMD will be severely hit and not only financially

This sounds incredibly arrogant and @BuckGup's response seems entirely fair. Criticizing people for supporting a new platform is not the way to go about this. Having concern over whether this is a major bug or if it can even be fixed with a simple update is reason for concern but no reason to bash buyers.

zMeul · August 6, 2017

2 minutes ago, leadeater said:

Yes I know that that's what I'm commenting on, this issue.

How? This IS a bug with SMT so how can it exist if it's turned off. Your source seems to be linking two issues as one.

no they aren't

I'm removing the phoronix link since it seems to create confusion - I only linked it in the 1st place to give them credit for the BSD discovery

leadeater · August 6, 2017

4 minutes ago, zMeul said:

no they aren't

I'm removing the phoronix link since it seems to create confusion - I only linked it in the 1st place to give them credit for the BSD discovery

Yes that is what your quote is talking about, do you even read the stuff you post at all?

33 minutes ago, zMeul said:

In a nutshell, for any given hyperthread pair, if one hyperthread is in a cpu-bound loop of any kind (can be in user mode), and the other hyperthread is returning from an interrupt via IRETQ, the hyperthread issuing the IRETQ can stall indefinitely until the other hyperthread with the cpu-bound loop pauses (aka HLT until next interrupt). After this situation occurs, the system appears to destabilize.

SMT...

33 minutes ago, zMeul said:

The situation does not occur if the cpu-bound loop is on a different core than the core doing the IRETQ

So I ask again how can this happen when SMT is disabled when the bug is with SMT?

zMeul · August 6, 2017

1 minute ago, Haaselh0ff said:

This sounds incredibly arrogant and @BuckGup's response seems entirely fair. Criticizing people for supporting a new platform is not the way to go about this. Having concern over whether this is a major bug or if it can even be fixed with a simple update is reason for concern but no reason to bash buyers.

if putting other people's money into a defective platform is a thing for you, go ahead

I worked in the business for some time and I would not shove down a customer's throat a brand new uncertified product

zMeul · August 6, 2017

3 minutes ago, leadeater said:

Yes that is what your quote is talking about, do you even read the stuff you post at all?

SMT...

have you actually bothered to read the source?

I'm done replying to this BS

leadeater · August 6, 2017

5 minutes ago, zMeul said:

have you actually bothered to read the source?

I'm done replying to this BS

Yes and I am saying your source is wrong and it's two bugs he's hitting not one. Also your source is quoting another source, that is the source you have actually quoted not the one you linked. Reading the actual issue it is very clearly an issue with SMT so disabling it means it can no longer exist so any other stability problems is something else.

Haaselh0ff · August 6, 2017

3 minutes ago, zMeul said:

if putting other people's money into a defective platform is a thing for you, go ahead

I worked in the business for some time and I would not shove down a customer's throat a brand new uncertified product

How long was AMD Ryzen rumored and in production for? Did they just throw something together in 2 months like Intel did in response to AMD?

dexT · August 6, 2017

7 minutes ago, zMeul said:

I worked in the business for some time and I would not shove down a customer's throat a brand new uncertified product

Do they carry Ryzen at K-Mart? Perhaps you could test this IRL.

zMeul · August 6, 2017

20 minutes ago, Haaselh0ff said:

How long was AMD Ryzen rumored and in production for? Did they just throw something together in 2 months like Intel did in response to AMD?

sure mate, CPU manufacturers can develop new products in 1-2 months .... what parallel dimension are you from?

oh look, Skylake-X and KabyLake-X were talked about since at least 2016 .. https://benchlife.info/intel-study-skylake-x-kaby-lake-x-and-basin-falls-for-skylake-w-06022016/

Edited August 6, 2017 by zMeul

zMeul · August 6, 2017

2 minutes ago, dexT said:

Do they carry Ryzen at K-Mart? Perhaps you could test this IRL.

do you buy your computers at the grocery store? good luck to you if you do

Fetzie · August 6, 2017

If the problem can be mitigated by changing which core handles interrupts, couldn't this be a problem with a compiler that hasn't been properly updated to work with the Ryzen micro architecture instead of a hardware bug in Ryzen?

The fact that a bug regarding interrupt scheduling with SMT is reproducible with SMT disabled would tempt me to think in that direction too.

dexT · August 6, 2017

1 minute ago, zMeul said:

do you buy your computers at the grocery store? good luck to you if you do

I bought 8GB of KLevv Urbane DDR3 2666 from Walmart so you could say that.

Haaselh0ff · August 6, 2017

2 minutes ago, zMeul said:

do you buy your computers at the grocery store? good luck to you if you do

Wal-Mart sells computer parts (I don't think they have them physically in the store, but you can do in-store pickup which would allow for you to grocery shop and then pick up some parts/a decent rig).

leadeater · August 6, 2017

@zMeul Instead of marking my post funny you could actually try an answer my very simple question. It didn't need to be an argument at all, this issue is talking about SMT then you said it exists when SMT is disabled and I'd like to know how that is possible, not a hard question. Especially when the original source says the bug does not happen when the cpu-bound loop is on a different core than the core doing the IRETQ, the only way they can be run on the same core is with SMT.

Marking posts as funny is rather childish.

Sign In

[Debunked] BSD team discovered a new hardware issue with Ryzen

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites