Workaround boosts Ryzen Performance for Intel MKL

fanatiXalpha · November 20, 2019

I looked around and didn't find a posting of this news topic on your forum so I open one up myself.

If there is already a topic/thread here to the same story: my apologies!

So a user on reddit found a workaround to uplift/boost the performance of AMD CPUs in applications that use the Intel Math Kernel Library.

The problem with lower performance on AMD CPUs has existed for about 10 years and is also known for that long.

The cause was that the library checked in the beginning with which Vendor-ID the CPU responded, which resulted for an AMD CPU in the fallback mode for the instruction set to normal SSE.

Despite the fact that Ryzen CPUs from AMD and even older CPUs from AMD support other instruction sets like AVX that are much faster.

The workaround forces the library to run in AVX2 mode, no matter if the CPU is Intel or not.

Performance gains vary from CPU-Gen to CPU-Gen, but in certain scenarios there is an uplift of 250% or more.

Zen 2 does gain more than Zen1(+) for example.

Hope this is will be covered in techlinked or the wan show...

Sources:

Reddit: https://www.reddit.com/r/matlab/comments/dxn38s/howto_force_matlab_to_use_a_fast_codepath_on_amd/

CB: https://www.computerbase.de/2019-11/mkl-workaround-erhoeht-leistung-auf-amd-ryzen/

HWLUXX: https://www.hardwareluxx.de/index.php/news/hardware/prozessoren/51542-anwendungen-mit-intel-mkl-lassen-amd-cpus-oft-schlecht-darstehen.html

Sorry for the minimum of details etc, but I'm at work right now...

Humbug · November 20, 2019

I didn't know that stuff like this was still happening; AMD CPUs being detected via the vendor ID and then suboptimal methods being used.

That Franc · November 20, 2019

1 hour ago, Humbug said:

I didn't know that stuff like this was still happening; AMD CPUs being detected via the vendor ID and then suboptimal methods being used.

Unless this is sarcasm, Linus covered it already - Intel does indeed do this with their ICC compiler, and even discloses it in the docs, although the doc is hidden behind a couple of links on hard-to-navigate pages, a bunch of scrolling, and obscure wording. There's also a good article about this on SemiAccurate.

Sauron · November 20, 2019

Exhibit A for why all software should be open source.

porina · November 20, 2019

Before Zen 2, AMD's FP related AVX related implementations were so bad in throughput you might as well use some version of SSE. AMD cheaped out on the FPU like they historically have done, and only decided to catch up to Intel consumer CPUs with Zen 2, even slightly overtaking them. AMD are still way behind with AVX-512 support, although in cases where that might be used, AMD has the other trick of just throwing more cores at it. Support of the instruction is not enough. You have to have the hardware behind it to make a real world difference.

SpaceGhostC2C · November 20, 2019

2 hours ago, porina said:

Before Zen 2, AMD's FP related AVX related implementations were so bad in throughput you might as well use some version of SSE.

Not really - not using your hardware to its full potential will result in lower performance than necessary, no matter how bad your hardware is. This has been shown all the way back to Bulldozer, and wouldn't be surprised if it applies to older versions as well.

The issue isn't even on MKL itself,but on Intel's compilers (which are naturally used to compile MKL), which will create branched code, with branches defined by CPU ID instead of instruction sets supported. Luckily, there is a way to tell the MKL binaries which path to take manually (although maybe they'll close it in the future? Hopefully they need it there ), but then you have to choose the right path for your CPU or it won't work.

Anyone wanting to use this on CPUs older than Zen 2 should look for the type of AVX/FM/SSE instruction set supported and choose the value of MKL_DEBUG_CPU_TYPE accordingly (to the best supported). There is a discussion of the alternatives to "5" in the comments.

Since the fallback to basic SSE is a worst-case scenario, practically any non-Intel CPU you have around can benefit, albeit not as much as the latest Zen.

porina · November 20, 2019

13 minutes ago, SpaceGhostC2C said:

Not really - not using your hardware to its full potential will result in lower performance than necessary, no matter how bad your hardware is.

I don't disagree with that, but it isn't what I'm saying. Basically Before Zen 2, the FP potential of the CPU was so low anyway, it didn't really make a meaningful difference if you used AVX or I can't remember the exact SSE version. You could see the AVX implementation as a kinda emulation. It looked like it supported the instructions, but relied on multiple instructions to provide the result, negating the point of having them.

Intel also play this trick with 1 unit AVX-512 cores no better than AVX2. You need 2 unit implementations to get double the throughput. At least Skylake-X got 2 unit implementation.

Also note I'm particularly referring to FP, specifically FP64 performance. Lesser areas may be less impacted.

fanatiXalpha · November 20, 2019

1 hour ago, porina said:

Basically Before Zen 2, the FP potential of the CPU was so low anyway,

Isn't that wrong?
Because the reddit user tested it on a 2XXX Zen-CPU which is not Zen2 but Zen+.

Yes AMD has improved from Zen(+) to Zen2 with opening it from 128bit to 256, but the performance gain is also there for Zen+ and not only Zen2.

porina · November 20, 2019

23 minutes ago, fanatiXalpha said:

Isn't that wrong?
Because the reddit user tested it on a 2XXX Zen-CPU which is not Zen2 but Zen+.

Yes AMD has improved from Zen(+) to Zen2 with opening it from 128bit to 256, but the performance gain is also there for Zen+ and not only Zen2.

I can only speak of my own interest use cases which are FP64 heavy. Possibly other mixes of instructions may have some benefit, but I've not looked nor care about them.

AnonymousGuy · November 21, 2019

15 hours ago, That Franc said:

Unless this is sarcasm, Linus covered it already - Intel does indeed do this with their ICC compiler, and even discloses it in the docs, although the doc is hidden behind a couple of links on hard-to-navigate pages, a bunch of scrolling, and obscure wording. There's also a good article about this on SemiAccurate.

Intel literally puts a disclaimer on every relevant benchmark slide that the results are done with their own compilers and its not optimized for other platforms. You'd have to be retarded to think intel would spend $ doing software dev for a competing product with trivial marketshare. And the fact that "proprietary" compilers wrok better with Intel is part of their value proposition.

fanatiXalpha · November 22, 2019

There are no $ to spend doing "software dev for a competing product".

It's only the question do I give my competitor the same tools or better: do I allow them to use the tools.

Like doing competition in cutting down a tree.

Intel is allowed to use the chainsaw and AMD only an Axe.

But hey ,just keep you comforting with this type of thinking...

SpaceGhostC2C · November 22, 2019

2 hours ago, fanatiXalpha said:

Like doing competition in cutting down a tree.

Intel is allowed to use the chainsaw and AMD only an Axe.

Well, you have to consider that, following the example, the chainsaw is designed and built by Intel Is software part of the competition? Do we leave drivers out when comparing GPUs? Wouldn't it be more fair to compare F1 drivers by having them race in identical cars? Isn't car development part of the competition, though?

The subject more broadly defined is not trivial.

Having said that, this particular issue has always rested on a rather lame argument ("we're not 100% sure how AMD implemented it so we nuked the whole thing", which would be like Intel's Linux distribution disabling SMT on AMD because "we use hyperthreading, we can't vouch for AMD's implementation"), especially since it's not up to you to choose wihch instruction set to compile for and see what happen, but rather hard-locked based on CPU ID - with the exception of this workaround.

Of course, there's always the question: do you ditch AMD due to poor MKL performance, or do you ditch MKL due to poor performance on your hardware? For people who may not even know they use MKL (like Matlab users), it's probably a matter of benchmarking hardware and discarding AMD. For people developing in C++ or Fortran, they're better off using GNU compilers and running in Linux (Fortran is something like GNU+Linux > Intel+Windows > Intel+Linux > GNU+Windows, for C++ I think the GNU compiler is always best), but unless they develop their own linear algebra libraries or find good-performing, platform-agnostic alternatives they may stick with MKL for simplicity...

KaitouX · November 22, 2019

26 minutes ago, SpaceGhostC2C said:

Well, you have to consider that, following the example, the chainsaw is designed and built by Intel Is software part of the competition? Do we leave drivers out when comparing GPUs? Wouldn't it be more fair to compare F1 drivers by having them race in identical cars? Isn't car development part of the competition, though?

The subject more broadly defined is not trivial.

This seems to be more of a case of Intel having a sign that says anyone that doesn't have the "GenuineIntel" Badge can't use any tools even when all the tools are available for free, the fix basically creates a fake badge that allows any processor to use the tool they prefer.

Regarding the drivers on GPUs, I personally think this issue is closer to some of the Gameworks features gimping AMD GPUs than to drivers.

Flying Sausages · November 22, 2019

Anyone test this?

fanatiXalpha · November 22, 2019

@SpaceGhostC2C

I get your point, this gets somewhat in the direction of "lawful evil" from Linus.

Intel is probably in the right to do this, but it is an asshole move.

Because, if you say "hey, we developed it and don't want that our competitor is good at it." Then why let the other CPU run the SW anyway?!
Why not do it like so:
"Our SW and so it does run only with our CPUs. With AMD, the SW won't start at all."

But this, this is a real bitch move.

Let it work, but in a bad way / with bad performance.

Because the average user that uses Matlab or other SW that relies on MKL will have no possibility to see why it is that much slower on AMD than Intel.

There is no indication which instructions are used to my knowledge.

And so most of the affected users think:"Hm, AMD sucks really at making CPUs".

But in fact, the don't.

It would be like car manufacturer A develops a new formula for gasoline (for more power, efficiency or whatever) and sells it to every gas station so that everyone can buy it.

But when a car from manufacturer B wants to get the same gasoline, the gas station detects that the car is from B and not A and proceeds by giving the customer with car B the standard gasoline without telling him*.

And this is the point I have a problem with.

Is it the right of Intel to do so? Probably, I'm no legal expert and from my moral standpoint I would say this shouldn't be, but I don't know.

But in the end, still a bitch move.

*With the same price, the same product name whatever.

It is hard to make comparisons, because they are fundamently different things.

But I tried

Edited November 22, 2019 by fanatiXalpha
Consistency about "Is Intel in the right"

BlueJedi · November 22, 2019

Interesting. This will also affect some of the functions in Maple. I'll have to test this out later.

Jito463 · November 23, 2019

Seriously? I thought Intel stopped doing this crap after the lawsuit against them. Interesting that someone found a workaround, but the title would be more accurate to say that it was one specific program (MatLab). Does anyone know if this has been tested on other ICC compiled software?

fanatiXalpha · November 25, 2019

Why would it be more accurate to mention a specific program?
It's not the program itself but rather the library that the program is using.

That's why I put "Intel MKL" in the title.

Other affected program is for example Numpy for Python.

Any program that uses the Intel MKL will have this behaviour with non-Intel CPUs

Mark Kaine · November 25, 2019

Ok I didn't understand a thing except AMD CPUs being neutered, what kind of programs are affected by this for example?

PS: and is this even legal? I suppose it is, but it could also not be of course...

fanatiXalpha · November 29, 2019

On 11/25/2019 at 12:23 PM, Mark Kaine said:

what kind of programs are affected by this for example?

Matlab

Sign In

Workaround boosts Ryzen Performance for Intel MKL

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites