Jump to content

Workaround boosts Ryzen Performance for Intel MKL

fanatiXalpha

I looked around and didn't find a posting of this news topic on your forum so I open one up myself.

If there is already a topic/thread here to the same story: my apologies!

 

So a user on reddit  found a workaround to uplift/boost the performance of AMD CPUs in applications that use the Intel Math Kernel Library.

The problem with lower performance on AMD CPUs has existed for about 10 years and is also known for that long.

The cause was that the library checked in the beginning with which Vendor-ID the CPU responded, which resulted for an AMD CPU in the fallback mode for the instruction set to normal SSE.

Despite the fact that Ryzen CPUs from AMD and even older CPUs from AMD support other instruction sets like AVX that are much faster.

 

The workaround forces the library to run in AVX2 mode, no matter if the CPU is Intel or not.

Performance gains vary from CPU-Gen to CPU-Gen, but in certain scenarios there is an uplift of 250% or more.

Zen 2 does gain more than Zen1(+) for example.

 

Hope this is will be covered in techlinked or the wan show... 

 

Sources:

 

Reddit: https://www.reddit.com/r/matlab/comments/dxn38s/howto_force_matlab_to_use_a_fast_codepath_on_amd/

CB: https://www.computerbase.de/2019-11/mkl-workaround-erhoeht-leistung-auf-amd-ryzen/

HWLUXX: https://www.hardwareluxx.de/index.php/news/hardware/prozessoren/51542-anwendungen-mit-intel-mkl-lassen-amd-cpus-oft-schlecht-darstehen.html

 

Sorry for the minimum of details etc, but I'm at work right now... :D

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Humbug said:

I didn't know that stuff like this was still happening; AMD CPUs being detected via the vendor ID and then suboptimal methods being used. 

Unless this is sarcasm, Linus covered it already - Intel does indeed do this with their ICC compiler, and even discloses it in the docs, although the doc is hidden behind a couple of links on hard-to-navigate pages, a bunch of scrolling, and obscure wording. There's also a good article about this on SemiAccurate.

Link to comment
Share on other sites

Link to post
Share on other sites

Before Zen 2, AMD's FP related AVX related implementations were so bad in throughput you might as well use some version of SSE. AMD cheaped out on the FPU like they historically have done, and only decided to catch up to Intel consumer CPUs with Zen 2, even slightly overtaking them. AMD are still way behind with AVX-512 support, although in cases where that might be used, AMD has the other trick of just throwing more cores at it. Support of the instruction is not enough. You have to have the hardware behind it to make a real world difference.

Main system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, Corsair Vengeance Pro 3200 3x 16GB 2R, RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, porina said:

Before Zen 2, AMD's FP related AVX related implementations were so bad in throughput you might as well use some version of SSE.

Not really - not using your hardware to its full potential will result in lower performance than necessary, no matter how bad your hardware is. This has been shown all the way back to Bulldozer, and wouldn't be surprised if it applies to older versions as well.

 

The issue isn't even on MKL itself,but on Intel's compilers (which are naturally used to compile MKL), which will create branched code, with branches defined by CPU ID instead of instruction sets supported. Luckily, there is a way to tell the MKL binaries which path to take manually (although maybe they'll close it in the future? Hopefully they need it there :P), but then you have to choose the right path for your CPU or it won't work.

Anyone wanting to use this on CPUs older than Zen 2 should look for the type of  AVX/FM/SSE instruction set supported and choose the value of MKL_DEBUG_CPU_TYPE accordingly (to the best supported). There is a discussion of the alternatives to "5" in the comments.

Since the fallback to basic SSE is a worst-case scenario, practically any non-Intel CPU you have around can benefit, albeit not as much as the latest Zen.

Link to comment
Share on other sites

Link to post
Share on other sites

13 minutes ago, SpaceGhostC2C said:

Not really - not using your hardware to its full potential will result in lower performance than necessary, no matter how bad your hardware is. 

I don't disagree with that, but it isn't what I'm saying. Basically Before Zen 2, the FP potential of the CPU was so low anyway, it didn't really make a meaningful difference if you used AVX or I can't remember the exact SSE version. You could see the AVX implementation as a kinda emulation. It looked like it supported the instructions, but relied on multiple instructions to provide the result, negating the point of having them.

 

Intel also play this trick with 1 unit AVX-512 cores no better than AVX2. You need 2 unit implementations to get double the throughput. At least Skylake-X got 2 unit implementation.

 

Also note I'm particularly referring to FP, specifically FP64 performance. Lesser areas may be less impacted.

Main system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, Corsair Vengeance Pro 3200 3x 16GB 2R, RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, porina said:

Basically Before Zen 2, the FP potential of the CPU was so low anyway,

Isn't that wrong?
Because the reddit user tested it on a 2XXX Zen-CPU which is not Zen2 but Zen+.

Yes AMD has improved from Zen(+) to Zen2 with opening it from 128bit to 256, but the performance gain is also there for Zen+ and not only Zen2.

Link to comment
Share on other sites

Link to post
Share on other sites

23 minutes ago, fanatiXalpha said:

Isn't that wrong?
Because the reddit user tested it on a 2XXX Zen-CPU which is not Zen2 but Zen+.

Yes AMD has improved from Zen(+) to Zen2 with opening it from 128bit to 256, but the performance gain is also there for Zen+ and not only Zen2.

I can only speak of my own interest use cases which are FP64 heavy. Possibly other mixes of instructions may have some benefit, but I've not looked nor care about them.

Main system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, Corsair Vengeance Pro 3200 3x 16GB 2R, RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

15 hours ago, That Franc said:

Unless this is sarcasm, Linus covered it already - Intel does indeed do this with their ICC compiler, and even discloses it in the docs, although the doc is hidden behind a couple of links on hard-to-navigate pages, a bunch of scrolling, and obscure wording. There's also a good article about this on SemiAccurate.

Intel literally puts a disclaimer on every relevant benchmark slide that the results are done with their own compilers and its not optimized for other platforms.  You'd have to be retarded to think intel would spend $ doing software dev for a competing product with trivial marketshare.   And the fact that "proprietary" compilers wrok better with Intel is part of their value proposition.

Workstation:  13700k @ 5.5Ghz || Gigabyte Z790 Ultra || MSI Gaming Trio 4090 Shunt || TeamGroup DDR5-7800 @ 7000 || Corsair AX1500i@240V || whole-house loop.

LANRig/GuestGamingBox: 9900nonK || Gigabyte Z390 Master || ASUS TUF 3090 650W shunt || Corsair SF600 || CPU+GPU watercooled 280 rad pull only || whole-house loop.

Server Router (Untangle): 13600k @ Stock || ASRock Z690 ITX || All 10Gbe || 2x8GB 3200 || PicoPSU 150W 24pin + AX1200i on CPU|| whole-house loop

Server Compute/Storage: 10850K @ 5.1Ghz || Gigabyte Z490 Ultra || EVGA FTW3 3090 1000W || LSI 9280i-24 port || 4TB Samsung 860 Evo, 5x10TB Seagate Enterprise Raid 6, 4x8TB Seagate Archive Backup ||  whole-house loop.

Laptop: HP Elitebook 840 G8 (Intel 1185G7) + 3080Ti Thunderbolt Dock, Razer Blade Stealth 13" 2017 (Intel 8550U)

Link to comment
Share on other sites

Link to post
Share on other sites

There are no $ to spend doing "software dev for a competing product".

It's only the question do I give my competitor the same tools or better: do I allow them to use the tools.

 

Like doing competition in cutting down a tree.

Intel is allowed to use the chainsaw and AMD only an Axe.

 

But hey ,just keep you comforting with this type of thinking...

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, fanatiXalpha said:

Like doing competition in cutting down a tree.

Intel is allowed to use the chainsaw and AMD only an Axe.

 

Well, you have to consider that, following the example, the chainsaw is designed and built by Intel :P Is software part of the competition? Do we leave drivers out when comparing GPUs? Wouldn't it be more fair to compare F1 drivers by having them race in identical cars? Isn't car development part of the competition, though?

The subject more broadly defined is not trivial.

 

Having said that, this particular issue has always rested on a rather lame argument ("we're not 100% sure how AMD implemented it so we nuked the whole thing", which would be like Intel's Linux distribution disabling SMT on AMD because "we use hyperthreading, we can't vouch for AMD's implementation"), especially since it's not up to you to choose wihch instruction set to compile for and see what happen, but rather hard-locked based on CPU ID - with the exception of this workaround.

 

Of course, there's always the question: do you ditch AMD due to poor MKL performance, or do you ditch MKL due to poor performance on your hardware? For people who may not even know they use MKL (like Matlab users), it's probably a matter of benchmarking hardware and discarding AMD. For people developing in C++ or Fortran, they're better off using GNU compilers and running in Linux (Fortran is something like GNU+Linux > Intel+Windows > Intel+Linux > GNU+Windows, for C++ I think the GNU compiler is always best), but unless they develop their own linear algebra libraries or find good-performing, platform-agnostic alternatives they may stick with MKL for simplicity...

Link to comment
Share on other sites

Link to post
Share on other sites

26 minutes ago, SpaceGhostC2C said:

Well, you have to consider that, following the example, the chainsaw is designed and built by Intel :P Is software part of the competition? Do we leave drivers out when comparing GPUs? Wouldn't it be more fair to compare F1 drivers by having them race in identical cars? Isn't car development part of the competition, though?

The subject more broadly defined is not trivial.

This seems to be more of a case of Intel having a sign that says anyone that doesn't have the "GenuineIntel" Badge can't use any tools even when all the tools are available for free, the fix basically creates a fake badge that allows any processor to use the tool they prefer.

Regarding the drivers on GPUs, I personally think this issue is closer to some of the Gameworks features gimping AMD GPUs than to drivers.

Link to comment
Share on other sites

Link to post
Share on other sites

@SpaceGhostC2C

I get your point, this gets somewhat in the direction of "lawful evil" from Linus.

Intel is probably in the right to do this, but it is an asshole move.

Because, if you say "hey, we developed it and don't want that our competitor is good at it." Then why let the other CPU run the SW anyway?!
Why not do it like so:
"Our SW and so it does run only with our CPUs. With AMD, the SW won't start at all."

 

But this, this is a real bitch move.

Let it work, but in a bad way / with bad performance.

Because the average user that uses Matlab or other SW that relies on MKL will have no possibility to see why it is that much slower on AMD than Intel.

There is no indication which instructions are used to my knowledge.

And so most of the affected users think:"Hm, AMD sucks really at making CPUs".

But in fact, the don't.

 

It would be like car manufacturer A develops a new formula for gasoline (for more power, efficiency or whatever) and sells it to every gas station so that everyone can buy it.

But when a car from manufacturer B wants to get the same gasoline, the gas station detects that the car is from B and not A and proceeds by giving the customer with car B the standard gasoline without telling him*.

 

And this is the point I have a problem with.

Is it the right of Intel to do so? Probably, I'm no legal expert and from my moral standpoint I would say this shouldn't be, but I don't know.

But in the end, still a bitch move.

 

*With the same price, the same product name whatever.

It is hard to make comparisons, because they are fundamently different things.

But I tried :D

Edited by fanatiXalpha
Consistency about "Is Intel in the right"
Link to comment
Share on other sites

Link to post
Share on other sites

Interesting. This will also affect some of the functions in Maple. I'll have to test this out later.

Link to comment
Share on other sites

Link to post
Share on other sites

Seriously?  I thought Intel stopped doing this crap after the lawsuit against them.  Interesting that someone found a workaround, but the title would be more accurate to say that it was one specific program (MatLab).  Does anyone know if this has been tested on other ICC compiled software?

Link to comment
Share on other sites

Link to post
Share on other sites

Why would it be more accurate to mention a specific program?
It's not the program itself but rather the library that the program is using.

That's why I put "Intel MKL" in the title.

Other affected program is for example Numpy for Python.

Any program that uses the Intel MKL will have this behaviour with non-Intel CPUs

Link to comment
Share on other sites

Link to post
Share on other sites

Ok I didn't understand a thing except AMD CPUs being neutered,  what kind of programs are affected by this for example? 

 

 

PS: and is this even legal?   I suppose it is, but it could also not be of course... 

The direction tells you... the direction

-Scott Manley, 2021

 

Softwares used:

Corsair Link (Anime Edition) 

MSI Afterburner 

OpenRGB

Lively Wallpaper 

OBS Studio

Shutter Encoder

Avidemux

FSResizer

Audacity 

VLC

WMP

GIMP

HWiNFO64

Paint

3D Paint

GitHub Desktop 

Superposition 

Prime95

Aida64

GPUZ

CPUZ

Generic Logviewer

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

On 11/25/2019 at 12:23 PM, Mark Kaine said:

what kind of programs are affected by this for example?

Matlab

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×