Jump to content

Linus Torvalds: "I Hope AVX512 Dies A Painful Death"

More rage bursts from Linus Torvalds, this time targeted at Intel.

 

Quotes

Quote

In a mailing list discussion stemming from the Phoronix article this week on the compiler instructions Intel is enabling for Alder Lake (and Sapphire Rapids), Linus Torvalds chimed in. The Alder Lake instructions being flipped on in GCC right now make no mention of AVX-512 but only AVX2 and others, likely due to Intel pursuing the subset supported by both the small and large cores in this new hybrid design being pursued. 

 

The lack of seeing AVX512 for Alder Lake led Torvalds to comment:

 

I hope AVX512 dies a painful death, and that Intel starts fixing real problems instead of trying to create magic instructions to then create benchmarks that they can look good on.
I hope Intel gets back to basics: gets their process working again, and concentrate more on regular code that isn't HPC or some other pointless special case.

 

My thoughts

Old Linus yells at yet another x86 ISA extension. 😅

 

Sources

https://www.phoronix.com/scan.php?page=news_item&px=Linus-Torvalds-On-AVX-512

Link to comment
Share on other sites

Link to post
Share on other sites

Don't know that I would consider this NEWS as such.

Please quote my post, or put @paddy-stone if you want me to respond to you.

Spoiler
  • PCs:- 
  • Main PC build  https://uk.pcpartpicker.com/list/2K6Q7X
  • ASUS x53e  - i7 2670QM / Sony BD writer x8 / Win 10, Elemetary OS, Ubuntu/ Samsung 830 SSD
  • Lenovo G50 - 8Gb RAM - Samsung 860 Evo 250GB SSD - DVD writer
  •  
  • Displays:-
  • Philips 55 OLED 754 model
  • Panasonic 55" 4k TV
  • LG 29" Ultrawide
  • Philips 24" 1080p monitor as backup
  •  
  • Storage/NAS/Servers:-
  • ESXI/test build  https://uk.pcpartpicker.com/list/4wyR9G
  • Main Server https://uk.pcpartpicker.com/list/3Qftyk
  • Backup server - HP Proliant Gen 8 4 bay NAS running FreeNAS ZFS striped 3x3TiB WD reds
  • HP ProLiant G6 Server SE316M1 Twin Hex Core Intel Xeon E5645 2.40GHz 48GB RAM
  •  
  • Gaming/Tablets etc:-
  • Xbox One S 500GB + 2TB HDD
  • PS4
  • Nvidia Shield TV
  • Xiaomi/Pocafone F2 pro 8GB/256GB
  • Xiaomi Redmi Note 4

 

  • Unused Hardware currently :-
  • 4670K MSI mobo 16GB ram
  • i7 6700K  b250 mobo
  • Zotac GTX 1060 6GB Amp! edition
  • Zotac GTX 1050 mini

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

True. Also maybe if they get rid of it software will finally be properly optimized for the universally available AVX2.

Don't ask to ask, just ask... please 🤨

sudo chmod -R 000 /*

Link to comment
Share on other sites

Link to post
Share on other sites

18 minutes ago, DuckDodgers said:

and concentrate more on regular code that isn't HPC or some other pointless special case.

Well good luck with that, the "pointless special cases" is most likely what brings the monies in from big corp.

F@H
Desktop: i9-13900K, ASUS Z790-E, 64GB DDR5-6000 CL36, RTX3080, 2TB MP600 Pro XT, 2TB SX8200Pro, 2x16TB Ironwolf RAID0, Corsair HX1200, Antec Vortex 360 AIO, Thermaltake Versa H25 TG, Samsung 4K curved 49" TV, 23" secondary, Mountain Everest Max

Mobile SFF rig: i9-9900K, Noctua NH-L9i, Asrock Z390 Phantom ITX-AC, 32GB, GTX1070, 2x1TB SX8200Pro RAID0, 2x5TB 2.5" HDD RAID0, Athena 500W Flex (Noctua fan), Custom 4.7l 3D printed case

 

Asus Zenbook UM325UA, Ryzen 7 5700u, 16GB, 1TB, OLED

 

GPD Win 2

Link to comment
Share on other sites

Link to post
Share on other sites

x86 is getting overloaded 

 

Edit : realised that that is not the point Linus Torvalds wanted to say

Edited by Drama Lama

Hi

 

Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler

hi

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

His point is about Intel creating an instruction, getting some benchmark company to create a tool to test it, and then running it against Ryzen and Intel chips to show that Intel is still relevant. But no one is using AVX512 (yet). "look at how bad Ryzen chips are at running this one off test for an instruction that we just invented! Buy our stuff plx."

 

It has nothing to do with the size of the ISA. 

Link to comment
Share on other sites

Link to post
Share on other sites

This guy is awesome lol. He has the balls to tell nVidia to fuck off and now Intel. Linus Torvalds is right tho. Intel is tryna make dem cpu's look great while AMD is doing better and kicks Intel in the ass.

DAC/AMPs:

Klipsch Heritage Headphone Amplifier

Headphones: Klipsch Heritage HP-3 Walnut, Meze 109 Pro, Beyerdynamic Amiron Home, Amiron Wireless Copper, Tygr 300R, DT880 600ohm Manufaktur, T90, Fidelio X2HR

CPU: Intel 4770, GPU: Asus RTX3080 TUF Gaming OC, Mobo: MSI Z87-G45, RAM: DDR3 16GB G.Skill, PC Case: Fractal Design R4 Black non-iglass, Monitor: BenQ GW2280

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, descendency said:

But no one is using AVX512 (yet)

Quite a lot of people actually seem to be using it. Heck, some of the more common software that do have AVX512-support are...x265 and x264.

Hand, n. A singular instrument worn at the end of the human arm and commonly thrust into somebody’s pocket.

Link to comment
Share on other sites

Link to post
Share on other sites

i agree that many of the the work loads avx 512 excels at are probably better suited to gpus, though at the same time moving the data to gpu has costs and they don't always fit, which is probably a good place to improve things, can we please get gen-z or cxl?

Link to comment
Share on other sites

Link to post
Share on other sites

There's nothing wrong with instructions if they can speed things up. MMX, SSE and 3DNow! made huge difference for games to a point trying to run them without these instructions meant they ran significantly slower.

 

Btw, Cinebench R20 uses AVX2 and AVX512 and AMD still just totally annihilates Intel. :D So, just instructions alone don't really shift things in any direction if competing product is just better.

 

Btw, how is with FMA instructions? Are they used anywhere practical or how is with that?

Link to comment
Share on other sites

Link to post
Share on other sites

29 minutes ago, RejZoR said:

There's nothing wrong with instructions if they can speed things up. MMX, SSE and 3DNow! made huge difference for games to a point trying to run them without these instructions meant they ran significantly slower.

 

Btw, Cinebench R20 uses AVX2 and AVX512 and AMD still just totally annihilates Intel. :D So, just instructions alone don't really shift things in any direction if competing product is just better.

 

Btw, how is with FMA instructions? Are they used anywhere practical or how is with that?

it depends, there are at least 2 FMA instruction sets, FMA4 and FMA3, FMA4 is present on zen chips but not advertised, not sure if zen 2 has it, early tests showed better performance than AVX2 but if was buggy and gave wrong results 

Link to comment
Share on other sites

Link to post
Share on other sites

49 minutes ago, RejZoR said:

Btw, Cinebench R20 uses AVX2 and AVX512 and AMD still just totally annihilates Intel. :D So, just instructions alone don't really shift things in any direction if competing product is just better.

Run-time analysis of the benchmark indicates it doesn't touch AVX-512 at all and it uses a mix of SSE(1/2), AVX(1/2) and FMA instructions in various proportions.

Looks like Maxon implemented only a small part of Intel's Embree library.

 

17 minutes ago, cj09beira said:

it depends, there are at least 2 FMA instruction sets, FMA4 and FMA3, FMA4 is present on zen chips but not advertised, not sure if zen 2 has it, early tests showed better performance than AVX2 but if was buggy and gave wrong results 

FMA4 is removed from Zen 2.

Link to comment
Share on other sites

Link to post
Share on other sites

5 hours ago, Sauron said:

True. Also maybe if they get rid of it software will finally be properly optimized for the universally available AVX2.

AVX-512 isn't just AVX2 but bigger, but is kinda a whole group of instruction sets. At a basic level, that I think is standard in AVX-512, it does AVX2 but up to twice as much. It depends on CPU implementation, and requires "2 unit" AVX-512 to get the doubling. Otherwise, it may be no better than AVX2. If it can already scale to AVX2, it shouldn't be that hard to implement AVX-512, although the increased processing rate can mean you hit other limits in the architecture faster.

 

AVX-512 also has optional additional feature sets, like the latest one targeted at machine learning applications. Maybe we can have a CPU equivalent to nvidia doing DLSS in GPU in the future for example. Who knows where software could take us once there is sufficient support.

 

1 hour ago, RejZoR said:

Btw, Cinebench R20 uses AVX2 and AVX512 and AMD still just totally annihilates Intel. :D So, just instructions alone don't really shift things in any direction if competing product is just better.

Does CB R20 use AVX-512? I'm not sure about that. Guess I could walk approx 3m to my left and fire up my 7920X and try it...

 

Anyway, if R20 "only" uses AVX2, AMD Zen 2 does have an on average better implementation of it than Intel desktop consumer CPUs.

 

1 hour ago, RejZoR said:

Btw, how is with FMA instructions? Are they used anywhere practical or how is with that?

DSP type processing. The way filter coefficients and data are used, you often have to do a multiply and add, so why not do them with a single instruction.

 

35 minutes ago, cj09beira said:

it depends, there are at least 2 FMA instruction sets, FMA4 and FMA3, FMA4 is present on zen chips but not advertised, not sure if zen 2 has it, early tests showed better performance than AVX2 but if was buggy and gave wrong results 

FMA3 and FMA4 differ in the number of operands but otherwise do the same thing. FMA4 was introduced by AMD during the was it bulldozer era? The thing with the modules. The problem was that instruction support is worthless without the hardware to back it up. Those CPUs just didn't have the compute capability, so FMA4 ran no faster than other instruction alternatives. On the other hand, Intel did have FMA hardware from Haswell onwards, and that lead to a big jump in performance from Sandy/Ivy Bridge.

Main system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, Corsair Vengeance Pro 3200 3x 16GB 2R, RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

5 minutes ago, DuckDodgers said:

Run-time analysis of the benchmark indicates it doesn't touch AVX-512 at all and it uses a mix of SSE(1/2), AVX(1/2) and FMA instructions in various proportions.

Looks like Maxon implemented only a small part of Intel's Embree library.

 

FMA4 is removed from Zen 2.

I think most of the math libraries out there for 3D applications haven't touched AVX yet. They're still running SSE2 because it's pretty much ubiquitous with any processors being used. Still a pain in the ass to use for any horizontal math. Though that changed with SSE3 causing some pretty big performance bumps because you don't have to get creative with it and even better in SSE4.1 because they added a single instruction dot product to it. I personally don't see a point to moving to AVX because the workload is perfectly suited to SSE you don't really need much more. At least for games.

CPU: Intel i7 - 5820k @ 4.5GHz, Cooler: Corsair H80i, Motherboard: MSI X99S Gaming 7, RAM: Corsair Vengeance LPX 32GB DDR4 2666MHz CL16,

GPU: ASUS GTX 980 Strix, Case: Corsair 900D, PSU: Corsair AX860i 860W, Keyboard: Logitech G19, Mouse: Corsair M95, Storage: Intel 730 Series 480GB SSD, WD 1.5TB Black

Display: BenQ XL2730Z 2560x1440 144Hz

Link to comment
Share on other sites

Link to post
Share on other sites

31 minutes ago, porina said:

AVX-512 isn't just AVX2 but bigger, but is kinda a whole group of instruction sets. At a basic level, that I think is standard in AVX-512, it does AVX2 but up to twice as much. It depends on CPU implementation, and requires "2 unit" AVX-512 to get the doubling. Otherwise, it may be no better than AVX2. If it can already scale to AVX2, it shouldn't be that hard to implement AVX-512, although the increased processing rate can mean you hit other limits in the architecture faster.

Yeah I know, but afaik because it is a different set of instructions from AVX2 optimization is typically done with only one of the two in mind even if the developers don't strictly need the extra features. Plus I'm not certain that workloads always scale up - a lot of things barely scale up to 64bit words, let alone 256 or 512 - and it completely obliterates your memory because you need longer words even for small data, sometimes for no real reason.

38 minutes ago, porina said:

AVX-512 also has optional additional feature sets, like the latest one targeted at machine learning applications. Maybe we can have a CPU equivalent to nvidia doing DLSS in GPU in the future for example. Who knows where software could take us once there is sufficient support.

....yeahhh but honestly I don't see it, gpus and CUDA in particular is too deeply dug in by now.

39 minutes ago, trag1c said:

I think most of the math libraries out there for 3D applications haven't touched AVX yet.

Eigen has afaik

Don't ask to ask, just ask... please 🤨

sudo chmod -R 000 /*

Link to comment
Share on other sites

Link to post
Share on other sites

29 minutes ago, Sauron said:

Eigen has afaik

Didn't know that... not that I really ever look at patch notes for any of those libraries lol. I wonder when that came in.

CPU: Intel i7 - 5820k @ 4.5GHz, Cooler: Corsair H80i, Motherboard: MSI X99S Gaming 7, RAM: Corsair Vengeance LPX 32GB DDR4 2666MHz CL16,

GPU: ASUS GTX 980 Strix, Case: Corsair 900D, PSU: Corsair AX860i 860W, Keyboard: Logitech G19, Mouse: Corsair M95, Storage: Intel 730 Series 480GB SSD, WD 1.5TB Black

Display: BenQ XL2730Z 2560x1440 144Hz

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, porina said:

AVX-512 also has optional additional feature sets, like the latest one targeted at machine learning applications. Maybe we can have a CPU equivalent to nvidia doing DLSS in GPU in the future for example. Who knows where software could take us once there is sufficient support.

Since no one is going back to software graphics engines for games, the prospect of more SIMD extensions with wider vectors has become a trend with diminishing returns for the mass market. Even workstation loads, like off-line rendering and photo/video editing are relying more and more on the GPU for parallel processing, besides the traditional graphics acceleration -- it's faster and more energy efficient, and the API overhead is being reduced faster than any new CPU ISA addition could gain traction.

Link to comment
Share on other sites

Link to post
Share on other sites

Some have stated that switching between AVX-512 and other instruction sets incurs a major latency performance penalty. So, it seem to me that if you're going to use AVX-512, your applications are better off all-in then attempting to nip at it casually. But then again, to use something like that so heavily, why not just GPGPU the app to begin with?

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Sauron said:

Plus I'm not certain that workloads always scale up - a lot of things barely scale up to 64bit words, let alone 256 or 512 - and it completely obliterates your memory because you need longer words even for small data, sometimes for no real reason.

If makes sense to use SIMD if you do have that multiple data. It is not meant to be a universal solution. Use it if available and appropriate, which for sure wont be always.

 

37 minutes ago, DuckDodgers said:

Since no one is going back to software graphics engines for games, the prospect of more SIMD extensions with wider vectors has become a trend with diminishing returns for the mass market. Even workstation loads, like off-line rendering and photo/video editing are relying more and more on the GPU for parallel processing, besides the traditional graphics acceleration -- it's faster and more energy efficient, and the API overhead is being reduced faster than any new CPU ISA addition could gain traction.

I'd argue the use of the term "software" here. GPU code is still software. The software runs on hardware, regardless if it is CPU or GPU. If you mean "software" in the sense of a generic implementation as opposed to making use of specific hardware support, I could see that. But we do have that hardware support. A bit speculative perhaps, but what if AMD do not follow nvidia with tensor cores. That could leave a gap open on the CPU side to fill.

 

GPUs have come a long way for general compute since the first implementations, but they still have a long way to go to substantially replace the flexibility of a CPU. As such they still remain complimentary where it is appropriate.

 

17 minutes ago, StDragon said:

Some have stated that switching between AVX-512 and other instruction sets incurs a major latency performance penalty. So, it seem to me that if you're going to use AVX-512, your applications are better off all-in then attempting to nip at it casually. But then again, to use something like that so heavily, why not just GPGPU the app to begin with?

As covered before, GPU is not appropriate for all code, even if it appears parallel-able. Data dependencies remain a sticking point, and while GPUs excel in more trivally parallel code, it doesn't work for everything. Something kinda in between a CPU and a GPU might be interesting. I think it was in a different thread, I suggested on possible avenue companies might take in future are simpler CPU cores, but have a lot more of them. Core complexity will still be more CPU-like, but scale more GPU-like.

 

 

Arguments seem to be going around in circles a bit here. It makes sense to use certain options when it makes sense, but it doesn't mean it'll be useful for everything.

Main system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, Corsair Vengeance Pro 3200 3x 16GB 2R, RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

50 minutes ago, trag1c said:

Didn't know that... not that I really ever look at patch notes for any of those libraries lol. I wonder when that came in.

At least 3-4 years ago, not sure exactly when though.

Don't ask to ask, just ask... please 🤨

sudo chmod -R 000 /*

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, trag1c said:

At least for games.

Games isn't why these instruction sets came to be, though.

 

15 minutes ago, porina said:

I think it was in a different thread, I suggested on possible avenue companies might take in future are simpler CPU cores, but have a lot more of them. Core complexity will still be more CPU-like, but scale more GPU-like.

That's Xeon Phi ;) 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

8 minutes ago, SpaceGhostC2C said:

That's Xeon Phi ;) 

Kinda. If I were to design a CPU for my application, I would have come up with something like that. However I think in today's balance, the non-FPU parts will have to be better than was included in Phi, plus it was still cache/bandwidth limited. Need much more.

Main system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, Corsair Vengeance Pro 3200 3x 16GB 2R, RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

18 minutes ago, porina said:

Kinda. If I were to design a CPU for my application, I would have come up with something like that. However I think in today's balance, the non-FPU parts will have to be better than was included in Phi, plus it was still cache/bandwidth limited. Need much more.

I think Intel's more recent approach to that, and probably related to the expansion it its GPU division, is to have the fusion happen in software rather than hardware:

https://software.intel.com/content/www/us/en/develop/tools/oneapi.html

A bit like AMD's HSA initiative which didn't translate into much at the time.

 

And now I kind of feel we are re-having a conversation we had with you and @leadeater in another thread, don't know which one - I'm becoming a broken record :P 

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×