Jump to content

Intel releases new Advanced Performance Extensions and AVX10 extensions to their future CPUs

igormp

Summary

Intel has released a new extension set, called APX. In summary, it doubles the amount of visible general purpose registers from 16 to 32, allows direct jumps with address of 64 bits, and gives some instructions the option to have a destination register, similar to how it's done in MIPS/RISC-V/ARM assembly (similar to: opcode dest_reg, reg1, reg2).

 

Along with that, they also released a new AVX superset, called AVX10, which makes the 512-bit part optional while making the 256-bit part work across all of their P and E cores.

 

Quotes

Quote

Intel® APX doubles the number of general-purpose registers (GPRs) from 16 to 32. This allows the compiler to keep more values in registers; as a result, APX-compiled code contains 10% fewer loads and more than 20% fewer stores than the same code compiled for an Intel® 64 baseline.2 Register accesses are not only faster, but they also consume significantly less dynamic power than complex load and store operations.

 

Compiler enabling is straightforward – a new REX2 prefix provides uniform access to the new registers across the legacy integer instruction set. Intel® AVX instructions gain access via new bits defined in the existing EVEX prefix. In addition, legacy integer instructions now can also use EVEX to encode a dedicated destination register operand – turning them into three-operand instructions and reducing the need for extra register move instructions. While the new prefixes increase average instruction length, there are 10% fewer instructions in APX-compiled code2, resulting in similar code density as before.

 

Intel® APX demonstrates the advantage of the variable-length instruction encodings of x86 – new features enhancing the entire instruction set can be defined with only incremental changes to the instruction-decode hardware. This flexibility has allowed Intel® architecture to adapt and flourish over four decades of rapid advances in computing – and it enables the innovations that will keep it thriving into the future.

 

My thoughts

I find this really nice, but wonder when we'll see a CPU that has support for those extensions. This will give compilers more flexibility when generating code, likely improving performance a bit. Along with their removal of some legacy boot stuff on their S extension, this does look like a nice modernization of the ISA.

 

It's not clear to me how their new AVX10 with both of its subsets differ from the existing AVX2 and AVX-512 predecessors. If a core has no support for AVX10/512 then code compiled with this extension in mind shouldn't work, from what I understood, making it no different to the current AVX512 scenario.

 

Sources

https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html

Edited by igormp
Added AVX10 info

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

Very interesting. APX addresses the limitations in x86-64, and AVX10 seems to resolve the (lack of) AVX-512 problem since Alder Lake. Could this appear as soon as Meteor Lake I wonder? Or is this a further out technology?

Main system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, Corsair Vengeance Pro 3200 3x 16GB 2R, RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

6 minutes ago, porina said:

Very interesting. APX addresses the limitations in x86-64, and AVX10 seems to resolve the (lack of) AVX-512 problem since Alder Lake. Could this appear as soon as Meteor Lake I wonder? Or is this a further out technology?

I doubt we'll be seeing those that soon. I guess it may come on their new Royal architecture.

 

Keep in mind that AVX10 has 2 subsets, AVX10/256 (which is similar to AVX2) and AVX10/512 (which is similar to AVX-512), I'm not sure how those subsets will interoperate.

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, igormp said:

I find this really nice, but wonder when we'll see a CPU that has support for those extensions. This will give compilers more flexibility when generating code, likely improving performance a bit.

This seems pretty nice (although L1 cache is pretty fast these days, so more registers won't be revolutionary), but it will take a long time between the first cpus supporting it (probably not too far off), and any software starting to use it, because almost all binaries have to be compiled for the lowest common denominator. Windows 11 did set the bar pretty high with the TPM requirement, but I wonder how willing they will be to make a change like that again any time soon, especially for relatively minor improvements.

 

That's not to say this is a bad change - improvements are definitely a good thing - but for most applications this will take quite a long time for anyone to see the benefits.

HTTP/2 203

Link to comment
Share on other sites

Link to post
Share on other sites

Quote

Intel has released a new extension set, called APX.

It's ironic that Intel's first attempt to kill x86 with even more advanced CISC architecture called iAPX in the early 80s.

This new extension to the ISA (for the first time since AMD64) actually diminishes a lot of the advantages of ARM, although the decoding overhead will still be consuming more power I guess. The only question now is the rate of adoption of the market and the x86 ISA will indeed be proven as immortal.

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, colonel_mortis said:

This seems pretty nice (although L1 cache is pretty fast these days, so more registers won't be revolutionary)

It gives the compiler more flexibility, so hot paths may not even need to touch the L1 at all (even if the L1 is fast, it's still an order of magnitude slower than the actula registers). Also added bonus of way less loads and stores.

 

Keep in mind that those are just the visible registers, the CPU has way more registers that it renames internally, it's just giving the option for the compiler itself to do some work on its own.

4 minutes ago, colonel_mortis said:

and any software starting to use it, because almost all binaries have to be compiled for the lowest common denominator.

I mean, it's just a simple flag away, and software could always make use of intrinsics to do dynamic checks for such feature before using it, so the same binary can still work with older µarches while still being able to use the new extensions.

6 minutes ago, colonel_mortis said:

Windows 11

Yeah, on that side it may take a bit longer to catch up.

 

3 minutes ago, DuckDodgers said:

although the decoding overhead will still be consuming more power I guess.

I'm not sure if that's any relevant considering all of the execution units and the area taken by those (and caches) compared to the decoder itself.

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, igormp said:

I'm not sure if that's any relevant considering all of the execution units and the area taken by those (and caches) compared to the decoder itself.

The x86 decoding and branch prediction stage in any modern implementation takes significant area and power budget, because an x86 instruction length cannot be determined before full decode. This makes superscalar front-end even more expensive, compared to any typical RISC architecture.

Link to comment
Share on other sites

Link to post
Share on other sites

15 minutes ago, DuckDodgers said:

The x86 decoding and branch prediction stage in any modern implementation takes significant area and power budget, because an x86 instruction length cannot be determined before full decode. This makes superscalar front-end even more expensive, compared to any typical RISC architecture.

Branch prediction is indeed costly, but this also applies to other ISAs, doesn't it?

 

Seeing the current core annotations for Zen CPUs, the decoder still doesn't look that significant compared to other units, specially in Zen 4.

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, igormp said:

 

 

It's not clear to me how their new AVX10 with both of its subsets differ from the existing AVX2 and AVX-512 predecessors. If a core has no support for AVX10/512 then code compiled with this extension in mind shouldn't work, from what I understood, making it no different to the current AVX512 scenario.

 

The theory goes, C compilers will generate "as-is" code without using chip extensions, and then when a chip extension is flagged as being available, it will generate additional code paths.

 

In practice (like literately taking programs and disassembling them) you will almost never run into chip-specific extensions when you disassemble. The developer has to use assembly directly, or intrinsic's that contain these multiple code paths. But even switching to these code paths have overhead.

 

So you won't see extensions used in practice outside of compression algorithms in most software. The compiler doesn't go "hey here's 1 loop that executes 8 times we can parallelize" it goes "here's 1 loop we can unroll 8 times" Generally "speed" optimizations actually make code much, much bigger, and when they're bigger they use more registers, cache, RAM, etc. So a "speed" compilation doesn't always make it faster if it blows past the smaller/cheaper/older chip configurations. 

 

In theory, a program should be linked at runtime so that available chip extension versions of library functions can be used. That way you can change the underlying CPU and never run into a situation where it doesn't work. But if you hand-code assembly, now it's stuck with that CPU. If Intel then decides to remove that functionality (eg AVX512, MPX, TSX) or AMD decides to add functionality and then withdraw it (3DNow!, FMA4, TBM) , now that binary can't ever work because the hand-coded assembly is the only code path. 

 

Zlib, is used everywhere, but never the version of Zlib with the hand-coded assembly, as an example. 

 

On both Windows and Linux, libraries are often compiled to work on all CPU's (eg 32-bit, 64-bit) and all extensions are avoided. If you want an optimized Linux experience, you need to re-compile EVERYTHING for the CPU in the machine, and that's often a huge time sink for what doesn't really amount to a general improvement in performance. 

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, igormp said:

Branch prediction is indeed costly, but this also applies to other ISAs, doesn't it?

 

yes but in theory it is made more complex due to the lower number of addressable registers, data that could label as in a register on ARM in x86 most likly is in a HW register but from an instruction perspective the compiler has had to go and put a load instruction in and store something else since the compiler has less registers to deal with.  

This does adds extra complexity to the branch predictor since any value that by the compiler stores to memory might be being modified by another thread of the same app so its now all of a sudden non-trival to no-op these stores and loads (they do manage to no-op them).

 

It's much more complex than just having the instruction just write to r21 and then read form r21 this offloads the checking of all of that correctness to compieltime for the compiler and way from runtime with the branch predictor.   

Since these chips have had many many more HW registers (for a long time) it makes sense to expose more of these to the compiler so as to simplify this stage a little. 

Link to comment
Share on other sites

Link to post
Share on other sites

51 minutes ago, Kisai said:

The theory goes, C compilers will generate "as-is" code without using chip extensions, and then when a chip extension is flagged as being available, it will generate additional code paths.

No, that's not how it works. If you enable the flag to use certain extensions (or even just -march=native), the compiler WILL make use of those extensions, that's easily seen by when the 512 bit registers are being used (ZMMx ones). Even basic stuff like strcmp and memcpy can make use of AVX extensions nowadays.

57 minutes ago, Kisai said:

So you won't see extensions used in practice outside of compression algorithms in most software. The compiler doesn't go "hey here's 1 loop that executes 8 times we can parallelize" it goes "here's 1 loop we can unroll 8 times" Generally "speed" optimizations actually make code much, much bigger, and when they're bigger they use more registers, cache, RAM, etc. So a "speed" compilation doesn't always make it faster if it blows past the smaller/cheaper/older chip configurations. 

I do daily.

58 minutes ago, Kisai said:

and all extensions are avoided

Many distros are starting to use x86 levels, where x86_64v3 would set AVX2 as the baseline.

1 hour ago, Kisai said:

you need to re-compile EVERYTHING for the CPU in the machine

You can also recompile only the most relevant software for you, that's way easier to do and brings many more benefits.

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

35 minutes ago, igormp said:

Many distros are starting to use x86 levels, where x86_64v3 would set AVX2 as the baseline.

Which ones? I know Arch discussed it with an rfc, but I don't think it was implemented. Can't wait to update my laptop... I heard there's was ~15% performance increase in Firefox for v3 without recompiling the dependencies, so I have high hopes!

1 hour ago, Kisai said:

So you won't see extensions used in practice outside of compression algorithms in most software. The compiler doesn't go "hey here's 1 loop that executes 8 times we can parallelize" it goes "here's 1 loop we can unroll 8 times" Generally "speed" optimizations actually make code much, much bigger, and when they're bigger they use more registers, cache, RAM, etc. So a "speed" compilation doesn't always make it faster if it blows past the smaller/cheaper/older chip configurations.

Ehh, there's a bunch of high profile software that check at runtime the capabilities of the CPU (glibc, some video decoding libraries & Linux kernel to name a few).

 

I have also noticed a huge throughput increase in my C++ game (clang) from switching to O3 and march=native. As always, it depends on the software. Even modern gcc/clang can produce wild oddities on certain pieces of code that you can't spot without profiling (https://www.youtube.com/watch?v=bSkpMdDe4g4).

 

4 hours ago, DuckDodgers said:

The x86 decoding and branch prediction stage in any modern implementation takes significant area and power budget, because an x86 instruction length cannot be determined before full decode. This makes superscalar front-end even more expensive, compared to any typical RISC architecture.

Variable length x86 decoding has very little cost on a modern CPU. Here's an in depth look if you're curious.

https://chipsandcheese.com/2021/07/13/arm-or-x86-isa-doesnt-matter/

Link to comment
Share on other sites

Link to post
Share on other sites

8 hours ago, igormp said:

but wonder when we'll see a CPU that has support for those extensions.

I'd like Redwood Cove & Crestmont which is the earliest possible but it may have to be after either/or of these.

 

8 hours ago, igormp said:

It's not clear to me how their new AVX10 with both of its subsets differ from the existing AVX2 and AVX-512 predecessors. If a core has no support for AVX10/512 then code compiled with this extension in mind shouldn't work, from what I understood, making it no different to the current AVX512 scenario.

It's to allow the E-Cores to have the instruction set efficiencies and extra capabilities without having to support wider data and execution. There's some really nice stuff in AVX-512 that doesn't actually need to be 512bit to benefit from them but currently all the E-Cores thus hybrid CPUs are limited to AVX2 feature set.

Link to comment
Share on other sites

Link to post
Share on other sites

6 hours ago, DuckDodgers said:

The x86 decoding and branch prediction stage in any modern implementation takes significant area and power budget, because an x86 instruction length cannot be determined before full decode. This makes superscalar front-end even more expensive, compared to any typical RISC architecture.

Total die percentage as well as core area die percentage for these isn't that out of wack and as for power that's insignificant compared to the execution units for example. ARM isn't living in the 80's/90's, neither is x86 for that matter, and all of this ARM vs x86 stuff date back to then and it's hardly all that relevant today. Performance and efficiency isn't won and lost at the ISA aspect of CPUs/SoCs.

 

Quote

That is important because the Arm-based Ampere Altra Max is quite a bit behind the AMD EPYC 9754. AMD’s performance on SPEC CPU2017 is roughly 3x the Altra Max at 128 cores. When we get to power, keep this in mind because AMD is nowhere near 3x the power consumption. To be fair, Ampere has its AmpereOne shipping to hyper-scalers, so an Altra Max 128 core to a Bergamo 128 core is roughly a 2-year gap in chip releases.

 

Quote

Power consumption is perhaps the most shocking. We often hear that Arm servers will always be better on power consumption than x86, but in the cloud native space, that is only part of the story. With our AMD EPYC 9754, we had SPEC CPU2017 figures that were roughly 3x its only 128-core competitor, the Ampere Altra Max M128-30. Power consumption was nowhere near 3x. In our recent HPE ProLiant RL300 Gen11 Review, we were seeing a server maximum of around 350-400W. In our 2U Supermicro ARS-210ME-FNR 2U Edge Ampere Altra Max Arm Server Review we saw idle at 132W and 365W-400W. We tested the Bergamo part in several single-socket 2U Supermicro servers that we have including the Supermicro CloudDC AS-2015CS-TNR and we saw idle in the 117-125W range and a maximum of 550-600W.

 

Quote

The impact of this is that AMD is now offering 3x the SPEC CPU2017 performance at similar idle but only around 50% higher power consumption. We fully expect Ampere AmpereOne will rebalance this, but for those who have counted x86 out in the cloud native space, it is not that simple.

https://www.servethehome.com/amd-epyc-bergamo-epyc-9754-cloud-native-sp5/2/ (I highly recommend watching their video and LevelOne on Zen4c/Bergamo)

 

AmpereOne has to get near triple it's performance or something like that while maintaining or slightly increasing power to match Zen4c. Problem is the power increase for it is not going to be only slightly.

 

image.png.02c930bfe0b7e873358a2a563de8f61e.png

https://amperecomputing.com/products/processors

 

All these ARM, RISC, x86 comparisons, advantages, disadvantages just don't matter at the end of the day when looking at actual market products available.

Link to comment
Share on other sites

Link to post
Share on other sites

7 hours ago, igormp said:

I doubt we'll be seeing those that soon. I guess it may come on their new Royal architecture.

I missed the rumours on that so only just looked it up. While it sounds fancy, it claimed 2x IPC over Alder Lake P cores, but it is also 5 generations on from that. The per-generation IPC increases over that time means it should be about that ball park anyway.

 

From various suspect rumour posts, apparently Jim Keller was involved in Royal Core which included feeding into Lion Cove of Arrow Lake, which is after Meteor Lake. Maybe that would be a good time?

 

7 hours ago, igormp said:

Keep in mind that AVX10 has 2 subsets, AVX10/256 (which is similar to AVX2) and AVX10/512 (which is similar to AVX-512), I'm not sure how those subsets will interoperate.

I'm skimming the documents and it is beyond my comprehension. What I can gather is AVX10 seems to be a superset of AVX-512. With more universal hardware support hopefully more software can adopt it faster like AVX2. Intel shot themselves in the foot on AVX-512 since Alder Lake so we've only had one mainstream desktop generation officially supporting AVX-512.

 

Now I'm thinking of the MS Windows number scheme. For consumers: 3, 95, 98, 2000?, XP, Vista, 7, 8, 10, 11. On Intel side: AVX, AVX2, AVX-512, AVX10 😄 

 

1 hour ago, leadeater said:

I'd like Redwood Cove & Crestmont which is the earliest possible but it may have to be after either/or of these.

Those two going into Meteor Lake. Redwood sounds like it is fundamentally the same architecture going back to Golden Cove. While I can't rule out dark silicon, it seems less likely to be added there. Crestmont is new so there is some possibility of adding it there.

 

At some point Intel needs to get hardware out for those working on the software to get on with it. Sooner is better. Maybe there's something out there already under NDA although leaks tend to happen sooner or later.

 

1 hour ago, leadeater said:

There's some really nice stuff in AVX-512 that doesn't actually need to be 512bit to benefit from them but currently all the E-Cores thus hybrid CPUs are limited to AVX2 feature set.

Very much this. In the early days of AVX-512, Skylake-X came out with 2 unit implementation. It had 2x the FP64 execution units relative to AVX2. In Prime95-like workloads (FP64 heavy) I saw ~80% IPC uplift compared to AVX2. Great. Then I got Rocket Lake with a 1 unit implementation. It has the same FP64 execution size as AVX2. I still saw ~40% IPC uplift compared to AVX2. Zen 4 AVX-512 implementation also doesn't have additional FP execution units compared to earlier, but it also shows a significant IPC uplift vs AVX2. Since I don't have one to test myself, using indirect data from elsewhere it seems to fall between Intel's 1 unit and 2 unit implementations in terms of IPC. While more execution helps to an extent, there are still challenges in getting the data where it needs to be and that's where AVX-512 can help even without those extra execution units.

Main system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, Corsair Vengeance Pro 3200 3x 16GB 2R, RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

50 minutes ago, porina said:

Very much this. In the early days of AVX-512, Skylake-X came out with 2 unit implementation. It had 2x the FP64 execution units relative to AVX2. In Prime95-like workloads (FP64 heavy) I saw ~80% IPC uplift compared to AVX2. Great. Then I got Rocket Lake with a 1 unit implementation. It has the same FP64 execution size as AVX2. I still saw ~40% IPC uplift compared to AVX2. Zen 4 AVX-512 implementation also doesn't have additional FP execution units compared to earlier, but it also shows a significant IPC uplift vs AVX2. Since I don't have one to test myself, using indirect data from elsewhere it seems to fall between Intel's 1 unit and 2 unit implementations in terms of IPC. While more execution helps to an extent, there are still challenges in getting the data where it needs to be and that's where AVX-512 can help even without those extra execution units.

AMD's implementation of AVX512 in Zen4 is probably the most balanced and efficient approach to date. Zen4 can sustain 1x512-bit FADD and 1x512-bit FMA every cycle with minimum transistor and power cost, while to achieve this Intel had to slap a second 512-bit FMA unit, with all the overhead resulting of that for their server parts. Intel turned around with Golden Cove, but they fusied off AVX512 in ADL, so ordinary customers were denied the performance benefits anyway. There are very few practical use-cases for 2x512-bit FMA to be worth the extra cost and FADD circuits are cheaper to implement. By the way, mask operations in AVX-512 are still much faster in Zen4 than any of Intel's designs. Kind of a sheme here, for one of the premiere features in the instruction set. On the other hand, Intel still beats AMD in load/store bandwidth.

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, porina said:

Those two going into Meteor Lake. Redwood sounds like it is fundamentally the same architecture going back to Golden Cove. While I can't rule out dark silicon, it seems less likely to be added there. Crestmont is new so there is some possibility of adding it there.

 

At some point Intel needs to get hardware out for those working on the software to get on with it. Sooner is better. Maybe there's something out there already under NDA although leaks tend to happen sooner or later.

I only list it as a maybe possibility since this in a way falls under:

 

Quote

OneRaichu explains that Redwood Cove will not widen the architecture as much as Sunny Cove cores and will mainly focus on instruction execution efficiency. It can be regarded as 0x1.5 since there are talks of improvements to branch prediction, micro-operation fusion, instruction dispatch, register remake, and EU execution efficiency. 

 

But I'm not that hopeful. My only thought is that it wouldn't be all that useful to announce something like this so far out and there is new architectures incoming so maybe it's a hint towards those. It is more likely the next, but maybe... 

 

We could also get APX but not AVX10.

Link to comment
Share on other sites

Link to post
Share on other sites

38 minutes ago, DuckDodgers said:

AMD's implementation of AVX512 in Zen4 is probably the most balanced and efficient approach to date. Zen4 can sustain 1x512-bit FADD and 1x512-bit FMA every cycle with minimum transistor and power cost, while to achieve this Intel had to slap a second 512-bit FMA unit, with all the overhead resulting of that for their server parts.

Intel has both 1 and 2 unit implementations. The 1 unit implementation is in execution terms comparable to Zen 4 implementation. Ryzen's FP execution has the extra ADD unit relative to Intel, although I'm not sure how much value that really gives. For example, Zen 2 was only about 4% faster than Skylake in Prime95 (AVX) IPC. Again Zen 4's implementation seems to give an AVX-512 IPC between that of Intel's 1 and 2 unit implementations, but it also came out 5 years after Intel did. I'm not aware if Intel made any notable updates in the execution performance of AVX-512 over that time.

 

For Rocket Lake specifically, which is 1 unit, in my testing I found the perf/W unchanged so there is no power penalty in using AVX-512. Sure, you use more power with AVX-512, but you get proportionately more performance out of it too. The power cost is not scaling worse like an old school overclock.

 

The two unit implementations do give more peak throughput vs 1 unit implementations.

 

I don't have data on Zen 4's AVX-512 power efficiency. With earlier Ryzen we still saw a power-per-clock increase when running AVX2 resulting in lower clocks when on a fixed power limit. If anyone has a Zen 4 could try it, I would be interested to see if clocks are much different running Prime95 small FFT with AVX2 vs AVX-512, assuming they're running same fixed power limit for both.

 

38 minutes ago, DuckDodgers said:

There are very few practical use-cases for 2x512-bit FMA to be worth the extra cost and FADD circuits are cheaper to implement.

I'm not sure of the value of just adding an extra ADD unit, whereas FMA can do everything. Providing you can feed it adequately, it scales better for intensive workloads. I'm ok with 2 unit going on HEDT, WS and server parts, not on consumer tier.

 

Main system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, Corsair Vengeance Pro 3200 3x 16GB 2R, RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

25 minutes ago, porina said:

I'm not sure of the value of just adding an extra ADD unit, whereas FMA can do everything. Providing you can feed it adequately, it scales better for intensive workloads. I'm ok with 2 unit going on HEDT, WS and server parts, not on consumer tier.

FADDs are more common in compiled code and the cost of implementation in hardware is lower than another FMA unit, just to use the accumulator for additions. Way back with Skylake, Intel doubled the ADD rate for the SSE/AVX op's to match the rate of MULs at marginal cost.

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, DuckDodgers said:

Way back with Skylake, Intel doubled the ADD rate for the SSE/AVX op's to match the rate of MULs at marginal cost.

Got a pointer to where I can read up on that? It could contribute to an observation I had. In Prime95-like workloads, Skylake was +14% IPC relative to Haswell and I didn't have a good explanation for it. I saw FP latency reported to go down from 5 to 4 cycles, but that shouldn't have much impact on more predictable pipelined code.

Main system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, Corsair Vengeance Pro 3200 3x 16GB 2R, RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

9 hours ago, kvuj said:

Which ones? I know Arch discussed it with an rfc, but I don't think it was implemented. Can't wait to update my laptop... I heard there's was ~15% performance increase in Firefox for v3 without recompiling the dependencies, so I have high hopes!

I had seen Arch and Fedora bringing those discussions, but seems that there's no official repo for those yet (arch has unofficial ones in case you want to try). My bad on that assumption.

 

7 hours ago, leadeater said:

It's to allow the E-Cores to have the instruction set efficiencies and extra capabilities without having to support wider data and execution. There's some really nice stuff in AVX-512 that doesn't actually need to be 512bit to benefit from them but currently all the E-Cores thus hybrid CPUs are limited to AVX2 feature set

But what I haven't properly wrapped my head around is that there are 2 subsets of AVX10: the 256bit and 512bit one. The 256bit is going to be supported in all cores after its debut, but are the E cores able to run the 512 subset somehow? That wasn't clear from intel's docs (at least to me).

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

18 minutes ago, igormp said:

But what I haven't properly wrapped my head around is that there are 2 subsets of AVX10: the 256bit and 512bit one. The 256bit is going to be supported in all cores after its debut, but are the E cores able to run the 512 subset somehow? That wasn't clear from intel's docs (at least to me).

It needs clarification for sure. My reading of it for now is that on client CPUs, both P and E cores will only support 256-bit. No mismatch, no problem. I don't believe this will be worse than 1 unit AVX-512. The FP execution potential is unchanged, but it has the other benefits that may get better utilisation out of it vs AVX2. The additional registers should make it even better than AVX-512. There may be a little overhead from issuing twice as many 256-bit instructions vs 512-bit ones.

 

I suspect 512-bit will only be present on high end server and workstation offerings, and they'll compare more with two unit AVX-512 with the extra execution resource. Right now they're P-core only offerings, and it could remain that way. For high core count uses they could go all E core. Does hybrid make any sense in server space? Mixing might not be a problem there.

Main system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, Corsair Vengeance Pro 3200 3x 16GB 2R, RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, porina said:

Got a pointer to where I can read up on that? It could contribute to an observation I had. In Prime95-like workloads, Skylake was +14% IPC relative to Haswell and I didn't have a good explanation for it. I saw FP latency reported to go down from 5 to 4 cycles, but that shouldn't have much impact on more predictable pipelined code.

Here's extensive listing (raw dumps) of instruction throughput and latencies rates for a wide variety of x86 CPUs over the last 30 years: http://instlatx64.atw.hu/

 

This is more structured database on the same topic, with extra info on issue ports and micro-ops: https://uops.info/table.html

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, DuckDodgers said:

Here's extensive listing (raw dumps) of instruction throughput and latencies rates for a wide variety of x86 CPUs over the last 30 years:

Thanks, that's a bit too low level for me. I have no idea what specific instructions are used so it feels a bit needle in a haystack. Now I wonder where I saw the latency reduction I mentioned earlier. Probably somewhere in Agner Fog's microarchitecture guide, which I haven't checked up on in a long time.

Main system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, Corsair Vengeance Pro 3200 3x 16GB 2R, RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×