Intel: Chips To Become Slower But More Energy Efficient

patrickjp93 · February 7, 2016

31 minutes ago, Paranoid Kami said:

Decided to check it out and you weren't kidding when you said it was easy. Is this just something not taught in schools or do companies not think it's worth the time?

https://software.intel.com/videos/1-minute-intro-intel-tbb-parallel-for

Oh it's taught in college HPC classes and if you go to work at a supercomputer (OpenMP and CilkPlus anyway, with TBB being Intel-specific until GCC and Clang release implementations), but in standard industry? I consulted at Epic Games as a performance tuning specialist. They looked at me like I had 2 heads when I told them to stop using native threads because they would never design well with them. I was fired, and yet still we see these horribly optimized games.

Part of the issue is Visual Studio doesn't support CilkPlus at all nor OpenMP beyond version 2.0, and currently it seems there's no plan to rectify this. Even though they're now using Clang as a back-end to replace their absolutely atrocious optimization engine, the front end parser is still part of the garbage Visual C/C++ Compiler, so it's behind. And since so much consumer software is built for Windows on Visual Studio, well, you get the idea. That's why I use Netbeans if on Windows as my development environment.

Microsoft is a problem, the old guard developers who don't like change are a problem, and naive programmers with no HPC experience and have no clue how to write good C/C++ code that is easily optimized to deep levels are a problem. Which demon do you want to slay first?

I harp on this crap for a reason. It's absolute BS that software is so stagnated when we have incredibly powerful coding tools built to unlock the potential of our processors. This is one reason I also defend Intel so much. They have done more good for the programming world than AMD and Nvidia combined. Intel pushed OpenMP even before we had multi-core CPUs, all the way back in 2000 (when multi-socket server boards were in their infancy). And yet here we stand 15 years later still stuck where we are.

Alfa404 · February 7, 2016

I bought the 5820k after hearing all kinds of problems about the new skylake and stuff. I just had a hunch the new "6950x" and its series of prossesors aren't going to be that amazing. No matter how you rationalize this, if the company itself isn't going after "performance" as a priority, but instead going for things like efficiency, it will benefit greatly in the mobile market but not so much for desktop PC. Unless everything else in my PC suddenly decides they don't need power anymore, I'm still stuck with a brick psu... And for the ppl saying that current CPUs are "fast enough" for general usage, what about multitasking? Heavier workloads? Streaming? CPUs aren't nearly as fast as it should be performance-wise. Software utilization can come but don't slow down my hardware first please. Especially when some ppl are paying a grand for it. So that might raise the question, yes they are more efficient, but will they be cheaper?

patrickjp93 · February 7, 2016

1 minute ago, Alfa404 said:

I bought the 5820k after hearing all kinds of problems about the new skylake and stuff. I just had a hunch the new "6950x" and its series of prossesors aren't going to be that amazing. No matter how you rationalize this, if the company itself isn't going after "performance" as a priority, but instead going for things like efficiency, it will benefit greatly in the mobile market but not so much for desktop PC. Unless everything else in my PC suddenly decides they don't need power anymore, I'm still stuck with a brick psu... And for the ppl saying that current CPUs are "fast enough" for general usage, what about multitasking? Heavier workloads? Streaming? CPUs aren't nearly as fast as it should be performance-wise. Software utilization can come but don't slow down my hardware first please. Especially when some ppl are paying a grand for it. So that might raise the question, yes they are more efficient, but will they be cheaper?

Unless you're multitasking on moderate to heavy tasks, that's what hyperthreading on their I3s is for.

For streaming, yes, you need that horsepower, but that's also an enthusiast-level task. And frankly I'd still qualify that's only because of the games themselves being dogs on the CPU side because of bad coding.

The efficiency is also an effort to keep ARM out of server space. Avoton and Xeon D cut ARM off at the pass, but that doesn't keep them from advancing altogether. I suspect we'll be seeing 16+ core Avoton chips before too long to handle those workloads Google is allegedly looking to Qualcomm for.

Cheaper? I doubt it. The silicon economy is pretty damn stable.

Cheddle · February 7, 2016

This stance has been obvious for the last 5 years. IPC gains have been MINIMAL while gains in energy efficiency has been more apparent

vanished · February 7, 2016

2 hours ago, patrickjp93 said:

*snip*

While we're on the topic of software efficiency, you don't even have to go to that level to see atrocious practices. I've got stories like you wouldn't believe.

Here's one: So my friend has this large spreadsheet at work that takes so long to calculate (~5 hours) that IT game him an additional computer to run it on. I asked what it was doing and the short version is that it is doing about 30000 summations, each of about 30000 numbers. It's doing this using VB macros in Excel, so there's your problem. I estimated I could whip up a C++ program to do the same thing in seconds, but they're fine with it like this \_(ツ)_/¯

The others include things from my dad's work experience in databases and servers, where he's seen scripts run over the weekend that he knows could be done in an hour or two if they'd bothered to write it properly, and other poor choices of language, like using python or java for some compute-intensive task where something like C++ would be a much better option. Yeah, it's rampant and with the prevailing state of mind being "it's good enough" or that modern computers are so fast they can afford to write crap, I don't see it changing any time soon unfortunately.

What percentage of the hardware performance gains that we've been getting each year do you suppose are lost to the seemingly ever worsening general state of software efficiency?

LAwLz · February 7, 2016

6 hours ago, patrickjp93 said:

He said nothing of IPC, and IPC is fully decoupled from clock rate. Performance may slip just a hair, but IPC will continue going up. All this is doing is slowing the clock back down.

He did not say IPC, but neither did he say anything about frequency. All we know is that Intel is willing to make a new generation of chips which performs worse than the previous one. How it will be slower (IPC or lower clocks) is still unknown.

6 hours ago, patrickjp93 said:

Just AVX 256. x264 is also very bandwidth-bound, You're spending more time streaming data from RAM into cache than you are encoding or decoding it.

By "AVX 256" do you mean AVX2? If that's the case then no, it won't be anywhere near a 100% performance increase in any regular application. It might get close to 100% performance increase in some extremely specialized application, but those are edge cases and not the norm. You can't double performance across the board with a handful of new instructions.

From what I know, x264 is not at all bandwidth-bound. Got any source on that? Here is one of my many sources which disagrees. Going from 1333MHz to 2133MHz only increased performance by ~2%. Here is another one from Corsair, and this is a quote from it: "This benchmark is almost entirely CPU limited". You do not spend more time streaming data than you do encoding it...

6 hours ago, patrickjp93 said:

No, most mainstream consumers have 0 need for more than 2 cores, let alone 4. For enthusiasts more makes sense. High-end gaming (which is still so badly coded I'm not yet convinced more cores are yet necessary) is an enthusiast pursuit. It's not like Intel's even gouging you for 2 extra cores. It's a different platform, but the cost difference is overall microscopic.

If you are talking about the i7-5820K then yes it has a pretty reasonable price compared to the mainstream (and by mainstream I mean Intel's low PCIe count platform, like LGA 1151). The problem is that the 5820K is 2 generations behind. I think we should get a 6 core on the mainstream platform because the enthusiast one is behind in terms of architecture.

I have high hopes for AMD's Zen architecture because it seems like they are willing to go beyond 4 cores on mainstream products. There are plenty of tasks which already uses more than 4 cores very well, and those are the tasks which actually need the CPU power we already have. For games and such we don't need better CPUs. That's almost entirely GPU bottlenecked (and even more so with DX12/Vulcan).

And yes, the average Joe won't need 6 cores but I would argue that the i7-6700K is not something the average Joe buys. It is enthusiast grade.

If there is a risk that Intel will release a generation of processors that performs worse than their previous generation then I think that is a good opportunity for them to move to 6 cores. That way each core might be worse, but the new generation will still be better.

RagnarokDel · February 7, 2016

anddddddd in 5 years AMD is going to replace Intel on PCs?

patrickjp93 · February 7, 2016

6 hours ago, Ryan_Vickers said:

While we're on the topic of software efficiency, you don't even have to go to that level to see atrocious practices. I've got stories like you wouldn't believe.

Here's one: So my friend has this large spreadsheet at work that takes so long to calculate (~5 hours) that IT game him an additional computer to run it on. I asked what it was doing and the short version is that it is doing about 30000 summations, each of about 30000 numbers. It's doing this using VB macros in Excel, so there's your problem. I estimated I could whip up a C++ program to do the same thing in seconds, but they're fine with it like this \_(ツ)_/¯

The others include things from my dad's work experience in databases and servers, where he's seen scripts run over the weekend that he knows could be done in an hour or two if they'd bothered to write it properly, and other poor choices of language, like using python or java for some compute-intensive task where something like C++ would be a much better option. Yeah, it's rampant and with the prevailing state of mind being "it's good enough" or that modern computers are so fast they can afford to write crap, I don't see it changing any time soon unfortunately.

What percentage of the hardware performance gains that we've been getting each year do you suppose are lost to the seemingly ever worsening general state of software efficiency?

Nearly all the improvement since Sandy Bridge has been very much hidden by all of these factors.

patrickjp93 · February 7, 2016

4 hours ago, LAwLz said:

He did not say IPC, but neither did he say anything about frequency. All we know is that Intel is willing to make a new generation of chips which performs worse than the previous one. How it will be slower (IPC or lower clocks) is still unknown.

By "AVX 256" do you mean AVX2? If that's the case then no, it won't be anywhere near a 100% performance increase in any regular application. It might get close to 100% performance increase in some extremely specialized application, but those are edge cases and not the norm. You can't double performance across the board with a handful of new instructions.

From what I know, x264 is not at all bandwidth-bound. Got any source on that? Here is one of my many sources which disagrees. Going from 1333MHz to 2133MHz only increased performance by ~2%. Here is another one from Corsair, and this is a quote from it: "This benchmark is almost entirely CPU limited". You do not spend more time streaming data than you do encoding it...

If you are talking about the i7-5820K then yes it has a pretty reasonable price compared to the mainstream (and by mainstream I mean Intel's low PCIe count platform, like LGA 1151). The problem is that the 5820K is 2 generations behind. I think we should get a 6 core on the mainstream platform because the enthusiast one is behind in terms of architecture.

I have high hopes for AMD's Zen architecture because it seems like they are willing to go beyond 4 cores on mainstream products. There are plenty of tasks which already uses more than 4 cores very well, and those are the tasks which actually need the CPU power we already have. For games and such we don't need better CPUs. That's almost entirely GPU bottlenecked (and even more so with DX12/Vulcan).

And yes, the average Joe won't need 6 cores but I would argue that the i7-6700K is not something the average Joe buys. It is enthusiast grade.

If there is a risk that Intel will release a generation of processors that performs worse than their previous generation then I think that is a good opportunity for them to move to 6 cores. That way each core might be worse, but the new generation will still be better.

Then someone didn't write the encode/decode algorithm with enough ILP and DLP in mind. Anything involving graphics is embarrassingly parallel, video included. Heck it could even be someone forgot to flip the optimization on in Visual Studio. I've personally seen AVX 256 perform at nearly 2x the performance of AVX 128 in my lab machines. Solving ordinary differential equations is also embarrassingly parallel, and that has no I/O requirements at all.

The enthusiast platform isn't far behind on instructions, and the ones it lacks are more important for enterprise users anyway. The next big performance change will come with AVX 512 which may come to Skylake-E.

They could react that way, certainly, but they could also push for heterogeneous acceleration (as easy as using #pragma offload target(gfx) variable list { same C++ code here as before with preprocessor tags around parallelizable loops } )and beef up their iGPUs more.

terminal2 · February 7, 2016

On 2/6/2016 at 9:34 AM, Coaxialgamer said:

they could just make bigger dies while staying on 14 nm a bit longer , instead of making smaller and smaller dies by going 10nm or less.

Sure , cooling would become an issue sooner or later , and they would have to move on eventually .

This.

I was never a fan of the move towards laptop-sized dies for anything but Celeron desktop processors. Never made much sense to me as I always thought like you did: just build more on the same space. Hell, if nothing else just asstons of cache.

LAwLz · February 7, 2016

1 hour ago, patrickjp93 said:

Then someone didn't write the encode/decode algorithm with enough ILP and DLP in mind. Anything involving graphics is embarrassingly parallel, video included. Heck it could even be someone forgot to flip the optimization on in Visual Studio. I've personally seen AVX 256 perform at nearly 2x the performance of AVX 128 in my lab machines. Solving ordinary differential equations is also embarrassingly parallel, and that has no I/O requirements at all.

x264 is one of the most well written applications you can find. It even runs circles around Intel's own encoder. It's just that video encoding is extremely (and I mean extremely) CPU dependent, and not really that much dependent on other things. It is also very parallel, which is why x264 scales very well up to something like 50 cores. So no, it's not that it is poorly optimized.

You might have seen a 2x increase but that was probably the exception rather than the norm, especially for more consumer oriented applications.

1 hour ago, patrickjp93 said:

The enthusiast platform isn't far behind on instructions, and the ones it lacks are more important for enterprise users anyway. The next big performance change will come with AVX 512 which may come to Skylake-E.

It's not that far behind in terms of instructions, but it is behind in other areas such as process node and architecture as well. It adds up to a quite significant difference.

1 hour ago, patrickjp93 said:

They could react that way, certainly, but they could also push for heterogeneous acceleration (as easy as using #pragma offload target(gfx) variable list { same C++ code here as before with preprocessor tags around parallelizable loops } )and beef up their iGPUs more.

That is one way of doing it, but the problem is that GPU acceleration does not work for certain workloads. For example if we go back to video encoding again. I don't know all the details, but GPU accelerated video encoding always ends up much worse in terms of quality compared to pure software encoding. You also run into issues when you try to do things such as CABAC, since it is extremely difficult to parallelize (maybe GPU accelerated implementations have to do shortcuts in this area?).

Moving to GPGPU also means we will have to update a ton of applications. That will take a very long time. I'd rather just get a better CPU which will be useful from day 1, than get a CPU which might be better if developers update their programs to take advantage of new features, which can take several years.

Maybe I should just suck it up and get Skylake-E or whatever it will be called. A 6 core chip on the mainstream platform is just something I am hoping for, but I doubt it will happen. Or maybe AMD will come to my rescue with Zen... lol jk, I fully expect it to be as big of a disappointment as Bulldozer was.

linuxfan66 · February 7, 2016

The others include things from my dad's work experience in databases and servers, where he's seen scripts run over the weekend that he knows could be done in an hour or two if they'd bothered to write it properly, and other poor choices of language, like using python or java for some compute-intensive task where something like C++ would be a much better option. Yeah, it's rampant and with the prevailing state of mind being "it's good enough" or that modern computers are so fast they can afford to write crap, I don't see it changing any time soon unfortunately.

What percentage of the hardware performance gains that we've been getting each year do you suppose are lost to the seemingly ever worsening general state of software efficiency?

I'm gonna defend the excel spreadsheet...

Database programming is rarely ever done or taught c/c++ anymore as far as I can tell.(I slightly biased..I think pointers are abomanible due to lacking built in overrun protection. The very concept of a thing in user land that is effectively a variable that is not managed by external library to ensure unwanted data entering from uncleared memory allowed to leave its allocated space if the programmer forgets To tell it not to is stupid for me. If you feel like TLDR I consider non system level code that does have a execution manager enforcing variable/pointer integrity irresponsible in many scenarios /rant over)

Irregardless on my thoughts of what users land code should not be. Code at hpc level computing should be optimised and multithreaded but that's clearly not happening...

azurezeed · February 7, 2016

Time for AMD to push the accelerator

vanished · February 7, 2016

4 minutes ago, azurezeed said:

Time for AMD to push the accelerator

It would seem like it, but since this interruption to Moore's law is caused by hitting the physical limits of the current methods/technology, it's not like they can just magically push through; to swoop in and pass Intel they would need to have one of these other technologies (besides silicon) ready to pull out of their back pocket, like, now, and since I don't think that's the case, and they certainly don't have the same kind of R&D funding as Intel, I don't expect we'll be seeing any miracles soon.

patrickjp93 · February 8, 2016

11 hours ago, LAwLz said:

x264 is one of the most well written applications you can find. It even runs circles around Intel's own encoder. It's just that video encoding is extremely (and I mean extremely) CPU dependent, and not really that much dependent on other things. It is also very parallel, which is why x264 scales very well up to something like 50 cores. So no, it's not that it is poorly optimized.

You might have seen a 2x increase but that was probably the exception rather than the norm, especially for more consumer oriented applications.

It's not that far behind in terms of instructions, but it is behind in other areas such as process node and architecture as well. It adds up to a quite significant difference.

That is one way of doing it, but the problem is that GPU acceleration does not work for certain workloads. For example if we go back to video encoding again. I don't know all the details, but GPU accelerated video encoding always ends up much worse in terms of quality compared to pure software encoding. You also run into issues when you try to do things such as CABAC, since it is extremely difficult to parallelize (maybe GPU accelerated implementations have to do shortcuts in this area?).

Moving to GPGPU also means we will have to update a ton of applications. That will take a very long time. I'd rather just get a better CPU which will be useful from day 1, than get a CPU which might be better if developers update their programs to take advantage of new features, which can take several years.

Maybe I should just suck it up and get Skylake-E or whatever it will be called. A 6 core chip on the mainstream platform is just something I am hoping for, but I doubt it will happen. Or maybe AMD will come to my rescue with Zen... lol jk, I fully expect it to be as big of a disappointment as Bulldozer was.

Sorry I couldn't respond earlier. It's been a long day.

The very evidence against your claim is the fact it only gets a 15% boost with AVX 256 but does better with more cores. Clearly someone overlooked some levels of data and instruction-level parallelism in how they designed parts of the algorithm (most likely obfuscated loops that could be joined together for less overhead). AVX 256 takes the same number of clock cycles as AVX 128 to crunch double the data, so clearly it's the developers' fault, or they used an outdated compiler version that was only just starting to become aware of the instruction.

No, it's the norm if you have data-parallel workloads, period, unless you have a bandwidth or I/O bottleneck. You can prove this very easily with GCC with a few preprocessor comments to say target avx and target avx2 and a simple data-parallel loop that has enough operations per turn of the loop to not cause a bandwidth bottleneck. And the fact that it's not bandwidth heavy is also proof it's either overcomplicated with too many data dependencies between I-Frames to be successfully parallelizable or the developers missed something that should be obvious. Like I said anything involving graphics is embarrassingly parallel, and that includes video and image compression.

Quality has everything to do with the algorithm, because an ALU is an ALU is an ALU and an FPU is ... Unless x264 is using double precision on video cards with no DP support, the problem is the algorithm, not the processor. An algorithm is only a mathematical structure which defines transformations of a system to gain new information. Once mathematically proven to do exactly XYZ, it's up to programmers to get it right for respective languages. If quality is lower, it's the implementation that's off, not the hardware, unless all GPUs just have huge arithmetic bugs we never knew about.

It wouldn't take long at all, just a few comments and preprocessor tags. As long as the drivers are there (for Intel and AMD, they already are). In the following example, if an iGPU is not present (and a dGPU isn't exposed with an OpenMP driver), the code will just run in parallel on your CPU cores. You can even control the thread count easily. There is 0 excuse for multithreaded code to not be the norm now, and the excuse for not being heterogeneous is looking really flimsy with OpenMP being the open standard of 15 years it's been with wide support on GCC, Clang, and ICC up through version 4 and partial 4.5 and Visual Studio having support for all of 2.0 and bits and pieces of 3.0 and 4.0.

Spoiler


#include <omp.h>

__declspec(target(gfx)) void function(double *vector, uint length) {
    #pragma parallel_loop
    for (uint i = 0; i < length; i++) {
        //do something with the vector elements in a parallel loop                   
    }
}

It won't be as big a disappointment as Bulldozer, but I think for consumers it will fall very short of the hype. If it doesn't have AVX 512 the only shot AMD will have in servers will be the HPC APUs, because otherwise even with 32 cores against Skylake E5/E7's 28 (with 6-channel memory controllers of up to DDR4 2400 too), it'll be outgunned severely for compute workloads.

patrickjp93 · February 8, 2016

10 hours ago, terminal2 said:

This.

I was never a fan of the move towards laptop-sized dies for anything but Celeron desktop processors. Never made much sense to me as I always thought like you did: just build more on the same space. Hell, if nothing else just asstons of cache.

Did you know that about 1/2 the heat on an E5/E7 Xeon actually comes from the cache? The other 1/2 of it in a compute workload is the FPU. I mean, if you can tolerate lower clock speeds and less overclocking headroom and thus get only some performance benefits, okay, but even then The E5 2699 V3 is a 662 mm sq. die. That's bigger than most flagship GPU dies up to this point. Intel can't make the cores much bigger.

patrickjp93 · February 8, 2016

10 hours ago, linuxfan66 said:

I'm gonna defend the excel spreadsheet...

Database programming is rarely ever done or taught c/c++ anymore as far as I can tell.(I slightly biased..I think pointers are abomanible due to lacking built in overrun protection. The very concept of a thing in user land that is effectively a variable that is not managed by external library to ensure unwanted data entering from uncleared memory allowed to leave its allocated space if the programmer forgets To tell it not to is stupid for me. If you feel like TLDR I consider non system level code that does have a execution manager enforcing variable/pointer integrity irresponsible in many scenarios /rant over)

Irregardless on my thoughts of what users land code should not be. Code at hpc level computing should be optimised and multithreaded but that's clearly not happening...

By database programming, do you mean implementing the database algorithms to maintain data coherency and transactions? If it's not being taught in C/C++, it's being taught wrong. If you're being taught how to interface with a database through Java or another language, that's fine.

99.9% of things you can do with pointers can be done with references until you get into the nastiest parts of custom data structures. Also, const keyword much?

At HPC level that is most certainly happening unless you go to some shit school/company.

patrickjp93 · February 8, 2016

7 hours ago, Ryan_Vickers said:

It would seem like it, but since this interruption to Moore's law is caused by hitting the physical limits of the current methods/technology, it's not like they can just magically push through; to swoop in and pass Intel they would need to have one of these other technologies (besides silicon) ready to pull out of their back pocket, like, now, and since I don't think that's the case, and they certainly don't have the same kind of R&D funding as Intel, I don't expect we'll be seeing any miracles soon.

We don't hit those limits until 7nm or 5nm. IBM's proven we can do 7nm without Spintronics, but it did take quad-level patterning and EUV to do it, and nothing was ever said of clock speeds. Whether that's the limit or someone can push down to 5 without these newage techniques remains an open question.

It's possible IBM gave GlobalFoundries a fighting chance when it paid to give them their last remaining fab (responsible for 22nm FDSOI and Power 8 CPU production) and probably IP access, but Intel doesn't sit on its butt as a foundry. That's probably the place they innovate the fastest.

Qub3d · February 8, 2016

22 hours ago, Ryan_Vickers said:

While we're on the topic of software efficiency, you don't even have to go to that level to see atrocious practices. I've got stories like you wouldn't believe.

Here's one: So my friend has this large spreadsheet at work that takes so long to calculate (~5 hours) that IT game him an additional computer to run it on. I asked what it was doing and the short version is that it is doing about 30000 summations, each of about 30000 numbers. It's doing this using VB macros in Excel, so there's your problem. I estimated I could whip up a C++ program to do the same thing in seconds, but they're fine with it like this \_(ツ)_/¯

The others include things from my dad's work experience in databases and servers, where he's seen scripts run over the weekend that he knows could be done in an hour or two if they'd bothered to write it properly, and other poor choices of language, like using python or java for some compute-intensive task where something like C++ would be a much better option. Yeah, it's rampant and with the prevailing state of mind being "it's good enough" or that modern computers are so fast they can afford to write crap, I don't see it changing any time soon unfortunately.

What percentage of the hardware performance gains that we've been getting each year do you suppose are lost to the seemingly ever worsening general state of software efficiency?

Hopefully the newer generation of coders can fix this. I'm currently studying a data structures course in C++, and we're spending a good month on parallel computing this semester. I can't wait to deploy to our beowulf cluster!

patrickjp93 · February 8, 2016

9 minutes ago, Qub3d said:

Hopefully the newer generation of coders can fix this. I'm currently studying a data structures course in C++, and we're spending a good month on parallel computing this semester. I can't wait to deploy to our beowulf cluster!

You'll be bandwidth-choked in MPI programming if you have to pass too much data around, but I've always been interested in seeing what a cluster of ordinary hardware could do.

vanished · February 8, 2016

3 hours ago, patrickjp93 said:

We don't hit those limits until 7nm or 5nm. IBM's proven we can do 7nm without Spintronics, but it did take quad-level patterning and EUV to do it, and nothing was ever said of clock speeds. Whether that's the limit or someone can push down to 5 without these newage techniques remains an open question.

Yeah, that's what I'm talking about My point is that, if Intel is going to be backing off on performance increases, it's not like AMD can just fly past them; they'd be heading for the same wall

patrickjp93 · February 8, 2016

9 minutes ago, Ryan_Vickers said:

Yeah, that's what I'm talking about My point is that, if Intel is going to be backing off on performance increases, it's not like AMD can just fly past them; they'd be heading for the same wall

We'll see. Samsung could pull a Houdini between now and the 5nm transition in 5-6 years.

RagnarokDel · February 8, 2016

as long as they increase the number of cores to counter it for pc chips. heck imagine a world in which hyperthreading isnt just useful to increase heat output when playing games?

patrickjp93 · February 8, 2016

6 hours ago, huilun02 said:

Intel I get it you want to target the mobile sector. But for the love of god don't pull a Windows 8.

Hopefully AMD catches up in IPC and per-core performance to shake things up, or at the very least bring affordability to performance.

The FX 9590 launched for $1000. Do you seriously think AMD's prices won't be rising significantly if they get competitive performance to Intel?

3 hours ago, RagnarokDel said:

as long as they increase the number of cores to counter it for pc chips. heck imagine a world in which hyperthreading isnt just useful to increase heat output when playing games?

That's up to developers, not Intel. HPC seems to get more performance from hyperthreading so...

LAwLz · February 8, 2016

12 hours ago, patrickjp93 said:

The very evidence against your claim is the fact it only gets a 15% boost with AVX 256 but does better with more cores. Clearly someone overlooked some levels of data and instruction-level parallelism in how they designed parts of the algorithm (most likely obfuscated loops that could be joined together for less overhead). AVX 256 takes the same number of clock cycles as AVX 128 to crunch double the data, so clearly it's the developers' fault, or they used an outdated compiler version that was only just starting to become aware of the instruction.

It is open source, so you can try it out if you want.

However, the problem is that you don't just get double the performance just because your registers becomes twice as wide. For example x264 contains lots of functions which already fit inside smaller than 256 registers. So increasing the registers don't help in those instances. x264 deals with lots and lots of small numbers.

Since I don't know enough myself I decided to ask a x264 developer. His/her answer made things very clear.

Quote

<LAwLz> Okay fine. So after looking around on the Internet it seems like people are reporting performance gains around 10-30% when using AVX in x264. I am not sure if those numbers are accurate since I can't use AVX2 on my processor, but I have no reason to distrust them.
<LAwLz> 10-30% seems rather low compared to the performance gains other applications seem to get when moving to AVX2. I was wondering if there is some specific reason why x264 don't seem to gain as much as some other applications?
<Gramner> AVX1 itself doesn't really do that much since x264 is mainly integer-based. AVX2 roughly doubles performance is certain algorithms but those algorithms only makes up a few percent of overall run time
<LAwLz> And why is it that only some algorithms see roughtly double the performance with AVX2? Are the other functions small enough to not benefit from wider registers or is it some other reason behind it?
<Gramner> most algorithms in H.264 is based around blocks of 16x16 pixels, so using wider than 16-bytes registers are hard to utilize efficiently (for 8-bit encoding)
<LAwLz> That makes sense. So we should not expect a big performance increase with the move to AVX 512 either?
<Gramner> or rather to say. blocks of 16x16 pixels or subblocks that are even smaller
<Gramner> I don't have any AVX-512 hardware yet so I can't really say. I'm guessing 5-10% maybe depending on how much of a penalty there is for doing cross-lane shuffles
<LAwLz> So basically, most functions fits inside 128 bit registers, so making them 256 or 512 won't really help.
<LAwLz> Is that just for 8 bit color depth, or can we see a bigger performance increase for 10bit content?
<Gramner> you can do stuff like storing multiple rows in a register, but in current AVX-capable hardware there are significant penalties for moving data between various lanes. each register is split in lanes each 16 byte wide each
<Gramner> 10-bit might be easier. there's probably more AVX2 optimizations possible for 10-bit as it is
<LAwLz> I see
<Gramner> 8-bit is generally considered more important and higher priority than 10-bit
<LAwLz> Yeah I know.

<Gramner> also note that around 50% or so of x264's run time is spent in scalar code which is unaffected by any vector processing improvements

I also asked about GPU acceleration and the TL;DR was that because GPUs are very parallel but bad when it comes to branches, the encoders which uses GPUs run simplified algorithms which has a lot less dependencies, but that comes at the expense of quality.

Sign In

Intel: Chips To Become Slower But More Energy Efficient

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites