Jump to content
  • entries
    2
  • comments
    0
  • views
    1,082

Hardware Advancements: Where Did They Go?

patrickjp93

523 views

I'm still considered a young frog in this pond, but even young ones like me still learn fast and know what they know, even if that knowledge is incomplete. So, give this young frog a bit of time with your ear. The era between the i386 and Pentium 4 for the most part precedes my birth, but I was here for Core 2 Duo/Quad, Athlon X2/4, the first Core processor from Intel, Bulldozer, Nehalem, Piledriver/Vishera, Sandy Bridge, the birth of the APU, Steamroller, Ivy Bridge, Haswell, Broadwell, Excavator, and most recently Skylake. I've seen the big jumps come. I've seen what people complain about: slowdowns in performance gains. This isn't just true for synthetic benchmarks that don't see updates for years after they're finished. It's true even for modern games and apps. The difference between Sandy Bridge and Haswell for modern games tends to be between 3 and 10% per clock. Between Haswell and Skylake, in a few rare cases, we see a tiny loss of performance. The key question is this: why? Why are these gains so small now when Nehalem to Sandy alone was 15%, and we got big overclocking out of the deal?

 

The answer put shortly is "it's complicated." To be persnickety, it's many-faceted, but the individual facets are not complicated on their own.

 

I think it's best to start with x86 itself. If you've only ever heard about assembly instructions in the dreadful light of using them to program without really taking a look and understanding what those "words" all do, you can't hope to understand the fundamental limits of computing beyond a superficial level. http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf It takes nearly 4000 pages to describe how x86 processors work today, list and elaborate on every single instruction, and describe good system programming practices for the architecture. That's because there are more than 200 instructions you can choose from, and there are many ways to solve a coding problem. You have instructions which contain 1 or more steps. Some very short algorithms have been reduced to a single instruction, reducing the sizes of executables and shaving clock cycles. This holds true for AMD and VIA  in addition to Intel. If this surprises you, then I think you should ask yourself where performance comes from in the first place. We still have the instructions available to us that do 1 thing at a time to 1 piece of data. They are the core of x86 and x86_64. Without them, you can't solve any problem with an x86 processor.

 

You may think of these instructions as the heart or core of x86. They fall into a more general concept/category in computer architecture theory called SISD; or Single-Instruction, Single-Data. These cover everything from mathematical operations on 2 operands to 1 or 2-operand logical operations to bit shifting on singular 8/16/32/64/80-bit data types (yes, 80, usually used for extra precision floats). These instructions are simple: easy to read and understand, easy to count clock cycles on and tune performance for. But, that sword cuts 2 ways. Though they are relatively easy to optimize with, you need many, many more of them to solve a given problem relative to newer, complex instructions that do 2 or more things. That requires larger executable sizes. It requires your caches hold more instructions, thus constraining your performance. It also kneecaps modern hardware that can optimize beyond the level a compiler is capable of.

 

These days, nearly every processor available has a superscalar engine in every core: hardware that reorders instructions to be able to interleave them and squeeze more performance for you. That said, there's a limited number of instructions that can be reordered by any such engine. You may also know this as Out of Order Processing. There's a fixed-length buffer in modern ARM, x86, PowerPC, Sparc, and MIPS architectures. If you limit yourself to SISD instructions, these buffers fill up on smaller tasks. The result? Less performance than the hardware is capable of delivering when using modern instructions with multiple steps or those which do 1 step on multiple data at once.

 

So, we have newer instructions that do more. Some are very application-specific (AES extensions to do low-overhead encryption and decryption on the fly), but most aren't. Even for those that are not, compilers can't always optimize code to use them because of complex rules. You can't use MMX, SSE, or AVX on data that isn't aligned properly in address space. I.E. you can't just pull any 4 floats from memory and add them to any other 4 floats. Both groups must be aligned along the natural 128-bit slots in  a 64-bit address space. If programmers do not guarantee this in their code, compilers can't assume the data will land squarely in valid address space, and only SISD instructions will be used, but more on that in another post.

 

After newer instructions and the superscalar engine, we have to know about pipelining: the interweaving of steps in multiple instructions so that instructions may execute in partial or even complete concurrency. In computer architecture 101, you'd be told there are four stages in an execution pipeline: Fetch, Decode, Execute, Write-Back. You'd then be shown four cascading rows each starting with fetch, offset by 1 from the row above. In general, this is how pipelining works. While the L1 cache is getting instruction 4, the core can be decoding instruction 3, an ALU can be executing instruction 2, and the results of instruction 1 can be on their way back to cache or main memory. However, with complex, multi-step instructions, there is actually much finer granularity in the execution stage for many pieces of multiple instructions to be working simultaneously. We call these pieces micro-ops or muOps. How many muOps can be executing simultaneously is a large determinant of x86 performance. For Haswell, the total number that can be in-flight at once is a whopping 114, up from 64 in Sandy Bridge. http://www.realworldtech.com/haswell-cpu/3/

 

That 78% increase didn't make performance jump by 78% though, not even 50%. Most didn't even see 15% over Sandy Bridge. So, what else impacts performance? In terms of cache, a 4790k, 6700k, and a 2600k all have the same 3-tier cache system with the same sizes. Minute reductions in write and read times have been made, but otherwise, this particular aspect has had little improvement. With Nehalem, Intel introduced a Small Loop Detector, a circuit which detects bunches of instructions which are tight loops with a small number of branches. This detection puts the instructions into a buffer for rabid successive execution which removing the fetch stage from the pipeline In Sandy Bridge, this buffer can detect loops up to 32 bytes/28 muOps http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html For Haswell and Skylake, this buffer's size has increased another 6 bytes and 4 muOps.

 

For Sandy Bridge, one can find 3 integer ALUs, a dedicated floating point add, a dedicated floating point multiply, and a 256-bit AVX vector unit, or 6 total arithmetic units distributed across 3 scheduling ports.

 

 

 

sandy-bridge-core-i7-2600k-p67,8-A-27505

 

With Haswell, Intel added a dedicated "store" AGU, freeing up two ports to better distribute arithmetic instructions. Further, Intel added a 4th integer ALU, freeing up ports 0 and 1 for vector instructions. Further, Port 0 and 1 were given vector duties, but port 1 was also given an additional dedicated floating point execution, benefitting legacy workloads that were mixing with modern ones to reduce pipeline pressure and conflicts. This increased scheduling flexibility is also augmented by the increased size of the Out of Order Execution Buffer, the circuit responsible for finding pipeline hazards and reordering instructions to better optimize their execution. Sandy Bridge could handle 168 muOps. Haskell can handle 192. http://www.hotchips.org/wp-content/uploads/hc_archives/hc25/HC25.80-Processors2-epub/HC25.27.820-Haswell-Hammarlund-Intel.pdf Yet even with this increased flexibility, reduction in pressure, and increased amount of resources, performance did not increase significantly for most existing software. 

 

So, if removing pipeline pressure and contention, improving loop detection, tweaking cache behavior, and extending the width of the Out of Order Scheduler only gained us what some would consider small gains; and if Intel has spent these billions on research, why haven't we seen the benefits to the same degree we used to?

 

For older software, using older instructions, especially those that fall into SISD, those instructions already cannot possibly execute any faster. For the last 4 generations, integer add and multiply have not changed at all. For division, some room remains, but integer division is very expensive and grows more expensive with every extra bit of width. In other words, it's not possible to make them go any faster. Old software is as fast as it can be, more or less, barring any growth in cache sizes and clock speeds; and even then,. However, for newer software, with new knowledge of new instructions that do more, with knowledge of fixed function and optimizing hardware, there is plenty of room for CPU-based performance to scream ahead. It's just going to take some effort to get the software where it needs to be.

 

Therefore, from a hardware perspective, you can't blame Intel for the slowdown of PC performance improvements. It's provided all the groundbreaking new circuit designs to get us where we already are, and no one has yet caught up or surpassed Intel's designs. If the hardware could do more, you can bet Intel would prefer increased sales by selling better hardware for the same R&D price it already pays anyway. Perhaps you could lay some blame for the slowdown in sales and upgrades because of its pricing strategy, but that is its own enormous debate. In my opinion, based on the evidence, the problem does not lay with Intel's hardware. It lays with programmers who don't know how to use it and still write the software we use today.

 

0 Comments

There are no comments to display.

×