AMD Talks Next Generation Coherent Interconnect Fabric Connecting Polaris GPUs, Zen CPUs and HPC APUs (AMDs NVlink)

looncraz · January 19, 2016

Oh great, one of the speculative execution brigade... MIT has gone down that road enough times and found it doesn't gain you anything and costs a ton of extra power and heat. Just stop with the lunacy.

The belief of common folk in that old claim wasn't based in good math. The actual engineers knew it was BS. This is not nearly an equivalent situation.

No, people who improperly use statistics are bad at telling the future. Those automatic stock trading algorithms are proof that the future is predictable. We're approaching the fundamental limits of single-threaded SISD computing. Software developers have to get over that. The programmers need to get better at their craft. There's already a ton of untapped power in our current CPU cores. Pushing farther for consumer hardware and blaming Intel as the sole provider of enthusiast parts is absolutely stupid. Software is behind. There is no room for debate on that. What more Intel can do until then is scrape out the bottom of the barrel. When AMD slams into that same wall, I hope it finally wakes you people up.

We are extremely close to that. You don't understand what branch prediction actually is, do you? And further you cannot have infinite hardware and thus cannot possibly execute every possible branch at once. The reconciliation process alone will require more time than simply having to guess wrong once a couple cycles ahead and then correct itself.

SMT designs have copies of their register files and use different ALUs to gain more performance. Using the same ones actually loses some performance in most cases.

Carbon will only increase clock speeds due to easier cooling. It will not magically widen and deepen execution pipelines.

Nothing wrong with speculative execution beyond its inherent inefficiency. Today we are doing it with branch prediction with some 95~96% accuracy. Some code doesn't care, some cares a great deal. Much like some code isn't as sensitive to execution latencies as it is to memory timings. My mention of "infinite" is in regards to theory, not application. I hope you understand the difference, because you seem to be confusing the two. I don't debate that software is behind, I've previously mentioned that.

You do realize that current CPUs are full of trade-offs and half-solutions, right? They are not designed for maximum performance at all costs, they are designed to be sold. Corporate solutions are very rarely that efficient. They are what works in a profitable, and marketable, manner. They are also limited by their willingness to take risks. Going with what is known to work is the safe bet, so most take that route. I'm also not saying that we will be able to accomplish any of this in five or ten years. I'm providing NO time-frame at all.

Every modern x86 CPU around has a whole slew of instructions that could take half or less the time currently used to execute but they are executed using generic circuits and timing logic due to economic decisions (not bad decisions, obviously). If you had a dedicated circuit for each of these, execution would speed up greatly. This is currently impractical, but that doesn't mean highly efficient techniques won't come around and change that. Some parts of a CPU can execute much faster than others while drawing less power. Maybe we will see multiple clock drivers in CPUs in the future. Or, maybe, we'll advance away from the cadence entirely, using metadata, signals, and buffers to allow each part of the CPU to execute however fast it can. Lots of options.

Carbon's leakage characteristics are its most exploitable, no doubt, but transistor switching speeds are already faster than how fast we can operate full CPUs, so this is the characteristic which allows to to operate more closely to its optimal switching speeds. With many caveats, buts, and ifs, naturally. Still, IBM, many years ago, showed off working transistors actually operating at 100GHz. That was only about 10 times faster than silicon transistors. And we've improved silicon transistors since then. But we don't see 10GHz CPUs. And why not if our transistors can already switch that fast? Several reasons, naturally, but mostly localized power dissipation limitations.

Now, please do note, that I am not saying that this is the best, or only, way forward - far from it - I'm just saying it will absolutely be doable in time. I prefer massive parallelism as I've spent years perfecting my ability to scale code. I wrote a replacement implementation of std::async, for example, that is about as much faster than std::async as you have cores to use... for fun. I see exactly how c++11 features could be exploited to enable fantastic automated parallel execution (auto-loops are a prime suspect, though the compiler will need to do some complex profiling).