Jump to content

Heterogeneous Architecture and What's Coming

As a refresher for those unfamiliar with AMD's APUs, Heterogeneous System Architecture (HSA), the Kaveri Line, and the roadmaps of both AMD & Intel, here's the long story short.

 

HSA is a standard which was developed by AMD to bring the float performance of GPU cores to their processors in much the same way large servers and supercomputers use graphics cards as accelerators for largely parallel tasks or handling multiple independent tasks for separate programs even. Some on this forum speculate it is meant to be a competitor to Intel's SIMD (Same Instruction, Multiple Data) standard for doing floating point task parallelism directly on the CPU. This is best codified in the AVX instruction sets which have gone through generations of 128, 256, and 512-bit respectively up to now. Processing floating point numbers with the same instruction and similar degrees is the backbone of graphics and image transforms, which falls under matrix algebra, an "embarrassingly parallel"(read: comparatively simple/easy) problem. This is why graphics cards have such high core counts and why CPUs stopped handling graphics for a long time. Few, fast cores can't put together an image anywhere near as fast as many, slower cores. This is why graphics cards run at ~1GHz and CPUs run at ~4GHz. More cores = more throughput.

 

Now AMD has released a series of Accelerated Processing Units (APUs) which put a low-power GPU (512 streaming processors across 8 Radeon R7 cores on the Kaveri A10 7850k chip) onto the same die as their traditional CPU architecture. These APUs can handle graphics on their own, in dual-crossfire-mode with select cards, or only do CPU-related tasks. Additionally the graphics cores can be utilized to accelerate other tasks such as neural networks for Artificial Intelligence.

 

AMD has gone through several generations of their APUs, starting with an on-die cache to handle video frames of a limited size, where the CPU cores would have to pass data and instructions to the GPU cores in the same cache. This is exactly how a CPU normally passes instructions/data to graphics cards over the PCIe bus, except the onboard cache and pathways are much shorter and have far higher bandwidth (at the severe cost of temperatures, electricity, and CPU multitasking being crippled when loading the graphics cores with instructions. 

 

With Kaveri, AMD implemented what is called "unified memory" wherein the CPU only passes a set of RAM addresses to the GPU cores to begin receiving instructions from, and where to store data for the frame buffer(image). This saves several trillion CPU clock cycles over the course of playing a game, but the CPU still has to govern when the GPU is allowed to access memory and manipulate it. Furthermore, DDR3 memory is not high-bandwidth the way GDDR5 (traditional graphics card RAM) is. The best fix for this has been using 2400+MHz RAM with low CAS Latency and tight timings to ensure the GPU cores can be fed as quickly as possible to hide the downtime of the CPU manipulating the GPU. Unified memory allows both processors to access the same memory space, and it allows the Kaveri APU to theoretically create a frame buffer of any size up to the limit of supported RAM capacity of a given motherboard, assuming the GPU cores are clocked highly enough to construct a frame every 1/60 of a second.

 

Please note there are many technical details related to microarchitecture and data passing which weren't included, but this should provide the basic idea for those not yet versed in HSA.

 

What is interesting now is AMD moving to implement the Carrizo APUs (2015-16 line) to have GPU cores which have schedulers of their own independent of the CPU (i.e. the graphics cores can execute code on their own with no need for CPU permission or pointing). Still the cores will all need to communicate where the boundaries of frames exist, but this eliminates much of the overhead which remained in the Kaveri APU line. Carrizo is slated to support DDR4 memory which will have lower voltage, better bandwidth, and higher clock rates (though initially worse CAS Latency), which will greatly improve graphics performance beyond independent code execution.

 

What is even more interesting is seeing Intel going down the same road. The 2016 Skylake architecture will implement unified memory for the integrated GPU cores, leaving them effectively where AMD is now. It is unknown how many streaming processors each of Intel's GPU cores accounts for. The AMD A10 7850k has 8 R7 Graphics cores responsible for 512 Streaming processors. Some have speculated the high-end Skylake chips will have either 72, 80, or 120 graphics cores vs. the 40/48/60 which will be aboard the Broadwell chips coming Q4 2014/Q1 2015 for mobile and desktop parts respectively.

 

Intel's HD Graphics 4600 is nothing to shake a stick at if you're a gamer on a budget, and it performs only slightly worse than AMD's onboard graphics head to head in most games (excluding those where AMD's Mantle API is enabled), so it's possible the streaming processor count is quite similar. 

 

Given Intel is heading towards its own heterogeneous architecture,and the step following Skylake will be to implement independent scheduling, what do we expect to see moving forward from 2015/16 from both companies in terms of GPGPU compute (and of course gaming graphics)? Does anyone know of another component of HSA which must be implemented after independent scheduling? Will Intel meet AMD at the ground floor as software companies begin seriously developing with heterogeneous systems in mind, will it fall behind, or will it dominate? I admit I myself am a greenhorn to the subject, but I did want to put this much together to invite enlightened discussion.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

Well, we need to clarify a few things:

HSA is not mainly meant to bring the FP performance of a GPU to the CPU, but is to eliminate the current "bottleneck" of GPGPU.

The bottleneck is multiple things: Latency (It takes a long time sending instructions and data to the GPU from the CPU), copy (This itself are a big plus of HSA (also a slight minus). You will no longer be required to do huge amount of copying to the coprocessors. Instead we have to unified memory (for integrated GPs) and pointers (for GPUs). However this does raise a issue.

With x86 logic you cannot access the same data from two different cores (or not without any kind of transactional memory technology (like TSX), however that raises other issues). Only one core can work with a piece of data at a time. So obvious you face the problems that didn't exist with the whole copy it over.

The way it works now, is that you have one first class citizen (the CPU) and all other processors are second-class citizen (coprocessors). Everytime you need to make the coprocessors work, you need to go through the CPU (through driver software essentially). HSA will make the GP a first class citizen, so you will no longer require drivers to make the GP cores work.

I have said it before, I doubt Intel will join HSA open foundations. I do believe Intel will make a technology similar to HSA, that is without doubt.

Intel will be using unified memory (it have more positive features than negative).

Intel and AMD are currently pushing VERY much on the APU and CPU with IGP to be the future for the regular consumer (you can notice this especially by their twitter accounts), how they are constant pushing with their content of APUs and CPUs with IGP. (eg. "#ifitcangame" - AMD, and "#iGameonIrisProGfx" - Intel), so I am currently worried about the GPU market (and especially Nvidia (as they don't really have anywhere to go).

Intel have said they would be improving their IGP by 40% (if I remember correctly) over the next couple of architecture updates. Intel HD5000 (iris pro) is featuring 40 processors. By doing basic calculations we could make an estimation on their future IGP core count.

(( 40 * 1.40 ) * 1.4 ) = 78.4 (so a ~80 IGP cores sounds more or less right).

Link to comment
Share on other sites

Link to post
Share on other sites

if this rumor come to be true than the next carrizo APU will be a thing to consider, HBM on chip, they would scream

 

http://www.tweaktown.com/news/39034/amd-carrizo-apu-rumoured-to-use-28nm-process-and-stacked-dram/index.html

I doubt we will see stacked cache on excavator, it seems to early. However I think we will see it with AMDs next x86 architecture.

Also that is a VERY bad article.

Link to comment
Share on other sites

Link to post
Share on other sites

Well, we need to clarify a few things:

HSA is not mainly meant to bring the FP performance of a GPU to the CPU, but is to eliminate the current "bottleneck" of GPGPU.

The bottleneck is multiple things: Latency (It takes a long time sending instructions and data to the GPU from the CPU), copy (This itself are a big plus of HSA (also a slight minus). You will no longer be required to do huge amount of copying to the coprocessors. Instead we have to unified memory (for integrated GPs) and pointers (for GPUs). However this does raise a issue.

With x86 logic you cannot access the same data from two different cores (or not without any kind of transactional memory technology (like TSX), however that raises other issues). Only one core can work with a piece of data at a time. So obvious you face the problems that didn't exist with the whole copy it over.

The way it works now, is that you have one first class citizen (the CPU) and all other processors are second-class citizen (coprocessors). Everytime you need to make the coprocessors work, you need to go through the CPU (through driver software essentially). HSA will make the GP a first class citizen, so you will no longer require drivers to make the GP cores work.

I have said it before, I doubt Intel will join HSA open foundations. I do believe Intel will make a technology similar to HSA, that is without doubt.

Intel will be using unified memory (it have more positive features than negative).

Intel and AMD are currently pushing VERY much on the APU and CPU with IGP to be the future for the regular consumer (you can notice this especially by their twitter accounts), how they are constant pushing with their content of APUs and CPUs with IGP. (eg. "#ifitcangame" - AMD, and "#iGameonIrisProGfx" - Intel), so I am currently worried about the GPU market (and especially Nvidia (as they don't really have anywhere to go).

Intel have said they would be improving their IGP by 40% (if I remember correctly) over the next couple of architecture updates. Intel HD5000 (iris pro) is featuring 40 processors. By doing basic calculations we could make an estimation on their future IGP core count.

(( 40 * 1.40 ) * 1.4 ) = 78.4 (so a ~80 IGP cores sounds more or less right).

HD 5400 is rumored to have 62 cores. 4600 currently has 32-40 depending on which chip you buy. I actually have no clue what Iris Pro 5200 in current mobile chips consists of.

As for transactional memory, I believe that is where the Hybrib Memory Cube technology will be coming in. 320GB/s bandwidth sounds like parallel accesses are the only thing which could possibly saturate it.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

HD 5400 is rumored to have 62 cores. 4600 currently has 32-40 depending on which chip you buy. I actually have no clue what Iris Pro 5200 in current mobile chips consists of.

As for transactional memory, I believe that is where the Hybrib Memory Cube technology will be coming in. 320GB/s bandwidth sounds like parallel accesses are the only thing which could possibly saturate it.

4600 have 20 EUs. Iris pro have 40 EUs.

I haven't read anything about HMC providing a solution for transactional memory?

Link to comment
Share on other sites

Link to post
Share on other sites

4600 have 20 EUs. Iris pro have 40 EUs.

I haven't read anything about HMC providing a solution for transactional memory?

I didn't say HMC cdefinitely will, but how the Hell else do you saturate that much bandwidth?

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

I didn't say HMC cdefinitely will, but how the Hell else do you saturate that much bandwidth?

Bandwidth is not the issue nor is it the solution.

Transactial memory technology is so multiple EUs can processer the same data. We are talking the same cacheline. So a very specific like the very element the EU are going to be processing.

It was one of the issues to successfully translate singlethreaded work to multiple threads, which still is a huge issue.

Link to comment
Share on other sites

Link to post
Share on other sites

Bandwidth is not the issue nor is it the solution.

Transactial memory technology is so multiple EUs can processer the same data. We are talking the same cacheline. So a very specific like the very element the EU are going to be processing.

It was one of the issues to successfully translate singlethreaded work to multiple threads, which still is a huge issue.

facepalm* I didn't say bandwidth was an issue. I said if the bandwidth is that wide who the Hell could use it under the current paradigm? Answer: no one. No CPU is capable of saturating such a thing through serial accesses. That means it would have to be parallel to make any sense as a new technology.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

Are you really going to facepalm me after your previous comments?

 

However, you have avoided the issue (or we don't understand each other at this point) that you will run into the same issue we ran into when multithreading became a thing.

x86 rules number 1: Who gets first, get treated first

Lets take a example of this issue;
 

You have thread number doing heavy SIMD calculation from the 1/4 of L2 cache when suddenly thread 2 wants to use one of those values. It will be denied.

Two threads cannot actively work with the same elements at once. You will need a transactional memory technology (like TSX that some haswell processors support), which is only a half assed solution.

 

At certain points you cannot avoid this, and this is one of the few reasons why most things aren't naturally multi-threaded. This was one of the few benefits of copying the data.

 

EDIT: Parallel wouldn't fix the issue I have been mentioning. What we might see of higher bandwidth is a smaller need for better L3 cache.

 

EDIT EDIT: Not every piece of software can be successfully translated into parallel without greater performance loss.

Link to comment
Share on other sites

Link to post
Share on other sites

@vm'N If you're not gonna quote, please remember to tag me so I can respond more readily. 

 

What I am saying is any memory has this kind of throughput which burst applications will not very well benefit from, then it would only make sense for it to have a transactional quality for parallel accesses and manipulation. And x86 can allow two threads to work on the same data simultaneously using hardware-level semaphores to avoid accessing the exact same addresses at the same time (give access to the space for reading, but selectively allow writes).

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

@patrickjp93

TSX (Intels transactional memory technology) and similar technologies have a "half assed" solution for working with the same elements.

They allow multiple thread to issue LOAD instruction on the same elements. This kind of technologies essentially checks the value before issuing STORE command to see if they have changed. If it have changed, it will need to do the entire instruction again (why it is a half-assed solution).

So to fully utilize these kind of technologies, you will have to be very carefully with your memory management, and not lose the overview. It is by far a better solution to leave instructions using the same elements to the same thread, as you are essentially avoiding all these issues. This is why TSX and similar technologies aren't in use in almost any kind of software, and why people consider it to be useless.

Without this kind of technology a thread trying to issue a LOAD on a locked (already been processes by another thread) value, else it will create a memory violation. It will obviously break, forcing a pipeline flush.

This is why it is difficult to fully multithread certain workloads. You will at some points have to work with the same elements, where there are no other solution than having a single thread instead of multiple (unless you want a secondary thread to be stalled meanwhile killing throughput).

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×