Jump to content
  • entries
    63
  • comments
    51
  • views
    20,372

The Chiplet "Problem" with GPUs

Mira Yurizaki

5,851 views

 

UPDATE: I've edited this blog too many times because I always think I'm done, but then another idea comes up. *sigh* But I should be done now.

 

With AMD's semi-recent announcement of their server processors using the so-called "Chiplet" design, I thought it'd be a good idea to talk about how this could affect other processor types. People have pointed to GPUs being the next logical step, but I've been hesitant to jump on that and this blog is to discuss why.

 

An Explanation: What is the Chiplet Design?

To understand the chiplet design, it's useful to understand how many processors are designed today. Typically they're designed using the so-called monolithic approach, where everything about the processor is built onto a single piece of silicon. The following is an example of a quad core design:

monolithic.png.df6be7e1611c1a156c6fb01ad52d2127.png

 

Everything going to the processor has to go through an I/O section of the chip. Primarily this handles talking to main memory, but modern processors also have other I/O built in like PCI Express lanes or display compositors (the GPU would be considered a separate thing). From there, it goes through a typically much faster inter-processor bus where the processor's cores talk among each other and through the I/O.

 

What the chiplet design does is separate the cores and I/O section into different chips.

chiplet-1.png.e5ec85f9c1a2548adb615b8a1786e63a.png

The advantage here is that one part of the processor as a whole can break, but the entire processor doesn't have to be thrown away. But it doesn't stop here. As long as the I/O section can support more and more processor core chiplets, then you can expand it out however many you want. Or something like this:

chiplet-2.thumb.png.161a0ad41a8dd81f35a95c1723c1e713.png

This is obviously a great design. You need more cores? Just throw on another chiplet!

 

So what's the problem here with GPUs adopting this? It's the expectations of what each processor is designed to take care of. Their core designs reflect that.

 

A Comparison Between a CPU Core and a GPU Core

At the heart of a processing unit of any sort is the "core", which I will define as a processing unit containing a memory interface, a "front-end" containing an instruction decoder and scheduler, and a "back-end" containing the execution units. A CPU core tends to have a complicated front-end and a back-end with a smaller number of execution units, while a GPU tends to have a simpler or smaller front-end with a much larger back-end. To put it visually:

 

589px-zen_block_diagram_svg.png.b999eb6befd3b86f30f2c0c26d77fe35.png

Block Diagram of an AMD Zen 1 CPU Core

 

FijiBlockDiagram.thumb.png.3174f5eb137c96b162a37c754067bfc4.png

Block Diagram of an AMD Fiji GPU Core. Each "ACE" is a Front-End Unit and Each "Shader Engine" is a Back-End Unit

 

They are designed this way because of the tasks they're expected to complete. A CPU is expected to perform a randomized set of instructions in the best way it can from various tasks with a small amount of data. A GPU is expected to perform a smaller number of instructions, specifically built and ordered, on a large amount of data.

 

From the previous section about chiplet design, you might be thinking to yourself: "Well can't the Fiji GPU core have the stuff on the left side (HBM + MC) and the right side (Multimedia Accelerators, Eyefinity, CrossFire XDMA, DMA, PCIe Bus Interface) separated into its own chip?" Well let's take a look at what the Fiji GPU die looks like (taken from https://www.guru3d.com/news-story/amd-radeon-r9-fiji-die-shot-photo.html)

 

 

amd-fiji-die.jpg.ea5224b6aa0a7a73a4052a68b892bfae.jpg

 

The big part in the middle are all of the ACEs, the Graphics Command Processor, and the Shader Engines from the block diagram. This takes up roughly, if guessing, 72% of the die itself. Not only that, aside from everything on the right side in the block diagram, this GPU core still needs everything from the left side, or all of the HBM and MC parts. Something needs to feed the main bit of the GPU with data and this is a hungry GPU! To put in another way, a two-chiplet design would very similar to the two GPU, single card designs of many years ago, like the R9 Fury Pro Duo:

cd5a1c7d_RadeonProDuoBareD_StraightOn_4c_5inch.jpeg.a1bba427a7eb974fdea29610b1cf28a6.jpeg

But Wouldn't Going to 7nm Solve This Issue?

While it's tempting to think that smaller nodes means smaller sized dies, the thing is with GPUs, adding more execution units increases its performance because the work it solves is what is known as embarrassingly parallel, or it's trivial to split the work up into more units. It's more pixels per second to crunch. This isn't the case with the CPU, where instructions are almost never guaranteed to be orderly and predictable, the basic ingredient for parallel tasks. So while adding more transistors per CPU core hasn't always been viable, it has been for GPUs and so the average die size of a GPU hasn't gone down as transistors get smaller:

gpu-sizes.png.b76991635203374e6577dde60dd26754.png

Transistor count, die size, and fabrication process for the highest-end GPU of a generation for AMD GPUs (Data sourced from Wikipedia)

 

Since AMD has had weird moments, let's take a look at its competitor, NVIDIA:

gpu-sizes-nvidia.png.9674b77f607b412d4c86d8872641c756.png

Transistor count, die size, and fabrication process for the highest-end* GPU of a generation for NVIDIA GPUs (Data sourced from Wikipedia)

 

Notes:

  • G92 is considered it's own generation due to being in two video card series
  • GTX 280 and GTX 285 were included due to being the same GPU but with a die shrink
  • TITANs were not included since the Ti version is more recognizable and are the same GPU

 

But the trend is the same: the average die size for the GPUs has remained fairly level.

 

Unfortunately transistor count for processors isn't straight-forward like it is for GPUs. Over the years, processors have integrated more and more things into it. So we can't even compare say an AMD Bulldozer transistor count to an AMD Ryzen transistor count due to Ryzen integrating more features like extra PCIe lanes, the entirety of what used to be "Northbridge", among other things. However, with that in mind, it's still nice to see some data to see where overall things have gotten:

cpu-sizes.png.071d123c31f7518e7379132c098a8f70.png

Transistor count, die size, and fabrication process for various processors (Data from Wikipedia)

 

One just has to keep in mind that at various points, processors started to integrate more features that aren't related to the front-end, back-end, or memory interface, so processors from that point on may actually have a lower transistor count and thus die-size.

 

How about separating the front-end from the back end?

This is a problem because the front-end needs to know how to allocate its resources, which is the back end. This introduces latency due to the increased distance and overhead because of the constant need to figure out what exactly is going on. To put it in another way, is it more efficient to have your immediate supervisor in a building across town or in the same building as you work in? Plus the front-end doesn't take up a lot of space on the GPU anyway.

 

What About Making Smaller GPUs?

So instead of making large GPUs with a ton of execution units, why not build smaller GPUs and use those as the chiplets? As an example, let's take NVIDIA's GTX 1080:

gtx1080.thumb.png.4f860d6436a2f60791a7a75f93527571.png

 

Compare this to the GTX 1050/1050 Ti (left) and the GT 1030 (right):

gtx1050.thumb.png.ce571f7d9802945dba065223e28f79d9.png   gt1030.png.61fc0a83549de770e495de0fe61c5218.png

 

With this, you could take away the memory and PCI Express controllers and move them to an I/O chip, and just duplicate the rest as many times as you want. Except now you have SLI, which has its problems that need to be addressed.

 

The Problem with Multi-GPU Rendering

The idea of multi-GPU rendering is simple: break up the work equally and have each GPU work on the scene. If it's "embarrassingly" easy to break up the rendering task, wouldn't this be a good idea? Well, it depends on what's really being worked on. For example, let's take this scene:

ffxiv_10052018_131044.jpg.f166f8428b4d76d84d756da0c54c0924.jpg

Approximate difficulty to render this scene: Green = Easy, Yellow = Medium, Red = Hard

 

The areas are color coded more or less to approximate the "difficulty" of rendering it. How would you divide this up evenly so that every GPU has an equal workload? Let's say we have four GPU chiplets.

 

  • Obviously splitting this scene up into quadrants won't work because one of the chiplets will be burdened by the large amount of red in the top right while another will be sitting around doing nothing at all taking care of the top left. And because you can't composite the entire image without everything being done, the GPU taking care of the top right portion will be the bottleneck.
  • Another option may be to have each chiplet in succession work on a frame. Though this may be an issue with more chiplets as you can't exactly render ahead too far and this sort of rendering is what causes microstuttering in multi-GPU systems.
  • Lastly, we could have the chiplets render the entire scene at a reduced resolution, but offset a bit. Or divvy this entire scene by say alternating pixels. This could potentially minimize an imbalance of workload, but someone still has to composite the final image and there could be a lot of data passing back and forth between the chiplets, possibly increasing bandwidth requirements more than necessary.

This is also not including another aspect that GPUs have taken on lately: general compute tasks. And then there's the question of VR, which is sensitive to latency.

 

Ultimately the problem with graphics rendering is that it's time sensitive. Whereas tasks for CPUs often have the luxury of "it's done when it's done" and the pieces of data they're working on are independent from beginning to end, graphics rendering doesn't enjoy the same luxuries. Graphics rendering is "the sooner you get it done, the better" and "everyone's writing to the same frame buffer"

 

What about DirectX 12 and Vulkan's multi-GPU support?

With the advent of DirectX 12 and (possibly) Vulkan adding effective multi-GPU support, we may be able overcome the issues described above. However, that requires developer support and not everyone's on board with either API. You may want them to be, but a lot of game developers would probably rather worry more on getting their game done than optimizing it for performance, sadly to say.

 

Plus it would present issues for backwards compatibility. Up until this point, we've had games designed around the idea of a single GPU and only sometimes more than one. And while some games may perform well enough on multiple GPUs, many others won't, and running those older games on a chiplet design may result in terrible performance. You could relieve this issue perhaps by using tools like NVIDIA Inspector to create a custom SLI profile. But to do this for every game would get old fast. Technology is supposed to help make our lives better, and that certainly won't.

 

But who knows? Maybe We'll Get Something Yet

Only time will tell though if this design will work with GPUs, but I'm not entirely hopeful given the issues.

5 Comments

A note about this part of the blog:

Quote

So while adding more transistors per CPU core hasn't always been viable...

What this means is that in GPU land, you can get away with simply duplicating your basic execution units. In AMD terms, this is a stream processor. In NVIDIA terms, a CUDA core.

 

In CPUs, you can't duplicate its basic execution units, which are the ALU, AGU, and FPU, and expect a linear improvement in performance. Most of the transistor count increases per core over time for CPUs mainly may be due to adding unrelated to semi-related features like SIMD processing.

Link to comment
Link to post

What if GPU rendering used ML/AI hardware (ASIC most likely) to perform object recognition on any given scene, and allocated more processing units/VRAM to more complex objects in the given scene (to speed up rendering)? Years ago, I was considering that idea over traditional rendering methods (due to the very issues you mentioned). Or, maybe use a combination of the previous hypothetical technique with ray-casting or radiosity (in opposed to rasterisation)? Using a new method to divvy the workload in a more logical manner could minimise the issues you mentioned in the last portion...

 

P.S. Just took a quick look at this as well.

Link to comment
Link to post

Well, there is one potential solution to the GPU chiplet problem, but at this time we can't create a bus fast enough to emulate an on-die bus, which would make scaling the solution difficult.

 

Let's use the Nvidia GP104 as an example. For reference, the first spoiler contains the GP104 layout, and the second spoiler contains the Streaming Multiprocessor layout:

Spoiler

blockdiagram.jpg

Spoiler

NVIDIA-Pascal-GP104-SM.png


Looking at the Streaming Multiprocessor, we see that each "big core" has an instruction pipeline, some controller logic, a register file, and many processing cores. The supporting infrastructure around that contains some data and instruction caches, and some shared scratchpad memory, allowing a few "big cores" to be placed together.

The streaming multiprocessors are then grouped into groups called GPCs. The GPC surrounds the streaming multiprocessor with some shared instruction pipelining and some task specific compute resources. The GPCs are then repeated over the chip, and glued together with some data caches, and an instruction dispatcher. Finally, we have a multiplicity of memory interfaces and a single monolithic external bus controller.

There are two ways I can see this working, both ways require the assumption that we can build external buses with equivalent performance to on die buses.

 

The first is to separate the GPCs into individual chips. All they will carry with them are their two memory controllers, and a portion of the L2 cache. This requires building some external "Gigathread Engine" (instruction dispatcher), and piping that to all of the chips. All of the chips are still working off of the same instruction stream, and the same data in the same memory. With the assumption that the external buses are as fast as on die buses, this is exactly equivalent to what is already happening on die, but with an allowance for increasing the performance of the chip by adding more GPCs at assembly time, instead of at FAB time. The tradeoff is the number of traces required on the PCB, as well as an increase in cost for the same performance (each GPC needs it's own physical packaging).

 

The second way trades some of the board complexity for some latency, by adding a third level: A new layer of instruction dispatching (a TeraThread Engine perhaps?). Ostensibly, the TeraThread Engine would be identical in function to the GigaThread engine, except that it would forward instructions to the GigaThread Engines instead of the individual GPCs. Doing this gives us the ability to add a multiplicity of the existing designs to a board, with two major tradeoffs: The first being a slightly higher latency, and the second being a much more complicated main memory design.

Both of the cases above are logically identical to the way things are currently done, and could likely be pulled off without requiring any changes to the programming model. The reality of the situation, however, is that both of these designs rely on external bus speeds approaching that of on die bus speeds, which is just not realistic at this time.

Link to comment
Link to post

The unicorn is getting a fast enough bus, but it's largely that at the moment, a unicorn.

 

The other question how much of a benefit are you truly getting from this from a manufacturing standpoint? If we pull up some numbers on Wikipedia, we can find the following stats:

  • GT 1030
    • 384 SPUs, 24 TMUs, and 16 ROPs
    • 1.8 billion transistors
    • 74 mm^2 die
  • GTX 1050 Ti
    • 768 SPus, 48 TMUs, and 16 ROPs
    • 3.3 billion transistors
    • 132 mm^2 die

Even though the GTX 1050 Ti is basically double the GT 1030, the GTX 1050 Ti is a more efficient design since it uses less transistors and die space than a reasonably assumed 2 times. Also note that nothing else is different between the two. They both were designed with the same external bus and memory type. I'm pretty sure my math isn't correct here, but for the same amount of material you can make 66 GT 1030's, you can make 37 GTX 1050 Ti's. You can lose 4 1050 Ti's and still even out, but this is an 89% yield. We also can't assume the GT 1030 has a 100% yield rate and, for the sake of simplicity, it too suffers from an 89% yield. Which means about 58 dies are good, which further increases the amount of bad 1050 Ti's that you can have before you drop below the 2x breaking even limit. In other words, you can have a defect rate as low as 79% on the GTX 1050 Ti's before it starts to no longer make sense to make them (as much) as opposed to gluing two GT 1030's together.

 

We also have to consider what I mentioned in the blog post: a GPU for gaming is going to be working on a time-sensitive task. Benching something like POV Ray and x264 is fine and all on a CPU because we don't care in the order that the final output is assembled (more or less) nor how long it takes (though the faster the better). In a GPU, the order of how the final output is assembled does matter and we do care how long it takes to get something done. I'm not quite sure how sensitive introducing latency or whatnot will affect overall graphics performance, and the only thing SLI shares is frame buffer data (I'm not sure what NVLink shares)

 

But overall, until we solve the two biggest issues plaguing multi-GPU setups, that being memory pools don't combine and workload distribution, I don't think chiplets will be anything more than a fancier way of doing multi-GPU setups.

Link to comment
Link to post
×