The Chiplet "Problem" with GPUs
UPDATE: I've edited this blog too many times because I always think I'm done, but then another idea comes up. *sigh* But I should be done now.
With AMD's semi-recent announcement of their server processors using the so-called "Chiplet" design, I thought it'd be a good idea to talk about how this could affect other processor types. People have pointed to GPUs being the next logical step, but I've been hesitant to jump on that and this blog is to discuss why.
An Explanation: What is the Chiplet Design?
To understand the chiplet design, it's useful to understand how many processors are designed today. Typically they're designed using the so-called monolithic approach, where everything about the processor is built onto a single piece of silicon. The following is an example of a quad core design:
Everything going to the processor has to go through an I/O section of the chip. Primarily this handles talking to main memory, but modern processors also have other I/O built in like PCI Express lanes or display compositors (the GPU would be considered a separate thing). From there, it goes through a typically much faster inter-processor bus where the processor's cores talk among each other and through the I/O.
What the chiplet design does is separate the cores and I/O section into different chips.
The advantage here is that one part of the processor as a whole can break, but the entire processor doesn't have to be thrown away. But it doesn't stop here. As long as the I/O section can support more and more processor core chiplets, then you can expand it out however many you want. Or something like this:
This is obviously a great design. You need more cores? Just throw on another chiplet!
So what's the problem here with GPUs adopting this? It's the expectations of what each processor is designed to take care of. Their core designs reflect that.
A Comparison Between a CPU Core and a GPU Core
At the heart of a processing unit of any sort is the "core", which I will define as a processing unit containing a memory interface, a "front-end" containing an instruction decoder and scheduler, and a "back-end" containing the execution units. A CPU core tends to have a complicated front-end and a back-end with a smaller number of execution units, while a GPU tends to have a simpler or smaller front-end with a much larger back-end. To put it visually:
Block Diagram of an AMD Zen 1 CPU Core
Block Diagram of an AMD Fiji GPU Core. Each "ACE" is a Front-End Unit and Each "Shader Engine" is a Back-End Unit
They are designed this way because of the tasks they're expected to complete. A CPU is expected to perform a randomized set of instructions in the best way it can from various tasks with a small amount of data. A GPU is expected to perform a smaller number of instructions, specifically built and ordered, on a large amount of data.
From the previous section about chiplet design, you might be thinking to yourself: "Well can't the Fiji GPU core have the stuff on the left side (HBM + MC) and the right side (Multimedia Accelerators, Eyefinity, CrossFire XDMA, DMA, PCIe Bus Interface) separated into its own chip?" Well let's take a look at what the Fiji GPU die looks like (taken from https://www.guru3d.com/news-story/amd-radeon-r9-fiji-die-shot-photo.html)
The big part in the middle are all of the ACEs, the Graphics Command Processor, and the Shader Engines from the block diagram. This takes up roughly, if guessing, 72% of the die itself. Not only that, aside from everything on the right side in the block diagram, this GPU core still needs everything from the left side, or all of the HBM and MC parts. Something needs to feed the main bit of the GPU with data and this is a hungry GPU! To put in another way, a two-chiplet design would very similar to the two GPU, single card designs of many years ago, like the R9 Fury Pro Duo:
But Wouldn't Going to 7nm Solve This Issue?
While it's tempting to think that smaller nodes means smaller sized dies, the thing is with GPUs, adding more execution units increases its performance because the work it solves is what is known as embarrassingly parallel, or it's trivial to split the work up into more units. It's more pixels per second to crunch. This isn't the case with the CPU, where instructions are almost never guaranteed to be orderly and predictable, the basic ingredient for parallel tasks. So while adding more transistors per CPU core hasn't always been viable, it has been for GPUs and so the average die size of a GPU hasn't gone down as transistors get smaller:
Transistor count, die size, and fabrication process for the highest-end GPU of a generation for AMD GPUs (Data sourced from Wikipedia)
Since AMD has had weird moments, let's take a look at its competitor, NVIDIA:
Transistor count, die size, and fabrication process for the highest-end* GPU of a generation for NVIDIA GPUs (Data sourced from Wikipedia)
Notes:
- G92 is considered it's own generation due to being in two video card series
- GTX 280 and GTX 285 were included due to being the same GPU but with a die shrink
- TITANs were not included since the Ti version is more recognizable and are the same GPU
But the trend is the same: the average die size for the GPUs has remained fairly level.
Unfortunately transistor count for processors isn't straight-forward like it is for GPUs. Over the years, processors have integrated more and more things into it. So we can't even compare say an AMD Bulldozer transistor count to an AMD Ryzen transistor count due to Ryzen integrating more features like extra PCIe lanes, the entirety of what used to be "Northbridge", among other things. However, with that in mind, it's still nice to see some data to see where overall things have gotten:
Transistor count, die size, and fabrication process for various processors (Data from Wikipedia)
One just has to keep in mind that at various points, processors started to integrate more features that aren't related to the front-end, back-end, or memory interface, so processors from that point on may actually have a lower transistor count and thus die-size.
How about separating the front-end from the back end?
This is a problem because the front-end needs to know how to allocate its resources, which is the back end. This introduces latency due to the increased distance and overhead because of the constant need to figure out what exactly is going on. To put it in another way, is it more efficient to have your immediate supervisor in a building across town or in the same building as you work in? Plus the front-end doesn't take up a lot of space on the GPU anyway.
What About Making Smaller GPUs?
So instead of making large GPUs with a ton of execution units, why not build smaller GPUs and use those as the chiplets? As an example, let's take NVIDIA's GTX 1080:
Compare this to the GTX 1050/1050 Ti (left) and the GT 1030 (right):
With this, you could take away the memory and PCI Express controllers and move them to an I/O chip, and just duplicate the rest as many times as you want. Except now you have SLI, which has its problems that need to be addressed.
The Problem with Multi-GPU Rendering
The idea of multi-GPU rendering is simple: break up the work equally and have each GPU work on the scene. If it's "embarrassingly" easy to break up the rendering task, wouldn't this be a good idea? Well, it depends on what's really being worked on. For example, let's take this scene:
Approximate difficulty to render this scene: Green = Easy, Yellow = Medium, Red = Hard
The areas are color coded more or less to approximate the "difficulty" of rendering it. How would you divide this up evenly so that every GPU has an equal workload? Let's say we have four GPU chiplets.
- Obviously splitting this scene up into quadrants won't work because one of the chiplets will be burdened by the large amount of red in the top right while another will be sitting around doing nothing at all taking care of the top left. And because you can't composite the entire image without everything being done, the GPU taking care of the top right portion will be the bottleneck.
- Another option may be to have each chiplet in succession work on a frame. Though this may be an issue with more chiplets as you can't exactly render ahead too far and this sort of rendering is what causes microstuttering in multi-GPU systems.
- Lastly, we could have the chiplets render the entire scene at a reduced resolution, but offset a bit. Or divvy this entire scene by say alternating pixels. This could potentially minimize an imbalance of workload, but someone still has to composite the final image and there could be a lot of data passing back and forth between the chiplets, possibly increasing bandwidth requirements more than necessary.
This is also not including another aspect that GPUs have taken on lately: general compute tasks. And then there's the question of VR, which is sensitive to latency.
Ultimately the problem with graphics rendering is that it's time sensitive. Whereas tasks for CPUs often have the luxury of "it's done when it's done" and the pieces of data they're working on are independent from beginning to end, graphics rendering doesn't enjoy the same luxuries. Graphics rendering is "the sooner you get it done, the better" and "everyone's writing to the same frame buffer"
What about DirectX 12 and Vulkan's multi-GPU support?
With the advent of DirectX 12 and (possibly) Vulkan adding effective multi-GPU support, we may be able overcome the issues described above. However, that requires developer support and not everyone's on board with either API. You may want them to be, but a lot of game developers would probably rather worry more on getting their game done than optimizing it for performance, sadly to say.
Plus it would present issues for backwards compatibility. Up until this point, we've had games designed around the idea of a single GPU and only sometimes more than one. And while some games may perform well enough on multiple GPUs, many others won't, and running those older games on a chiplet design may result in terrible performance. You could relieve this issue perhaps by using tools like NVIDIA Inspector to create a custom SLI profile. But to do this for every game would get old fast. Technology is supposed to help make our lives better, and that certainly won't.
But who knows? Maybe We'll Get Something Yet
Only time will tell though if this design will work with GPUs, but I'm not entirely hopeful given the issues.
5 Comments