Yurizaki's Tech Ramblings

Mira Yurizaki · January 18, 2019

A note about this part of the blog:

Quote

So while adding more transistors per CPU core hasn't always been viable...

What this means is that in GPU land, you can get away with simply duplicating your basic execution units. In AMD terms, this is a stream processor. In NVIDIA terms, a CUDA core.

In CPUs, you can't duplicate its basic execution units, which are the ALU, AGU, and FPU, and expect a linear improvement in performance. Most of the transistor count increases per core over time for CPUs mainly may be due to adding unrelated to semi-related features like SIMD processing.

TopHatProductions115 · January 19, 2019

What if GPU rendering used ML/AI hardware (ASIC most likely) to perform object recognition on any given scene, and allocated more processing units/VRAM to more complex objects in the given scene (to speed up rendering)? Years ago, I was considering that idea over traditional rendering methods (due to the very issues you mentioned). Or, maybe use a combination of the previous hypothetical technique with ray-casting or radiosity (in opposed to rasterisation)? Using a new method to divvy the workload in a more logical manner could minimise the issues you mentioned in the last portion...

P.S. Just took a quick look at this as well.

Mira Yurizaki · January 19, 2019

That won't work very well because the AI wouldn't know where anything is until the geometry is setup, which is one of the first steps to 3D rendering. By that point, the work was already distributed.

straight_stewie · January 21, 2019

Well, there is one potential solution to the GPU chiplet problem, but at this time we can't create a bus fast enough to emulate an on-die bus, which would make scaling the solution difficult.

Let's use the Nvidia GP104 as an example. For reference, the first spoiler contains the GP104 layout, and the second spoiler contains the Streaming Multiprocessor layout:

Spoiler

Looking at the Streaming Multiprocessor, we see that each "big core" has an instruction pipeline, some controller logic, a register file, and many processing cores. The supporting infrastructure around that contains some data and instruction caches, and some shared scratchpad memory, allowing a few "big cores" to be placed together.

The streaming multiprocessors are then grouped into groups called GPCs. The GPC surrounds the streaming multiprocessor with some shared instruction pipelining and some task specific compute resources. The GPCs are then repeated over the chip, and glued together with some data caches, and an instruction dispatcher. Finally, we have a multiplicity of memory interfaces and a single monolithic external bus controller.

There are two ways I can see this working, both ways require the assumption that we can build external buses with equivalent performance to on die buses.

The first is to separate the GPCs into individual chips. All they will carry with them are their two memory controllers, and a portion of the L2 cache. This requires building some external "Gigathread Engine" (instruction dispatcher), and piping that to all of the chips. All of the chips are still working off of the same instruction stream, and the same data in the same memory. With the assumption that the external buses are as fast as on die buses, this is exactly equivalent to what is already happening on die, but with an allowance for increasing the performance of the chip by adding more GPCs at assembly time, instead of at FAB time. The tradeoff is the number of traces required on the PCB, as well as an increase in cost for the same performance (each GPC needs it's own physical packaging).

The second way trades some of the board complexity for some latency, by adding a third level: A new layer of instruction dispatching (a TeraThread Engine perhaps?). Ostensibly, the TeraThread Engine would be identical in function to the GigaThread engine, except that it would forward instructions to the GigaThread Engines instead of the individual GPCs. Doing this gives us the ability to add a multiplicity of the existing designs to a board, with two major tradeoffs: The first being a slightly higher latency, and the second being a much more complicated main memory design.

Both of the cases above are logically identical to the way things are currently done, and could likely be pulled off without requiring any changes to the programming model. The reality of the situation, however, is that both of these designs rely on external bus speeds approaching that of on die bus speeds, which is just not realistic at this time.

Mira Yurizaki · January 21, 2019

The unicorn is getting a fast enough bus, but it's largely that at the moment, a unicorn.

The other question how much of a benefit are you truly getting from this from a manufacturing standpoint? If we pull up some numbers on Wikipedia, we can find the following stats:

GT 1030
- 384 SPUs, 24 TMUs, and 16 ROPs
- 1.8 billion transistors
- 74 mm^2 die
GTX 1050 Ti
- 768 SPus, 48 TMUs, and 16 ROPs
- 3.3 billion transistors
- 132 mm^2 die

Even though the GTX 1050 Ti is basically double the GT 1030, the GTX 1050 Ti is a more efficient design since it uses less transistors and die space than a reasonably assumed 2 times. Also note that nothing else is different between the two. They both were designed with the same external bus and memory type. I'm pretty sure my math isn't correct here, but for the same amount of material you can make 66 GT 1030's, you can make 37 GTX 1050 Ti's. You can lose 4 1050 Ti's and still even out, but this is an 89% yield. We also can't assume the GT 1030 has a 100% yield rate and, for the sake of simplicity, it too suffers from an 89% yield. Which means about 58 dies are good, which further increases the amount of bad 1050 Ti's that you can have before you drop below the 2x breaking even limit. In other words, you can have a defect rate as low as 79% on the GTX 1050 Ti's before it starts to no longer make sense to make them (as much) as opposed to gluing two GT 1030's together.

We also have to consider what I mentioned in the blog post: a GPU for gaming is going to be working on a time-sensitive task. Benching something like POV Ray and x264 is fine and all on a CPU because we don't care in the order that the final output is assembled (more or less) nor how long it takes (though the faster the better). In a GPU, the order of how the final output is assembled does matter and we do care how long it takes to get something done. I'm not quite sure how sensitive introducing latency or whatnot will affect overall graphics performance, and the only thing SLI shares is frame buffer data (I'm not sure what NVLink shares)

But overall, until we solve the two biggest issues plaguing multi-GPU setups, that being memory pools don't combine and workload distribution, I don't think chiplets will be anything more than a fancier way of doing multi-GPU setups.

Sign In

Yurizaki's Tech Ramblings

The Chiplet "Problem" with GPUs

5 Comments

Mira Yurizaki 13,204

Link to comment

Link to post

TopHatProductions115 12,305

Link to comment

Link to post

Mira Yurizaki 13,204

Link to comment

Link to post

straight_stewie 1,648

Link to comment

Link to post

Mira Yurizaki 13,204

Link to comment

Link to post

My Activity Streams