NVidia exploring MCM GPUs?

WMGroomAK · July 4, 2017

I have to admit that I personally find this interesting, especially with how AMD and Intel have been exploring Multi-Chip Module CPUs, but it appears that Nvidia is also exploring the option of future GPUs being based on a MCM package.

http://techreport.com/news/32189/nvidia-explores-ways-of-cramming-many-gpus-onto-one-package

Quote

The proposal was put together by researchers and engineers from Arizona State University, Nvidia, the University of Texas at Austin, and the Barcelona Supercomputing Center. The idea starts with the recognition that Nvidia is soon going to struggle to squeeze more performance out of its current layouts with today's fabrication technology. Typically, the company has been able to improve GPU performance between generations by ratcheting up the streaming multiprocessor (SM) count. Unfortunately, it's getting increasingly difficult to cram more transistors into single dies. Nvidia's V100 GPU, for example, required TSMC to produce the chips at the reticle limit of its 12-nm process. Furthermore, there are costs and problems associated with making ever-larger dies, as yield numbers decrease due to manufacturing faults.

It's possible that Nvidia could take the approach of putting multiple GPUs on the same PCB, as it did with the Tesla K10 and K80. However, the researchers found a number of problems with this approach that the company has yet to solve. For example, they note that it's not easy to distribute work across multiple GPUs, so it requires a lot of effort from programmers to use the hardware efficiently.

Instead, these researchers want to take advantage of developments in package technologies that might allow Nvidia to place mutiple GPU modules (GPMs) onto one package. These GPMs would be smaller than current GPUs, and therefore easier and cheaper to manufacture. While the researchers acknowldedge that questions remain about the performance of packages like this one, they claim that recent developments in substrate technology could allow the company to implement a fast, robust interconnect architecture to let these modules communicate. Theoretically, on-package bandwidth could reach multiple terabytes per second.

In Nvidia's in-house GPU simulator, the research team put together an MCM-GPU with a whopping 256 SMs, compared to Pascal's "measly" 56 SMs. The team then pitted that against a hypothetical (and unbuildable) 256-SM GPU built with the company's current architecture. The results showed that the MCM-GPU was 45.5% faster than the monolithic chip. Further comparison with multiple GPUs on the same board (rather than integrated into one package) still gave the MCM-GPU a 26.8% performance advantage.

This definitely seems like a more doable approach to making more powerful GPUs at a more cost effective model as we hit limits on scaling down transistor sizes. It seems to be more of a stable option than some of the previous models where they placed two separate GPUs on the same package and had trouble with the interconnects. Of course, AMD may have a bit of an advantage on going this route with how they are designing their Infinity Fabric, but time will tell as I don't personally see any mainstream GPUs being proposed with this kind of layout for the foreseeable future.

sazrocks · July 4, 2017

Definitely interesting. I wonder if this will ever make it to market ahead of carbon nanotube based transistors (which allow smaller process nodes).

Do we know if infinity fabric can scale to something like this for amd?

RadiatingLight · July 4, 2017

30 minutes ago, sazrocks said:

Definitely interesting. I wonder if this will ever make it to market ahead of carbon nanotube based transistors (which allow smaller process nodes).

Do we know if infinity fabric can scale to something like this for amd?

I thought vega was infinity fabric based.

Jito463 · July 4, 2017

1 minute ago, RadiatingLight said:

I thought vega was infinity fabric based.

No, the rumor is that's what Navi is supposed to be based on, but Vega is a traditional, single die GPU.

RadiatingLight · July 4, 2017

7 minutes ago, Jito463 said:

No, the rumor is that's what Navi is supposed to be based on, but Vega is a traditional, single die GPU.

This is what I was talking about:

https://www.overclock3d.net/news/gpu_displays/amd_has_confirmed_that_vega_utilises_their_new_infinity_fabric_tech/2

some components are using infinity fabric, although yes, it is a single die.

Taf the Ghost · July 4, 2017

Vega should be using Infinity Fabric for certain aspects, but it is Navi that'll bring online the multi-GPU configurations. Navi is going to be more like the Ryzen -> Threadripper -> Epyc stack. We just don't know how many GPUs they're going to stack on. (4x 75w GPUs for the top-tier would make the most sense.)

But, seriously, the displayed part is literally how Epyc works. There's a reason some of us are super happy this tech is here.

Jito463 · July 4, 2017

2 minutes ago, RadiatingLight said:

This is what I was talking about:

https://www.overclock3d.net/news/gpu_displays/amd_has_confirmed_that_vega_utilises_their_new_infinity_fabric_tech/2

some components are using infinity fabric, although yes, it is a single die.

My mistake then. In retrospect, I recall them mentioning that Vega could utilize information directly from RAM, bypassing the CPU. Presumably, this is what IF is being used for on Vega, since obviously they're not using it for a multi-die GPU configuration.

VagabondWraith · July 4, 2017

Interesting, and makes the most sense. A die size of 815mm² is costly and yields I bet are horrendous. There's a reason why they cost $18,000 apiece. Will be interesting to see if AMD can get the technology to scale using multiple smaller GPU dies and how smoothly it goes. Will be interesting to watch this unfold moving forward.

YoloSwag · July 4, 2017

I actually had an idea like this before. A lot of people said it was crazy and inefficient. Well here's to y'all haters.

A foreseeable issue with this design though is latency. Splitting a workload is an additional task. An avoidable issue if you're gonna set independencies for each module.

WMGroomAK · July 4, 2017

22 minutes ago, YoloSwag said:

I actually had an idea like this before. A lot of people said it was crazy and inefficient. Well here's to y'all haters.

A foreseeable issue with this design though is latency. Splitting a workload is an additional task. An avoidable issue if you're gonna set independencies for each module.

I agree that the issue of latency across modules will definitely be a big hurdle to development as we're already seeing some of this with AMDs Infinity Fabric on their Ryzen processors... Hopefully as development of this moves forward, we can see the latency issue drop or code written that can account and optimize for this issue. I think there is too much benefit from increased module yields for this to not move forward.

Drak3 · July 4, 2017

38 minutes ago, YoloSwag said:

I actually had an idea like this before. A lot of people said it was crazy and inefficient

Without having a thorough explanation of how it works, and some real world testing, it's to be expected.

However, AMD already has working concepts of multiple GPU cards that work well when they do work. The bitch of their existing cards coming down to Crossfire and said cards always resulting in a distinct need to recognize them as 2 GPUs, and at the end, it was still somewhat inefficient compared to a single GPU solution. We'll see how these MCM designs pan out in the long run.

leadeater · July 4, 2017

3 hours ago, sazrocks said:

Do we know if infinity fabric can scale to something like this for amd?

The current Infinity Fabric probably not but the technology it's based on yes. Also AMD a while ago showed off a theoretical APU with stacked CPU, GPU and HBM on it, it's floating around the news section somewhere.

Also the technology group working on Gen-z is looking more at interconnecting components in a server and between servers, the technology might not be directly applicable to inter-die interconnects but some of it is since that's what the Infinity Fabric is based off of.

leadeater · July 4, 2017

1 hour ago, WMGroomAK said:

I agree that the issue of latency across modules will definitely be a big hurdle to development as we're already seeing some of this with AMDs Infinity Fabric on their Ryzen processors... Hopefully as development of this moves forward, we can see the latency issue drop or code written that can account and optimize for this issue. I think there is too much benefit from increased module yields for this to not move forward.

Fortunately GPUs and rendering is much more of a parallel workload type so the need to pass information between modules might not be as big of an issue as a CPU, wild ass guess ofc since I'm not a GPU designer .

Prysin · July 4, 2017

42 minutes ago, leadeater said:

The current Infinity Fabric probably not but the technology it's based on yes. Also AMD a while ago showed off a theoretical APU with stacked CPU, GPU and HBM on it, it's floating around the news section somewhere.

Also the technology group working on Gen-z is looking more at interconnecting components in a server and between servers, the technology might not be directly applicable to inter-die interconnects but some of it is since that's what the Infinity Fabric is based off of.

the whole irony of this post is the location of the University...

University of Austin...

Radeon Technologies Group has a huge office in Austin.

Industrial espionage perhaps?

MageTank · July 4, 2017

2 hours ago, WMGroomAK said:

I agree that the issue of latency across modules will definitely be a big hurdle to development as we're already seeing some of this with AMDs Infinity Fabric on their Ryzen processors... Hopefully as development of this moves forward, we can see the latency issue drop or code written that can account and optimize for this issue. I think there is too much benefit from increased module yields for this to not move forward.

It's still superior than the latency imposed by using an on-board interconnect (multi-socket boards) as the trace topology on the substrate offers a far quicker path than the traces on your multi-socket motherboards. People can say what they will, but AMD has always had some extremely fast interconnects, far faster than QPI. The latency issue is indeed something to worry about when comparing MCM's to a single monolithic core, but when comparing it against multi-socket setups, it's a superior way to handle core scaling.

I find it odd that OP's post implies AMD's/Intel's exploration of this is something new, when both have done this before years ago. It's also not new for GPU's, as I believe the Xbox 360 used an MCM GPU, though I may be wrong. Either way, it will be interesting to see how they intend to tackle the latency hurdle compared to that of our current monolithic dies. Nvidia's NVLink interconnect is pretty fast, coming in at a whopping 1200Gbps on NVLink 2.0, nearly 4x the bandwidth of Intel's latest QPI (300Gbps). In comparison, AMD's outdated Hypertransport 3.1 had a bandwidth of 409Gbps (from 2008). AMD on the other hand, doubled down on Infinity Fabric, as it supports a theoretical bandwidth of 512Gbytes (yes, Gigabytes, not bits) per second. As long as latency is not abysmal, I can't imagine the bandwidth itself becoming a bottleneck anytime soon.

Bouzoo · July 4, 2017

16 minutes ago, MageTank said:

It's also not new for GPU's, as I believe the Xbox 360 used an MCM GPU, though I may be wrong..

You're not wrong, it used na MCM, unless I'm reading the pics here wrong.

https://forums.anandtech.com/threads/xbox-360-slim-to-have-integrated-cpu-gpu-die-fusion.2100408/

EDOT: DRAM is the only MCM part

MageTank · July 4, 2017

11 minutes ago, Bouzoo said:

You're not wrong, it used na MCM, unless I'm reading the pics here wrong.

https://forums.anandtech.com/threads/xbox-360-slim-to-have-integrated-cpu-gpu-die-fusion.2100408/

Yeah, I heard it from a friend in my telegram hardware group when we were originally discussing this about a month ago when threadripper was first rumored to be MCM. I didn't really do any research on it, but it did make sense at the time. I am mostly interested in how AMD plans on scaling the IF up. As it sits on their current CPU's, it's limited to DDR4 memory speeds. At 4266mhz ram, you are looking at a peak theoretical Infinity Fabric bandwidth of 508Gbps. This is less than half that of Nvidia's NVLink 2.0, and roughly 20% slower than Nvidia's NVLink 1.0 (which is 640 Gbps). This is also assuming the fastest JEDEC approved DDR4 speed of 4266 (2133mhz actual frequency, since double data-rate). As it currently sits, Ryzen has a difficult time of achieving anything higher than 3600 (with some extremely lucky souls hitting 3800ish). At 3600, the IF's bandwidth would be roughly 429Gbps, which is barely faster than their original Hypertransport 3.1 seen on their older Phenoms. Now, this is assuming that the IF on their newer Ryzen/Threadripper SKU's are still 256-bit wide. If they widen the bus, bandwidth would improve exponentially. All we know for certain, is that AMD claims the fabric itself can scale up to 512GB/s, or 4096Gbps. How they intend to achieve that bandwidth is beyond me.

DXMember · July 4, 2017

4 hours ago, Jito463 said:

No, the rumor is that's what Navi is supposed to be based on, but Vega is a traditional, single die GPU.

Vega is still heavily based on infinity-fabric - all the memory subsystems, multimedia accelerators and CUs are interconnected with infinity fabric

leadeater · July 4, 2017

3 hours ago, MageTank said:

Now, this is assuming that the IF on their newer Ryzen/Threadripper SKU's are still 256-bit wide. If they widen the bus, bandwidth would improve exponentially. All we know for certain, is that AMD claims the fabric itself can scale up to 512GB/s, or 4096Gbps. How they intend to achieve that bandwidth is beyond me.

Maybe they are using 256bit per die so on Eypc it's 4x the bandwidth??? And then some super magic RAM speed calculation that isn't possible yet??? That would get them to about half their claimed 4096Gbps.

Edit:

Oh and then dual socket to claim 2x, ha nailed it . Kidding btw.

VanayadGaming · July 4, 2017

I guarantee what AMD meant with "NAVI Scalability" is that they are going to do the SAME thing they did with CPUs with Ryzen.

tom_w141 · July 4, 2017

8 hours ago, WMGroomAK said:

time will tell as I don't personally see any mainstream GPUs being proposed with this kind of layout for the foreseeable future.

AMD have said they are doing this with Navi. 2 GPUs on 1 card linked by IF so the system sees only 1 big GPU.

Taf the Ghost · July 4, 2017

20 minutes ago, tom_w141 said:

AMD have said they are doing this with Navi. 2 GPUs on 1 card linked by IF so the system sees only 1 big GPU.

Well, for their top-tier & professional processors, more than likely. If you think of them taking 2x RX 580 or 4x RX 560. Things get really interesting when you can put them all together in that type of configuration. Once RX Vega launches, we should have a lot more technical information about the interior Infinity Fabric already within the Vega die. (As it appears to be quite a lot.) Seen some informed speculation about it, but we'll need to wait for AMD to start talking more.

The direction this is probably going is a little less like Epyc (4 monolith packages in an array) and more like a Lego system. We don't know how much they'll be able to move off-die and onto the package, but that's probably another generation or two down the line. Though I expect the "big Navi" will be 4x Navi GPUs in an array. 400W Toaster, but, well, that sucker is going to max out whatever CPU you toss at it.

Notional · July 4, 2017

8 hours ago, RadiatingLight said:

I thought vega was infinity fabric based.

Vega does indeed utilize infinity fabric, but it's not a multi die like what NVidia is looking into here.

Interesting. Afaik this is the rumours that is going on about AMD's next gen NAVI architecture. Since it worked quite well on the CPU space with Ryzen/TR/Epyc, I do wonder if AMD can pull it off without too many issues. Right now NVidia already hit max die size on Volta possible with todays technology. So something has to happen to go forward from here.

In the end it could lead to much cheaper GPU's compared to performance. And with a huge performance increase. After all, the Ryzen dies are over 90% in yields, so even if stitching them together with infinity fabric (or similar) on an interposer (which is necessary for HBM), it could still be cheaper.

VanayadGaming · July 4, 2017

1 hour ago, Notional said:

Vega does indeed utilize infinity fabric, but it's not a multi die like what NVidia is looking into here.

Interesting. Afaik this is the rumours that is going on about AMD's next gen NAVI architecture. Since it worked quite well on the CPU space with Ryzen/TR/Epyc, I do wonder if AMD can pull it off without too many issues. Right now NVidia already hit max die size on Volta possible with todays technology. So something has to happen to go forward from here.

In the end it could lead to much cheaper GPU's compared to performance. And with a huge performance increase. After all, the Ryzen dies are over 90% in yields, so even if stitching them together with infinity fabric (or similar) on an interposer (which is necessary for HBM), it could still be cheaper.

From what I know one of the reasons why AMD pushed so much xfire or dual GPUs is exactly because of this. At the moment they have games that scale at almost 100% with dual gpus, and if I recall correctly, they had far better scaling than nvidia. More than likely they will implement an MCM structure with navi. Now the question is...should I wait for that or get vega / volta ?

Notional · July 4, 2017

11 minutes ago, VanayadGaming said:

From what I know one of the reasons why AMD pushed so much xfire or dual GPUs is exactly because of this. At the moment they have games that scale at almost 100% with dual gpus, and if I recall correctly, they had far better scaling than nvidia. More than likely they will implement an MCM structure with navi. Now the question is...should I wait for that or get vega / volta ?

Well, remember that these stitched together chips will be seen and function like 1 chip. So xfire tech should not have anything to add in that regard.

Depends on what you have now. Navi probably won't be out until 2020, as we will probably see vega rebrands next year with the addition of Vega 20. So Navi won't be out until 2019 at the earliest but expect the year after.

Sign In

NVidia exploring MCM GPUs?

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites