Jump to content

AMD to publicily reveal new HW at Computex.

21 minutes ago, pas008 said:

have you looked at those old school benches with hydra included? in 2009

That type of thing is much simpler than a MCM GPU, you're still treating each GPU as a real distinct PCIe device and dispatching work to them in some kind of balanced algorithm so there is no high speed I/O at all between the GPUs. Even with that much more simple approach the most common outcome for Hydra was driver/game crashes.

 

MCM GPU is a single PCIe device and there is no workload splitting so the entire GPU itself needs to handle that and Alternate Frame Render (AFR), the most simple method, isn't really an option otherwise you're actually going to half the usable VRAM and half the memory bandwidth. To act as an actual single GPU is way harder than Hydra and even more than Threadripper and EPYC.

 

CPUs can at least do lots of different tasks, GPUs mostly do one thing at a time in massive parallel. It's hard enough to get a dance group of 4 people to do their moves in perfect sync, try a 3000 person dance group.

Link to comment
Share on other sites

Link to post
Share on other sites

Yes it's about finding the algorithm to combine

Mathematically it can be found considering other multi gpu solution

Along with the latency of the plx on mobo 

Think of we started to get a white board out on plx latency vs on gpu board we would be in to something

Link to comment
Share on other sites

Link to post
Share on other sites

4 minutes ago, pas008 said:

Yes it's about finding the algorithm to combine

Mathematically it can be found considering other multi gpu solution

Along with the latency of the plx on mobo 

Think of we started to get a white board out on plx latency vs on gpu board we would be in to something

Problem with MCM is it isn't an algorithm issue, you aren't actually doing any workload balancing like multi GPU. The problem MCM has is bandwidth between dies, memory zones, post processing effects, frame order, frame timing etc.

 

If you don't care about frame order and frame times or that effects look consistent frame by frame and even over the whole frame itself then MCM is simple, that's basically why in HPC many GPU computation is easy because you don't really care about all of those.

 

As @Taf the Ghost said we'll see MCM GPUs in HPC before we see it in gaming, and they'll be doing it more for larger VRAM so they can do bigger/more complex computations that are currently CPU only due to the memory size required.

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, leadeater said:

Problem with MCM is it isn't an algorithm issue, you aren't actually doing any workload balancing like multi GPU. The problem MCM has is bandwidth between dies, memory zones, post processing effects, frame order, frame timing etc.

 

If you don't care about frame order and frame times or that effects look consistent frame by frame and even over the whole frame itself then MCM is simple, that's basically why in HPC many GPU computation is easy because you don't really care about all of those.

 

As @Taf the Ghost said we'll see MCM GPUs in HPC before we see it in gaming, and they'll be doing it more for larger VRAM so they can do bigger/more complex computations that are currently CPU only due to the memory size required.

So why does sli work?  With 600hz?

 why did hydra work in 2009?

Why does dx12 mgpu work?

It's very possible and latency be reduced

On same pcb

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, pas008 said:

So why does sli work?  With 600hz?

 why did hydra work in 2009?

Why does dx12 mgpu work?

It's very possible and latency be reduced

On same pcb

Because those are treated as distinct separate GPUs and do not require high bandwidth resource sharing and have assistance of drivers and APIs to keep it all in line. An MCM GPU that relies on drivers for the hardware to actually function properly is a waste of time and you might as well stick with SLI and Crossfire, it won't be any better.

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, leadeater said:

Because those are treated as distinct separate GPUs and do not require high bandwidth resource sharing and have assistance of drivers and APIs to keep it all in line. An MCM GPU that relies on drivers for the hardware to actually function properly is a waste of time and you might as well stick with SLI and Crossfire, it won't be any better.

See above

Link to comment
Share on other sites

Link to post
Share on other sites

Wait drivers would see as one

Latency would be pcb bound not sli fingers

And not to mention pci4or5 revisions

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, pas008 said:

See above

Yes I know about that but a research paper based on theory and possibly some prototypes doesn't mean we are that close to an actual product we can buy and Nvidia is working on that more for HPC not desktop graphics/gaming, it says so in the 1st sentence of the introduction ;).

 

The point is more don't get expectations too high, it's a really complex problem that isn't going to be solved quickly and it may ultimately not make it to gaming for many many years.

 

Some perspective on the problem would likely help here, currently the Infinity Fabric in all Zen architecture CPUs tops out at 42GB/s where a GPU according to Nvidia needs 3TB/s. To get 3TB/s of bandwidth between each die using current technology would use A LOT of die area and traces, lots and lots.

 

Quote

NVIDIA’s GRS technology can provide signaling rates up to 20 Gbps per wire. The actual on-package link bandwidth settings for our 256 SM MCM-GPU can vary based on the amount of design effort and cost associated with the actual link design complexity, the choice of packaging technology, and the number of package routing layers. Therefore, based on our estimations, an inter-GPM GRS link bandwidth of 768 GB/s (equal to the local DRAM partition bandwidth) is easily realizable. Larger bandwidth settings such as 1.5 TB/s are possible, albeit harder to achieve, and a 3TB/s link would require further investment and innovations in signaling and packaging technology. Moreover, higher than necessary link bandwidth settings would result in additional silicon cost and power overheads. Even though on-package interconnect is more efficient than its on-board counterpart, it is still substantially less efficient than on-chip wires and thus we must minimize inter-GPM link bandwidth consumption as much as possible.

 

image.png.82c9168a4db76b4fce1b77fe15851cb9.png

 

Using 768 GB/s links could result in as much as 40% performance loss compared to the optimum 3 TB/s, I wouldn't bother with anything less than 1.5 TB/s.

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, pas008 said:

Hpc isn't that whst nvlink was for?

Non consumer

NVLink is not for anything to do with MCM, it's for connecting ASIC (graphics cards) and CPUs (IBM Power only). Connecting dies is a different set of technology.

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, leadeater said:

NVLink is not for anything to do with MCM, it's for connecting ASIC (graphics cards) and CPUs (IBM Power only). Connecting dies is a different set of technology.

But for us end users are we even reaching those numbers?

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, pas008 said:

But for us end users are we even reaching those numbers?

Inside the GPU yes very much so, look at the memory bandwidth on a typical high end GPU. A 1080 Ti has 484GB/s and AMD GPUs with HBM have even more, though it needs to be noted Nvidia needs less because they have way better memory compression so they actually have more effective memory bandwidth typically. 

Link to comment
Share on other sites

Link to post
Share on other sites

Really simple way to put it, what ever a CPU needs an GPU needs 10 times more of it.

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, leadeater said:

Inside the GPU yes very much so, look at the memory bandwidth on a typical high end GPU. A 1080 Ti has 484GB/s and AMD GPUs with HBM have even more, though it needs to be noted Nvidia needs less because they have way better memory compression so they actually have more effective memory bandwidth typically. 

So you are telling me 2 cards in sli with pcie latency sometimes with a chipset

Would be better than a card that can handle all that on it own? On it's own pcb if they had i work load divider ? Hmm 

Link to comment
Share on other sites

Link to post
Share on other sites

Not sure if you understand dx12 or the hydra100 or. 200 reasoning even if those were dx10

Link to comment
Share on other sites

Link to post
Share on other sites

11 minutes ago, pas008 said:

So you are telling me 2 cards in sli with pcie latency sometimes with a chipset

Would be better than a card that can handle all that on it own? On it's own pcb if they had i work load divider ? Hmm 

No I'm saying those are two entirely different problems with different ways of solving the problems of each of them. To make two GPU dies act as one resource is nothing like SLI at all and don't really share anything in common. It's actually all pretty well covered in the Nvidia article you linked, the fundamental problem for MCM is the bandwidth between dies.

 

SLI does not require sending high bandwidth data between GPUs, the SLI bridge doesn't do much at all and is why AMD got rid of it and just uses the PCIe bus which compared to the bandwidth in a GPU die is tiny.

 

Ryzen inter CCX bandwidth is only 42GB/s, same goes for inter die bandwidth on TR and EPYC (42GB/s), and we already see the impact that has on a lot of workloads. We can put up with those oddities mostly because it's a non visual impact as a CPU isn't rendering in real time for our eyes to see. If we simply 10x the Infinity Fabric to 420GB/s we're still basically half what Nvidia says we need to even bother looking at this and even the 768GB/s is not enough for a real final product.

Link to comment
Share on other sites

Link to post
Share on other sites

So now pcie4 being short skipping right to pcie5 makes sense now?

 

And now you are saying its plausible?

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

13 minutes ago, leadeater said:

Ryzen inter CCX bandwidth is only 42GB/s, same goes for inter die bandwidth on TR and EPYC (42GB/s), and we already see in the impact that has on a lot of workloads. We can put up with those oddities mostly because it's a non visual impact as a CPU isn't rendering in real time for our eyes to see. If we simply 10x the Infinity Fabric to 420GB/s we're still basically half what Nvidia says we need to even bother looking at this and even the 768GB/s is not enough for a real final product.

amd might need less bandwidth than nvidea depending on the way they go about doing it, for example gcn already divides each frame into quadrants and solves one in each shader engine, they could have each die solve 1 or 2 quarters (depending on the number of dies) of the image and then only need to send the final image to a master to send it to the monitor (after sewing them together), this method unless i am missing something would need little bandwidth at the cost of lots of vram

 

edit:

btw those numbers are at what memory frequency, because i bet IF can be clocked higher in the next iterations 

Link to comment
Share on other sites

Link to post
Share on other sites

5 minutes ago, pas008 said:

So now pcie4 being short skipping right to pcie5 makes sense noe?

PCIe isn't used for linking dies either, even PCIe 5 is more than 10 times too slow anyway. PCIe is what's referred to by Nvidia as on-board, on-chip communication is the area of technology we are talking about. Those would be things like EMIB and Infinity Fabric (though IF can be used for both).

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, cj09beira said:

amd might need less bandwidth than nvidea depending on the way they go about doing it, for example gcn already divides each frame into quadrants and solves one in each shader engine, they could have each die solve 1 or 2 quarters (depending on the number of dies) of the image and then only need to send the final image to a master to send it to the monitor (after sewing them together), this method unless i am missing something would need little bandwidth at the cost of lots of vram

Like i said hydra divider like attributes

Like what matrox did with their drivers

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, leadeater said:

PCIe isn't used for linking dies either, even PCIe 5 is more than 10 times too slow anyway. PCIe is what's referred to by Nvidia as on-board, on-chip communication is the area of technology we are talking about. Those would be things like EMIB and Infinity Fabric (though IF can be used for both).

Where are you?

Seriously what conversation are you in

Link to comment
Share on other sites

Link to post
Share on other sites

6 minutes ago, cj09beira said:

amd might need less bandwidth than nvidea depending on the way they go about doing it, for example gcn already divides each frame into quadrants and solves one in each shader engine, they could have each die solve 1 or 2 quarters (depending on the number of dies) of the image and then only need to send the final image to a master to send it to the monitor (after sewing them together), this method unless i am missing something would need little bandwidth at the cost of lots of vram

 

edit:

btw those numbers are at what memory frequency, because i bet IF can be clocked higher in the next iterations 

This is what conversation i had many times here. First to do gets that resource

Link to comment
Share on other sites

Link to post
Share on other sites

10 minutes ago, cj09beira said:

amd might need less bandwidth than nvidea depending on the way they go about doing it, for example gcn already divides each frame into quadrants and solves one in each shader engine, they could have each die solve 1 or 2 quarters (depending on the number of dies) of the image and then only need to send the final image to a master to send it to the monitor (after sewing them together), this method unless i am missing something would need little bandwidth at the cost of lots of vram

 

edit:

btw those numbers are at what memory frequency, because i bet IF can be clocked higher in the next iterations 

Interestingly MCM and the rise of Raytracing might lead to the end of post processing which would eliminate a lot of issues MCM would have if you split up the rendering of frames in to zones independently. The problem of not rending a frame in full and instead stitching is making sure things like shadow maps, blur and bloom etc actually look the same across the frame and line up physically. Gaming experience wouldn't be great if the shadow of a tree was a different shade half way and was shifted to the right by 2 pixels.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


×