The Great Grand-Daddy of your Next GPU

Emily Young · May 14, 2023

Multi-GPU setups like SLI and Crossfire once ruled the PC gaming scene, but micro stuttering and development complexity have made them obsolete. Why, then, are we boldly predicting that the future of gaming is multi-GPU?

Ross Tregemba's YouTube channel: https://www.youtube.com/@gtastuntcrew302

filpo · May 14, 2023

Boutta make the 4090 cry in its 8 gpu glory

hishnash · May 14, 2023

Apples GPU platform is rather interesting, even within the single chip solutions, M1/2 Pro/Max you are infact using a mutli gpu solution, each group for 8 GPU `Cores` (this is the same as what NV call a SM) acts as a GPU (with its own little mini arm co-prososor core to manage it).

As apple uses a TBDR gpu, they are doing Tiel based (or checker board) but tiles are small (typically 16x16 pixels or smaller) but this does not hard assign a given tile to each gpu, it is more that work is ready to be worked on and each of the GPUs (each group of 8 cores) pulls work as and when it has capacity.

The use of TBDR pipeline also makes it a lot easier to do this splitting into multi GPUs (either on die or across the bridge) as within a render pass devs can only access data fro that tile and those data operations are scoped to memory that is within the silicon, this programming model resolves a lot of the issues with modern pipelines on other multi gpu solutions. Devs need to be much more explicit about this.

Running a M1 Ultra in a IM/IR mode for a modern AAA titles game engine pipeline would result in a LOT of wasted memory reads writes and waiting for the entier screen to finish rendering. For AMD or Nvidia to pull this off and create a mutli gpu solution that is transparent to the dev/user they would need quite a bit higher GPU to GPU bandwidth than apples solution otherwise they would risk a lot more stalls waiting for fences etc while rendering.

power666 · May 15, 2023

Ugh, lots of nitpicks in this video.

First of, none of the modern 4-way SLI setups from nVidia were mentioned. Granted, you can only get them in from nVidia directly in their DGX workstations which use Quadro class graphics but they are similar to that 8-way Voodoo card by incorporating a chip on the bridge connector. The nvLink chip on the bridge is what permits standard desktop Quadros which are normally limited to 2-way SLI scale higher via this fanout switch chip. Some who could get their hands on this DGX bridge board could build their own 4-way Quadro setup but Conceptually 8-way SLI is still possible due to the number of nvLink buses supported on the nvLink chip, however, nVidia has kept 8 GPU and higher systems isolated to the data center and their mezzanine style carrier cards. Since nvLink is being leveraged, it also permits memory sharing and aggregation, which is another nitpick that I'll get to later on.

The second nitpick is that the video doesn't dive too deep into the challenges of leveraging multiple GPUs which is simply load balancing. Splitting frames up evenly in terms of output pixel count doesn't inherently mean that each GPU does the same amount of work needs to be performed in those regions. With an uneven load across the GPUs, performance is inherently limited to how long it takes the GPU with the most work to finish, a classic bottleneck scenario.

Third nitpick is that SLI is never actually displayed on screen using its interleaving nature. 3dfx figured out early on that having each GPU work on a different scan line is a simply and relatively efficient for how 3D APIs were working back then. As more complex rendering techniques are being developed, it was no longer to simply scale this technique: shaders would reference data from the pixels found on the previous scan line.

Fourth nitpick is that spit frame rendering was demonstrated as a splitting the screen into quads which isn't how it normally worked. Rather the splits would be in horizontal lines the heights varying to load balance across the GPUs. The reason being is that each GPU would be responsible for a full display scan line. Splitting mid scan line was often glitchy from a visual perspective without having an additional aggregation buffer. The additional buffer was not optimal due to the small memory sizes of GPUs at the time and the lag it would introduce. Not impossible for a quad split to be used but it was not the norm.

Fifth nitpick is that different multi-GPU techniques can be used in tandem when there are 4 or GPUs in a system. AFR of SFR was used in some games. Not really a nitpick but more of a piece of trivia, AFR of SFR came to be as DirectX had a maximum 8 frame buffers currently in flight. This figure is a bit deceptive as one buffer is being output to the screen while the next is being rendered which is done on a per GPU basis with that API. This limited AFR under DirectX effectively to 4-way maximum GPU setup. Hence why AFR of SFR was necessary to go beyond 4-way GPU setups or if triple buffering was being leveraged. DirectX 10 did effectively off SFR support which capped GPU support on the consumer side to 4-way. I haven't kept up but Vulkan and DX12 should be able to bring these techniques back but support rests on the game/application developer, not the hardware manufacturer/API developer.

Checkerboard render is interesting in several aspects. First off, even single GPUs have leveraged this technique as a means to vary image quality in minor ways across a frame that makes it difficult to discern. Think of a frame split up to 64 x 64 tiles but for some complex tiles and to keep the frame rate up, the GPU will instead render a 16 x 16 or 8 x 8 pixel version of that tile and brute force scale it up to save on computational time. When multiple GPUs were involved, the tile size was important to evenly distribute the work. There is a further technique to load balance by further breaking down larger tiles into even smaller ones and then distributing that work load across multiple GPUs. So instead of a complex 64 x 64 tile getting a 16 x 16 rendered tile that is scaled upward, in a multiple GPU scenario four 16 x 16 tiles are split across multiple GPUs to maintain speed and quality. Further subdividing tiles and assigning them to specific GPU is indeed a very computationally complex task but the modern GPUs already have accelerators in place to tackle this workload. This is how various hardware video encoders function to produce high quality, compress images by subdiving portions of the screen. While never explicitly said, I suspect that this is one of the reasons why modern GPUs have started to include multiple hardware video encoders.

One technique not mentioned is when multiple monitors are used as each display can be driven/rendered by a different GPU. While not load balanced, it is pretty straight forward to implement. Both AMD and nVidia provide additional hardware to synchronize refresh rates and output across multiple monitors as well as the GPUs.

The original CrossFire with dongle was mostly AFR as the there was a video output switch on the master card. The master card decided which GPU was sending output to the monitor. This was mostly done via detecting V-blank signals. The chip could conceptually switch mid-frame at the end of a scanline but ATI never got this to work reliably so they to focused on AFR. (Note: later implementations of Crossfire that used internal ribbon cables could switch on a per scanline basis making this an issue only for ATI early implementations.)

In the early days of multiple GPUs, video memory was simply mirrored across GPUs. This wasn't emphasized in the video but was implied in the scenarios leveraging AFR due to each GPU doing the same work at a different slices in time. Modern GPUs that leverage nvLink from nVidia and Infinity Fabric links from AMD can actually aggregate memory spaces. They also permit dedicated regions to each GPU while having a portion mirrored across the GPUs to limit. For example two modern Quadros with 24 GB of memory on board could provide 24 GB (full mirroring), 36 GB (12 GB dedicated on each card with 12 GB mirrored), or 48 GB of memory (full dedicated) to an application to use. That flexibility is great to have.

doubleflower · May 15, 2023

Incidentally multi GPU didn't disappear from the computing scene completely however - they're still very much alive and well in the server/HPC space (see AMD MI250X/Intel's Ponte Vecchio) as their workloads tend to scale well with more GPUs. It wouldn't be long before multi GPU designs will become a common thing amongst gaming GPUs and it seems like it's already sort of happening right now starting with the 7900XT(X).

dogwitch · May 15, 2023

Modern console. on up/ps4 up use checker board

Commodore256 · May 15, 2023

I'd love to see Dosbox-X support emulating this 8-way Voodoo SLI setup. Hell, I'd love to see the Dynamrec support AArch64. I'd love to avoid a real wintel retro rig if I can, I'd rather save the space for more exotic hardware. Though that Voodoo Brick is a little to exotic to obtain, love to just be able to emulate that.

GDRRiley · May 15, 2023

6 hours ago, doubleflower said:

Incidentally multi GPU didn't disappear from the computing scene completely however - they're still very much alive and well in the server/HPC space (see AMD MI250X/Intel's Ponte Vecchio) as their workloads tend to scale well with more GPUs.

HPC though always has been about weird systems. Xeon Phi wasn't to uncommon in HPC between 2012-2022, with Cori at NERSC being the largest deployment of Phi in HPC.

GoZippy · May 16, 2023

Hey @linus - I watched this yesterday and was like - OK - cool.. but then as I was testing my PC at home I realized that Furmark was showing Corssfire was enabled and working - even still with the latest AMD Radeon drivers/software installed on my Win 10 pc... WHAT??? I thought they disabled it a LONG time ago... well - apparently my old 8600K and old Z320XP SLI from Gigabyte are still pushing frames with even hetrogeneous dis-similar GPUs if crossfire is enabled on the MB bios... wow! 5 years later after they killed Crossfire - it is back? Anyone else seeing this? I would love to post a screenshot 200+ FPS on a RX580 with a RX6700 installed in this old beast not even OC really - I have just been keeping alive because I am too cheap to invest in new CPU and hardware when I can push this machine to limits and still be stable at just shy of 5Ghz... I mean.. the best gaming and work pc I have ever owned just keeps surprising me... yeah - I know - I need to play with newer stuff.. but why? Point is - Crossfire is still working on my machine for some reason and it seems to really help if OpenGL is used correctly... I am wondering when more devs will spend some time to port all that python AI locked in tensor to work on AMD also... Several projects out there for stable diffusion to work with AMD on DirectML - but not sure how efficient it is if we are talking about using a 10 year old tech to make use of Crossfire connected GPUs... talk about a way to speed things up for home users that do not have access to A100's or better Machine Deep Learning GPU's... Now I want a speed test show down for Machine Learning and games - multiple 2nd to latest gen AMD on crossfire vs one Nvidia latest gen and cost compare to see value or not... who knows... seems to me that a couple RDNA2 could out perform for half the cost of one latest gen... maybe... SLI and Crossfire FTW? or just another FAD to debunk? BTW I got this to 300FPS putting furmark on the correct monitor lol.

Sign In

The Great Grand-Daddy of your Next GPU

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Topics

Latest From Linus Tech Tips:

Why Do Youtubers Keep Destroying Companies - WAN Show April 19, 2024

Latest From Tech Quickie:

Nutrition Facts…for your Internet Connection?

Latest From TechLinked:

Apple’s Dirty Little Secret

Latest From GameLinked:

The next Must-Play RPGs

Latest From ShortCircuit:

You Deserve this much OLED - AORUS CO49DQ

Latest From Mac Address:

Why did you buy an Apple Vision Pro?

Latest From Channel Super Fun:

I Swapped the CEO's Assistant For a Day!