Jump to content

The Great Grand-Daddy of your Next GPU

Multi-GPU setups like SLI and Crossfire once ruled the PC gaming scene, but micro stuttering and development complexity have made them obsolete. Why, then, are we boldly predicting that the future of gaming is multi-GPU?
 

 

Ross Tregemba's YouTube channel: https://www.youtube.com/@gtastuntcrew302

Emily @ LINUS MEDIA GROUP                                  

congratulations on breaking absolutely zero stereotypes - @cs_deathmatch

Link to comment
Share on other sites

Link to post
Share on other sites

Boutta make the 4090 cry in its 8 gpu glory 

Message me on discord (bread8669) for more help 

 

Current parts list

CPU: R5 5600 CPU Cooler: Stock

Mobo: Asrock B550M-ITX/ac

RAM: Vengeance LPX 2x8GB 3200mhz Cl16

SSD: P5 Plus 500GB Secondary SSD: Kingston A400 960GB

GPU: MSI RTX 3060 Gaming X

Fans: 1x Noctua NF-P12 Redux, 1x Arctic P12, 1x Corsair LL120

PSU: NZXT SP-650M SFX-L PSU from H1

Monitor: Samsung WQHD 34 inch and 43 inch TV

Mouse: Logitech G203

Keyboard: Rii membrane keyboard

 

 

 

 

 

 

 

 

 


 

 

 

 

 

 

Damn this space can fit a 4090 (just kidding)

Link to comment
Share on other sites

Link to post
Share on other sites

Apples GPU platform is rather interesting, even within the single chip solutions, M1/2 Pro/Max you are infact using a mutli gpu solution, each group for 8 GPU `Cores` (this is the same as what NV call a SM) acts as a GPU (with its own little mini arm co-prososor core to manage it).  

As apple uses a TBDR gpu, they are doing Tiel based (or checker board) but tiles are small (typically 16x16 pixels or smaller)  but this does not hard assign a given tile to each gpu, it is more that work is ready to be worked on and each of the GPUs (each group of 8 cores) pulls work as and when it has capacity.

 

The use of TBDR pipeline also makes it a lot easier to do this splitting into multi GPUs (either on die or across the bridge) as within a render pass devs can only access data fro that tile and those data operations are scoped to memory that is within the silicon, this programming model resolves a lot of the issues with modern pipelines on other multi gpu solutions. Devs need to be much more explicit about this.  

Running a M1 Ultra in a IM/IR mode for a modern AAA titles game engine pipeline would result in a LOT of wasted memory reads writes and waiting for the entier screen to finish rendering.  For AMD or Nvidia to pull this off and create a mutli gpu solution that is transparent to the dev/user they would need quite a bit higher GPU to GPU bandwidth than apples solution otherwise they would risk a lot more stalls waiting for fences etc while rendering. 

Link to comment
Share on other sites

Link to post
Share on other sites

Ugh, lots of nitpicks in this video.

First of, none of the modern 4-way SLI setups from nVidia were mentioned.  Granted, you can only get them in from nVidia directly in their DGX workstations which use Quadro class graphics but they are similar to that 8-way Voodoo card by incorporating a chip on the bridge connector.  The nvLink chip on the bridge is what permits standard desktop Quadros which are normally limited to 2-way SLI scale higher via this fanout switch chip.   Some who could get their hands on this DGX bridge board could build their own 4-way Quadro setup but Conceptually 8-way SLI is still possible due to the number of nvLink buses supported on the nvLink chip, however, nVidia has kept 8 GPU and higher systems isolated to the data center and their mezzanine style carrier cards.  Since nvLink is being leveraged, it also permits memory sharing and aggregation, which is another nitpick that I'll get to later on.

 

The second nitpick is that the video doesn't dive too deep into the challenges of leveraging multiple GPUs which is simply load balancing. Splitting frames up evenly in terms of output pixel count doesn't inherently mean that each GPU does the same amount of work needs to be performed in those regions.  With an uneven load across the GPUs, performance is inherently limited to how long it takes the GPU with the most work to finish, a classic bottleneck scenario.  

 

Third nitpick is that SLI is never actually displayed on screen using its interleaving nature.  3dfx figured out early on that having each GPU work on a different scan line is a simply and relatively efficient for how 3D APIs were working back then.  As more complex rendering techniques are being developed, it was no longer to simply scale this technique:  shaders would reference data from the pixels found on the previous scan line.  


Fourth nitpick is that spit frame rendering was demonstrated as a splitting the screen into quads which isn't how it normally worked.  Rather the splits would be in horizontal lines the heights varying to load balance across the GPUs.  The reason being is that each GPU would be responsible for a full display scan line.  Splitting mid scan line was often glitchy from a visual perspective without having an additional aggregation buffer.  The additional buffer was not optimal due to the small memory sizes of GPUs at the time and the lag it would introduce.  Not impossible for a quad split to be used but it was not the norm.

 

Fifth nitpick is that different multi-GPU techniques can be used in tandem when there are 4 or GPUs in a system.  AFR of SFR was used in some games.  Not really a nitpick but more of a piece of trivia, AFR of SFR came to be as DirectX had a maximum 8 frame buffers currently in flight.  This figure is a bit deceptive as one buffer is being output to the screen while the next is being rendered which is done on a per GPU basis with that API.  This limited AFR under DirectX effectively to 4-way maximum GPU setup.  Hence why AFR of SFR was necessary to go beyond 4-way GPU setups or if triple buffering was being leveraged.  DirectX 10 did effectively off SFR support which capped GPU support on the consumer side to 4-way.   I haven't kept up but Vulkan and DX12 should be able to bring these techniques back but support rests on the game/application developer, not the hardware manufacturer/API developer.

 

Checkerboard render is interesting in several aspects.  First off, even single GPUs have leveraged this technique as a means to vary image quality in minor ways across a frame that makes it difficult to discern.  Think of a frame split up to 64 x 64 tiles but for some complex tiles and to keep the frame rate up, the GPU will instead render a 16 x 16 or 8 x 8 pixel version of that tile and brute force scale it up to save on computational time.  When multiple GPUs were involved, the tile size was important to evenly distribute the work.  There is a further technique to load balance by further breaking down larger tiles into even smaller ones and then distributing that work load across multiple GPUs.  So instead of a complex 64 x 64 tile getting a 16 x 16 rendered tile that is scaled upward, in a multiple GPU scenario four 16 x 16 tiles are split across multiple GPUs to maintain speed and quality.  Further subdividing tiles and assigning them to specific GPU is indeed a very computationally complex task but the modern GPUs already have accelerators in place to tackle this workload.  This is how various hardware video encoders function to produce high quality, compress images by subdiving portions of the screen.  While never explicitly said, I suspect that this is one of the reasons why modern GPUs have started to include multiple hardware video encoders.

 

One technique not mentioned is when multiple monitors are used as each display can be driven/rendered by a different GPU.  While not load balanced, it is pretty straight forward to implement.  Both AMD and nVidia provide additional hardware to synchronize refresh rates and output across multiple monitors as well as the GPUs.

 

The original CrossFire with dongle was mostly AFR as the there was a video output switch on the master card.  The master card decided which GPU was sending output to the monitor.  This was mostly done via detecting V-blank signals.  The chip could conceptually switch mid-frame at the end of a scanline but ATI never got this to work reliably so they to focused on AFR.  (Note: later implementations of Crossfire that used internal ribbon cables could switch on a per scanline basis making this an issue only for ATI early implementations.)  

 

In the early days of multiple GPUs, video memory was simply mirrored across GPUs.  This wasn't emphasized in the video but was implied in the scenarios leveraging AFR due to each GPU doing the same work at a different slices in time.  Modern GPUs that leverage nvLink from nVidia and Infinity Fabric links from AMD can actually aggregate memory spaces.  They also permit dedicated regions to each GPU while having a portion mirrored across the GPUs to limit.  For example two modern Quadros with 24 GB of memory on board could provide 24 GB (full mirroring), 36 GB (12 GB dedicated on each card with 12 GB mirrored), or 48 GB of memory (full dedicated) to an application to use.  That flexibility is great to have.

Link to comment
Share on other sites

Link to post
Share on other sites

Incidentally multi GPU didn't disappear from the computing scene completely however - they're still very much alive and well in the server/HPC space (see AMD MI250X/Intel's Ponte Vecchio) as their workloads tend to scale well with more GPUs.  It wouldn't be long before multi GPU designs will become a common thing amongst gaming GPUs and it seems like it's already sort of happening right now starting with the 7900XT(X).  

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

Modern console. on up/ps4 up use checker board

MSI x399 sli plus  | AMD theardripper 2990wx all core 3ghz lock |Thermaltake flo ring 360 | EVGA 2080, Zotac 2080 |Gskill Ripjaws 128GB 3000 MHz | Corsair RM1200i |150tb | Asus tuff gaming mid tower| 10gb NIC

Link to comment
Share on other sites

Link to post
Share on other sites

I'd love to see Dosbox-X support emulating this 8-way Voodoo SLI setup. Hell, I'd love to see the Dynamrec support AArch64. I'd love to avoid a real wintel retro rig if I can, I'd rather save the space for more exotic hardware. Though that Voodoo Brick is a little to exotic to obtain, love to just be able to emulate that.

Link to comment
Share on other sites

Link to post
Share on other sites

6 hours ago, doubleflower said:

Incidentally multi GPU didn't disappear from the computing scene completely however - they're still very much alive and well in the server/HPC space (see AMD MI250X/Intel's Ponte Vecchio) as their workloads tend to scale well with more GPUs.

HPC though always has been about weird systems. Xeon Phi wasn't to uncommon in HPC between 2012-2022, with Cori at NERSC being the largest deployment of Phi in HPC.

Good luck, Have fun, Build PC, and have a last gen console for use once a year. I should answer most of the time between 9 to 3 PST

NightHawk 3.0: R7 5700x @, B550A vision D, H105, 2x32gb Oloy 3600, Sapphire RX 6700XT  Nitro+, Corsair RM750X, 500 gb 850 evo, 2tb rocket and 5tb Toshiba x300, 2x 6TB WD Black W10 all in a 750D airflow.
GF PC: (nighthawk 2.0): R7 2700x, B450m vision D, 4x8gb Geli 2933, Strix GTX970, CX650M RGB, Obsidian 350D

Skunkworks: R5 3500U, 16gb, 500gb Adata XPG 6000 lite, Vega 8. HP probook G455R G6 Ubuntu 20. LTS

Condor (MC server): 6600K, z170m plus, 16gb corsair vengeance LPX, samsung 750 evo, EVGA BR 450.

Spirt  (NAS) ASUS Z9PR-D12, 2x E5 2620V2, 8x4gb, 24 3tb HDD. F80 800gb cache, trueNAS, 2x12disk raid Z3 stripped

PSU Tier List      Motherboard Tier List     SSD Tier List     How to get PC parts cheap    HP probook 445R G6 review

 

"Stupidity is like trying to find a limit of a constant. You are never truly smart in something, just less stupid."

Camera Gear: X-S10, 16-80 F4, 60D, 24-105 F4, 50mm F1.4, Helios44-m, 2 Cos-11D lavs

Link to comment
Share on other sites

Link to post
Share on other sites

Hey @linus - I watched this yesterday and was like - OK - cool.. but then as I was testing my PC at home I realized that Furmark was showing Corssfire was enabled and working - even still with the latest AMD Radeon drivers/software installed on my Win 10 pc... WHAT??? I thought they disabled it a LONG time ago... well - apparently my old 8600K and old Z320XP SLI from Gigabyte are still pushing frames with even hetrogeneous dis-similar GPUs if crossfire is enabled on the MB bios... wow! 5 years later after they killed Crossfire - it is back? Anyone else seeing this? I would love to post a screenshot 200+ FPS on a RX580 with a RX6700 installed in this old beast not even OC really - I have just been keeping alive because I am too cheap to invest in new CPU and hardware when I can push this machine to limits and still be stable at just shy of 5Ghz... I mean.. the best gaming and work pc I have ever owned just keeps surprising me... yeah - I know - I need to play with newer stuff.. but why? Point is - Crossfire is still working on my machine for some reason and it seems to really help if OpenGL is used correctly... I am wondering when more devs will spend some time to port all that python AI locked in tensor to work on AMD also... Several projects out there for stable diffusion to work with AMD on DirectML - but not sure how efficient it is if we are talking about using a 10 year old tech to make use of Crossfire connected GPUs...  talk about a way to speed things up for home users that do not have access to A100's or better Machine Deep Learning GPU's...  Now I want a speed test show down for Machine Learning and games - multiple 2nd to latest gen AMD on crossfire vs one Nvidia latest gen and cost compare to see value or not... who knows... seems to me that a couple RDNA2 could out perform for half the cost of one latest gen... maybe... SLI and Crossfire FTW? or just another FAD to debunk? BTW I got this to 300FPS putting furmark on the correct monitor lol.

Screenshot 2023-05-15 235649.png

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×