Nvidia shows off RTX GPU on ARM

Fasterthannothing · July 20, 2021

28 minutes ago, leadeater said:

Doubt it since considering M1 CPU only can use up to 27W on the Firestorm and Icestorm cores alone. In mixed GPU workloads the total chip power is actually lower at around 17W but that's actually running a rather inefficient although real world realistic (ish) workload. Something not as demanding as GTAV far as I know (could be wrong).

Either way dedicated 10W to a GPU is not the same thing as shared SoC power, not even 15W (CPU) + 10W (GPU) to 25W (SoC). 5800U iGPU would almost never be getting 10W allocation in a mixed workload. Bump that TDP up and it'll do 80 FPS so an RDNA2 GPU core rather than modified Vega would at least match the MX350.

An example would be a snapdragon 888 it only uses 5 watts so 15w total for both if paired together

leadeater · July 20, 2021

10 minutes ago, SlidewaysZ said:

An example would be a snapdragon 888 it only uses 5 watts so 15w total for both if paired together

Sure but a Snapdragon 888 would heavily limit game performance of something like GTAV, making a high end GPU pairing pointless. You really want something on the level of the M1 or don't even bother basically.

Edit:

P.S. Spec I'm reading for the 888 is 10W.

Fasterthannothing · July 20, 2021

1 hour ago, leadeater said:

Sure but a Snapdragon 888 would heavily limit game performance of something like GTAV, making a high end GPU pairing pointless. You really want something on the level of the M1 or don't even bother basically.

Edit:

P.S. Spec I'm reading for the 888 is 10W.

I feel like if something like a 888 was that weak Samsung wouldn't be trying to pair their exynos with the same GPU in the steam deck. It might bottleneck the mx350 some but I doubt it would be that noticeable.

Also what I saw said the 888 is 5 watts

Either way I'm excited to see what happens with handhelds and mobile phones.

leadeater · July 20, 2021

5 hours ago, SlidewaysZ said:

Also what I saw said the 888 is 5 watts

Other material says 10W too, I suspect there is a 5W variant and a 10W variant.

5 hours ago, SlidewaysZ said:

I feel like if something like a 888 was that weak Samsung wouldn't be trying to pair their exynos with the same GPU in the steam deck. It might bottleneck the mx350 some but I doubt it would be that noticeable.

The Exynos 2200 that is coming soon is quite a bit faster than the 888, and even then my point still stands. Watch this marketing video again, notice how devoid the game footage is of actual in game elements i.e. actual game play. That is why it's pointless, you can pretty up the game a decent bit but without a supporting CPU you're still pushing garbage mobile games with no real game play.

You can either be bored at 720p or be bored 1440p, not much really changes.

Forbidden Wafer · July 20, 2021

56 minutes ago, leadeater said:

You can either be bored at 720p or be bored 1440p, not much really changes.

Why does that sound like one of those sayings on stickers people glue to their cars?

Lurick · July 20, 2021

3 minutes ago, valdyrgramr said:

Get back to me when major retailers have fairly priced RTX GPUs on hand.

Give it 5 or 10 years and you can hoover them RTX 3000 series cards up for cheap!

TheSage79 · July 22, 2021

Just a thought... the only real reason why PC's dominate desktop gaming is because of the plethora of GPU's available. AMD is already working on their own ARM based chip and oh, they make GPU's as well. nVidia already has a working tech demo of its RTX GPU's working on ARM. Apple showed us that ARM based chips can match the performance of all but the highest end x86-64 desktops CPU's (though Apple still sucks at gaming). Honestly... there is not much left to stop the transition from x86-64 to ARM.

hishnash · July 23, 2021

23 hours ago, TheSage79 said:

Just a thought... the only real reason why PC's dominate desktop gaming is because of the plethora of GPU's available. AMD is already working on their own ARM based chip and oh, they make GPU's as well. nVidia already has a working tech demo of its RTX GPU's working on ARM. Apple showed us that ARM based chips can match the performance of all but the highest end x86-64 desktops CPU's (though Apple still sucks at gaming). Honestly... there is not much left to stop the transition from x86-64 to ARM.

Absolutely, worth noting we have only seen apples ultra lower power (fan less) SoC, what is expected for the 16MPB will likely out perform almost all (if not all) laptops (both in cpu and GPU workloads). Notably apples GPU IP is almost 2x the perf/W of Nvidia's in many ways their GPU team is doing better than their CPU team its just that we have not seen a GPU with more than 8 cores (apple GPU core is the same as an NVIDIA Streaming Multiprocessors (SM) ).

Apple runs its cores are a lower clock rate to Nvidia but the key to the perf/W advantage comes from them using A TBDR pipeline and the large VRAM bandwidth savings this provides (memory that can provide high bandwidth under load can draw a LOT of power it is estimated that the 3090s GDDR6x draws over 55W and that does not include the power draw of the controller within the GPU die). We it is very unlikely Nvidia or AMD move to TBDR pipeline as to make proper use of it game devs need to actively develop for it and existing titles/game engines that assume TBIR and make use of modern TBIR blending modes basically require every single object to be broken into a separate render pass (using even more memory bandwidth that the same would on TBIR GPU). To make use of a TBDR pipeline you need to go back to the whiteboard and think about your shader pipeline, render sequencing etc from scratch its a lot of work.

leadeater · July 23, 2021

On 7/22/2021 at 5:15 PM, TheSage79 said:

Honestly... there is not much left to stop the transition from x86-64 to ARM.

6 hours ago, hishnash said:

Absolutely

Quoting you mainly because you replied to this comment.

There is plenty that is stopping such a transition, or if you want to count a different way then one giant reason. It's called Windows. Also Apple/Mac OS doesn't suck at gaming it's just that the market size is so small few want to target game releases for that OS, and sadly some that do get Mac OS support get very little treatment and run very poorly than compared to Windows. Civilization 6 is a good/bad example of that, it's at a minimum 2x slower on Mac OS and in no way should it be or had to be if more time were spent to make it run better.

So unless Windows ARM support gets significantly better than it's current state or these instances of poor game support on Mac OS/Linux it actually won't matter how much performance ARM CPUs have for gamers as they won't actually be a viable option for this group of people.

Also be very weary of performance scaling, a design or architecture that works well at low power or small configurations may not scale up as you might expect or want.

P.S. Apples GPU arch isn't nearly as good for games as it is for application performance, even the better performing examples it struggles to match GPU it was handily beating in application performance.

hishnash · July 23, 2021

10 hours ago, leadeater said:

P.S. Apples GPU arch isn't nearly as good for games as it is for application performance, even the better performing examples it struggles to match GPU it was handily beating in application performance.

The reason for the difference between games and applications is that the majority of games/engines are optimised for TBIR gpus and running a TBIR code path on a TBDR gpus is a very bad idea. But games/engines that properly optimise for TBDR tend to out-perform (per Tflop) an equivalent TBIR gpu, TBDR has a load of optimisations in place due the the pipeline it is much better at discarding obscured fragments (so skips work as it does not render things that are obscured) and due to not needn't to read/write so much data to VRAM also has far fewer stalls as it waist for that data to be read/written. My expirance writing display shareders for both TBIR and TBDR gpus tells me that optimising an existing TBIR rendering pipeline to run well on a TBDR gpu requires going back to a whiteboard and re-thinking it.

leadeater · July 24, 2021

7 hours ago, hishnash said:

TBDR has a load of optimisations in place due the the pipeline it is much better at discarding obscured fragments (so skips work as it does not render things that are obscured) and due to not needn't to read/write so much data to VRAM also has far fewer stalls as it waist for that data to be read/written.

This is already done on current GPUs, Nvidia introduced much more hardware advanced implementation of that I think in Pascal. AMD however was way later to the party on that, I don't even think Vega had it? At least when talking about Nvidia they have done a lot of architectural and driver level optimizations that take away many of the advantages you mention when you look at the higher level differences between the two. Biggest difference I see is the control you get at the software stack compared to having to rely on under the hood technologies Nvidia/AMD implement that you may get no or little control over and have to a lot of optimization debugging to see the impacts as you go through the pipeline stages etc.

When Apple is talking about the advantages of TBDR they are strictly talking about the differences in techniques but do not take in to account any of the actual implementations done, all those things are true only if no equivalent optimizations have been made. Honestly it's the only possible way to do it, having to take in to account literally anything and everything everyone else has done would make it too complicated to do a benefits write up.

I kind of think as new technologies or techniques come out everyone in the industry tends to eventually adopt them anyway, or as best to tries. So I think if one way really is superior to another it'll become the used standard over time. I could certainly see Nvidia and AMD working with Microsoft to make such a change on the next big release of DirectX and then if required put in hardware level translation assists between the old way to the new way for existing games. I see something like RDNA2's on-die cache being able to help out with that a lot.

hishnash · July 24, 2021

9 hours ago, leadeater said:

This is already done on current GPUs, Nvidia introduced much more hardware advanced implementation of that I think in Pascal. AMD however was way later to the party on that, I don't even think Vega had it? At least when talking about Nvidia they have done a lot of architectural and driver level optimizations that take away many of the advantages you mention when you look at the higher level differences between the two.

What you are referring to there is the TB part of TBIR. Neither AMD nor Nvidia support TBDR (and never have, at least not in recent history).

9 hours ago, leadeater said:

I could certainly see Nvidia and AMD working with Microsoft to make such a change on the next big release of DirectX and then if required put in hardware level translation assists between the old way to the new way for existing games. I see something like RDNA2's on-die cache being able to help out with that a lot.

Sure adding apis is not the hard part (getting devs to properly adopt them is), the issue for AMD and Nvidia is moving to a TBDR approach will mean they see a massive drop in performance for existing titles. Could they build a GPU that can run in either mode? Yes one could but that require a lot of extra transistors.

The current cache AMD have in RDNA2 would both need to be able to run in a non-cache mode but also likely would need to also need to be sub-deveined within the shader engine. A top end RDNA2 chip has only 4 shader engines, and since the on die memory is a cache it will have a memory controller that handles read/write access to ensure Application A can not read data from application B moving, to have most effective use of tile memory if they can move it to a smaller sub-section down to a CU or duel CU cluster then it could run without that overhead. Maybe the overhead of the access control they have is not that high so they could run it in a TBDR mode (would that however result in that entire shader engine being locked to TBDR?)

leadeater · July 24, 2021

1 hour ago, hishnash said:

What you are referring to there is the TB part of TBIR. Neither AMD nor Nvidia support TBDR (and never have, at least not in recent history).

I mean the culling advantage you talked about, Nvidia has that done as well. Occlusion culling has been around since 2004 ish but Nvidia implemented a better hardware level approach in Pascal.

TBDR does not have the specific advantage you and Apple speak about because it's something that is actually being done now. It's probably possible to do it better or more efficiently with TBDR but how much better I'm not sure, it's definitely not a case of TBDR can do it and others cannot though. The problem comes in actually how much better X vs Y is, Apple can say it has a certain advantage for specified reason but you must compare claims to real world implementations to verify or quantify the claim otherwise it's nothing more than mere marketing at worst or a generalized technical statement about advantages of one technique compared to another in the most basic forms.

Also Nvidia does support Differed Rendering as well, also I believe since Pascal. Turing further introduce even more culling features as well btw.

https://gamedevelopment.tutsplus.com/articles/forward-rendering-vs-deferred-rendering--gamedev-12342

https://developer.nvidia.com/blog/introduction-turing-mesh-shaders

https://www.programmersought.com/article/1982816870/ (first two)

When Apple says this:

Quote

In a traditional immediate-mode (IM) renderer, when a triangle is submitted to the GPU for processing, it's immediately rendered to device memory. The triangles are processed by the rasterization and fragment function stages even if they're occluded by other primitives submitted to the GPU later.

They quite literally mean "traditional", because actual real world what Nvidia and AMD are not actually that but that really isn't what Apple is trying to talk about with that statement. Problem is applying that statement outside of the point Apple is trying to make and relate it to what Nvidia and AMD have actually done.

hishnash · July 24, 2021

5 hours ago, leadeater said:

TBDR does not have the specific advantage you and Apple speak about because it's something that is actually being done now. It's probably possible to do it better or more efficiently with TBDR but how much better I'm not sure, it's definitely not a case of TBDR can do it and others cannot though.

The other culling arpoches ether in hardware or just by doing deafer lighting (a common pipeline used more anymore these days) do a good job of avoiding extra costly fragment evaluation but to do this they do result in an increase in memory bandwidth usage.

5 hours ago, leadeater said:

Also Nvidia does support Differed Rendering as well, also I believe since Pascal. Turing further introduce even more culling features as well btw.

That is very different, it'a rendering technique used on all platforms that requires multiple render passes. It is not Differed within the render pass but rather you render out all objects rendering a objectID + trig ID into the Buffer then you run a compute style render pass that is sued to evaluate the costly fragment shaders, this is a great way of doing better culling and avoids over draw and you can do it just as well on a TBDR pipeline but it does not reduce memory bandwidth usage (in fact it increases it).

leadeater · July 24, 2021

1 hour ago, hishnash said:

The other culling arpoches ether in hardware or just by doing deafer lighting (a common pipeline used more anymore these days) do a good job of avoiding extra costly fragment evaluation but to do this they do result in an increase in memory bandwidth usage.

Not on the later approaches, Nvidia specifically talks about the bandwidth reduction compared to the past with traditional occlusion culling.

Quote

The mesh shader gives developers new possibilities to avoid such bottlenecks. The new approach allows the memory to be read once and kept on-chip as opposed to previous approaches, such as compute shader-based primitive culling (see [3],[4],[5]), where index buffers of visible triangles are computed and drawn indirectly.

Quote

Mesh Shading Pipeline

A new, two-stage pipeline alternative supplements the classic attribute fetch, vertex, tessellation, geometry shader pipeline. This new pipeline consists of a task shader and mesh shader:

Task shader : a programmable unit that operates in workgroups and allows each to emit (or not) mesh shader workgroups

Mesh shader : a programmable unit that operates in workgroups and allows each to generate primitives

The mesh shader stage produces triangles for the rasterizer using the above-mentioned cooperative thread model internally. The task shader operates similarly to the hull shader stage of tessellation, in that it is able to dynamically generate work. However, like the mesh shader, the task shader also uses a cooperative thread mode. Its input and output are user defined instead of having to take a patch as input and tessellation decisions as output.

Quote

Bandwidth-reduction, as de-duplication of vertices (vertex re-use) can be done upfront, and reused over many frames. The current API model means the index buffers have to be scanned by the hardware every time. Larger meshlets mean higher vertex re-use, also lowering bandwidth requirements. Furthermore developers can come up with their own compression or procedural generation schemes.
The optional expansion/filtering via task shaders allows to skip fetching more data entirely.

So when Apple says they are able to cull before the Raster stage starts and other methods cannot that simply isn't correct in practice. Nvidia has this and other culling phase before Raster, and ones during that have great bandwidth reductions as well.

Quote

TBDR allows the vertex and fragment stages to run asynchronously—providing significant performance improvements over IM

The above also allows this so also not unique to Apple and TBDR.

1 hour ago, hishnash said:

That is very different, it'a rendering technique used on all platforms that requires multiple render passes. It is not Differed within the render pass but rather you render out all objects rendering a objectID + trig ID into the Buffer then you run a compute style render pass that is sued to evaluate the costly fragment shaders, this is a great way of doing better culling and avoids over draw and you can do it just as well on a TBDR pipeline but it does not reduce memory bandwidth usage (in fact it increases it).

Correct Deferred Rendering here does increase the required bandwidth, multiple hardware render targets is also required, but the benefits greatly out way that increase.

However this all still ultimately comes back to what Apple says is a benefit for their TBDR method over other ways simply just is not the case in reality. Like I said true in the more generalized comparison, speaking more in an overview, not actually so true in practice. if there is wastage or optimizations that can be made then ways to do it will always be found, it's just an iterative process of find the next most beneficial and achievable thing and then doing it. Nvidia's far better at doing it than AMD is such is why comparative performance to memory bandwidth required is vastly better on Nvidia cards than AMD.

Call me a skeptic but when a company says they can do something special or unique my general response is "doubt it", marketing is almost always dressed up almost lies but not actually, just strategically presented in a way that isn't always actually informative to the real situation.

Edited July 24, 2021 by leadeater

hishnash · July 24, 2021

Im not saying TBDR is perfect solution it has its downsides, For example if you have many objects with semi transparent materials there is a limited number of these a TBDR blending solution can support without needing to BLIT the tile memory out to VRAM and then BLIT it back in. Also if you want to do effects that combine the values from multiple pixels (such as depth blur) any tile based solution is an issue, TBIR GPUs can easily switch to just regular IR mode (without the tiling) to enable you to do these types of effects, any screen space effect in fact is much more complicated to do on a TBDR gpu, be that shadows, reflections, ambient occlusion, bloom etc. There are ways to mitigate some of this on TBDR pipelines but in my experience the impact of that results in rather different outputs on TBIR GPUs vs TBDR GPUs and very few game engines are going to prefer to optimise for TBDR any time in the future they will continue to target TBIR solutions and then do their best to hack that into a TBDR pipeline as the market demands this (consoles are all TBIR GPUs after all).

3 minutes ago, leadeater said:

Not on the later approaches, Nvidia specifically talks about the bandwidth reduction compared to the past with traditional occlusion culling.

It does reduce bandwidth but not even close to as much as a TBDR approach, the result of this is till dumped back into VRAM before it is loaded for fragment shader evaluation, possibly on AMDs chips if your evaluated geometry buffer is within the cache limits (128 MB is not much for render pass geometry) you avoid this power cost? Compared to a hand rolled solution (that would evaluate the vertex shaders, dump to VRAM, load in compute shader to do depth comparison and then writhe out the result nvidia's approach does the depth buffer inline with the vertex evaluation so it saves one load store but not all of them, it also requires the fragment shader stensilces used to console the eval of the fragment shader to be written out to VRAM this can add up even through they re just buffers of booleans you have one per drawcall-object).

9 minutes ago, leadeater said:

but the benefits greatly out way that increase.

Absolutely but you can use it on TBDR GPUs just the same as TBIR and the first (geometry index/martial ID) pass of a deferred lighting parse on a TBDR gpu does require a lot less bandwidth and is a much simpler evaluation (as long as you doing have to many semi transparent objects...! see above). Differed lighting rendering on a TBDR GPUs is also a little more tricky if you have fragment shaders that use textures to adjust the depth of an object (any from of depth alterations in a TBDR pipeline need to happen during the regular TBDR phase otherwise your going to end up with incorrect object/material ids).

leadeater · July 25, 2021

4 minutes ago, hishnash said:

Im not saying TBDR is perfect solution it has its downsides

I know, I'm just pointing out that listed benefits or things said to be unique to TBDR actually are not. Nothing more than that. That's why I said actual comparisons have to be made between X and Y to validate claims and quantify benefits otherwise I have to treat it purely as marketing speak. Without a fair way to compare I cannot conclude how much better one thing is to another.

hishnash · July 25, 2021

1 minute ago, leadeater said:

I know, I'm just pointing out that listed benefits or things said to be unique to TBDR actually are not. Nothing more than that. That's why I said actual comparisons have to be made between X and Y to validate claims and quantify benefits otherwise I have to treat it purely as marketing speak. Without a fair way to compare I cannot conclude how much one thing is to another.

The benefits it has is massive reducing in memory bandwidth needs, this is not marketing speak at least in my experience of writing shaders for both platforms you can easily see the same scene using a fraction of the bandwidth than it does on other platforms.

However the downsides are that you do end up needing to think a lot more about how to implement a given solution since the regular algorithms your might be used to using will have some really nasty side effects on a TBDR gpu, introduction of tile compute shaders in the last years have opened up a lot of extra options but there are still a class of common effects that are extremely hard to do efficiently and since these GPUs assume much lower bandwidth requirements they tend to not be provided with so much so your bandwidth budget to just fire of extra compute passes is limited (in particular if you need to do these on the un-resolved AA buffers that can be large!).

wamred · July 25, 2021

This is pretty funny that Nvidia is now working with the company that they were trying to buy and was unsuccessful.

igormp · July 25, 2021

10 minutes ago, wamred said:

This is pretty funny that Nvidia is now working with the company that they were trying to buy and was unsuccessful.

They were already working with ARM and had ARM-based platforms already (see the whole Jetson and Tegra lineup).

The merger is still ongoing, it wasn't unsuccessful.

wamred · July 25, 2021

3 minutes ago, igormp said:

They were already working with ARM and had ARM-based platforms already (see the whole Jetson and Tegra lineup).

The merger is still ongoing, it wasn't unsuccessful.

Ah that makes sense, still don't necessarily want it to go through.

williamcll · August 3, 2021

Nvidia has paved the way for a RTX nintendo switch

Sign In

Nvidia shows off RTX GPU on ARM

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Mesh Shading Pipeline

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment