Nvidia Brings DXR Raytracing to Pascal, Turing and Volta

Mira Yurizaki · March 19, 2019

@Derangel Ah wait, now I think I understand what happened. NVIDIA probably added DXR fallback layer support at the driver level rather than making games require the library.

4 minutes ago, dgsddfgdfhgs said:

Nvidia dont care about Maxwell anymore

To be fair, it's in its 5th year of service. That's pretty old in computer time.

leadeater · March 19, 2019

58 minutes ago, Derangel said:

RTX is ray-tracing processed via the tensor cores. Both DXR and Vulkan ray-tracing can use the tensor cores. The proprietary implementation of the tech likely has more to do with the use and optimization of the tensor cores as both BFV and Metro use DXR.

Really it's just a brand name for a set of technology and software tools, the way I see it RTX has more to do with development tools than the hardware itself but both fall under it because it's mutual requirement. DXR the spec allows for and implements methods to offload tasks and doesn't impose any restrictions on how those are allocated and to what they are allocated to.

That's why Nvidia can show off the Star Wars demo on the Titan V, nothing prevents a Tensor core only implementation. There's nothing that stops an RT only implementation either for that matter, like BFV though you would be pushing a lot back on shader compute.

On a more speculation note, my wild ass guess is that comparative wise like Nvidia has done AMD hardware would be as efficient or slightly better as Turing is DXR only mode for cards that support Rapid Packed Math. The only question I have on that is can Vega actually execute both an FP and INT instruction at the same time? It can do two FP16 or 4 INT8 (Vega 20) but can those be mixed. Can you do a FP16 and INT16, or FP16 and 2 INT8?

ne0tic · March 19, 2019

Though the BIG question is, will the RTX cards be able to use the basic RT that now the old Pascal cards will use?

If you can toggle basic RT on the RTX cards then you should be able to get almost as much FPS with for example a RTX 2070 with RT on then what you usually get with it off.

leadeater · March 19, 2019

9 minutes ago, ne0tic said:

Though the BIG question is, will the RTX cards be able to use the basic RT that now the old Pascal cards will use?

If you can toggle basic RT on the RTX cards then you should be able to get almost as much FPS with for example a RTX 2070 with RT on then what you usually get with it off.

I don't quite understand what your getting at? RT cores are just a hardware accelerated offload, yes RTX cards support DXR only mode but how and what is used doesn't change the games implementation and graphical demand.

You could target DXR only, that means less graphical demand because you have to account for the lower performance that can be utilized.

And even then due to architecture limitations in Pascal DXR only is significantly slower than on Turing. Have a look at the picture with the 3 render time plots on it, Turing DXR only is faster.

kilgore_T · March 19, 2019

like turning on dx9 in far cry on an fx card back in the day

ne0tic · March 19, 2019

3 minutes ago, leadeater said:

I don't quite understand what your getting at? RT cores are just a hardware accelerated offload, yes RTX cards support DXR only mode but how and what is used doesn't change the games implementation and graphical demand.

You could target DXR only, that means less graphical demand because you have to account for the lower performance that can be utilized.

And even then due to architecture limitations in Pascal DXR only is significantly slower than on Turing. Have a look at the picture with the 3 render time plots on it, Turing DXR only is faster.

What I meant was if the RT and Tensor cores on the RTX cards could do the same basic Ray Tracing that the Pascal cards will do with only CUDA based cores. Like if they could implement so the RT and Tensor cores could do the same RT as the CUDA based cards will do because then you could get much better FPS on with Ray Tracing on with the RTX cards then what you can right now.

leadeater · March 19, 2019

11 minutes ago, ne0tic said:

What I meant was if the RT and Tensor cores on the RTX cards could do the same basic Ray Tracing that the Pascal cards will do with only CUDA based cores. Like if they could implement so the RT and Tensor cores could do the same RT as the CUDA based cards will do because then you could get much better FPS on with Ray Tracing on with the RTX cards then what you can right now.

It is the same, there is no difference in the task been run. The difference is sections of the processing happen on RT cores instead of CUDA compute shader cores.

Edit:

Same task run 3 different ways, Pascal DXR only, Turing DXR only, Turing RTX

As you see for Pascal the time line (top one) is much much longer, that middle section is the ray tracing, that big consistent block. Looking at the middle one you can see the ray tracing starts at the same time section but is around half of that of Pascal, that is because ray tracing uses both FLOAT and INT and Pascal can only 1 or the other, Turing can do both simultaneously and that is why it's faster for that task.

The bottom is Turing using the RT cores and Tensor cores, those long big middle sections of the time lines for the other two time plots all fits in to that bright green section. That's how much faster RT cores at doing the exact same task as the CUDA cores can. Interestingly the Tensor cores don't appear to have a big impact on render time, nearly all the time is made up by the RT cores.

Edit 2:

Actually dammit, damn you Nvidia. I just noticed the Turing RTX plot is with DLSS on. That explains why the ray tracing section starts sooner in the plot. Argh, the comparison isn't even valid.

randomhkkid · March 19, 2019

27 minutes ago, leadeater said:

Edit 2:

Actually dammit, damn you Nvidia. I just noticed the Turing RTX plot is with DLSS on. That explains why the ray tracing section starts sooner in the plot. Argh, the comparison isn't even valid.

Yeah that's why I added the second chart in the first post. According to Nvidia without DLSS there's about a 20% loss of performance.

leadeater · March 19, 2019

2 minutes ago, randomhkkid said:

Yeah that's why I added the second chart in the graph. According to Nvidia without DLSS there's about a 20% loss of performance.

Awesome, thanks for the info. Really wish all 4 were plotted, oh well.

randomhkkid · March 19, 2019

8 hours ago, Mira Yurizaki said:

Turing can do INT32 + FP32 at the same time regardless of asynchronous compute. This is one of the architectural improvements they did over Pascal.

Ah my bad, was under the impression that INT32 + FP32 was the async compute feature. Amended.

Taf the Ghost · March 19, 2019

1 hour ago, leadeater said:

Really it's just a brand name for a set of technology and software tools, the way I see it RTX has more to do with development tools than the hardware itself but both fall under it because it's mutual requirement. DXR the spec allows for and implements methods to offload tasks and doesn't impose any restrictions on how those are allocated and to what they are allocated to.

That's why Nvidia can show off the Star Wars demo on the Titan V, nothing prevents a Tensor core only implementation. There's nothing that stops an RT only implementation either for that matter, like BFV though you would be pushing a lot back on shader compute.

On a more speculation note, my wild ass guess is that comparative wise like Nvidia has done AMD hardware would be as efficient or slightly better as Turing is DXR only mode for cards that support Rapid Packed Math. The only question I have on that is can Vega actually execute both an FP and INT instruction at the same time? It can do two FP16 or 4 INT8 (Vega 20) but can those be mixed. Can you do a FP16 and INT16, or FP16 and 2 INT8?

On the instruction part for AMD, the short answer is "yes", but the exact details are beyond my knowledge of the GCN architecture. Nvidia is still several generations behind in Async compared to AMD with GCN 1.0. (Happened to be chatting with some people that know their GPU architecture lately.

Jurrunio · March 19, 2019

1 hour ago, leadeater said:

On a more speculation note, my wild ass guess is that comparative wise like Nvidia has done AMD hardware would be as efficient or slightly better as Turing is DXR only mode for cards that support Rapid Packed Math. The only question I have on that is can Vega actually execute both an FP and INT instruction at the same time? It can do two FP16 or 4 INT8 (Vega 20) but can those be mixed. Can you do a FP16 and INT16, or FP16 and 2 INT8?

GPU edition of SMT?

leadeater · March 19, 2019

20 minutes ago, Taf the Ghost said:

On the instruction part for AMD, the short answer is "yes", but the exact details are beyond my knowledge of the GCN architecture. Nvidia is still several generations behind in Async compared to AMD with GCN 1.0. (Happened to be chatting with some people that know their GPU architecture lately.

Yea, I had a really good look for the information and wasn't really able to find it. AMD talks about combining FP16 tasks, and being able to run INT 4, 8, 16 and 32 but aren't really clear on if they could be mixed. Their pictures are just labeled as 'Ops' when showing how RPM works but they don't state much beyond that.

Async compute information doesn't really clear it up much either as that's about being able to run graphics and compute workloads at the same time on the same CU but it doesn't go in to specifics like data types. Can a CU do both FP and INT? Probably but I'd like to know for sure.

Each CU has 4 SIMD units and each of those has a 16-lane integer and floating point vector Arithmetic Logic Unit (ALU). Note this is old information about CUs and SIMD and I suspect some of that's changed but doesn't matter much here. I'm thinking each ALU can only do INT or FP but since there is 4 per CU you can have 3 FP and 1 INT mode ALUs.

Taf the Ghost · March 19, 2019

2 minutes ago, Jurrunio said:

GPU edition of SMT?

It looks like AMD will be implementing VLIW, SIMT and SSIMD (Google will help if you need it) in a reworked combination in their actual next-gen design. (That'd be after Navi.) In CPU terms, it's more like AMD is going with a bigger (not wider) Front End and a massively reworked Back End. At least, between some of the rumors and all of AMD's GPU patents that have been coming out in the last 2 years. But bringing back VLIW and making it VLIW2 is similar to threading, but more like giving the ability to double the amount of calls sent at the beginning of the pipeline. Potentially double the IPC, but it's a matter of handling that on the Back End that requires a huge amount of rework.

leadeater · March 19, 2019

11 minutes ago, Taf the Ghost said:

It looks like AMD will be implementing VLIW, SIMT and SSIMD (Google will help if you need it) in a reworked combination in their actual next-gen design. (That'd be after Navi.) In CPU terms, it's more like AMD is going with a bigger (not wider) Front End and a massively reworked Back End. At least, between some of the rumors and all of AMD's GPU patents that have been coming out in the last 2 years. But bringing back VLIW and making it VLIW2 is similar to threading, but more like giving the ability to double the amount of calls sent at the beginning of the pipeline. Potentially double the IPC, but it's a matter of handling that on the Back End that requires a huge amount of rework.

Interesting note though, Fiji had 8 ACEs (1-way) and Vega has 4 ACEs (4-way) so Vega can actually do 16 tasks simultaneously.

Taf the Ghost · March 19, 2019

51 minutes ago, leadeater said:

Yea, I had a really good look for the information and wasn't really able to find it. AMD talks about combining FP16 tasks, and being able to run INT 4, 8, 16 and 32 but aren't really clear on if they could be mixed. Their pictures are just labeled as 'Ops' when showing how RPM works but they don't state much beyond that.

Async compute information doesn't really clear it up much either as that's about being able to run graphics and compute workloads at the same time on the same CU but it doesn't go in to specifics like data types. Can a CU do both FP and INT? Probably but I'd like to know for sure.

Each CU has 4 SIMD units and each of those has a 16-lane integer and floating point vector Arithmetic Logic Unit (ALU). Note this is old information about CUs and SIMD and I suspect some of that's changed but doesn't matter much here. I'm thinking each ALU can only do an INT or FP but since there is 4 per CU you can have 3 FP and 1 INT mode ALUs.

Got into the middle of a hyper-technical GCN discussion just the other day. (AMD's big problem is really execution and hammering home what they're best at, more than GCN being bad.) Most GCN GPUs are actually best thought of as a collection of CPUs clustered into a Graphics Engine. Those CUs can do an extreme amount independently. AMD also has something like a 50% Performance/Area advantage on Nvidia in FP-intensive Compute. At 14nm/16nm generation. It's the Geometry throughput that unbalances the current designs. (Vega having full FP64 means the dies is significantly larger than needed for gaming GPUs.) It's why the recent Crytek demo was at 4K/30 on a Vega 56. Vega just has that much Shader power.

As for the Async, I saw some testing recently that GCN really does do INT and FP fully async, but I'm having trouble finding the link at the moment. I'll see if I can find it, because it was some interesting testing. Nvidia still schedules well with Pascal, but GCN really does do what it says it does.

Taf the Ghost · March 19, 2019

9 minutes ago, leadeater said:

Interesting note though, Fiji had 8 ACEs (1-way) and Vega has 4 ACEs (4-way) so Vega can actually do 16 tasks simultaneously.

Speaking of not balancing your designs to fit all of the rolls...

(General consensus is that Vega is badly internally bottlenecked, so they had to clock it too higher. Thus you get an insanely effecient uArch that's suddenly "too hot".)

leadeater · March 19, 2019

4 minutes ago, Taf the Ghost said:

Speaking of not balancing your designs to fit all of the rolls...

Even so looking at some information, just so happens to be 1st Gen GCN, AMD basically nailed a lot of proper Async Compute first try.

https://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute/4

However it's not without problems.

https://community.amd.com/thread/168154 (interesting read, highly recommend).

The above picture may be about OpenCL and pure compute but it seems like its very easy to leave large amounts of of the GPU resources idle.

Taf the Ghost · March 19, 2019

13 minutes ago, leadeater said:

Even so looking at some information, just so happens to be 1st Gen GCN, AMD basically nailed a lot of proper Async Compute first try.

https://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute/4

However it's not without problems.

https://community.amd.com/thread/168154 (interesting read, highly recommend).

The above picture may be about OpenCL and pure compute but it seems like its very easy to leave large amounts of of the GPU resources idle.

Interesting find.

As for Vega itself, one of the discussions was about the fact it probably has 5 Tflops of ALUs sitting idle in gaming situations because of the geometry/ROP bottlenecks. Vega has shader power to do all of it in Async since it those chunks would be sitting idle in most gaming situations. This is part of the reason Vega scales differently in 1080p->4K than Pascal or Turing.

Related, the issues with Vega's bottlenecks is also why everyone in the industry is expecting a lot from Navi and the Consoles. Leaned down for Gaming GPU designs, the modern GCN cards can do significant work if they're balanced properly.

Taf the Ghost · March 19, 2019

On the Nvidia RTX stuff, Nvidia's Async has a long ways to go. When those Tensor cores are activated, they bottleneck the GPUs because of latency added. I believe both Tim at HWUB and Steve at GN noted this in Metro, in places like the tunnel scenes. Even with no actual rays to process, simply having the other parts active for the render pipeline adds stalls.

soldier_ph · March 19, 2019

Let's Race down the Trace and Trace down the Race

Mihle · March 19, 2019

@leadeater @Taf the Ghost

You think AMD GPUs will do better than Nvidia 10xx series GPUs in RT because of better async compute?

(Like if they perform the same in non RT games, AMD will perform better with RT)

leadeater · March 19, 2019

1 minute ago, Mihle said:

@leadeater @Taf the Ghost

You think AMD GPUs will do better than Nvidia 10xx series GPUs in RT because of better async compute?

(Like if they perform the same in non RT games, AMD will perform better with RT)

Should do yes, so long as they can do parallel async compute not context switching (Pascal) and can do simultaneous FP and INT.

Taf the Ghost · March 19, 2019

Just now, Mihle said:

@leadeater @Taf the Ghost

You think AMD GPUs will do better than Nvidia 10xx series GPUs in RT because of better async compute?

(Like if they perform the same in non RT games, AMD will perform better with RT)

Depending on implementation, Vega would smoke both Pascal and Turing. Turing's Async is far behind AMD's and that's a big reason why. For all of those Gigarays, RTX is bottlenecking the GPU badly.

leadeater · March 19, 2019

1 hour ago, Taf the Ghost said:

On the Nvidia RTX stuff, Nvidia's Async has a long ways to go. When those Tensor cores are activated, they bottleneck the GPUs because of latency added. I believe both Tim at HWUB and Steve at GN noted this in Metro, in places like the tunnel scenes. Even with no actual rays to process, simply having the other parts active for the render pipeline adds stalls.

This might also contribute to what was observed in that Metro testing

Quote

Like Volta, the Turing SM is partitioned into 4 sub-cores (or processing blocks) with each sub-core having a single warp scheduler and dispatch unit, as opposed Pascal’s 2 partition setup with two dispatch ports per sub-core warp scheduler. There are some fairly major implications with change, and broadly-speaking this means that Volta/Turing loses the capability to issue a second, non-dependent instruction from a thread for a single clock cycle. Turing is presumably identical to Volta performing instructions over two cycles but with schedulers that can issue an independent instruction every cycle, so ultimately Turing can maintain 2-way instruction level parallelism (ILP) this way, while still having twice the amount of schedulers over Pascal.

https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/4

Sounds to me like Turing and Volta gained capabilities but regressed in instruction execution (increased instruction latency). If I understand it correctly.

Sign In

Nvidia Brings DXR Raytracing to Pascal, Turing and Volta

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites