Jump to content

Nvidia Brings DXR Raytracing to Pascal, Turing and Volta

randomhkkid

@Derangel Ah wait, now I think I understand what happened. NVIDIA probably added DXR fallback layer support at the driver level rather than making games require the library.

 

4 minutes ago, dgsddfgdfhgs said:

Nvidia dont care about Maxwell anymore

To be fair, it's in its 5th year of service. That's pretty old in computer time.

Link to comment
Share on other sites

Link to post
Share on other sites

58 minutes ago, Derangel said:

RTX is ray-tracing processed via the tensor cores. Both DXR and Vulkan ray-tracing can use the tensor cores. The proprietary implementation of the tech likely has more to do with the use and optimization of the tensor cores as both BFV and Metro use DXR.

Really it's just a brand name for a set of technology and software tools, the way I see it RTX has more to do with development tools than the hardware itself but both fall under it because it's mutual requirement. DXR the spec allows for and implements methods to offload tasks and doesn't impose any restrictions on how those are allocated and to what they are allocated to.

 

That's why Nvidia can show off the Star Wars demo on the Titan V, nothing prevents a Tensor core only implementation. There's nothing that stops an RT only implementation either for that matter, like BFV though you would be pushing a lot back on shader compute.

 

On a more speculation note, my wild ass guess is that comparative wise like Nvidia has done AMD hardware would be as efficient or slightly better as Turing is DXR only mode for cards that support Rapid Packed Math. The only question I have on that is can Vega actually execute both an FP and INT instruction at the same time? It can do two FP16 or 4 INT8 (Vega 20) but can those be mixed. Can you do a FP16 and INT16, or FP16 and 2 INT8?

Link to comment
Share on other sites

Link to post
Share on other sites

Though the BIG question is, will the RTX cards be able to use the basic RT that now the old Pascal cards will use?

 

If you can toggle basic RT on the RTX cards then you should be able to get almost as much FPS with for example a RTX 2070 with RT on then what you usually get with it off. 

Corsair iCUE 4000X RGB

ASUS ROG STRIX B550-E GAMING

Ryzen 5900X

Corsair Hydro H150i Pro 360mm AIO

Ballistix 32GB (4x8GB) 3600MHz CL16 RGB

Samsung 980 PRO 1TB

Samsung 970 EVO 1TB

Gigabyte RTX 3060 Ti GAMING OC

Corsair RM850X

Predator XB273UGS QHD IPS 165 Hz

 

iPhone 13 Pro 128GB Graphite

Link to comment
Share on other sites

Link to post
Share on other sites

9 minutes ago, ne0tic said:

Though the BIG question is, will the RTX cards be able to use the basic RT that now the old Pascal cards will use?

 

If you can toggle basic RT on the RTX cards then you should be able to get almost as much FPS with for example a RTX 2070 with RT on then what you usually get with it off. 

I don't quite understand what your getting at? RT cores are just a hardware accelerated offload, yes RTX cards support DXR only mode but how and what is used doesn't change the games implementation and graphical demand.

 

You could target DXR only, that means less graphical demand because you have to account for the lower performance that can be utilized.

 

And even then due to architecture limitations in Pascal DXR only is significantly slower than on Turing. Have a look at the picture with the 3 render time plots on it, Turing DXR only is faster.

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, leadeater said:

I don't quite understand what your getting at? RT cores are just a hardware accelerated offload, yes RTX cards support DXR only mode but how and what is used doesn't change the games implementation and graphical demand.

 

You could target DXR only, that means less graphical demand because you have to account for the lower performance that can be utilized.

 

And even then due to architecture limitations in Pascal DXR only is significantly slower than on Turing. Have a look at the picture with the 3 render time plots on it, Turing DXR only is faster.

What I meant was if the RT and Tensor cores on the RTX cards could do the same basic Ray Tracing that the Pascal cards will do with only CUDA based cores. Like if they could implement so the RT and Tensor cores could do the same RT as the CUDA based cards will do because then you could get much better FPS on with Ray Tracing on with the RTX cards then what you can right now. 

Corsair iCUE 4000X RGB

ASUS ROG STRIX B550-E GAMING

Ryzen 5900X

Corsair Hydro H150i Pro 360mm AIO

Ballistix 32GB (4x8GB) 3600MHz CL16 RGB

Samsung 980 PRO 1TB

Samsung 970 EVO 1TB

Gigabyte RTX 3060 Ti GAMING OC

Corsair RM850X

Predator XB273UGS QHD IPS 165 Hz

 

iPhone 13 Pro 128GB Graphite

Link to comment
Share on other sites

Link to post
Share on other sites

11 minutes ago, ne0tic said:

What I meant was if the RT and Tensor cores on the RTX cards could do the same basic Ray Tracing that the Pascal cards will do with only CUDA based cores. Like if they could implement so the RT and Tensor cores could do the same RT as the CUDA based cards will do because then you could get much better FPS on with Ray Tracing on with the RTX cards then what you can right now. 

It is the same, there is no difference in the task been run. The difference is sections of the processing happen on RT cores instead of CUDA compute shader cores.

 

Edit:

Same task run 3 different ways, Pascal DXR only, Turing DXR only, Turing RTX

GDC_Update_FINAL-page-012.jpg&key=636599

As you see for Pascal the time line (top one) is much much longer, that middle section is the ray tracing, that big consistent block. Looking at the middle one you can see the ray tracing starts at the same time section but is around half of that of Pascal, that is because ray tracing uses both FLOAT and INT and Pascal can only 1 or the other, Turing can do both simultaneously and that is why it's faster for that task.

 

The bottom is Turing using the RT cores and Tensor cores, those long big middle sections of the time lines for the other two time plots all fits in to that bright green section. That's how much faster RT cores at doing the exact same task as the CUDA cores can. Interestingly the Tensor cores don't appear to have a big impact on render time, nearly all the time is made up by the RT cores.

 

Edit 2:

Actually dammit, damn you Nvidia. I just noticed the Turing RTX plot is with DLSS on. That explains why the ray tracing section starts sooner in the plot. Argh, the comparison isn't even valid.

Link to comment
Share on other sites

Link to post
Share on other sites

27 minutes ago, leadeater said:

Edit 2:

Actually dammit, damn you Nvidia. I just noticed the Turing RTX plot is with DLSS on. That explains why the ray tracing section starts sooner in the plot. Argh, the comparison isn't even valid.

Yeah that's why I added the second chart in the first post. According to Nvidia without DLSS there's about a 20% loss of performance. 

Data Scientist - MSc in Advanced CS, B.Eng in Computer Engineering

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, randomhkkid said:

Yeah that's why I added the second chart in the graph. According to Nvidia without DLSS there's about a 20% loss of performance. 

Awesome, thanks for the info. Really wish all 4 were plotted, oh well.

Link to comment
Share on other sites

Link to post
Share on other sites

8 hours ago, Mira Yurizaki said:

Turing can do INT32 + FP32 at the same time regardless of asynchronous compute. This is one of the architectural improvements they did over Pascal.

Ah my bad, was under the impression that INT32 + FP32 was the async compute feature. Amended. 

Data Scientist - MSc in Advanced CS, B.Eng in Computer Engineering

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, leadeater said:

Really it's just a brand name for a set of technology and software tools, the way I see it RTX has more to do with development tools than the hardware itself but both fall under it because it's mutual requirement. DXR the spec allows for and implements methods to offload tasks and doesn't impose any restrictions on how those are allocated and to what they are allocated to.

 

That's why Nvidia can show off the Star Wars demo on the Titan V, nothing prevents a Tensor core only implementation. There's nothing that stops an RT only implementation either for that matter, like BFV though you would be pushing a lot back on shader compute.

 

On a more speculation note, my wild ass guess is that comparative wise like Nvidia has done AMD hardware would be as efficient or slightly better as Turing is DXR only mode for cards that support Rapid Packed Math. The only question I have on that is can Vega actually execute both an FP and INT instruction at the same time? It can do two FP16 or 4 INT8 (Vega 20) but can those be mixed. Can you do a FP16 and INT16, or FP16 and 2 INT8?

On the instruction part for AMD, the short answer is "yes", but the exact details are beyond my knowledge of the GCN architecture. Nvidia is still several generations behind in Async compared to AMD with GCN 1.0. (Happened to be chatting with some people that know their GPU architecture lately.

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, leadeater said:

On a more speculation note, my wild ass guess is that comparative wise like Nvidia has done AMD hardware would be as efficient or slightly better as Turing is DXR only mode for cards that support Rapid Packed Math. The only question I have on that is can Vega actually execute both an FP and INT instruction at the same time? It can do two FP16 or 4 INT8 (Vega 20) but can those be mixed. Can you do a FP16 and INT16, or FP16 and 2 INT8?

GPU edition of SMT?

CPU: i7-2600K 4751MHz 1.44V (software) --> 1.47V at the back of the socket Motherboard: Asrock Z77 Extreme4 (BCLK: 103.3MHz) CPU Cooler: Noctua NH-D15 RAM: Adata XPG 2x8GB DDR3 (XMP: 2133MHz 10-11-11-30 CR2, custom: 2203MHz 10-11-10-26 CR1 tRFC:230 tREFI:14000) GPU: Asus GTX 1070 Dual (Super Jetstream vbios, +70(2025-2088MHz)/+400(8.8Gbps)) SSD: Samsung 840 Pro 256GB (main boot drive), Transcend SSD370 128GB PSU: Seasonic X-660 80+ Gold Case: Antec P110 Silent, 5 intakes 1 exhaust Monitor: AOC G2460PF 1080p 144Hz (150Hz max w/ DP, 121Hz max w/ HDMI) TN panel Keyboard: Logitech G610 Orion (Cherry MX Blue) with SteelSeries Apex M260 keycaps Mouse: BenQ Zowie FK1

 

Model: HP Omen 17 17-an110ca CPU: i7-8750H (0.125V core & cache, 50mV SA undervolt) GPU: GTX 1060 6GB Mobile (+80/+450, 1650MHz~1750MHz 0.78V~0.85V) RAM: 8+8GB DDR4-2400 18-17-17-39 2T Storage: HP EX920 1TB PCIe x4 M.2 SSD + Crucial MX500 1TB 2.5" SATA SSD, 128GB Toshiba PCIe x2 M.2 SSD (KBG30ZMV128G) gone cooking externally, 1TB Seagate 7200RPM 2.5" HDD (ST1000LM049-2GH172) left outside Monitor: 1080p 126Hz IPS G-sync

 

Desktop benching:

Cinebench R15 Single thread:168 Multi-thread: 833 

SuperPi (v1.5 from Techpowerup, PI value output) 16K: 0.100s 1M: 8.255s 32M: 7m 45.93s

Link to comment
Share on other sites

Link to post
Share on other sites

20 minutes ago, Taf the Ghost said:

On the instruction part for AMD, the short answer is "yes", but the exact details are beyond my knowledge of the GCN architecture. Nvidia is still several generations behind in Async compared to AMD with GCN 1.0. (Happened to be chatting with some people that know their GPU architecture lately.

Yea, I had a really good look for the information and wasn't really able to find it. AMD talks about combining FP16 tasks, and being able to run INT 4, 8, 16 and 32 but aren't really clear on if they could be mixed. Their pictures are just labeled as 'Ops' when showing how RPM works but they don't state much beyond that.

 

Async compute information doesn't really clear it up much either as that's about being able to run graphics and compute workloads at the same time on the same CU but it doesn't go in to specifics like data types. Can a CU do both FP and INT? Probably but I'd like to know for sure.

 

Each CU has 4 SIMD units and each of those has a 16-lane integer and floating point vector Arithmetic Logic Unit (ALU). Note this is old information about CUs and SIMD and I suspect some of that's changed but doesn't matter much here. I'm thinking each ALU can only do INT or FP but since there is 4 per CU you can have 3 FP and 1 INT mode ALUs.

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, Jurrunio said:

GPU edition of SMT?

It looks like AMD will be implementing VLIW, SIMT and SSIMD (Google will help if you need it) in a reworked combination in their actual next-gen design. (That'd be after Navi.) In CPU terms, it's more like AMD is going with a bigger (not wider) Front End and a massively reworked Back End. At least, between some of the rumors and all of AMD's GPU patents that have been coming out in the last 2 years. But bringing back VLIW and making it VLIW2 is similar to threading, but more like giving the ability to double the amount of calls sent at the beginning of the pipeline. Potentially double the IPC, but it's a matter of handling that on the Back End that requires a huge amount of rework.

Link to comment
Share on other sites

Link to post
Share on other sites

11 minutes ago, Taf the Ghost said:

It looks like AMD will be implementing VLIW, SIMT and SSIMD (Google will help if you need it) in a reworked combination in their actual next-gen design. (That'd be after Navi.) In CPU terms, it's more like AMD is going with a bigger (not wider) Front End and a massively reworked Back End. At least, between some of the rumors and all of AMD's GPU patents that have been coming out in the last 2 years. But bringing back VLIW and making it VLIW2 is similar to threading, but more like giving the ability to double the amount of calls sent at the beginning of the pipeline. Potentially double the IPC, but it's a matter of handling that on the Back End that requires a huge amount of rework.

Interesting note though, Fiji had 8 ACEs (1-way) and Vega has 4 ACEs (4-way) so Vega can actually do 16 tasks simultaneously.

Link to comment
Share on other sites

Link to post
Share on other sites

51 minutes ago, leadeater said:

Yea, I had a really good look for the information and wasn't really able to find it. AMD talks about combining FP16 tasks, and being able to run INT 4, 8, 16 and 32 but aren't really clear on if they could be mixed. Their pictures are just labeled as 'Ops' when showing how RPM works but they don't state much beyond that.

 

Async compute information doesn't really clear it up much either as that's about being able to run graphics and compute workloads at the same time on the same CU but it doesn't go in to specifics like data types. Can a CU do both FP and INT? Probably but I'd like to know for sure.

 

Each CU has 4 SIMD units and each of those has a 16-lane integer and floating point vector Arithmetic Logic Unit (ALU). Note this is old information about CUs and SIMD and I suspect some of that's changed but doesn't matter much here. I'm thinking each ALU can only do an INT or FP but since there is 4 per CU you can have 3 FP and 1 INT mode ALUs.

Got into the middle of a hyper-technical GCN discussion just the other day. (AMD's big problem is really execution and hammering home what they're best at, more than GCN being bad.) Most GCN GPUs are actually best thought of as a collection of CPUs clustered into a Graphics Engine. Those CUs can do an extreme amount independently. AMD also has something like a 50% Performance/Area advantage on Nvidia in FP-intensive Compute. At 14nm/16nm generation. It's the Geometry throughput that unbalances the current designs. (Vega having full FP64 means the dies is significantly larger than needed for gaming GPUs.) It's why the recent Crytek demo was at 4K/30 on a Vega 56. Vega just has that much Shader power.

 

As for the Async, I saw some testing recently that GCN really does do INT and FP fully async, but I'm having trouble finding the link at the moment. I'll see if I can find it, because it was some interesting testing. Nvidia still schedules well with Pascal, but GCN really does do what it says it does.

Link to comment
Share on other sites

Link to post
Share on other sites

9 minutes ago, leadeater said:

Interesting note though, Fiji had 8 ACEs (1-way) and Vega has 4 ACEs (4-way) so Vega can actually do 16 tasks simultaneously.

Speaking of not balancing your designs to fit all of the rolls... 

 

(General consensus is that Vega is badly internally bottlenecked, so they had to clock it too higher. Thus you get an insanely effecient uArch that's suddenly "too hot".)

Link to comment
Share on other sites

Link to post
Share on other sites

4 minutes ago, Taf the Ghost said:

Speaking of not balancing your designs to fit all of the rolls... 

Even so looking at some information, just so happens to be 1st Gen GCN, AMD basically nailed a lot of proper Async Compute first try.

https://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute/4

 

However it's not without problems.

image.png.adb2ee35bf007eb977c26fe608de622d.png

https://community.amd.com/thread/168154 (interesting read, highly recommend).

 

The above picture may be about OpenCL and pure compute but it seems like its very easy to leave large amounts of of the GPU resources idle.

Link to comment
Share on other sites

Link to post
Share on other sites

13 minutes ago, leadeater said:

Even so looking at some information, just so happens to be 1st Gen GCN, AMD basically nailed a lot of proper Async Compute first try.

https://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute/4

 

However it's not without problems.

image.png.adb2ee35bf007eb977c26fe608de622d.png

https://community.amd.com/thread/168154 (interesting read, highly recommend).

 

The above picture may be about OpenCL and pure compute but it seems like its very easy to leave large amounts of of the GPU resources idle.

Interesting find. 

 

As for Vega itself, one of the discussions was about the fact it probably has 5 Tflops of ALUs sitting idle in gaming situations because of the geometry/ROP bottlenecks. Vega has shader power to do all of it in Async since it those chunks would be sitting idle in most gaming situations. This is part of the reason Vega scales differently in 1080p->4K than Pascal or Turing.

 

Related, the issues with Vega's bottlenecks is also why everyone in the industry is expecting a lot from Navi and the Consoles. Leaned down for Gaming GPU designs, the modern GCN cards can do significant work if they're balanced properly.

Link to comment
Share on other sites

Link to post
Share on other sites

On the Nvidia RTX stuff, Nvidia's Async has a long ways to go. When those Tensor cores are activated, they bottleneck the GPUs because of latency added. I believe both Tim at HWUB and Steve at GN noted this in Metro, in places like the tunnel scenes. Even with no actual rays to process, simply having the other parts active for the render pipeline adds stalls.

Link to comment
Share on other sites

Link to post
Share on other sites

Let's Race down the Trace and Trace down the Race ??️

You can take a look at all of the Tech that I own and have owned over the years in my About Me section and on my Profile.

 

I'm Swiss and my Mother language is Swiss German of course, I speak the Aargauer dialect. If you want to watch a great video about Swiss German which explains the language and outlines the Basics, then click here.

 

If I could just play Videogames and consume Cool Content all day long for the rest of my life, then that would be sick.

Link to comment
Share on other sites

Link to post
Share on other sites

@leadeater @Taf the Ghost

You think AMD GPUs will do better than Nvidia 10xx series GPUs in RT because of better async compute?

 

(Like if they perform the same in non RT games, AMD will perform better with RT)

“Remember to look up at the stars and not down at your feet. Try to make sense of what you see and wonder about what makes the universe exist. Be curious. And however difficult life may seem, there is always something you can do and succeed at. 
It matters that you don't just give up.”

-Stephen Hawking

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, Mihle said:

@leadeater @Taf the Ghost

You think AMD GPUs will do better than Nvidia 10xx series GPUs in RT because of better async compute?

 

(Like if they perform the same in non RT games, AMD will perform better with RT)

Should do yes, so long as they can do parallel async compute not context switching (Pascal) and can do simultaneous FP and INT.

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, Mihle said:

@leadeater @Taf the Ghost

You think AMD GPUs will do better than Nvidia 10xx series GPUs in RT because of better async compute?

 

(Like if they perform the same in non RT games, AMD will perform better with RT)

Depending on implementation, Vega would smoke both Pascal and Turing. Turing's Async is far behind AMD's and that's a big reason why. For all of those Gigarays, RTX is bottlenecking the GPU badly.

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Taf the Ghost said:

On the Nvidia RTX stuff, Nvidia's Async has a long ways to go. When those Tensor cores are activated, they bottleneck the GPUs because of latency added. I believe both Tim at HWUB and Steve at GN noted this in Metro, in places like the tunnel scenes. Even with no actual rays to process, simply having the other parts active for the render pipeline adds stalls.

This might also contribute to what was observed in that Metro testing

 

Quote

Like Volta, the Turing SM is partitioned into 4 sub-cores (or processing blocks) with each sub-core having a single warp scheduler and dispatch unit, as opposed Pascal’s 2 partition setup with two dispatch ports per sub-core warp scheduler. There are some fairly major implications with change, and broadly-speaking this means that Volta/Turing loses the capability to issue a second, non-dependent instruction from a thread for a single clock cycle. Turing is presumably identical to Volta performing instructions over two cycles but with schedulers that can issue an independent instruction every cycle, so ultimately Turing can maintain 2-way instruction level parallelism (ILP) this way, while still having twice the amount of schedulers over Pascal.

https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/4

 

Sounds to me like Turing and Volta gained capabilities but regressed in instruction execution (increased instruction latency). If I understand it correctly.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×