Titan V ray tracing (almost) like a RTX 2080 TI

CarlBar · January 9, 2019

15 minutes ago, leadeater said:

Use of it might come later when AMD has something, problem for 3DMark is that they have to make something that will run on any compliant DX12 DXR device not just one company's implementation. Otherwise you are making a tech demo not a benchmark, we have plenty of those from Nvidia already.

That may happen, kind of depends where industry development goes. I suspect it may be one of those it's just too new issues and there is no solid understanding of how to use Tensor cores for this purpose on a algorithmic basis rather than an AI basis, which I think using that method goes against what 3DMark allows. Allowing an implementation that is backed by an external AI training cluster isn't really benchmarking the GPU if you ask me, it's still useful information but ends up being more of a software/AI benchmark. It's sort of dipping your toes in to pre-rendering, very slightly.

Oh i agree it should not only run on NVIDIA hardware. But let's say AMD releases a new NAVI 2.0 card a generation from now and it has a dedicated AA subprocessing component onboard that can also handle RT denoising. I'd expect 3DMark to use Tensor cores for denoising in NVIDIA hardware, and AMD AA Subprocessor for AMD implementation.

The key thing is a benchmark that ignores a significant real world performance factor is no longer really representative of real world performance and that makes the benchmark every bit as much a tech demo as one that doesn't support more than one hardware configuration.

As far as the AI thing, i think one of us is misunderstanding how it works. To help you make sense of my understanding i suggest looking up a youtuber by the name of code bullet. He creates AI's, primarily Neural networking based to play simple games, (his hill racer video is probably the best demonstration, i'll try and track it down before i hit post), and it all works by looking at a series of input variable for the scenario in question, (for denoising the content of the target pixel and surrounding pixels), and based on the inputs it will pick one of an enormous list of different permutations of algorithm sequences to apply. At first you'll start with 10's or 100's or even 1000's of copies of of this AI each with randomly setup responses. They'll all be run, then their outputs compared to the desired result, (probably the same frame rendered with enough rays no denoise is needed beyond normal AA), and the ones that where closest will be selected and then given slight random mutations producing a massive number of slightly different copies derived from just the bst. the repeat the whole process again. repeat a few 10's/100's/100's/e.t.c. times till you get an acceptable final output in all test scenarios. Then you push that final algorithm to the end user.

Now i think about it, (typing this lot out has kinda made my brain work things out), I suspect technically speaking the actual denoising algorithm is being run in the shader pipeline and it's the Tensor ores that are doing the comparison math to figure out which algorithm to use in that scenario. The thing is such AI created algorithmic responses still tend to be waaay more efficient than anything a human comes up with, (and as often as not impossible for a human to understand because the list of variables and how they're used is so ungodly complex).

Video:

leadeater · January 9, 2019

1 hour ago, CarlBar said:

The key thing is a benchmark that ignores a significant real world performance factor is no longer really representative of real world performance and that makes the benchmark every bit as much a tech demo as one that doesn't support more than one hardware configuration.

It currently represents ray tracing performance solely so it's fine for that, you'd still be able to compare Nvidia RT cores to AMD [insert name]. Reading the tech notes for it though there is still a large amount of rasterization going on, DX12 DXR is for reflection/lighting/shadows only after all, which is going to make it harder to directly compare. Are you raster performance limited or ray tracing for example, it may not be totally clear.

1 hour ago, CarlBar said:

As far as the AI thing, i think one of us is misunderstanding how it works.

Nvidia is working on AI Deep Learning Denoising methods which is trained on their large cluster, much like DLSS. That means if you implement that method (not in use currently) you are bringing in computation power of an external cluster, totally fine and excellent but in regards to 3DMark benchmark of GPU capability violates the purpose of such a benchmark. You are no longer testing just the hardware or just that single GPU/system but rather a wide variety of factors some of which are external and those factors can change over time. What would be the point of benchmarking an RTX 2080 in 2019 which gets a score of 2000 then doing it again in 2020 and it gets a score of 3000, the GPU isn't any faster, it's not actually a driver improvement increasing the performance of the Shader Pipeline or RT cores or Tensor cores but you have made it easier/quicker for the Tensor Cores to look at the image and figure out the closest to ground truth image.

https://news.developer.nvidia.com/rtx-coffee-break-ray-traced-reflections-and-denoising-952-minutes/

Quote

Question: How does DLSS train, and why is it a big jump forward in super resolution and anti-aliasing technology?

Answer: The DLSS model is trained on a mix of final rendered frames and intermediate buffers taken from a game’s rendering pipeline. We do a large amount of preprocessing to prepare the data to be fed into the training process on NVIDIA’s Saturn V DGX-based supercomputing cluster. One of the key elements of the preprocessing is to accumulate the frames so to generate “perfect frames”.

https://news.developer.nvidia.com/dlss-what-does-it-mean-for-game-developers/

That's a cluster of 5280 Volta V100 GPUs.

This is something Pixar already does.

https://www.pcgamer.com/what-is-ray-tracing/

1 hour ago, CarlBar said:

I suspect technically speaking the actual denoising algorithm is being run in the shader pipeline and it's the Tensor ores that are doing the comparison math to figure out which algorithm to use in that scenario.

The Tensor cores are doing the image inference analysis. (For those that don't know what inference means it's similar to an educated guess)

Please note the below is only information to show the Tensor cores can and are used to do image analysis, this isn't directly what Nvidia RTX denoising is doing because I don't actually know that kind of detail.

Quote

Let’s take a deep dive into the TensorRT workflow using a code example. We’ll cover importing trained models into TensorRT, optimizing them and generating runtime inference engines which can be serialized to disk for deployment. Finally, we’ll see how to load serialized runtime engines and run fast and efficient inference in production applications. But first, let’s go over some of the challenges of deploying inference and see why inference needs a dedicated solution.

Quote

Why Does Inference Need a Dedicated Solution?

As consumers of digital products and services, every day we interact with several AI powered services such as speech recognition, language translation, image recognition, and video caption generation, among others. Behind the scenes, a neural network computes the results for each query. This step is often called “inference”: new data is passed through a trained neural network to generate results. In traditional machine learning literature it’s also sometimes referred to as “prediction” or “scoring”.

This neural network usually runs within a web service in the cloud that takes in new requests from thousands or millions of users simultaneously, computes inference calculations for each request and serves the results back to users. To deliver a good user experience, all this has to happen under a small latency budget that includes network delay, neural network execution and other delays based on the production environment.

Similarly, if the AI application is running on a device such as in an autonomous vehicle performing real-time collision avoidance or a drone making real-time path planning decisions, latency becomes critical for vehicle safety. Power efficiency is equally important since these vehicles may have to go days, weeks or months between recharging or refueling.

Today, application developers and domain experts use GPU-accelerated deep learning frameworks such as Caffe, TensorFlow, or PyTorch to train deep neural networks to solve application-specific tasks. These frameworks give them the flexibility to prototype solutions by exploring network designs, performing model assessment and diagnostics and re-training models with new data.

Once the model is trained, developers typically follow one of the following deployment approaches.

Use a training framework such as Caffe, TensorFlow or others for production inference.

Build a custom deployment solution in-house using the GPU-accelerated cuDNN and cuBLAS libraries directly to minimize framework overhead.

Use training frameworks or build custom deployment solutions for CPU-only inference.

These deployment options often fail to deliver on key inference requirements such as scalability to millions of users, ability to process multiple inputs simultaneously, or ability to deliver results quickly and with high power efficiency.

Quote

TensorRT Optimization Performance Results

The result of all of TensorRT’s optimizations is that models run faster and more efficiently compared to running inference using deep learning frameworks on CPU or GPU. The chart in Figure 5 compares inference performance in images/sec of the ResNet-50 network on a CPU, on a Tesla V100 GPU with TensorFlow inference and on a Tesla V100 GPU with TensorRT inference.

With TensorRT, you can get up to 40x faster inference performance comparing Tesla V100 to CPU. TensorRT inference with TensorFlow models running on a Volta GPU is up to 18x faster under a 7ms real-time latency requirement.

https://devblogs.nvidia.com/tensorrt-3-faster-tensorflow-inference/

Tensor Cores execute the inference model/runtime engine.

CarlBar · January 9, 2019

2 hours ago, leadeater said:

It currently represents ray tracing performance solely so it's fine for that, you'd still be able to compare Nvidia RT cores to AMD [insert name]. Reading the tech notes for it though there is still a large amount of rasterization going on, DX12 DXR is for reflection/lighting/shadows only after all, which is going to make it harder to directly compare. Are you raster performance limited or ray tracing for example, it may not be totally clear.

Nvidia is working on AI Deep Learning Denoising methods which is trained on their large cluster, much like DLSS. That means if you implement that method (not in use currently) you are bringing in computation power of an external cluster, totally fine and excellent but in regards to 3DMark benchmark of GPU capability violates the purpose of such a benchmark. You are no longer testing just the hardware or just that single GPU/system but rather a wide variety of factors some of which are external and those factors can change over time. What would be the point of benchmarking an RTX 2080 in 2019 which gets a score of 2000 then doing it again in 2020 and it gets a score of 3000, the GPU isn't any faster, it's not actually a driver improvement increasing the performance of the Shader Pipeline or RT cores or Tensor cores but you have made it easier/quicker for the Tensor Cores to look at the image and figure out the closest to ground truth image.

https://news.developer.nvidia.com/rtx-coffee-break-ray-traced-reflections-and-denoising-952-minutes/

https://news.developer.nvidia.com/dlss-what-does-it-mean-for-game-developers/

That's a cluster of 5280 Volta V100 GPUs.

This is something Pixar already does.

https://www.pcgamer.com/what-is-ray-tracing/

The Tensor cores are doing the image inference analysis. (For those that don't know what inference means it's similar to an educated guess)

Please note the below is only information to show the Tensor cores can and are used to do image analysis, this isn't directly what Nvidia RTX denoising is doing because I don't actually know that kind of detail.

https://devblogs.nvidia.com/tensorrt-3-faster-tensorflow-inference/

Tensor Cores execute the inference model/runtime engine.

So basically the Tensor cores are doing roughly what i thought when i wrote up my post above, they're running major parts of the decision engine by analysing the image to determine the values of all the properties that are used to decide the action to take using the decision array created by the supercomputer previously. Parts of that decision array executing may also be being executed.

And my point with the whole NVIDIA denoising is that you'd also expect to see that similar 2k-3k (i.e. 50%) uplift in real world in game FPS in supported games. A benchmark that doesn't represent that isn't very useful unless every manufacturer sees the same uplift.

At the same time i acknowledge that there somes differences between a benchmark and a game that mean NVIDIA could optimize further and faster than for an actual game and that we don't know how good the uptake really will be, (i hope good, but more on that later), so futuremark has some good reasons for not letting NVIDIA specifically into their house, (to use an analogy). But if uptake does happen it's key IMo that their benchmark approximates the real world benefits of that in some fashion if it wants to remain valid.

I did say i hope this tech sees good uptake for game devs, (same for DLSS), and thats because if NVIDIA can build a large enough database they should be able to develop a workable general purpose algorithm that will work for any game. But they're going to need a stupidly large dataset and a lot of processing to get to that point. hence why they're starting on a game by game basis.

leadeater · January 9, 2019

51 minutes ago, CarlBar said:

And my point with the whole NVIDIA denoising is that you'd also expect to see that similar 2k-3k (i.e. 50%) uplift in real world in game FPS in supported games. A benchmark that doesn't represent that isn't very useful unless every manufacturer sees the same uplift.

If you need that kind of information then you need to look at game benchmarks, benchmarks designed to assess the performance of hardware need to do just that not someones software implementation especially if it changes over time. You simply can't have that and also be able to compare across cards, across vendors, across generations or across time.

What purpose does a benchmark have if it's invalidated within only months due to an efficiency increase in the denoising used in RTX enabled games, 3DMark is not an RTX implementation or benchmark it is a pure DX12 DXR benchmark.

We already have this very same discrepancy in 3DMark now; Nvidia GameWorks Flow, HairWorks, ShadowLib, Terrain Tessellation, Turf and VXAO all are not tested and not implemented in any 3DMark benchmarks and the only source of information for performance of those can be found is game benchmarks that use them. Having HairWorks in 3DMark would be utterly useless as an industry benchmark tool and so is a denoiser being done on equally custom proprietary technology.

There's nothing wrong with what Nvidia is doing just remember what is the purpose of the bechmark, what information is it supposed to portray. 3DMark only shows you a relative performance indicator that is most useful to indicate performance differences between GPUs not how much performance in games you are going to get. It's a ranking/scoring systems of GPUs running identical tasks, if the task is not identical then it's not fit for purpose.

51 minutes ago, CarlBar said:

So basically the Tensor cores are doing roughly what i thought when i wrote up my post above, they're running major parts of the decision engine by analysing the image to determine the values of all the properties that are used to decide the action to take using the decision array created by the supercomputer previously. Parts of that decision array executing may also be being executed.

No the Tensor cores are inferencing the actual result, that is the action. This isn't training of an algorithm to complete a task or set of tasks like the example you showed this is the execution of the algorithm that looks at the image and returns all the pixel values of everything that is missing, it's best guess.

There's a lot more in the field of AI, Deep Learning etc than just machine learning to tech an AI to complete a task or set of tasks, that example you showed was machine learning. Training and Inference are different things. Nvidia TensorRT and the entire blog is about Inference and how their framework and Tensor hardware can be used to run Inference workloads, that which is normally done on CPU or GPGPU.

Quote

1.The first step is to train a deep neural network on massive amounts of labeled data using GPUs. During this step, the neural network learns millions of weights or parameters that enable it to map input data examples to correct responses. Training requires iterative forward and backward passes through the network as the objective function is minimized with respect to the network weights. Often several models are trained and accuracy is validated against data not seen during training in order to estimate real-world performance.

2.The next step–inference–uses the trained model to make predictions from new data. During this step, the best trained model is used in an application running in a production environment such as a data center, an automobile, or an embedded platform. For some applications, such as autonomous driving, inference is done in real time and therefore high throughput is critical.

https://devblogs.nvidia.com/deploying-deep-learning-nvidia-tensorrt/

We're at step 2, inference, using new data i.e. the noisy image to get a clean image.

Taf the Ghost · January 9, 2019

@leadeater

I'm still sort of floored it took anyone this long to bring in fixed-function Matrix calculations. I think people will be shocked when they bring in Diagonalization. There's going to be an order of magnitude or two of computational power when they get there. I also hadn't thought about Linear Algebra in a while, haha.

leadeater · January 9, 2019

5 minutes ago, Taf the Ghost said:

I also hadn't thought about Linear Algebra in a while, haha

Why would you willingly

Taf the Ghost · January 9, 2019

3 hours ago, leadeater said:

Why would you willingly

Linear Algebra was fun, but I also like Vector Mathematics. Not that one has much use for it outside of a few engineering fields.

Taf the Ghost · January 9, 2019

1 hour ago, VegetableStu said:

There're Bishops in this GPU?! o_o

I laugh, but that's as easy an explanation as you're going to get.

Taf the Ghost · January 9, 2019

2 minutes ago, VegetableStu said:

awwh ,_,

(I don't blame you, computerphile videos sounds pretty hard to structure for a middling audience, LOL)

Well, considering high-end Math guys have been abusing Diagonalization for on about 30+ years, that's probably a sign we haven't seen it with fixed-function hardware because it's really hard. I'm also pretty sure everyone would be lost about 3 sentences into any explanation of it. (Ignoring that I'd need to refresh on the topic so any analogies I made wouldn't send people off on the wrong tangent.)

Sign In

Titan V ray tracing (almost) like a RTX 2080 TI

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Topics

Latest From Linus Tech Tips:

The Humble PC

Latest From Tech Quickie:

What Are the Download Speeds in Space?

Latest From TechLinked:

Goodbye, TikTok

Latest From GameLinked:

Video Games Are Dying.

Latest From ShortCircuit:

The World's Fastest CPU (Technically...) - Intel i9-14900KS

Latest From Mac Address:

Why did you buy an Apple Vision Pro?

Latest From Channel Super Fun:

I Swapped the CEO's Assistant For a Day!