The Misconceptions of DirectX 12 and Vulkan

Entry posted by Mira Yurizaki July 27, 2016

793 views

Note: This is a repost since where I had it originally on the forums got buried within a few hours.

I spent the last week and some change looking up how DirectX 12 and Vulkan work and how NVIDIA and AMD are handling it. Most of it was that I felt there was a lot of misconception going on with people thinking how the APIs are supposed to work. So here’s what I found out:

What does “low-level” API even mean?

Spoiler

A common description I hear is that DirectX 12 and Vulkan provides direct hardware access or allows for “close to metal” programming. What I’m finding out though is that it doesn’t do either. What DirectX 12 and Vulkan do is provide a way for the application to build up data that can be easily mapped to the GPU. This also reduces CPU overhead. For example, this is what would happen before:

· The application creates software pipelines that represent certain parts of how to render an object.
· The application submits the data, settings, and other information through these pipelines.
· During each stage of rendering, the API would look at each of these pipelines to see what data it has and if its relevant
· The API would copy this over to the drivers that would handle the rest in a single line.
Sometimes though the application wouldn’t be done with pushing things down the pipeline, so the API would wait until it was done.

With DirectX 12 and Vulkan, the following happens:

· The application tells the API it has some pipelines. Right now the common ones are graphics, compute, and copy (the copy queue is for streaming in assets)
· The application creates an object that holds the data, settings, and instructions on how to render an object or compute a task or what asset to stream. These are called Pipeline State Objects (PSO)
· The API takes the PSOs and shuffles them into the appropriate pipeline (graphics, compute, or copy)
· When the driver wants to render something, it takes the PSO, looks at it, and can efficiently figure out what to do with it.

It’d be like trying to drive somewhere by using a map, stopping once in a while to make sure you know where you’re going and plotting where to go next versus using a printout of directions.

What is this “asynchronous compute” I keep hearing about?

Spoiler

Asynchronous compute is one of the new hot buzzwords circulating about in the graphics world. Strictly speaking, it is a GPU’s ability to do graphics and compute workloads at the same time. There’s contention on what exactly this means, a so-called “true asynchronous compute” which I find silly. There are, from what I can find, two camps:

· Parallel computing: A graphics and compute workload can be run at the same time. AMD and NVIDIA both fit this trait.
· Concurrent computing: A graphics and compute workload may not be running at the same time, but it can efficiently switch between the two. AMD fits this trait while NVIDIA does not (NVIDIA takes a non-trivial penalty for switching)

If you’re confused about what parallel and what concurrent mean, here’s a handy picture:

Using this picture as an example, parallelism means for the entire time the processor is running, it is running both tasks at the same time. In concurrency, the processor is running one task at a time, but switching to the other one at various points.

I heard NVIDIA doesn’t have “hardware scheduler”, what’s that about?

Spoiler

Back when NVIDIA made the Kepler architecture, they decided that, because graphics workloads tended to be predictable, the component that schedules the instructions appropriately called the scheduler, was too complex. Because workloads are predictable, you could sort them ahead of time before sending them to the GPU. Thus NVIDIA’s scheduling doesn’t happen on hardware, but in the drivers who compile the work for the GPU.

The problem with this setup is that if NVIDIA needs to do what is called a context switch (which is between mixed graphics and compute, or pure compute), it’s expensive to do so because software has to set it up. This ties in to the “can NVIDIA do asynchronous compute” debate. I think it can, because NVIDIA’s literature points to it. And such literature is written by the technical writers and engineers, not marketers, because these are for people who have to work on the chip, not for gamers looking for the best specs to salivate over.

Anyway, here’s a slide showing a problem with NVIDIA’s Maxwell 2 architecture:

The driver would compile the work, allocate which portions of the GPU will work on the tasks, and send them off. The problem was, such as in this graph, if the graphics part gets done before the compute part, the GPU isn’t considered done and much of the GPU remains unused.

Pascal, the architecture after this, solves this problem:

However, here’s where AMD has a potential leg up on NVIDIA. I’m going to add a third task. So I’ll go and edit the “Maxwell 2 Load Balancing” image to show what an AMD chip would do:

Surely that makes a better use of resources, right? Well Pascal can do this:

There’s a gap there because NVIDIA needs time to switch to the new job. But otherwise, it might be similar in execution time.

I heard NVIDIA’s can’t do asynchronous compute

Spoiler

For NVIDIA's GPUs before Pascal, this is more or less true. Maxwell 2 has the capability, but the for some reason or another, the drivers cannot form a graphics and a compute queue for the GPU. Everything executes on a single queue. But there is a question whether or not Maxwell will see an appreciable improvement. Ditto with Pascal. But what do I mean about this?

I’ve read that AMD’s GPUs work a lot like how Intel’s Pentium 4 with HyperThreading worked and NVIDIA’s is more like how AMD’s Athlon XP processors worked back in the day.

The Pentium 4 had a long, 21 stage pipeline (30 something in the later models) and it took a while before the thing you put into it spit out a result. HyperThreading allowed two threads to run in parallel, using up the available execution resources. It also had an effect if one thread stalled, the other could do more work. Expand that “two threads” to nine workloads, and you have something resembling an R9 Fury.

In contrast, AMD’s Athlon XP had a shorter pipeline, (best case, 10 stages). This meant that instructions would go in and come out faster than the Pentium 4 if both processors ran at the same clock speed. In best-case loads, the Athlon XP could perform twice as fast as a Pentium 4, assuming the pipelines took one clock cycle to complete. So adding HyperThreading or similar to an Athlon XP wouldn’t really benefit it, it’s already doing the work just as fast.

But wait, how does NVIDIA have fewer pipelines in this case? Because NVIDIA got rid of the scheduler in their GPU. The GPU knows ahead of time what it’s doing, what resources it’s going to work on, and what tasks need what execution resources. In contrast on AMD’s side, their GPUs still need to figure out how to best allocate their resources before executing.

And this may be the basis for the argument that NVIDIA’s GPUs in general cannot do asynchronous compute.

Maybe it's just a semantics game

Spoiler

I say this because also in my research, I'm finding that "asynchronous compute" is more of an AMD term, not an industry standard term. So in that sense, technically NVIDIA doesn't support. It's like saying NVIDIA doesn't support FreeSync. However NVIDIA does support variable refresh rates using GSync. Both FreeSync and GSync are different implementations of the same thing. Likewise, AMD's asynchronous compute and NVIDIA's "Dynamic scheduling" and "Preemption" both achieve the same thing: getting graphics and compute to be done at the same time as fast as possible.

Let’s go back to DirectX 12 and Vulkan and talk about some caveats

Spoiler

What DirectX 11 and OpenGL did was abstract the hardware enough to make it convenient for programmers and developers to just get something. The benefit of this was that you could make more generalized code that would work more or less fine on any architecture. But this was limiting in performance. The problem with less abstraction is that the code now becomes less generalized.

To put a programming analogy with this, DirectX 11 and OpenGL were like programming in Java. Java had a lot of programming concepts that made development easier. One is that as long as there was a Java Virtual Machine for the target, you can run your Java program on it with almost no change in your code. Another feature that sticks out in my mind is what is known as garbage collection. This periodically cleans up memory that the program has, but isn’t using.

DirectX 12 and Vulkan is more like programming in C. C is bare bones programming, but you can now directly manage memory resources. The problem is that you must compile the C program for the architecture and you may have to create a support layer to get your program working on the target platform. And that garbage collection feature? Absent. If you forget to clean up your memory, the program is going to continue eating it with no way of freeing it up (this by the way is a memory leak).

There’s one major difference in how to program for an AMD GPU and an NVIDIA GPU: how much work you give it. To feed an AMD GPU completely with work, you must feed it a large number of tasks (something like 64+). NVIDIA on the other hand favors much smaller work (32 at most). If one developer optimizes for AMD, NVIDIA’s GPUs are going to choke more because it cannot handle the large number of tasks at once, it has to do it piecemeal at best. On the other hand, optimize for NVIDIA, and it’ll run circles around AMD because smaller workloads create underutilization for AMD's GPUs.

To put it in another perspective, NVIDIA is like an enthusiast sports car. It’s fast, it has horsepower, but if you make it try to tow a 1-ton trailer, it’s going to hate you. AMD on the other hand is like a super duty pickup truck. It’s not as fast, but it can carry loads and haul a camper without breaking a sweat. However, if you don’t even use it to haul anything, what’s the point?

However, this does not mean a game developer needs to make two entirely separate code paths to render something on an AMD or NVIDIA GPU. Lots of things are generalized, but when it comes to high performance code where DirectX 12 and Vulkan will let you, you must tailor it to a particular architecture. Likewise in a C application, there are many reusable parts that don't change between processor architectures. It's just when you want to do something high performance or dig into assembly language, you have to make sure only that architecture runs it.

As an example, I worked on a device that would pick up a laser communication (similar to how an optical drive works), process it, and store a log of what happened, and it was battery operated. On power up, it would enter a service where you can plug it into a PC to get the logs or update it. If nothing pinged it for a minute it would hibernate to maximize battery life and wait for a laser pulse. This was the reason why it booted up to service mode automatically, because the device was effectively off. However, there was a problem, since waking up from hibernating is like restarting the program, how do you prevent it from entering the service mode? By the time the device went to its normal operation, the laser was long done.

What I did was in the boot code, which was in assembly mind you, was check how the device was powering up, because there was a flag set if it woke up rather than powered up. It would check the flag, check if there was still a laser pulse, then start the program and skip the service mode. If there was no pulse, it would not start the program and go back to hibernating. It had to do this fast, like within hundreds of clock cycles. This is high performance code that is tailored for the device's processor. I cannot plant this on another processor, even though the rest of the program probably could with minor modification.

tl;dr The overall takeaway though is that DirectX 12 and Vulkan are not free lunches for performance. Developers must know what they’re doing to start tapping into the potential of their hardware. And you can't exactly develop an approach that works for all GPUs. You have to start optimizing for a particular architecture for the best performance.

Two more points I want to dump

Spoiler

Neither AMD nor NVIDIA have "full featured DirectX 12" GPUs. AMD is only certified for feature level 12_0 while NVIDIA is certified for 12_1. But the features required to be certified for one or the other is a subset of the entire feature set DirectX 12 provides. In a twist, Intel is the only one that implements all DirectX 12 features to their fullest (well, one isn't fully featured but it's up in the air if anyone uses that). See https://en.wikipedia.org/wiki/Feature_levels_in_Direct3D#Direct3D_12
Asynchronous compute is not a feature that is required to be DirectX12 certified. So yes, NVIDIA's Fermi and Kepler can become DirectX 12 GPUs, but it's highly unlikely NVIDIA will do it.

Also you're free to chime in, but I felt like digging into this topic was like walking into a toxic waste land on the hope that there was some gold stashed away in there, and I'm still expecting this.

Sign In

Yurizaki's Tech Ramblings

The Misconceptions of DirectX 12 and Vulkan

What does “low-level” API even mean?

What is this “asynchronous compute” I keep hearing about?

I heard NVIDIA doesn’t have “hardware scheduler”, what’s that about?

I heard NVIDIA’s can’t do asynchronous compute

Maybe it's just a semantics game

Let’s go back to DirectX 12 and Vulkan and talk about some caveats

0 Comments

My Activity Streams