Jump to content

So I've come across this really great post worth sharing on /r/nvidia. I've tried to summarize it a little, but it's still a long read, and a somewhat technical one at that, though it's pretty easy to grasp the metaphors.

Quote

Demystifying Asynchronous Compute

As some of you may know, this is one of my favorite topics ( ;) ) and I thought I'd try to write a sort of "definitive guide" to dispel some of the misconceptions, rumors and hysteria over this feature.

So to start off we're going to briefly look at what the Direct X 12 specification says, because a little background is needed in order to approach asynchronous compute.

DirectX 12; Command Lists and Multi-Engine

DirectX 12 employs a different command submission model using command lists, each of which is created by one CPU thread and submitted to one of three queues, corresponding to one of three "engines". This places no requirements whatsoever on how the independent command streams are executed; AMD's "Async Shaders" and Nvidia's "Dynamic Load Balancing" are just marketing terms used to introduce new features I guess, and frankly it's unusual for such a profoundly architecture-intimate feature to be propelled to the forefront of a marketing campaign.

Quote

Most modern GPUs contain multiple independent engines that provide specialized functionality. Many have one or more dedicated copy engines, and a compute engine, usually distinct from the 3D engine. Each of these engines can execute commands in parallel with each other. Direct3D 12 provides granular access to the 3D, compute and copy engines, using queues and command lists.

Synchronization and Multi-Engine

So an "engine" is essentially a command processor, and each has it's own queue(s) with the API exposing signalling in the form of fences which are used to coordinate work across queues. An "engine" is an API construct, to be clear, neither AMD/NV refer to any actual hardware blocks as engines. The critical thing here is independence, the hardware is no longer forced to approach things sequentially. This means you can have graphics, compute and DMA dispatches in parallel; through independent dispatchers.

 

The compute queue, or the compute "engine" rather; ACEs for AMD, GMU for NVIDIA; do not have access to fixed function hardware, so no rasterizers, no geometry engines etc. The compute queue is good for things that need ALU/FPU power and not much else.

 

The copy queue is self explanatory, but don't forget about it, the key term really shouldn't be async compute but multi-engine, but we'll get to that later.

 

Parallelism and Concurrency

Parallelism and concurrency can be defined in a multitude of ways, and you will easily find several definitions for each with a quick google search.

 

Generally speaking, parallelism is a condition that stems from multiple work units ; the key here is independence. A task that can be broken up into a series of subtasks that can be executed simultaneously is said to be parallelizable.

Slapping people is an inherently parallelizable task for a vast majority of the population; if you split up slapping into left-handed slapping and right-handed slapping, you can now slap with both hands, simultaneously. - you scoundrel.

 

Concurrency can be seen as a more general notion of parallelism, with parallelism being a subset of concurrency upon which an additional condition is placed; independence. The key to concurrency is interruptability. In a concurrent execution model, multiple tasks move forward in their execution within the same time span on a single work unit. The start and end times of running tasks overlap.

 

If you weren't content with slapping people with two hands, and wanted to add insult to injury you could spit in their eye; but you know how it is; you're a busy person, you don't have all day. It takes five seconds to go through one cycle of the Parallel Slap™, if you add a two second spit routine in after it then you'll only get to do it 8 times a minute. Ain't nobody got time for that; we've all been there, time is money. Instead you can choose to spit right in their eye in the downtime between your palms making contact with their face and the moment you retract your arms back. You can now perform two Parallel Slaps™ and a Concurrent Spit™ in 5 seconds.

Congratulations, you are now slapping in parallel and spitting concurrently - you monster.

Asynchronous Compute and DX12

Microsoft mentions a few examples of use cases for multi-engine

Right off that bat it's worth saying the term 'asynchronous' is abused and misused very often; if two events are asynchronous they are time-independent of each other. Anyway, asynchronous execution arises when you do not have to wait for a routine to return before dispatching additional work, at least that's how it's conventionally defined, which is somewhat similar to how concurrency was defined earlier - which many people find confusing.

 

Parallelism: Simultaneous execution of two or more tasks, they are executing at the same instant therefore on independent units.

Concurrency: Overlapping execution of two or more tasks, they are not executing at the same instant however both tasks are progressing forwards in their execution within the same time-frame.

Asynchrony: Order-independent execution of two or more tasks, a routine can be called before the preceding routine returns.

 

Parallel is the opposite of serial.

Concurrent is the opposite of sequential.

Asynchronous is the opposite of synchronous.

 

Let's go back to the previous example; let's say you have a particularly fiendish pet monkey perched on your shoulder. After slapping with both hands in parallel you exploit the brief stall time before retracting your arms to order your monkey to spit. If we consider the task to be "order your monkey to spit" you are executing it concurrently with a parallel slap. Once the order is given, you can move forward in the SLAP task without waiting for the monkey to actually spit. This is the essence of asynchrony. While from your POV it was concurrent, if you consider the whole system including the slapping maniac and his asshole-ish monkey the spitting and the slapping are asynchronous, and possibly parallel; monkey spits while you are retracting your arms from the slap. So now, you're slapping with both hands in parallel, ordering your monkey to SPIT concurrently, and the monkey is spitting asynchronously - you slapping, spitting lunatic.

Using a more serious example; a CPU and I/O operations. If the CPU were to wait for I/O operations to return it would spend a huge deal of time waiting, instead these operations are executed asynchronously such that the CPU only sends the command(s) then moves onto another task without waiting for a return. The I/O operations will still take the same amount of time, but the CPU has effectively hidden a big chunk of latency.

 

So now we've established what we mean by parallel, concurrent and asynchronous we can finally move on to how nvidia and AMD are able to exploit this new command submission model.

NVIDIA and AMD - Two different approaches

Now for the purposes of this example let's assume we have two tasks on two different queues, let's call them A and B.

Task A is on the graphics queue, and it uses some fixed function hardware.

Task B is on the compute queue and uses only ALU/FPU resources.

We have two GPUs; one GCN and one Maxwell based, both containing 10 SMs/CUs.

Let's assume task A executes in 10 milliseconds on a single unit (an SM or a CU).

Task B executes in 3 milliseconds on a single unit.

When assigned to a single unit, both GCN and Maxwell execute the task in 10ms with no stall time on the unit whatsoever; CU/SM utilization is 100%.

However, if we spread the workload across all units (10) instead of executing in 1ms, it takes 1.25ms on both GCN and Maxwell; there's 0.25ms of stall time ; utilization is now 80%.

If we were to leave it at that then execute task B sequentially (spread across all 10 units) we would have a total execution time of 1.25 + 0.3 = 1.55ms.

There's some stall time on the SM/CUs however, which means there's room for improvement.

 

It should also be mentioned that GCN has one geometry engine and one rasterizer in each Shader Engine (usually 9 CUs). NVIDIA employs a geometry engine per SM and rasterizers shared by all SMs in a GPC ( 4 or 5). The balance of resources is radically different.

GCN and Asynchronous Shaders

Task A is assigned to all 10 CUs, after the first part of the task is processed, the data must be then sent to a fixed function unit (rasterizer for example) before returning to the CU for completion. When this happens, the CUs are idle, and the ACEs dispatch work from task B to each of the CUs. In order to execute this newly assigned task a context switch must be performed. A context switch is simply the transfer of all data relevant to a task in execution (registers, cache) to some form of temporary storage and the retrieval of context from another task so it can execute. The effectiveness of this approach is contingent on the context switch latency being significantly smaller than the execution time of the task being swapped in. If this is not the case, then the context switching latency will have a measurable effect on the stall time on the CUs, which is the issue we are trying to solve. So on GCN each of the ACEs can dispatch work to each of the CUs, and they enable very fast context switching thanks to a dedicated cache.

 

Now, operating under the assumption that context switch latency is negligible and that the 0.25ms of stall time from task A is contiguous this is what happens:

Task A dispatched to all 10 CUs. Task A is in execution for 0.5ms. Task A is dispatched to fixed function unit(s) using intermediate result from each CU. ACEs assigns parts of task B to each CU Task A context is swapped to dedicated cache within an ACE. Task B is dispatched Task B is executing on all 10 CUs for 0.3ms Task B is finished Task A context is swapped back into each CU. Task A executes for 0.5ms Task A complete. Total time = 1.3ms vs 1.55ms without exploiting multi-engine So the CUs are executing graphics and compute tasks concurrently, the execution of work using FFUs is asynchronous; the work is dispatched and the CU moves on to task B without waiting for the FFU operation(s) to return.

 

So what's going on here is that each queue is independent; command lists within each queue execute synchronously, but command lists from different queues (different command streams) are asynchronous with respect to each other. If Task A (command list A) is a shadowmap render, and asynchronous shaders (which is just the name of this implementation involving ACEs etc) enables the ACEs to quickly context swap however many CUs to Task B while the FFUs in the Shader Engine are processing Task A. They are thus executing in parallel on the Shader Engine, asynchronously with respect to each other and concurrently on the CU. Entiende?

ACEs enable fast context switching, which means they can afford to context switch and run something else in much smaller gaps in utilization than before.

 

Nvidia's architecture does not include a dedicated context swap cache, context swaps go to offdie to VRAM. This is very slow. Context switch latency is orders of magnitude higher than in GCN. The approach outlined above is totally untenable on a Maxwell or Pascal GPU.

NVIDIA - Maxwell and Pascal

On Maxwell what would happen is Task A is assigned to 8 SMs such that execution time is 1.25ms and the FFU does not stall the SMs at all. Simple, right? However we now have 20% of our SMs going unused.

So we assign task B to those 2 SMs which will complete it in 1.5ms, in parallel with Task A's execution on the other 8 SMs.nHere is the problem; when Task A completes Task B will still have 0.25ms to go, and on Maxwell there's no way of reassigning those 8 SMs before Task B completes. Partitioning of resources is static(unchanging) and happens at the drawback boundary, controlled by the driver. So if driver estimates the execution times of Tasks A and B incorrectly, the partitioning of execution units between them will lead to idle time as outlined above.

Pascal solves this problem with 'dynamic load balancing' ; the 8 SMs assigned to A can be reassigned to other tasks while Task B is still running; thus saturating the SMs and improving utilization.

 

For some reason many people have decided that Pascal uses preemption instead of async compute. This makes no sense at all. Preemption is the act of telling a unit to halt execution of its running task. Preemption latency measures the time between the halt command being issued and the unit being ready for another assignment. Pixel level preemption is good for time-critical tasks like async timewarp for VR because it means you can delay issuing the halt command as the unit will only need to finish working on the current pixel before halting, dumping context in VRAM and being ready for a new assignment. Thankfully NVIDIA's GTX 1080 whitepaper is pretty clear and divides the "Asynchronous Compute" section into two main points; overlapping workloads and time-critical workloads. Pixel level preemption is relevant to the latter, while "dynamic load balancing" is relevant to the former.

 

So if we stop using the term asynchronous compute and focus on multi-engine all our lives would be much more pleasant, DX12 only requires you to have those "engines" and their queues exposed. It places no requirements whatsoever on how the independent command streams are executed; "Async Shaders" and "Dynamic Load Balancing" are just marketing terms used to introduce new features I guess, and frankly it's unusual for such a profoundly architecture-intimate feature to be propelled to the forefront of a marketing campaign.

 

i5 4670k @ 4.2GHz (Coolermaster Hyper 212 Evo); ASrock Z87 EXTREME4; 8GB Kingston HyperX Beast DDR3 RAM @ 2133MHz; Asus DirectCU GTX 560; Super Flower Golden King 550 Platinum PSU;1TB Seagate Barracuda;Corsair 200r case. 

Link to comment
https://linustechtips.com/topic/652588-demystifying-asynchronous-compute/
Share on other sites

Link to post
Share on other sites

Well yeah to that last part async is just a way to compensate for sloppy coding optimizations, it doesn't add performance it just allows a little more performance to be gotten out of a card if it's drivers are not fully optimized for the task.

 

In other words it will do nothing if the card makers actually optimized their cards workflow from the get go.

https://linustechtips.com/main/topic/631048-psu-tier-list-updated/ Tier Breakdown (My understanding)--1 Godly, 2 Great, 3 Good, 4 Average, 5 Meh, 6 Bad, 7 Awful

 

Link to post
Share on other sites

huh awesome read, iv started to get really into how this stuff works, like what actiually happens on the die, and not just "oh it does the computeing" and more "it does this, and then this and then this and then this and then you get this ect..." its really cool and this cleared up Async computing for GPUs for me, i have to say the example with the slaps made it kinda click for me tbh :P 

I spent $2500 on building my PC and all i do with it is play no games atm & watch anime at 1080p(finally) watch YT and write essays...  nothing, it just sits there collecting dust...

Builds:

The Toaster Project! Northern Bee!

 

The original LAN PC build log! (Old, dead and replaced by The Toaster Project & 5.0)

Spoiler

"Here is some advice that might have gotten lost somewhere along the way in your life. 

 

#1. Treat others as you would like to be treated.

#2. It's best to keep your mouth shut; and appear to be stupid, rather than open it and remove all doubt.

#3. There is nothing "wrong" with being wrong. Learning from a mistake can be more valuable than not making one in the first place.

 

Follow these simple rules in life, and I promise you, things magically get easier. " - MageTank 31-10-2016

 

 

Link to post
Share on other sites

ASYNC COMPUTE WILL MAKE AMD GREAT AGAIN!!!! WAHHHHH ...oh shit wait...it will not? well fuck that then. >:(

| CPU: Core i7-8700K @ 4.89ghz - 1.21v  Motherboard: Asus ROG STRIX Z370-E GAMING  CPU Cooler: Corsair H100i V2 |
| GPU: MSI RTX 3080Ti Ventus 3X OC  RAM: 32GB T-Force Delta RGB 3066mhz |
| Displays: Acer Predator XB270HU 1440p Gsync 144hz IPS Gaming monitor | Oculus Quest 3 VR

Link to post
Share on other sites

This whole debate is what happens when you have a ton of people who have almost little idea how computers work on the lowest level fed by marketing trying to prove their e-penis is larger.

 

I'm glad though that this person put together the explanation at a higher level.

Link to post
Share on other sites

Class is about to start... following so I can read later.

Intel Xeon 1650 V0 (4.4GHz @1.4V), ASRock X79 Extreme6, 32GB of HyperX 1866, Sapphire Nitro+ 5700XT, Silverstone Redline (black) RL05BB-W, Crucial MX500 500GB SSD, TeamGroup GX2 512GB SSD, WD AV-25 1TB 2.5" HDD with generic Chinese 120GB SSD as cache, x2 Seagate 2TB SSHD(RAID 0) with generic Chinese 240GB SSD as cache, SeaSonic Focus Plus Gold 850, x2 Acer H236HL, Acer V277U be quiet! Dark Rock Pro 4, Logitech K120, Tecknet "Gaming" mouse, Creative Inspire T2900, HyperX Cloud Flight Wireless headset, Windows 10 Pro 64 bit
Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×