[Updated] Oxide responds to AotS Conspiracies, Maxwell Has No Native Support For DX12 Asynchronous Compute

othertomperson · September 5, 2015

Oh? I thought many have seen this? http://arstechnica.com/gaming/2015/08/directx-12-tested-an-early-win-for-amd-and-disappointment-for-nvidia/

This is the one I've seen. Combining both puts the 980 and the 980 Ti as equal, and both equal with the 390X. This doesn't make any sense.

Kinda Bottlenecked · September 5, 2015

This is the one I've seen. Combining both puts the 980 and the 980 Ti as equal, and both equal with the 390X. This doesn't make any sense.

Where's the 290x?

othertomperson · September 5, 2015

Where's the 290x?

390X. Close enough. Same GPU anyway. These results don't make any sense when you look at them together. There are no conclusions you can draw from them. Even comparing Nvidia and AMD's own GPUs, a 290X shouldn't be that far behind a 390X and a 980 shouldn't be the same as a 980 Ti.

I think something is seriously wrong with at least one of these benchmark results and until there is more data you can't really say anything.

Briggsy · September 5, 2015

390X. Close enough. Same GPU anyway. These results don't make any sense when you look at them together. There are no conclusions you can draw from them. Even comparing Nvidia and AMD's own GPUs, a 290X shouldn't be that far behind a 390X and a 980 shouldn't be the same as a 980 Ti.

I think something is seriously wrong with at least one of these benchmark results and until there is more data you can't really say anything.

One is running 1080p/1600p, the Ars technica is 1080p, 1400p and 4K, that's where the difference is coming from, aside from game settings and other hardware configs.

Kinda Bottlenecked · September 5, 2015

390X. Close enough. Same GPU anyway. These results don't make any sense when you look at them together. There are no conclusions you can draw from them. Even comparing Nvidia and AMD's own GPUs, a 290X shouldn't be that far behind a 390X and a 980 shouldn't be the same as a 980 Ti.

I think something is seriously wrong with at least one of these benchmark results and until there is more data you can't really say anything.

First up are the average FPS scores, made up of the frame times for the entire benchmark run, combining data for normal, medium, and heavy batch scenes.

othertomperson · September 5, 2015

One is running 1080p/1600p, the Ars technica is 1080p, 1400p and 4K, that's where the difference is coming from, aside from game settings and other hardware configs.

I'm not comparing the different resolutions, I'm comparing the 290X against the 980 Ti at 1080p, and then the 390X with the 980 at 1080p. The information on each benchmark is contradictory.

One shows the 980 Ti = 290X and the other shows 390X = 980. This is a massive discrepancy.

patrickjp93 · September 5, 2015

780Ti was matched against the 290X (at the time). So was the Titan Black...

So lets see how these add up, shall we?

R9 290X

Pix rate: 64 GPixels/s

Tex Rate: 176 GTexel/s

FLOPS: 5,632 GFLOPS

GTX 780Ti

Pix rate: 52.5 GPixels/s

Tex Rate: 210 GTexel/s

FLOPS: 5,040 GFLOPS

GTX Titan Black

Pix rate: 53.3 GPixels/s

Tex Rate: 213 GTexel/s

FLOPS: 5,125 GFLOPS

You were saying?

Time for an apology to all the people you tried to bullshit i guess....

2880 * 2 * 1.1*10^9 = 6.33TFlops, its theoretical rating. You conveniently quoted the 290X theoretical rate instead of RWP but quoted the 780TI's RWP. Way to skew your data. The 780TI walks away the champ except in DP, and then the Titan Black gets involved with 1/3 performance but still actually wins in RWP because AMD's OpenCL drivers suck.

Dabombinable · September 5, 2015

They will definitely have to use the ACEs in the Xbone to maximize its performance for ARK.

Xbox stands for "DirectXbox", however that's a bit of a mouthful, so the Xbox was born (though the Original Xbox is the only one that also physically fits the name).

Prysin · September 6, 2015

https://www.techpowerup.com/

2880 * 2 * 1.1*10^9 = 6.33TFlops, its theoretical rating. You conveniently quoted the 290X theoretical rate instead of RWP but quoted the 780TI's RWP. Way to skew your data. The 780TI walks away the champ except in DP, and then the Titan Black gets involved with 1/3 performance but still actually wins in RWP because AMD's OpenCL drivers suck.

Both numbers are taken from Techpowerup... you can go look at it here

https://www.techpowerup.com/

You can say what you want, but every source i could find stated that the numbers i got there, is the RWP of both the 290X and 780Ti. Feel free to find CREDIBLE SOURCES THAT CAN DISPROVE ME.

because your math, does not.

AFAIK, if you are going by non reference 780Tis, then go by Non reference 290Xs too... shits gonna be pretty damn close.

I did some theoretical calculations like you did too, and i can ONLY make the 780Ti win, when its OCd above 1.18GHz.... below that, OC vs OC, 290X wins

patrickjp93 · September 6, 2015

https://www.techpowerup.com/

Both numbers are taken from Techpowerup... you can go look at it here

https://www.techpowerup.com/

You can say what you want, but every source i could find stated that the numbers i got there, is the RWP of both the 290X and 780Ti. Feel free to find CREDIBLE SOURCES THAT CAN DISPROVE ME.

because your math, does not.

AFAIK, if you are going by non reference 780Tis, then go by Non reference 290Xs too... shits gonna be pretty damn close.

I did some theoretical calculations like you did too, and i can ONLY make the 780Ti win, when its OCd above 1.18GHz.... below that, OC vs OC, 290X wins

My math does. The boost clock of the 780TI is 1100MHz at least on most of the OEM models. 2880 shaders * 2 ops per clock (fused multiply add) * clocks = FLOPs. That is the theoretical flops calculation for any processor. For Intel you can think of their CPUs as 4 cores * 256/32 = 8 shaders per core * 2 ops per clock * clock speed. The 4790K is 4 * 8 * 2 * 4.2 * 10^9 = 268.8*10^9 = 268.8 GFlops (4790K multicore boost tops out at 4.2 GHz).

If you take the boost clock of the 780TI reference, you get 2880 * 2 * 0.928*10^9 = 5.345*10^12 or 5.345 TFlops. Problem with that is it discounts GPU boost entirely, and most can get it to 1100 MHz easily. The 780TI won the FLOPs contest hands down.

And no, clock for clock the 780TI wins in FLOPs, because it literally has to by the definition (my equation, which is the industry equation) since Nvidia has more shaders and both have the same # of operations per clock. Again, get on my level before you challenge me.

Mahigan · September 6, 2015

My math does. The boost clock of the 780TI is 1100MHz at least on most of the OEM models. 2880 shaders * 2 ops per clock (fused multiply add) * clocks = FLOPs. That is the theoretical flops calculation for any processor. For Intel you can think of their CPUs as 4 cores * 256/32 = 8 shaders per core * 2 ops per clock * clock speed. The 4790K is 4 * 8 * 2 * 4.2 * 10^9 = 268.8*10^9 = 268.8 GFlops (4790K multicore boost tops out at 4.2 GHz).

If you take the boost clock of the 780TI reference, you get 2880 * 2 * 0.928*10^9 = 5.345*10^12 or 5.345 TFlops. Problem with that is it discounts GPU boost entirely, and most can get it to 1100 MHz easily. The 780TI won the FLOPs contest hands down.

And no, clock for clock the 780TI wins in FLOPs, because it literally has to by the definition (my equation, which is the industry equation) since Nvidia has more shaders and both have the same # of operations per clock. Again, get on my level before you challenge me.

Umm,

Why are you two debating? The GTX 780 Ti (NVIDIA Kepler GK110) can't even handle mixed mode (can't Async Compute as it can process 1 Graphic or 32 Compute). It also relies on a heck of a lot of software driven scheduling. That makes it pretty poor at hitting any sort of theoretical compute numbers. On the nVIDIA side, only Maxwell 2, in theory, can handle mixed mode (1 Graphic and 31 Compute). The driver isn't working yet but nVIDIA are working on it.

The strength of GCN is in its ability to schedule many more threads per Compute Unit than anything on the nVIDIA side. GCN also relies on pure hardware scheduling (hence the higher power usage). This didn't make sense under DX11 but it makes a hell of a lot of sense for DX12.

gilang01 · September 6, 2015

found this yesterday:

(from eteknix)

"Oxide Games, developer of the highly-anticipated Ashes of the Singularity, has revealed that NVIDIA is working on a driver to fully implement DirectX 12’s Async Compute. According to Oxide developer Kollock, in a post on overclock.net, NVIDIA is in the process of refining its Async Compute driver, with the help of Oxide.

Kollock writes:

“We actually just chatted with Nvidia about Async Compute, indeed the driver hasn’t fully implemented it yet, but it appeared like it was. We are working closely with them as they fully implement Async Compute. We’ll keep everyone posted as we learn more.”

It seems as though NVIDIA will be implementing a combination of software and hardware to handle Async Compute, rather than by hardware alone.

Mahigan of Overclock offered this thorough explanation of Async Compute:

“The Asynchronous Warp Schedulers are in the hardware. Each SMM (which is a shader engine in GCN terms) holds four AWSs. Unlike GCN, the scheduling aspect is handled in software for Maxwell 2. In the driver there’s a Grid Management Queue which holds pending tasks and assigns the pending tasks to another piece of software which is the work distributor. The work distributor then assigns the tasks to available Asynchronous Warp Schedulers. It’s quite a few different “parts” working together. A software and a hardware component if you will.

With GCN the developer sends work to a particular queue (Graphic/Compute/Copy) and the driver just sends it to the Asynchronous Compute Engine (for Async compute) or Graphic Command Processor (Graphic tasks but can also handle compute), DMA Engines (Copy). The queues, for pending Async work, are held within the ACEs (8 deep each)… and ACEs handle assigning Async tasks to available compute units.

Simplified…

Maxwell 2: Queues in Software, work distributor in software (context switching), Asynchronous Warps in hardware, DMA Engines in hardware, CUDA cores in hardware.

GCN: Queues/Work distributor/Asynchronous Compute engines (ACEs/Graphic Command Processor) in hardware, Copy (DMA Engines) in hardware, CUs in hardware.”

Despite its own problems implementing DirectX 12, AMD already has a headstart when it comes to Async Compute. Now it seems that NVIDIA is hoping to close the gap very soon."

patrickjp93 · September 6, 2015

Umm,

Why are you two debating? The GTX 780 Ti (NVIDIA Kepler GK110) can't even handle mixed mode (can't Async Compute as it can process 1 Graphic or 32 Compute). It also relies on a heck of a lot of software driven scheduling. That makes it pretty poor at hitting any sort of theoretical compute numbers. On the nVIDIA side, only Maxwell 2, in theory, can handle mixed mode (1 Graphic and 31 Compute). The driver isn't working yet but nVIDIA are working on it.

The strength of GCN is in its ability to schedule many more threads per Compute Unit than anything on the nVIDIA side. GCN also relies on pure hardware scheduling (hence the higher power usage). This didn't make sense under DX11 but it makes a hell of a lot of sense for DX12.

We're not debating Async Compute. I'm disputing his claim Nvidia has been behind in raw compute at all. It hasn't until Maxwell, and frankly that doesn't even matter since most boosts are well over 1300 MHz. Furthermore, its scheduling makes it far better at compute than AMD's offerings. The proof of that has been in Nvidia's domination of enterprise compute accelerators for the past decade. Even Intel does better with the Xeon Phis than AMD does with its FirePro offerings.

Please don't barge into a conversation when you haven't a clue what's going on.

Mahigan · September 6, 2015

We're not debating Async Compute. I'm disputing his claim Nvidia has been behind in raw compute at all. It hasn't until Maxwell, and frankly that doesn't even matter since most boosts are well over 1300 MHz. Furthermore, its scheduling makes it far better at compute than AMD's offerings. The proof of that has been in Nvidia's domination of enterprise compute accelerators for the past decade. Even Intel does better with the Xeon Phis than AMD does with its FirePro offerings.

Please don't barge into a conversation when you haven't a clue what's going on.

Oh,

I was under the impression that this was a public forum...

I'll just ignore your tone. And I apologize if I've offended you.

No, it's scheduling doesn't make it superior to AMDs offerings under DX12 (or Compute). You can ask Dr David Kanter if you wish (starts at 1:18:00 in):

Maxwell is not really a great fit for compute, because the way it got more power efficient is they threw out the scheduling hardware....

Software scheduling makes it a better gaming GPU under DX11. nVIDIAs enterprise cards use hardware based scheduling (in case you weren't aware). In fact Fermi also used hardware based scheduling. Hardware based scheduling didn't make sense for DirectX 11, you had far more flexibility with a software scheduler than a hardware scheduler. A software scheduler also makes more sense if you want to save on power usage. Since DX11 was a serial API (feeding either a Compute or a Graphic task at a time) then you didn't need a complex hardware scheduler. It was just wasting power while being under utilized.

With DX12 we're not only moving to a far more parallel pipeline but also far more compute heavy workloads. The capacity to have many threads in-flight is of high importance. Each Compute Unit, in GCN, can have 2,560 Threads in-flight:

In the Hawaii based GPUs, with 44 CUs, that's 112,640 Threads in-flight. Each CU processes 64 Threads at a time (1 Wavefront). That means that there is less latency involved in moving from one set of 64 Threads to the next, in GCN, compared to Kepler, Maxwell, Maxwell 2. Why? Because the Threads (or Grids in CUDA language) are held in a Grid Management Unit which is part of the nVIDIA driver (software scheduling). While it used to make sense for Kepler/Maxwell/Maxwell 2 for DX11 Games and it did make sense for a world which wasn't ready for heavy degrees of parallelism it no longer makes sense for the Compute and DX12 tasks right around the corner.

Ever since Kepler, nVIDIAs consumer cards have lacked most of their hardware scheduling capacities. So no, you can't hit your theoretical compute numbers (most current OpenCL apps don't even use the CPUs multiple processors in order to feed GPUs either so you can't even rely on those benchmarks for true architecture performance). The reason is, you have a hard time keeping the units fed.

GF114, owing to its heritage as a compute GPU, had a rather complex scheduler. Fermi GPUs not only did basic scheduling in hardware such as register scoreboarding (keeping track of warps waiting on memory accesses and other long latency operations) and choosing the next warp from the pool to execute, but Fermi was also responsible for scheduling instructions within the warps themselves. While hardware scheduling of this nature is not difficult, it is relatively expensive on both a power and area efficiency basis as it requires implementing a complex hardware block to do dependency checking and prevent other types of data hazards. And since GK104 was to have 32 of these complex hardware schedulers, the scheduling system was reevaluated based on area and power efficiency, and eventually stripped down.

The end result is an interesting one, if only because by conventional standards it’s going in reverse. With GK104 NVIDIA is going back to static scheduling. Traditionally, processors have started with static scheduling and then moved to hardware scheduling as both software and hardware complexity has increased. Hardware instruction scheduling allows the processor to schedule instructions in the most efficient manner in real time as conditions permit, as opposed to strictly following the order of the code itself regardless of the code’s efficiency. This in turn improves the performance of the processor.

However based on their own internal research and simulations, in their search for efficiency NVIDIA found that hardware scheduling was consuming a fair bit of power and area for few benefits. In particular, since Kepler’s math pipeline has a fixed latency, hardware scheduling of the instruction inside of a warp was redundant since the compiler already knew the latency of each math instruction it issued. So NVIDIA has replaced Fermi’s complex scheduler with a far simpler scheduler that still uses scoreboarding and other methods for inter-warp scheduling, but moves the scheduling of instructions in a warp into NVIDIA’s compiler. In essence it’s a return to static scheduling.

Ultimately it remains to be seen just what the impact of this move will be. Hardware scheduling makes all the sense in the world for complex compute applications, which is a big reason why Fermi had hardware scheduling in the first place, and for that matter why AMD moved to hardware scheduling with GCN. At the same time however when it comes to graphics workloads even complex shader programs are simple relative to complex compute applications, so it’s not at all clear that this will have a significant impact on graphics performance, and indeed if it did have a significant impact on graphics performance we can’t imagine NVIDIA would go this way.

What is clear at this time though is that NVIDIA is pitching GTX 680 specifically for consumer graphics while downplaying compute, which says a lot right there. Given their call for efficiency and how some of Fermi’s compute capabilities were already stripped for GF114, this does read like an attempt to further strip compute capabilities from their consumer GPUs in order to boost efficiency. Amusingly, whereas AMD seems to have moved closer to Fermi with GCN by adding compute performance, NVIDIA seems to have moved closer to Cayman with Kepler by taking it away.

http://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/3

If you keep the units fed, you get a rather interesting result...

Look at that powerful GTX 780 Ti...

http://techreport.com/review/28356/nvidia-geforce-gtx-980-ti-graphics-card-reviewed/3

Pardon me if you call this barging in...

Mahigan · September 6, 2015

Again... not trying to pick a fight. Just trying to end one.

Dabombinable · September 6, 2015

We're not debating Async Compute. I'm disputing his claim Nvidia has been behind in raw compute at all. It hasn't until Maxwell, and frankly that doesn't even matter since most boosts are well over 1300 MHz. Furthermore, its scheduling makes it far better at compute than AMD's offerings. The proof of that has been in Nvidia's domination of enterprise compute accelerators for the past decade. Even Intel does better with the Xeon Phis than AMD does with its FirePro offerings.

Please don't barge into a conversation when you haven't a clue what's going on.

You really need to learn how to reference your information correctly.

patrickjp93 · September 6, 2015

You really need to learn how to reference your information correctly.

I reference when there are actual things to reference. The equation for flops is known across the damn industry. I shouldn't have to hold the kiddies' hands on everything!

patrickjp93 · September 6, 2015

@Mahigan, and yet the new Maxwell Teslas are selling like hotcakes. The only problem with Maxwell in compute workloads is its lack of native 64-bit support. Otherwise it's just as good and sometimes better than Kepler.

Dabombinable · September 6, 2015

I reference when there's actual things to reference. The equation for flops is known across the damn industry. I shouldn't have to hold the kiddies' hands on everything!

Straight up, if you say something, you need to back it up with more than just your own calculations.

Briggsy · September 6, 2015

Straight up, if you say something, you need to back it up with more than just your own calculations.

yeah, its pretty similar to writing code without documentation. It might make sense to the author, but no one else.

patrickjp93 · September 6, 2015

Straight up, if you say something, you need to back it up with more than just your own calculations.

It's not just mine! If you don't know that equation by heart at this point you must be damn blind! http://optimisationcpugpu-hpc.blogspot.com/2012/10/how-to-calculate-flops-of-gpu.html?m=1

I am sick and tired of idiots like you saying I'm full of it when I have the damn proof right in front of you! Hell! Look at how the calculations are done by all these review sites! I match it perfectly. I swear if social Darwinism was legal more than two thirds of this forum wouldn't be considered eligible for breeding.

patrickjp93 · September 6, 2015

yeah, its pretty similar to writing code without documentation. It might make sense to the author, but no one else.

Good code reads well enough minimal documentation should be required beyond good function and variable naming, but that aside, I provided proof as I always have. Get over yourselves. I'm right, again. Now enough with the keyboard warrior trolling BS. Get on my level or get the bloody hell out of my way.

Ohlyver · September 6, 2015

I hope I'm not too late with this:https://m.reddit.com/r/nvidia/comments/3j5e9b/analysis_async_compute_is_it_true_nvidia_cant_do/

Dabombinable · September 6, 2015

Good code reads well enough minimal documentation should be required beyond good function and variable naming, but that aside, I provided proof as I always have. Get over yourselves. I'm right, again. Now enough with the keyboard warrior trolling BS. Get on my level or get the bloody hell out of my way.

Where not trolling, we are just asking you to reference the information you provide correctly. Oh and to provide a source of information that isn't just yourself.

Shakaza · September 6, 2015

It's not just mine! If you don't know that equation by heart at this point you must be damn blind! http://optimisationcpugpu-hpc.blogspot.com/2012/10/how-to-calculate-flops-of-gpu.html?m=1

I am sick and tired of idiots like you saying I'm full of it when I have the damn proof right in front of you! Hell! Look at how the calculations are done by all these review sites! I match it perfectly. I swear if social Darwinism was legal more than two thirds of this forum wouldn't be considered eligible for breeding.

While I think you're in the right here, I don't think that's sufficient justification for insulting other members, regardless of how frustrated you are. If you have a problem with them, you can take it up with the mods, not fly insults in their faces. Rules exist for a reason.

Sign In

[Updated] Oxide responds to AotS Conspiracies, Maxwell Has No Native Support For DX12 Asynchronous Compute

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites