(Updated) AMD Navi GPU to Offer GTX 1080 Class Performance at ~$250 Report Claims

ravenshrike · April 16, 2018

6 hours ago, Valentyn said:

7 hours ago, leadeater said:

On the latest gaming reviews, that's moved up to 30-40% with the latest drivers.
https://www.hardocp.com/article/2018/03/20/nvidia_titan_v_video_card_gaming_review

No, it's at 34% increase max in games which aren't programmed by rabid, caffeine-deprived monkeys. Thus directly in line with the CUDA core increase.

hobobobo · April 16, 2018

41 minutes ago, Razor01 said:

Base performance = throughput, nV is so far ahead of AMD its not funny.

Arent ROPs affecting the throughtput the most? Coz i seems to remember, and i might be wrong, you writing on amd decoupling their rops from something while nv didnt and the gist of it being that the decoupling is coming to nv too? Coz as i see it with my limited understanding - amd being stuck at 64 rops is whats holding them back. If navi is indeed mcm, isnt it feasable for a ccx to have optimal amount of rops for cus there and thus breaking the hard limit? I understand thats scheduler is probably way more challenging for mcm gpu, but assuming amd somehow manages that, would not they be able harness the advantages of gcn while negating its abysmal scaling?

Razor01 · April 16, 2018

AMD ROP's are more capable than nV ROP's even though AMD has less filtrates on ROPs they are able to do more ops at the same time, even though they have half the ROP count, so they are fine there. But this isn't the full reason either.

Fiji did better at 4k, where fillrate and shader amounts needs more. But this was when AMD had a 50% + advantage in raw shader horsepower. (8.6 Tflops vs 5.6 Tflops, Fury X vs 980ti respectively) Pixel fillrate wise, Fury X if I remember correctly ~70 Gpixels vs the 980ti ~100 Gpixels. At this res though current games become more shader bound than fillrate bound. So need that shader throughput more than the fillrates.

Ok multiple dies with navi.

The latency increases for AMD CCX modules are in the neighborhood of 200% to 300%. Looking at 50ns that goes up to 150ns. Its a huge amount. To cover that type of latency the chips will need to be able to get to 200% to 300% more data to fill that time up. They will need to have to be able to create work 2 to 3 times faster. Now with graphics memory that latency will be cut down considerably, but not to 2 to 3 times, maybe 1.5 to 2 times, so still looking at chips that need to be able to process 50% to 100% faster and have the SAME throughput as before. If throughput increases, then you need that much more capability in processing per clock.

GPU's are excellent at hiding latency because they can usually process data much faster than sending data around. But the problem is we are looking at huge increases here. I don't think they can hide that much latency by themselves Once a certain amount of latency is introduced into a GPU, unlike a CPU, CPU looses some performance, GPU will utterly fail in performance. Can't remember exactly but 6ns of latency is enough to cut like 12 FPS (also program dependent) out, and we are talking about magnitudes of 6ns.

Lets look back at the early theories of why Maxwell failed in async compute tasks, well one was they used a context switch. By using a context switch forces the GPU to flush out its SM and do another task, which introduces latency we are taking about 2 cycles here, not too much. But even that would be worse than what we saw with what happened with Maxwel lol. Which in theory this is what was happening but not on purpose. What was happening was the driver was trying to do something the hardware could not do, and the GPU then used the context switch so it wouldn't stall.

hobobobo · April 16, 2018

15 minutes ago, Razor01 said:

-snip-

So it sounds without gcn dying there is not much for amd to tweak. Another thing, would it not be possible to mitigate the huge latency increase by some kind of preprocessor die (the likes breing rumored for threadripper 2/3), which, for example, would either break screen into several quadrants, like 4 for 4 ccxs, althouth it does sound like tearing nightmare. Because, even thought i do know jack shit about even amateur level gpu opearion, it does seem there might be some ways to mitigate it, either on api lvl, which theoreticly could remove nearly all additional latency (if api preschedules workloads for ccxs eliminating the need for hardware to do so) or on hardware lvl with something like nn splitting workloads in a way which eliminated the need to proccess the separate ccx output in order to merge it.

Perhaps its all just bullshit stemming from me not understanding how modern gpu works, all my computer knowledge is based on an old soviet book explaining the most basic instruction sets and their hardware lvl implimentation

Oh, and another thing, could you perhaps recommend some literature to start understand the inner workings of silicon on a deeper level?

Razor01 · April 16, 2018

10 minutes ago, hobobobo said:

So it sounds without gcn dying there is not much for amd to tweak. Another thing, would it not be possible to mitigate the huge latency increase by some kind of preprocessor die (the likes breing rumored for threadripper 2/3), which, for example, would either break screen into several quadrants, like 4 for 4 ccxs, althouth it does sound like tearing nightmare. Because, even thought i do know jack shit about even amateur level gpu opearion, it does seem there might be some ways to mitigate it, either on api lvl, which theoreticly could remove nearly all additional latency (if api preschedules workloads for ccxs eliminating the need for hardware to do so) or on hardware lvl with something like nn splitting workloads in a way which eliminated the need to proccess the separate ccx output in order to merge it.

Perhaps its all just bullshit stemming from me not understanding how modern gpu works, all my computer knowledge is based on an old soviet book explaining the most basic instruction sets and their hardware lvl implimentation

Oh, and another thing, could you perhaps recommend some literature to start understand the inner workings of silicon on a deeper level?

A pre processor die or caching die can help in this regard but not completely, that will help with storing the data and routing the data to necessary units as need but to hide the latency it has to be done by drivers and software. And yeah it will cause problems like tearing if not done correctly on the software side.

Hmm best way to learn is to do the programming lol, hmm for references its really hard because we are talking about many different aspects in chips, if it was singular it would be easy to look up

hobobobo · April 16, 2018

6 minutes ago, Razor01 said:

Hmm best way to learn is to do the programming lol, hmm for references its really hard because we are talking about many different aspects in chips, if it was singular it would be easy to look up

Oh, ive wanted to take up programming as a hobby but with the amount of languaches out there ive got no idea where to start, since i dont really want to learn something dying like java and i was told highlvls like python or ruby aint a good place to start either.

On the hardware side ive figured it all looks like a branching tree with all the new stuff branching out of the basic old things. Perhaps several branching trees, but it seems even things like x86 and vliw share alot on the most basic level and the differences start to become more and more prominent as you introduce complexity

wrath · April 16, 2018

It's like every year, "Next year will be better" It will outperform nvidia and every year It's the same "Next year will be better" and it goes on and on

I hope this time It's for real

leadeater · April 16, 2018

6 hours ago, Trixanity said:

Nvidia said they could use the tensor cores to train a NN to reconstruct an image using fewer rays. That's it. Other than that, Volta does not use the tensor cores for raytracing whatsoever hence I doubt it'll be a thing for most gaming cards since they'll still be able to do fast raytracing without them. It's a combination of hardware and software that Nvidia won't reveal the specifics of.

The Tensor cores are used for denoising, without it the produced image would either look very grainy or they would have to do full ray simulation which is far too demanding. If you cull ray paths then they don't get rendered so that's actually missing information in the final image, denoising is a process in which those missing bits of information is generated by looking at the pixels around it and and guessing what it would have been if it had been ray traced.

leadeater · April 16, 2018

5 hours ago, Trixanity said:

What I can glean from that is Nvidia has changed things in the backend to make raytracing faster while having optimized software running on the CUDA cores. Meanwhile they can leverage tensor cores to accelerate certain things but that the heavy lifting is done elsewhere.

What they changed was the the job scheduling and resource assignment so more tasks can run concurrently and also avoid underutilized GPU resources. Raytracing needs a lot of tasks to run at once and each one is not that demanding but collectively it's extremely demanding, but if you have to assign a minimum amount of GPU resources you get tons of wastage.

Mira Yurizaki · April 16, 2018

Since there was a side discussion about D3D12 and all that fun jazz, I feel people still don't seem to grasp what D3D12 does versus D3D11. I don't claim to be an expert in DirectX programming (it'd be nice if someone who's worked on a game engine would come over to clear up some things), but it helps to understand it at least on a higher level point of view, starting with how D3D12 optimized the job submission portion. There's a whole primer on it at https://msdn.microsoft.com/en-us/library/windows/desktop/dn859354(v=vs.85).aspx.

Another thing to note in order to be D3D12 compliant, the GPU must support one of the feature levels described in https://msdn.microsoft.com/en-us/library/windows/desktop/mt186615(v=vs.85).aspx. It does not matter how these features are implemented, only that they're implemented. And one of the curious things that got a lot of noise but it shows up nowhere in this list? Asynchronous compute. It's not a feature of D3D12. It doesn't make any sense that it would be either.

Also it helps to understand what the ACEs in GCN are trying to do: they're hardware schedulers. To summarize I'm just going to steal this from a reddit post:

Spoiler

Task A is assigned to all 10 CUs, after the first part of the task is processed, the data must be then sent to a fixed function unit (rasterizer for example) before returning to the CU for completion.

When this happens, the CUs are idle, and the ACEs dispatch work from task B to each of the CUs.

In order to execute this newly assigned task a context switch must be performed.

A context switch is simply the transfer of all data relevant to a task in execution (registers, cache) to some form of temporary storage and the retrieval of context from another task so it can execute.

The effectiveness of this approach is contingent on the context switch latency being significantly smaller than the execution time of the task being swapped in. If this is not the case, then the context switching latency will have a measurable effect on the stall time on the CUs, which is the issue we are trying to solve.

So on GCN each of the ACEs can dispatch work to each of the CUs, and they enable very fast context switching thanks to a dedicated cache.

Now, operating under the assumption that context switch latency is negligible and that the 0.25ms of stall time from task A is contiguous this is what happens.

Task A dispatched to all 10 CUs. Task A is in execution for 0.5ms. Task A is dispatched to fixed function unit(s) using intermediate result from each CU. ACEs assigns parts of task B to each CU Task A context is swapped to dedicated cache within an ACE. Task B is dispatched Task B is executing on all 10 CUs for 0.3ms Task B is finished Task A context is swapped back into each CU. Task A executes for 0.5ms Task A complete.

Total time = 1.3ms vs 1.55ms without exploiting multi-engine

In case people aren't aware, NVIDIA schedules GPU jobs at the driver level, by having the drivers allocate resources for a job and then submits that to the GPU. The problem with this approach in Maxwell was that once the job was submitted, those resources were committed to only that job and it couldn't change (this is called static partitioning). Pascal solved this by allowing GPU resources that are done with a task work on another:

Spoiler

On Maxwell what would happen is Task A is assigned to 8 SMs such that execution time is 1.25ms and the FFU does not stall the SMs at all. Simple, right? However we now have 20% of our SMs going unused.

So we assign task B to those 2 SMs which will complete it in 1.5ms, in parallel with Task A's execution on the other 8 SMs.

Here is the problem; when Task A completes Task B will still have 0.25ms to go, and on Maxwell there's no way of reassigning those 8 SMs before Task B completes. Partitioning of resources is static(unchanging) and happens at the drawback boundary, controlled by the driver.

So if driver estimates the execution times of Tasks A and B incorrectly, the partitioning of execution units between them will lead to idle time as outlined above.

Pascal solves this problem with 'dynamic load balancing' ; the 8 SMs assigned to A can be reassigned to other tasks while Task B is still running; thus saturating the SMs and improving utilization.

In other words, ACEs are AMD's way of handling the multiple command queues that D3D12 exposes by dynamically scheduling them in hardware. NVIDIA handles them by scheduling the commands ahead of time in the driver level, then executing it on the GPU, and starting with Pascal, resolving execution bubbles by Dynamic Load Balancing.

And lastly, one point I want to make is that D3D12 doesn't suddenly grant more performance for being lower overhead. If the GPU is already saturated with work using a D3D11 task, performance is not going to improve much, if at all, in a D3D12 task. The whole point of D3D12 was to make the command list operation of the graphics rendering process more efficient.

laminutederire · April 16, 2018

35 minutes ago, leadeater said:

The Tensor cores are used for denoising, without it the produced image would either look very grainy or they would have to do full ray simulation which is far too demanding. If you cull ray paths then they don't get rendered so that's actually missing information in the final image, denoising is a process in which those missing bits of information is generated by looking at the pixels around it and and guessing what it would have been if it had been ray traced.

I don't know if we are allowed to redirect to research papers, but this one explains it well: Link . I don't know if you guys have access to it, but I got to it through Google without my university VPN, so it seems it is free to consult as is.

leadeater · April 16, 2018

1 hour ago, M.Yurizaki said:

and starting with Pascal, resolving execution bubbles by Dynamic Load Balancing.

Which they further refined in Volta, there was a nice article on it explaining the improvements over Pascal. Will find it later and edit post with it.

Dylanc1500 · April 16, 2018

10 minutes ago, leadeater said:

Which they further refined in Volta, there was a nice article on it explaining the improvements over Pascal. Will find it later and edit post with it.

@M.Yurizaki

I believe this is the "article" you are looking for, if i am wrong let me know. https://devblogs.nvidia.com/inside-volta/

Mira Yurizaki · April 16, 2018

7 minutes ago, leadeater said:

Which they further refined in Volta, there was a nice article on it explaining the improvements over Pascal. Will find it later and edit post with it.

I think I commented about that in the GPU forum. Something about being able to handle groups of threads that take one branch or the other efficiently.

AntonChigurh · April 16, 2018

>WCCFTech puts out more bullshit unconfirmed claims

>MRW

Razor01 · April 16, 2018

2 hours ago, M.Yurizaki said:

Since there was a side discussion about D3D12 and all that fun jazz, I feel people still don't seem to grasp what D3D12 does versus D3D11. I don't claim to be an expert in DirectX programming (it'd be nice if someone who's worked on a game engine would come over to clear up some things), but it helps to understand it at least on a higher level point of view, starting with how D3D12 optimized the job submission portion. There's a whole primer on it at https://msdn.microsoft.com/en-us/library/windows/desktop/dn859354(v=vs.85).aspx.

Another thing to note in order to be D3D12 compliant, the GPU must support one of the feature levels described in https://msdn.microsoft.com/en-us/library/windows/desktop/mt186615(v=vs.85).aspx. It does not matter how these features are implemented, only that they're implemented. And one of the curious things that got a lot of noise but it shows up nowhere in this list? Asynchronous compute. It's not a feature of D3D12. It doesn't make any sense that it would be either.

Also it helps to understand what the ACEs in GCN are trying to do: they're hardware schedulers. To summarize I'm just going to steal this from a reddit post:

Reveal hidden contents

Task A is assigned to all 10 CUs, after the first part of the task is processed, the data must be then sent to a fixed function unit (rasterizer for example) before returning to the CU for completion.

When this happens, the CUs are idle, and the ACEs dispatch work from task B to each of the CUs.

In order to execute this newly assigned task a context switch must be performed.

A context switch is simply the transfer of all data relevant to a task in execution (registers, cache) to some form of temporary storage and the retrieval of context from another task so it can execute.

The effectiveness of this approach is contingent on the context switch latency being significantly smaller than the execution time of the task being swapped in. If this is not the case, then the context switching latency will have a measurable effect on the stall time on the CUs, which is the issue we are trying to solve.

So on GCN each of the ACEs can dispatch work to each of the CUs, and they enable very fast context switching thanks to a dedicated cache.

Now, operating under the assumption that context switch latency is negligible and that the 0.25ms of stall time from task A is contiguous this is what happens.

Task A dispatched to all 10 CUs. Task A is in execution for 0.5ms. Task A is dispatched to fixed function unit(s) using intermediate result from each CU. ACEs assigns parts of task B to each CU Task A context is swapped to dedicated cache within an ACE. Task B is dispatched Task B is executing on all 10 CUs for 0.3ms Task B is finished Task A context is swapped back into each CU. Task A executes for 0.5ms Task A complete.

Total time = 1.3ms vs 1.55ms without exploiting multi-engine

In case people aren't aware, NVIDIA schedules GPU jobs at the driver level, by having the drivers allocate resources for a job and then submits that to the GPU. The problem with this approach in Maxwell was that once the job was submitted, those resources were committed to only that job and it couldn't change (this is called static partitioning). Pascal solved this by allowing GPU resources that are done with a task work on another:

Reveal hidden contents

On Maxwell what would happen is Task A is assigned to 8 SMs such that execution time is 1.25ms and the FFU does not stall the SMs at all. Simple, right? However we now have 20% of our SMs going unused.

So we assign task B to those 2 SMs which will complete it in 1.5ms, in parallel with Task A's execution on the other 8 SMs.

Here is the problem; when Task A completes Task B will still have 0.25ms to go, and on Maxwell there's no way of reassigning those 8 SMs before Task B completes. Partitioning of resources is static(unchanging) and happens at the drawback boundary, controlled by the driver.

So if driver estimates the execution times of Tasks A and B incorrectly, the partitioning of execution units between them will lead to idle time as outlined above.

Pascal solves this problem with 'dynamic load balancing' ; the 8 SMs assigned to A can be reassigned to other tasks while Task B is still running; thus saturating the SMs and improving utilization.

In other words, ACEs are AMD's way of handling the multiple command queues that D3D12 exposes by dynamically scheduling them in hardware. NVIDIA handles them by scheduling the commands ahead of time in the driver level, then executing it on the GPU, and starting with Pascal, resolving execution bubbles by Dynamic Load Balancing.

And lastly, one point I want to make is that D3D12 doesn't suddenly grant more performance for being lower overhead. If the GPU is already saturated with work using a D3D11 task, performance is not going to improve much, if at all, in a D3D12 task. The whole point of D3D12 was to make the command list operation of the graphics rendering process more efficient.

You summed it it very well!

Both AMD and nV took different approaches to solving underutilization. AMD's vision with async compute is good as in it gives programmers the ability to further enhance the ability, but nV took a much more basic approach which with throughput. Although pascal fills in the holes a bit its still not as capable as what AMD can do with GCN. And to get that extra performance out of Maxwell is probably impossible, with Pascal even getting that extra performance out you won't be able to get as much as with GCN.

Mira Yurizaki · April 16, 2018

Also one more thing of note, NVIDIA added widespread support for D3D11 deferred contexts in 2011, before Kepler was even out (https://forums.anandtech.com/threads/6950-vs-gtx-460-768mb-why-does-nvidia-beat-the-radeon-in-civ5.2155665/page-2#post-31520674). D3D11 deferred contexts is arguably what gave NVIDIA the edge over AMD (which I'm finding posts as of 2015 that still complain they don't support it). And it wasn't that NVIDIA added support for games that used it, they apparently made it so that on a driver level, it would be used, regardless if the game used it or not.

So while people like to claim Kepler (and probably Maxwell) was a D3D11 beast, it probably really wasn't because of the hardware.

leadeater · April 17, 2018

1 hour ago, Dylanc1500 said:

@M.Yurizaki

I believe this is the "article" you are looking for, if i am wrong let me know. https://devblogs.nvidia.com/inside-volta/

Partly yes, there was another one with a graphic of the SM cores and resources assigned to each task showing how it was more dynamic in that allocation than Pascal. Both articles are good reads.

Here's the article and the area improved is Multi-Process Service (MPS). https://docs.nvidia.com/deploy/mps/index.html

Though I get the feeling we're going a bit off track and talking a lot about Nvidia in an AMD news topic.

Razor01 · April 17, 2018

49 minutes ago, M.Yurizaki said:

Also one more thing of note, NVIDIA added widespread support for D3D11 deferred contexts in 2011, before Kepler was even out (https://forums.anandtech.com/threads/6950-vs-gtx-460-768mb-why-does-nvidia-beat-the-radeon-in-civ5.2155665/page-2#post-31520674). D3D11 deferred contexts is arguably what gave NVIDIA the edge over AMD (which I'm finding posts as of 2015 that still complain they don't support it). And it wasn't that NVIDIA added support for games that used it, they apparently made it so that on a driver level, it would be used, regardless if the game used it or not.

So while people like to claim Kepler (and probably Maxwell) was a D3D11 beast, it probably really wasn't because of the hardware.

Well multithreaded in DX11 I don't know why AMD didn't implement that in drivers as well, it would have been a worth while investment. Maybe they thought to fast track LLAPI's was a better fit for them with them being resource constrained.

Razor01 · April 17, 2018

15 minutes ago, leadeater said:

Partly yes, there was another one with a graphic of the SM cores and resources assigned to each task showing how it was more dynamic in that allocation than Pascal. Both articles are good reads.

Here's the article and the area improved is Multi-Process Service (MPS). https://docs.nvidia.com/deploy/mps/index.html

Though I get the feeling we're going a bit off track and talking a lot about Nvidia in an AMD news topic.

That is cuda page is it really good it spells out the difference of what the SM blocks are like and capabilities are with Volta.

leadeater · April 17, 2018

1 minute ago, Razor01 said:

That is a good article it really spells out the difference of what the SM blocks are like and capabilities are with Volta.

Yep, though mostly it's all for HPC compute audience the tech itself can be used in other ways. Volta is actually a huge improvement for large HPC shared clusters running multiple loads of different types. That fine resource allocation and QoS is really awesome. Come a long way since GPUs could only do one thing at a time. Really isn't much in Volta at all that is geared towards gamers to be honest.

Razor01 · April 17, 2018

17 minutes ago, leadeater said:

Yep, though mostly it's all for HPC compute audience the tech itself can be used in other ways. Volta is actually a huge improvement for large HPC shared clusters running multiple loads of different types. That fine resource allocation and QoS is really awesome. Come a long way since GPUs could only do one thing at a time. Really isn't much in Volta at all that is geared towards gamers to be honest.

Well as long as it has more units and can use them while using less power, and I think that is what we are going to get. a chip that will have a performance to an equal tier Pascal card, with higher performance at lower power consumption. If they up the clocks to Pascal levels or higher at same TDP envelopes much more performance which I don't think they will do at this point no competition clock these cards in their ideal envelopes and they are set. nV stated in one of their interview the 12nm process they can clock higher than 16nm.

Now we don't know if their gaming versions will be using TSMC 12nm though rumors are there that the gaming cards are using GF or Samsung lol. But this is for a different topic.

Sign In

(Updated) AMD Navi GPU to Offer GTX 1080 Class Performance at ~$250 Report Claims

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites