Jump to content

Titan V Will Not Get NVLink OR SLI

3 hours ago, Bit_Guardian said:

Those apps are coded incorrectly then. NVLink, providing the coherent fabric and high bandwidth that it does, puts PCIe to shame. The scaling should be tiered and layered, just as it is in all cluster-style computing. HPC is about maximising the use of your resources at hand.

Only if the code requires it, video and image rendering does not require a coherent fabric or frame buffer at all. One person's or application's requirements doesn't make it a requirement for everything.

 

DX12 mGPU  AFR doesn't require it either yet allows the full use of both GPU frame buffers unlike SLI where they are mirrored so the frame buffer is not additive.

 

If you do not need data to pass between GPUs then you absolutely do not need NVLink, and the use cases that don't require it is much larger than ones that do. Also two Titan V's with an average of 54% scaling is more performance than a single Tesla V100 that supports NVLink which alone is almost twice the TCO.

 

Titan V: 2x 3000 = 6000 = 1.54% performance scaling

Tesla V100: 1x 10000 = 10000 = 1.00% performance scaling

Tesla V100: 2 x 10000 = 20000 = 1.92% performance scaling

 

Note: I am assuming the same performance scaling as the Pascal Tesla P100 NVLink vs PCIe.

 

54% average more performance for 4000 (40%) less sounds like a good deal to me or for another 14000 you could get 38% more performance than two Titan V's, have to really need that 38% to justify that price difference.

 

More importantly Nvidia's own data shows limited performance benefit when using NVLink instead of PCIe and also that NVLink is NOT a requirement for Unified Memory, that is native to CUDA 8.0 and any Pascal or newer GPU architecture PCIe or NVLink.

ZUEfva.jpg

http://images.nvidia.com/content/pdf/v100-application-performance-guide.pdf

 

Quote

Just as with the GPU-to-GPU transfers, we see that NVLink enables much faster performance. Remember that this type of transfer occurs whenever you load a dataset into main memory and then process some of that data on the GPUs. These transfer times are often a factor in overall application performance, so a 3X speedup is welcome. This increased performance may also enable applications which were previously too data-movement-intensive.

 

Finally, consider that NVIDIA CUDA 8.0 (together with the Tesla P100 GPUs) allows for fully Unified Memory. You will be able to load datasets larger than GPU memory and let the system automatically manage data movement. In other words, the size of GPU memory no longer limits the size of your jobs. On such runs, having a wider pipe between CPU and GPU memory is of immense importance.

https://www.microway.com/hpc-tech-tips/comparing-nvlink-vs-pci-e-nvidia-tesla-p100-gpus-openpower-servers/

 

So real world usage you get 3x the bandwidth when using NVLink between GPUs and does not lead to a significant performance penalty unless you are doing something that is 'data-movement-intensive', and for those you need an IBM system with NVLink between the CPU and GPU.

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, leadeater said:

Only if the code requires it, video and image rendering does not require a coherent fabric or frame buffer at all. One person's or application's requirements doesn't make it a requirement for everything.

 

DX12 mGPU  AFR doesn't require it either yet allows the full use of both GPU frame buffers unlike SLI where they are mirrored so the frame buffer is not additive.

 

If you do not need data to pass between GPUs then you absolutely do not need NVLink, and the use cases that don't require it is much larger than ones that do. Also two Titan V's with an average of 54% scaling is more performance than a single Tesla V100 that supports NVLink which alone is almost twice the TCO.

 

Titan V: 2x 3000 = 6000 = 1.54% performance scaling

Tesla V100: 1x 10000 = 10000 = 1.00% performance scaling

Tesla V100: 2 x 10000 = 20000 = 1.92% performance scaling

 

Note: I am assuming the same performance scaling as the Pascal Tesla P100 NVLink vs PCIe.

 

54% average more performance for 4000 (40%) less sounds like a good deal to me or for another 14000 you could get 38% more performance than two Titan V's, have to really need that 38% to justify that price difference.

 

More importantly Nvidia's own data shows limited performance benefit when using NVLink instead of PCIe and also that NVLink is NOT a requirement for Unified Memory, that is native to CUDA 8.0 and any Pascal or newer GPU architecture PCIe or NVLink.

ZUEfva.jpg

http://images.nvidia.com/content/pdf/v100-application-performance-guide.pdf

 

https://www.microway.com/hpc-tech-tips/comparing-nvlink-vs-pci-e-nvidia-tesla-p100-gpus-openpower-servers/

 

So real world usage you get 3x the bandwidth when using NVLink between GPUs and does not lead to a significant performance penalty unless you are doing something that is 'data-movement-intensive', and for those you need an IBM system with NVLink between the CPU and GPU.

They most certainly do. Actually read the VP9 codec code. Cache coherency and scaling is everything. Compression (encoding) gets a ton of benefit from multi-frame analysis. Having a coherent fabric speeds that up tremendously if you have multiple processors.

 

It's not a requirement. It is, however, and accelerator for coherent memory requirements.

 

Or a Cray x86 system with NVLink, or a Qualcomm/Cavium ARM system with NVLink, or a Sparc M7 system with NVLink... IBM is not the only game in town with NVLink in servers. AMD will implement it too if they want contracts with AWS and Alibaba.

Link to comment
Share on other sites

Link to post
Share on other sites

18 minutes ago, Bit_Guardian said:

They most certainly do. Actually read the VP9 codec code. Cache coherency and scaling is everything. Compression (encoding) gets a ton of benefit from multi-frame analysis. Having a coherent fabric speeds that up tremendously if you have multiple processors.

It's not a requirement and the performance difference is very minimal or did you not actually go check, or read any of the linked material at all. Also it's fully able to be utilized over the PCIe bus if the application is using CUDA 8+.

 

20 minutes ago, Bit_Guardian said:

Or a Cray x86 system with NVLink, or a Qualcomm/Cavium ARM system with NVLink, or a Sparc M7 system with NVLink... IBM is not the only game in town with NVLink in servers. AMD will implement it too if they want contracts with AWS and Alibaba.

There are no ARM or Spark CPU to GPU NVLink products that I am aware of, the only commercial readily available CPU to GPU NVLink is from IBM. Nvidia's own DGX-1 purpose built 8 Tesla GPU machine learning appliance uses standard x86 Intel CPUs with NVLink between the GPUs, use cases that require NVLink between the CPU and GPU is very small and Nvidia will tell you this themselves.

 

AMD will not be implementing NVLink ever, they and everyone else not Nvidia and Intel are pushing Gen-z. NVLink wins with raw bandwidth now and in to the short term future but you have to need it to want it where Gen-z is far more flexible and useful in a multi-node cluster, don't be surprised to see NVLink and Gen-z in use at the same time in the same cluster.

 

Far as AWS goes, they use Intel for their Tesla V100 instances.

 

Quote

P3 instances use customized Intel Xeon E5-2686v4 processors running at up to 2.7 GHz. They are available in three sizes (all VPC-only and EBS-only):

Model NVIDIA Tesla V100 GPUs GPU Memory NVIDIA NVLink vCPUs Main Memory Network Bandwidth EBS Bandwidth
p3.2xlarge 1 16 GiB n/a 8 61 GiB Up to 10 Gbps 1.5 Gbps
p3.8xlarge 4 64 GiB 200 GBps 32 244 GiB 10 Gbps 7 Gbps
p3.16xlarge 8 128 GiB 300 GBps 64 488 GiB 25 Gbps 14 Gbps

Each of the NVIDIA GPUs is packed with 5,120 CUDA cores and another 640 Tensor cores and can deliver up to 125 TFLOPS of mixed-precision floating point, 15.7 TFLOPS of single-precision floating point, and 7.8 TFLOPS of double-precision floating point. On the two larger sizes, the GPUs are connected together via NVIDIA NVLink 2.0 running at a total data rate of up to 300 GBps. This allows the GPUs to exchange intermediate results and other data at high speed, without having to move it through the CPU or the PCI-Express fabric.

https://aws.amazon.com/blogs/aws/new-amazon-ec2-instances-with-up-to-8-nvidia-tesla-v100-gpus-p3/

 

So anyway it still comes back to the fundamental issue of the Titan V lacking NVLink, what exactly does this impact and in what way? Because looking at Nvidia's own data and marketing, very little.

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, leadeater said:

It's not a requirement and the performance difference is very minimal or did you not actually go check, or read any of the linked material at all. Also it's fully able to be utilized over the PCIe bus if the application is using CUDA 8+.

 

There are no ARM or Spark CPU to GPU NVLink products that I am aware of, the only commercial readily available CPU to GPU NVLink is from IBM. Nvidia's own DGX-1 purpose built 8 Tesla GPU machine learning appliance uses standard x86 Intel CPUs with NVLink between the GPUs, use cases that require NVLink between the CPU and GPU is very small and Nvidia will tell you this themselves.

 

AMD will not be implementing NVLink ever, they and everyone else not Nvidia and Intel are pushing Gen-z. NVLink wins with raw bandwidth now and in to the short term future but you have to need it to want it where Gen-z is far more flexible and useful in a multi-node cluster, don't be surprised to see NVLink and Gen-z in use at the same time in the same cluster.

 

Far as AWS goes, they use Intel for their Tesla V100 instances.

 

https://aws.amazon.com/blogs/aws/new-amazon-ec2-instances-with-up-to-8-nvidia-tesla-v100-gpus-p3/

 

So anyway it still comes back to the fundamental issue of the Titan V lacking NVLink, what exactly does this impact and in what way? Because looking at Nvidia's own data and marketing, very little.

It's minimal for underdeveloped programs.

 

You're also not very informed. http://www.techenablement.com/the-missing-link-in-nvlink-or-hello-pascal-bye-bye-pci-bus-limitations/

 

https://www.theregister.co.uk/2016/10/15/google_power9_intel_arm_cloud/

 

https://www.google.com.au/amp/s/www.nextplatform.com/2017/08/23/arm-servers-qualcomm-now-contender/amp/

 

All of the major architectures other than MIPS 64 have native NVLink support from certain vendors.

 

Sorry but Google is forcing AMD to support NVLink. It doesn't matter what AMD wishes.

 

Nvidia's information is incomplete, just as yours and mine are.

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, Bit_Guardian said:

It's minimal for underdeveloped programs.

 

You're also not very informed. http://www.techenablement.com/the-missing-link-in-nvlink-or-hello-pascal-bye-bye-pci-bus-limitations/

 

https://www.theregister.co.uk/2016/10/15/google_power9_intel_arm_cloud/

 

https://www.google.com.au/amp/s/www.nextplatform.com/2017/08/23/arm-servers-qualcomm-now-contender/amp/

 

All of the major architectures other than MIPS 64 have native NVLink support from certain vendors.

 

Sorry but Google is forcing AMD to support NVLink. It doesn't matter what AMD wishes.

 

Nvidia's information is incomplete, just as yours and mine are.

You do know NVLink has two implementations right? CPU to GPU and GPU to GPU, the second one already fully supported by AMD or any CPU for that matter as it's CPU agnostic. I don't think you actually understand the difference because you didn't get what I was talking about. The independent source I linked you has done the testing of all three possible hardware combinations: IBM Power NVLink CPU-GPU-GPU, NVLink GPU-GPU and PCIe GPU-GPU.

https://www.microway.com/hpc-tech-tips/comparing-nvlink-vs-pci-e-nvidia-tesla-p100-gpus-openpower-servers/ (3 Links top of page for each one).

 

Nvidia has done a lot of testing and optimization of NVLink and shows it in it's best light possible, that's what marketing is about. NVLink helps only when you need the data bandwidth otherwise it does nothing. The biggest benefit you can actually get from it is overall application efficiency when you do a large amount of data movement and that comes from the main system memory so while NVLink doesn't increase actual performance by much what it does do very well is reduce data set load times leading to overall quicker completion times.

 

Quote

Due to the new high-speed NVLink connection, there is only one server on the market with both Host-to-Device and Device-to-Device NVLink connectivity. This system, leveraging IBM’s POWER8 CPUs and innovation from the OpenPOWER foundation (including NVIDIA and Mellanox), began shipments in fall 2016.

You can now include POWER9 for host-to-device support, if you had read your own source a bit more carefully you would have noted Google is using IBM POWER9.

 

Quote

On Friday this week, Google stepped up from simply toying with gear from Intel's competitors to signaling it seriously hopes to deploy public cloud services powered by IBM Power chips.

 

"We look forward to a future of heterogeneous architectures within our cloud," said John Zipfel, Google Cloud's technical program manager.

 

This comes as Google shared draft blueprints for the Zaius P9 server: an OpenCAPI and Open-Compute-friendly box with an IBM Power9 scale-out microprocessor at its heart. That's the same Power9 the US government is using in its upcoming monster supercomputers. Google has worked with Rackspace, IBM and Ingrasys on the Zaius designs – you may recall that the Zaius concept was unveiled in April.

 

Back to the bandwidth topic about NVLink and why it doesn't significantly increase performance you need to look at the details of it, GPU device bandwidth is much lower than you might think, not much higher than PCIe if every GPU has x16. AMD actually solved the same issue as NVLink did but in a different way, more PCIe CPU lanes.

 

Quote

On systems with x86 CPUs (such as Intel Xeon), the connectivity to the GPU is only through PCI-Express (although the GPUs connect to each other through NVLink). On systems with POWER8 CPUs, the connectivity to the GPU is through NVLink (in addition to the NVLink between GPUs).

Nevertheless, the performance characteristics of the GPU itself (GPU cores, GPU memory, etc) do not vary. The Tesla P100 GPU itself will be performing at the same level. It’s the data flow and total system throughput that will determine the final performance for your workload.

 

Quote

It’s important to understand that the test below was run on a system with four Tesla GPUs. On each GPU, the available 80GB/s bandwidth was divided in half. One link goes to a POWER8 CPU and one link goes to the adjacent P100 GPU (see diagram below).

 

The NVLink connections on Tesla P100 GPUs provide a theoretical peak throughput of 80GB/s (160GB/s bi-directional). However, those links are made up of several bricks, which can be split up to connect to a number of other devices. For example, one GPU might dedicate 40GB/s for a link to a CPU and 40GB/s for a link to a nearby GPU.

Bandwidth is split in to sub-links, for the P100 NVLink model it has 4 sub-links of 20GB/s each direction (80GB/s, 80 + 80 = 160) and for the V100 NVLink model it has 6 sub-links of 25GB/s each direction (150GB/s, 150 + 150 = 300). It's also worth noting that the PCIe P100 is considered a single sub-link of 16GB/s meaning 32GB/s both directions, 8GB/s difference in a two GPU configuration compared to NVLink 1.0 and 18GB/s for NVLink 2.0

 

The sub-links makes things a bit more complicated than PCIe does and in a 8 GPU system you can't have all the GPUs directly talking to each other which is why Nvidia DGX-1 links 4 GPUs in their own mesh group then links each one of those to a single GPU in the other group, that means each GPU is directly connected to 4 GPUs and 3 GPUs are only a single hop away. This means if 1 GPU in the opposite group needs to do a lot of communication to the 3 GPUs in the other group 1 hop away you only have 25GB/s bandwidth, a PCIe system wouldn't suffer from that limitation.

 

Out of interest lets present the AMD EPYC PCIe bandwidth in a similar way, 985 * 128 = 126080, 126080 + 126080 = 252160MB/s which is only ~50GB/s less than NVLink 2.0. Your not actually going to get that bandwidth to GPUs due to other system devices requiring PCIe lanes obviously but hey in a system with 8 GPUs using PCIe you can actually fully mesh connect the GPUs without having a bandwidth penalty doing it. PCIe 4.0 isn't that far off either, not looking all that good for NVLink.

 

So if we circle back to the actual thread topic can you explain why Titan V lacking NVLink is such an issue because Nvidia's own data shows NVLink GPU-to-GPU gives only a small overall performance increase.

 

Edit:

I'm a long time AMD GPU user and I'll jump on an Nvidia bashing bandwagon quite happily so long as it's warranted, this isn't one of those occasions.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×