tesla Benchmark - 10x Tesla P100 GPU-NVTP100-16

yd1248 · November 10, 2016

Hey LinusTech team,

My son is a huge fan of yours Though he's too young to be left unsupervised on an internet forum

I find your videos amusing.

I especially enjoyed your 8 gamers 1 CPU video and thought I might return the favor.

Background : I work at an AI company and we got delivery of new toys today.

The servers are the updated SuperMicro SYS-4028GR--TRT2, dual socket with all 10 GPUs being on the same PCIe root complex. I think PLX switches ofc but cant confirm model number.

Contrary to popular belief, PCIe switches dont add much latency but this beauty has a penalty of 1 us per switch hop.

Having all GPUs on the same root complex is nice for many reasons. I refer the reader to a nice article from Cirrascale -> http://www.cirrascale.com/blog/index.php/exploring-the-pcie-bus-routes/

So in effect by cutting out the Infiniband fabric out of the computation path, we see nice strong scaling across increasing GPU counts. Hugely beneficial for TensorFlow et al

1) The output of nvidia-smi is the most awesome thing I have seen on konsole in the past 2 years J.K.

cluster|17:41:59: nvidia-smi

2) The PCIe topology of the GPUs can be easily queried like so :

cluster|17:45:40: nvidia-smi topo --matrix

3) You can see in the image above the NVLink field present, though the PCIe version doesnt have it.

The specs I Ctrl C+V from the invoice are :

10x PCIex16 NVIDIA® Tesla P100 GPU-NVTP100-16

16GB CoWoS HBM2 PCIe 3.0 -- Passive Cooling

Brand NVIDIA / Product Name Tesla P100

Part Number GPU-NVTP100-16

Double-Precision Performance 4.7 TeraFLOPS

Single-Precision Performance 9.3 TeraFLOPS

Half-Precision Performance 18.7 TeraFLOPS

PCIex16 Interconnect Bandwidth 32 GB/s

CoWoS HBM2 Stacked Memory Capacity 16 GB

CoWoS HBM2 Stacked Memory Bandwidth 720 GB/s

Thermal Passive

Our other systems on order are these -> http://www-03.ibm.com/systems/power/hardware/s822lc-hpc/

Which have the nice NVLink feature enabled in CPU (IBM POWER8 ) which allows us to page-fault CPU RAM on demand. Useful when your convnet has 10^10 parameters

In fact, if one were observing closely, I think OpenPOWER IBM will start to chip away at Intel's dominance in x86 enterprise.

4) Rest of the system specs are not so special as the topology and the pascals.

2x Xeon E5-2643v4 ( you really need high clocked CPUs with non-AVX turbos upto 3.7 GHz to really utilize these GPUs. HIgh core count at low clocks is a strict no-no)

1024 GB RAM, Intel DC S3100 SSDs in RAID 0 for local cacheing & Mellanox ConnectX 4x aggregated EDR infiniband HCAs.

5) Output of deviceQuery ( only the CUDA devs among you will be impressed by this ) :

6 ) Bandwidth Tests :

In the 3rd test you see that HBM2 technology shine ( this is not a benchmark ), though it can still be much higher

7) p2p bandwidth & latency [ The inter GPU communication stuff ]

This roughly ensures that the added premium the comapny pays for all slots to be on the same root, is actually there. You can see the 1 us addtional hop latency i talked about earlier.

7) And finally a benchmark

I'll give just 1, since this isnt a "gaming" GPU though you still can, no game benchmarks. I assure you all it can run Crysis CUDA & NN benchs wont be of interest to this audience.

So I give the LuxMark bench v 3.1, which is an OpenCL raytracer. Its a shame there isnt a CUDA version of it, since its common knowledge Nvidia handicaps openCL on their chips.

The 1st is from my personal GPU, an EVGA 980 Ti & the 2nd sc-shot is of the server. Enjoy & drool (I cant game on it either, haha)

19_blackie_73 · November 10, 2016

factor 20 is very good wish I'd have sometimes such a machine for rendering. May I ask if you could make a crazy render test with Blender?

yd1248 · November 10, 2016

yes, that factor would not have been possible without the special PCIe layout.

sure link your scene files and let me know how to render those. though i'm not giving any guarantees

samcool55 · November 10, 2016

I'm not sure if it's possible but can you maybe post some picture of the monster itself?

I have no idea what to expect from a (pc? server? what even is it?) with so much power.

yd1248 · November 10, 2016

haha, sure. The 1st pic is the server i'm talking about

The 2nd pic is an older server filled with Tesla K80s which we'll upgrade to P100s down the line. Maybe wait for Volta

tlink · November 10, 2016

hot damn thats some power. cant imagine the heat that rack unit puts out

yd1248 · November 10, 2016

And a $8000 GPU if anyone cares

you can see the 2.5D stacking interposer of the HBM2 4GB modules

Space Reptile · November 10, 2016

VagabondWraith · November 10, 2016

Can't wait to get 2 of these for SLI.

Nineshadow · November 10, 2016

Mmm, dem GPUs.

Although I've always wondered, how's it like working at an AI company?

yd1248 · November 10, 2016

it can be fun, boring, mind-boggling, exhilarating & terrifying in the same day.

this sums it up :

Master Disaster · November 10, 2016

Any chance you can run afterburner (or a linux equivilant) and the benchmark so we can see clocks on the core & ram + temps pls?

Coaxialgamer · November 10, 2016

Hot damn gp100 is a big gpu... I mean 610mm² sounds big and all , but you don't realise how big until you see the die and hbm stacks...

yd1248 · November 10, 2016

haha

There's no afterburner for Tesla on *nix. What you have for Linux for GeForce are "coolbits" that you can set in X.Org conf files. See the coolbits docs.

For Tesla this is not available

To monitor the GPUs the IT dept creates log files with commands like these :

pts/7 0 : nvidia-smi --query-gpu=temperature.gpu --format=csv -i 0 -f t.txt --loop=1

- The P100s are all passively cooled. They are cooled by very high sp fans in a very cold room. since we got them today there was no time to put them in a proper HVAC setup.

- Without these proper cooling stuff, today we saw constant thermal throttling as the GPUs hit 80C. Running the GPU equivalents of LINPACK we were pretty happy with what we saw. GPU Boost 3 is super weird, caps TDP first and then lowers the clock, but this is something we need to understand. Again they are not meant to be run this way, they are usually in a room at 4 C

Coaxialgamer · November 10, 2016

well i'm actually pretty happy , i can reach a 10th of this machine's luxmark performance lol ( by the way your job is awesome , would love to be within range of such a beast )

#subtle brag

Stefan1024 · November 10, 2016

I'd be jealous of you, but then I think about the noise this monster creates and them I'm happy agian

patrickjp93 · November 11, 2016

4 hours ago, yd1248 said:
haha

There's no afterburner for Tesla on *nix. What you have for Linux for GeForce are "coolbits" that you can set in X.Org conf files. See the coolbits docs.

For Tesla this is not available

To monitor the GPUs the IT dept creates log files with commands like these :
pts/7 0 : nvidia-smi --query-gpu=temperature.gpu --format=csv -i 0 -f t.txt --loop=1
- The P100s are all passively cooled. They are cooled by very high sp fans in a very cold room. since we got them today there was no time to put them in a proper HVAC setup.

- Without these proper cooling stuff, today we saw constant thermal throttling as the GPUs hit 80C. Running the GPU equivalents of LINPACK we were pretty happy with what we saw. GPU Boost 3 is super weird, caps TDP first and then lowers the clock, but this is something we need to understand. Again they are not meant to be run this way, they are usually in a room at 4 C

Strange how Google can run some of its datacenters at 90F ambient and still keep everything cool just using air cooling.

yd1248 · November 11, 2016

Quote

Strange how Google can run some of its datacenters at 90F ambient and still keep everything cool just using air cooling.

I then wonder why it doesn't run ALL its datacenters at 90F ambient

yd1248 · November 11, 2016

I don't know if you can call this a benchmark. But there you go

PS : very proud of LinusTech folders

DuckDodgers · November 11, 2016

Can you run OctaneBench on that? It's a CUDA-based PT renderer: https://render.otoy.com/octanebench/

yd1248 · November 11, 2016

6.0 and 6.1 (eg: Nvidia Titan X, 1080, 1070)Experimental Support
Requires Octane Version 3.03.2 or higher

What it says on the otoy FAQ page - https://home.otoy.com/render/octane-render/faqs/

I can't find that 3.03.2 version, the link you posted was 2.7 and I checked it doesnt support Pascal

PM dld link if you can find it

X_X · November 15, 2016

@yd1248 You could download the 3.04 Demo from here https://render.otoy.com/downloads/47/0a/93/0e/OctaneRender_demo_3_04_linux.zip

then copy the "benchmark_data" folder and optionally the script "run_benchmark_linux.sh" from the "OctaneBench_2_17_linux" folder to the "OctaneRender_demo_3_04_linux" folder.

Note that the GTX 1050Ti is capped to 1911MHz by Vendor or/and Nvidia.

Something noticed on GTX 1080 is that CPU C-State latency can be detrimental to scores. Default GPU memory clock on cards with PState P2 is reduced below nominal frequency and P2 is invoked when using GPU computing such as CUDA.

Unfortunately uploads of the benchmark score on 3.0+ are not allowed at this time. Will have to settle for a screenshot.

Have fun

SirRoderick · November 15, 2016

SUPER interesting read, thanks for the look at that hardware!

yd1248 · November 15, 2016

@X_X

It still says no supported GPU following your method.

Btw,

Quote

Something noticed on GTX 1080 is that CPU C-State latency can be detrimental to scores. Default GPU memory clock on cards with PState P2 is reduced below nominal frequency and P2 is invoked when using GPU computing such as CUDA.

wow !

Not to be sarcastic but I've only ever heard authors of RTOS of communication satellites and the guys at NASA JPL Pasadena ever complain of the performance impact of CPU C-states on anything So you should verify your sources. I'm very skeptical of that claim.

As for GPU P-states, Tesla defaults to P0. And even with P2, in my experience, I have never seen the clock go down below advertised unless thermally throttled.

X_X · November 16, 2016

5 hours ago, yd1248 said:

@X_X

wow !

Not to be sarcastic but I've only ever heard authors of RTOS of communication satellites and the guys at NASA JPL Pasadena ever complain of the performance impact of CPU C-states on anything

Lol, we're all free to believe in whatever we like I am the source. Never heard of people showing SSD 4k random R/W being better with C-States disabled. See Intel Dynamic Storage Accelerator. You don't think cores in C6 with exit latency in tens of microseconds cannot have an effect where CPU usage is now and then. Bit of work, back to sleep.

Some other tests here

Shame it didn't detect the cards, sounds like claiming Pascal support is perhaps a little wide of the mark. Thanks for giving it a go.

Sign In

tesla Benchmark - 10x Tesla P100 GPU-NVTP100-16

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites