Jump to content

Benchmark - 10x Tesla P100 GPU-NVTP100-16

Hey LinusTech team,

My son is a huge fan of yours :) Though he's too young to be left unsupervised on an internet forum :P 

I find your videos amusing. 

I especially enjoyed your 8 gamers 1 CPU video and thought I might return the favor.

 

Background : I work at an AI company and we got delivery of new toys today.

The servers are the updated SuperMicro SYS-4028GR--TRT2, dual socket with all 10 GPUs being on the same PCIe root complex. I think PLX switches ofc but cant confirm model number.  

Contrary to popular belief, PCIe switches dont add much latency but this beauty has a penalty of 1 us per switch hop. 

Having all GPUs on the same root complex is nice for many reasons. I refer the reader to a nice article from Cirrascale -> http://www.cirrascale.com/blog/index.php/exploring-the-pcie-bus-routes/

So in effect by cutting out the Infiniband fabric out of the computation path, we see nice strong scaling across increasing GPU counts. Hugely beneficial for TensorFlow et al

 

1) The output of nvidia-smi is the most awesome thing I have seen on konsole in the past 2 years :) J.K.

 

cluster|17:41:59: nvidia-smi

0-nvidia-smi.png

2) The PCIe topology of the GPUs can be easily queried like so :

cluster|17:45:40: nvidia-smi topo --matrix

 1-topo.png

 

 

 

3) You can see in the image above the NVLink field present, though the PCIe version doesnt have it. 

The specs I Ctrl C+V from the invoice are : 

10x PCIex16 NVIDIA® Tesla P100 GPU-NVTP100-16

16GB CoWoS HBM2 PCIe 3.0 -- Passive Cooling

Brand NVIDIA / Product Name Tesla P100

Part Number GPU-NVTP100-16

Double-Precision Performance 4.7 TeraFLOPS

Single-Precision Performance 9.3 TeraFLOPS

Half-Precision Performance 18.7 TeraFLOPS

PCIex16 Interconnect Bandwidth 32 GB/s

CoWoS HBM2 Stacked Memory Capacity 16 GB

CoWoS HBM2 Stacked Memory Bandwidth 720 GB/s

Thermal Passive

 

Our other systems on order are these -> http://www-03.ibm.com/systems/power/hardware/s822lc-hpc/ 

Which have the nice NVLink feature enabled in CPU (IBM  POWER8 ) which allows us to page-fault CPU RAM on demand. Useful when your convnet has 10^10 parameters :)

In fact, if one were observing closely, I think OpenPOWER IBM will start to chip away at Intel's dominance in  x86 enterprise.

 

4) Rest of the system specs are not so special as the topology and the pascals.

2x Xeon E5-2643v4 ( you really need high clocked CPUs with non-AVX turbos upto 3.7 GHz to really utilize these GPUs. HIgh core count at low clocks is a strict no-no)

1024 GB RAM, Intel DC S3100 SSDs in RAID 0 for local cacheing & Mellanox ConnectX 4x aggregated EDR infiniband HCAs.  

 

5) Output of deviceQuery ( only the CUDA devs among you will be impressed by this :)  ) :   

 

3-deviceQuery.png

 

6 ) Bandwidth Tests : 

In the 3rd test you see that  HBM2 technology shine ( this is not a benchmark ), though it can still be much higher  

4-band1.png

 

7) p2p bandwidth & latency [ The inter GPU communication stuff ]

This roughly ensures that the added premium the comapny pays for  all slots to be on the same root, is actually there. You can see the 1 us addtional hop latency i talked about earlier. 

 

5-band2.png

 

 

7) And finally a benchmark

I'll give just 1, since this isnt a "gaming" GPU though you still can, no game benchmarks. I assure you all it can run Crysis ;) CUDA & NN benchs wont be of interest to this audience.

So I give the LuxMark bench v 3.1, which is an OpenCL raytracer. Its a shame there isnt a CUDA version of it, since its common knowledge Nvidia handicaps openCL on their chips.

The 1st is from my personal GPU, an EVGA 980 Ti & the 2nd sc-shot is of the server. Enjoy & drool :) (I cant game on it either, haha)

 

6-980ti.png

 

7-server.png

 

 

 

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

factor 20 is very good :) wish I'd have sometimes such a machine for rendering. May I ask if you could make a crazy render test with Blender?

GUITAR BUILD LOG FROM SCRATCH OUT OF APPLEWOOD

 

- Ryzen Build -

R5 3600 | MSI X470 Gaming Plus MAX | 16GB CL16 3200MHz Corsair LPX | Dark Rock 4

MSI 2060 Super Gaming X

1TB Intel 660p | 250GB Kingston A2000 | 1TB Seagate Barracuda | 2TB WD Blue

be quiet! Silent Base 601 | be quiet! Straight Power 550W CM

2x Dell UP2516D

 

- First System (Retired) -

Intel Xeon 1231v3 | 16GB Crucial Ballistix Sport Dual Channel | Gigabyte H97 D3H | Gigabyte GTX 970 Gaming G1 | 525 GB Crucial MX 300 | 1 TB + 2 TB Seagate HDD
be quiet! 500W Straight Power E10 CM | be quiet! Silent Base 800 with stock fans | be quiet! Dark Rock Advanced C1 | 2x Dell UP2516D

Reviews: be quiet! Silent Base 800 | MSI GTX 950 OC

 

Link to comment
Share on other sites

Link to post
Share on other sites

yes, that factor would not have been possible without the special PCIe layout. 

sure :)  link your scene files and let me know how to render those. though i'm not giving any guarantees

Link to comment
Share on other sites

Link to post
Share on other sites

I'm not sure if it's possible but can you maybe post some picture of the monster itself?

I have no idea what to expect from a (pc? server? what even is it?) with so much power.

If you want my attention, quote meh! D: or just stick an @samcool55 in your post :3

Spying on everyone to fight against terrorism is like shooting a mosquito with a cannon

Link to comment
Share on other sites

Link to post
Share on other sites

hot damn thats some power. cant imagine the heat that rack unit puts out :P 

Link to comment
Share on other sites

Link to post
Share on other sites

Can't wait to get 2 of these for SLI. :D 

CPU: Intel Core i7 7820X Cooling: Corsair Hydro Series H110i GTX Mobo: MSI X299 Gaming Pro Carbon AC RAM: Corsair Vengeance LPX DDR4 (3000MHz/16GB 2x8) SSD: 2x Samsung 850 Evo (250/250GB) + Samsung 850 Pro (512GB) GPU: NVidia GeForce GTX 1080 Ti FE (W/ EVGA Hybrid Kit) Case: Corsair Graphite Series 760T (Black) PSU: SeaSonic Platinum Series (860W) Monitor: Acer Predator XB241YU (165Hz / G-Sync) Fan Controller: NZXT Sentry Mix 2 Case Fans: Intake - 2x Noctua NF-A14 iPPC-3000 PWM / Radiator - 2x Noctua NF-A14 iPPC-3000 PWM / Rear Exhaust - 1x Noctua NF-F12 iPPC-3000 PWM

Link to comment
Share on other sites

Link to post
Share on other sites

Mmm, dem GPUs.

 

Although I've always wondered, how's it like working at an AI company?

i5 4670k @ 4.2GHz (Coolermaster Hyper 212 Evo); ASrock Z87 EXTREME4; 8GB Kingston HyperX Beast DDR3 RAM @ 2133MHz; Asus DirectCU GTX 560; Super Flower Golden King 550 Platinum PSU;1TB Seagate Barracuda;Corsair 200r case. 

Link to comment
Share on other sites

Link to post
Share on other sites

it can be fun, boring, mind-boggling, exhilarating & terrifying in the same day.

this sums it up :

 

education-teaching-math-mathematics-math

Link to comment
Share on other sites

Link to post
Share on other sites

Any chance you can run afterburner (or a linux equivilant) and the benchmark so we can see clocks on the core & ram + temps pls?

Main Rig:-

Ryzen 7 3800X | Asus ROG Strix X570-F Gaming | 16GB Team Group Dark Pro 3600Mhz | Corsair MP600 1TB PCIe Gen 4 | Sapphire 5700 XT Pulse | Corsair H115i Platinum | WD Black 1TB | WD Green 4TB | EVGA SuperNOVA G3 650W | Asus TUF GT501 | Samsung C27HG70 1440p 144hz HDR FreeSync 2 | Ubuntu 20.04.2 LTS |

 

Server:-

Intel NUC running Server 2019 + Synology DSM218+ with 2 x 4TB Toshiba NAS Ready HDDs (RAID0)

Link to comment
Share on other sites

Link to post
Share on other sites

Hot damn gp100 is a big gpu... I mean 610mm² sounds big and all , but you don't realise how big until you see the die and hbm stacks...

AMD Ryzen R7 1700 (3.8ghz) w/ NH-D14, EVGA RTX 2080 XC (stock), 4*4GB DDR4 3000MT/s RAM, Gigabyte AB350-Gaming-3 MB, CX750M PSU, 1.5TB SDD + 7TB HDD, Phanteks enthoo pro case

Link to comment
Share on other sites

Link to post
Share on other sites

haha

There's no afterburner for Tesla on *nix. What you have for Linux for GeForce are "coolbits" that you can set in X.Org conf files. See the coolbits docs. 

For Tesla this is not available 

To monitor the GPUs the IT dept creates log files with commands like these :

pts/7 0 : nvidia-smi --query-gpu=temperature.gpu --format=csv -i 0 -f t.txt --loop=1

- The P100s are all passively cooled.   They are cooled by very high sp fans in a very cold room. since we got them today there was no time to put them in a proper HVAC setup. 

- Without these proper cooling stuff, today we saw constant thermal throttling as the GPUs hit 80C. Running the GPU equivalents of LINPACK we were pretty happy with what we saw. GPU Boost 3 is super weird, caps TDP first and then lowers the clock, but this is something we need to understand. Again they are not meant to be run this way, they are usually in a room at 4 C :) 

Link to comment
Share on other sites

Link to post
Share on other sites

well i'm actually pretty happy , i can reach a 10th of this machine's luxmark performance lol ( by the way your job is awesome , would love to be within range of such a beast )

#subtle brag

luxmark.JPG

 

 

AMD Ryzen R7 1700 (3.8ghz) w/ NH-D14, EVGA RTX 2080 XC (stock), 4*4GB DDR4 3000MT/s RAM, Gigabyte AB350-Gaming-3 MB, CX750M PSU, 1.5TB SDD + 7TB HDD, Phanteks enthoo pro case

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, yd1248 said:

haha

There's no afterburner for Tesla on *nix. What you have for Linux for GeForce are "coolbits" that you can set in X.Org conf files. See the coolbits docs. 

For Tesla this is not available 

To monitor the GPUs the IT dept creates log files with commands like these :


pts/7 0 : nvidia-smi --query-gpu=temperature.gpu --format=csv -i 0 -f t.txt --loop=1

- The P100s are all passively cooled.   They are cooled by very high sp fans in a very cold room. since we got them today there was no time to put them in a proper HVAC setup. 

- Without these proper cooling stuff, today we saw constant thermal throttling as the GPUs hit 80C. Running the GPU equivalents of LINPACK we were pretty happy with what we saw. GPU Boost 3 is super weird, caps TDP first and then lowers the clock, but this is something we need to understand. Again they are not meant to be run this way, they are usually in a room at 4 C :) 

Strange how Google can run some of its datacenters at 90F ambient and still keep everything cool just using air cooling.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

Quote

Strange how Google can run some of its datacenters at 90F ambient and still keep everything cool just using air cooling.

I then wonder why it doesn't run ALL its datacenters at 90F ambient

Link to comment
Share on other sites

Link to post
Share on other sites

I don't know if you can call this a benchmark. But there you go 

PS : very proud of LinusTech folders :)

Screenshot from 2016-11-11 13:49:26.png

Link to comment
Share on other sites

Link to post
Share on other sites

6.0 and 6.1 (eg: Nvidia Titan X, 1080, 1070)Experimental Support
Requires Octane Version 3.03.2 or higher

What it says on the otoy FAQ page - https://home.otoy.com/render/octane-render/faqs/ 

I can't find that 3.03.2 version, the link you posted was 2.7 and I checked it doesnt support Pascal

PM dld link if you can find it

Link to comment
Share on other sites

Link to post
Share on other sites

@yd1248 You could download the 3.04 Demo from here https://render.otoy.com/downloads/47/0a/93/0e/OctaneRender_demo_3_04_linux.zip 

then copy the "benchmark_data" folder and optionally the script "run_benchmark_linux.sh" from the "OctaneBench_2_17_linux" folder to the "OctaneRender_demo_3_04_linux" folder.

 

2i7tj44.jpg

 

Note that the GTX 1050Ti is capped to 1911MHz by Vendor or/and Nvidia.

 

Something noticed on GTX 1080 is that CPU C-State latency can be detrimental to scores. Default GPU memory clock on cards with PState P2 is reduced below nominal frequency and P2 is invoked when using GPU computing such as CUDA.

 

Unfortunately uploads of the benchmark score on 3.0+ are not allowed at this time. Will have to settle for a screenshot.

Have fun :)

AWOL

Link to comment
Share on other sites

Link to post
Share on other sites

SUPER interesting read, thanks for the look at that hardware!

Case: Meatbag, humanoid - APU: Human Brain version 1.53 (stock clock) - Storage: 100TB SND (Squishy Neuron Drive) - PSU: a combined 500W of Mitochondrial cells - Optical Drives: 2 Oculi, with corrective lenses.

Link to comment
Share on other sites

Link to post
Share on other sites

@X_X

It still says no supported GPU following your method. 

Btw, 

Quote

Something noticed on GTX 1080 is that CPU C-State latency can be detrimental to scores. Default GPU memory clock on cards with PState P2 is reduced below nominal frequency and P2 is invoked when using GPU computing such as CUDA.

wow !

Not to be sarcastic but I've only ever heard authors of RTOS of communication satellites and the guys at NASA JPL Pasadena ever complain of the performance impact of CPU C-states on anything :)  So you should verify your sources. I'm very skeptical of that claim.

As for GPU P-states, Tesla defaults to P0. And even with P2, in my experience, I have never seen the clock go down below advertised unless thermally throttled. 

Link to comment
Share on other sites

Link to post
Share on other sites

5 hours ago, yd1248 said:

@X_X

wow !

Not to be sarcastic but I've only ever heard authors of RTOS of communication satellites and the guys at NASA JPL Pasadena ever complain of the performance impact of CPU C-states on anything :)

Lol, we're all free to believe in whatever we like :) I am the source. Never heard of people showing SSD 4k random R/W being better with C-States disabled. See Intel Dynamic Storage Accelerator. You don't think cores in C6 with exit latency in tens of microseconds cannot have an effect where CPU usage is now and then. Bit of work, back to sleep.

 

Some other tests here

 

Shame it didn't detect the cards, sounds like claiming Pascal support is perhaps a little wide of the mark. Thanks for giving it a go.

AWOL

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×