Jump to content

NVIDIA REFUSED To Send Us This

jakkuh_t

Correction - SXM4 version of this card does have an IHS on it. image.thumb.png.e86294e0813985550a66801062ef0b4d.png

Main System: 2 x Intel Xeon Platinum 8268, 384GB DDR4 2933 ECC, 2 x NVidia 2080 Ti FE, 2 x Samsung Enterprise 3.2TB NVME PCIe Gen.3x8 SSD, custom water cooling.

Link to comment
Share on other sites

Link to post
Share on other sites

  

3 hours ago, Minionflo said:

can you post the command to recreate the Resnet50 benchmark?

The command was:
 

wget https://github.com/tensorflow/benchmarks/archive/master.zip && unzip master.zip && cd benchmarks-master/scripts/tf_cnn_benchmarks && python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=512 --num_batches=100 --model=resnet50 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10 --use_fp16


Note that per a forum thread I found here: https://forum.level1techs.com/t/testing-resnet50-performance-nvidia-docker-ubuntu/145182 , it would appear that the

--xla_compile=True

flag that is passed in the above command is specifically Intel only. For AMD this should be modified to:

--xla_compile=False


I didn't change this however, and it only appears to only cause warnings but I can't say it didn't impact my performance in the benchmark (although as I note below I appeared to be power limited at a lower wattage than the MSI card that was used in the video which was expected.)

 

Also I don't know exactly what container they were actually running it in. but the latest container seems to be: tensorflow:22.01-tf1-py3.
Be aware the container is ~12 gigabytes.

Run with:

docker run --gpus all -it --rm nvcr.io/nvidia/tensorflow:22.01-tf1-py3



I ran this using WSL2 Ubuntu and Docker on Windows 11 and I was able to generally reproduce the video's results for an RTX 3090, with the above command and container and obtained a result of: total images/sec: 1378.18 on my Pny 3090 which was power limited at 360W.

Link to comment
Share on other sites

Link to post
Share on other sites

I feel like the capability of my Nvidia Tesla t4 is more flexible.

i wonder if it will show up under task manager?

Link to comment
Share on other sites

Link to post
Share on other sites

So price tag aside this would be an efficient card to get for GPU intensive tasks, especially in a server chassis where multiple PSU's can be installed to power 8 to 10 of these things. Even in a workstation tower arrangement you could stack 3 to 4 of these depending on your mobo along with an A6000 to do the video related stuff - that's impressive.

 

What's NOT so impressive is the price, and the only guy to blame for that is Jensen. Even when you factor in the 80GB HBM2 there is no way in hell the BOM for this should exceed four digits. You are saving nothing in efficiency whatsoever spending $10-14K on this card versus the inflated $3K on a 3090. In fact, a water cooled 3090 would more than likely outlast one of these when used 24/7. Obviously for those buying HUNDREDS it won't matter if a card costing 10 grand dies under that 5 year extended service plan and they can get a replacement for free - the average Joe won't be throwing out that kind of money for just a GPU every 5 years.

 

Server installs also benefit from renewable energy infrastructure so things like massive solar arrays can be deployed to help bring down that power cost further. Never mind the subsidies companies can get for doing green energy installs, much of which just doesn't apply to a residential setup in the home. You will need more than just a handful of panels on your roof and many years before your wallet sees any difference.

 

Workstation users MAY be able to afford it though considering the price of a high end system ($50-60K) and the type of money they can bring in for their users over that same 5 year period. Once the warranty period is up the cards can continue to be used either until they die or as a secondary system extending that ROI. Reselling it can be a smart option at that point as well, especially if the card is still in demand because it performs well or some guy just needs a spare for cheap.

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, Luscious said:

So price tag aside this would be an efficient card to get for GPU intensive tasks, especially in a server chassis where multiple PSU's can be installed to power 8 to 10 of these things. Even in a workstation tower arrangement you could stack 3 to 4 of these depending on your mobo along with an A6000 to do the video related stuff - that's impressive.

 

What's NOT so impressive is the price, and the only guy to blame for that is Jensen. Even when you factor in the 80GB HBM2 there is no way in hell the BOM for this should exceed four digits. You are saving nothing in efficiency whatsoever spending $10-14K on this card versus the inflated $3K on a 3090. In fact, a water cooled 3090 would more than likely outlast one of these when used 24/7. Obviously for those buying HUNDREDS it won't matter if a card costing 10 grand dies under that 5 year extended service plan and they can get a replacement for free - the average Joe won't be throwing out that kind of money for just a GPU every 5 years.

 

Server installs also benefit from renewable energy infrastructure so things like massive solar arrays can be deployed to help bring down that power cost further. Never mind the subsidies companies can get for doing green energy installs, much of which just doesn't apply to a residential setup in the home. You will need more than just a handful of panels on your roof and many years before your wallet sees any difference.

 

Workstation users MAY be able to afford it though considering the price of a high end system ($50-60K) and the type of money they can bring in for their users over that same 5 year period. Once the warranty period is up the cards can continue to be used either until they die or as a secondary system extending that ROI. Reselling it can be a smart option at that point as well, especially if the card is still in demand because it performs well or some guy just needs a spare for cheap.

From what research I did if you needed something that was easy to install, was willing to wait a bit longer but save a ton of power I would look at:

Nvidia Tesla T4 (at 70W) or Nvidia A2 (at 45W) and only get higher gpu's if you need the vram.

Link to comment
Share on other sites

Link to post
Share on other sites

I think it was interesting to see consumer GPU performance in the data science space. As such, I feel like it would be a valuable benchmark to add to GPU reviews when you guys make them... though I have a side interest in data science/machine learning so I couldn't say if the greater ltt community feel the same way

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, T8z5h3 said:

From what research I did if you needed something that was easy to install, was willing to wait a bit longer but save a ton of power I would look at:

Nvidia Tesla T4 (at 70W) or Nvidia A2 (at 45W) and only get higher gpu's if you need the vram.

That T4 is a quirky little card with it's HH/HL form factor LOL. Not sure which one (Supermicro or Tyan) actually had a server SKU that crammed TWENTY of those into a single 4U chassis, all forced-air cooled. The motherboard they used for it was just as insane with a completely proprietary daughterboard that had 20 x8 PCI-e slots!!! Made for a nice 1400W space heater not including the dual Xeons and everything else in there.

 

FWIW I would like to see LTT now cover the AMD side of things, specifically the Instinct M100 card. It's priced about the same as the outgoing 40GB A100 shown here. How does the M100 compare performance-wise to nVidia and how does it measure when it comes to efficiency - questions that I'm sure many interested in it would want to ask.

Link to comment
Share on other sites

Link to post
Share on other sites

Cross-posting from Reddit, as I feel this is an important consideration.

 

The A100 is even faster than you showed for deep learning tasks. 

 

A major selling point of the A100 is its larger memory which is not taken advantage of in the video. It _appears_ as if you are using all the memory in both cards, but that is simply tensorflow allocating the whole GPU without actually using all of it. That means you can drastically increase the batch size in deep learning for the A100. Batch size is one of the first hyperparameters in deep learning to tune for your data<>device when you're optimizing your processing efficiency. 

 

Batch size is basically "how many items should I push through the neural network at the same time". The more items you push through at the same time the faster you'll be in the long run, but the more memory you need. Imagine you are moving watermelons from the store to your car with a cart. The larger the cart (memory), the more watermelons you can fit (batch size), and the faster you're done. In this case, you have a a small cart (RTX 3090) and a large cart (A100) and you're only filling the A100 with the same number of watermelons as in the small cart. So you're not making optimal use of the whole cart; you could be done a lot faster if you'd filled the whole cart (larger batch size).

 

So in conclusion, the A100 is likely even a lot faster than you are currently seeing when making full use of the memory and optimising the batch size for each device.

 

Furthermore, as was noted on Reddit, Resnet-50 is an easy nut to crack in general. With its mere 23 million parameters, it's a drop of rain compared to the ocean of today's models. If you look at the field of natural language processing, we have models of hundred billion (GPT-3) to trillions (T5-XXL). But I understand that having a ready-made benchmark script to test out the device is more feasible. Would still have loved to see how far you could push the batch sizes in both cases before running into out-of-memory issues, though!

Link to comment
Share on other sites

Link to post
Share on other sites

On 2/23/2022 at 7:24 PM, ACastanza said:

I ran this using WSL2 Ubuntu and Docker on Windows 11 and I was able to generally reproduce the video's results for an RTX 3090, with the above command and container and obtained a result of: total images/sec: 1378.18 on my Pny 3090 which was power limited at 360W.

When trying to do such benchmark on windows you're leaving tons of performance on the table, see: https://medium.com/analytics-vidhya/comparing-gpu-performance-for-deep-learning-between-pop-os-ubuntu-and-windows-69aa3973cc1f

 

Even when under WSL2 it's still slower, sadly. Here are my results with a 3060 using fp32 on linux:

image.png.f921e43a89f243082be8a36f2676504f.png

 

And here the results from a 3060ti on windows from an acquaintance:

image.png.d22efa1568a108ec334ac5f2d700ca47.png

 

Whereas the 3060ti should be around 30% faster than my 3060.

 

Anyway, here are some values for comparison that I did some time ago on resnet50:

image.png.545ebd1dc36a07c9a1e02ab266774da1.png

As you can see, the V100 is not that far behind while using a batch size 4x smaller (I didn't try a 256 batch size because there was no point in trying to compare it, I can try it again later), so a higher batch size (2056 maybe?) should be nice to see.

 

23 hours ago, Luscious said:

What's NOT so impressive is the price, and the only guy to blame for that is Jensen. Even when you factor in the 80GB HBM2 there is no way in hell the BOM for this should exceed four digits.

That's just the tag price, you can get those way cheaper when you're an actual big company looking to buy many GPUs at once.

 

14 hours ago, Luscious said:

FWIW I would like to see LTT now cover the AMD side of things, specifically the Instinct M100 card. It's priced about the same as the outgoing 40GB A100 shown here. How does the M100 compare performance-wise to nVidia and how does it measure when it comes to efficiency - questions that I'm sure many interested in it would want to ask.

Sadly you can't just run most of the workloads those nvidia gpus are meant to run. Most of the big ML frameworks are built on top of CUDA, and AMD's software stack is severely lacking when it comes to ML. 

Now if you're looking into some FP64 workloads, then an AMD gpu is what you're looking for (think physics simulations).

 

13 hours ago, MountainGoatAOE said:

Cross-posting from Reddit, as I feel this is an important consideration.

 

The A100 is even faster than you showed for deep learning tasks. 

 

A major selling point of the A100 is its larger memory which is not taken advantage of in the video. It _appears_ as if you are using all the memory in both cards, but that is simply tensorflow allocating the whole GPU without actually using all of it. That means you can drastically increase the batch size in deep learning for the A100. Batch size is one of the first hyperparameters in deep learning to tune for your data<>device when you're optimizing your processing efficiency. 

 

Batch size is basically "how many items should I push through the neural network at the same time". The more items you push through at the same time the faster you'll be in the long run, but the more memory you need. Imagine you are moving watermelons from the store to your car with a cart. The larger the cart (memory), the more watermelons you can fit (batch size), and the faster you're done. In this case, you have a a small cart (RTX 3090) and a large cart (A100) and you're only filling the A100 with the same number of watermelons as in the small cart. So you're not making optimal use of the whole cart; you could be done a lot faster if you'd filled the whole cart (larger batch size).

 

So in conclusion, the A100 is likely even a lot faster than you are currently seeing when making full use of the memory and optimising the batch size for each device.

 

Furthermore, as was noted on Reddit, Resnet-50 is an easy nut to crack in general. With its mere 23 million parameters, it's a drop of rain compared to the ocean of today's models. If you look at the field of natural language processing, we have models of hundred billion (GPT-3) to trillions (T5-XXL). But I understand that having a ready-made benchmark script to test out the device is more feasible. Would still have loved to see how far you could push the batch sizes in both cases before running into out-of-memory issues, though!

Sadly LTT is mostly a gamer focused channel, so they're neither really knowledgeable enough to do this kind of stuff, nor would they audience really appreciate it. PugetSystems, ServeTheHome or Level1 on the other hand...

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

On 2/23/2022 at 10:38 PM, danwat1234 said:

On your facebook page, the video is cut off like numerous others on your page. Why not do the little bit of work in order to make the aspect ratio correct? It's just a matter of re-encoding once you find the magic settings? Because, ad revenue?

 

https://www.facebook.com/LinusTech/posts/517830633046006

From my experience, Facebook and Instagram's algorithms tend to prefer taller videos, presumably because they take up the majority of a user's phone.

Quote or mention me or I won't be notified of your reply!

Main Rig: R7 3700x New!, EVGA GTX 1060 6GB, ROG STRIX B450-F Gaming New!, Corsair RGB 2x16GB 3200MHz New!, 512GB Crucial P5, 120GB Samsung SSD, 1TB Segate SSHD, 2TB Barracuda HDD

MacBook Pro 14" (M1 Max, 32GB RAM)

Links: My beautiful sketchy case | My website

Link to comment
Share on other sites

Link to post
Share on other sites

13 hours ago, ImAlsoRan said:

From my experience, Facebook and Instagram's algorithms tend to prefer taller videos, presumably because they take up the majority of a user's phone.

I believe that. But, i think larger % of Linus community than general users will gladly landscape their mobile devices to get the full experience if they could. 

Link to comment
Share on other sites

Link to post
Share on other sites

On 2/24/2022 at 4:27 AM, Luscious said:

That T4 is a quirky little card with it's HH/HL form factor LOL. Not sure which one (Supermicro or Tyan) actually had a server SKU that crammed TWENTY of those into a single 4U chassis, all forced-air cooled. The motherboard they used for it was just as insane with a completely proprietary daughterboard that had 20 x8 PCI-e slots!!! Made for a nice 1400W space heater not including the dual Xeons and everything else in there.

 

FWIW I would like to see LTT now cover the AMD side of things, specifically the Instinct M100 card. It's priced about the same as the outgoing 40GB A100 shown here. How does the M100 compare performance-wise to nVidia and how does it measure when it comes to efficiency - questions that I'm sure many interested in it would want to ask.

I have a Nvidia Tesla T4 waiting to go In a dell percussion 3930 1 RU workstation with a Intel 9900,16 gb ddr4,500 gb hdd, dual 550W power supply and a Nvidia T400, win 10 pro (likely going win 11 pro). I will be using that card mostly for video encoding and AI upscaling.

 

Link to comment
Share on other sites

Link to post
Share on other sites

i  hope ltt looks at instict server gpus. those thing are monster.

MSI x399 sli plus  | AMD theardripper 2990wx all core 3ghz lock |Thermaltake flo ring 360 | EVGA 2080, Zotac 2080 |Gskill Ripjaws 128GB 3000 MHz | Corsair RM1200i |150tb | Asus tuff gaming mid tower| 10gb NIC

Link to comment
Share on other sites

Link to post
Share on other sites

Shout out to the only true thermal paste pattern, the rice mark.

 

Link to comment
Share on other sites

Link to post
Share on other sites

  • 3 weeks later...

I did some tests myself with an A100 in case anyone is interested:

 

Just now, igormp said:
Got an A100 to try out because I was bored, got some nice numbers and we can clearly see that this workload is simply just too simple for this GPU.


+-------------------+---------------+----------------+----------------+----------------+----------------+---------------+----------------+----------------+----------------+----------------+-----------------+
|    GPU-Imgs/s     | FP32 Batch 64 | FP32 Batch 128 | FP32 Batch 256 | FP32 Batch 384 | FP32 Batch 512 | FP16 Batch 64 | FP16 Batch 128 | FP16 Batch 256 | FP16 Batch 384 | FP16 Batch 512 | FP16 Batch 1024 |
+-------------------+---------------+----------------+----------------+----------------+----------------+---------------+----------------+----------------+----------------+----------------+-----------------+
| 2060 Super        | 172           | NA             |            NA  |            NA  |            NA  | 405           |            444 |            NA  |            NA  |            NA  |            NA   |
| 3060              | 220           | NA             |            NA  |            NA  |            NA  | 475           |            500 |            NA  |            NA  |            NA  |            NA   |
| 3080              | 396           | NA             |            NA  |            NA  |            NA  | 900           |            947 |            NA  |            NA  |            NA  |            NA   |
| V100              | 369           | 394            |            NA  |            NA  |            NA  | 975           |           1117 |            NA  |            NA  |            NA  |            NA   |
| A100              | 766           | 837            |           873  |           865  |           OOM  | 1892          |           2148 |          2379  |          2324  |          2492  |          2362   |
| Radeon VII (ROCm) | 288           | 304            |            NA  |            NA  |            NA  | 393           |            426 |            NA  |            NA  |            NA  |            NA   |
| 6800XT (DirectML) | NA            | 63             |            NA  |            NA  |            NA  | NA            |             52 |            NA  |            NA  |            NA  |            NA   |
+-------------------+---------------+----------------+----------------+----------------+----------------+---------------+----------------+----------------+----------------+----------------+-----------------+

 

Also did an AI-benchmark run:

 

Device Inference Score: 21692
Device Training Score: 23542
Device AI Score: 45234

image.png

image.png

 

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

  • 4 months later...
On 2/24/2022 at 12:21 AM, jakkuh_t said:

 

What components are used in this build? Can someone provide me the details? 

I can only interpret The PSU is Corsair AX 1600i , a MSI 3090 supreme and Tesla 100. What about the cpu and mobo?

 

Also is it mandatory to use the 3090 for display output? or any regular gpu is fine?

Link to comment
Share on other sites

Link to post
Share on other sites

  • 10 months later...

@jakkuh_tDid you ever revise that 3D-print for the cooler you designed for the A100? I'm thinking about slapping 3 similar form-factor cards in an ATX rig, and cooling is going to be an issue to say the least

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×