Jump to content

NVidia Losses half of its value

exetras
9 minutes ago, asus killer said:

nvidia fall kind of mirrors Nasdaq's fall. The problem is much greater than the company, it's the market as a all. MAGA

I would have clicked agree but                                                                                                                                         ^

 

the-queen-is-disappointed.jpg

Link to comment
Share on other sites

Link to post
Share on other sites

12 minutes ago, Citadelen said:

Nvidia's stock had been riding a multi-year high based on their consistent successive earning beats, so when they finally fell short, cut guidance and had racked up a massive Inventory, the market reacted quite violently.

Why do people keep saying this "massive inventory",  Nvidia have claimed it will be back to normal by January and that the initial claims of high stock problems were inflated.   If it isn't back to normal by then and we have some actual numbers to look at I'd be interesting in investigating it then, but for now it seems all we have is one or two out of context claims fueled by a huge drop in the market value.  1080ti and low end cards are not in stock, mid range don't seem to be falling in price (which is what one would expect from a over supply of said GPU's) from either company. 

 

I think they both learned from the last bitcoin debacle and this overstock problem is more an everyday issue that seems to have been caught in the wind with all the other hype going on.

 

 

Grammar and spelling is not indicative of intelligence/knowledge.  Not having the same opinion does not always mean lack of understanding.  

Link to comment
Share on other sites

Link to post
Share on other sites

21 minutes ago, mr moose said:

Why do people keep saying this "massive inventory",  Nvidia have claimed it will be back to normal by January and that the initial claims of high stock problems were inflated.   If it isn't back to normal by then and we have some actual numbers to look at I'd be interesting in investigating it then, but for now it seems all we have is one or two out of context claims fueled by a huge drop in the market value.  1080ti and low end cards are not in stock, mid range don't seem to be falling in price (which is what one would expect from a over supply of said GPU's) from either company. 

 

I think they both learned from the last bitcoin debacle and this overstock problem is more an everyday issue that seems to have been caught in the wind with all the other hype going on.

 

 

2060 is coming soon and they still made a 1060 with gddr5x. If there may not be a problem with the 1080ti that is disappearing they sure look like desperate to unleash some 1060's out of the warehouse. That's also their best seller so that's appropriate to be the biggest indicator of over stock

.

Link to comment
Share on other sites

Link to post
Share on other sites

NVDA stock skyrocketed because of several factors, the biggest being the cryptocurrency boom. Which is why it's now also collapsing. The fact that Nvidia has a huge overstock of Pascal didn't help measures and the short term investors were looking to cash out ASAP.

 

My 2 cents is that the RTX are overpriced partly due to the overstock of Pascal. Nvidia wanted to milk the bleeding edge consumers, while still comfortably being able to sell Pascal. If I'm right we might be looking at a price decrease for Turing in the near future, perhaps even without any competition from AMD.

Link to comment
Share on other sites

Link to post
Share on other sites

13 minutes ago, asus killer said:

2060 is coming soon and they still made a 1060 with gddr5x. If there may not be a problem with the 1080ti that is disappearing they sure look like desperate to unleash some 1060's out of the warehouse. That's also their best seller so that's appropriate to be the biggest indicator of over stock

Well, I haven't seen the price drop here,  I just bought a RX570 because the $150 difference over the 1060 was better in my pocket. And to be honest I don't think the performance gap was enough to justify that cost difference.  So it seems they either don't have an inventory issue here or it doesn't exist to the extent people are making out.

Grammar and spelling is not indicative of intelligence/knowledge.  Not having the same opinion does not always mean lack of understanding.  

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, mr moose said:

Well, I haven't seen the price drop here,  I just bought a RX570 because the $150 difference over the 1060 was better in my pocket. And to be honest I don't think the performance gap was enough to justify that cost difference.  So it seems they either don't have an inventory issue here or it doesn't exist to the extent people are making out.

1060's used sell for a lot more than equivalent amd rx's cards for some stupid reason. The same for new, people are willing to pay a lot more for 1060's then rx 580 for example. Just like you pointed out.

So Nvidia doesn't really need to lower prices, people keep paying premium to go green. Also because they increased so much the price of the 2000's series the pascal cards never drop the price, the 1070 or 1080 didn't, they are smart at Nvidia. Couple that with crappy rtx cards, people just stocked up on pascal cards at full price resolving their excess inventory.

 Still you can't deny phasing out 1070's and 1080's and releasing both new 1060 and a new 2060 means overstocking, i don't see how.

.

Link to comment
Share on other sites

Link to post
Share on other sites

4 minutes ago, asus killer said:

 Still you can't deny phasing out 1070's and 1080's and releasing both new 1060 and a new 2060 means overstocking, i don't see how.

Because if they don't need to do anything about the overstock (reduce prices, shift stock or have a promotion),  then it's not actually a problem.

Grammar and spelling is not indicative of intelligence/knowledge.  Not having the same opinion does not always mean lack of understanding.  

Link to comment
Share on other sites

Link to post
Share on other sites

20 minutes ago, mr moose said:

Because if they don't need to do anything about the overstock (reduce prices, shift stock or have a promotion),  then it's not actually a problem.

i get your point, and they have so much money they probably could just build a tree house of chips in the back of the headquarters, but analysts and stockholders don't like inventory mismanagement and money wasted. That's why it's a big problem.

.

Link to comment
Share on other sites

Link to post
Share on other sites

Im a daytrader. This should be simple.  Both amd and nvidia had a humongous 2000%+ rally starting in 2016 and both crashed (retraced) accordingly.  When something rises on a logarithmic scale it retraces on a logarithmic scale too. Think of bitcoin. A rise that fast causes a bubble effect with a frenzy of investors and traders just riding the wave, with a subsequent exodus sellout and panic.  You can blame news that happened at the peak all you want but the news is really only catalyst for the bubble to fail and from there things have only just gotten started.

 

There's more to it though.  Speaking of bitcoin  they should also both have crashed due to a delayed reaction to the decline of bitcoin and all cryptocurrencies, causing a reduced hardware demand from miners. The latest additional breakdown of bitcoin from $7k to $3k on DECLINING VOLUME has crushed the hopes of all these miners who thought they were getting in some kind of breakout from a $6k consolidation. 

 

The real question is why are both of amd and nvidia still up 1000% since 2015? I could be wrong but the logical reason they would have both rallied 2000% in the first place would be a parallel of the bitcoin and crypto rally that started at that same time. In AMDs case they also have dominating Intel with Threadripper while Nvidia has no such thing except maybe some AI sales and both still traded suspiciously similarly (trailing bitcoin).

 

So now that crypto has lost 90% of its value,  where does that put amd and nvidia? Why are they still up 1000%? We haven't had a sudden change in the amount of gamers and its doesn't sound like it's earned by raytracing. I second the poster talking about a falling knife because the charts don't look pretty either. 

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, leadeater said:

HPCG isn't actually run in a way like the system is actually going to be used either, tasks that would be run on dedicated hardware is run on everything and effects the scoring.

 

3 hours ago, leadeater said:

HPL is best case and HPCG is worst case and neither of them represent the performance that is actually going to be obtained when running the workloads the cluster was designed for, performance that is above HPCG and below HPL basically. Cluster efficiency is much higher than HPCG indicates

Well, its true that no synthetic benchmark is going to run exactly the way a real application does because applications aren't usually that similar to synthetic benchmarks.

 

However, many pre exascale and exascale workloads are very dependent on memory bandwidth and the ability to have lots of low latency, high bandwidth interconnect traffic. Which is why you really have to look at multiple benchmarks like HPL and HPCG and the % of peak that each sustained maximum result represents. Higher computational efficiency is difficult and expensive to attain, especially when you get to systems that have >90% like Fujitsu's. 

 

Thats also why most exascale architectures are going to be using HBM style on package memory instead of DIMMs and why Fujitsu abandoned DIMMs on their dedicated HPC architecture in 2015. 

 

3 hours ago, leadeater said:

The interconnects from each of them are all actually very similar in design, meshes of meshes. Aries went with high compatibility so used PCIe 3.0 between the Aries SoC and the nodes, that's really where the limitations in bandwidth come from.

I wouldn't really call the dragonfly topology terribly similar to a five or six dimensional torus, since dragonfly is a hierarchical network that connects groups to each other. Tofu is a hybrid of a torus and a mesh. Blue Gene's interconnect is similar to Tofu. Summit and Sierra use EDR Infiniband fat trees which is a pretty standard setup by comparison to more exotic topologies.

 

Cray just released their latest custom interconnect which is what their highest end Shasta systems will use, called Slingshot. They dont seem to think its a waste of time to create their own custom interconnect and ASIC.

 

They'll have OmniPath as an option, but they spent an awful lot of money to make yet another of their dragonfly topology based custom interconnects for their latest and greatest system.

 

3 hours ago, leadeater said:

Not sure what you mean by gluelessly

 

I mean that IBM Power9 SU scales to 16 sockets using its X and O busses, while Xeon only scales to 8. You need router chips and a separate interconnect to scale beyond 8 sockets with Xeons. AMD doesnt even have a scale up CPU, and they're limited to 2 sockets.

 

Intel used to have the Jordan Creek scalable memory buffer, similar to what IBM has with their Centaur DIMMs(though much less sophisticated). They reduced the capacity per socket because they got rid of their buffer chip and just use normal 6 channel DDR4 controllers. Broadwell EX effectively had 8 channel memory in performance mode(instead of mirrored mode).

 

IBM Centaur DIMMs come in sizes up to 512GB, which use 152 individual 3DS stacked DDR4 chips with TSVs. Not only that, but because each DIMM has up to 152 DRAMs on it, it has Chipkill ECC which like most ECC requires a 9th chip per group, but it also has a 10th chip so that one chipkill event doesnt require replacing a DIMM. No one else has DIMMs like that.

 

And when it comes to memory bandwidth, with 119GB/s Xeons cant compete with Power9's 230GB/s or PrimeHPC's 480GB/s. Although the current highest memory bandwidth is the NEC SX Aurora, with 1.2TB/s from 6 stacks of HBM 2.

 

The other features in addition to the Centaur DIMMS are IBM's X and O bus scalability to 16 sockets, OpenCAPI and built in NVlink support. The reason that Summit and Sierra are so powerful also has to do with the fact that the GPUs can directly talk to the CPUs and have true coherency between the HBM on the GPU and the CPU's 512GB of DDR4. 

 

As far as benchmarks go.

 

https://www.ibm.com/blogs/systems/ibm-power9-scale-out-servers-deliver-more-memory-better-price-performance-vs-intel-x86/

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Amazonsucks said:

I mean that IBM Power9 SU scales to 16 sockets using its X and O busses, while Xeon only scales to 8. You need router chips and a separate interconnect to scale beyond 8 sockets.

You can't scale beyond 8 in a single system, past that point you're going to multi-node. In HPC you rarely see more than 4 sockets per system anyway, with accelerator based nodes mostly being 2 socket. IBM Power systems are the same here too, AC922 is dual socket as an example.

 

I think you might be referring to some other part of the system design?

 

Big scale up systems are targeted more towards telcos, data analytics, banking etc where the software and systems aren't design to scale out or can't be. I believe there are no Power9 scale up systems in the supercomputer rankings, though I wouldn't be that surprised if there was 1 hidden in there (E950/E980 etc).

 

BlueGene on the other hand is a totally different ball park compared to Power9 and Intel/AMD, it's a huge system of 1 socket nodes however what constitutes a node is much smaller and more components of the system are separated out. Each node is just a CPU and memory which go in to draws and then draws goes in to crates, love the terminology ?.

 

BlueGene and the bespoke IBM systems of old are actually far more interesting to look at, everything else is very similar. Personally I liked old IBM back when they were right in there in the hardware game, their company stance right now is "We are a software company". I think Cell BE burned them big time, even though it's my favorite CPU design (oh the possibilities now days with chiplets).

 

1 hour ago, Amazonsucks said:

Intel used to have the Jordan Creek scalable memory buffer, similar to what IBM has with their Centaur DIMMS(though much less sophisticated). They reduced the capacity per socket because they got rid of their buffer chip.

Yea I'm just not sure why they did that, they already have a special M variant now that supports more memory so I don't know why they went away from it. You'd have alternative CPU options for deployment that don't like SMI, to regress in system capability is just weird.

 

1 hour ago, Amazonsucks said:

And when it comes to memory bandwidth, with 119GB/s Xeons cant compete with Power9's 230GB/s or PrimeHPC's 480GB/s.

In these deployments we're dealing with scale out so 120GB/s so equal between Intel and Power9.

 

1 hour ago, Amazonsucks said:

OpenCAPI

Intel doesn't care about this, they have their own technology called Omni-Path. Intel hasn't joined many of the HPC open standards, same goes for Nvidia.

 

1 hour ago, Amazonsucks said:

The reason that Summit and Sierra are so powerful also has to do with the fact that the GPUs can directly talk to the CPUs and have true coherency between the HBM on the GPU and the CPU's 512GB of DDR4. 

Nope, the cluster is just way way way larger in scale than anything else. It's faster because it has the newest Nvidia V100 accelerators combined with having more of those than any system before it with P100s. It's not hard to be the fastest when you've brought the most hardware to the game by a lot. Just look at spot 5 and 7 then at spot 1 and 2, with spot 5 and 1 having near as much the same performance per core. You're seeing 6 times the performance because there is 6 times the hardware.

 

All NVLink does for the CPU is allow it to load the data set in to the GPUs faster, it does not make the computation run faster. In a benchmark like Linpack it will have no effect on Rmax. The benefit of NVLink for the CPU is to load jobs from the queue faster meaning you dispatch more jobs over the same time period and the GPUs are idle less often, reducing I/O wait times is good but the benefit reduces as the job run time increases i.e. 10 jobs per hour vs 1 job per hour.

 

1 hour ago, Amazonsucks said:

Yea because vendor supplied benchmarks are so trustworthy, but yes IBM is the best platform for running SAP and DB2 (IBM database). If we only look at benchmarks that IBM is the best in then they will look like the fastest, that is really what it comes down to at the end of it, buying the optimal system for what you are doing which you'll more often find that is an Intel system than an IBM Power.

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, Amazonsucks said:

Cray just released their latest custom interconnect which is what their highest end Shasta systems will use, called Slingshot. They dont seem to think its a waste of time to create their own custom interconnect and ASIC.

Which is based on Ethernet if you didn't know. They are doing what I said, taking existing transport technologies and augmenting them with their own logic to customize them for these specialty use cases, you don't need to develop your own layer 1/2 technology when there are already excellent high performance low latency options.

 

Also Shasta can use Slingshot, Infiniband and Omni-path.

Link to comment
Share on other sites

Link to post
Share on other sites

14 hours ago, leadeater said:

You can't scale beyond 8 in a single system, past that point you're going to multi-node.

https://www.ibm.com/us-en/marketplace/power-system-e980

 

16 way Power9 with 64TB max memory. Thats as much as an SGI UV from a couple years ago, except this is with no routers.

 

14 hours ago, leadeater said:

. You'd have alternative CPU options for deployment that don't like SMI, to regress in system capability is just weird.

Memory buffers are expensive. Even IBM is working with JEDEC to make a standard to at least partially replace Centaur for Power10.

 

https://www.nextplatform.com/2018/08/28/ibm-power-chips-blur-the-lines-to-memory-and-accelerators/
 

And as the author points out, even unbuffered Power9 is about 70% more bandwidth than Skylake SP, and Centaur systems are 100% faster. 

 

14 hours ago, leadeater said:

You're seeing 6 times the performance because there is 6 times the hardware.

 

All NVLink does for the CPU is allow it to load the data set in to the GPUs faster, it does not make the computation run faster.

Not really. The node count of Summit is 1/4 the node count of Titan, which it replaces. Each node of Summit can be 2 CPUs with 6 GPUs with complete coherency between all 8 of the processors and all of the memory in  a node. Increasing the bandwidth and reducing the latency between the CPUs and GPUs allows for such fat nodes, where as older systems needed a much higher ratio of CPUs to GPUs, and since the GPUs do most of the math, having all those CPUs just to let the GPUs talk to each other was a waste of power.

 

Summit is also only like 250 racks. It does use a healthy 13MW though.

 

And you could do a system using nodes like DGX, and Fujitsu actually made one of those, and so did Nvidia.

 

http://www.fujitsu.com/global/about/resources/news/press-releases/2018/0420-01.html

 

https://www.nvidia.com/en-au/data-center/dgx-saturnv/

 

But obviously that doesnt work well enough for ORNL and LLNL so they used IBM Power9s for the specific reason that NVLink allowed them to make a system without the huge bottleneck that the PCI Express bus represents in terms of throughput and latency and can have fewer but much more powerful nodes to simplify programming.

 

As far as the whole Ethernet thing goes, yes and Tofu is also built on 100Gb Ethernet. To say that Tofu and Aries or other custom interconnects requiring bespoke ASICs are just modified Ethernet is really an oversimplification.

Link to comment
Share on other sites

Link to post
Share on other sites

8 minutes ago, Amazonsucks said:

https://www.ibm.com/us-en/marketplace/power-system-e980

 

16 way Power9 with 64TB max memory. Thats as much as an SGI UV from a couple years ago, except this is with no routers.

That was in relation to Intel's 8 socket support, there is nothing above that without breaking out in to another system. I'm just not sure what you mean by router chips and separate interconnects to scale beyond 8? Beyond 8 just isn't a thing for Intel, not in a single system boundary context.

 

12 minutes ago, Amazonsucks said:

Memory buffers are expensive.

If you need that kind of memory per socket money typically isn't a problem, Intel CPUs are already expensive so adding a thousand or so ish on to a 6142M/8180M system wouldn't be of much impact when the ram itself costs far more (the additional 1.5TB of it vs the buffer chips). Removing them probably had more to do with reducing latency but I still don't see why an option with and without buffers wasn't possible, Intel already makes custom CPUs for AWS/Azure etc so I have my suspicions the demand for 3TB per socket wasn't there to bother.

 

18 minutes ago, Amazonsucks said:

Not really. The node count of Summit is 1/4 the node count of Titan, which it replaces

Not the point, you didn't actually look at what I pointed you to did you? Piz Daint is a similar cluster using last generation hardware,  Summit has 6 times the hardware as Piz Daint and has 6 times the performance. I shouldn't think that is a surprise. Titan uses very old GPUs so having less nodes doesn't indicate much beyond technology moving forward over time.

 

Summit tops the list because it has the most raw hardware by a lot. Summit is not the fastest because it has IBM Power9 or NVLink, it's faster because it's the biggest. Scale Piz Daint out to similar scale and you'll get similar performance, be it at much higher power draw because P100s not V100s.

 

22 minutes ago, Amazonsucks said:

where as older systems needed a much higher ratio of CPUs to GPUs, and since the GPUs do most of the math, having all those CPUs just to let the GPUs talk to each other was a waste of power.

Other than the systems with 2 Intel CPUs and 8 and 16 V100/P100 GPUs in them. The CPUs don't let the GPUs talk to each other if you're using NVLink between the GPUs, you don't need NVLink on the CPU to allow for high bandwidth low latency communication between the GPUs or unified GPU memory.

 

tesla-v100-nvlink-hybrid-cube-mesh-diagram-625-udt.png

 

37 minutes ago, Amazonsucks said:

But obviously that doesnt work well enough for ORNL and LLNL so they uses IBM Power9s for the specific reason that NVLink allowed them to make a system without the huge bottleneck that the PCI Express bus represents in terms of throughput and latency.

This is likely due to the workloads they intent to run have a high amount of loading of data in to the GPUs and do so often, this is where NVLink will help. This is a workload requirement, clusters built on PCIe accelerators work just fine and aren't necessarily limited by the PCIe bandwidth. You'd actually have to run tests and assess the benefit, which I'm sure they did.

 

Most of the workloads that the researchers here run on GPUs run on a single GPU, they just run multiple jobs and iterations at one time so PCIe with no NVLink between the GPUs doesn't effect them at all. NVLink at all wouldn't do much for them, other than increase cost and require them to re-optimize everything for Power and that won't happen (we're very unlikely to put in a Power based cluster, renting time on NeSI is easier and cheaper).

 

Increasing bandwidth in areas you don't need it doesn't do anything, if it were that widely important to everyone then you'd see far more Power9 clusters.

 

59 minutes ago, Amazonsucks said:

As far as the whole Ethernet thing goes, yes and Tofu is also built on 100Gb Ethernet. To say that Tofu and Aries or other custom interconnects requiring bespoke ASICs are just modified Ethernet is really an oversimplification.

It wasn't a simplification, I don't think you understood the point.

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, leadeater said:

The CPUs don't let the GPUs talk to each other if you're using NVLink between the GPUs, you don't need NVLink on the CPU to allow for high bandwidth low latency communication between the GPUs or unified GPU memory.

NVLink isnt just for GPU to GPU communication. In the diagram you show, which i believe is the DGX-1, the GPUs communicate with each other using NVLink, but they only have to access the CPUs memory over the extremely low bandwidth PCIe bus.

 

PCIe 3.0 x16 is ~16GB/s. Each NVLink brick is 50GB/s. The Power9 AC922 servers on which Summit and Sierra are based have 2 Power9 SO chips, each of them having 6 NVLink bricks for total of 300GB/s. PCIe cant even come close. 

 

For Summit, its 2 NVLink bricks per GPU since there are 3 per socket, with an aggregate bandwidth between the CPUs and GPUs of 100GB/s per 2 brick link(vs 16GB/s for PCIe), as well as NVLink between the GPUs. 

 

For Sierra, its 3 bricks per GPU with 150GB/s, which is roughly 10 TIMES a PCIe setup's individual CPU to GPU bandwidth.

 

That allows Summit and Sierra(or any Power9 Volta setup) to bypass the PCIe bus entirely for things like RDMA. And while the GPU's memory is coherent in a DGX style system, it is not between the CPUs and GPUs.

 

Summit has 512GB DDR4 per node, with 96GB of HBM. NVLink allows the GPUs to access the host CPU's memory in a way that it simply cant over PCIe at bandwidths that approach the entirety of the 8 channel DDR4 bandwidth.

 

That's where you get the scalability, reduced overhead and latency reduction that's required for a true 130 PFLOPS pre exascale system to actually run the workloads it does and still be heterogeneous.

 

A homogeneous system using only CPUs would be much simpler to use, but impractical from a power and cost perspective, than a system using GPUs for them. In addition to solving problems, adding accelerators like GPUs to a system causes plenty of problems as well. This guy works there so ill let him explain. 

 

 

3 hours ago, leadeater said:

Most of the workloads that the researchers here run on GPUs run on a single GPU

 

Then they arent running simulations that require petabytes of RAM and PFLOPS of performance and dont need it. ORNL, LLNL, Riken and other HPC centers do. Summit has 2.53PB of DDR4 and only 475TB of HBM. The much faster access to the large DDR4 memory per node is imperative when you're running a workload that uses a massive amount of nodes and the datasets being worked on dont fit in the 96GB of GPU memory.

 

As soon as you run out of GPU memory on a Xeon PCIe heterogeneous system and you have to talk over the PCIe bus, latency begins to add up. Scale that 6x bigger than Piz Daint and you may run Linpack just fine, but many data centric real workloads would undoubtedly suffer.

 

Probably even moreso for Summit, since its actually capable of ExaOPS when using the tensor cores.

 

https://www.olcf.ornl.gov/2018/06/08/genomics-code-exceeds-exaops-on-summit-supercomputer/

 

3 hours ago, leadeater said:

Beyond 8 just isn't a thing for Intel, not in a single system boundary context.

Thought you meant Power9 didnt scale past 8 sockets. I know intel doesnt, which was another of the advantages Power9 has over Xeon.

 

3 hours ago, leadeater said:

Increasing bandwidth in areas you don't need it doesn't do anything, if it were that widely important to everyone then you'd see far more Power9 clusters.

Not that many people have pre exascale systems. I dont think Power CPUs are going to end up in initial exascale systems, and probably no heterogeneous exascale system without NVLink or something similar.

 

What you will see in the near future is a continuation of two things: easy to use all CPU homogeneous systems like ARM A64FX and a lot more coherency between the CPUs and GPUs in heterogeneous systems.

 

3 hours ago, leadeater said:

It wasn't a simplification, I don't think you understood the point.

I get your point. Its just that i never said that the exotic, custom interconnects weren't based on Ethernet. Its kind of like saying "well a Veyron and a Civic are both based on an gasoline internal combustion engine". While that is true, the Veyron has a lot of additional capabilities engineered into it like Tofu, Aries or Slingshot.

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, Amazonsucks said:

PCIe 3.0 x16 is ~16GB/s. Each NVLink brick is 50GB/s.

25GB/s per direction vs 16GB/s per direction, NVLink marketing is well... crafty. PCIe 4.0 is double that of 3.0 so CPU to GPU bandwidth is plenty ample, if there were GPUs that supported it and more than just IBM with PCIe 4.0 CPUs.

 

2 hours ago, Amazonsucks said:

extremely low bandwidth PCIe bus

That's a rather large exaggeration, it's slower than NVLink but it's not 'extremely low'. A single EPYC has a total of 126GB/s bi-directional PCIe bandwidth on offer, not a far cry from 150GB/s of NVLink, PCIe 4.0 anyone?

 

2 hours ago, Amazonsucks said:

That allows Summit and Sierra(or any Power9 Volta setup) to bypass the PCIe bus entirely for things like RDMA. And while the GPU's memory is coherent in a DGX style system, it is not between the CPUs and GPUs.

It doesn't work like you think even with NVLink on the CPUs, this is often what I see when people see the brochure or technical marketing information but don't understand that it ultimately doesn't mean much. These are forefront CUDA features which means it's possible with PCIe as well as NVLink. The big win for NVLink and why it exists is to interconnect the GPUs and avoid going through QPI links or equivalent and to have interconnect controllers on each GPU increasing the total bandwidth available.

 

https://devblogs.nvidia.com/unified-memory-cuda-beginners/

 

You either have a need for NVlink or you don't, it's not universally applicable to everyone and the benefits aren't uniform. If you'd only gain 10% more effective performance using it but increase the per node cost by 30% it would be better to just buy more nodes and you'd have a bigger performance increase for the same spend. On the other hand if it doubles it then you'd really want to opt for it, but then do you only need it between the GPUs.

 

Also RDMA to where to do what?

 

2 hours ago, Amazonsucks said:

Then they arent running simulations that require petabytes of RAM and PFLOPS of performance and dont need it.

Yes they are, for that they book time on NeSI or NeCTAR, how they run the workload doesn't change between running it our equipment or theirs they just have more nodes, more GPUs, just more of everything so they can run more jobs and get the task done faster. NeSI has a Cray XC50 cluster btw, #482 in top 500 currently (still hanging in there ?).

 

2 hours ago, Amazonsucks said:

As soon as you run out of GPU memory on a Xeon PCIe heterogeneous system and you have to talk over the PCIe bus, latency begins to add up. Scale that 6x bigger than Piz Daint and you may run Linpack just fine, but many data centric real workloads would undoubtedly suffer.

You don't ever load a task in to GPUs larger than the memory. When your data set is larger than GPU memory you need to get well acquainted with CUDA memory management, https://devblogs.nvidia.com/beyond-gpu-memory-limits-unified-memory-pascal/ (NVLink helps but not being stupid with your dataset helps more). Scaling Piz Daint will have no issues, unless you count cost.

 

2 hours ago, Amazonsucks said:

I get your point. Its just that i never said that the exotic, custom interconnects weren't based on Ethernet.

I believe you didn't because I made the original point that a lot of the bespoke nature of HPC interconnects are becoming less necessary because industry standard offerings have matured to the point that augmenting these is more cost effective and doesn't have a drawback in doing so, Summit and Sierra are standard Infiniband.

 

2 hours ago, Amazonsucks said:

and probably no heterogeneous exascale system without NVLink or something similar.

 

What you will see in the near future is a continuation of two things: easy to use all CPU homogeneous systems like ARM A64FX and a lot more coherency between the CPUs and GPUs in heterogeneous systems.

CCIX and Gen-Z will supplant NVLink outside of Nvidia/IBM sponsored projects, even then NVLink might not survive but it'll be around for the next 3 years easily. Accelerators are here to stay and with the shift to memory semantic common topologies it's no longer going to matter what the device is or where it is.

 

If you really are that interested I'd say get in to the industry so you can get some practical experience, get to talk to the people using the clusters rather than just the people that build them. Experience allows you to really quickly cut through the fluff because the majority of it is, very rarely are all the nodes in the clusters active and almost never running the same job or even from that same batch of jobs, running Linkpack or similar is one of the only times.

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, leadeater said:

if there were GPUs that supported it and more than just IBM with PCIe 4.0 CPUs.

Which begs the question,if PCIe 4.0 was superior, why did Nvidia bother developing NVLink and why would IBM integrate both PCIe 4 and NVLink into Power9, and then use NVLink instead of PCIe 4 in their AC922 systems like Summit and Sierra?

 

The answer is that NVLink allows the CPU and GPU to have cache and memory coherence, in addition to raw bandwidth.

 

And their marketing isnt really crafty. But this is.

 

1 hour ago, leadeater said:

A single EPYC has a total of 126GB/s bi-directional PCIe bandwidth on offer, not a far cry from 150GB/s of NVLink, PCIe 4.0 anyone?

 

You're making it sound as if 126GB/s is going to be divided up the same way as the 150GB/s from six NVLink bricks. Except there are no PCIe accelerators that use more than 16x. When comparing it to PCIe 3 on a per GPU basis its 75GB/s each way per GPU for 2 GPUs per CPU.

 

75 vs 16 is hardly marketing fluff. For 3:1 nodes its still 50GB/s per GPU vs 16 for PCIe.

 

1 hour ago, leadeater said:

On the other hand if it doubles it then you'd really want to opt for it, but then do you only need it between the GPUs.

 

Also RDMA to where to do what?

Unless you want to have easier access to the CPUs huge pool of RAM. RDMA between nodes.

 

1 hour ago, leadeater said:

bespoke nature of HPC interconnects are becoming less necessary because industry standard offerings have matured to the point that augmenting these is more cost effective and doesn't have a drawback in doing so, Summit and Sierra are standard Infiniband.

Except that several companies like Cray and Fujitsu are still making them, which was my original point. And they are necessary for exascale machines.

 

Infiniband itself emerged from competing HPC interconnect standards and was an HPC focused interconnect from the start.

 

Summit and Sierra use a pruned fat tree infiniband network because it reduced cost. I would say its less than ideal but they're working around its inherent compromises very well.

 

1 hour ago, leadeater said:

very rarely are all the nodes in the clusters active and almost never running the same job or even from that same batch of jobs, running Linkpack or similar is one of the only times.

Unless its a simulation that can only be run on a massive machine. If we only needed small systems, why bother making something like K Computer, Summit, Frontier or all the exascale systems that are in the works? Summit was barely completed when they ran that genomics simulation on 4000 of its nodes at 1.8 ExaOPS. 

 

Sure, a lot of clusters that people run HPL on and submit to the Top500 are really just commodity clusters that rarely run real HPC workloads, but thats definitely not everyone.

 

And is HPE still working on Gen-Z? Seems like they said that The Machine was going to revolutionize computing with its memristors that turned into RAM with batteries. Ill believe their hype when i see something actually delivered.

Link to comment
Share on other sites

Link to post
Share on other sites

 

On 12/17/2018 at 9:39 AM, mr moose said:

Why do people keep saying this "massive inventory",  Nvidia have claimed it will be back to normal by January and that the initial claims of high stock problems were inflated.   If it isn't back to normal by then and we have some actual numbers to look at I'd be interesting in investigating it then, but for now it seems all we have is one or two out of context claims fueled by a huge drop in the market value.  1080ti and low end cards are not in stock, mid range don't seem to be falling in price (which is what one would expect from a over supply of said GPU's) from either company. 

 

I think they both learned from the last bitcoin debacle and this overstock problem is more an everyday issue that seems to have been caught in the wind with all the other hype going on.

 

 

Screenshot_20181218-110953.thumb.png.a9eb9b64b1dd73b5dae8f10d9176025e.png

C'mon mate, this stuff isn't hard to find. Their inventory has steadily grown to the point of almost having doubled over the past year. This isn't something we're guessing at, it's literally in the fiscal reports.

        Pixelbook Go i5 Pixel 4 XL 

  

                                     

 

 

                                                                           

                                                                              

 

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, Amazonsucks said:

You're making it sound as if 126GB/s is going to be divided up the same way as the 150GB/s from six NVLink bricks. Except there are no PCIe accelerators that use more than 16x. When comparing it to PCIe 3 on a per GPU basis its 75GB/s each way per GPU for 2 GPUs per CPU.

 

75 vs 16 is hardly marketing fluff. For 3:1 nodes its still 50GB/s per GPU vs 16 for PCIe.

You don't need to when the GPUs connect to the CPU at x16 and the GPUs use NVLink.  You still actually need that extra bandwidth for it to matter, just look at Nvidia's own testing with a GPU limited to PCIe 3.0 x8. Had they used x16 the performance would have been very similar to NVLink because if you look at the profiling tool the bandwidth utilization was 12GB/s.

 

If you don't need a lot of bandwidth between the CPU and the GPUs you can drop down to x8 PCIe links and put more GPUs in the system than you could using NVLink without implementing a blackplane switched NVLink solution like the DGX-2. Having a CPU that uses NVLink takes up sublinks which means less GPUs per system, so do you need it and that's not a global yes.

 

3 hours ago, Amazonsucks said:

Which begs the question,if PCIe 4.0 was superior, why did Nvidia bother developing NVLink and why would IBM integrate both PCIe 4 and NVLink into Power9, and then use NVLink instead of PCIe 4 in their AC922 systems like Summit and Sierra?

It's not superior, it's well suited to connect GPUs to the system. NVLink shines because you're putting interconnect controllers on each GPU, you could do the same with PCIe and have a 'PCIe NVLink' but there's a lot to the PCIe spec you don't need for that situation so a purpose designed interconnect is probably better (less power for example).

 

3 hours ago, Amazonsucks said:

Unless you want to have easier access to the CPUs huge pool of RAM. RDMA between nodes.

NVLink doesn't enable this, you have this regardless. Again this is a CUDA feature not an NVLink feature, for RDMA between nodes that's a network device feature which would be connected to the CPU via PCIe.

 

3 hours ago, Amazonsucks said:

Except that several companies like Cray and Fujitsu are still making them, which was my original point. And they are necessary for exascale machines.

Didn't we just literally cover how they were augmenting existing Ethernet technology so they are not doing that.

 

3 hours ago, Amazonsucks said:

Summit and Sierra use a pruned fat tree infiniband network because it reduced cost. I would say its less than ideal but they're working around its inherent compromises very well.

Yea they cheaped out, right. Maybe it's exactly all that is required, seriously you don't think if they need something better they wouldn't have gone for it? They had the money to.

 

3 hours ago, Amazonsucks said:

Unless its a simulation that can only be run on a massive machine. If we only needed small systems, why bother making something like K Computer, Summit, Frontier or all the exascale systems that are in the works? Summit was barely completed when they ran that genomics simulation on 4000 of its nodes at 1.8 ExaOPS. 

Because they are running the same simulation with different parameters hundreds of thousands of times so distributing that cross a very large cluster means you can do that quicker. I get the sense that you think the clusters are actually Massively Parallel Processors (MPP) systems which are a very different thing. In an HPC cluster all the nodes are independent from each other, it isn't one system.

 

MPP systems make up a small percentage of the top 500, those are your IBM BlueGene/Cray XC type systems. The rest are just clusters of independent servers connected via high speed networking with a workload manager such as SLURM dishing out work to the nodes. Depending on what is dished out you need to access your data set, potentially copy it to the node to local NVDIMM or other local storage and there may also be inter node communication as required by the job being run as well as SLURM control traffic.

 

HPC cluster - Death by a thousands cuts

MPP - Death by one giant cut

 

3 hours ago, Amazonsucks said:

Sure, a lot of clusters that people run HPL on and submit to the Top500 are really just commodity clusters that rarely run real HPC workloads, but thats definitely not everyone.

CERN can only have something like half their nodes active at any one time, SLURM has rule sets in it to control where workloads get put and how many active jobs can be run per server, per rack, per row, per room etc. Running Linpack across it on all nodes really was a one time deal.

 

CERN also primarily care about storage performance and data throughput to the compute nodes.

 

3 hours ago, Amazonsucks said:

And is HPE still working on Gen-Z? Seems like they said that The Machine was going to revolutionize computing with its memristors that turned into RAM with batteries. Ill believe their hype when i see something actually delivered.

They have already made The Machine and shown it working, it's now decommissioned. Gen-Z is an industry wide technology and The Machine actually had very little to do with it.

Link to comment
Share on other sites

Link to post
Share on other sites

5 hours ago, leadeater said:

so do you need it and that's not a global yes.

Oh absolutely, but that depends on what the machine does. In a lot of cases you wouldnt want GPUs at all either.

 

5 hours ago, leadeater said:

seriously you don't think if they need something better they wouldn't have gone for it? They had the money to.

They explicitly said they didnt. Their budget even caused them to buy half as much RAM and use a less than ideal interconnect topology with Sierra.

 

They figured out how to make the best of it as the guy in the video, and other people from LLNL have explained.

 

Summit and Sierra are the top 2 supers in the world, and very good systems, but they're relatively inexpensive compared to a no compromise system like K Computer. They didnt even necessarily want to use GPUs at LLNL since they had a Blue Gene system before. ORNL had experience with them, and LLNL ended up with a 2:1 GPU:CPU ratio with Sierra instead of 3:1 like Summit.

 

5 hours ago, leadeater said:

Didn't we just literally cover how they were augmenting existing Ethernet technology so they are not doing that.

Well if you really want to call Tofu or Aries "augmented Ethernet". My point was that they're really much more than that.

 

5 hours ago, leadeater said:

They have already made The Machine and shown it working, it's now decommissioned.

The 160TB of DRAM with batteries or supercapacitors version from a couple years back was interesting but a far cry from its originally planned specs which included memristors. I mean when we see a real Machine deployed with some kind of truly persistent shared memory like Optane since it seems memristors wont happen.

 

Whats really interesting about The Machine is that they were limited by the Xeon's limited memory address space. Werent they going to switch to ARM with a larger physical addressing?

 

5 hours ago, leadeater said:

MPP systems make up a small percentage of the top 500

But they are arguably the most important ones and usually in the top 10 doing extremely important work.

Link to comment
Share on other sites

Link to post
Share on other sites

58 minutes ago, Amazonsucks said:

But they are arguably the most important ones and usually in the top 10 doing extremely important work.

Summit and Sierra are not MPP, neither is Tianhe-2A and ABCI. MPP makes up only 11.6% of the top 500.

 

58 minutes ago, Amazonsucks said:

Well if you really want to call Tofu or Aries "augmented Ethernet". My point was that they're really much more than that.

If that's what you take away form what I said then that's your issue, your the one over simplifying what I said.

Link to comment
Share on other sites

Link to post
Share on other sites

9 hours ago, Citadelen said:

 

Screenshot_20181218-110953.thumb.png.a9eb9b64b1dd73b5dae8f10d9176025e.png

C'mon mate, this stuff isn't hard to find. Their inventory has steadily grown to the point of almost having doubled over the past year. This isn't something we're guessing at, it's literally in the fiscal reports.

You know that Intel's has steadily grown too, in fact it is not uncommon for inventory to shift up and down, but you must also know that the inventory listed above (Which I would like a link to)  doesn't necessary just mean stock in the sales channel. Also it being large doesn't necessarily mean it is "huge problem". Companies can easily weather large stocks much of the time, and the fact Nvidias stock hasn't nose dived beyond much of the market it appears many investors also don't consider it to be the "huge problem" everyone is making out.

 

Grammar and spelling is not indicative of intelligence/knowledge.  Not having the same opinion does not always mean lack of understanding.  

Link to comment
Share on other sites

Link to post
Share on other sites

21 hours ago, mr moose said:

You know that Intel's has steadily grown too, in fact it is not uncommon for inventory to shift up and down, but you must also know that the inventory listed above (Which I would like a link to)  doesn't necessary just mean stock in the sales channel. Also it being large doesn't necessarily mean it is "huge problem". Companies can easily weather large stocks much of the time, and the fact Nvidias stock hasn't nose dived beyond much of the market it appears many investors also don't consider it to be the "huge problem" everyone is making out.

 

Intel's inventory isn't building up. And yes, the market has reacted quit violently to Nvidia's growing problems, their stock is down 27% since their Q3 2018 earnings, and almost 50% since September.

        Pixelbook Go i5 Pixel 4 XL 

  

                                     

 

 

                                                                           

                                                                              

 

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, Citadelen said:

Intel's inventory isn't building up. And yes, the market has reacted quit violently to Nvidia's growing problems, their stock is down 27% since their Q3 2018 earnings, and almost 50% since September.

Most tech companies have taken a nose dive in the same time period.   When you account for shifts in the entire market, nvidia's change in value is not largely out of the ordinary,  maybe 10% below their trending average?   AMD lost 40%,  Nvidia lost 48%.  Intel (which I believe is largely considered a safe long term bet by most investors) dropped 16%. 

 

EDIT: also Intel's inventory is steadily growing.  https://www.stock-analysis-on.net/NASDAQ/Company/Intel-Corp/Analysis/Inventory

Grammar and spelling is not indicative of intelligence/knowledge.  Not having the same opinion does not always mean lack of understanding.  

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, mr moose said:

EDIT: also Intel's inventory is steadily growing.  https://www.stock-analysis-on.net/NASDAQ/Company/Intel-Corp/Analysis/Inventory

That whole inventory thing was not what people think it was, for Nvidia it was only a few weeks worth. It was big for the OEM that returned them, Nvidia never ended up swimming in unsold stock.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


×