Intel Stratix 10 Destroys Nvidia Titan XP in Machine Learning, With Beta Libraries

randomhkkid · March 29, 2017

Just now, MandelFrac said:

Did you not read this article? The Titan XP and the Tesla P100 have the exact same half and quarter precision performance. They're decimated by the Stratix 10, and the Stratix 10 is just 180W at full tilt vs. 250W for the P100 Tesla.

I did but the Tesla card was not mentioned. So arguably that $4200 1.5x faster than the Titan XP card but costs 3.5x as much (at current 32bit applications)? Not much of a value proposition.

N.b. Playing devil's advocate here, I'm all for more pushing performance and advancing machine learning

MandelFrac · March 29, 2017

Just now, randomhkkid said:

I did but the Tesla card was not mentioned. So arguably that $4200 1.5x faster than the Titan XP card but costs 3.5x as much (at current 32bit applications)? Not much of a value proposition.

N.b. Playing devil's advocate here, I'm all for more pushing performance and advancing machine learning

The P100 would be slower than the Titan XP by virtue of its lower clockspeed, so the S10 is both cheaper and VASTLY faster while also being much more efficient for this workload

MandelFrac · March 29, 2017

5 minutes ago, Dylanc1500 said:

Power 9 is what I'm referring to. I know Power 8 all too well. I develope and teach databases for clients that use them.

https://www.ibm.com/developerworks/community/wikis/form/anonymous/api/wiki/61ad9cf2-c6a3-4d2c-b779-61ff0266d32a/page/1cb956e8-4160-4bea-a956-e51490c2b920/attachment/56cea2a9-a574-4fbb-8b2c-675432367250/media/POWER9-VUG.pdf

Im on my phone so it's a little difficult to elaborate further without taking forever.

Do you work for IBM? I'll be looking for a new job in a couple months. It'd be nice to get somewhere closer to the bleeding edge of software and hardware.

randomhkkid · March 29, 2017

Just now, MandelFrac said:

The P100 would be slower than the Titan XP by virtue of its lower clockspeed, so the S10 is both cheaper and VASTLY faster while also being much more efficient for this workload

What I'm saying is that compared to the Titan XP, which you say has equivalent or better performance to the P100, the S10 is only 1.5x faster at 3.5x the cost.

Dabombinable · March 29, 2017

3 minutes ago, randomhkkid said:

I did but the Tesla card was not mentioned. So arguably that $4200 1.5x faster than the Titan XP card but costs 3.5x as much (at current 32bit applications)? Not much of a value proposition.

N.b. Playing devil's advocate here, I'm all for more pushing performance and advancing machine learning

The people this kind of accelerator is aimed at....really don't care about price/performance that much, but efficiency at the given task without sacrificing speed.

MandelFrac · March 29, 2017

Just now, randomhkkid said:

What I'm saying is that compared to the Titan XP, which you say has equivalent or better performance to the P100, the S10 is only 1.5x faster at 3.5x the cost.

Which makes little sense to say since no one would use the TXP for real HPC because it has neither the high VRAM nor the ECC nor the HPC-validated drivers. The P100 is not an available product for purchase, ergo they tested with the nearest analogue.

MandelFrac · March 29, 2017

1 minute ago, Dabombinable said:

The people this kind of accelerator is aimed at....really don't care about price/performance that much, but efficiency at the given task without sacrificing speed.

They care. Perf/watt/$ is the usual guiding metric.

randomhkkid · March 29, 2017

4 minutes ago, MandelFrac said:

Which makes little sense to say since no one would use the TXP for real HPC because it has neither the high VRAM nor the ECC nor the HPC-validated drivers. The P100 is not an available product for purchase, ergo they tested with the nearest analogue.

Plenty of university level including doctoral level research is carried out on consumer Geforce based cards because of funding constraints.

5 minutes ago, Dabombinable said:

The people this kind of accelerator is aimed at....really don't care about price/performance that much, but efficiency at the given task without sacrificing speed.

Sure I understand that it may not be the main concern, but my original point is that hardware that can be customised for tasks will outperform software targeted at general purpose hardware.

The efficiency here is impressive, do we know if learning can be split over multiple FPGAs like they can on the GPUs?

Citadelen · March 29, 2017

It may destroy a Titan XP, but what about the P6000?

Or the Quadro GP 100?

I'm also interested to see how both of those compare to a Naples + MI25 machine.

MandelFrac · March 29, 2017

Just now, randomhkkid said:

Plenty of university level including doctoral level research is carried out on consumer Geforce based cards because of funding constraints.

Sure I understand that it may not be the main concern, but my original point is that hardware that can be customised for tasks will outperform software targeted at general purpose hardware.

The efficiency here is impressive, do we know if learning can be split over multiple FPGAs like they can on the GPUs?

Preliminary design and tests, yes, but the validation runs are done with real DP performance and validated drivers.

And yes, it can.

http://www.nallatech.com/nallatech-to-support-and-deliver-fpga-product-for-intel-quickpath-interconnect-qpi/

The Arria FPGAs have been integrated in pairs along with an HMC tile between for use in Seismic simulations for ages, and learning is a highly parallel problem. If it can be spread among nodes on a cluster, of course it can be split among a couple accelerators for far cheaper.

MandelFrac · March 29, 2017

5 minutes ago, Citadelen said:

It may destroy a Titan XP, but what about the P6000?

Or the Quadro GP 100?

I'm also interested to see how both of those compare to a Naples + MI25 machine.

Given both of those have lower single, half, and quarter precision performance than the TXP (lower clockspeed), the Stratix 10 will beat them both by even wider margins.

Dylanc1500 · March 29, 2017

9 minutes ago, MandelFrac said:

Some workloads are far better for scale-out than scale-up.

That's all I was getting at. People seem to think there is a be-all, end-all system. This is precisely why Intel is heavily delving into different technologies for systems instead of just brute forcing x86 or shoving "moar cores."

MandelFrac · March 29, 2017

Just now, Dylanc1500 said:

That's all I was getting at. People seem to think there is a be-all, end-all system. This is precisely why Intel is heavily delving into different technologies for systems instead of just brute forcing x86 or shoving "moar cores."

It was definitely high time for them to concede on needing higher SMT though. Analytics just LOVES more threads without more cores, though I'd argue the algorithm libraries used on IBM systems are just super cache-unfriendly, but hey, can't argue with the benchmarks unless you want to stick your neck out and accuse SAP of foul play.

Stefan1024 · March 29, 2017

30 minutes ago, MandelFrac said:

Not really. A flagship Tesla? $5000 USD if you buy in bulk, closer to $6000 for a single? Stratix 10? $4200 a piece for the 16GB HBM2 version, 3900 for the bare die.

4k for the module, but expect at least 5k for a useable board. However I was expecting more, seeing some development baords around 25k.

The problem with FPGAs is the time and money required to programm your usecase. While you can get some IP (also expensive) it needs to be fine tuned and optimized for every task. This adds up quickly.

MandelFrac · March 29, 2017

4 minutes ago, Stefan1024 said:

4k for the module, but expect at least 5k for a useable board. However I was expecting more, seeing some development baords around 25k.

The problem with FPGAs is the time and money required to programm your usecase. While you can get some IP (also expensive) it needs to be fine tuned and optimized for every task. This adds up quickly.

Nope, it's 4200 USD for a PCIe card accelerator.

But the big-iron Omnipath 20x100G switches with one of these housed inside, fully integrated with programmable fabric logic and drivers is a whopping 600,000 USD.

And Intel's been furiously racing Xilinx to make them easier to program. OpenCL tooling has already greatly lowered the barrier. Now Intel is opening up beta OpenMP/OpenACC/SyCL toolchains. No bleeding edge tech is easy up front, but it's getting easier by the month.

Dylanc1500 · March 29, 2017

1 minute ago, MandelFrac said:

It was definitely high time for them to concede on needing higher SMT though. Analytics just LOVES more threads without more cores, though I'd argue the algorithm libraries used on IBM systems are just super cache-unfriendly, but hey, can't argue with the benchmarks unless you want to stick your neck out and accuse SAP of foul play.

I agree completely. lol isn't that what most people do nowadays with benchmarks? Honestly I want IBM to make a modernized PowerPC just to see what they could create.

MandelFrac · March 29, 2017

Just now, Dylanc1500 said:

I agree completely. lol isn't that what most people do nowadays with benchmarks? Honestly I want IBM to make a modernized PowerPC just to see what they could create.

I think IBM is happy to live off of licenses and open initiatives after 2020. The hardware design costs are just becoming too prohibitive for the 800% profit margins

Dylanc1500 · March 29, 2017

1 minute ago, MandelFrac said:

I think IBM is happy to live off of licenses and open initiatives after 2020. The hardware design costs are just becoming too prohibitive for the 800% profit margins

I legitimately laughed out loud at that while walking into the office and one of the secretaries looked at me like I was stupid. I will now officially be that guy that randomly laughs to himself.

At least we don't have packages like this anymore:

MandelFrac · March 29, 2017

10 minutes ago, Dylanc1500 said:

I legitimately laughed out loud at that while walking into the office and one of the secretaries looked at me like I was stupid. I will now officially be that guy that randomly laughs to himself.

At least we don't have packages like this anymore:

Well, there is that obnoxious 1200mm square AI chip IBM developed and demo'ed a couple years back...

So I take it you work in one of the U.S. offices, and based on the time, either Mountain or Pacific Time regions for you. I'd seriously like to get out of building web front ends and API providers for internal customers at a bank. It's literally a waste of my talent. I run out of work on a constant basis but can't seem to convince our solutions architect I can do more, even after working for him for 7 months...

Stefan1024 · March 29, 2017

18 minutes ago, MandelFrac said:

Nope, it's 4200 USD for a PCIe card accelerator.

But the big-iron Omnipath 20x100G switches with one of these housed inside, fully integrated with programmable fabric logic and drivers is a whopping 600,000 USD.

And Intel's been furiously racing Xilinx to make them easier to program. OpenCL tooling has already greatly lowered the barrier. Now Intel is opening up beta OpenMP/OpenACC/SyCL toolchains. No bleeding edge tech is easy up front, but it's getting easier by the month.

Using tools like OpenCL or OpenMP that have been developed and optimized for a completly different hardware is difficult. But if it works well it would be great.

Also compiling the code can take several hours on a high end workstation.

Stefan1024 · March 29, 2017

13 minutes ago, Dylanc1500 said:

I legitimately laughed out loud at that while walking into the office and one of the secretaries looked at me like I was stupid. I will now officially be that guy that randomly laughs to himself.

At least we don't have packages like this anymore:

This thing is the maximus one can reach with the technology of this time.

Manufacturers got lazy or just cheap out nowadays and don't do it anymore. But that's a shame.

MandelFrac · March 29, 2017

2 minutes ago, Stefan1024 said:

Using tools like OpenCL or OpenMP that have been developed and optimized for a completly different hardware is difficult. But if it works well it would be great.

Also compiling the code can take several hours on a high end workstation.

Hours?! Excuse me? Even I can program the full logic of an aria in 20 minutes, compilation and deployment included. Whoever built your makefile and/or program architecture itself needs a kick in the groin.

Dylanc1500 · March 29, 2017

1 minute ago, MandelFrac said:

Well, there is that obnoxious meter-square AI chip IBM developed and demo'ed a couple years back...

So I take it you work in one of the U.S. offices, and based on the time, either Mountain or Pacific Time regions for you. I'd seriously like to get out of building web front ends and API providers for internal customers at a bank. It's literally a waste of my talent. I run out of work on a constant basis but can't seem to convince our solutions architect I can do more, even after working for him for 7 months...

I forgot about that thing, they wanted to live up to the "big iron" name more literally.

Central. It's only a formal office for me when I have conference calls with boss or clients or actual paperwork, otherwise I'm traveling to clients. Honestly people seem to greatly dislike the company I work for.

Coaxialgamer · March 29, 2017

1 hour ago, xentropa said:

ARM and POWER 8 are both reduced instruction set computing.

It is true x86 binaries are smaller and higher performing which I believe is enabled by the complex instruction that can pipeline commands entirely within the CPU itself. (ARM doesn't have the same instruction sets so commands must be "pipelined" during software compilation into the DRAM before being sent to the CPU, meaning programs in ARM will require more RAM and require higher memory bandwidth). However there is a downside to this which is that any instruction set or portion of the CPU that isn't used (by the average joe checking his email and web browsing) will be idling on the CPU resulting in greater power consumption, and a reduced wafer yield due to a larger die size. Both of which increases costs to the consumer and manufacturer.

ARM has the entire mobile industry, a ship Intel unfortunately missed.

Not extremely relevant to the conversation , but x86 actually uses RISC internally for efficiency reasons .It uses a CISC wrapper ( appears CISC to the outside ) in order to maintain compatibity . This has been the case since the pentium pro .

Also the argument of complexity for ARM vs x86 is getting less and less true . While x86 is certainly very complex because it has to maintain legacy support ( plus the inherit complexity of x86 in the first place ), ARM has been getting increasingly complex as well , with ARMv8 actually getting very close , implementing more and more features seen in desktop processors ( vector instructions ,FP support , NEON , SIMD etc )

MandelFrac · March 29, 2017

2 minutes ago, Dylanc1500 said:

I forgot about that thing, they wanted to live up to the "big iron" name more literally.

Central. It's only a formal office for me when I have conference calls with boss or clients or actual paperwork, otherwise I'm traveling to clients. Honestly people seem to greatly dislike the company I work for.

That's b/c as a SAAS provider IBM is damn awful. OPS caused no less than 86 companies in Australia a ton of grief last week when payroll and leave data couldn't be accessed for 36 hours.

Sign In

Intel Stratix 10 Destroys Nvidia Titan XP in Machine Learning, With Beta Libraries

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites