Jump to content

Intel Stratix 10 Destroys Nvidia Titan XP in Machine Learning, With Beta Libraries

MandelFrac

https://www.nextplatform.com/2017/03/21/can-fpgas-beat-gpus-accelerating-next-generation-deep-learning/

 

Warning, large images.

Spoiler

barney3.png.453d929354d1b372bdf7a96257b2597d.png

Spoiler

barney4.png.eb1c827f53163729bc4a6172dc2926d7.png

 

Quote

Can FPGAs beat GPUs in performance for next-generation DNNs? Intel’s evaluation of various emerging DNNs on two generations of FPGAs (Intel Arria 10 and Intel Stratix 10) and the latest Titan X GPU shows that current trends in DNN algorithms may favor FPGAs, and that FPGAs may even offer superior performance. While the results described are from work done in 2016, the Intel team continues testing Intel FPGAs for modern DNN algorithms and optimizations (e.g., FFT/winograd math transforms, aggressive quantizations, compressions). The team also pointed out FPGA opportunities for other irregular applications beyond DNNs, and on latency sensitive applications like ADAS and industrial uses.

“The current ML problems using 32-bit dense matrix multiplication is where GPUs excel. We encourage other developers and researchers to join forces with us to reformulate machine learning problems to take advantage of the strength of FPGAs using smaller bit processing because FPGAs can adapt to shifts toward lower precision,” says Huang. --As in Jen Sun Huang, CEO of Nvidia--

So it seems the battle in supercomputers may shift away from GPGPUs to more efficient processors, even if customizing them on the fly is currently a bit painful. All in all it looks like exciting times ahead, and the race to Exascale will be a tight one between AMD, Nvidia, Intel, and Xilinx. It should be noted that the Stratix 10 is nowhere near Intel's max die size tolerances for 14nm, and in fact it's only 70% the size, meaning there's ~40% more performance to squeeze in on the same node with no architecture change. If only I could afford one :ph34r:

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, xentropa said:

So have they finally turned to the ARM side?

No. Intel is just committed to Altera's old customers to let them keep their ARM compatibility while being retrained to use x86 for future products. Brian Kirzanich said as much when the Stratix 10 was unveiled. It's better to give ARM a tiny cut now than lose a big market chunk to Xilinx for the next 5 years.

Link to comment
Share on other sites

Link to post
Share on other sites

This is why AMD and Nvidia won't let Intel make dGPU......

"We also blind small animals with cosmetics.
We do not sell cosmetics. We just blind animals."

 

"Please don't mistake us for Equifax. Those fuckers are evil"

 

This PSA brought to you by Equifacks.
PMSL

Link to comment
Share on other sites

Link to post
Share on other sites

I don't know.

 

CISC and x86 was designed so that the OS and software industry didn't have to make very complicated operating systems, since many instructions could be executed on the hardware level, courtesy of intels complex instruction set architecture.

 

With ARM, the approach is to reduce the complex hardware so that chip makers can reduce manufacturing costs in an era where software, operating systems are becoming more complex.  The downside of this is that OS and program compilation is now more complicated with less "instructions" at hand.

 

So what is a bit confusing to me if what was said is true, is why would Intel spend more money and resources to make an expensive CPU architecture so that the OS and software industry can have easier work to do, especially if the OS and software industry is already adapted to the more difficult ARM programming?

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, xentropa said:

I don't know.

 

CISC and x86 was designed so that the OS and software industry didn't have to make very complicated operating systems, since many instructions could be executed on the hardware level, courtesy of intels complex instruction set architecture.

 

With ARM, the approach is to reduce the complex hardware so that chip makers can reduce manufacturing costs in an era where software, operating systems are becoming more complex.  The downside of this is that OS and program compilation is now more complicated with less "instructions" at hand.

 

So what is a bit confusing to me if what was said is true, is why would Intel spend more money and resources to make an expensive CPU architecture so that the OS and software industry can have easier work to do, especially if the OS and software industry is already adapted to the more difficult ARM programming?

Because x86 binaries remain much smaller and higher performing. The only exception to this is I/O, but ARM hasn't caught up to x86 in that, whereas x86 is finally catching up to Power 8. If you analyze the SAP business system benchmarks between x86, ARM, and Power 8, you'll find Intel takes the crown for raw mathematical calculation throughput, IBM the crown for database I/O and analytics, and ARM gets nothing.

Link to comment
Share on other sites

Link to post
Share on other sites

ARM and POWER 8 are both reduced instruction set computing.

 

It is true x86 binaries are smaller and higher performing which I believe is enabled by the complex instruction that can pipeline commands entirely within the CPU itself.  (ARM doesn't have the same instruction sets so commands must be "pipelined" during software compilation into the DRAM before being sent to the CPU, meaning programs in ARM will require more RAM and require higher memory bandwidth).  However there is a downside to this which is that any instruction set or portion of the CPU that isn't used (by the average joe checking his email and web browsing) will be idling on the CPU resulting in greater power consumption, and a reduced wafer yield due to a larger die size.   Both of which increases costs to the consumer and manufacturer.

 

ARM has the entire mobile industry, a ship Intel unfortunately missed.

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, xentropa said:

ARM and POWER 8 are both reduced instruction set computing.

 

It is true x86 binaries are smaller and higher performing which I believe is enabled by the complex instruction that can pipeline commands entirely within the CPU itself.  (ARM doesn't have the same instruction sets so commands must be "pipelined" during software compilation into the DRAM before being sent to the CPU, meaning programs in ARM will require more RAM and require higher memory bandwidth).  However there is a downside to this which is that any instruction set or portion of the CPU that isn't used (by the average joe) will be idling on the CPU resulting in greater power consumption, and a reduced wafer yield due to a larger die size.   Both of which increases costs to the consumer and manufacturer.

 

ARM has the entire mobile industry, a ship Intel unfortunately missed.

Nope, it's literally turned off when not in use, just like on an ARM core today. See AMD's Carrizo presentations.

 

And die size has nothing to do with the instruction set. It has everything to do with library density and performance. ARM is still VASTLY behind x86 in performance, trading it for efficiency and size.

 

ARM won't keep it imho. Now that China owns it and ARM v9 won't be open source, no way in Hell.

Link to comment
Share on other sites

Link to post
Share on other sites

If x86 uses "hardware based calculations" for its instruction set, it requires more transistors, which requires more physical space on the silicon die.  I am pretty sure x86 processors have a lot more transistors (per core) than any ARM processor of the same nm node.

 

I don't think ARM in particular will break into the supercomputing arena, but RISC... I don't know.

Link to comment
Share on other sites

Link to post
Share on other sites

6 minutes ago, xentropa said:

If x86 uses "hardware based calculations" for its instruction set, it requires more transistors, which requires more physical space on the silicon die.  I am pretty sure x86 processors have a lot more transistors than any ARM processor of the same nm node.

 

I don't think ARM in particular will break into the supercomputing arena, but RISC... I don't know.

Even though Intel's 4790K uses fewer transistor than the Apple A8 (1.4 billion for Intel, 2 billion for Apple) https://en.wikipedia.org/wiki/Transistor_count

 

It's not the transistor count which is the problem. Intel pursued performance first and is building efficiency and density(into the design libraries) in decades later. ARM went efficiency first and is slowly building up performance.

 

And just to keep things crystal clear, Intel's Stratix 10 is more than 30 billion transistors and fits in less space than the full 24-core Broadwell E7 Xeon which just takes 7.2 billion.

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, xentropa said:

They are SoCs but alright.

So is the 4790K, minus a wireless module (roughly 200 million transistors) and the south bridge, which the A8 would have a MUCH smaller version of (~80 million). So seriously... x86 uses way fewer transistors to achieve way higher performance, even if it comes at the cost of efficiency and being able to scale to ultra low power. 

Link to comment
Share on other sites

Link to post
Share on other sites

From a performance standpoint it's actually not all that surprising that an FPGA custom built for maths for DNN computation will outperform a GPU. GPUs are multi-pupose chips that just happen to be good at compute because of their highly parallel nature.

 

Recent changes in large CNN and DNNs are to move to 16bit operations as opposed to 32bit for increased speed despite the loss in precision. This should bring speed benefits to training networks on GPUs as they twice the amount of information can be fed at once. 

 

TL;DR It's not very surprising that custom hardware can outperform general purpose GPUs. Think of it like how recent phones can decod 4K 60fps video without breaking a sweat but your laptop lags on the same file.

Data Scientist - MSc in Advanced CS, B.Eng in Computer Engineering

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, randomhkkid said:

From a performance standpoint it's actually not all that surprising that an FPGA custom build for DNN computation will outperform a GPU. GPUs are multi-pupose chips that just happen to be good at compute because of their highly parallel nature.

 

Recent changes in large CNN and DNNs are to move to 16bit operations as opposed to 32bit for increased speed despite the loss in precision. This should bring speed benefits to training networks on GPUs as they twice the amount of information can be fed at once. 

 

TL;DR It's not very surprising that custom hardware can outperform general purpose GPUs. Think of it like how recent phones can decod 4K 60fps video without breaking a sweat but your laptop lags on the same file.

It wasn't custom-built for DNN. In fact the Stratix 10 was designed to be coupled with the OmniScale 200G fabric to do in-interconnect map-reduce or slightly higher-order math. Since the bandwidth of the processor itself is multiple TB/s, it can sustain a 20x200G switch.

 

But just utilising the underlying 10TFlops of Single-Precision DSP (up to 40 at quarter precision, including integer ops) has let Intel's hardware outcompete Nvidia's equally general-purpose hardware. Intel hasn't even brought the logic elements into play yet, and they can become n-bit ALUs/FPUs if so desired.

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, MandelFrac said:

It wasn't custom-built for DNN. In fact the Stratix 10 was designed to be coupled with the OmniScale 200G fabric to do in-interconnect map-reduce or slightly higher-order math. Since the bandwidth of the processor itself is multiple TB/s, it can sustain a 20x200G switch.

 

But just utilising the underlying 10TFlops of Single-Precision DSP (up to 40 at quarter precision, including integer ops) has let Intel's hardware outcompete Nvidia's equally general-purpose hardware. Intel hasn't even brought the logic elements into play yet, and they can become n-bit ALUs/FPUs if so desired.

Ah I see, I didn't realise that. However it still holds that the GPU is a multi-purpose chip vs one that is designed solely for maths which is what the backpropagation algorithms of DNNs need.

Data Scientist - MSc in Advanced CS, B.Eng in Computer Engineering

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, randomhkkid said:

Ah I see, I didn't realise that. However it still holds that the GPU is a multi-purpose chip vs one that is designed solely for maths which is what the backpropagation algorithms of DNNs need.

It wasn't just built for math(s) either. FPGAs are used for software-defined networking, and Alibaba is using Intel's to build its cluster interconnect in a way which lets customers make custom-sized instances on a level AWS absolutely cannot compete with at the moment because their switches and capabilities are hard-wired into ASIC-based ethernet controllers. The math elements are there so you can use something like MPI or HPX libraries to make analysis/map-reduce functions and have the results collated in-fabric by vastly more efficient processors than CPUs/GPUs, but the key advantage in using FPGAs as your network backbone is you can partition off entire portions of a given system on the fly, and you can instantly gain failover if a NIC or a portion of the FPGA itself somehow dies.

 

No one in the industry other than Intel offers this capability, which is why a 20x Omnipath switch will set you back 600,000 USD.

Link to comment
Share on other sites

Link to post
Share on other sites

25 minutes ago, xentropa said:

If x86 uses "hardware based calculations" for its instruction set, it requires more transistors, which requires more physical space on the silicon die.  I am pretty sure x86 processors have a lot more transistors (per core) than any ARM processor of the same nm node.

 

I don't think ARM in particular will break into the supercomputing arena, but RISC... I don't know.

If you, or anyone else, really want to understand x86 and how it computes I suggest reading this: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf

Also I have read all of them. I had to do the same thing with Power9. I can tell you this, there is a place for each. IBM is heavily pushing scalability and offering four threads per core.

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, Dylanc1500 said:

If you, or anyone else, really want to understand x86 and how it computes I suggest reading this: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf

Also I have read all of them. I had to do the same thing with Power9. I can tell you this, there is a place for each. IBM is heavily pushing scalability and offering four threads per core.

8*.

 

IBM offers up to 96 threads on its 12-core Power 8 chips. I have not heard any detail of IBM lowering that count, and Intel is finally moving up from 2 to 4 on their KNL architecture, supposedly doing the same with the Skylake-EX Xeons (but not EP interestingly enough if rumors hold true).

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, MandelFrac said:

It wasn't just built for math(s) either. FPGAs are used for software-defined networking, and Alibaba is using Intel's to build its cluster interconnect in a way which lets customers make custom-sized instances on a level AWS absolutely cannot compete with at the moment because their switches and capabilities are hard-wired into ASIC-based ethernet controllers. The math elements are there so you can use something like MPI or HPX libraries to make analysis/map-reduce functions and have the results collated in-fabric by vastly more efficient processors than CPUs/GPUs, but the key advantage in using FPGAs as your network backbone is you can partition off entire portions of a given system on the fly, and you can instantly gain failover if a NIC or a portion of the FPGA itself somehow dies.

 

No one in the industry other than Intel offers this capability, which is why a 20x Omnipath switch will set you back 600,000 USD.

FPGAs are just programmable logic units, technically they're used for chip design or in rarer cases like you describe as a way to create custom instances. My point is that they are able to be reprogrammed to do specific tasks unlike a GPU which runs programs without changing the underlying hardware. That aside FPGAs are very very expensive compared to your average GPU.

Data Scientist - MSc in Advanced CS, B.Eng in Computer Engineering

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, randomhkkid said:

FPGAs are just programmable logic units, technically they're used for chip design or in rarer cases like you describe as a way to create custom instances. My point is that they are able to be reprogrammed to do specific tasks unlike a GPU which runs programs without changing the underlying hardware. That aside FPGAs are very very expensive compared to your average GPU.

Not really. A flagship Tesla? $5000 USD if you buy in bulk, closer to $6000 for a single? Stratix 10? $4200 a piece for the 16GB HBM2 version, 3900 for the bare die.

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, MandelFrac said:

Not really. A flagship Tesla? $5000 USD if you buy in bulk? Stratix 10? $4200 a piece.

I did say average haha. Any info on how it would perform against that $5k Tesla?

Data Scientist - MSc in Advanced CS, B.Eng in Computer Engineering

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, randomhkkid said:

I did say average haha. Any info on how it would perform against that $5k Tesla?

Did you not read this article? The Titan XP and the Tesla P100 have the exact same half and quarter precision performance. They're decimated by the Stratix 10, and the Stratix 10 is just 180W at full tilt vs. 250W for the P100 Tesla. And this is without the logic elements implemented in the software library yet, so we're far from seeing what the Stratix 10 can do, whereas Nvidia has tuned AlexNet within a few percentage points of maximum capability.

 

The Stratix 10 is also THE most powerful FPGA in the industry at 30 billion transistors, and only at 450mm sq, whereas the KNL Xeon Phi goes out to 683mm sq.. Intel can pump this up in ways Nvidia stands no chance of beating currently.

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, MandelFrac said:

8*.

 

IBM offers up to 96 threads on its 12-core Power 8 chips. I have not heard any detail of IBM lowering that count, and Intel is finally moving up from 2 to 4 on their KNL architecture, supposedly doing the same with the Skylake-EX Xeons (but not EP interestingly enough if rumors hold true).

Power 9 is what I'm referring to. I know Power 8 all too well. I develope and teach databases for clients that use them.

https://www.ibm.com/developerworks/community/wikis/form/anonymous/api/wiki/61ad9cf2-c6a3-4d2c-b779-61ff0266d32a/page/1cb956e8-4160-4bea-a956-e51490c2b920/attachment/56cea2a9-a574-4fbb-8b2c-675432367250/media/POWER9-VUG.pdf

Im on my phone so it's a little difficult to elaborate further without taking forever.

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, Dylanc1500 said:

Power 9 is what I'm referring to. I know Power 8 all too well. I develope and teach databases for clients that use them.

https://www.ibm.com/developerworks/community/wikis/form/anonymous/api/wiki/61ad9cf2-c6a3-4d2c-b779-61ff0266d32a/page/1cb956e8-4160-4bea-a956-e51490c2b920/attachment/56cea2a9-a574-4fbb-8b2c-675432367250/media/POWER9-VUG.pdf

Im on my phone so it's a little difficult to elaborate further without taking forever.

Available as SMT8 or 4, depending on workload needs. Okay, not ground-breaking news. Some workloads are far better for scale-out than scale-up. :)

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×