Jump to content

More Knight's Landing Details Surface!

http://cdn3.wccftech.com/wp-content/uploads/2015/04/SZ15_SFTS003_100_ENGf.pdf

 

wccf obtained the slide deck for an upcoming presentation at IDF 2015 about Knight's Landing. Inside a whole bunch of jargon about revolutionizing parallel computing can be found, but most importantly are the specs for Intel's new 715mm^2 beast chip which can only be dwarfed in size by IBM's newest AI CPU. Mind you this is built on the 14nmFF SHP process too. Transistor counts are not yet known.

 

Knight's Landing will come in variants of core counts ranging from 60-72 in both socketed package and Xeon Phi form. All socket packages will come with 6-channel DDR4 2400MHz support capable of addressing up to 384GB each. Also, the large L3 cache of the original model has been replaced by 8 or 16GB of JEDEC-compliant 3D-stacked HBM produced by Micron on the Intel 14nmFF process (a shot across the bow to SK Hynix). There is still a 36MB shared L2 cache.

 

The HBM is supposed to offer over 1TB/s of internal bandwidth due to 8-channel configuration it's in.

 

Also be aware this is PCIe 4.0 (redouble your PCIe bandwidth) compliant as will be released with Skylake-E in 2016.

 

The theoretical performance figure many are currently quoting is 6TFlops SP and 3TFlops DP, but new to this conversation is that this is for the lowest-end version (60 cores). Clock speeds have not been given.

 

We can directly estimate the speed for the lowest model like so

Edit: I forgot there are 2 AVX 512 units per silvermont core.

6*10^12 FLOPS = 60 cores * (1024/32 = 32 SIMD units/core) * (2 FMA + 1 M = 3 op/(SIMD*cycle)) * cycles/second

clockspeed = 2.0833...*10^9 cycles/second = 1.0416666...  GHz

 

72 * 32 * 3 * 1.04666*10^9 = ~7.2TFlops SP / 3.6TFlops DP

It should be noted the previous (currently selling) generation increases in clock speed as core counts increase, so in truth these numbers are a bit conservative for the 72-core model.

 

And this is promised in a 300W TDP... :D

 

For reference, the nearest competition is the Nvidia K40 and K80 Tesla based on the Kepler architecture. the theoretical performance and memory bandwidth figures can be found here: http://www.nvidia.com/object/tesla-servers.html

 

As you can see, Intel has taken the DP performance crown by a fair margin. AMD's best FirePro is rated for 2.58 TFlops DP, but obviously neither it nor Nvidia get near this according to Linpack benchmarks. However, the Xeon Phi coprocessors already get much closer to their theoretical performance bounds. So, AMD and Nvidia better bring the performance in 2016 or this will be a one-sided slaughter.

 

Enjoy the supercomputing hype y'all.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

I'm enjoying it for some reason that I will never know.

 

So:

 

16 or 8 GB of HBM instead of L3 cache.

2304 GB of RAM - theoretical limit - 

14nm.

Two form factors.

36MB of L2 cache.

and more.

 

Bloody amazing and expensive.

  ﷲ   Muslim Member  ﷲ

KennyS and ScreaM are my role models in CSGO.

CPU: i3-4130 Motherboard: Gigabyte H81M-S2PH RAM: 8GB Kingston hyperx fury HDD: WD caviar black 1TB GPU: MSI 750TI twin frozr II Case: Aerocool Xpredator X3 PSU: Corsair RM650

Link to comment
Share on other sites

Link to post
Share on other sites

Hmmm, This is super cool. Wonder how it'll compare to Tesla's if you can even compare those two.

Computing enthusiast. 
I use to be able to input a cheat code now I've got to input a credit card - Total Biscuit
 

Link to comment
Share on other sites

Link to post
Share on other sites

16GB HBM? the hard on is real

this is one of the greatest thing that has happened to me recently, and it happened on this forum, those involved have my eternal gratitude http://linustechtips.com/main/topic/198850-update-alex-got-his-moto-g2-lets-get-a-moto-g-for-alexgoeshigh-unofficial/ :')

i use to have the second best link in the world here, but it died ;_; its a 404 now but it will always be here

 

Link to comment
Share on other sites

Link to post
Share on other sites

Hmmm, This is super cool. Wonder how it'll compare to Tesla's if you can even compare those two.

Xeon Phi are already getting better real world performance than Kepler Teslas in a number of instances. Now that Intel additionally has the theoretical performance advantage as well, well, I don't think 2016 is going to be kind to Nvidia with or without Pascal.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

I want one! I have no use for it, nor can I afford it if I want to go to college, but I want one! xD

Why is the God of Hyperdeath SO...DARN...CUTE!?

 

Also, if anyone has their mind corrupted by an anthropomorphic black latex bat, please let me know. I would like to join you.

Link to comment
Share on other sites

Link to post
Share on other sites

Xeon Phi are already getting better real world performance than Kepler Teslas in a number of instances. Now that Intel additionally has the theoretical performance advantage as well, well, I don't think 2016 is going to be kind to Nvidia with or without Pascal.

I'd say the next big competitor to what intel has to offer is volta in 2017. ~300w with HBM and Nvidia link.

Computing enthusiast. 
I use to be able to input a cheat code now I've got to input a credit card - Total Biscuit
 

Link to comment
Share on other sites

Link to post
Share on other sites

I thought this was gonna be Game of Thrones related...Silly me...

"Graphics and gameplay are not mutually exclusive."


"Nvidia, AMD, Intel, or whatever company out there has only one end goal and that is PROFIT.


If you think these companies exist for any other reason you're gonna be disappointed my dear. CAVEAT EMPTOR"

Link to comment
Share on other sites

Link to post
Share on other sites

I thought this was gonna be Game of Thrones related...Silly me...

 

Out...OUT NOW! *shuts door*

 

Anyways, cool stuff Intel, can't wait. :P

- Fresher than a fruit salad.

Link to comment
Share on other sites

Link to post
Share on other sites

I'd say the next big competitor to what intel has to offer is volta in 2017. ~300w with HBM and Nvidia link.

Fiji will probably be the next big contender with 8.6 TFLOPS of single precision and 4.3 TFLOPS of double precision at 300w TDP.

Link to comment
Share on other sites

Link to post
Share on other sites

I thought this was gonna be Game of Thrones related...Silly me...

Not gonna lie that's immediately what I thought too!

So disappointed. Haha.

CPU: Ryzen 9 5900 Cooler: EVGA CLC280 Motherboard: Gigabyte B550i Pro AX RAM: Kingston Hyper X 32GB 3200mhz

Storage: WD 750 SE 500GB, WD 730 SE 1TB GPU: EVGA RTX 3070 Ti PSU: Corsair SF750 Case: Streacom DA2

Monitor: LG 27GL83B Mouse: Razer Basilisk V2 Keyboard: G.Skill KM780 Cherry MX Red Speakers: Mackie CR5BT

 

MiniPC - Sold for $100 Profit

Spoiler

CPU: Intel i3 4160 Cooler: Integrated Motherboard: Integrated

RAM: G.Skill RipJaws 16GB DDR3 Storage: Transcend MSA370 128GB GPU: Intel 4400 Graphics

PSU: Integrated Case: Shuttle XPC Slim

Monitor: LG 29WK500 Mouse: G.Skill MX780 Keyboard: G.Skill KM780 Cherry MX Red

 

Budget Rig 1 - Sold For $750 Profit

Spoiler

CPU: Intel i5 7600k Cooler: CryOrig H7 Motherboard: MSI Z270 M5

RAM: Crucial LPX 16GB DDR4 Storage: Intel S3510 800GB GPU: Nvidia GTX 980

PSU: Corsair CX650M Case: EVGA DG73

Monitor: LG 29WK500 Mouse: G.Skill MX780 Keyboard: G.Skill KM780 Cherry MX Red

 

OG Gaming Rig - Gone

Spoiler

 

CPU: Intel i5 4690k Cooler: Corsair H100i V2 Motherboard: MSI Z97i AC ITX

RAM: Crucial Ballistix 16GB DDR3 Storage: Kingston Fury 240GB GPU: Asus Strix GTX 970

PSU: Thermaltake TR2 Case: Phanteks Enthoo Evolv ITX

Monitor: Dell P2214H x2 Mouse: Logitech MX Master Keyboard: G.Skill KM780 Cherry MX Red

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

I'd say the next big competitor to what intel has to offer is volta in 2017. ~300w with HBM and Nvidia link.

PCIe 4.0 will offer 16GT/s or up to 156Gb/s bandwidth vs. NVLINK 2.0 offering 200. If the bottleneck still ends up being PCIe then Nvidia could still do well, and then maybe Intel can work to use NVLink with its own coprocessors since PCI-SIG is so damn slow...

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

Fiji will probably be the next big contender with 8.6 TFLOPS of single precision and 4.3 TFLOPS of double precision at 300w TDP.

Yes, but AMD gets nowhere near its theoretical numbers, and there's still the problem of OpenCL's ugliness/clunkyness vs. OpenMP and the lack of HSA adoption as well.

 

Also, what are the values for the AMD flops equation? edit: nvm

 

5.632*10^12 flop/second = 2816 SPs * (2flop/SP/cycle) * 1*10^9 cycles/second

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

Yes, but AMD gets nowhere near its theoretical numbers, and there's still the problem of OpenCL's ugliness/clunkyness vs. OpenMP and the lack of HSA adoption as well.

No chip on the market will reach its theoretical compute performance including the Phi. OpenCL isn't as bad as most make it out to be. Most I would figure have no experience in using it as it doesn't take much to establishing a device driver and begin executing kernels. If you dislike it you can always migrate over to OpenACC.

 

Also, what are the values for the AMD flops equation?

 

5.632*10^12 = 2816 cores * (x simd units) * (instructions/clock) * clock speed

4096*2*1050 = 8.6 TFLOPS SP + (1/2 ratio DP) = 4.3 TFLOPS DP.

Link to comment
Share on other sites

Link to post
Share on other sites

No chip on the market will reach its theoretical compute performance including the Phi. OpenCL isn't as bad as most make it out to be. Most I would figure have no experience in using it as it doesn't take much to establishing a device driver and begin executing kernels. If you dislike it you can always migrate over to OpenACC.

4096*2*1050 = 8.6 TFLOPS SP + (1/2 ratio DP) = 4.3 TFLOPS DP.

No chip reaches its theoretical values, but some get much closer than others. On the IBM platform Nvidia actually gets closer than the earlier figures of the Xeon Phi, and Linpack has been slow to update, but Intel's Xeon Phi is performing between 75 and 95% theoretical performance now (yay branch predictors), which has started putting Intel in better and better positions to unseat Nvidia entirely from its performance crown moorings. AMD's FirePro are performing in the 35-55% range it seems.

OpenCL is okay, but it's just not as nice to set up kernels as CUDA and it's definitely nowhere near as clean as OpenMP/OpenACC. I think that's one big thing AMD shot itself in the foot on. It should have jumped on those standards instead of striving for HSA first.

And given the Titan X offers 9.9 TFlops SP, it might want to do better than that. Doesn't AMD yet have FMA + M?

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

I'd rather have a trillion raspberry pis in a cluster :P /s

Wow even the lowest end one is a damn beast!

MacBook Pro 15' 2018 (Pretty much the only system I use)

Link to comment
Share on other sites

Link to post
Share on other sites

The Fiji XT based FirePros and the R9 390X will wipe the floor with this. Full HSA support, over 8 Teraflops of single precision compute and over 4 Teraflops of double precision compute, 16GB of HBM and undoubtedly half the price with 250 TDP.

And let's not forget all of this is going to be on TSMC's 28nm process and a 550mm² chip. The Phi is over 700mm² large on 14nm and still only manages 300W TDP.
With the same process node Fiji XT would literally be half the size and half the TDP with a 25% performance advantage.

Link to comment
Share on other sites

Link to post
Share on other sites

The Fiji XT based FirePros and the R9 390X will wipe the floor with this. Full HSA support, over 8 Teraflops of single precision compute and over 4 Teraflops of double precision compute, 16GB of HBM and undoubtedly half the price with 250 TDP.

And let's not forget all of this is going to be on TSMC's 28nm process and a 550mm² chip. The Phi is over 700mm² large on 14nm and still only manages 300W TDP.

With the same process node Fiji XT would literally be half the size and half the TDP with a 25% performance advantage.

Facepalm* I haven't seen a fanboy post this bad in a while...

No, the Fiji FirePro will not simply wipe the floor with this. If that was remotely true the current Hawaii FirePro would be wiping the floor with both Nvidia and Intel in the HPC space right now based on the theoretical numbers. However, AMD barely has any supercomputer sales, and the current Hawaii FirePro already support OpenCL 2.0 and most of HSA.

HSA is not seeing much adoption or even exploration. That's because most of the HPC code base is already in OpenMP and OpenACC which do 95% of what AMD proposed in HSA, and they're easy programming models to learn.

Intel has one huge advantage over GPUs when it comes to parallel compute: fully asynchronous computation. Every thread can be individually handled and programmed, on a much bigger scale than AMD's 8 ACEs. Furthermore, the Xeon Phi have branch predictors, allowing you to put both the computations and decision tree on the Xeon Phi and not have to ping-pong kernel I/O, leaving th Phi saturated for far greater percentages of time. This is why Intel's real world performance is so much closer to its theoretical and then beats its competition despite having lower theoretical performance numbers.

As per die sizes, you forget AMD is already using its HDL move and Intel isn't. At 14nmFF with no HDL Fiji would easily be as big as 700mm^2. Intel is also pulling double the clocks of AMD's GPUs. And both chips will have a 300W TDP.

And actually Intel's Xeon Phi are very price-competitive against Tesla and FirePro products.

Only in truly embarrassingly parallel workloads will Fiji eke out a small performance win over Knight's Landing. The problem with using this as your measure is HPC has many different parallel workloads, and you can't eliminate all branching for all workloads for dynamic sizes no matter how hard you try. As Linpack has published, Intel wins in 18 of 22 major tests on the current Knight's Corner generation vs the K40 Tesla the the W9150 FirePro, some by small margins and others by large ones.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

Wouldn't this chip technically work as a GPU with so many smaller cores?

It's been confirmed DX12 and Vulkan can work with these, but Xeon Phi lack the nice TMUs and ROPs that GPUs depend on, so they won't do so well, especially for the price. It is a highly parallel computation unit though similar to GPUs in that respect.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×