xeon phi Intel begins to ship its KNL Xeon Phi processors

porina · June 21, 2016

Was discussing this on another forum where we eat flops for breakfast. Based on claimed peak DP performance it'll be roughly 12x 6700k at stock. So on that basis it isn't badly priced relative to the lower models, and still offers a big potential power saving. And that's not counting the phi has a ton more memory bandwidth so would not be choked like the 6700k is now. Now if only I could get two. One for me, one for a developer to optimise code for it!

patrickjp93 · June 21, 2016

17 minutes ago, porina said:

Was discussing this on another forum where we eat flops for breakfast. Based on claimed peak DP performance it'll be roughly 12x 6700k at stock. So on that basis it isn't badly priced relative to the lower models, and still offers a big potential power saving. And that's not counting the phi has a ton more memory bandwidth so would not be choked like the 6700k is now. Now if only I could get two. One for me, one for a developer to optimise code for it!

How the hell did you calculate that?!

6700K SFlops: 4 * 256/32 * 2 * 4.2*10^9 = 268.8*10^9 flops

DFlops: SFlops / 2 = 139.4*10^9

KNL SFlops: 72 * 1024/32 * 2 * 1.5*10^9 = 6.912*10^12

DFlops: SFlops/2 = 3.456*10^12 = 3456*10^9

3456/139.4 = 24.7919X the performance of the 6700K.

TheRandomness · June 21, 2016

@Slick Use your powers as techy people and obtain one. Please? A 'review' would be nice (throw crysis at it)

Tbh, chances of that happening are next to none...

Sauron · June 21, 2016

6 hours ago, Mr.Meerkat said:

Hmmmm...a situation where 288 threads are all used...25% of the speed (per thread) of an actual core anyone? (I know that's not exactly how HT works but still )

GPU computing (or at least, the sort of computing that is USUALLY done on a GPU and that Intel would like to shift over to their CPUs)

porina · June 21, 2016

5 minutes ago, patrickjp93 said:

How the hell did you calculate that?!

Kinda like you did, but someone else convinced me to throw in a factor of 2 somewhere for fma. I wasn't awake enough to question it and ran with it. If actually 24x, even better!

patrickjp93 · June 21, 2016

1 minute ago, porina said:

Kinda like you did, but someone else convinced me to throw in a factor of 2 somewhere for fma. I wasn't awake enough to question it and ran with it. If actually 24x, even better!

That's where my 2 is in both equations as well: from fma.

patrickjp93 · June 21, 2016

5 minutes ago, Sauron said:

GPU computing (or at least, the sort of computing that is USUALLY done on a GPU and that Intel would like to shift over to their CPUs)

No, more like a ton of independent tasks. You could run an email server (like, GMail-sized) off of these for pretty cheap compared to a bunch of E5 or E7 servers.

Sauron · June 21, 2016

1 minute ago, patrickjp93 said:

No, more like a ton of independent tasks. You could run an email server (like, GMail-sized) off of these for pretty cheap compared to a bunch of E5 or E7 servers.

Yeah, that's what gpus excel at - small, independent tasks. Maybe not mail servers specifically but that sort of thing.

patrickjp93 · June 21, 2016

3 minutes ago, Sauron said:

Yeah, that's what gpus excel at - small, independent tasks. Maybe not mail servers specifically but that sort of thing.

GPUs do not excel at launching a ton of totally independent, asynchronous threads, and further, unless you intend to engineer a GPU with a bunch of network interfaces, that won't end well. GPUs excel at launching workgroups where each thread is working on pretty much the same thing and reducing the results together at the end in a fork-join model where the fork factor tends to be 16x or higher.

porina · June 21, 2016

1 minute ago, patrickjp93 said:

That's where my 2 is in both equations as well: from fma.

I only applied it to 6700k, and used the quoted DP flops for phi, or something like that. Actually, my working was done a bit differently (and very possibly incorrectly):

4 cores * 2 AVX units * 4 DP data chunks * 4.0 GHz = 128 GFLOPS. I got persuaded to double that again for FMA, then comparing to quoted ~3 DP TFLOPS. Also recognising peak rate can't be sustained in real world.

Where did 256/32 come from? Presumably the 256 is the bit width of each AVX unit, but I don't get the 32?

sof006 · June 22, 2016

2 hours ago, SamStrecker said:

Ask china

I will. Just you wait... just you wait.....

awesomeness10120 · June 22, 2016

So, these new Xeon Phis are built with the Silvermont core. I know that Intel makes a very good portion of their money from servers, but why doesn't Intel make a cost effective desktop platform with Silvermont to compete with the low end Athlons?

KeltonDSMer · June 22, 2016

Interesting to see HBM. I thought Intel heavily invested in HMC with Micron, specifically for Xeon Phi. What makes HBM more suitable? Am I missing something obvious here? What happened to HMC?

I mean, look at the HMC page on Micron's site: https://www.micron.com/products/hybrid-memory-cube/high-performance-on-package-memory

2FA · June 22, 2016

Remember the idiot who was saying this would be Skylake-E?

patrickjp93 · June 22, 2016

4 hours ago, porina said:

I only applied it to 6700k, and used the quoted DP flops for phi, or something like that. Actually, my working was done a bit differently (and very possibly incorrectly):

4 cores * 2 AVX units * 4 DP data chunks * 4.0 GHz = 128 GFLOPS. I got persuaded to double that again for FMA, then comparing to quoted ~3 DP TFLOPS. Also recognising peak rate can't be sustained in real world.

Where did 256/32 come from? Presumably the 256 is the bit width of each AVX unit, but I don't get the 32?

1 AVX unit per core, but there is a factor of 2 ops per clock applied because of FMA. That's why I did 256/32 as part of my calculation for single precision. You could do 256/64 for DP. And the 6700K stays boosted at 4.2GHz on all 4 cores.

porina · June 22, 2016

5 hours ago, patrickjp93 said:

1 AVX unit per core, but there is a factor of 2 ops per clock applied because of FMA. That's why I did 256/32 as part of my calculation for single precision. You could do 256/64 for DP. And the 6700K stays boosted at 4.2GHz on all 4 cores.

The 32 was the SP size? I didn't make the connection earlier. It was my understanding there were two 256-bit FMA capable units per core.

http://www.anandtech.com/show/6355/intels-haswell-architecture/8

Above is for Haswell. Skylake isn't much different although I was unable to locate an equivalent illustration for it. I know from experience Skylake is a good bit faster (IPC) than Haswell in FMA heavy applications although I don't understand why, other than perhaps the reduced instruction latency giving a boost where peak rate can't be sustained.

Also from: http://www.agner.org/optimize/microarchitecture.pdf

Quote

...it enables Intel to boast a floating point performance of 32 FLOPS per cycle.

If I'm applying it correctly that would be consistent with two units, 8 SP per unit, x2 for FMA. That document also describes the availability of two FMA units.

Putting the above aside, assuming Phi has similar execution capabilities, the difference between them would simply come down to cores * clock (again, ignoring the potential for memory bandwidth limiting like I often see on Skylake).

Stefan1024 · June 22, 2016

I hope they produce to much of them and then like to get rid of them as fast as possible. Like they did with the last generation. A 57 core Xeon Phi for 200$ was pretty nice.

But I'm pretty sure this won't happen again

Liltrekkie · June 22, 2016

Linus: uh yeah intel contact, can we get some of these?

Intel contact: what will you use them for?

Linus: network render

Intel contact: ok

later.....

Linus: The new 500 gamer 1 CPU in a 1U rack....

Anytime i see stuff like this that's powerful and has a lot of cores this pops in my head, I bet linus has already had this discussion with his Intel contact for this product

patrickjp93 · June 22, 2016

3 hours ago, porina said:

The 32 was the SP size? I didn't make the connection earlier. It was my understanding there were two 256-bit FMA capable units per core.

http://www.anandtech.com/show/6355/intels-haswell-architecture/8

Above is for Haswell. Skylake isn't much different although I was unable to locate an equivalent illustration for it. I know from experience Skylake is a good bit faster (IPC) than Haswell in FMA heavy applications although I don't understand why, other than perhaps the reduced instruction latency giving a boost where peak rate can't be sustained.

Also from: http://www.agner.org/optimize/microarchitecture.pdf

If I'm applying it correctly that would be consistent with two units, 8 SP per unit, x2 for FMA. That document also describes the availability of two FMA units.

Putting the above aside, assuming Phi has similar execution capabilities, the difference between them would simply come down to cores * clock (again, ignoring the potential for memory bandwidth limiting like I often see on Skylake).

No, 32 flops per cycle is 4 * 256/32 or 4*8. If you see FLOPs outside of the supercomputing world, it's for single precision. It's annoying, but it's true. Haswell has only 1 vector unit per core (FMA capable), a dedicated 32/64-bit FPU (FMA capable), a dedicated 32/64-bit integer ALU, and a dedicated AGU or address-translation unit. And while you could potentially use both the vector and singleton units in your code, it's tough to extract any extra performance over just using the vector unit.

porina · June 22, 2016

1 hour ago, patrickjp93 said:

No, 32 flops per cycle is 4 * 256/32 or 4*8. If you see FLOPs outside of the supercomputing world, it's for single precision. It's annoying, but it's true. Haswell has only 1 vector unit per core (FMA capable), a dedicated 32/64-bit FPU (FMA capable), a dedicated 32/64-bit integer ALU, and a dedicated AGU or address-translation unit. And while you could potentially use both the vector and singleton units in your code, it's tough to extract any extra performance over just using the vector unit.

Not sure if it was you but I vaguely recall similar talk over Prime95 previously. The author of that continues to try and extract as much performance as is available from the architecture, and if there are two FMA units, let's use two FMA units. In many cases, the limitation is not in the execution but in trying to keep them fed. Consumer CPU cache is too small and lacking in ram bandwidth. Why do my interests have to have more to do with HPC than consumer?

patrickjp93 · June 22, 2016

2 minutes ago, porina said:

Not sure if it was you but I vaguely recall similar talk over Prime95 previously. The author of that continues to try and extract as much performance as is available from the architecture, and if there are two FMA units, let's use two FMA units. In many cases, the limitation is not in the execution but in trying to keep them fed. Consumer CPU cache is too small and lacking in ram bandwidth. Why do my interests have to have more to do with HPC than consumer?

Well, then you wouldn't have 2 256-bit units. You have 1 256-bit unit vector unit AND 1 FPU that can do either 1 FMA 32-bit or 1 FMA 64-bit operation per clock, so multiply by 9/8 or 5/4 depending on which mode you're in. And keeping them fed isn't an issue either You can load up to 64 bytes of from cache per clock and up to 32 bytes of instructions, an entire data cache line and an entire instruction cache line. Keeping 288 bits of compute units fed isn't hard to do. To do it in a way that's useful...that's tougher.

patrickjp93 · June 22, 2016

4 hours ago, Liltrekkie said:

Linus: uh yeah intel contact, can we get some of these?

Intel contact: what will you use them for?

Linus: network render

Intel contact: ok

later.....

Linus: The new 500 gamer 1 CPU in a 1U rack....

Anytime i see stuff like this that's powerful and has a lot of cores this pops in my head, I bet linus has already had this discussion with his Intel contact for this product

yeah right... the limiting factor would be the graphics card hookup actually. Xeon Phi CPUs still only come with 32 lanes, so the best you could do is shove 4 Radeon Pro Duos in and get 8 gamers, 1 CPU with 9 physical cores per player.

porina · June 22, 2016

13 minutes ago, patrickjp93 said:

Well, then you wouldn't have 2 256-bit units. You have 1 256-bit unit vector unit AND 1 FPU that can do either 1 FMA 32-bit or 1 FMA 64-bit operation per clock, so multiply by 9/8 or 5/4 depending on which mode you're in. And keeping them fed isn't an issue either You can load up to 64 bytes of from cache per clock and up to 32 bytes of instructions, an entire data cache line and an entire instruction cache line. Keeping 288 bits of compute units fed isn't hard to do. To do it in a way that's useful...that's tougher.

Refer back to the two links I had earlier, both describe the presence of two 256 bit FMA capable units, one each on port 0 and port 1. Am I misunderstanding something here?

Also the data limits I refer two are in two parts, that the cache is too small for the working data set, and then fetching things from system ram is limiting. Crystalwell eDRAM is kinda halfway there and I'd love to see that on more CPUs as it does help a lot on the i5-5675C I got to test, within the limits of poor maximum clock speed of that generation and an unexplained real world lower IPC even than Haswell in this application.

patrickjp93 · June 22, 2016

25 minutes ago, porina said:

Refer back to the two links I had earlier, both describe the presence of two 256 bit FMA capable units, one each on port 0 and port 1. Am I misunderstanding something here?

Also the data limits I refer two are in two parts, that the cache is too small for the working data set, and then fetching things from system ram is limiting. Crystalwell eDRAM is kinda halfway there and I'd love to see that on more CPUs as it does help a lot on the i5-5675C I got to test, within the limits of poor maximum clock speed of that generation and an unexplained real world lower IPC even than Haswell in this application.

It's the same unit, accessible by two ports. Each of those execution ports obeys mutual exclusion. No two instructions may be in the same port at the same time, and no two instructions may use the same ALU at the same time. You're misreading. Notice how both of those ports also have integer capabilities, and many of the same ones. It's a way to allow the pipelines to have more flexibility.

http://www.realworldtech.com/haswell-cpu/4/

Your analysis of the cache is also incomplete. If you're doing 1 operation over a large dataset, then yes, memory streaming will be necessary, but what about a lot of transformations on a small dataset?

VerticalDiscussions · June 22, 2016

Thats a lot of threads ehe, 288!

Sign In

xeon phi Intel begins to ship its KNL Xeon Phi processors

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites