Jump to content

Intel begins to ship its KNL Xeon Phi processors

NumLock21

Was discussing this on another forum where we eat flops for breakfast. Based on claimed peak DP performance it'll be roughly 12x 6700k at stock. So on that basis it isn't badly priced relative to the lower models, and still offers a big potential power saving. And that's not counting the phi has a ton more memory bandwidth so would not be choked like the 6700k is now. Now if only I could get two. One for me, one for a developer to optimise code for it!

Gaming system: R7 7800X3D, Asus ROG Strix B650E-F Gaming Wifi, Thermalright Phantom Spirit 120 SE ARGB, Corsair Vengeance 2x 32GB 6000C30, RTX 4070, MSI MPG A850G, Fractal Design North, Samsung 990 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Productivity system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, 64GB ram (mixed), RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, random 1080p + 720p displays.
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

17 minutes ago, porina said:

Was discussing this on another forum where we eat flops for breakfast. Based on claimed peak DP performance it'll be roughly 12x 6700k at stock. So on that basis it isn't badly priced relative to the lower models, and still offers a big potential power saving. And that's not counting the phi has a ton more memory bandwidth so would not be choked like the 6700k is now. Now if only I could get two. One for me, one for a developer to optimise code for it!

How the hell did you calculate that?!

 

6700K SFlops: 4 * 256/32 * 2 * 4.2*10^9 = 268.8*10^9 flops

           DFlops: SFlops / 2 = 139.4*10^9

 

KNL    SFlops: 72 * 1024/32 * 2 * 1.5*10^9 = 6.912*10^12

           DFlops: SFlops/2 = 3.456*10^12 = 3456*10^9

 

3456/139.4 = 24.7919X the performance of the 6700K.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

6 hours ago, Mr.Meerkat said:

Hmmmm...a situation where 288 threads are all used...25% of the speed (per thread) of an actual core anyone? (I know that's not exactly how HT works but still :P)

GPU computing (or at least, the sort of computing that is USUALLY done on a GPU and that Intel would like to shift over to their CPUs)

Don't ask to ask, just ask... please 🤨

sudo chmod -R 000 /*

Link to comment
Share on other sites

Link to post
Share on other sites

5 minutes ago, patrickjp93 said:

How the hell did you calculate that?!

Kinda like you did, but someone else convinced me to throw in a factor of 2 somewhere for fma. I wasn't awake enough to question it and ran with it. If actually 24x, even better!

Gaming system: R7 7800X3D, Asus ROG Strix B650E-F Gaming Wifi, Thermalright Phantom Spirit 120 SE ARGB, Corsair Vengeance 2x 32GB 6000C30, RTX 4070, MSI MPG A850G, Fractal Design North, Samsung 990 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Productivity system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, 64GB ram (mixed), RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, random 1080p + 720p displays.
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, porina said:

Kinda like you did, but someone else convinced me to throw in a factor of 2 somewhere for fma. I wasn't awake enough to question it and ran with it. If actually 24x, even better!

That's where my 2 is in both equations as well: from fma.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

5 minutes ago, Sauron said:

GPU computing (or at least, the sort of computing that is USUALLY done on a GPU and that Intel would like to shift over to their CPUs)

No, more like a ton of independent tasks. You could run an email server (like, GMail-sized) off of these for pretty cheap compared to a bunch of E5 or E7 servers.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, patrickjp93 said:

No, more like a ton of independent tasks. You could run an email server (like, GMail-sized) off of these for pretty cheap compared to a bunch of E5 or E7 servers.

Yeah, that's what gpus excel at - small, independent tasks. Maybe not mail servers specifically but that sort of thing.

Don't ask to ask, just ask... please 🤨

sudo chmod -R 000 /*

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, Sauron said:

Yeah, that's what gpus excel at - small, independent tasks. Maybe not mail servers specifically but that sort of thing.

GPUs do not excel at launching a ton of totally independent, asynchronous threads, and further, unless you intend to engineer a GPU with a bunch of network interfaces, that won't end well. GPUs excel at launching workgroups where each thread is working on pretty much the same thing and reducing the results together at the end in a fork-join model where the fork factor tends to be 16x or higher.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, patrickjp93 said:

That's where my 2 is in both equations as well: from fma.

I only applied it to 6700k, and used the quoted DP flops for phi, or something like that. Actually, my working was done a bit differently (and very possibly incorrectly):

4 cores * 2 AVX units * 4 DP data chunks * 4.0 GHz = 128 GFLOPS. I got persuaded to double that again for FMA, then comparing to quoted ~3 DP TFLOPS. Also recognising peak rate can't be sustained in real world.

 

Where did 256/32 come from? Presumably the 256 is the bit width of each AVX unit, but I don't get the 32?

Gaming system: R7 7800X3D, Asus ROG Strix B650E-F Gaming Wifi, Thermalright Phantom Spirit 120 SE ARGB, Corsair Vengeance 2x 32GB 6000C30, RTX 4070, MSI MPG A850G, Fractal Design North, Samsung 990 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Productivity system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, 64GB ram (mixed), RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, random 1080p + 720p displays.
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, SamStrecker said:

Ask china

I will. Just you wait... just you wait.....

System Specs:

CPU: Ryzen 7 5800X

GPU: Radeon RX 7900 XT 

RAM: 32GB 3600MHz

HDD: 1TB Sabrent NVMe -  WD 1TB Black - WD 2TB Green -  WD 4TB Blue

MB: Gigabyte  B550 Gaming X- RGB Disabled

PSU: Corsair RM850x 80 Plus Gold

Case: BeQuiet! Silent Base 801 Black

Cooler: Noctua NH-DH15

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

So, these new Xeon Phis are built with the Silvermont core. I know that Intel makes a very good portion of their money from servers, but why doesn't Intel make a cost effective desktop platform with Silvermont to compete with the low end Athlons?

Link to comment
Share on other sites

Link to post
Share on other sites

Interesting to see HBM. I thought Intel heavily invested in HMC with Micron, specifically for Xeon Phi. What makes HBM more suitable? Am I missing something obvious here? What happened to HMC?

 

I mean, look at the HMC page on Micron's site: https://www.micron.com/products/hybrid-memory-cube/high-performance-on-package-memory

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

Remember the idiot who was saying this would be Skylake-E?

[Out-of-date] Want to learn how to make your own custom Windows 10 image?

 

Desktop: AMD R9 3900X | ASUS ROG Strix X570-F | Radeon RX 5700 XT | EVGA GTX 1080 SC | 32GB Trident Z Neo 3600MHz | 1TB 970 EVO | 256GB 840 EVO | 960GB Corsair Force LE | EVGA G2 850W | Phanteks P400S

Laptop: Intel M-5Y10c | Intel HD Graphics | 8GB RAM | 250GB Micron SSD | Asus UX305FA

Server 01: Intel Xeon D 1541 | ASRock Rack D1541D4I-2L2T | 32GB Hynix ECC DDR4 | 4x8TB Western Digital HDDs | 32TB Raw 16TB Usable

Server 02: Intel i7 7700K | Gigabye Z170N Gaming5 | 16GB Trident Z 3200MHz

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, porina said:

I only applied it to 6700k, and used the quoted DP flops for phi, or something like that. Actually, my working was done a bit differently (and very possibly incorrectly):

4 cores * 2 AVX units * 4 DP data chunks * 4.0 GHz = 128 GFLOPS. I got persuaded to double that again for FMA, then comparing to quoted ~3 DP TFLOPS. Also recognising peak rate can't be sustained in real world.

 

Where did 256/32 come from? Presumably the 256 is the bit width of each AVX unit, but I don't get the 32?

1 AVX unit per core, but there is a factor of 2 ops per clock applied because of FMA. That's why I did 256/32 as part of my calculation for single precision. You could do 256/64 for DP. And the 6700K stays boosted at 4.2GHz on all 4 cores.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

5 hours ago, patrickjp93 said:

1 AVX unit per core, but there is a factor of 2 ops per clock applied because of FMA. That's why I did 256/32 as part of my calculation for single precision. You could do 256/64 for DP. And the 6700K stays boosted at 4.2GHz on all 4 cores.

The 32 was the SP size? I didn't make the connection earlier. It was my understanding there were two 256-bit FMA capable units per core. 

 

http://www.anandtech.com/show/6355/intels-haswell-architecture/8

 

Above is for Haswell. Skylake isn't much different although I was unable to locate an equivalent illustration for it. I know from experience Skylake is a good bit faster (IPC) than Haswell in FMA heavy applications although I don't understand why, other than perhaps the reduced instruction latency giving a boost where peak rate can't be sustained.

 

Also from: http://www.agner.org/optimize/microarchitecture.pdf

Quote

...it enables Intel to boast a floating point performance of 32 FLOPS per cycle.

If I'm applying it correctly that would be consistent with two units, 8 SP per unit, x2 for FMA. That document also describes the availability of two FMA units.

 

Putting the above aside, assuming Phi has similar execution capabilities, the difference between them would simply come down to cores * clock (again, ignoring the potential for memory bandwidth limiting like I often see on Skylake).

Gaming system: R7 7800X3D, Asus ROG Strix B650E-F Gaming Wifi, Thermalright Phantom Spirit 120 SE ARGB, Corsair Vengeance 2x 32GB 6000C30, RTX 4070, MSI MPG A850G, Fractal Design North, Samsung 990 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Productivity system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, 64GB ram (mixed), RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, random 1080p + 720p displays.
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

I hope they produce to much of them and then like to get rid of them as fast as possible. Like they did with the last generation. A 57 core Xeon Phi for 200$ was pretty nice.

 

But I'm pretty sure this won't happen again :(

Mineral oil and 40 kg aluminium heat sinks are a perfect combination: 73 cores and a Titan X, Twenty Thousand Leagues Under the Oil

Link to comment
Share on other sites

Link to post
Share on other sites

Linus: uh yeah intel contact, can we get some of these?

Intel contact: what will you use them for?

Linus: network render

Intel contact: ok

 

later.....

 

Linus: The new 500 gamer 1 CPU in a 1U rack....

 

 

Anytime i see stuff like this that's powerful and has a lot of cores this pops in my head, I bet linus has already had this discussion with his Intel contact for this product

Do you even fanboy bro?

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, porina said:

The 32 was the SP size? I didn't make the connection earlier. It was my understanding there were two 256-bit FMA capable units per core. 

 

http://www.anandtech.com/show/6355/intels-haswell-architecture/8

 

Above is for Haswell. Skylake isn't much different although I was unable to locate an equivalent illustration for it. I know from experience Skylake is a good bit faster (IPC) than Haswell in FMA heavy applications although I don't understand why, other than perhaps the reduced instruction latency giving a boost where peak rate can't be sustained.

 

Also from: http://www.agner.org/optimize/microarchitecture.pdf

If I'm applying it correctly that would be consistent with two units, 8 SP per unit, x2 for FMA. That document also describes the availability of two FMA units.

 

Putting the above aside, assuming Phi has similar execution capabilities, the difference between them would simply come down to cores * clock (again, ignoring the potential for memory bandwidth limiting like I often see on Skylake).

No, 32 flops per cycle is 4 * 256/32 or 4*8. If you see FLOPs outside of the supercomputing world, it's for single precision. It's annoying, but it's true. Haswell has only 1 vector unit per core (FMA capable), a dedicated 32/64-bit FPU (FMA capable), a dedicated 32/64-bit integer ALU, and a dedicated AGU or address-translation unit. And while you could potentially use both the vector and singleton units in your code, it's tough to extract any extra performance over just using the vector unit.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, patrickjp93 said:

No, 32 flops per cycle is 4 * 256/32 or 4*8. If you see FLOPs outside of the supercomputing world, it's for single precision. It's annoying, but it's true. Haswell has only 1 vector unit per core (FMA capable), a dedicated 32/64-bit FPU (FMA capable), a dedicated 32/64-bit integer ALU, and a dedicated AGU or address-translation unit. And while you could potentially use both the vector and singleton units in your code, it's tough to extract any extra performance over just using the vector unit.

Not sure if it was you but I vaguely recall similar talk over Prime95 previously. The author of that continues to try and extract as much performance as is available from the architecture, and if there are two FMA units, let's use two FMA units. In many cases, the limitation is not in the execution but in trying to keep them fed. Consumer CPU cache is too small and lacking in ram bandwidth. Why do my interests have to have more to do with HPC than consumer?

Gaming system: R7 7800X3D, Asus ROG Strix B650E-F Gaming Wifi, Thermalright Phantom Spirit 120 SE ARGB, Corsair Vengeance 2x 32GB 6000C30, RTX 4070, MSI MPG A850G, Fractal Design North, Samsung 990 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Productivity system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, 64GB ram (mixed), RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, random 1080p + 720p displays.
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, porina said:

Not sure if it was you but I vaguely recall similar talk over Prime95 previously. The author of that continues to try and extract as much performance as is available from the architecture, and if there are two FMA units, let's use two FMA units. In many cases, the limitation is not in the execution but in trying to keep them fed. Consumer CPU cache is too small and lacking in ram bandwidth. Why do my interests have to have more to do with HPC than consumer?

Well, then you wouldn't have 2 256-bit units. You have 1 256-bit unit vector unit AND 1 FPU that can do either 1 FMA 32-bit or 1 FMA 64-bit operation per clock, so multiply by 9/8 or 5/4 depending on which mode you're in. And keeping them fed isn't an issue either You can load up to 64 bytes of from cache per clock and up to 32 bytes of instructions, an entire data cache line and an entire instruction cache line. Keeping 288 bits of compute units fed isn't hard to do. To do it in a way that's useful...that's tougher.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, Liltrekkie said:

Linus: uh yeah intel contact, can we get some of these?

Intel contact: what will you use them for?

Linus: network render

Intel contact: ok

 

later.....

 

Linus: The new 500 gamer 1 CPU in a 1U rack....

 

 

Anytime i see stuff like this that's powerful and has a lot of cores this pops in my head, I bet linus has already had this discussion with his Intel contact for this product

yeah right... the limiting factor would be the graphics card hookup actually. Xeon Phi CPUs still only come with 32 lanes, so the best you could do is shove 4 Radeon Pro Duos in and get 8 gamers, 1 CPU with 9 physical cores per player.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

13 minutes ago, patrickjp93 said:

Well, then you wouldn't have 2 256-bit units. You have 1 256-bit unit vector unit AND 1 FPU that can do either 1 FMA 32-bit or 1 FMA 64-bit operation per clock, so multiply by 9/8 or 5/4 depending on which mode you're in. And keeping them fed isn't an issue either You can load up to 64 bytes of from cache per clock and up to 32 bytes of instructions, an entire data cache line and an entire instruction cache line. Keeping 288 bits of compute units fed isn't hard to do. To do it in a way that's useful...that's tougher.

Refer back to the two links I had earlier, both describe the presence of two 256 bit FMA capable units, one each on port 0 and port 1. Am I misunderstanding something here?

 

Also the data limits I refer two are in two parts, that the cache is too small for the working data set, and then fetching things from system ram is limiting. Crystalwell eDRAM is kinda halfway there and I'd love to see that on more CPUs as it does help a lot on the i5-5675C I got to test, within the limits of poor maximum clock speed of that generation and an unexplained real world lower IPC even than Haswell in this application.

Gaming system: R7 7800X3D, Asus ROG Strix B650E-F Gaming Wifi, Thermalright Phantom Spirit 120 SE ARGB, Corsair Vengeance 2x 32GB 6000C30, RTX 4070, MSI MPG A850G, Fractal Design North, Samsung 990 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Productivity system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, 64GB ram (mixed), RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, random 1080p + 720p displays.
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

25 minutes ago, porina said:

Refer back to the two links I had earlier, both describe the presence of two 256 bit FMA capable units, one each on port 0 and port 1. Am I misunderstanding something here?

 

Also the data limits I refer two are in two parts, that the cache is too small for the working data set, and then fetching things from system ram is limiting. Crystalwell eDRAM is kinda halfway there and I'd love to see that on more CPUs as it does help a lot on the i5-5675C I got to test, within the limits of poor maximum clock speed of that generation and an unexplained real world lower IPC even than Haswell in this application.

It's the same unit, accessible by two ports. Each of those execution ports obeys mutual exclusion. No two instructions may be in the same port at the same time, and no two instructions may use the same ALU at the same time. You're misreading. Notice how both of those ports also have integer capabilities, and many of the same ones. It's a way to allow the pipelines to have more flexibility.

 

http://www.realworldtech.com/haswell-cpu/4/

 

Your analysis of the cache is also incomplete. If you're doing 1 operation over a large dataset, then yes, memory streaming will be necessary, but what about a lot of transformations on a small dataset?

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×