Jump to content

Intel begins to ship its KNL Xeon Phi processors

NumLock21
30 minutes ago, patrickjp93 said:

From the above link "One interesting point is that while Haswell can execute two 256-bit FMAs or FMULs per cycle, there is still only a single 256-bit FADD." which still suggests to me they're separate units. 

30 minutes ago, patrickjp93 said:

Your analysis of the cache is also incomplete. If you're doing 1 operation over a large dataset, then yes, memory streaming will be necessary, but what about a lot of transformations on a small dataset?

For small enough datasets, it can run within L3 cache at what I call practically (ram) unlimited performance. I use that condition to examine the behaviour of the architecture without worrying about ram impact. Over at PrimeGrid there is a challenge on right now, running tasks with 256k FFT size. This translates into 2MB per task, and most run one task per core. Now, this isn't the complete story, as that is only the FFT data, and doesn't consider any other data. I understand it also uses lookup tables for some things for example, so "a bit more" is required. And this came out when I looked at how performance varied with architecture after clock normalisation. 2MB/core was quite a bit better than 1.5MB/core (Skylake+Haswell). More recently I got a Xeon with 2.5MB/core, and that had a slight improvement over 2Mb/core within the Haswell family. I don't have the exact numbers on me right now.

Gaming system: R7 7800X3D, Asus ROG Strix B650E-F Gaming Wifi, Thermalright Phantom Spirit 120 SE ARGB, Corsair Vengeance 2x 32GB 6000C30, RTX 4070, MSI MPG A850G, Fractal Design North, Samsung 990 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Productivity system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, 64GB ram (mixed), RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, random 1080p + 720p displays.
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, porina said:

From the above link "One interesting point is that while Haswell can execute two 256-bit FMAs or FMULs per cycle, there is still only a single 256-bit FADD." which still suggests to me they're separate units. 

For small enough datasets, it can run within L3 cache at what I call practically (ram) unlimited performance. I use that condition to examine the behaviour of the architecture without worrying about ram impact. Over at PrimeGrid there is a challenge on right now, running tasks with 256k FFT size. This translates into 2MB per task, and most run one task per core. Now, this isn't the complete story, as that is only the FFT data, and doesn't consider any other data. I understand it also uses lookup tables for some things for example, so "a bit more" is required. And this came out when I looked at how performance varied with architecture after clock normalisation. 2MB/core was quite a bit better than 1.5MB/core (Skylake+Haswell). More recently I got a Xeon with 2.5MB/core, and that had a slight improvement over 2Mb/core within the Haswell family. I don't have the exact numbers on me right now.

The author is making a small mistake. The core can process 2 instructions, but can only execute one of them per cycle. Or he's confusing the fact that FMA is itself 2 ops.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

On 6/22/2016 at 4:12 AM, sof006 said:

Jesus... Thats a lot of cores. Wonder how many of them you'd need to make a 100 Petabyte machine...

 

On 6/22/2016 at 4:16 AM, FloRolf said:

Wtf is a 100petabyte machine? 

 

On 6/22/2016 at 4:45 AM, sof006 said:

Super computer

You mean Petaflop - And I'm assuming your comment is in relation to the Chinese super computer they've just finished building? - Source: http://arstechnica.com/information-technology/2013/06/chinese-supercomputer-destroys-speed-record-and-will-get-much-faster/

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×