A REAL 64 Core CPU - For SCIENCE!

System Error Message · November 1, 2017

https://en.wikipedia.org/wiki/TILE-Gx

72 cores

During the benchmarks, linus missed something. Note how the GPUs had no benchmarks for the larger data sizes?

AnonymousGuy · November 1, 2017

You can use Phi with any x86/x64 code with minimal additional work for thread parallelization. It's significantly more pain in the ass to write code (so I've been told) for CUDA, especially if whatever you're starting with was written for a CPU. Wouldn't surprise me if you can download an Intel compiler that will auto-Phi-ize your code for you too.

If whatever task you're trying to do already supports a GPU, then use a GPU. Otherwise, that's where Phi comes in.

System Error Message · November 1, 2017

2 hours ago, AnonymousGuy said:

You can use Phi with any x86/x64 code with minimal additional work for thread parallelization. It's significantly more pain in the ass to write code (so I've been told) for CUDA, especially if whatever you're starting with was written for a CPU. Wouldn't surprise me if you can download an Intel compiler that will auto-Phi-ize your code for you too.

If whatever task you're trying to do already supports a GPU, then use a GPU. Otherwise, that's where Phi comes in.

not to mention that cuda is limited to one platform which each has its own limits too. So newer CUDA code for the pascal may not work for fermi.

With x86 the compatibility is huge so its not just the xeon phi but any other x86 CPU too.

Im curious about openCL support for the xeon phi.

Prysin · November 1, 2017

14 hours ago, SiverJohn17 said:

So fun fact, I am actually the customer for products like these. And I straight up don't care about Xeon Phis. They were interesting a while ago but from my perspective there is no reason to use one of those chips over your standard GPU. In fact my lab uses off the shelf Titan XPs for most of our simulations. For most of our purposes these cards are the best options with only the new Tesla V100s being more powerful. Standard disclaimer our simulations only utilize single floating point precision so pure FLOPS are fine. That being said I (being the odd man in our lab) want to get a V100 for a personal project.

TL;DR even for HPC applications Xeon Phis are mainly irrelevant.

i thought for raw simulation compute it would be more beneficial (at single precision) to run with the Radeon Pro Duo or similar products given their retardedly high compute performance.... or are your software suit only CUDA?

I know AMD is fairly popular in geological and petrochemical simulation, but there you need FP64 for a lot of the work (billions of dollars potentially lost if you get it wrong)

johnjohns111 · November 1, 2017

"For Science" When I was I kid, I've persuaded to buy me almost a hind-end PC for "education"

Nicnac · November 1, 2017

15 hours ago, brandishwar said:

snip

Very interesting, I see your point on the use in a cluster and for driving GPU base coprocessors. I just believe that the whole phi line is kinda outdated now because their basic idea was to be run as a coprocessor on an add-in card. Nowadays whenever someone has a task that can be highly parallelized I believe they would always go for CUDA optimization etc. instead of developing code for a phi, because in a way or another it still has to have adjusted code to run properly and especially nvidia is doing everything they can to make implementation of gpu crunching simpler and easier.

SiverJohn17 · November 1, 2017

2 hours ago, Prysin said:

i thought for raw simulation compute it would be more beneficial (at single precision) to run with the Radeon Pro Duo or similar products given their retardedly high compute performance.... or are your software suit only CUDA?

I know AMD is fairly popular in geological and petrochemical simulation, but there you need FP64 for a lot of the work (billions of dollars potentially lost if you get it wrong)

So two problems with that one you are correct the suite is only CUDA. It is an out of house software solution though even if it was in house I'd probably develop on CUDA for some of the extra luxuries and, correct me if I'm wrong because it has been a few years since I've looked into it, that even on OpenCL you still have to code for the certain GPU architecture you want to use. The second problem, and personally more aggravating to me, is that the code can only be ran on a single GPU so this would get us less performance than a Titan XP for more cost.

SiverJohn17 · November 1, 2017

6 hours ago, AnonymousGuy said:

You can use Phi with any x86/x64 code with minimal additional work for thread parallelization. It's significantly more pain in the ass to write code (so I've been told) for CUDA, especially if whatever you're starting with was written for a CPU. Wouldn't surprise me if you can download an Intel compiler that will auto-Phi-ize your code for you too.

If whatever task you're trying to do already supports a GPU, then use a GPU. Otherwise, that's where Phi comes in.

It depends on how your initial code is written, because thread parallelization isn't how you squeeze most of the performance out of a Phi. You also have to take into consideration vectorization, and if you're code isn't written with that in mind it can be just as much of a pain to rewrite parts in CUDA. Granted if you've written your code to be nice and modular it shouldn't be as terrible.

3 hours ago, System Error Message said:

not to mention that cuda is limited to one platform which each has its own limits too. So newer CUDA code for the pascal may not work for fermi.

With x86 the compatibility is huge so its not just the xeon phi but any other x86 CPU too.

Im curious about openCL support for the xeon phi.

The same is true for the Phi, if you noticed in the video Linus talked about the new instruction set AVX512, and so if you write code to optimize for that then it won't work on the older CPUs (it'll of course work on the other Xeons though I don't know about AMD support). However basic code on either platform will work on any other. The basic Saxpy code on a Fermi card will work on Pascal. The trouble begins when you start to get architecture specific but as these were mainly built for the world of HPC leaving that off the table isn't an option. So that'll always be a problem no matter what you use.

System Error Message · November 1, 2017

Just now, SiverJohn17 said:

It depends on how your initial code is written, because thread parallelization isn't how you squeeze most of the performance out of a Phi. You also have to take into consideration vectorization, and if you're code isn't written with that in mind it can be just as much of a pain to rewrite parts in CUDA. Granted if you've written your code to be nice and modular it shouldn't be as terrible.

The same is true for the Phi, if you noticed in the video Linus talked about the new instruction set AVX512, and so if you write code to optimize for that then it won't work on the older CPUs (it'll of course work on the other Xeons though I don't know about AMD support). However basic code on either platform will work on any other. The basic Saxpy code on a Fermi card will work on Pascal. The trouble begins when you start to get architecture specific but as these were mainly built for the world of HPC leaving that off the table isn't an option. So that'll always be a problem no matter what you use.

Still, openCL will run on all of them, CUDA only on nvidia and C++ would require a recompilation (and some tweaks) per platform.

However, everyone here is missing the point. This isnt about comparing the numbers to a GPU. DId you see how there werent some GPU benchmarks for some workload sizes? This is where having a xeon phi helps. In cases where a GPU would be absolutely terrible or not able to run the work, the xeon phi can.

On blender for instance, setting smaller tile sizes makes the CPU render faster than the GPU whereas the GPU renders much faster if each tile is bigger.

brandishwar · November 1, 2017

2 hours ago, Nicnac said:

I just believe that the whole phi line is kinda outdated now because their basic idea was to be run as a coprocessor on an add-in card.

This is likely why they changed it to be a primary processor instead of running on a daughter card. It eliminates the need for a separate mainboard and processor that is capable of supporting the daughter card (typically required workstation and server mainboards -- i.e. expensive), and makes all of the cores and threads immediately accessible to the operating system without having to talk to any device across the PCI-Express bus.

AnonymousGuy · November 1, 2017

3 hours ago, System Error Message said:

On blender for instance, setting smaller tile sizes makes the CPU render faster than the GPU whereas the GPU renders much faster if each tile is bigger.

Can you explain why tile size matters? Why not just use a giant tile for everything?

System Error Message · November 2, 2017

11 hours ago, AnonymousGuy said:

Can you explain why tile size matters? Why not just use a giant tile for everything?

CPU is better at smaller tile sizes than larger. Its the opposite for GPUs.

Not sure how to explain, its a common issue. So if you ran the car render benchmark, it will show poor results for the GPU vs the CPU because it is set small tile size. AMD GPU takes less of a hit though using openCL and small tile sizes whereas nvidia cuda will show the GPU at max but perform poorly.

This is another strike against CUDA, as in CUDA isnt exactly as good as many think. Again this makes the xeon phi relevant. Theres no such thing as going with nvidia because price and CUDA and i would strongly advice against geforce for computational stuff because they have many computational stuff removed that are only kept on the tesla. Stuff like full duplex busses, cache and other things are removed on geforce. Nvidia wants you to use CUDA but wants you to pay thousands for their tesla whereas with x86 its simply up to you which CPU you want and how much you want to pay. Even if AVX512 isnt mainstream yet, the SSE (+AVX) instruction set is all about working with large datasets faster not to mention with the new xeon phi you can add the ram you want.

Nup · November 2, 2017

On 31/10/2017 at 9:06 PM, SiverJohn17 said:

So fun fact, I am actually the customer for products like these. And I straight up don't care about Xeon Phis. They were interesting a while ago but from my perspective there is no reason to use one of those chips over your standard GPU. In fact my lab uses off the shelf Titan XPs for most of our simulations. For most of our purposes these cards are the best options with only the new Tesla V100s being more powerful. Standard disclaimer our simulations only utilize single floating point precision so pure FLOPS are fine. That being said I (being the odd man in our lab) want to get a V100 for a personal project.

TL;DR even for HPC applications Xeon Phis are mainly irrelevant.

What do you do?

SiverJohn17 · November 2, 2017

1 hour ago, Nup said:

What do you do?

I am actually just a first year PhD student doing a rotation in a computational chemistry group. However, I have done some self studying on this stuff throughout the years as I have kept a constant interest in hardware. So much so that even though I am the newest member of the lab, I have become the de facto tech guru. Its a fun job though my rotation is ending this week.

Edited: For derp.

Nup · November 2, 2017

6 hours ago, SiverJohn17 said:

I am actually just a first year PhD student doing a rotation in a computational chemistry group. However, I have done some self studying on this stuff throughout the years as I have kept a constant interest in hardware. So much so that even though I am the newest member of the lab, I have become the de facto tech guru. Its a fun job though my rotation is ending this week.

Edited: For derp.

Sounds like a good start! It’s a sector that really growing, so far as I’m aware. Computational chemistry is really somewhere I’d like to end up in, I’m in last year of an undergraduate chemistry course and having difficult choosing which diriection to go in.

SiverJohn17 · November 3, 2017

On 11/2/2017 at 7:08 PM, Nup said:

Sounds like a good start! It’s a sector that really growing, so far as I’m aware. Computational chemistry is really somewhere I’d like to end up in, I’m in last year of an undergraduate chemistry course and having difficult choosing which diriection to go in.

Sorry I missed this. Interestingly my weakest subject is probably chemistry. My undergraduate was in physics and biology. But yeah, I know what you mean. I thankfully knew where I was going from the start of undergraduate.

power666 · November 6, 2017

It should be pointed out that the AVX-512 that the Xeon Phi supports is not exactly the same as the AVX-512 support in SkyLake-X. There is indeed a common subset of AVX-512 instructions supported on both (AVX-512 foundation) but each chip has some exclusive AVX-512 instructions too. This mess should be figured out in the coming generation of new chips (Cannon Lake, Knight's Hill etc.)

What memory settings were used for benchmarking? There wasn't any mention of the 16 GB of HMC in that the Xenon Phi has in that video. For things that love memory bandwidth, this system should be able to set some world records by setting up the HMC to be main memory and removing the DDR4 DIMMs. That provides four times as much bandwidth as the six channel DDR4 memory and operates at a lower latency too. The catch is that the HMC is stuck at 16 GB capacity in the socket but that's plenty for some of the gaming and desktop benchmarks performed here. Yep, this system can run without any DIMMs installed if configured right. For those ever wondering what it would be like to use HBM as main memory, this should provide similar numbers to get a good general idea.

If the HMC is setup to extend DDR4 memory, applications really need to be NUMA aware. While there is only one processor die in that package, internally there are six memory controllers. Each of the HBM stacks can be seen as a NUMA node if desired and the six DDR4 channels can be configured as an additional two memory nodes. At a minimum, there will still be two nodes split between HBM and DDR4 memory types.

The HMC can be setup as a cache if an application is more bandwidth sensitive than latency sensitive. There is of course overhead involved in transferring these large amounts of data.

The other thing is that there needs to be some significant tuning to see performance gains, especially beyond 64 concurrent threads per application. See https://msdn.microsoft.com/en-us/library/windows/hardware/dn653313(v=vs.85).aspx for reference. Some of these benchmarks need to explicitly state that they scale beyond 64 threads to show the full performance gains one would expect. Alternatively, four instances of the benchmarks could be run concurrently to see if there is any slow down. I strongly suspect that the multithreaded testing done isn't tapping the full potential of this hardware. (Single threaded performance should be pretty spot on though.) Another possible test would be to disable the 4-way SMT (aka Hyperthreading) to bring the logical processor count down to 64 which fits into a single processor group in Windows.

Also not mentioned is that the board in that SuperMicro workstation supports the 100 Gbit Omnipath networking that some Xeon Phis support on-package.

Sign In

A REAL 64 Core CPU - For SCIENCE!

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Featured Topics

Topics

Latest From Linus Tech Tips:

I Was Never Meant to Have This Prototype CPU

Latest From Tech Quickie:

Why Do Speakers Hiss?

Latest From TechLinked: