Jump to content

Windows drops the ball on Threadripper 2 performance

rcmaehl

Source:
Phoronix

 

TL;DR:
Windows is potentially not utilizing the full architecture for the new Threadripper 2 chips with Linux performance being up to 35% faster.

 

Media:

image.png.3a42010630c5a61d0e93ed4f02986bc2.pngimage.png.3940eef6a705aa1d14bad4ec84e64dd1.png
 

Quotes/Excerpts:

Quote

Tests were done from Microsoft Windows 10 against Clear Linux, Ubuntu 18.04, the Arch-based Antergos 18.7-Rolling, and openSUSE Tumbleweed... while maintaining the hardware's configuration and BIOS settings. Had I known how poorly Windows 10 works on current high core count NUMA environments under some workloads, I would have certainly ran more benchmarks. Threadripper 2990WX at stock speeds with the ASUS ROG ZENITH EXTREME motherboard, Cooler Master Wraith Ripper heatsink, 4 x 8GB G-SKILL DDR4-3200MHz memory, 500GB Samsung 970 EVO NVMe SSD, and Radeon RX Vega 56 graphics. All of these benchmarks were carried out in a fully-automated and reproducible environment using the open-source Phoronix Test Suite benchmarking software. Long story short, the Linux performance in a majority of these CPU-focused benchmarks were running much faster on the AMD Threadripper 2990WX than Windows 10 Pro when tested with the same hardware in the same configuration. We usually see better performance with Linux over Windows... but not always to some of the extremes encountered.

 

My thoughts:

 

Windows Pro definitely was not configured for NUMA environments, hopefully the benchmarks for Windows Server will show better, however time will tell. In the mean time I know at least a few people glad to here this news.

f8cmq5fu48911.png

PLEASE QUOTE ME IF YOU ARE REPLYING TO ME

Desktop Build: Ryzen 7 2700X @ 4.0GHz, AsRock Fatal1ty X370 Professional Gaming, 48GB Corsair DDR4 @ 3000MHz, RX5700 XT 8GB Sapphire Nitro+, Benq XL2730 1440p 144Hz FS

Retro Build: Intel Pentium III @ 500 MHz, Dell Optiplex G1 Full AT Tower, 768MB SDRAM @ 133MHz, Integrated Graphics, Generic 1024x768 60Hz Monitor


 

Link to comment
Share on other sites

Link to post
Share on other sites

Windows lagging behind on modern technologies? Who woulda thunk it? 

Laptop: 2019 16" MacBook Pro i7, 512GB, 5300M 4GB, 16GB DDR4 | Phone: iPhone 13 Pro Max 128GB | Wearables: Apple Watch SE | Car: 2007 Ford Taurus SE | CPU: R7 5700X | Mobo: ASRock B450M Pro4 | RAM: 32GB 3200 | GPU: ASRock RX 5700 8GB | Case: Apple PowerMac G5 | OS: Win 11 | Storage: 1TB Crucial P3 NVME SSD, 1TB PNY CS900, & 4TB WD Blue HDD | PSU: Be Quiet! Pure Power 11 600W | Display: LG 27GL83A-B 1440p @ 144Hz, Dell S2719DGF 1440p @144Hz | Cooling: Wraith Prism | Keyboard: G610 Orion Cherry MX Brown | Mouse: G305 | Audio: Audio Technica ATH-M50X & Blue Snowball | Server: 2018 Core i3 Mac mini, 128GB SSD, Intel UHD 630, 16GB DDR4 | Storage: OWC Mercury Elite Pro Quad (6TB WD Blue HDD, 12TB Seagate Barracuda, 1TB Crucial SSD, 2TB Seagate Barracuda HDD)
Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, DrMacintosh said:

Windows lagging behind on modern technologies? Who woulda thunk it? 

'cept gaming. pretty much...

"If a Lobster is a fish because it moves by jumping, then a kangaroo is a bird" - Admiral Paulo de Castro Moreira da Silva

"There is nothing more difficult than fixing something that isn't all the way broken yet." - Author Unknown

Spoiler

Intel Core i7-3960X @ 4.6 GHz - Asus P9X79WS/IPMI - 12GB DDR3-1600 quad-channel - EVGA GTX 1080ti SC - Fractal Design Define R5 - 500GB Crucial MX200 - NH-D15 - Logitech G710+ - Mionix Naos 7000 - Sennheiser PC350 w/Topping VX-1

Link to comment
Share on other sites

Link to post
Share on other sites

6 minutes ago, bcredeur97 said:

'cept gaming. pretty much...

So, pretty much any modern computer workload.

 

Gaming is just now climbing out of the dual thread dark ages and should not be considered modern by any stretch of the imagination.  I mean... I can still play most games on 10 year old hardware.

Link to comment
Share on other sites

Link to post
Share on other sites

8 minutes ago, DrMacintosh said:

Windows lagging behind on modern technologies? Who woulda thunk it? 

Who would of thought supporting 1,000s of different devices is not the easiest thing to do? 

 

While linux might show improvements.....they should show a plethora of other benchmarks showing unoptimizations of linux with some hardware. 

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, mynameisjuan said:

Who would of thought supporting 1,000s of different devices is not the easiest thing to do? 

 

While linux might show improvements.....they should show a plethora of other benchmarks showing unoptimizations of linux with some hardware. 

No, that has nothing to do with the kernel and its thread scheduling.

 

Linux arguably supports more hardware than W10.  Many "legacy" devices that have been dropped from the Windows platform in terms of driver support still function properly under Linux.

Link to comment
Share on other sites

Link to post
Share on other sites

20 minutes ago, rcmaehl said:

Windows Pro definitely was not configured for NUMA environments, hopefully the benchmarks for Windows Server will show better, however time will tell. 

Windows 10 Pro supports multiple physical processors which suggests it's NUMA aware (https://answers.microsoft.com/en-us/windows/forum/windows_10/windows-10-versions-cpu-limits/905c24ad-ad54-4122-b730-b9e7519c823f?auth=1) However it's likely that since Threadripper is for all intents and purposes a single processor, Windows may think the system itself isn't NUMA enabled.

 

And call me fanboyish or whatever, but Phoronix is a Linux biased website so I wouldn't put it past them to not dive deep in Windows performance. Unless they do when it matters, then I stand corrected.

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, KarathKasun said:

No, that has nothing to do with the kernel and its thread scheduling.

 

Linux arguably supports more hardware than W10.  Many "legacy" devices that have been dropped from the Windows platform in terms of driver support still function properly under Linux.

The fuck did I mention anything about kernel and scheduling? Couldnt care less about legacy systems.

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, M.Yurizaki said:

Windows 10 Pro supports multiple physical processors which suggests it's NUMA aware (https://answers.microsoft.com/en-us/windows/forum/windows_10/windows-10-versions-cpu-limits/905c24ad-ad54-4122-b730-b9e7519c823f?auth=1) However it's likely that since Threadripper is for all intents and purposes a single processor, Windows may think the system itself isn't NUMA enabled.

 

And call me fanboyish or whatever, but Phoronix is a Linux biased website so I wouldn't put it past them to not dive deep in Windows performance. Unless they do when it matters, then I stand corrected.

They most definitely have done that.  If you look at their history of comparsons CPU performance is better in Linux while GPU performance is better in Windows.

Ive used both platforms for development for a looooong time and can corroborate this phenomenon.

Link to comment
Share on other sites

Link to post
Share on other sites

[Level1ing Intensifies]

Link to comment
Share on other sites

Link to post
Share on other sites

11 minutes ago, mynameisjuan said:

The fuck did I mention anything about kernel and scheduling? Couldnt care less about legacy systems.

That is why Windows is slower with high thread count setups, it sucks at thread scheduling.

 

In case you missed my point, it has nothing to do with supporting "1,000s of devices".

Link to comment
Share on other sites

Link to post
Share on other sites

11 minutes ago, KarathKasun said:

That is why Windows is slower with high thread count setups, it sucks at thread scheduling.

Then I'd like to see macOS vs. Linux in the same or as same tests as possible, since macOS and Windows both use a priority based multi-level feedback queue for scheduling. This is to see if it's the design of the scheduling algorithm itself or the implementation of it.

Link to comment
Share on other sites

Link to post
Share on other sites

7 minutes ago, M.Yurizaki said:

Then I'd like to see macOS vs. Linux in the same or as same tests as possible, since macOS and Windows both use a priority based multi-level feedback queue for scheduling. This is to see if it's the design of the scheduling algorithm itself or the implementation of it.

I think its more related to the way Windows handles context switching combined with the fact that Windows is much more likely to bounce threads around to different cores.  AFAIK OSX is right there with Linux as far as scale-ability goes, the Mach micro-kernel was much more efficient than Windows when it came to those metrics last I saw it compared.  Granted, it has been a hot minute since I dug that deeply into the differences between the three platforms.

 

The bonus with Linux is that you can tune the threading mechanisms in the kernel yourself.

Link to comment
Share on other sites

Link to post
Share on other sites

Chances are this is just a compatibility that can be fixed with a simple patch. I love how people take new hardware on a system as any type of performance metric. Sure atm there is a compatibility issue, but in a few weeks that will likely be gone.

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, KarathKasun said:

I think its more related to the way Windows handles context switching combined with the fact that Windows is much more likely to bounce threads around to different cores.  AFAIK OSX is right there with Linux as far as scale-ability goes, the Mach micro-kernel was much more efficient than Windows when it came to those metrics last I saw it compared.  Granted, it has been a hot minute since I dug that deeply into the differences between the three platforms.

 

The bonus with Linux is that you can tune the threading mechanisms in the kernel yourself.

Which at that point, you hotpatch the kernel so it can make special considerations when running on Threadripper.

Link to comment
Share on other sites

Link to post
Share on other sites

51 minutes ago, M.Yurizaki said:

Which at that point, you hotpatch the kernel so it can make special considerations when running on Threadripper.

You can do similar things on Windows as well, but not to the same degree.  Then again, if TR2 is ahead in Windows (with the accompanying ~30% penalty) compared to Intel's current lineup, how many data-centers (which almost exclusively run some *nix falvor) do you think are going to make the change over to AMD.  How about people running small tech businesses that need one server to handle all tasks for a reasonable price (hello reasonable VM use case for the low end server market)?

 

Windows is a tiny slice of the market that the WX CPUs are targeted at, and Linux has been better at the jobs in that market (from a performance standpoint) for the past two decades anyway.

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, KarathKasun said:

You can do similar things on Windows as well, but not to the same degree.  Then again, if TR2 is ahead in Windows (with the accompanying ~30% penalty) compared to Intel's current lineup, how many data-centers (which almost exclusively run some *nix falvor) do you think are going to make the change over to AMD.  How about people running small tech businesses that need one server to handle all tasks for a reasonable price (hello reasonable VM use case for the low end server market)?

 

Windows is a tiny slice of the market that the WX CPUs are targeted at, and Linux has been better at the jobs in that market (from a performance standpoint) for the past two decades anyway.

Hotpatching is up to AMD to do though, with support from Microsoft. They did this when Bulldozer came out to address its supposed shortcomings.

Link to comment
Share on other sites

Link to post
Share on other sites

4 minutes ago, M.Yurizaki said:

Hotpatching is up to AMD to do though, with support from Microsoft. They did this when Bulldozer came out to address its supposed shortcomings.

I would say that its up to MS at this point.  MS has been making NUMA aware kernels since Windows Server 2003.  All Multi-socket Opteron setups from the day they were introduced are NUMA (and there were configurations including CPU socket w/o memory).  Same goes for Nehalem based Xeons and newer.  If MS hasn't gotten NUMA figured out at this point, they deserve to be kicked to the side of the road in the HPC and Server space.

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, KarathKasun said:

I would say that its up to MS at this point.  MS has been making NUMA aware kernels since Windows server 2003.  All Multi-socket Opteron setups from the day they were introduced are NUMA.  Same goes for Nehalem based Xeons and newer.  If MS hasn't gotten NUMA figured out at this point, they deserve to be kicked to the side of the road in the HPC and Server space.

The problem here is, and I'm making guesses, is that Threadripper reports itself as a single processor with 32 cores and 64 threads. If it doesn't report anything else, Windows is going to treat this like a single processor, regardless of what goes on in the background. If the CPU is hiding details from the OS, then how do you expect the OS to know how to treat the CPU properly from the get go? It also can't report itself as two or more separate processors because Threadripper won't work on Home editions of Windows (limited to one physical processor). Threadripper is a unique beast as far as system configuration is concerned.

 

And with regards to thread scheduling, since I was stewing on this while getting lunch, Microsoft's approach may work for the use cases it expects its customers to use the OS for, which may not be the best case for other use cases. It may make sense to re-schedule a thread on the same core, but what if the thread is ready and some other thread has that core. Do you wait for the thread's CPU time to expire and take a hit on the original thread's performance or do you let the thread run on the next available processor?

Link to comment
Share on other sites

Link to post
Share on other sites

If I'm not mistaken AMD has said there isn't a scheduling problem on Windows multiple times in the past so I'm pretty sure they'll say the same this time.

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, M.Yurizaki said:

The problem here is, and I'm making guesses, is that Threadripper reports itself as a single processor with 32 cores and 64 threads. If it doesn't report anything else, Windows is going to treat this like a single processor, regardless of what goes on in the background. If the CPU is hiding details from the OS, then how do you expect the OS to know how to treat the CPU properly from the get go? It also can't report itself as two or more separate processors because Threadripper won't work on Home editions of Windows (limited to one physical processor). Threadripper is a unique beast as far as system configuration is concerned.

 

And with regards to thread scheduling, since I was stewing on this while getting lunch, Microsoft's approach may work for the use cases it expects its customers to use the OS for, which may not be the best case for other use cases. It may make sense to re-schedule a thread on the same core, but what if the thread is ready and some other thread has that core. Do you wait for the thread's CPU time to expire and take a hit on the original thread's performance or do you let the thread run on the next available processor?

The OS knows the logical layout of the CPU, NUMA is not a new thing and mechanisms for reporting this have been around for a long time now.

 

For the thread scheduling...  On many core platforms (more than 4c) there should be a dynamic pinning mechanism, which sort-of exists in Windows already.  In the Windows world this feature exists in a very basic way in the core parking mechanism.  Im sure that it is also used at some point in keeping threads within a CCX, Module, or local group (see Intel mesh and ring topologies).

 

I suppose we will see if MS patches their Kernel with a TR2 specific scheduler faster than they patched for BD. xD

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, DrMacintosh said:

Windows lagging behind on modern technologies? Who woulda thunk it? 

There is a Problem in your Argument.


It ain't no modern technology. Its old technology resurrected from the Dead.

 

Proof:

http://hw-museum.cz/mb/11/msi-k8t-master2-far7

"Hell is full of good meanings, but Heaven is full of good works"

Link to comment
Share on other sites

Link to post
Share on other sites

8 minutes ago, Stefan Payne said:

There is a Problem in your Argument.


It ain't no modern technology. Its old technology resurrected from the Dead.

 

Proof:

http://hw-museum.cz/mb/11/msi-k8t-master2-far7

No, its still in use.

 

https://www.newegg.com/Product/Product.aspx?Item=N82E16813119062&ignorebbr=1&nm_mc=KNC-GoogleAdwords-PC&cm_mmc=KNC-GoogleAdwords-PC-_-pla-_-Motherboards+-+Server-_-N82E16813119062&gclid=CjwKCAjw-8nbBRBnEiwAqWt1zVc1mDT83a07jhHaCxAQpFmT3Jc9vLmpt7nKAk5H9d4ME3xYV-ULYxoCiXQQAvD_BwE&gclsrc=aw.ds

 

Any dual CPU board from the last 10 years is a NUMA config because of the memory controller being integrated into the CPU.  Though the memory-less node setup is not officially supported.

Link to comment
Share on other sites

Link to post
Share on other sites

28 minutes ago, The Benjamins said:

This is a repost of 

 

 

It would be nice if a patch comes along making things better for the 2990wx

Don't mind thread merging but

A) Their thread doesn't mention their opinion on the matter
B) My thread isn't shoving the entire article down people's throats, just provides a well trimmed summary

PLEASE QUOTE ME IF YOU ARE REPLYING TO ME

Desktop Build: Ryzen 7 2700X @ 4.0GHz, AsRock Fatal1ty X370 Professional Gaming, 48GB Corsair DDR4 @ 3000MHz, RX5700 XT 8GB Sapphire Nitro+, Benq XL2730 1440p 144Hz FS

Retro Build: Intel Pentium III @ 500 MHz, Dell Optiplex G1 Full AT Tower, 768MB SDRAM @ 133MHz, Integrated Graphics, Generic 1024x768 60Hz Monitor


 

Link to comment
Share on other sites

Link to post
Share on other sites

Guest
This topic is now closed to further replies.

×