AMD speaks on W10 scheduler and Ryzen

LAwLz · March 15, 2017

12 minutes ago, leadeater said:

Basically the CCXs are paired with a memory controller and I/O silicon logic, that means while they can put any number of CCXs they wish (maybe in pairs?) on a die they still need the CCX interconnects plus memory controller/PCIe that go with it. Cut the die cut the memory controller which means broken non functional die.

Isn't all of those things part of the same die though? That's what it looks like to me when looking at the die shot above. That both CCXs and the other logics (memory controller etc) are on the same die.

18 minutes ago, leadeater said:

That's where it starts to get really complex and further in to more unknowns:

Can a single CCX be properly connected to the memory controller or is a minimum of two required?

Can a single CCX die actually be made smaller in physical area? Can the memory controller etc be rearranged?

Is the potential sales volume of the 4 core SKUs worth investing in a dedicated die design?

That's where my uncertainty about the 4 core version comes from as well. To me, judging by the die shot, you need two CCXs for a fully functional chip, but that would mean having 4 cores disabled on a 4 core SKU, which seems wasteful (unless yields are quite poor).

leadeater · March 15, 2017

5 minutes ago, LAwLz said:

Isn't all of those things part of the same die though? That's what it looks like to me when looking at the die shot above. That both CCXs and the other logics (memory controller etc) are on the same die.

Correct, any design using a different number of CCXs would be a totally different die.

Edit:

The CCX approach just makes it easier to design different dies, simplicity through commonality/duplication. Adding more cores doesn't lead to drastic architecture change like it normally would. It's a bit like playing Tetris but with CCXs, memory controllers and I/O.

shdowhunt60 · March 15, 2017

7 minutes ago, leadeater said:

Correct, any design using a different number of CCXs would be a totally different die.

Edit:

The CCX approach just makes it easier to design different dies, simplicity through commonality/duplication. Adding more cores doesn't lead to drastic architecture change like it normally would. It's a bit like playing Tetris but with CCXs, memory controllers and I/O.

I guess it's kinda like the CU system that AMD had with their APU's I guess then?

Syntaxvgm · March 15, 2017

On 3/13/2017 at 6:48 PM, tlink said:

screaming bulldozer is as much of a good argument as screaming hitler is.

I consider bulldozer and invading Russia mistakes of similar scale.

zMeul · March 15, 2017

3 hours ago, leadeater said:

For the above point about TDP, well someone needs to go look up what TDP actually means because it is not power draw of the CPU. The 4 core SKU and 6 core SKU having the same TDP in no way indicates the CCX makeup.

for this type of comparison when the arch is identical and assuming the TDP is expressed on loading the cores in the exact same way, estimating TDP that way is a safe bet

there are other factors to consider, like core clocks

for comparison look at i3 6100 vs i5 6500 - 51W vs 65W; the i3 is clocked higher

LAwLz · March 15, 2017

@Tomsen @leadeater

Behardware has benchmarked a 2+2 vs a 4+0 configuration to determine how much of an impact the interconnector has.

Conclusion: Next to no impact at all. So if this is any indication, then an update to the Windows scheduler where it avoids moving data between CCXs would not make a difference.

Assuming that Windows' scheduler became 100% flawless in allocating threads to specific CCXs, we're looking at an average (not counting the abnormality that is BF1) of ~2% performance increase. That might as well be within margin of error.

(Funnily enough, the same performance increase the "legendary" Bulldozer optimization patch also brought us).

So once again has this idea that Microsoft are to blame, and that Windows 10 just isn't "optimized" for Ryzen been debunked.

leadeater · March 15, 2017

6 minutes ago, zMeul said:

for this type of comparison when the arch is identical and assuming the TDP is expressed on loading the cores in the exact same way, estimating TDP that way is a safe bet

there are other factors to consider, like core clocks

for comparison look at i3 6100 vs i5 6500 - 51W vs 65W; the i3 is clocked higher

I agree taking an entire CCX away should actually have a significant reduction in power draw which should be reflected in the TDP figure. Only thing to be careful of is TDP is thermal design power and is there to advise on the recommended cooling solution required, having said that if AMD had a chance to state a lower TDP they would have done so.

Valentyn · March 15, 2017

3 minutes ago, LAwLz said:

@Tomsen @leadeater

abnormality that is BF1) of ~2% performance increase. That might as well be within margin of error.

(Funnily enough, the same performance increase the "legendary" Bulldozer optimization patch also brought us).

It's ~3-7% for gaming if you only look at gaming results, and ignore BF1.

That's a nice little increase on some games. One many would gladly accept, as it's the IPC difference between haswell and broadwell.

leadeater · March 15, 2017

4 minutes ago, LAwLz said:

@Tomsen @leadeater

Behardware has benchmarked a 2+2 vs a 4+0 configuration to determine how much of an impact the interconnector has.

That likely has a lot to do with the L3 cache being a victim cache and not a fully integrated instruction cache like Intel's L3 cache. I suspect there is little to almost no data movement between CCX L3 caches at all.

zMeul · March 15, 2017

1 minute ago, leadeater said:

I agree taking an entire CCX away should actually have a significant reduction in power draw which should be reflected in the TDP figure. Only thing to be careful of is TDP is thermal design power and is there to advise on the recommended cooling solution required, having said that if AMD had a chance to state a lower TDP they would have done so.

I'm assuming AMD does things in a similar fashion on how Intel does it and don't just throw a sticker on it

furthermore, Intel has two thermal specifications:

TDP - Thermal Design Power (TDP) represents the average power, in watts, the processor dissipates when operating at Base Frequency with all cores active under an Intel-defined, high-complexity workload
TSS - Intel Reference Heat Sink specification for proper operation of this SKU.

the i7 6700K has a TDP or 91W but the TSS is 130W

zMeul · March 15, 2017

4 minutes ago, Valentyn said:

It's ~3-7% for gaming if you only look at gaming results, and ignore BF1.

the avg is 2.33% - that's margin of error

leadeater · March 15, 2017

1 minute ago, zMeul said:

I'm assuming AMD does things in a similar fashion on how Intel does it and don't just throw a sticker on it

furthermore, Intel has two thermal specifications:

TDP - Thermal Design Power (TDP) represents the average power, in watts, the processor dissipates when operating at Base Frequency with all cores active under an Intel-defined, high-complexity workload

TSS - Intel Reference Heat Sink specification for proper operation of this SKU.

the i7 6700K has a TDP or 91W but the TSS is 130W

Well it was Intel that changed the meaning of TDP and made up TSS, but hey it doesn't really matter what it's called as long as we know what it represents and are comparing the same things between vendors.

Far as I'm aware AMD uses TDP like Intel uses TSS.

Quote

The thermal design power (TDP), sometimes called thermal design point, is the maximum amount of heat generated by a computer chip or component (often the CPU or GPU) that the cooling system in a computer is designed to dissipate in typical operation. Rather than specifying CPU's real power dissipation, TDP serves as the nominal value for designing CPU cooling systems.

https://en.wikipedia.org/wiki/Thermal_design_power

zMeul · March 15, 2017

2 minutes ago, leadeater said:

Well it was Intel that changed the meaning of TDP

that's the way Intel defined TDP since I dunnoo ... forever?!!?

leadeater · March 15, 2017

10 minutes ago, zMeul said:

that's the way Intel defined TDP since I dunnoo ... forever?!!?

Not back before processors had power states and variable multipliers. It was either on or after Intel Core series when Intel started marketing TDP differently and doesn't use it by the industry standard definition, but in the CPU world they are big enough they can set their own standard and not really cause any big problems doing so.

Here's an interesting read on TDP for AMD vs Intel.

http://www.anandtech.com/show/2807/2

Edit:

Quote

In particular, until around 2006 AMD used to report the maximum power draw of its processors as TDP, but Intel changed this practice with the introduction of its Conroe family of processors.^[4]

https://en.wikipedia.org/wiki/Thermal_design_power

Syntaxvgm · March 15, 2017

On 3/13/2017 at 7:13 PM, zMeul said:

except it's ~14%

Ryzen on DDR4 does less than Haswell IPC .. or you did forgot that

can I just insert the obligatory "man, moore's law is more dead than a hooker in a river"? I mean core count..yea I guess but come on

Syntaxvgm · March 15, 2017

On 3/13/2017 at 9:32 PM, zMeul said:

content creation does not include compiling code

and compiling code can be done on a bottom of the barrel celly

oh yea that's why I literally almost shot my PC when I first made the jump to quad core all those years ago and discovered the compiler I used only 2 threads? Because compile times, not gaming, is why I made the jump.

MageTank · March 15, 2017

1 hour ago, leadeater said:

Not back before processors had power states and variable multipliers. It was either on or after Intel Core series when Intel started marketing TDP differently and doesn't use it by the industry standard definition, but in the CPU world they are big enough they can set their own standard and not really cause any big problems doing so.

Here's an interesting read on TDP for AMD vs Intel.

http://www.anandtech.com/show/2807/2

Edit:

https://en.wikipedia.org/wiki/Thermal_design_power

Let's not forget, Intel pushed for "Scenario Design Power" in an attempt to get away with power-throttling to stay within a specified TDP. This happens on their mobile SKU's and their desktop T SKU's (unless you specifically turn it off via BIOS on the desktop SKU's).

This is why I tell people to only take TDP into consideration if they do not intend to overclock, or to run the most stressful programs (Prime95, Linpack, etc.). If you intend to overclock, TDP no longer becomes accurate, not even in the slightest. When it comes to heat, voltage scales quadratically, and you will likely need a cooler with a TDP rating far more aggressive than what your CPU originally needed. Heat still scales with clock speed changes, but it's far more linear if you are just changing clocks.

I could go on and on about TDP, like how different cooler manufacturers often test differently, or outright lie (ID Cooling pretends that this 45mm vapor chamber is rated for 130w, I can assure you it's not) when it comes to advertising these numbers, but that rant would go on for days.

Syntaxvgm · March 15, 2017

23 minutes ago, MageTank said:

Let's not forget, Intel pushed for "Scenario Design Power" in an attempt to get away with power-throttling to stay within a specified TDP. This happens on their mobile SKU's and their desktop T SKU's (unless you specifically turn it off via BIOS on the desktop SKU's).

This is why I tell people to only take TDP into consideration if they do not intend to overclock, or to run the most stressful programs (Prime95, Linpack, etc.). If you intend to overclock, TDP no longer becomes accurate, not even in the slightest. When it comes to heat, voltage scales quadratically, and you will likely need a cooler with a TDP rating far more aggressive than what your CPU originally needed. Heat still scales with clock speed changes, but it's far more linear if you are just changing clocks.

I could go on and on about TDP, like how different cooler manufacturers often test differently, or outright lie (ID Cooling pretends that this 45mm vapor chamber is rated for 130w, I can assure you it's not) when it comes to advertising these numbers, but that rant would go on for days.

hah you ever seen the fucking ratings on passive coolers?

"we idled a 140w cpu on it, it's good for 140w"

"You can place it on a stove burner man, it's all good"

and I'm not talking about server ones that are meant to have air forced through, I mean consumer ones.

MageTank · March 15, 2017

6 minutes ago, Syntaxvgm said:

hah you ever seen the fucking ratings on passive coolers?

"we idled a 140w cpu on it, it's good for 140w"

"You can place it on a stove burner man, it's all good"

and I'm not talking about server ones that are meant to have air forced through, I mean consumer ones.

Yes, actually, lol. This vapor chamber of mine is rated for 130w, but the moment I throw even 120w at it, I get near 98C under 48k FFT Prime95. I hit thermal junction before the first pass of Linpack finishes, so I can't even tell you what it does at it's max heat.

Now, I am not ignorant, and I know they can't possibly rate their performance for absolutely every case in existence, but it's an ITX cooler, which implies it's designed to be used in an ITX case. My thermals are done in a 10L case with very good airflow from the side-vents for the CPU. In fact, taking the top panel completely off and exposing all of the internals, actually made the performance worse (less centralized air on the CPU itself). This is why I ignore TDP ratings and have taken the time to study the design of the heatsinks themselves. That being said, they were not too far off with their rating. I would say it's rated for about 105-110w. This is still more than the Cryorig C7 (95w, and in my personal tests, it did slightly outperform the C7, though at much louder fan speeds) so it's still one of the best ITX coolers you can buy (ignoring ID Cooling's awful QA and nearly double the price of the C7 for only a few C difference).

Another thing to consider: My CPU's stock TDP is 91w. I delidded it, and undervolted it from 1.23 down to 1.14, and it's clocks are still stock as well. I had to do all of this to survive 48k FFT Prime95 with this cooler. It does however, draw 122w during Prime95, which is where I got my 120w numbers from. During gaming load, this cooler is more than enough, so I give it a pass.

Tomsen · March 15, 2017

2 hours ago, LAwLz said:

@Tomsen @leadeater

Behardware has benchmarked a 2+2 vs a 4+0 configuration to determine how much of an impact the interconnector has.

Conclusion: Next to no impact at all. So if this is any indication, then an update to the Windows scheduler where it avoids moving data between CCXs would not make a difference.

Assuming that Windows' scheduler became 100% flawless in allocating threads to specific CCXs, we're looking at an average (not counting the abnormality that is BF1) of ~2% performance increase. That might as well be within margin of error.

(Funnily enough, the same performance increase the "legendary" Bulldozer optimization patch also brought us).

So once again has this idea that Microsoft are to blame, and that Windows 10 just isn't "optimized" for Ryzen been debunked.

You didn't come up with the same conclusion as the person who you got your data from. Did you even read the report? I know it is in French or whatever, but you could just use google translate.

Just to quote some pieces of the report:

Quote

Interestingly enough (we will come back to this), the 8 MB cache is available in each CCX in configuration 2 + 2. In theory, therefore, this configuration is advantageous, it has access to 2 x 8 MB of L3, against only 1 x 8 Mo for configuration 4 + 0.

Quote

X264 and x265, which are not sensitive to the memory subsystem, perform virtually identically in both modes.

Surprise however, the cases of WinRAR and 7-Zip, two benchs very sensitive to the memory subsystem that show very different results.

In the case of 7-Zip, the 2 + 2 configuration is the most interesting. It seems that the software benefits better from the presence of 16 MB of L3. Conversely, WinRAR is more penalized by synchronization and the additional L3 cache does not compensate.

So right off the bat, 4 out of the 10 programs that was run doesn't seem like it have to much cross-CCX communication. So 40% of the tested results are basically useless to this debate.

The remaining 60% of the programs (the games) all show some kind of regression in performance with variance, all from ~3% up to ~20% (you could argue that half of them is within error of margin, but I would argue that since EVERY game showed regression that wouldn't necessarily be true). This gives a better insight to our debate. To quote the reporter (or whatever he is):

Quote

In all cases, the configuration or a CCX is disabled is the best performing. The additional L3 cache does not change anything, the losses are very variable according to the titles but for some the difference is massive: Battlefield 1 announces a differential of almost 20% which translates in practice by 22 FPS of difference! The synchronization of data seems extremely penalizing in this title. Project Cars and Civilization VI also incur significant losses.

And here is his summarization:

Quote

We are now beginning to see a little more clearly. Yes, communication between CCX at a cost, and depending on the applications it is not necessarily harmless.

For less sensitive applications, the effect is almost zero, as is the case with video encoding software, for example.

For others it is much more striking, as for example Battlefield 1 where one loses 20% performance.

He also notes:

Quote

There may be over-aggravating factors. Knowing that communication between CCXs is a hassle, running threads from one CCX to another (Windows 10 loves constantly moving threads!) Can make the situation worse, although it is difficult to quantify in what proportion.

Quote

This does not mean that things will not change for Ryzen in the future. The most obvious solution would be a patch for the Windows scheduler , in order to limit thread movements from one CCX to another

To sum it up, the test data is extremely limited of 10 programs, in which 4 of them seems like it doesn't have much cross-CCX communication, and the rest all showed some kind of regression.

It is funny how you have put up my arguments, I never said microsoft is to blame. I said that scheduler optimizations could potentially yield some sort of performance improvement, which the reporter seemed to agree with.

I would consider this lazy debunking from you, I would have expected better to be honest. Really seemed like you didn't even bother to read the report.

HalGameGuru · March 15, 2017

On that cross CCX benchmarking in 4+0 or 2+2 scenarios I think they are showing the wrong numbers. In gaming I think the bigger impact would be felt on the MINIMUMS, forcing more cross CCX talk would lower the minimums more than drop the maximums or averages. But I think as the L3 cache doesn't appear to be effected by disabling cores this route may be fruitless as you have less and less cause to cross talk when you disable the cores. A more rigorous methodology may be needed, forcing crosstalk and denying it and seeing specific performance changes. I do not think it will be enough to make a big difference though, in either case.

EDIT: Hypoethesis: keep all 8 cores running, force affinity for 2 multi threaded apps into their OWN 2+2 or 4+0 affinities and run them concurrently. And check the results against each other, 2+2 and 2+2 vs 4+0 and 0+4.

cj09beira · March 15, 2017

9 hours ago, leadeater said:

Yea that's basically where I'm stuck at as to which of those two options pays off better. Seems wasteful and costly to disable that many cores and use up wafer area to deliver 4 core products. Naples while it does show how AMD can scale CCXs is very different in die design regarding PCIe lanes and memory controller and looks to be only offering 8 (2 CCX), 16 (4/6 CCX), 24 (6/8 CCX) and 32 (8 CCX) products and is a poor example to use for gauging if a single CCX die design is going to be used.

We also know AMD is favoring a market push to high core products all round so how much they actually want to invest in 4 core products is unknown.

theres something important here also, which is that cheap lower end cpus sell lots more, which might mean that amd would have to use perfectly good chips for lower end products which is bad for business.

i would argue that the cheer potential amount of demand for 4 core cpus would make them use a die just for that as it would be expensive to use 2ccxs for that, dont forget this node is mature as it has been producing all the Polaris chips for almost a year now.

cj09beira · March 15, 2017

8 hours ago, leadeater said:

Probably wasn't that clear by what I meant by scalable CCX design. Any SKU that uses a different amount of CCXs is a different die design, cutting the current die isn't possible.

Basically the CCXs are paired with a memory controller and I/O silicon logic, that means while they can put any number of CCXs they wish (maybe in pairs?) on a die they still need the CCX interconnects plus memory controller/PCIe that go with it. Cut the die cut the memory controller which means broken non functional die.

That's where it starts to get really complex and further in to more unknowns:

Can a single CCX be properly connected to the memory controller or is a minimum of two required?

Can a single CCX die actually be made smaller in physical area? Can the memory controller etc be rearranged?

Is the potential sales volume of the 4 core SKUs worth investing in a dedicated die design?

My personally feeling on the matter regarding the CCX is it was always designed primarily for Naples to allow AMD to easily make a few different die designs covering a large range of core count SKU offerings cheaply and to maximize wafer area usage. Ryzen is the product of making do with what you have and not necessarily a specifically dedicated design, surely a 100% Ryzen focused design wouldn't have a CCX interconnect in it at all and be a unified 8 core??

I'd love to have an open and honest discussion with the Zen architecture engineers to find out where the real focus was, Ryzen or Naples.

dont forget they also made ryzen with low power things in mind, like consoles, laptops embedded etc, and we will have a 4 core apu late this year, so that has to have 1 ccx as they wont sell any 8/6 core apu.

ryzens ccx is cost saving measure so that making various dies is as cheap as possible

LAwLz · March 15, 2017

5 hours ago, Tomsen said:

So right off the bat, 4 out of the 10 programs that was run doesn't seem like it have to much cross-CCX communication. So 40% of the tested results are basically useless to this debate.

All I hear is "don't use programs which doesn't show the results I want them to show! You should only look at the benchmarks which agrees with me!".

Showing a broad range of programs (which 10 is not, I will admit) is important when talking about what kind of performance differences we can expect. If you just look at the programs which benefits the most then you will give a twisted picture of what people should actually expect from an update (if we get one at all that is).

You can't just dismiss 40% of the benchmarks because they are the ones that benefits the least from it.

5 hours ago, Tomsen said:

The remaining 60% of the programs (the games) all show some kind of regression in performance with variance, all from ~3% up to ~20% (you could argue that half of them is within error of margin, but I would argue that since EVERY game showed regression that wouldn't necessarily be true). This gives a better insight to our debate. To quote the reporter (or whatever he is):

The average (not counting BF1 because you should not include abnormalities like that) it's a 4% increase, and that's assuming the scheduler is essentially perfect which it won't be.

So what I expect is something like a ~3% performance boost in games (less overall) IF (huge if) we get a patch.

5 hours ago, Tomsen said:

It is funny how you have put up my arguments, I never said microsoft is to blame. I said that scheduler optimizations could potentially yield some sort of performance improvement, which the reporter seemed to agree with.

I did not mean for my post to read that way. I tagged you since I thought you'd think it was interesting data.

The whole thing about "Microsoft isn't to blame and Windows 10 is optimized for Ryzen" was more meant as a comment to those who were flaming me in another thread said that you shouldn't expect a 10-20% performance increase from some Windows 10 patch. And yes I am incredibly salty from that thread because I got quite a few warning points for arguing with those people who said my posts were just trolling, spam, idiotic etc.

Anyway, yes it seems like a patch could in theory (although extremely hard to actually make and doing even a tiny thing wrong could reduce performance by a lot) increase performance, but only by ~3%.

That will be like... 1 or 2 FPS in games. It's still free performance so it would be good news (assuming no drawbacks)... But it's far from those 10-20% performance gains people seem to think Microsoft is keeping from Ryzen owners.

5 hours ago, Tomsen said:

Really seemed like you didn't even bother to read the report.

I didn't, but I think I have to. The person who wrote the report was looking at the same numbers I was looking at. Just because the author has one opinion does not mean it is the right one.

leadeater · March 15, 2017

9 minutes ago, LAwLz said:

-snip-

Not sure why you tagged me in to that though? I mean I had the same view as you on the matter in the other thread too, not complaining or anything and happy to read that post .

Sign In

AMD speaks on W10 scheduler and Ryzen

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites