Why does hyperthreading improve synthetic "benchmark" performance?

corrado33 · November 25, 2018

Hyperthreading works by using "unused transistors" in a core to do an additional process during one clock cycle. So, why does a synthetic benchmark which "theoretically" uses 100% of the CPU core, show improvements with hyperthreading?

In theory, it should be possible to design a synthetic benchmark that does NOT scale AT ALL with hyperthreading.(If the operation you are doing uses enough of the transistors you won't be able to do 2 during one clock cycle therefore hyperthreading wouldn't work.)

On the other hand, it should also be possible to design a synthetic benchmark that would scale virtually PERFECTLY with hyperthreading.(On a specific chip of course.)

So, say, for example, I do a simple addition, 8+8. Googling around I found that could use as few as ~100 transistors to do the calculation, let's multiply that by 2 to direct the signal from the correct pin to the correct pin and you get 200 transistors. Considering modern CPUS have literally BILLIONS of transistors, it seems like a waste to do such a simple calculation. If you were running a multithreaded program that simply calculated 8+8 over and over again, the CPU would say it's at 100% utilization because it ALWAYS has something to do, but in reality it'd only be using very few transistors. Is this... true? If I do that simple addition does it really occupy a few hundred million transistors for 1 clock cycle?

If that's true, then why don't they develop hyper-hyper-hyper-hyper-hyper threaded processors? This would scale very well with simple processes.

Another question is this.

The trend with CPUs is to increase transistor count by reducing transistor size.... but... how does that increase performance if operations require the same number of transistors? Say, hypothetically a CPU had 405 transistors. If it was NOT hyperthreaded, it could do 1 8+8 operation per clock cycle (or per command cycle, whatever), if it were hyperthreaded, it could to 2 per clock cycle. But if then the next generation of CPUs has 805 transistors, the results for the 8+8 hyperthreaded vs non hyperthreaded comparison shouldn't change. (Assuming same frequency and signal handling etc.)

So, with updated CPUs, are there also updated command sets which utilize the greater number of cores better? So, in the example above, maybe there was a command that could use a different operation to compute 8+8 twice as fast but using twice as many transistors.

Finally, is this the explanation on why OCCP generally makes a CPU hotter than AIDA64 even though both of them are technically using 100% of the cpu time? OCCP uses a calculation that requires more transistors therefore more power therefore more heat?

EDIT: Is there any way to observe the % of transistors being used in a CPU at any given time? That'd be super cool.

Lurick · November 25, 2018

Because that's not how hypertheading works, it doesn't use "unused transistors". It's a physical core that's addresses as two logical cores by the OS and it addresses them as such.

November 25, 2018

5 minutes ago, corrado33 said:

Hyperthreading works by using "unused transistors" in a core to do an additional process during one clock cycle

who told you that?

Sakkura · November 25, 2018

2 minutes ago, corrado33 said:

Hyperthreading works by using "unused transistors" in a core to do an additional process during one clock cycle. So, why does a synthetic benchmark which "theoretically" uses 100% of the CPU core, show improvements with hyperthreading?

Because it doesn't use 100% of the CPU core. Not only is it very difficult to do so, but there's no incentive to do that because it's a completely unrealistic load. Even synthetic benchmarks are trying to emulate real-world performance.

3 minutes ago, corrado33 said:

If that's true, then why don't they develop hyper-hyper-hyper-hyper-hyper threaded processors? This would scale very well with simple processes.

Something like that does exist for specialized uses. More often GPGPU gets used instead though.

4 minutes ago, corrado33 said:

Another question is this.

The trend with CPUs is to increase transistor count by reducing transistor size.... but... how does that increase performance if operations require the same number of transistors? Say, hypothetically a CPU had 405 transistors. If it was NOT hyperthreaded, it could do 1 8+8 operation per clock cycle (or per command cycle, whatever), if it were hyperthreaded, it could to 2 per clock cycle. But if then the next generation of CPUs has 805 transistors, the results for the 8+8 hyperthreaded vs non hyperthreaded comparison shouldn't change. (Assuming same frequency and signal handling etc.)

You can do a lot of different things, like better caches so the core doesn't have to wait as long, or more execution units, or you can make a longer pipeline so it can clock higher, or a more complicated branch predictor (often combined with a longer pipeline to get the higher clocks without losing effective IPC).

3 minutes ago, Lurick said:

Because that's not how hypertheading works, it doesn't use "unused transistors". It's a physical core that's addresses as two logical cores by the OS and it addresses them as such.

It does effectively that. If a part of the CPU isn't being used by one thread, it can be used by the other thread.

corrado33 · November 25, 2018

5 minutes ago, Neftex said:

who told you that?

It's an oversimplification. Hyperthreading essentially duplicates the "front end" for a core, but does not duplicate the part of the core that executes the operations. So two "front ends" share the same "execution core." Therefore they share the same "transistors."

Jurrunio · November 25, 2018

14 minutes ago, corrado33 said:

So, why does a synthetic benchmark which "theoretically" uses 100% of the CPU core, show improvements with hyperthreading?

A single thread doesnt feed the compute cores fast enough. Think about single clutch sequential gearboxes compared to double clutch gearboxes on performance cars, even though the number of gears doesn't change, more clutch means faster gear changes (as long as they use parts of similar grade).

corrado33 · November 25, 2018

6 minutes ago, Sakkura said:

Because it doesn't use 100% of the CPU core. Not only is it very difficult to do so, but there's no incentive to do that because it's a completely unrealistic load. Even synthetic benchmarks are trying to emulate real-world performance.

So am I thinking about this incorrectly? A hyperthreaded core has 2 front ends and 1 execution "section." So is it like.... this metaphor? If you were chopping wood and had someone putting logs on the stand, you could chop one log at a time, but if you had 2 people putting logs on the stand (on top of each other) (and you were strong enough) you could chop TWO pieces of wood at once. Or is it like "If you were chopping wood and had someone putting logs on the stand, you could chop one log at a time, but if you had 2 people putting logs on the stand you could still only chop 1 at a time, but you could chop faster because there is downtime when the person is grabbing the piece of wood."

If hyperthreading is like the former, then my questions still hold, but if it's like the latter then I understand.

Alex Atkin UK · November 25, 2018

Its nearer the latter, but still a bit too simplistic way of looking at it..

corrado33 · November 25, 2018

1 minute ago, Jurrunio said:

A single thread doesnt feed the compute cores fast enough. Think about single clutch sequential gearboxes compared to double clutch gearboxes on performance cars, even though the number of gears doesn't change, more clutch means faster gear changes (as long as they use parts of similar grade).

Why doesn't it feed the compute cores fast enough? What's the limitation? CPU-> ram data transfer? CPU->CPU Cache data transfer? Why don't CPU manufacturers work on fixing THAT issue rather than just trying to shove more transistors in the chip?

corrado33 · November 25, 2018

3 minutes ago, Alex Atkin UK said:

Its the latter.

So a hyperthreaded core compute unit technically could run at twice the frequency of the CPU clock speed? OR is it that typical cores run at less than their "frequency" because they can't get data fast enough?

Alex Atkin UK · November 25, 2018

Not sure what you mean, if you mean could it theoretically be as fast as a none-HT running at twice the clock speed then "theoretically" in very specific workloads, it could possibly come close. But it would have to be an extremely inefficient workload I think.

All HT is really doing I believe is giving the CPU something else to work on while its sat idle waiting for other parts of your system to deal with the data it needs to proceed with the current job.

corrado33 · November 25, 2018

2 minutes ago, Alex Atkin UK said:

Not sure what you mean, if you mean could it theoretically be as fast as a none-HT running at twice the clock speed then "theoretically" in very specific workloads, it could possibly come close. But it would have to be an extremely inefficient workload I think.

All HT is really doing I believe is giving the CPU something to work on while its sat idle waiting for other parts of your system to deal with the data it needs to proceed with the current job.

Say, for example, we had an extremely simple workload that didn't require storage access or anything. The thread could complete in a single cycle of the CPU. If you used a hyperthreaded CPU, would it be twice as fast as a non hyperthreaded CPU? If so, then the "compute core" would have to be running twice as fast (to deal with twice the amount of data). Or is it that in reality, CPUs spend a crap ton of their time sitting around waiting for stuff. And if that's true, then why on earth don't we work on fixing that problem?

Glenwing · November 25, 2018

12 minutes ago, corrado33 said:

Why doesn't it feed the compute cores fast enough? What's the limitation? CPU-> ram data transfer? CPU->CPU Cache data transfer? Why don't CPU manufacturers work on fixing THAT issue rather than just trying to shove more transistors in the chip?

Limitations of RAM transfer speed and things like that are huge concerns, which is why they are already avoided as much as possible by advanced techniques (out of order execution, branch prediction etc.), which involve predicting what data will be needed later, and fetching it into the cache so it's ready to go when the CPU needs it. As you might imagine, this is not 100% reliable, because it involves predicting the future. It is not a "problem" that can be fixed, it is just reality. If you have any ideas for more accurate prediction methods I'm sure they'd be interested though

From the WP page:

Quote

When execution resources would not be used by the current task in a processor without hyper-threading, and especially when the processor is stalled, a hyper-threading equipped processor can use those execution resources to execute another scheduled task. (The processor may stall due to a cache miss, branch misprediction, or data dependency.)

Alex Atkin UK · November 25, 2018

I think its basically impossible to be twice as fast, latencies would kill that. To be anything close though that workload would have be somehow causing the CPU to be idle 50% of the time.

If the workload was 100% efficient (impossible in a real world workload) then HT would be pointless.

corrado33 · November 25, 2018

13 minutes ago, Jurrunio said:

A single thread doesnt feed the compute cores fast enough. Think about single clutch sequential gearboxes compared to double clutch gearboxes on performance cars, even though the number of gears doesn't change, more clutch means faster gear changes (as long as they use parts of similar grade).

Also, since the hyperthreaded and non hyperthreaded versions of chips are identical dies (a la the 6600k and 6700k), does that really mean there are freaking 4 other core "Front ends" just sitting around doing freaking nothing? Why? That doesn't make sense unless "bad dies" often kill parts of the front end.

corrado33 · November 25, 2018

1 minute ago, Glenwing said:

Limitations of RAM transfer speed and things like that are huge concerns, which is why they are already avoided as much as possible by advanced techniques (out of order execution, branch prediction etc.), which involve predicting what data will be needed later, and fetching it into the cache so it's ready to go when the CPU needs it. As you might imagine, this is not 100% reliable, because it involves predicting the future. It is not a "problem" that can be fixed, it is just reality. If you have any ideas for more accurate prediction methods I'm sure they'd be interested though

So why not just give a CPU a crap ton of cache to work with? That way the prediction algorithms don't need to be accurate and therefore aren't the limiting factor as often.

Glenwing · November 25, 2018

1 minute ago, corrado33 said:

So why not just give a CPU a crap ton of cache to work with? That way the prediction algorithms don't need to be accurate and therefore aren't the limiting factor as often.

Cache takes up a lot of space. And when it's larger and more spread out, it can't operate as quickly.

Alex Atkin UK · November 25, 2018

5 minutes ago, corrado33 said:

Also, since the hyperthreaded and non hyperthreaded versions of chips are identical dies (a la the 6600k and 6700k), does that really mean there are freaking 4 other core "Front ends" just sitting around doing freaking nothing? Why? That doesn't make sense unless "bad dies" often kill parts of the front end.

It makes sense from a profit perspective.

I remember it being widely reported many years ago that the demand for a lower-end AMD CPU was so high they literally were disabling "good" CPUs to sell them as lower-end models. Some people figured out how to hack it re-enable the disabled cores and many were perfectly stable doing so (llikely proving the theory) but I think its pretty much impossible to reverse the disabling of parts these days.

Bottom line, its more profitable to be selling lower-end CPUs than have higher-end CPUs just sat on a warehouse somewhere.

Jurrunio · November 25, 2018

4 minutes ago, corrado33 said:

Why?

so there's a reason for you to spend more money.

16 minutes ago, corrado33 said:

Why doesn't it feed the compute cores fast enough? What's the limitation? CPU-> ram data transfer? CPU->CPU Cache data transfer? Why don't CPU manufacturers work on fixing THAT issue rather than just trying to shove more transistors in the chip?

memory to cache, cache management etc.

2 minutes ago, corrado33 said:

So why not just give a CPU a crap ton of cache to work with?

cost. Too expensive to give a lot of cache. Also more cache = longer route for signal to move around, increasing latency in the first place.

Glenwing · November 25, 2018

3 minutes ago, corrado33 said:

Also, since the hyperthreaded and non hyperthreaded versions of chips are identical dies (a la the 6600k and 6700k), does that really mean there are freaking 4 other core "Front ends" just sitting around doing freaking nothing? Why? That doesn't make sense unless "bad dies" often kill parts of the front end.

It is an artificial market segmentation.

Consider the fact that it takes billions of dollars and many years to design a CPU. Those costs have to be paid for by the combined sales of all the chips. The manufacturing cost of the chips has very little to do with the retail price, so the idea that "the i3 and i7 have the same manufacturing cost" is irrelevant. They have a certain chip design, and that chip design needs to pay for itself. One way to do that is to open it to a wider market by offering lower cost options, so people who would not normally be able to afford a CPU can at least buy something, even if it's at a lower cost. In return for this discount, part of the processor is disabled.

Yes, it is an artificially created limitation, but that doesn't actually matter when you think about it.

corrado33 · November 25, 2018

4 minutes ago, Glenwing said:

Cache takes up a lot of space. And when it's larger and more spread out, it can't operate as quickly.

It's gotta be faster than fetching things from ram. Also, it seems that die maps could be changed.

For example: This is a i7-980x

All the cache is at one side of the die. Why don't they surround the cores in cache? I assume that the cores don't actually talk to each other, so there's no need for them to be side by side. What about increasing the number of... what I assume to be "cache controllers."

Say, for example, each core has it's own smaller cache while also having access to a larger, albeit slightly slower/further away shared cache?

corrado33 · November 25, 2018

9 minutes ago, Glenwing said:

It is an artificial market segmentation.

Consider the fact that it takes billions of dollars and many years to design a CPU. Those costs have to be paid for by the combined sales of all the chips. The manufacturing cost of the chips has very little to do with the retail price, so the idea that "the i3 and i7 have the same manufacturing cost" is irrelevant. They have a certain chip design, and that chip design needs to pay for itself. One way to do that is to open it to a wider market by offering lower cost options, so people who would not normally be able to afford a CPU can at least buy something, even if it's at a lower cost. In return for this discount, part of the processor is disabled.

Yes, it is an artificially created limitation, but that doesn't actually matter when you think about it.

So how do they "disable" the chips? I know sometimes they'll laser cut off extra cores, but other times (like the 6600k vs the 6700k) I'd imagine it's physically impossible to remove the extra "front ends." Is it an external resistor/capacitor that does it? I'm making the assumption here that the material used to hold the die (likely some sort of circuit board material... FR-4?) Is identical between the 6600 and 6700, the only chance at "customization" would be the little surface mount components on the bottom of the chip?

I mean, that'd be FAR too simple... right..... right?

Glenwing · November 25, 2018

1 minute ago, corrado33 said:

Say, for example, each core has it's own smaller cache while also having access to a larger, albeit slightly slower/further away shared cache?

Already done a long time ago. The large section you are looking at is the L3 (level 3) cache. There is also a smaller and faster L2 cache and an even smaller (and even faster) L1 cache located within the actual cores.

7 minutes ago, corrado33 said:

All the cache is at one side of the die. Why don't they surround the cores in cache? I assume that the cores don't actually talk to each other, so there's no need for them to be side by side. What about increasing the number of... what I assume to be "cache controllers."

For specific design decisions you would probably need to talk to someone from Intel. But I think it's safe to say it's not quite as simple as that.

5 minutes ago, corrado33 said:

So how do they "disable" the chips? I know sometimes they'll laser cut off extra cores, but other times (like the 6600k vs the 6700k) I'd imagine it's physically impossible to remove the extra "front ends." Is it an external resistor/capacitor that does it? I'm making the assumption here that the material used to hold the die (likely some sort of circuit board material... FR-4?) Is identical between the 6600 and 6700, the only chance at "customization" would be the little surface mount components on the bottom of the chip?

That isn't something Intel talks about publicly

corrado33 · November 25, 2018

46 minutes ago, Sakkura said:

Something like that does exist for specialized uses. More often GPGPU gets used instead though.

So a GPU is essentially a CPU that's VERY... VERY heavily hyperthreaded? Possibly with smaller compute units because the GPU only has to do certain, specific calculations?

Glenwing · November 25, 2018

Just now, corrado33 said:

So a GPU is essentially a CPU that's VERY... VERY heavily hyperthreaded? Possibly with smaller compute units because the GPU only has to do certain, specific calculations?

Not hyperthreaded, a GPU has thousands of actual cores But they are heavily simplified cores with basically only type of execution unit (FP32 normally) and maybe a few special function units, rather than a CPU core which has many hundreds of different types of operations it can pick from.

Sign In

Why does hyperthreading improve synthetic "benchmark" performance?

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites