Jump to content

Hyperthreading works by using "unused transistors" in a core to do an additional process during one clock cycle. So, why does a synthetic benchmark which "theoretically" uses 100% of the CPU core, show improvements with hyperthreading? 

 

In theory, it should be possible to design a synthetic benchmark that does NOT scale AT ALL with hyperthreading.(If the operation you are doing uses enough of the transistors you won't be able to do 2 during one clock cycle therefore hyperthreading wouldn't work.) 

 

On the other hand, it should also be possible to design a synthetic benchmark that would scale virtually PERFECTLY with hyperthreading.(On a specific chip of course.) 

 

So, say, for example, I do a simple addition, 8+8. Googling around I found that could use as few as ~100 transistors to do the calculation, let's multiply that by 2 to direct the signal from the correct pin to the correct pin and you get 200 transistors. Considering modern CPUS have literally BILLIONS of transistors, it seems like a waste to do such a simple calculation. If you were running a multithreaded program that simply calculated 8+8 over and over again, the CPU would say it's at 100% utilization because it ALWAYS has something to do, but in reality it'd only be using very few transistors. Is this... true? If I do that simple addition does it really occupy a few hundred million transistors for 1 clock cycle? 

 

If that's true, then why don't they develop hyper-hyper-hyper-hyper-hyper threaded processors? This would scale very well with simple processes. 

 

Another question is this.

 

The trend with CPUs is to increase transistor count by reducing transistor size.... but... how does that increase performance if operations require the same number of transistors? Say, hypothetically a CPU had 405 transistors. If it was NOT hyperthreaded, it could do 1 8+8 operation per clock cycle (or per command cycle, whatever), if it were hyperthreaded, it could to 2 per clock cycle. But if then the next generation of CPUs has 805 transistors, the results for the 8+8 hyperthreaded vs non hyperthreaded comparison shouldn't change. (Assuming same frequency and signal handling etc.) 

 

So, with updated CPUs, are there also updated command sets which utilize the greater number of cores better? So, in the example above, maybe there was a command that could use a different operation to compute 8+8 twice as fast but using twice as many transistors. 

 

 

Finally, is this the explanation on why OCCP generally makes a CPU hotter than AIDA64 even though both of them are technically using 100% of the cpu time? OCCP uses a calculation that requires more transistors therefore more power therefore more heat? 

 

 

EDIT: Is there any way to observe the % of transistors being used in a CPU at any given time? That'd be super cool. 

Link to post
Share on other sites

Because that's not how hypertheading works, it doesn't use "unused transistors". It's a physical core that's addresses as two logical cores by the OS and it addresses them as such.

Current Network Layout:

Current Build Log/PC:

Storage Server Setup:

 

Prior Build Log/PC:

Link to post
Share on other sites

2 minutes ago, corrado33 said:

Hyperthreading works by using "unused transistors" in a core to do an additional process during one clock cycle. So, why does a synthetic benchmark which "theoretically" uses 100% of the CPU core, show improvements with hyperthreading?

Because it doesn't use 100% of the CPU core. Not only is it very difficult to do so, but there's no incentive to do that because it's a completely unrealistic load. Even synthetic benchmarks are trying to emulate real-world performance.

 

3 minutes ago, corrado33 said:

If that's true, then why don't they develop hyper-hyper-hyper-hyper-hyper threaded processors? This would scale very well with simple processes. 

Something like that does exist for specialized uses. More often GPGPU gets used instead though.

 

4 minutes ago, corrado33 said:

Another question is this.

 

The trend with CPUs is to increase transistor count by reducing transistor size.... but... how does that increase performance if operations require the same number of transistors? Say, hypothetically a CPU had 405 transistors. If it was NOT hyperthreaded, it could do 1 8+8 operation per clock cycle (or per command cycle, whatever), if it were hyperthreaded, it could to 2 per clock cycle. But if then the next generation of CPUs has 805 transistors, the results for the 8+8 hyperthreaded vs non hyperthreaded comparison shouldn't change. (Assuming same frequency and signal handling etc.)

You can do a lot of different things, like better caches so the core doesn't have to wait as long, or more execution units, or you can make a longer pipeline so it can clock higher, or a more complicated branch predictor (often combined with a longer pipeline to get the higher clocks without losing effective IPC).

 

3 minutes ago, Lurick said:

Because that's not how hypertheading works, it doesn't use "unused transistors". It's a physical core that's addresses as two logical cores by the OS and it addresses them as such.

It does effectively that. If a part of the CPU isn't being used by one thread, it can be used by the other thread.

Link to post
Share on other sites

 

5 minutes ago, Neftex said:

who told you that?

It's an oversimplification. Hyperthreading essentially duplicates the "front end" for a core, but does not duplicate the part of the core that executes the operations. So two "front ends" share the same "execution core." Therefore they share the same "transistors." 

Link to post
Share on other sites

14 minutes ago, corrado33 said:

So, why does a synthetic benchmark which "theoretically" uses 100% of the CPU core, show improvements with hyperthreading? 

A single thread doesnt feed the compute cores fast enough. Think about single clutch sequential gearboxes compared to double clutch gearboxes on performance cars, even though the number of gears doesn't change, more clutch means faster gear changes (as long as they use parts of similar grade).

CPU: i7-2600K 4751MHz 1.44V (software) --> 1.47V at the back of the socket Motherboard: Asrock Z77 Extreme4 (BCLK: 103.3MHz) CPU Cooler: Noctua NH-D15 RAM: Adata XPG 2x8GB DDR3 (XMP: 2133MHz 10-11-11-30 CR2, custom: 2203MHz 10-11-10-26 CR1 tRFC:230 tREFI:14000) GPU: Asus GTX 1070 Dual (Super Jetstream vbios, +70(2025-2088MHz)/+400(8.8Gbps)) SSD: Samsung 840 Pro 256GB (main boot drive), Transcend SSD370 128GB PSU: Seasonic X-660 80+ Gold Case: Antec P110 Silent, 5 intakes 1 exhaust Monitor: AOC G2460PF 1080p 144Hz (150Hz max w/ DP, 121Hz max w/ HDMI) TN panel Keyboard: Logitech G610 Orion (Cherry MX Blue) with SteelSeries Apex M260 keycaps Mouse: BenQ Zowie FK1

 

Model: HP Omen 17 17-an110ca CPU: i7-8750H (0.125V core & cache, 50mV SA undervolt) GPU: GTX 1060 6GB Mobile (+80/+450, 1650MHz~1750MHz 0.78V~0.85V) RAM: 8+8GB DDR4-2400 18-17-17-39 2T Storage: HP EX920 1TB PCIe x4 M.2 SSD + Crucial MX500 1TB 2.5" SATA SSD, 128GB Toshiba PCIe x2 M.2 SSD (KBG30ZMV128G) gone cooking externally, 1TB Seagate 7200RPM 2.5" HDD (ST1000LM049-2GH172) left outside Monitor: 1080p 126Hz IPS G-sync

 

Desktop benching:

Cinebench R15 Single thread:168 Multi-thread: 833 

SuperPi (v1.5 from Techpowerup, PI value output) 16K: 0.100s 1M: 8.255s 32M: 7m 45.93s

Link to post
Share on other sites

6 minutes ago, Sakkura said:

Because it doesn't use 100% of the CPU core. Not only is it very difficult to do so, but there's no incentive to do that because it's a completely unrealistic load. Even synthetic benchmarks are trying to emulate real-world performance.

So am I thinking about this incorrectly? A hyperthreaded core has 2 front ends and 1 execution "section." So is it like.... this metaphor? If you were chopping wood and had someone putting logs on the stand, you could chop one log at a time, but if you had 2 people putting logs on the stand (on top of each other) (and you were strong enough) you could chop TWO pieces of wood at once. Or is it like "If you were chopping wood and had someone putting logs on the stand, you could chop one log at a time, but if you had 2 people putting logs on the stand you could still only chop 1 at a time, but you could chop faster because there is downtime when the person is grabbing the piece of wood."

 

If hyperthreading is like the former, then my questions still hold, but if it's like the latter then I understand. 

Link to post
Share on other sites

Its nearer the latter, but still a bit too simplistic way of looking at it..

ASUS B650E-F GAMING WIFI + R7 7800X3D + 2x Corsair Vengeance 32GB DDR5-6000 CL30-36-36-76  + ASUS RTX 4090 TUF Gaming OC

Router:  Intel N100 (pfSense) Backup: GL.iNet GL-X3000/ Spitz AX Switches: Netgear MS510TXUP, MS510TXPP, GS110EMX
WiFi6: Zyxel NWA210AX (1.7Gbit peak at 160Mhz) WiFi5: Ubiquiti NanoHD OpenWRT (~500Mbit at 80Mhz)
ISPs: Zen Full Fibre 900 (~930Mbit down, 115Mbit up) + Three 5G (~1200Mbit down, 115Mbit up, variable)
Upgrading Laptop/Desktop CNVIo WiFi 5 cards to PCIe WiFi6e/7

Link to post
Share on other sites

1 minute ago, Jurrunio said:

A single thread doesnt feed the compute cores fast enough. Think about single clutch sequential gearboxes compared to double clutch gearboxes on performance cars, even though the number of gears doesn't change, more clutch means faster gear changes (as long as they use parts of similar grade).

Why doesn't it feed the compute cores fast enough? What's the limitation? CPU-> ram data transfer? CPU->CPU Cache data transfer? Why don't CPU manufacturers work on fixing THAT issue rather than just trying to shove more transistors in the chip?

Link to post
Share on other sites

3 minutes ago, Alex Atkin UK said:

Its the latter.

So a hyperthreaded core compute unit technically could run at twice the frequency of the CPU clock speed? OR is it that typical cores run at less than their "frequency" because they can't get data fast enough? 

Link to post
Share on other sites

Not sure what you mean, if you mean could it theoretically be as fast as a none-HT running at twice the clock speed then "theoretically" in very specific workloads, it could possibly come close.  But it would have to be an extremely inefficient workload I think.

All HT is really doing I believe is giving the CPU something else to work on while its sat idle waiting for other parts of your system to deal with the data it needs to proceed with the current job.

ASUS B650E-F GAMING WIFI + R7 7800X3D + 2x Corsair Vengeance 32GB DDR5-6000 CL30-36-36-76  + ASUS RTX 4090 TUF Gaming OC

Router:  Intel N100 (pfSense) Backup: GL.iNet GL-X3000/ Spitz AX Switches: Netgear MS510TXUP, MS510TXPP, GS110EMX
WiFi6: Zyxel NWA210AX (1.7Gbit peak at 160Mhz) WiFi5: Ubiquiti NanoHD OpenWRT (~500Mbit at 80Mhz)
ISPs: Zen Full Fibre 900 (~930Mbit down, 115Mbit up) + Three 5G (~1200Mbit down, 115Mbit up, variable)
Upgrading Laptop/Desktop CNVIo WiFi 5 cards to PCIe WiFi6e/7

Link to post
Share on other sites

2 minutes ago, Alex Atkin UK said:

Not sure what you mean, if you mean could it theoretically be as fast as a none-HT running at twice the clock speed then "theoretically" in very specific workloads, it could possibly come close.  But it would have to be an extremely inefficient workload I think.

All HT is really doing I believe is giving the CPU something to work on while its sat idle waiting for other parts of your system to deal with the data it needs to proceed with the current job.

Say, for example, we had an extremely simple workload that didn't require storage access or anything. The thread could complete in a single cycle of the CPU. If you used a hyperthreaded CPU, would it be twice as fast as a non hyperthreaded CPU? If so, then the "compute core" would have to be running twice as fast (to deal with twice the amount of data). Or is it that in reality, CPUs spend a crap ton of their time sitting around waiting for stuff. And if that's true, then why on earth don't we work on fixing that problem? 

Link to post
Share on other sites

12 minutes ago, corrado33 said:

Why doesn't it feed the compute cores fast enough? What's the limitation? CPU-> ram data transfer? CPU->CPU Cache data transfer? Why don't CPU manufacturers work on fixing THAT issue rather than just trying to shove more transistors in the chip?

Limitations of RAM transfer speed and things like that are huge concerns, which is why they are already avoided as much as possible by advanced techniques (out of order execution, branch prediction etc.), which involve predicting what data will be needed later, and fetching it into the cache so it's ready to go when the CPU needs it. As you might imagine, this is not 100% reliable, because it involves predicting the future. It is not a "problem" that can be fixed, it is just reality. If you have any ideas for more accurate prediction methods I'm sure they'd be interested though :)

 

From the WP page:

Quote

When execution resources would not be used by the current task in a processor without hyper-threading, and especially when the processor is stalled, a hyper-threading equipped processor can use those execution resources to execute another scheduled task. (The processor may stall due to a cache miss, branch misprediction, or data dependency.)

 

Link to post
Share on other sites

I think its basically impossible to be twice as fast, latencies would kill that. To be anything close though that workload would have be somehow causing the CPU to be idle 50% of the time.

 

If the workload was 100% efficient (impossible in a real world workload) then HT would be pointless.

ASUS B650E-F GAMING WIFI + R7 7800X3D + 2x Corsair Vengeance 32GB DDR5-6000 CL30-36-36-76  + ASUS RTX 4090 TUF Gaming OC

Router:  Intel N100 (pfSense) Backup: GL.iNet GL-X3000/ Spitz AX Switches: Netgear MS510TXUP, MS510TXPP, GS110EMX
WiFi6: Zyxel NWA210AX (1.7Gbit peak at 160Mhz) WiFi5: Ubiquiti NanoHD OpenWRT (~500Mbit at 80Mhz)
ISPs: Zen Full Fibre 900 (~930Mbit down, 115Mbit up) + Three 5G (~1200Mbit down, 115Mbit up, variable)
Upgrading Laptop/Desktop CNVIo WiFi 5 cards to PCIe WiFi6e/7

Link to post
Share on other sites

13 minutes ago, Jurrunio said:

A single thread doesnt feed the compute cores fast enough. Think about single clutch sequential gearboxes compared to double clutch gearboxes on performance cars, even though the number of gears doesn't change, more clutch means faster gear changes (as long as they use parts of similar grade).

Also, since the hyperthreaded and non hyperthreaded versions of chips are identical dies (a la the 6600k and 6700k), does that really mean there are freaking 4 other core "Front ends" just sitting around doing freaking nothing? Why? That doesn't make sense unless "bad dies" often kill parts of the front end. 

Link to post
Share on other sites

1 minute ago, Glenwing said:

Limitations of RAM transfer speed and things like that are huge concerns, which is why they are already avoided as much as possible by advanced techniques (out of order execution, branch prediction etc.), which involve predicting what data will be needed later, and fetching it into the cache so it's ready to go when the CPU needs it. As you might imagine, this is not 100% reliable, because it involves predicting the future. It is not a "problem" that can be fixed, it is just reality. If you have any ideas for more accurate prediction methods I'm sure they'd be interested though :)

So why not just give a CPU a crap ton of cache to work with? That way the prediction algorithms don't need to be accurate and therefore aren't the limiting factor as often.

Link to post
Share on other sites

1 minute ago, corrado33 said:

So why not just give a CPU a crap ton of cache to work with? That way the prediction algorithms don't need to be accurate and therefore aren't the limiting factor as often.

Cache takes up a lot of space. And when it's larger and more spread out, it can't operate as quickly.

Link to post
Share on other sites

5 minutes ago, corrado33 said:

Also, since the hyperthreaded and non hyperthreaded versions of chips are identical dies (a la the 6600k and 6700k), does that really mean there are freaking 4 other core "Front ends" just sitting around doing freaking nothing? Why? That doesn't make sense unless "bad dies" often kill parts of the front end. 

It makes sense from a profit perspective.

 

I remember it being widely reported many years ago that the demand for a lower-end AMD CPU was so high they literally were disabling "good" CPUs to sell them as lower-end models.  Some people figured out how to hack it re-enable the disabled cores and many were perfectly stable doing so (llikely proving the theory) but I think its pretty much impossible to reverse the disabling of parts these days.

 

Bottom line, its more profitable to be selling lower-end CPUs than have higher-end CPUs just sat on a warehouse somewhere.

ASUS B650E-F GAMING WIFI + R7 7800X3D + 2x Corsair Vengeance 32GB DDR5-6000 CL30-36-36-76  + ASUS RTX 4090 TUF Gaming OC

Router:  Intel N100 (pfSense) Backup: GL.iNet GL-X3000/ Spitz AX Switches: Netgear MS510TXUP, MS510TXPP, GS110EMX
WiFi6: Zyxel NWA210AX (1.7Gbit peak at 160Mhz) WiFi5: Ubiquiti NanoHD OpenWRT (~500Mbit at 80Mhz)
ISPs: Zen Full Fibre 900 (~930Mbit down, 115Mbit up) + Three 5G (~1200Mbit down, 115Mbit up, variable)
Upgrading Laptop/Desktop CNVIo WiFi 5 cards to PCIe WiFi6e/7

Link to post
Share on other sites

4 minutes ago, corrado33 said:

Why?

so there's a reason for you to spend more money.

 

16 minutes ago, corrado33 said:

Why doesn't it feed the compute cores fast enough? What's the limitation? CPU-> ram data transfer? CPU->CPU Cache data transfer? Why don't CPU manufacturers work on fixing THAT issue rather than just trying to shove more transistors in the chip?

memory to cache, cache management etc.

 

2 minutes ago, corrado33 said:

So why not just give a CPU a crap ton of cache to work with?

cost. Too expensive to give a lot of cache. Also more cache = longer route for signal to move around, increasing latency in the first place.

CPU: i7-2600K 4751MHz 1.44V (software) --> 1.47V at the back of the socket Motherboard: Asrock Z77 Extreme4 (BCLK: 103.3MHz) CPU Cooler: Noctua NH-D15 RAM: Adata XPG 2x8GB DDR3 (XMP: 2133MHz 10-11-11-30 CR2, custom: 2203MHz 10-11-10-26 CR1 tRFC:230 tREFI:14000) GPU: Asus GTX 1070 Dual (Super Jetstream vbios, +70(2025-2088MHz)/+400(8.8Gbps)) SSD: Samsung 840 Pro 256GB (main boot drive), Transcend SSD370 128GB PSU: Seasonic X-660 80+ Gold Case: Antec P110 Silent, 5 intakes 1 exhaust Monitor: AOC G2460PF 1080p 144Hz (150Hz max w/ DP, 121Hz max w/ HDMI) TN panel Keyboard: Logitech G610 Orion (Cherry MX Blue) with SteelSeries Apex M260 keycaps Mouse: BenQ Zowie FK1

 

Model: HP Omen 17 17-an110ca CPU: i7-8750H (0.125V core & cache, 50mV SA undervolt) GPU: GTX 1060 6GB Mobile (+80/+450, 1650MHz~1750MHz 0.78V~0.85V) RAM: 8+8GB DDR4-2400 18-17-17-39 2T Storage: HP EX920 1TB PCIe x4 M.2 SSD + Crucial MX500 1TB 2.5" SATA SSD, 128GB Toshiba PCIe x2 M.2 SSD (KBG30ZMV128G) gone cooking externally, 1TB Seagate 7200RPM 2.5" HDD (ST1000LM049-2GH172) left outside Monitor: 1080p 126Hz IPS G-sync

 

Desktop benching:

Cinebench R15 Single thread:168 Multi-thread: 833 

SuperPi (v1.5 from Techpowerup, PI value output) 16K: 0.100s 1M: 8.255s 32M: 7m 45.93s

Link to post
Share on other sites

3 minutes ago, corrado33 said:

Also, since the hyperthreaded and non hyperthreaded versions of chips are identical dies (a la the 6600k and 6700k), does that really mean there are freaking 4 other core "Front ends" just sitting around doing freaking nothing? Why? That doesn't make sense unless "bad dies" often kill parts of the front end. 

It is an artificial market segmentation.

 

Consider the fact that it takes billions of dollars and many years to design a CPU. Those costs have to be paid for by the combined sales of all the chips. The manufacturing cost of the chips has very little to do with the retail price, so the idea that "the i3 and i7 have the same manufacturing cost" is irrelevant. They have a certain chip design, and that chip design needs to pay for itself. One way to do that is to open it to a wider market by offering lower cost options, so people who would not normally be able to afford a CPU can at least buy something, even if it's at a lower cost. In return for this discount, part of the processor is disabled.

 

Yes, it is an artificially created limitation, but that doesn't actually matter when you think about it.

Link to post
Share on other sites

4 minutes ago, Glenwing said:

Cache takes up a lot of space. And when it's larger and more spread out, it can't operate as quickly.

It's gotta be faster than fetching things from ram. Also, it seems that die maps could be changed. 

 

For example: This is a i7-980x

 

$

 

All the cache is at one side of the die. Why don't they surround the cores in cache? I assume that the cores don't actually talk to each other, so there's no need for them to be side by side. What about increasing the number of... what I assume to be "cache controllers." 

 

Say, for example, each core has it's own smaller cache while also having access to a larger, albeit slightly slower/further away shared cache? 

Link to post
Share on other sites

9 minutes ago, Glenwing said:

It is an artificial market segmentation.

 

Consider the fact that it takes billions of dollars and many years to design a CPU. Those costs have to be paid for by the combined sales of all the chips. The manufacturing cost of the chips has very little to do with the retail price, so the idea that "the i3 and i7 have the same manufacturing cost" is irrelevant. They have a certain chip design, and that chip design needs to pay for itself. One way to do that is to open it to a wider market by offering lower cost options, so people who would not normally be able to afford a CPU can at least buy something, even if it's at a lower cost. In return for this discount, part of the processor is disabled.

 

Yes, it is an artificially created limitation, but that doesn't actually matter when you think about it.

So how do they "disable" the chips? I know sometimes they'll laser cut off extra cores, but other times (like the 6600k vs the 6700k) I'd imagine it's physically impossible to remove the extra "front ends." Is it an external resistor/capacitor that does it? I'm making the assumption here that the material used to hold the die (likely some sort of circuit board material... FR-4?) Is identical between the 6600 and 6700, the only chance at "customization" would be the little surface mount components on the bottom of the chip? 

 

I mean, that'd be FAR too simple... right..... right?

Link to post
Share on other sites

1 minute ago, corrado33 said:

Say, for example, each core has it's own smaller cache while also having access to a larger, albeit slightly slower/further away shared cache? 

Already done a long time ago. The large section you are looking at is the L3 (level 3) cache. There is also a smaller and faster L2 cache and an even smaller (and even faster) L1 cache located within the actual cores.

 

7 minutes ago, corrado33 said:

All the cache is at one side of the die. Why don't they surround the cores in cache? I assume that the cores don't actually talk to each other, so there's no need for them to be side by side. What about increasing the number of... what I assume to be "cache controllers."

For specific design decisions you would probably need to talk to someone from Intel. But I think it's safe to say it's not quite as simple as that.

 

5 minutes ago, corrado33 said:

So how do they "disable" the chips? I know sometimes they'll laser cut off extra cores, but other times (like the 6600k vs the 6700k) I'd imagine it's physically impossible to remove the extra "front ends." Is it an external resistor/capacitor that does it? I'm making the assumption here that the material used to hold the die (likely some sort of circuit board material... FR-4?) Is identical between the 6600 and 6700, the only chance at "customization" would be the little surface mount components on the bottom of the chip? 

That isn't something Intel talks about publicly :)

Link to post
Share on other sites

46 minutes ago, Sakkura said:

Something like that does exist for specialized uses. More often GPGPU gets used instead though.

 

So a GPU is essentially a CPU that's VERY... VERY heavily hyperthreaded? Possibly with smaller compute units because the GPU only has to do certain, specific calculations?

Link to post
Share on other sites

Just now, corrado33 said:

 

So a GPU is essentially a CPU that's VERY... VERY heavily hyperthreaded? Possibly with smaller compute units because the GPU only has to do certain, specific calculations?

Not hyperthreaded, a GPU has thousands of actual cores :) But they are heavily simplified cores with basically only type of execution unit (FP32 normally) and maybe a few special function units, rather than a CPU core which has many hundreds of different types of operations it can pick from.

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×