Jump to content

[Guide] Hyper-threading and Windows, explained for real

vanished
1 hour ago, PianoPlayer88Key said:

@Ryan_Vickers maybe so. :( I still would like the single-threaded performance to catch up to where we would be if the pace hadn't slowed down.  Also, improved IPC from one generation to the next would be good.  5% isn't NEARLY good enough for me.  I was reading recently that the 286 was 2x or more faster than the 8086 at the same clock speed.

To my understanding, a CPU takes instructions from the software, and breaks it down internally to micro-operations. These are then passed to the appropriate internal units for processing. The difficulty of keeping the internal units all fed is why we have hyper-threading. We need to feed it faster to make use of it before we can think about make it do more.

 

Also to gain IPC, we have to think more about what IPC is... doing more per clock. I think the simple fact is, for a lot of the older and more commonly used instructions, they about as optimised as they're ever going to be. Any IPC gains, measured as an overall average as opposed to a specific instruction, will come from adding new instructions to do things faster than they could without those instructions. Look at the processors in mobile devices. Overall, they're weak compared to x86, but they have been highly optimised for the tasks they're required to do such as video decoding and can drive high pixel count displays. Elsewhere, I do a lot of FPU intensive processing, and I see massive improvements across Intel generations. Sandy Bridge doubled throughput from AVX compared to whatever was before that lacking it. Haswell was 50% faster again thanks to FMA. Skylake didn't add any new instructions in that area, but still showed 14% improvement. As far as I can tell, the most likely contributor to this is the reduction in clock cycles needed for FMA execution from 5 to 4. That would be 20%, not considering pipelines and that not all instructions used are FMA.

1 hour ago, PianoPlayer88Key said:

Also with the trend toward better power efficiency ... I wonder if we'll ever see an enthusiast-class CPU and GPU ($350 to $1000 range) that at stock settings doesn't need a fan or even a heatsink, even with ambient temps around 40°C?  I think 15-20 years ago, even the highest-end consumer products didn't have any coolers on them, right?  Although they weren't nearly as capable then, I realize. :)

It is a contradiction to ask for an ultra low power enthusiast grade product. Essentially you buy by power budget. You either want low power (both consumption and performance), or you want high power (both consumption and performance). Mobile (phone, tablet) technology is probably the best compromise you're going to get for that requirement. I have recently got an nvidia shield tablet, and no fans. Arguably the case is the heatsink. It has 3D performance not to be sniffed at. You're not going to play modern PC games on it, but it can play the equivalent of older PC 3D games. For example, I understand they ported Portal to it.

35 minutes ago, straight_stewie said:

RAM is the largest performance bottleneck in modern computers. For some reason the speed of memory hasn't been able to keep pace with improvements in all other areas.

It depends on the use cases, since for a lot of them the ram performance is largely insignificant. However, I do fit in one of those cases where ram performance is as important as CPU clock, and I wish there were better options there - such as:

Quad channel ram support in mainstream CPUs. I don't want to get overpriced enthusiast CPUs just to get it.

Official support in CPU for higher ram speeds. I don't feel there is enough compatibility once you get to 3000+ ball park and this could be forced through regulation and not implemented as an OC afterthought.

It would also suffice, even be preferable to just have bigger on CPU caches, since if they are of sufficient size they negate the need for ram performance to a large extent. The Broadwell CPU I got to try has 128MB of L4 which is sufficient for my uses for now. It is a bit slow compared to other caches at only 50GB/s bandwidth (roughly comparable to dual channel 3200) but Broadwell doesn't clock well anyway. Operating out of that meant I could use any old cheap ram and not impact performance at all. I'd like to see it bought closer to the cores though, either as L2 or L3. As a rough estimate, 4 GB/s per core per core-GHz would be "practically unlimited" in my uses. For a quad core CPU with dual channel ram, that would be equivalent to running the ram at the same clock as the CPU.

 

Wow, this ended up longer than I thought, and I haven't got on to hyper threading yet... :) 

Main system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, Corsair Vengeance Pro 3200 3x 16GB 2R, RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

On 7/8/2016 at 6:53 PM, porina said:

To my understanding, a CPU takes instructions from the software, and breaks it down internally to micro-operations. These are then passed to the appropriate internal units for processing.

"Micro-operations" only go on in the Control Unit. You can accomplish the same goals with logic gates. Microcode and micro-operations simply allow an abstract (and short bit-length) statement like "ADD A, B" to be turned into a multi step operation like (this example assumes an imaginary accumulator machine):

  1. Increment the Program Counter.
  2. Put program counter value in the Memory Address Register.
  3. Move the value located at the address in the Memory Address Register to the Instruction Register (Fetch Cycle: These three steps are the same for every instruction)
  4. Move value at address A to X register.
  5. Move value at Address B to Y register.
  6. Move the value in ACC to the address that is stored in ACC.

Micro-operations are these steps. Microcode is what allows these steps to occur. Microcode is really just specialized firmware that is stored in a ROM. This ROM is organized as an instruction lookup table, where the binary opcode of the instruction is the address of the first micro-operation that needs to take place for that instruction. The fetch cycle is left out because it is the same for every instruction. A "micro-word" (word I just made up) is a control word: A word where the bits are mapped 1/1 to control bits on pieces of the processor.

https://en.wikipedia.org/wiki/Microcode

 

On 7/8/2016 at 6:53 PM, porina said:

It depends on the use cases, since for a lot of them the ram performance is largely insignificant. However, I do fit in one of those cases where ram performance is as important as CPU clock, and I wish there were better options there - such as:

 

I have to disagree. It is known and widely discussed that RAM is the single largest performance hold-up in all modern systems. This is why ways to get around RAM have been invented, namely caches and DMA. Here's a video from Computerphile:

Spoiler

 

 

ENCRYPTION IS NOT A CRIME

Link to comment
Share on other sites

Link to post
Share on other sites

13 minutes ago, straight_stewie said:

I have to disagree. It is known and widely discussed that RAM is the single largest performance hold-up in all modern systems. This is why ways to get around RAM have been invented, namely caches and DMA. Here's a video from Computerphile:

I guess we are looking from different perspectives. Because we do have those caches, then practically for a lot of applications the actual ram speed makes relatively little difference. The stuff I run often doesn't fit in the last level cache so ram performance remains a bottleneck.

Main system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, Corsair Vengeance Pro 3200 3x 16GB 2R, RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

I'm going to admit to skimming through most of this thread (I did read a bit of the OP), but I feel like either the subject was not explained thoroughly enough or there is some misinformation still being spread about.

 

A background on processors

To understand how HyperThreading and even Cluster Multithreading (both are similar to an extent), a basic understanding of processors is needed.

 

A processor is divided up into three major sections

  • The front end, or control unit. This fetches instructions, decodes them, and schedules how to run them.
  • The back end, or execution unit. This does the work on the actual instruction
  • The memory interface, which connects the processor to the RAM interface and handles caching.

Originally processors were designed to do one thing from start to finish in a fetch, decode, execute, write back pipeline. In simpler processors, this could simply be fetch, execute, write back. Over time, several features were developed to increase the throughput of the processor.

 

What exactly is HyperThreading?

HyperThreading arose from a side effect of a feature in processors known as superscalar pipelining. A superscalar pipeline has two or more pathways for instructions to follow in order to increase instruction throughput. Superscalar pipelines often duplicate resources. What Intel found out was that a lot of times these duplicated resources would go unused. To understand this better, here's a block diagram for a single core of a Nehalem based Intel processor:

Spoiler

1024px-Intel_Nehalem_arch.svg.png

Towards the bottom are the execution units. If you notice, there are multiples of the same thing (AGU, ALU) or components that are separated (FP ADD, FP MUL).

 

Another component that HyperThreading adds is duplicating the processor's registers. Registers are small bits of memory that hold the current processor execution state, i.e., what it's working on right now. For example if you were thinking about what 1+2 is, registers hold those two numbers and outputs it into another.

 

So HyperThreading allows for two CPU states to exist in a single core without creating an actual core. At this point, if there's something ready to run on the secondary state and there are execution units available to it, the processor will use those execution units to do the task. So if you have one thread that's doing a basic integer math instruction (like 1+1, 2*3), and another that's doing floating point math, because there are separate available resources for those instructions, HyperThreading will run both at the same time.

 

The only catch is I'm not sure how the processors do load balancing or how many execution units a thread actually uses, since HyperThreading has been shown to only improve performance at best by about 20%.

 

How is this related to Clustered Multithreading (CMT)?

For that, let's look at the block diagram for AMD's Bulldozer

Spoiler

AMD_Bulldozer_block_diagram_%28CPU_core_

Essentially CMT is almost identical to how HyperThreading works, but it's arranged differently. HyperThreading has two processor states sharing the same exact resources. This means one thread can hog up all the resources leaving the other starved. CMT however gives both processor states their own resources, making separate execution cores. The common point about HyperThreading and CMT is they both share the same front-end.

 

Where CMT falters is that if a single thread needs extra execution resources, it can't take the other execution core's resources, it's stuck with what it has. This is why Bulldozer was lackluster in performance on single threaded tasks, each integer core had less resources than a single K10 core.

 

Let's talk about threads real quick

Switching gears, let's talk about threads. As the OP said, a thread is a task in an application. A thread can be used to handle the graphics, another the input, another some processing. Regardless of all this, all threads share the following life cycle:

thread_life_cycle.png

  • Running: The thread is currently being served by the processor.
  • Waiting/Sleeping: The thread currently has no task to do or is waiting on something to be freed up. An example would be a thread waiting to use the hard drive.
  • Ready: The thread is ready to run.
  • Done: The thread no longer is needed and its resources are cleaned up.

 

Okay, so about Windows, scheduling, and all that

One of the biggest, misinformed areas I've see on this topic is how operating systems schedule their workloads. Here's how Windows and Mac OS schedule their tasks:

  • Every application breaks their tasks up into threads. It can be as many threads as they need.
  • The OS then looks for threads that are in the ready state.
  • Threads that are in the ready state are queued up in a first come first serve basis. Threads may also have priority over others.
  • When it comes time to schedule a new thread to run, the OS looks based on priority, then by who's next in line.

In other words, operating systems don't care about the application they're running. They only care about the threads that are available. Think of an application like building a car. The boss who's in charge of manufacturing doesn't care about what car is being built. All they care about is what components need to be assembled and who can do it.

 

Linux has a different approach to scheduling, called the "Completely Fair Scheduler". Tasks are still broken up into threads, but it looks at who spent the least amount of time executing.

 

The Misconception of "Dual threaded" or "Is only good up to x amount of cores"

Applications come in two flavors of execution style: serial or parallel. Either they're one or they're the other (though different forms of parallelism exist). Serial is that all tasks are executed one after another with the next task in line waiting for the current one to finish. Parallel style is that tasks are broken up to not wait on each other.

 

In programming, you have what is called a "main loop". This is basically a forever loop that keeps the program "running" until something tells it to exit. A serial program will look something like this (we're making a car):

 

for(;;){
  make_frame();
  make_engine();
  make_instruments();
  put_panels_on_frame();
  mount_engine();
  intall_instruments();
}

And the program will continue doing this in order forever. However, some of these tasks can be done in parallel. The first three for instance aren't dependent on each other, though the last three do depend on a frame being made first, otherwise those can run in parallel to some degree.

 

So in regards to the first part, "dual threaded" or whatever flavor it is, is a misconception the program literally has only two threads. But if you pop open Task Manager to the "Details" tab and show the "Threads" column, you can find that a lot of tasks have more than two threads. My Firefox instance at the time of writing this has 88 threads running.

 

However I also heard that term to mean "it can only run on two cores, period." Which is a misconception based on how operating systems schedule tasks. Applications do not schedule themselves. Applications also do not know how many cores there are in a PC. It may only have two threads ready to run at most throughout it's life. But there's also another stick in the mud.

 

Operating systems these days have gotten even smarter about scheduling and using their resources. For example, if you have a quad core processor but your task only uses a little under 50% of the CPU, in order to be more efficient, the OS may not fire up all four cores. It may only fire up two cores. That doesn't mean the application is only able to throw out two threads. It may be that some of these threads wait often enough and others can slip in and do work. Or the program has 10 threads to run, but they process so quickly they don't utilize the CPU all that much.

 

The other factor is just how good that processor is at doing things. Sure, an application may not achieve more performance after four cores, but only for that architecture. If for example you ran into this limit on a Skylake processor, a 6-core Nehalem processor will see a much better benefit than a 4-core Nehalem processor because Nehalem isn't as good at processing as Skylake. Otherwise, shouldn't PS4 and XB1 games ported to the PC require 8-cores?

 

The tl;dr version is this: scheduling is a complicated subject and it cannot be deduced to programs are incapable of doing better due to some hard cap, because there is no hard cap.

Edited by M.Yurizaki
Removed the NVIDIA exmaple because it wasn't relevant.
Link to comment
Share on other sites

Link to post
Share on other sites

18 minutes ago, M.Yurizaki said:

*snip*

Wow, that's a lot of info!  Thanks :)

I'm going to reply to this more in detail later, but until then, know my feelings are a mix of "informative", "agree", and "I want to discuss this" ;)

Solve your own audio issues  |  First Steps with RPi 3  |  Humidity & Condensation  |  Sleep & Hibernation  |  Overclocking RAM  |  Making Backups  |  Displays  |  4K / 8K / 16K / etc.  |  Do I need 80+ Platinum?

If you can read this you're using the wrong theme.  You can change it at the bottom.

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, Ryan_Vickers said:

Wow, that's a lot of info!  Thanks :)

I'm going to reply to this more in detail later, but until then, know my feelings are a mix of "informative", "agree", and "I want to discuss this" ;)

If you have any questions let me know, because I sort of threw this together without re-re-rereading it.

Link to comment
Share on other sites

Link to post
Share on other sites

44 minutes ago, M.Yurizaki said:

[snip part I'm replying to]

 

How is this related to Clustered Multithreading (CMT)?

[heading left for landmark]

Thanks again, very informative.  It really helps to have some detailed descriptions on this kind of thing.  That said, would you agree that my analogy is more or less correct, in the context of people looking for a simple explanation that will allow them to predict HT's behaviour, and not necessarily fully understand every technical aspect?

 

44 minutes ago, M.Yurizaki said:

Let's talk about threads real quick

[snip]

Yeah that all sounds right :) Did any part of my OP seem like I was off about any of these points?

 

44 minutes ago, M.Yurizaki said:

The Misconception of "Dual threaded" or "Is only good up to x amount of cores"

[snip]

Here's where I wanted to chat... to the best of my knowledge, everything I had in the OP was right in this regard.  Suppose you have a program that has several threads... say, 7 (doesn't really matter) but only 2 of them run tasks that are CPU intensive enough to even show up on the chart in task manager.  Suppose that those 2 threads are actually so intensive that each alone could completely occupy 1 core.  Running this application on a quad core vs an 8 core (all other things equal) should have no impact on the application's performance, correct?  In fact, even running it on a dual core should be fine (but all those little background processes might make a measurable dent in performance in this situation I would imagine).

Solve your own audio issues  |  First Steps with RPi 3  |  Humidity & Condensation  |  Sleep & Hibernation  |  Overclocking RAM  |  Making Backups  |  Displays  |  4K / 8K / 16K / etc.  |  Do I need 80+ Platinum?

If you can read this you're using the wrong theme.  You can change it at the bottom.

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Ryan_Vickers said:

Thanks again, very informative.  It really helps to have some detailed descriptions on this kind of thing.  That said, would you agree that my analogy is more or less correct, in the context of people looking for a simple explanation that will allow them to predict HT's behaviour, and not necessarily fully understand every technical aspect?

I think that's a pretty good way of looking at it. You could add something like say the checkout line has someone bagging the items, but another person is ready to checkout, so the bagger can go open another register.

 

1 hour ago, Ryan_Vickers said:

Yeah that all sounds right :) Did any part of my OP seem like I was off about any of these points?

Pretty much hit the nail on the head and I was likely repeating what you said.

 

1 hour ago, Ryan_Vickers said:

Here's where I wanted to chat... to the best of my knowledge, everything I had in the OP was right in this regard.  Suppose you have a program that has several threads... say, 7 (doesn't really matter) but only 2 of them run tasks that are CPU intensive enough to even show up on the chart in task manager.  Suppose that those 2 threads are actually so intensive that each alone could completely occupy 1 core.  Running this application on a quad core vs an 8 core (all other things equal) should have no impact on the application's performance, correct?  In fact, even running it on a dual core should be fine (but all those little background processes might make a measurable dent in performance in this situation I would imagine).

All other things considered equal, yes, it wouldn't matter if you threw more cores at the problem assuming the other threads aren't CPU intensive enough. It might hiccup on a dual core depending on what the other threads are doing. And I think this is were people get all confused.

Link to comment
Share on other sites

Link to post
Share on other sites

Yay double posting! But I think I thought of another example of what HyperThreading is.

 

Say we have Bethesda. They're pulling all of their resources into the new Elder Scrolls game. At some point the artists are done. No more concept work needs to be made, we know what the characters look like, etc. So effectively they're done on this project. The boss in charge of Fallout is going "Hey, we got some work for you guys." And so those artists start working on Fallout. In effect, the entire company is working on two things at once. The TES team is still working at its peak, but the Fallout team isn't, even though they probably could.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×