Jump to content

Intel's Core i7-12700H Benchmarked Against Apple's M1 Max

Lightwreather
1 hour ago, mr moose said:

I wouldn't say that. As Rejzor said,  HT and SMT are there because it's more cost efficient to make more threads with HT/SMT than to make more full cores.  I dare say  we'd need an engineer to explain why. 

 

My point was not so much to get into the technicalities of CPU design though (apologies if it wasn't worded all that well), but simply to highlight a general issue single threaded synthetics tests will face.  Or more importantly issue that exists that we end users can't completely allow for as we don't know how many threads the scheduler is throwing into the core alongside said benchmark thus effecting the outcome.   We could probably argue that it's not much different than real world, but then again, the M1 won't have this issue so one would expect the single thread/core test to be more accurate on it.  However real world tests should quickly show us what's what.  

 

 

Just as a note of interest, HT was introduced by Intel on (going from memory) northwood processors (xeons first I believe but for desktop the P4),  Which we are told was a rush to combat the awesome performance of the AMD lineup at the time.  So it has been around for a very long time.

 

EDIT: @leadeater it doesn't help that GB calls them "single core scores", or that I use single core/thread interchangeably and I probably shouldn't but I just don't care.

 

 

It doesn't really require engineer to explain reasoning behind HT/SMT. HT/SMT logic (basically additional transistors on top of those for main core itself) requires like 10-15% of a full core in die surface area and you gain 20-30% performance from that. Full extra core indeed gives full 100% performance, but also requires extra 100% die surface area. So, almost regardless how you turn it, you gain more in performance than you have to invest in extra die size.

 

Also reason for this is because cores are often not 100% utilized even when we see 100% core utilization in Windows. It says that, but you basically always have some unused per-core resources. HT/SMT basically ensures you really squeeze 100% of resources from each core or at least closer to 100%...

Hyperthreading.webp

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, leadeater said:

Don't have a cry at me when others are literally having arguments about which core and core arch is better and has more performance than another.

I think those discussions are interesting and very relevant.

 

 

2 hours ago, leadeater said:

Total performance is certainly relevant

Total performance of the chip, with all of its cores, or total performance in a certain workload with X amounts of threads? Yes, very relevant.

Total performance of a core? No, it's not.

 

 

2 hours ago, leadeater said:

I've also yet to really see many applications today be well and truly only single thread as well, this is actually where HT/SMT can have effects that go both ways performance wise.

I guess it depends on what you define as "truly only multithreaded", and I suspect you added that word specifically to be nitpicking.

When I say single thread performance matters in a lot of applications, I am not talking about programs that only spawn a single thread. I am talking about applications where the overall performance is tied to a single thread. It might be a program that spawns 10 threads, but if one thread is responsible for 90% of the work and the remaining 9 threads are responsible for the other 10% then it is for all intends and purposes a single threaded program. Throwing more cores or SMT at that program won't impact performance in any meaningful way. Switching to a CPU with higher per thread performance will however have a massive impact.

 

 

2 hours ago, leadeater said:

As above it is when used how I said I've seen it being used, not how you are using it right now.

In that case I would strongly recommend you just relearn what "single core performance" mean, and equivalate it to single threaded performance.

That's what it means in 99% of cases from what I've seen, and "per core performance" as in "how much work a particular core can do" is a meaningless term anyway so losing a word to describe it won't matter.

You're gonna have a very bad time if you're going to go around correcting people who say "single core" instead of "single thread", and pretending to not understand that when people say "single core" they mean "single thread" will just lead to confusion.

 

 

2 hours ago, leadeater said:

No it really is not, a thread should be scheduled on to the same core where it make sense for cache reasons or simply because it doesn't need the extra performance or to tie up CPU time on another core.

 

For me this metric is hugely important for VM hosting, if not one of the most important.

I am not sure what you are talking about.

Can you give me an example where "full core utilization performance" has mattered more than a benchmark measuring performance at a given thread count?

 

If you are going to compare two chips, when does "per core" (in the literal sense of the words) matter more than per thread performance or performance at a given number of threads?

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, LAwLz said:

I guess it depends on what you define as "truly only multithreaded", and I suspect you added that word specifically to be nitpicking.

huh? I was talking single threaded not multithreaded? An application is only truly single threaded when it only allocates a single thread and all handles are within that thread, such applications are very rare now days with the more common examples being benchmark applications. Even then it's actually many threads with specific execution of only one of them that has the benchmark workload.

 

Almost everything else is 2 or greater threads, utilization not required to be equal across them of course.

 

Spoiler

image.thumb.png.43b90dd52acf7381c75fb33794e64043.png

The number of processes with 2 or less threads is very low, none of which on my system are foreground user applications. However not every thread in a process is simultaneously executing so knowing how many are in use at any given time isn't as simple as opening Task Manager going "steam.exe is using 38 threads". Also given that in reality only 1 thread can be executing on an OS CPU thread at any one time all threads are collectively time sharing those OS CPU threads.

 

1 hour ago, LAwLz said:

When I say single thread performance matters in a lot of applications, I am not talking about programs that only spawn a single thread. I am talking about applications where the overall performance is tied to a single thread. It might be a program that spawns 10 threads, but if one thread is responsible for 90% of the work and the remaining 9 threads are responsible for the other 10% then it is for all intends and purposes a single threaded program. Throwing more cores or SMT at that program won't impact performance in any meaningful way. Switching to a CPU with higher per thread performance will however have a massive impact.

Yes this is includes what I was talking about, where those other threads go does matter, it's not irrelevant and it can and does affect application performance.

 

Edit: This is a cleanliness problem with benchmarks, real world you aren't going to have a single application open and restricting your usage of the system in the same way as when benchmarking. This is true for a set and forget task that's going to take a long time and you walk away from the computer. In the past I've had to close background music depending on codec used and HW acceleration utilization (or not) as it would cause in game stuttering, usually a sign my CPU really does suck now and time to upgrade /:Edit

 

 

1 hour ago, LAwLz said:

In that case I would strongly recommend you just relearn what "single core performance" mean, and equivalate it to single threaded performance.

I don't have to re-learn anything, how about go back and read what I said😉

 

1 hour ago, LAwLz said:

You're gonna have a very bad time if you're going to go around correcting people who say "single core" instead of "single thread", and pretending to not understand that when people say "single core" they mean "single thread" will just lead to confusion.

Who said anything about going round correcting them, all I said is I observe this being done, and it generally doesn't matter or get in the way of what they are actually saying. But if the comment is specifically that a Golden Cove core has more performance than a Zen 3 core and you measure that only with a single thread benchmark then this is inaccurate and potentially incorrect.

 

Of course that is a really hard thing to actually benchmark because you could well run a workload on it that fully utilizes all the execution units of the core in the most efficient way but that doesn't mean the same is being done on the comparison core archecture.

 

It matters where it matters and it doesn't matter were it doesn't, quite simple. Literally no need to get so agitated over something where you don't encounter where it's important so it's not front of mind.

 

1 hour ago, LAwLz said:

I am not sure what you are talking about.

Can you give me an example where "full core utilization performance" has mattered more than a benchmark measuring performance at a given thread count?

 

If you are going to compare two chips, when does "per core" (in the literal sense of the words) matter more than per thread performance or performance at a given number of threads?

There are different situations where such things matters, one of them is where there is thread interdependency and locality of the cache at the sub-arch level matters because the thread to thread cache to cache latency impacts the performance. I get this a lot in backup applications where you are moving a lot of data and it's being split across threads and there is also hash table lookups and network I/O scheduling itself going on. Optimizing the NIC settings for things like RSS queues and the number of allowed data mover threads in the backup application matters a lot to get peak network and disk throughput.

 

Why this matters needs to be looked at in the reverse direction, opposite to the way you are thinking about it. It matters how much more a core can/could do when any given high utilization thread is placed on to one of the core threads and there is another thread with a dependency with the other one that would be best place on to the same core but only if there is enough resources that it would not slow down the original beyond the benefit of both threads utilizing the same L2 cache and memory access paths compared to another CPU core thread that might be distant across the Intel Mesh or AMD CCD.

 

For VM hosting it matters quite a lot due to the layers of abstraction and how CPU cores are scheduled.  When a VM asks for time access to 8 CPU threads is it best to give it 8 threads from 8 different CPU cores or is it better to give it 8 threads from 6 cores sharing 2 extra threads among those 6? What happens when there are 30 VMs on a single host with a total of 240 allocated vCPUs and only 40 physical cores and 80 threads? Would you not think how much a specific core can do would matter to specific application performance inside a specific VM allocated to a certain set of threads at a given time and where those threads are actually being run?

 

Total CPU performance is important but also knowing how much each core can actually do impacts the number of and what workload profile of VM I allow to run on any given VM host because it would be a very bad time if all 30 VMs were SQL VMs rather than 1 of them an SQL VM and the others being web servers and licenses servers that barely go above idle but still require scheduled host thread time.

 

Per core actual performance I use to know how many high performance workload VMs I can safely place on a single host along with other low performance ones before performance become greatly impacted.

 

1 hour ago, LAwLz said:

Total performance of a core? No, it's not.

Again yes of course that is important. You've just been talking around ways to measure it which I hope you realize that is what you are ultimately doing.

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, mr moose said:

I wouldn't say that. As Rejzor said,  HT and SMT are there because it's more cost efficient to make more threads with HT/SMT than to make more full cores.  I dare say  we'd need an engineer to explain why. 

A very simplistic explanation of HT/SMT is the following:

 

A CPU core (in particular CISC, but not exclusive to CISC) almost always have parts that go unused during normal operations. Smart people figured out that if we have two threads that perform different task that makes use of different parts of the CPU core we might as well throw both of these threads at the same CPU core do them in parallell and achive a higher utalization of the CPU core. Thus HT/SMT was born. 
 

This is also why the gains from HT/SMT varies widley. In ideal situations HT/SMT will perform as the CPU core x2 because the tasks have no overlap of the hardware functions needed. In reality you often have an overlap in functions needed for HT/SMT but you still gain some performance since you only need to stall a small part of the task but can run a lot of it in parallell. 
 

HT/SMT is a solution to the bloated CISC design.
 

If we back up to the original RISC idea there is no need for HT/SMT since there are no complex function blocks that go unused, everything is just built up from a few basic functions on a CPU level. 
 

Today there really isn’t much that is built on the pure RISC idea and thus we see some ”RISC” CPUs with HT/SMT coming out. 

Link to comment
Share on other sites

Link to post
Share on other sites

32 minutes ago, Spindel said:

If we back up to the original RISC idea there is no need for HT/SMT since there are no complex function blocks that go unused, everything is just built up from a few basic functions on a CPU level. 

Not quite, not all resources in a pure RISC CPU would go used. RISC is merely reduced instruction set so if all you do is issue a integer instruction a floating point execution unit will do nothing when that instruction is executed in the pipeline. Given this wouldn't it be more ideal to be able to issue two instructions to a single core, 1 INT and 1 FP? Or how about if each core has 2 or 4 INT execution units and an instruction only requires the use of 1 or 2? SMT in RISC is still applicable and it's why IBM has Power processors all the way up to SMT8.

 

You are right though not much is actually pure/classic RISC today, Power certainly is not.

 

Edit:

ThunderX2 and ThunderX3 ARM CPUs are SMT4. Cortex A65AE is SMT2. I think these are the only SMT ARM CPUs?

Link to comment
Share on other sites

Link to post
Share on other sites

7 minutes ago, leadeater said:

Not quite, not all resources in a pure RISC CPU would go used. RISC is merely reduced instruction set so if all you do is issue a integer instruction a floating point execution unit will do nothing when that instruction is executed in the pipeline. Given this wouldn't it be more ideal to be able to issue two instructions to a single core, 1 INT and 1 FP? Or how about if each core has 2 or 4 INT execution units and an instruction only requires the use of 1 or 2? SMT in RISC is still applicable and it's why IBM has Power processors all the way up to SMT8.

 

You are right though not much is actually pure/classic RISC today, Power certainly is not.

True. 
 

But UGHHH, being that scheduler breaking up threads at such a basic level 😛

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, Spindel said:

True. 
 

But UGHHH, being that scheduler breaking up threads at such a basic level 😛

I'll stick to scripting and C# and let smarter things do all the work for me haha. Having never programed for a RISC system and programming not being my core thing I know for sure I'd be useless at it, I definitely need an environment that protects me from my own stupid.

Link to comment
Share on other sites

Link to post
Share on other sites

Basically the more you can keep all execution units of a core active at any one time, the faster the chip will be.  Any given application you're actually running, and what operations it's performing, can seriously hamper your ability to do so.

 

For a TL;DR version of what's been discussed in this thread so far - this video by Engadget from back in February gives a good laymen's terms explanation

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

15 minutes ago, Paul Thexton said:

Engadget from back in February gives a good laymen's terms explanation

And a cat in the background so you know the video is going to be good. Cats, the superior pet 🙂

 

image.png.fd4472c881e1ec5bcb998ae1015460f5.png

 

Only big omission was the power difference between Ryzen and M1 when talking about that

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, leadeater said:

Cats, the superior pet 🙂

I'm in a temporary office setup at home right now while I decide how to best remodel the room I would normally be using .... as such there's nothing behind me which my cats can treat as a deathmatch zone.  Everybody in work is very upset at me for not migrating back to my proper office yet, they used to love watching my cats fight during meetings.

Link to comment
Share on other sites

Link to post
Share on other sites

22 hours ago, Spindel said:

A very simplistic explanation of HT/SMT is the following:

 

A CPU core (in particular CISC, but not exclusive to CISC) almost always have parts that go unused during normal operations. Smart people figured out that if we have two threads that perform different task that makes use of different parts of the CPU core we might as well throw both of these threads at the same CPU core do them in parallell and achive a higher utalization of the CPU core. Thus HT/SMT was born. 
 

This is also why the gains from HT/SMT varies widley. In ideal situations HT/SMT will perform as the CPU core x2 because the tasks have no overlap of the hardware functions needed. In reality you often have an overlap in functions needed for HT/SMT but you still gain some performance since you only need to stall a small part of the task but can run a lot of it in parallell. 
 

HT/SMT is a solution to the bloated CISC design.
 

If we back up to the original RISC idea there is no need for HT/SMT since there are no complex function blocks that go unused, everything is just built up from a few basic functions on a CPU level. 
 

Today there really isn’t much that is built on the pure RISC idea and thus we see some ”RISC” CPUs with HT/SMT coming out. 

I know how it works,  what I meant was that if you wanted to know how well optimized the same CPU is that doesn't have HT/SMT you'd probably need an engineer to explain it,  simply because there is little to no literature on core design of said CPU's that doesn't include HT/SMT.   

Grammar and spelling is not indicative of intelligence/knowledge.  Not having the same opinion does not always mean lack of understanding.  

Link to comment
Share on other sites

Link to post
Share on other sites

On 11/15/2021 at 12:49 AM, RejZoR said:

It doesn't really require engineer to explain reasoning behind HT/SMT. HT/SMT logic (basically additional transistors on top of those for main core itself) requires like 10-15% of a full core in die surface area and you gain 20-30% performance from that. Full extra core indeed gives full 100% performance, but also requires extra 100% die surface area. So, almost regardless how you turn it, you gain more in performance than you have to invest in extra die size.

 

Also reason for this is because cores are often not 100% utilized even when we see 100% core utilization in Windows. It says that, but you basically always have some unused per-core resources. HT/SMT basically ensures you really squeeze 100% of resources from each core or at least closer to 100%...

Hyperthreading.webp

To add to your excellent post, each thread also utilizes a core differently. SMT is more beneficial when threads aren’t fully utilizing the core’s resources. On the other hand, a well-coded thread that is able to leverage all of a core’s resources means little is available to share. 
 

Beyond this, SMT is also beneficial in switching between a multitude of lighter threads, though Little cores can be used for this too. 

My eyes see the past…

My camera lens sees the present…

Link to comment
Share on other sites

Link to post
Share on other sites

On 11/13/2021 at 5:28 PM, SorryClaire said:

Wait, people actually take geekbench seriously?

Yes, because it is actually a legitimate benchmark.

 

The only problem I have with it is that a lot of reviewers just run a GB run and just use it to evaluate the overall performance.

 

You should supplement it with various other benchmarks such as PugetBench, Cinebench R23 and such to paint a better overall picture of what it can and cannot do.

The Workhorse (AMD-powered custom desktop)

CPU: AMD Ryzen 7 3700X | GPU: MSI X Trio GeForce RTX 2070S | RAM: XPG Spectrix D60G 32GB DDR4-3200 | Storage: 512GB XPG SX8200P + 2TB 7200RPM Seagate Barracuda Compute | OS: Microsoft Windows 10 Pro

 

The Portable Workstation (Apple MacBook Pro 16" 2021)

SoC: Apple M1 Max (8+2 core CPU w/ 32-core GPU) | RAM: 32GB unified LPDDR5 | Storage: 1TB PCIe Gen4 SSD | OS: macOS Monterey

 

The Communicator (Apple iPhone 13 Pro)

SoC: Apple A15 Bionic | RAM: 6GB LPDDR4X | Storage: 128GB internal w/ NVMe controller | Display: 6.1" 2532x1170 "Super Retina XDR" OLED with VRR at up to 120Hz | OS: iOS 15.1

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, Zodiark1593 said:

Beyond this, SMT is also beneficial in switching between a multitude of lighter threads, though Little cores can be used for this too. 

I don’t know how this works with Intels hetrogenous offerings but with the OG M1 (with 4 big and 4 small cores) I was initially surprised by how much of the ”daily” workload is handeled by the small cores. 
 

You rareley see the the big cores being used other than short spikes here and there, most of the load is handeled by the small cores. 

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, D13H4RD said:

 

You should supplement it with various other benchmarks such as PugetBench, Cinebench R23 and such to paint a better overall picture of what it can and cannot do.

And of course other benchmarks have their own problems. I think that Anandtech pointed out that Cinebench has some wierd behaviour on M1 macs, when running multi threaded test he power draw is lower than it could be on the M1 using all the cores at the same time and compared to some other multithreaded workloads. Either CB simply can’t use everything in the M1 or there is a bug or poor coding meaning it doesn’t use all resources available.  
 

Eitherway for me personally CB is totally irrelevant to my use case anyway.

Link to comment
Share on other sites

Link to post
Share on other sites

6 hours ago, Zodiark1593 said:

Beyond this, SMT is also beneficial in switching between a multitude of lighter threads, though Little cores can be used for this too. 

Back in the bad old days Pentium 4 HT was great for when applications spun out of control and locked up the CPU thread to 100% and pretty well everything really was single threaded so the other thread prevented the OS from being completely unusable. I had both P4 HT and AMD at the time and I kept switch between them because of that annoyance, performance vs stability.

 

Oh those were the days.

Link to comment
Share on other sites

Link to post
Share on other sites

5 minutes ago, leadeater said:

Back in the bad old days Pentium 4 HT was great for when applications spun out of control and locked up the CPU thread to 100% and pretty well everything really was single threaded so the other thread prevented the OS from being completely unusable. I had both P4 HT and AMD at the time and I kept switch between them because of that annoyance, performance vs stability.

 

Oh those were the days.

PFFF...!

 

Athlon XP for life!

 

Still miss my Athlon XP 2500+ 😞 

Link to comment
Share on other sites

Link to post
Share on other sites

5 minutes ago, Spindel said:

PFFF...!

 

Athlon XP for life!

 

Still miss my Athlon XP 2500+ 😞 

I had a socket 939 platform with an FX CPU (the OG good ones ofc) but P4 HT still had it's daily life benefits outside of performance.

Link to comment
Share on other sites

Link to post
Share on other sites

27 minutes ago, Spindel said:

PFFF...!

 

Athlon XP for life!

 

Still miss my Athlon XP 2500+ 😞 

Same. I remember the first 64 bit AMD system I built to run Linux on fondly. 

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, Paul Thexton said:

Same. I remember the first 64 bit AMD system I built to run Linux on fondly. 

Bahhh fancy 64-bit stuff, Athlon XP was a 32-bit system. Back then only Apples G5 was 64-bit 😛 

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, Spindel said:

Bahhh fancy 64-bit stuff, Athlon XP was a 32-bit system. Back then only Apples G5 was 64-bit 😛 

I’m old so details are lost to alcohol at this point, I vaguely recall mine was either 1st or 2nd gen AMD 64bit, I just don’t recall the product name by now clearly 😂

 

Only reason I even remember it was 64 bit is because that’s the machine I used to discover how many FOSS libraries I used at work didn’t play nicely on 64bit. 

Link to comment
Share on other sites

Link to post
Share on other sites

On 11/17/2021 at 2:16 PM, Spindel said:

And of course other benchmarks have their own problems. I think that Anandtech pointed out that Cinebench has some wierd behaviour on M1 macs, when running multi threaded test he power draw is lower than it could be on the M1 using all the cores at the same time and compared to some other multithreaded workloads. Either CB simply can’t use everything in the M1 or there is a bug or poor coding meaning it doesn’t use all resources available.  
 

Eitherway for me personally CB is totally irrelevant to my use case anyway.

Yep, but I still think it's better to run them in order to paint a more complete picture of the architectures strengths and weaknesses alongside other metrics such as software compatibility.

 

That last bit is important because there's still several applications that either natively support Apple Silicon but is very iffy or still utilize Rosetta emulation.

The Workhorse (AMD-powered custom desktop)

CPU: AMD Ryzen 7 3700X | GPU: MSI X Trio GeForce RTX 2070S | RAM: XPG Spectrix D60G 32GB DDR4-3200 | Storage: 512GB XPG SX8200P + 2TB 7200RPM Seagate Barracuda Compute | OS: Microsoft Windows 10 Pro

 

The Portable Workstation (Apple MacBook Pro 16" 2021)

SoC: Apple M1 Max (8+2 core CPU w/ 32-core GPU) | RAM: 32GB unified LPDDR5 | Storage: 1TB PCIe Gen4 SSD | OS: macOS Monterey

 

The Communicator (Apple iPhone 13 Pro)

SoC: Apple A15 Bionic | RAM: 6GB LPDDR4X | Storage: 128GB internal w/ NVMe controller | Display: 6.1" 2532x1170 "Super Retina XDR" OLED with VRR at up to 120Hz | OS: iOS 15.1

Link to comment
Share on other sites

Link to post
Share on other sites

On 11/13/2021 at 1:28 AM, SorryClaire said:

Wait, people actually take geekbench seriously?

Well yes, because it's the only "fair" benchmark across device types.

 

That said, it doesn't mean it's representative of performance, because who gives a care if an ARM CPU inside a smartphone is more powerful than a laptop CPU inside a laptop with cooling fans. The power usage and cooling are completely different. Now if that ARM CPU is inside an identical device (eg passive cooled ultrabook, like the MacBook Air) to what would be a comparable passively cooled Intel CPU (are there any?) you could only compare to the previous Intel MacBook Air. 12" laptops from Dell and HP running the ultra-low-power chips, aren't even in the same ballpark for performance.

 

Hence it's fair when you compare the same device type, and misleading if you're comparing two different device types. 

 

You can't run Prime95 and PCMark/3DMark on an iPhone and AndroidPhone, or even on an ARM Windows or MacOS device. They don't exist for those platforms.

 

Where Geekbench fouls things up is that GB2/GB3/GB4/GB5 keep moving the goalposts (and asking for more money for their benchmark tools, as you can't upgrade from 2 to 3 to 4 to 5) so a GB3 score and a GB5 score are not even relative.

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×