Jump to content

Apple M1 Ultra - 2nd highest multicore score, lost to 64-core AMD Threadripper.

TheReal1980
3 hours ago, igormp said:

One point that you left out is that geekbench doesn't "force" raw CPU usage, and most of those tasks are offloaded to dedicated hardware

Source on this claim? I can't find any evidence that this is true. 

I suspect that it's one of those myths that has been repeated enough times to the point where it's "common knowledge" that doesn't get checked for validity. 

 

 

Edit:

I found a few posts with people saying similar things, but I think that stems from some confusion regarding cryptographic functions.

For a couple of years now, CPUs include specific instructions for accelerating some common cryptographic functions such as AES. While some might say that "doesn't test CPU performance", it's still part of the CPU, not some ASIC or other specialized hardware. But cryptography is just 5% of the score, and all vendors has it (including AMD and Intel) so it doesn't really matter. 

 

 

There is "Geekbench compute" that uses the GPU, but that's a separate program from the CPU version of Geekbench. 

Link to comment
Share on other sites

Link to post
Share on other sites

8 hours ago, Dracarris said:

If their presentation is anything to go by, this is not your classic multi-chiplet approach that we know from AMD. Does anyone know the bandwidth of infinity fabric? Btw not having a separate IO die is a huge advantage in terms of speed and power. However, not having one limits the scalability, so for M2 they'll need to create interfaces on at least two edges of the die to scale up beyond two.

Here lies the problem: Apple's die-to-die connection is not just for the CPU clusters but everything. And bandwidth is only one metric, we know nothing about the latency and software support. You can say about AMD's implementation what you want, but it's good enough. Scaling with multi-core workloads is excellent. If the scheduler is not switching threads wildly between CCXs, you wouldn't know it's there. 

And before you predict anything for M2, wait for the final iteration of M1 with the Mac Pro. It could be a dual socket design for all we know. They will pull a trick out of their sleeves. 😉

Link to comment
Share on other sites

Link to post
Share on other sites

50 minutes ago, HenrySalayne said:

And before you predict anything for M2, wait for the final iteration of M1 with the Mac Pro. It could be a dual socket design for all we know. They will pull a trick out of their sleeves. 😉

I’m starting to wonder if the Mac Pro will get it’s own bespoke Silicon under a different name (even if it’s using the same A14 cores), the wording they used did sound like M1Ultra is the last M1 variant. 
 

WWDC should be interesting this year.

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, LAwLz said:

Source on this claim? I can't find any evidence that this is true. 

I suspect that it's one of those myths that has been repeated enough times to the point where it's "common knowledge" that doesn't get checked for validity. 

 

 

Edit:

I found a few posts with people saying similar things, but I think that stems from some confusion regarding cryptographic functions.

For a couple of years now, CPUs include specific instructions for accelerating some common cryptographic functions such as AES. While some might say that "doesn't test CPU performance", it's still part of the CPU, not some ASIC or other specialized hardware. But cryptography is just 5% of the score, and all vendors has it (including AMD and Intel) so it doesn't really matter. 

 

 

There is "Geekbench compute" that uses the GPU, but that's a separate program from the CPU version of Geekbench. 

Even vector units are accelerators, or accelerated paths so it's not just for things like AES/cryptographic functions. Both are in each core also, not like a dedicated matrix unit or ML/neural engine within the SoC outside of the CPU cores. Does GB even utilize that at all? Does it leverage the media engines at all also?

 

Does this mean if were are testing FP AVX2 or AVX512 that we are not testing "CPU performance"? Crypto benching shouldn't be seen as anything different to this as well. And even if the conversation were to shift more towards things like those Neural Engines and GB or something else did utilize them then that is either going to make other vendors implement similar within their own product or deem that they do not need it or doesn't benefit the platforms their products go in to.

 

In silicon ASICs or accelerators are there to be used, whether or not GB utilizes them, and I would sure bet a lot of Apple software does/will/would utilize what is there when it is possible to do so. So as you said these types of distinctions don't matter, the only time it really would is if you had some special workload and requirement for Crypto or ML and that and only that sub-score mattered however at that point benchmarking should be done with a more dedicated tool for that.

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, LAwLz said:

I suspect that it's one of those myths that has been repeated enough times to the point where it's "common knowledge" that doesn't get checked for validity. 

 

Yep geekbench is a benchmark of a collection of common tasks you do on the device. For these tasks it uses the best part of the system to do them (this is not just on apple chips but also intel/AMD chips).


For example it will use AVX and even intel media engine when needed this is a completely valid comparison of the chip, there is not point just saying well it should strictly ensure that it only ever users the cpu cores (whatever that means) since no real world application doing that task would be stupid enough to do that task using the cpu cores when there are alternative pathways that free up cpu time for other tasks. (sometimes using the co-prososors is not faster than the cpu but it frees up the cpu to do other work).

If you look at other benchmarks (like Cinebench) on x86 this is highly optimised and makes use of lots of vendor specific extensions but on apple silicon it does not (to date) make use of the equivalent extensions it is effectively running plane simple C without much/if any vectorisation and no use of the AMX (matrix/vector units) evenough though on x86 it makes existive use of AVX and other matrix extensions.  This is a valid benchmark if and only if your use case is cpu rendering using Cinema 4D of small scenes (the cinicbench scenes if has a very small memory footprint that is a key parameter in how this might perform). But if you want to consider Cinebench as a buying guid for any other task you are very wrong (even other cpu based raytracing rendering tasks can be optimised very differently). 


--
People always say GB is not fare since it performs better on macOS but if you look at intel Macs and pull up scores from macOS and windows and linux you will find GB performs better on the same hardware on linux than it does on macOS!!! (likely due to less background shit running at the same time). 

Link to comment
Share on other sites

Link to post
Share on other sites

On 3/10/2022 at 1:55 AM, jaslion said:

Course take it with a huge grain of salt as well geekbench is a specific set of instructions and as proven with the regular m1 and it's current variations is not a reflection of real world performance.

 

No doubt the m1 is going to perform well. Just gotta keep in mind that you aren't buying one of the best cpu's of the current day in a tiny relatively cheap box for the performance it seems to produce.

Yes. Benchmarks aren't proof of any kind to real world performance....just stop.

Link to comment
Share on other sites

Link to post
Share on other sites

6 hours ago, hishnash said:

If you look at other benchmarks (like Cinebench) on x86 this is highly optimised and makes use of lots of vendor specific extensions but on apple silicon it does not (to date) make use of the equivalent extensions it is effectively running plane simple C without much/if any vectorisation and no use of the AMX (matrix/vector units) evenough though on x86 it makes existive use of AVX and other matrix extensions.  This is a valid benchmark if and only if your use case is cpu rendering using Cinema 4D of small scenes (the cinicbench scenes if has a very small memory footprint that is a key parameter in how this might perform). But if you want to consider Cinebench as a buying guid for any other task you are very wrong (even other cpu based raytracing rendering tasks can be optimised very differently). 

Conebench is an interesting case, partly because it has become the darling of youtubers and partly because it reflects the usage of a computer worse than in example Geekbench.

 

But the most interesting part is that Cinebench is unable to push AS to its max. Running other synthetic loads on all cores gives a higher power use than running cinebench.

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

13 minutes ago, Spindel said:

But the most interesting part is that Cinebench is unable to push AS to its max. Running other synthetic loads on all cores gives a higher power use than running cinebench.

 

The simple reason it is not able to push AS to its max is that it is very poorly optimised for AS, the devs have not put any direct effort into making it make use of things like the AMX (matrix) co-prososores or even make use of ARMs vector extensions.  And why would they? it is intentional taken directly from Cinema4D and this product really does not expect people to be using CPU raytracing on AS when the GPU on AS has the same amount of addressable memory as the cpu there is no good reason to use the GPU on this platform.

On other platforms the reason you might opt for CPU based rendering over GPU based rendering is that your scene and its assets are to large to fit in your GPUs VRAM (not the case at all for the test scenes in Cinebench). But with AS having the same memory for the CPU and GPU if you can run your render without swapping to disk on the cpu you can also run it on the GPU (or even both at once) this will give orders of magnitude better perf! This is why the makers of Cinema4D are not going to wast any time optimising the cpu rendering for apple silicon and for them the paying customers are never going to make use of the cpu renderer on this platform and when your a company you care about paying customers! 

Link to comment
Share on other sites

Link to post
Share on other sites

7 minutes ago, hishnash said:

The simple reason it is not able to push AS to its max is that it is very poorly optimised for AS, the devs have not put any direct effort into making it make use of things like the AMX (matrix) co-prososores or even make use of ARMs vector extensions.  And why would they? it is intentional taken directly from Cinema4D and this product really does not expect people to be using CPU raytracing on AS when the GPU on AS has the same amount of addressable memory as the cpu there is no good reason to use the GPU on this platform.

On other platforms the reason you might opt for CPU based rendering over GPU based rendering is that your scene and its assets are to large to fit in your GPUs VRAM (not the case at all for the test scenes in Cinebench). But with AS having the same memory for the CPU and GPU if you can run your render without swapping to disk on the cpu you can also run it on the GPU (or even both at once) this will give orders of magnitude better perf! This is why the makers of Cinema4D are not going to wast any time optimising the cpu rendering for apple silicon and for them the paying customers are never going to make use of the cpu renderer on this platform and when your a company you care about paying customers! 

Is there a way to run CB on the GPU?

 

Would be an interesting test.

 

EDIT:// or why not cpu+gpu

Link to comment
Share on other sites

Link to post
Share on other sites

31 minutes ago, Spindel said:

Is there a way to run CB on the GPU?

 

No CB is a benchmark of the CPU rendering provided by Cinema4d. That is it, it is in fact a very poor benchmark of our cpu for any other task (even other cpu based retracers do not always agree on what machine if fastest!). The fact the YouTube industry (and oc industry) have opted to use CB is a little stupid! 

 

Link to comment
Share on other sites

Link to post
Share on other sites

8 hours ago, hishnash said:

People always say GB is not fare since it performs better on macOS but if you look at intel Macs and pull up scores from macOS and windows and linux you will find GB performs better on the same hardware on linux than it does on macOS!!! (likely due to less background shit running at the same time). 

Just a slight correction, people complain it runs better on ARM not so much Mac OS. It's a hardware based complaint, based on simplistic instructions from applications given to ARM ISA and those instructions yield very good performance. Similarly those can tend to do slightly worse on x86 due to all the decoding overhead, that said that's what the instruction and data caches are for, to mitigate that. Much of a muchness I guess, I don't think there has been much in the way to try and evidence this so hearsay and speculation.

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, hishnash said:

No CB is a benchmark of the CPU rendering provided by Cinema4d. That is it, it is in fact a very poor benchmark of our cpu for any other task (even other cpu based retracers do not always agree on what machine if fastest!). The fact the YouTube industry (and oc industry) have opted to use CB is a little stupid! 

That's because it's simple to run, stable application and give highly repeatable and accurate results. That's actually a hard thing to come by, 3DMark is actually a little worse in that respect, plus not a CPU only benchmark either.

Link to comment
Share on other sites

Link to post
Share on other sites

10 hours ago, Paul Thexton said:

I’m starting to wonder if the Mac Pro will get it’s own bespoke Silicon under a different name (even if it’s using the same A14 cores), the wording they used did sound like M1Ultra is the last M1 variant. 
 

WWDC should be interesting this year.

 

To recap, Apple has made 3 chips during the A14 gen

1) A14

2) M1

3) M1 Max (with M1 Pro and M1 Ultra being derivatives of the M1 Max)

 

Each one of these justifies its existence by being used in a wide array of final products. 

 

Would Apple just throw “f*ck you money” at developing a 4th chip, let’s call it “P1”, just for the low-volume MacPro? What’s the big picture behind this? Expecting the MacPro to become more adopted? Expecting to re-enter the server market with an Xserve machine? Experimenting with huge chips on the road to the mother-of-all-Apple-chips humongous chip that will be used in the Apple Car self driving system? Interesting scenarios and open questions. 

Link to comment
Share on other sites

Link to post
Share on other sites

11 hours ago, HenrySalayne said:

And before you predict anything for M2, wait for the final iteration of M1 with the Mac Pro. It could be a dual socket design for all we know. They will pull a trick out of their sleeves. 😉

Would Apple boast about their UltraFusion software-transparent approach being superior and then 3 months later go back to a dual socket (or even a more traditional dual chiplet) approach? Not sure about that. Unless it’s instrumental to creating a pool of stick-based expandable RAM. 

Link to comment
Share on other sites

Link to post
Share on other sites

36 minutes ago, leadeater said:

Just a slight correction, people complain it runs better on ARM not so much Mac OS. It's a hardware based complaint, based on simplistic instructions from applications given to ARM ISA and those instructions yield very good performance. Similarly those can tend to do slightly worse on x86 due to all the decoding overhead, that said that's what the instruction and data caches are for, to mitigate that. Much of a muchness I guess, I don't think there has been much in the way to try and evidence this so hearsay and speculation.

GB is not hand coded in assembly, it will be mostly c/c++ so really this is a judgment of the quality of the c/c++ compilers... hint this applies to all real world applications (other than those hand crafted with x86 assembly). Almost all code running on modern x86 cpus these days is `simplistic instructions`.  So if GB performs better on a given ISA than others this also means other applications will be just the same so showing it is a valid benchmark. 

 

36 minutes ago, leadeater said:

That's because it's simple to run, stable application and give highly repeatable and accurate results. That's actually a hard thing to come by, 3DMark is actually a little worse in that respect, plus not a CPU only benchmark either.

Benchmarks like GB are also easy to run and do a much better job of stressing the entire system, of course there is the dissection point of the waiting that GB gives to different tasks this is subjective and depending not the waiting you choice different systems will perform better than others. Of course the issue GB has is it is quite a short benchmark for any given part of the cpu so on systems with timed boost clocks (x86 systems not arm) it will perform better than a real world long task.

Link to comment
Share on other sites

Link to post
Share on other sites

5 minutes ago, saltycaramel said:

Would Apple boast about their UltraFusion software-transparent approach being superior and then 3 months later go back to a dual socket (or even a more traditional dual chiplet) approach? Not sure about that. Unless it’s instrumental to creating a pool of stick-based expandable RAM. 

 

31 minutes ago, saltycaramel said:

Would Apple just throw “f*ck you money” at developing a 4th chip, let’s call it “P1”, just for the low-volume MacPro? What’s the big picture behind this? Expecting the MacPro to become more adopted? Expecting to re-enter the server market with an Xserve machine? Experimenting with huge chips on the road to the mother-of-all-Apple-chips humongous chip that will be used in the Apple Car self driving system? Interesting scenarios and open questions. 

No I expect this years MacPro will use the M1 Ultra die with a few additions.

It will support a 2 tiered memory system. With the expandable memory being os managed like a very fast/write tolerant swap (possibly provided through the PCIe/MPX slots). 

And apple will provide add in GPU cards that can be used for apps that support Multi-GPU (apps for the macPro already support this).  I feel that these cards might well make use of M1 Ultra dies that have defective cpu cores but others here think that is unlikely and they would rather be bespoke dies.

The M1 Ultra is a good bit faster in cpu than the current macPro and in gpu as fast as any single gpu the current macpro has to offer so with the option of adding additional add in gpus for apps the support multi gpu it will be a lot better as the power draw of apple GPU ip is so much lower the cards would likly be 2 slot rather than 4 slot so you can put a LOT more of them in (also good for apples $ income). 

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, hishnash said:

 

No I expect this years MacPro will use the M1 Ultra die with a few additions.

It will support a 2 tiered memory system. With the expandable memory being os managed like a very fast/write tolerant swap (possibly provided through the PCIe/MPX slots). 

And apple will provide add in GPU cards that can be used for apps that support Multi-GPU (apps for the macPro already support this).  I feel that these cards might well make use of M1 Ultra dies that have defective cpu cores but others here think that is unlikely and they would rather be bespoke dies.

The M1 Ultra is a good bit faster in cpu than the current macPro and in gpu as fast as any single gpu the current macpro has to offer so with the option of adding additional add in gpus for apps the support multi gpu it will be a lot better as the power draw of apple GPU ip is so much lower the cards would likly be 2 slot rather than 4 slot so you can put a LOT more of them in (also good for apples $ income). 

 

I think you’re onto something.

 

In the end the M1 Ultra is fast enough CPU-wise. Maybe clock it higher and use a beefier cooling, just not to leave MacPro users with the sour aftertaste in their mouth of having exactly the same CPU of “MacStudio plebs”. 

 

If I can point to a google keyword for the bespoke GPU chip: “Apple Lifuka”.

Link to comment
Share on other sites

Link to post
Share on other sites

11 minutes ago, hishnash said:

GB is not hand coded in assembly, it will be mostly c/c++ so really this is a judgment of the quality of the c/c++ compilers... hint this applies to all real world applications (other than those hand crafted with x86 assembly). Almost all code running on modern x86 cpus these days is `simplistic instructions`.  So if GB performs better on a given ISA than others this also means other applications will be just the same so showing it is a valid benchmark. 

x86 instructions get decoded in to simplistic instructions executed on execution units, big difference. Very big difference. That is THE biggest difference between ARM and x86, fixed length simpler instructions vs variable length complex instructions that require decoding.

 

I did not at all imply GB is coded in assembly.

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, leadeater said:

x86 instructions get decoded in to simplistic instructions executed on execution units, big difference. Very big difference. That is THE biggest difference between ARM and x86, fixed length simpler instructions vs variable length complex instructions that require decoding.

Yes but this applies to all apps on x86 vs ARM.  So not really a valid reason to say GB is not a valid benchmarking tool when the perf advantage seen on modern RISC systems will also be seen in other day to day applications just the same. 



 

Link to comment
Share on other sites

Link to post
Share on other sites

Well,anyone can take an ARM processor,put multiple of them on a single die and market it as competitive with desktops - Nothing magical here.

The thing is that the die of the M1 Ultra is so big that yields will be significantly lower and costs will be significantly higher than the M1.

Chiplets is one way to reduce costs and improve yields in such situations.but the M1 Ultra is monolithic.

 

M1-Ultra-vs-M1-1030x579.thumb.jpg.eda6637f8a7ba36aaaa2ab4621b47cf3.jpg

A PC Enthusiast since 2011
AMD Ryzen 7 5700X@4.65GHz | GIGABYTE GTX 1660 GAMING OC @ Core 2085MHz Memory 5000MHz
Cinebench R23: 15669cb | Unigine Superposition 1080p Extreme: 3566
Link to comment
Share on other sites

Link to post
Share on other sites

28 minutes ago, hishnash said:

Benchmarks like GB are also easy to run and do a much better job of stressing the entire system, of course there is the dissection point of the waiting that GB gives to different tasks this is subjective and depending not the waiting you choice different systems will perform better than others. Of course the issue GB has is it is quite a short benchmark for any given part of the cpu so on systems with timed boost clocks (x86 systems not arm) it will perform better than a real world long task.

That is still no good for OC benchmarking, that's why it and other like it are not used. You want something fairly short, does the same, usually singular task every time, which is important to gauging the effects of any vcore changes, multiplier changes, memory timing changes. Something like GB is just not suited to this, SPEC even more so by many many factors. But at the very least SPEC would let you know if you OC is actually truly stable heh, not much else.

 

OC runs aren't even looking for statistically accuracy either, it's not like they do each OC configuration 10 times to average it. They just take single runs essentially meaning singular highest result no matter if it's not repeatable so long as it classed as a valid result.

 

That's why CB is liked so much, it's a very easy by the numbers comparison tool that can be suited for part of a product review and also general testing of OC results, in or outside of a product review. Same is true of 3DMark for GPUs, 3DMark doesn't tell you a lot about many things either and is flawed in it's own ways too. I don't see any reason to change these current tools to something else for the purposes of what they are used for.

Edited by leadeater
Link to comment
Share on other sites

Link to post
Share on other sites

5 minutes ago, saltycaramel said:

 

In the end the M1 Ultra is fast enough CPU-wise. Maybe clock it higher and use a beefier cooling, just not to leave MacPro users with the sour aftertaste in their mouth of having exactly the same CPU of “MacStudio plebs”. 

 

If I can point to a google keyword for the bespoke GPU chip: “Apple Lifuka”.

I don't think apple will alter clock speeds at all!  I think the silicon team there only see a gain when it is a gain in perf/w all other gains are ignored. This is how they got to were they are years and years of only bothering about improve perf/W and never considering any change that negatively impacts this! The willingness to spend massive transistor counts on this target as they know the total cost of the system were were they make the money not the silicon itself. (if need be they can always increase the margins on the ram or ssd after all). 

Yer I did see those rumers a year ago or so. I could see them build a dedicated GPU die, a bit like the M1 Pro is based on the design of the M1 Max but a seperate mask, just not sure if the macPro dedicated GPU market is large enough to justify the large cost of doing this since even on the macPro most users will be fine with the SoC gpu (most macPro users now they have the MacSutido will be audio pros who will be happy there is not GPU waisting a PCIe slot). 

Link to comment
Share on other sites

Link to post
Share on other sites

8 minutes ago, hishnash said:

Yes but this applies to all apps on x86 vs ARM.  So not really a valid reason to say GB is not a valid benchmarking tool when the perf advantage seen on modern RISC systems will also be seen in other day to day applications just the same. 

I didn't say it wasn't valid, I also said it's mitigated by instruction and data caches anyway. What I said is that is source of the criticisms however there is no actual testing to prove or show this and how wouldn't even be easy anyway.

 

Edit:

Why people think it matters i because apps don't do single tasks like GB does and that may, probably no, make a difference. If you just go through one by one testing different instruction types and workloads that isn't how an application typically operates and interacts with hardware. So in a way this actually favors x86 because an application would put more pressure on instruction cache than GB would.

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, leadeater said:

That is still no good for OC benchmarking, that's why it and other like it are not used. You want something fair short, does the same, usually singular task every time, which is important to gauging the effects of any vcore changes, multiplier changes, memory timing changes. Something like GB is just not suited to this, SPEC even more so by many many factors. But at the very least SPEC would let you know if you OC is actually truly stable heh, not much else.

 

True, but with your OC tuning CB might not show and issue as it does a very poor job of pushing all of the cpu functions, I suppose that might be why GB is not considered a good tool for OC since some of the dedicated pathways it uses (even on intel/amd) systems likely are un-effects by OC (such as AES, or compression/decompression). 

 

4 minutes ago, leadeater said:

They just take single runs essentially meaning singular highest result no matter if it's not repeatable so long as it classed as a valid result.

Oh god that is sad, I would want at least 30 runs for each step then do a Mann–Whitney  test to prove there is really significantly change on perf. This will give you a indication of confidence of how different the values are. 

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


×