Jump to content

Apple M1 Ultra - 2nd highest multicore score, lost to 64-core AMD Threadripper.

TheReal1980
6 minutes ago, Spindel said:

@hishnash and @saltycaramel

 

I’ve predicted the Mac Pro having regular RAM dimms as fast swap since I got my original M1 Mini in the end of november 2020. 

Not sure about regular DDR dimes as this would require a DDR controller. Maybe they put regular dimes on a MPX card and have that controller on card abstract them out but i thin apple would require just solder them to the card so they can make the card nice and thin and full length, could get quite a bit of memory on a full length double sided MPX card, if its operating behind a large 128GB cache then it does not need to be that fast to access. 

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, hishnash said:

Oh god that is sad, I would want at least 30 runs for each step then do a Mann–Whitney  test to prove there is really significantly change on perf. This will give you a indication of confidence of how different the values are. 

Well you just cannot, and that's the problem. OC's are by their very nature unstable and things like heat soak impact the results greatly. Confidence beyond a complete run and any validation tests of that run is all that is required for this. Remember it's literally just a "measuring" contest lol

Link to comment
Share on other sites

Link to post
Share on other sites

9 minutes ago, leadeater said:

So in a way this actually favors x86 because an application would put more pressure on instruction cache than GB would.

I think GB tasks run long enough that this is not an issue, GB is not doing ultra small instruction tests with tiny execution payloads the payloads for each test run for many many 1000s of cycles are each are much larger than any instruction cache. These are real world tasks (de-compress a zip file) most production apps do end up doing a lot of these sorts of tasks, perhaps the hit agaist GB is that it does them sequently when modern apps are more mixed doing them at the same time on multiple threads.  

Link to comment
Share on other sites

Link to post
Share on other sites

4 minutes ago, hishnash said:

Not sure about regular DDR dimes as this would require a DDR controller. Maybe they put regular dimes on a MPX card and have that controller on card abstract them out but i thin apple would require just solder them to the card so they can make the card nice and thin and full length, could get quite a bit of memory on a full length double sided MPX card, if its operating behind a large 128GB cache then it does not need to be that fast to access. 

They could do something similar, or just literally CXL but maybe with a sprinkle of Apple branding, CXL memory pooling for off package memory and memory expansion.

 

https://blocksandfiles.com/2021/10/07/samsung-sw-virtualises-cxl-attached-memory/

https://www.computeexpresslink.org/post/cxl-2-0-specification-memory-pooling-questions-from-the-webinar-part-2

 

 

The right hand box

CXL-Usage.jpg

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, hishnash said:

Not sure about regular DDR dimes as this would require a DDR controller. Maybe they put regular dimes on a MPX card and have that controller on card abstract them out but i thin apple would require just solder them to the card so they can make the card nice and thin and full length, could get quite a bit of memory on a full length double sided MPX card, if its operating behind a large 128GB cache then it does not need to be that fast to access. 

Or they add a external controller on the mother board like in the olden times.

 

I mean it’s logical if we look at most other common system nowadays the memory structure is:

 

L1C -> L2C -> L3C (in the case of M1 it does actually not have L3C) -> RAM -> Swap

 

A mac pro will then be:

L1C -> L2C -> RAM -> Fast Swap -> Swap

 

On the other hand what I see as logical is based on old conventions. Since there is stuff that AS does differently I might be surprised.

Link to comment
Share on other sites

Link to post
Share on other sites

 

6 minutes ago, leadeater said:

Well you just cannot, and that's the problem. OC's are by their very nature unstable and things like heat soak impact the results greatly. Confidence beyond a complete run and any validation tests of that run is all that is required for this. Remember it's literally just a "measuring" contest lol

The reason I would want to do a proper statical test is from my understanding its not a simple single parameter that one is tuning (I have not played with OC since the days of the Northbridge) when optimising mutliipel params how these inference each other is not always so straight forward so to get the best combination you relay need to be able to have confidence that once given configuration is better than the last and not just random chance that it produce a better number.  But then again I would not be doing it for a messing contest but rather to try to improve the underly perfomance of my system (from my understand on todays systems you are at best going to only get a few % points so not really worth it if thats your goal).

Link to comment
Share on other sites

Link to post
Share on other sites

5 minutes ago, hishnash said:

I think GB tasks run long enough that this is not an issue, GB is not doing ultra small instruction tests with tiny execution payloads the payloads for each test run for many many 1000s of cycles are each are much larger than any instruction cache. These are real world tasks (de-compress a zip file) most production apps do end up doing a lot of these sorts of tasks, perhaps the hit agaist GB is that it does them sequently when modern apps are more mixed doing them at the same time on multiple threads.  

I know GB isn't ultra tiny, it may have been more so in the past, I don't know what it's like now specifically but I'm also not that interested in GB either. I don't think it's all that good either, nor all that bad, it's just there, a thing like other things, and many things gives better answers.

 

Also running the same thing once, or 1000s of times would have no effect on the instruction cache and every cycle would be a cache hit and utilization would not change. Sequentially running through different tasks statically will put less pressure on cache than applications doing task switching as well as any other applications running at the same time. But these are all made big enough in mind to accommodate this and I think it would be quite difficult task for any normal user to put so much pressure on instruction caches to actually cause some significant and measure difference/problem.

 

But these are all the underlying reasons why people complain about GB because many think it does matter. Just letting you know.

Link to comment
Share on other sites

Link to post
Share on other sites

4 minutes ago, Spindel said:

L1C -> L2C -> L3C (in the case of M1 it does actually not have L3C) -> RAM -> Swap

 

M1* chips have what they call SLC (system level cache) each memory controller has this and it is exposed to all parts of the chip that read/write to memory I think that is why they do not call it L3 since it is not just for the cpu, the GPU, NPU etc all read/write to it. 

interestingly the AMX (matrix units that are surprising fast over 2 TFlops of FP80 per complex in addition to the cpu) read/write from the L2 cache per CPU complex. 

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, hishnash said:

M1* chips have what they call SLC (system level cache) each memory controller has this and it is exposed to all parts of the chip that read/write to memory I think that is why they do not call it L3 since it is not just for the cpu, the GPU, NPU etc all read/write to it. 

interestingly the AMX (matrix units that are surprising fast over 2 TFlops of FP80 per complex in addition to the cpu) read/write from the L2 cache per CPU complex. 

OK that quickly got way over my head in technicality 🙂 

Link to comment
Share on other sites

Link to post
Share on other sites

16 minutes ago, hishnash said:

The reason I would want to do a proper statical test is from my understanding its not a simple single parameter that one is tuning (I have not played with OC since the days of the Northbridge) when optimising mutliipel params how these inference each other is not always so straight forward so to get the best combination you relay need to be able to have confidence that once given configuration is better than the last and not just random chance that it produce a better number.  But then again I would not be doing it for a messing contest but rather to try to improve the underly perfomance of my system (from my understand on todays systems you are at best going to only get a few % points so not really worth it if thats your goal).

You absolutely do not change multiple things at a time when OC'ing. You make one and only one change otherwise you are not able to isolate what you did and how it improved performance.

 

You don't go in and bump up vcore, increase core multipler, jump over to memory tuning and change primary or secondary timings etc, that's just really bad practice unless you are applying a known working baseline OC which you should save in a BIOS profile as well.

 

I suggest you watch any of GN's OC videos, or der8auer, or Bearded Hardware.

Edited by leadeater
Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, leadeater said:

You absolutely do not change multiple things at a time when OC'ing. You make one and only one change otherwise you are not able to isolate what you did and how it improved performance.

 

You don't go in and bump up vcore, increase core multipler, jump over to memory tuning and change primary or secondary timings etc, that's just really bad practice unless you are applying a know working baseline OC which you should save in a BIOS profile as well.

for sure but even if you just change one thing at a time if your just looking of the max score over 5 runs of CB how is that giving you any good indication of the truth as to if your change has really improved things... might just be luck that this run ended up in a thermal or other state that let it be just a little better (or windows was just a little bit less interested in background tasks that one time..)   

Link to comment
Share on other sites

Link to post
Share on other sites

5 minutes ago, hishnash said:

for sure but even if you just change one thing at a time if your just looking of the max score over 5 runs of CB how is that giving you any good indication of the truth as to if your change has really improved things... might just be luck that this run ended up in a thermal or other state that let it be just a little better (or windows was just a little bit less interested in background tasks that one time..)   

heh 5 runs, each configuration typically gets 1 run then another change is made, another run, another change etc. You underestimate how little multiple run average accuracy is cared about.

 

One shot wonders 🙃

kid-little-boy.gif

Link to comment
Share on other sites

Link to post
Share on other sites

10 hours ago, leadeater said:

In silicon ASICs or accelerators are there to be used, whether or not GB utilizes them

I just want to stress that I have found no evidence whatsoever that GB does use some accelerator that is not already widely available from all vendors, such as AES.

I think that some people assume more things are hardware accelerated on the M1 than on AMD or Intel, and as a result try to dismiss GB as being unfair. However, as far as I am aware, that is just not true at all. The tasks that are being hardware accelerated, such as AES, are being hardware accelerated on all modern platforms, not just the M1.

 

10 hours ago, leadeater said:

So as you said these types of distinctions don't matter

Again, I also want to stress that I didn't say that and it was not my intention to imply that I don't think it matters. 

For discussions that are purely about comparing CPU architectures, I absolutely think it does matter. For discussions about user experience I don't think it matters.

However, as I have said before and I want to make absolutely clear, I have found 0 evidence that the M1 runs more hardware accelerated tasks than competing AMD or Intel processors when running GB5 CPU. All the tasks I could find uses purely the CPU, not some non-CPU ASIC like the neural engine.

 

 

10 hours ago, hishnash said:

For example it will use AVX and even intel media engine when needed this is a completely valid comparison of the chip

I have not found any evidence that GB uses the Intel Media Engine. If you have found some then can you please link it to me?

 

 

 

1 hour ago, leadeater said:

That's because it's simple to run, stable application and give highly repeatable and accurate results. That's actually a hard thing to come by, 3DMark is actually a little worse in that respect, plus not a CPU only benchmark either.

I think another factor in CB's success as a benchmark is that it is also one of the few (easily accessible) benchmarks that has a single-core mode, and scales extremely well with multiple cores.

Link to comment
Share on other sites

Link to post
Share on other sites

15 minutes ago, LAwLz said:

For discussions that are purely about comparing CPU architectures, I absolutely think it does matter. For discussions about user experience I don't think it matters.

Yep 100% agree 👍

Link to comment
Share on other sites

Link to post
Share on other sites

9 hours ago, Vishera said:

Well,anyone can take an ARM processor,put multiple of them on a single die and market it as competitive with desktops - Nothing magical here.

The thing is that the die of the M1 Ultra is so big that yields will be significantly lower and costs will be significantly higher than the M1.

Chiplets is one way to reduce costs and improve yields in such situations.but the M1 Ultra is monolithic.

M1 Ultra in not Monolithic. It's two M1 Max dies on an interposer with TSMC chip interconnect technology. The implementation of it within a silicon design is Apple's.

 

In this instance the M1 Max's are chiplets, like what AMD is doing but with a different interconnect technology and no central I/O die which is both a benefit and a curse at the same time.

 

However M1 Max dies are big, real big, really big. Double the size of Alder lake-S. On a raw silicon size, transistor count Apple has brought a gun to a knife fight, but now they are dual wielding.

 

Apple

26st36.jpg

 

Intel

1riym4.jpg

Link to comment
Share on other sites

Link to post
Share on other sites

42 minutes ago, leadeater said:

with TSMC chip interconnect technology

with some sprinkles of Apple magic, probably 

 

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, saltycaramel said:

with some sprinkles of Apple magic, probably 

It's still just an implementation of TSMC technology, you literally cannot do any of this without the fab actually have the technology to do it. Apple's patient is for their approach to using it, will not actually block anyone else use the same underlying TSMC capability to do the same/similar thing so long as not breach of this patient. Given that everyone has technology patient for chip interconnects, stacking, chiplets etc etc I doubt anyone is going to get blocked for doing anything.

 

Let me ask you, do these sound familiar?  😉

 

Quote

This new generation CoWoS technology can accommodate multiple logic system-on-chip (SoC) dies, and up to 6 cubes of high-bandwidth memory (HBM), offering as much as 96GB of memory. It also provides bandwidth of up to 2.7 terabytes per second, 2.7 times faster than TSMC’s previously offered CoWoS solution in 2016.

https://pr.tsmc.com/english/news/2026

 

Also

Quote

Like its predecessor GLink 1.0, GLink 2.0 supports InFO_oS and all CoWoS types (both silicon and organic interposers). GLink 2.0 is fully backward compatible with GLink 1.0, with similar power consumption while doubling speed per lane, beachfront and area efficiency. Die edge is the scarcest resource and GLink-2.0 allows the most efficient use of it by transferring 1.3 Tbps of full duplex traffic per every mm of beachfront. Lead AI & Networking customers adopted GLink 2.0 for their next-generation products and expect mass productions in 2023 and thereafter.

 

image.png.32bc0f5b0bf1cee34f204728c1b14669.png

 

Quote

The next GLink versions using TSMC 5nm and 3nm technologies will support 2.5 Tbps/mm, error-free full duplex traffic with similar power consumption, and will be available in Q4, 2021 and Q1, 2022, respectively.

 

https://www.guc-asic.com/en-global/news/pressDetail/GLink2

 

And the great news here is that these aren't even TSMCs latest and greatest interconnect and interposer technologies either! So expect to see even more and faster in future Apple silicon MCM SoCs.

https://ieeexplore.ieee.org/document/9501649

Link to comment
Share on other sites

Link to post
Share on other sites

What if the relationship of Apple with its biggest suppliers/partners/foundries/assemblers is a complex co-financing knowledge-sharing fuzzy-boundaries situation where it’s difficult to tell who’s the chicken, who’s the egg and whose huge R&D spending went into what?

 

They must be doing something right in these kind of relationships. 

 

If one client is 1/4 of my whole business, it’s probably got a seat in the planning room and at the drawing board..

 

6E062B20-A0A4-4166-A299-43C5B6EF0359.thumb.jpeg.46f6a16d58ec07599c9cb8c2b9f58668.jpeg

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, saltycaramel said:

 

To recap, Apple has made 3 chips during the A14 gen

1) A14

2) M1

3) M1 Max (with M1 Pro and M1 Ultra being derivatives of the M1 Max)

 

Each one of these justifies its existence by being used in a wide array of final products. 

 

Would Apple just throw “f*ck you money” at developing a 4th chip, let’s call it “P1”, just for the low-volume MacPro? What’s the big picture behind this? Expecting the MacPro to become more adopted? Expecting to re-enter the server market with an Xserve machine? Experimenting with huge chips on the road to the mother-of-all-Apple-chips humongous chip that will be used in the Apple Car self driving system? Interesting scenarios and open questions. 

Fair questions. It doesn’t look like the M1 Max has two interconnects which would be needed for the rumoured Jade-4C dye.
 

If we take Apple at face value and assume no further iterations of M1, then it means either the Mac Pro will be given the M1 Ultra (which doesn’t seem to have any support for PCI-e/other expansion, which is the whole point of the current Xeon equipped Mac Pro), or it’ll otherwise mean that when they launch the M2 line they’ll do so be releasing the big dog first before the cut back versions for other products.

 

I’ve no real vested interest in the Mac Pro myself so I’m just interested from a geek point of view, but there are definitely companies out there with them who still haven’t forgiven Apple for the issues they had with the trashcan model and they’re probably all very concerned right now at the prospect of losing the possibility of 1.5tb ram, internal expansion of the current 2019 model.

Link to comment
Share on other sites

Link to post
Share on other sites

6 hours ago, saltycaramel said:

What if the relationship of Apple with its biggest suppliers/partners/foundries/assemblers is a complex co-financing knowledge-sharing fuzzy-boundaries situation where it’s difficult to tell who’s the chicken, who’s the egg and whose huge R&D spending went into what?

 

They must be doing something right in these kind of relationships. 

 

If one client is 1/4 of my whole business, it’s probably got a seat in the planning room and at the drawing board..

To a degree yes, Apple is paying for a lot. It's a partnership, Apple sets out requirements and goals and TSMC works with Apple to meet those. TSMC isn't just doing this for Apple as they have other customers wanting/needing/asking for similar.

 

Like with leading edge nodes Apple is one of their key risk partners to produce chips and utilize emerging technology.

 

I think the reason Apple is filing this patient and not TSMC is they probably had quite a large design input in to it and how or what is being done which makes it fall under Apple rather than TSMC. I'm sure they had to get agreeance from TSMC to file this otherwise TSMC would file an objection.

 

I would have to think TSMC and Apple like each other a lot.

Link to comment
Share on other sites

Link to post
Share on other sites

15 hours ago, HenrySalayne said:

Here lies the problem: Apple's die-to-die connection is not just for the CPU clusters but everything. And bandwidth is only one metric, we know nothing about the latency and software support. You can say about AMD's implementation what you want, but it's good enough. Scaling with multi-core workloads is excellent. If the scheduler is not switching threads wildly between CCXs, you wouldn't know it's there. 

latency basically also always benefits when you make distances shorter or electrical properties (resistance, inductance) of the link better. As for software support it looks a lot like there doesn't need to be a lot as at least a large part is managed in HW; we already know e.g., that the GPUs present themselves as a single device to the OS.

15 hours ago, HenrySalayne said:

And before you predict anything for M2, wait for the final iteration of M1 with the Mac Pro. It could be a dual socket design for all we know. They will pull a trick out of their sleeves. 😉

Hmm if the final words of the last presentation is anything to go by this was the last iteration of M1. It could be possible they do a dual-socket M1 Ultra but I am not sure if the IO of the current die supports this. I'd heavily root for M2. After all the base M1 was now announced quite some time ago, pretty sure M2 is in the making for quite some time and we'll see public announcement latest this fall.

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, leadeater said:

M1 Ultra in not Monolithic. It's two M1 Max dies on an interposer with TSMC chip interconnect technology. The implementation of it within a silicon design is Apple's.

 

In this instance the M1 Mac's are chiplets, like what AMD is doing but with a different interconnect technology and no central I/O die which is both a benefit and a curse at the same time.

 

However M1 Max dies are big, real big, really big. Double the size of Alder lake-S. On a raw silicon size, transistor count Apple has brought a gun to a knife fight, but now they are dual wielding.

Imagine how it will be if AMD and Intel designed their processor dies at such size...

400W CPUs here i come!

A PC Enthusiast since 2011
AMD Ryzen 7 5700X@4.65GHz | GIGABYTE GTX 1660 GAMING OC @ Core 2085MHz Memory 5000MHz
Cinebench R23: 15669cb | Unigine Superposition 1080p Extreme: 3566
Link to comment
Share on other sites

Link to post
Share on other sites

19 minutes ago, leadeater said:

Like with leading edge nodes Apple is one of their key risk partners to produce chips and utilize emerging technology.

I think the current partnership is indeed beautiful where Apple is funding leading edge silicon nodes with their infinite money printing machine. As long as this is not abused ultimately everyone wins as the nodes will be available for other customers. And among the race of basically the two only leading-edge competing manufacturers TSMC and Samsung, the former pulled ahead by a rather large margin if we are honest.

 

I am more concerned about the geopolitical implications, given that TSMC is a Taiwanese company with the big brother China sitting in their neck. They really should push a lot their geo diversification with fabs in NA and Europe. After all basically the worldwide technological progress in chip manufacturing depends on TSMC at this point and them being able to operate and do business in the way they are currently able to.

Link to comment
Share on other sites

Link to post
Share on other sites

19 minutes ago, Vishera said:

Imagine how it will be if AMD and Intel designed their processor dies at such size...

400W CPUs here i come!

That is not what the industry usually do. You add more silicon to use it less aggressively (lower clocks), which usually allows you to lower voltage, which translates into better power efficiency.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


×