Apple M1 Ultra - 2nd highest multicore score, lost to 64-core AMD Threadripper.

hishnash · March 19, 2022

22 hours ago, FakeKGB said:

Fair, but I doubt they know what AMD CPUs are.

they do know, macOS even accedently at one point shipped with AMD APU drivers and AMD cpu ids, apple clearly internally had been testing out first and second gen Zen chips but proabolty noted it was not worth the effort given they were moving to their own arc soon after (properly supporting AMD chips to the extend macOS supported intel would be a lot of work, custom work in the secular and system libs like accurate would need to be hand tuned). Worth remembering that the silicon industry is very fluid with staff, most expiranced cpu engines in the industry have worked at Intel, AMD, Apple during there life this also applies to the chips teams within apple that have all worked at other silicon vendors in the past.

hishnash · March 19, 2022

On 3/19/2022 at 6:54 AM, LAwLz said:

Not sure what exactly you are referring to, but Arm has native instructions for easily covering between the more loose memory model of regular Arm, and the more strict memory model of x86.

Arm, not Apple, have created several instructions specifically for this purpose. Anyone making Arm cores can implement it if they want. It's not Apple specific.

ARMv8 does not indue TSO memory modes (these are not different instruction these are differs modes the cpu is set into that effects all memory operations while in that mode), sure if you make your own arm cores you can add any instruction/modes you want but this is not part of the required ARMv8 spec (and is not present on any other ARMv8 cpu). Yes the ARM spec does not forbid you from adding this mode but the ARM spec does not forbid you from adding any mode you like.

Given that developer toggle this mode themselves (it is limited to the kernel) the runtime env that runs within the mode does not need to comply to any of the ARM spec since the only code that runs within this is the translated rosseta2 executables.

LAwLz · March 20, 2022

On 3/20/2022 at 12:19 AM, hishnash said:

ARMv8 does not indue TSO memory modes (these are not different instruction these are differs modes the cpu is set into that effects all memory operations while in that mode), sure if you make your own arm cores you can add any instruction/modes you want but this is not part of the required ARMv8 spec (and is not present on any other ARMv8 cpu). Yes the ARM spec does not forbid you from adding this mode but the ARM spec does not forbid you from adding any mode you like.

Given that developer toggle this mode themselves (it is limited to the kernel) the runtime env that runs within the mode does not need to comply to any of the ARM spec since the only code that runs within this is the translated rosseta2 executables.

It's been a few years since I looked into this, but my understanding of TSO is that it is just Apple's way of enabling developers to easily access the standard Arm features/instructions that were designed for x86 memory consistency. A lot of them were introduced in the ARM 8.3 extension. For example LDAPR which I believe has been supported in standard Arm cores since 2019, such as the A77.

Hell, even ARMv7 had somewhat of a strong memory order mode. It was called "strongly ordered memory" back then. In ARMv8 it is called "Device-nGnRnE most restrictive"

But even without this memory ordering mode, chances are Apple's M1 would still be really fast with translating x86 code. Having strong memory ordering is nice, but it's not really needed, and in the cases where it is needed you can just throw in some barriers. Most programs will likely work just fine regardless of the memory mode on the M1.

DrMacintosh · March 20, 2022

On 3/9/2022 at 9:44 AM, Spindel said:

At a power draw of 65-70 W on the CPU

It definitely draws more than that at peak load. Closer to 250W but that’s also the whole computer, including the graphics.

Spindel · March 20, 2022

12 minutes ago, DrMacintosh said:

It definitely draws more than that at peak load. Closer to 250W but that’s also the whole computer, including the graphics.

As I said later in this thread I too would expect it to draw more (200-220 W) when you fully load CPU+GPU. But the GB test in OP only strains the CPU part of the SoC so that’s the number that is interesting when comparing to the artificial suns that are x86 CPUs.

YamiYukiSenpai · March 20, 2022

This thing can reach 3080/3090, and they haven't even unleashed the big guns yet.

Gotta wonder what the upcoming Mac Pro would look like...

Roswell · March 20, 2022

4 hours ago, YamiYukiSenpai said:

This thing can reach 3080/3090, and they haven't even unleashed the big guns yet.

Gotta wonder what the upcoming Mac Pro would look like...

While impressive, it can reach those cards when they’re constrained to half of their power envelope.

WolframaticAlpha · March 20, 2022

9 hours ago, YamiYukiSenpai said:

This thing can reach 3080/3090, and they haven't even unleashed the big guns yet.

Gotta wonder what the upcoming Mac Pro would look like...

*In ideal conditions and slices of the power draw territory for certain applications.

It is a good system, but Apple's claims of "fastest pc chip" don't hold water. A bit disappointing, considering I was just coming around to trusting their graphs.

Despite the slightly disappointing results of the GPU, I am still optimistic since the really osm efficiency might allow for really cool form factors,which the 3090 or a xeon won't allow.

WolframaticAlpha · March 20, 2022

On 3/19/2022 at 1:02 AM, FakeKGB said:

That's likely because Intel Mac users have no hecking idea what AMD even is. They know Intel is older, so if Apple Silicon > Intel, they're happy.

Dude, there are computer illiterate Mac users, but there are plenty of people who use Mac's and still know about the world outside of their Macintosh. Don't make such generalizations.

Spindel · March 21, 2022

Interesting tidbit about the Ultra GPU.

Apparently it only draws around 30 W when for example running Blender. GB compute tops at around 50 W.

GFXBench apparently uses more power.

But it’s interesting that there seems to be some software issues in utilizing the full potential of the GPU.

I hope we well get some more indepth investigation of this by the likes of anandtech or some other outlet.

leadeater · March 21, 2022

1 hour ago, Spindel said:

Apparently it only draws around 30 W when for example running Blender.

It's used in Blender? I thought the version with M1 GPU support wasn't out yet?

Edit:

Oh it came out on the 9th, very interesting.

Spindel · March 21, 2022

Just now, leadeater said:

It's used in Blender? I thought the version with M1 GPU support wasn't out yet?

It is out since about a week

hishnash · March 21, 2022

3 hours ago, Spindel said:

I hope we well get some more indepth investigation of this by the likes of anandtech or some other outlet.

You can attached Xcode debugging tools to blender to inspect the GPU activity and its quite clear its not going that great a job even on the M1 Max.

This is a snapshot of about 3 seconds of work during the BMW GPU render it should be solid orange. But you can see there are empty areas in the perf state indicating nothing was running at all and the little green segments showing the task that was disptached was lower enough priority that it was running in low power state.

Also if you look at the gpu counters that give an dictation of how much cache, ALU etc are limiting the operations you can see the shaders are doing a very poor job of saturating the gpu:

In fact its even worse than that if you look at the shader breakdown you can see massive segments of unused time.

From looking a these graphs I would say more than 70% of the time is un-used and the time that is sued as very poor fill rate. The fact that there are any gaps between these scheduled calls is a big red flag, the system should be queueing up compute tasks well in advance so that there is a task ready to work on before the Current task finishes (that is what blender does for CUDA).

Spindel · March 21, 2022

1 hour ago, hishnash said:

You can attached Xcode debugging tools to blender to inspect the GPU activity and its quite clear its not going that great a job even on the M1 Max.

This is a snapshot of about 3 seconds of work during the BMW GPU render it should be solid orange. But you can see there are empty areas in the perf state indicating nothing was running at all and the little green segments showing the task that was disptached was lower enough priority that it was running in low power state.

Also if you look at the gpu counters that give an dictation of how much cache, ALU etc are limiting the operations you can see the shaders are doing a very poor job of saturating the gpu:

In fact its even worse than that if you look at the shader breakdown you can see massive segments of unused time.

From looking a these graphs I would say more than 70% of the time is un-used and the time that is sued as very poor fill rate. The fact that there are any gaps between these scheduled calls is a big red flag, the system should be queueing up compute tasks well in advance so that there is a task ready to work on before the Current task finishes (that is what blender does for CUDA).

The question then is where in lies the problem?

Is it the Metal API or the implementation of the API?

As I said this problem is not specific to Blender alone. In any case either Apple will need to issue an fix to their drivers and/or the API or the developers need to fix their implementation.

hishnash · March 21, 2022

22 minutes ago, Spindel said:

Is it the Metal API or the implementation of the API?

Not an issue with the API it is possible to saturate these GPUs with compute workloads, maybe RT workloads can't fully fill the ALU as they are mostly IO bound unless you have costly shaders but they should at least be pre-scheduling shader calls in advance rather than waiting of the shader to complete call back tot he cpu then issue the next one.

This is an issue with blender not having been optimised yet, they only just got full metal feature support into cycles in bender 3.1 I expect they were focusing on ensuring they could render all the object and shader types (cycles has quite a complex set of features it supports when you look at it, volumetric, multiple types of hair, particles, and then the entire node based shader graph).

You should not start optimising until you have all these features supported I expect over the next few realises we will see some rather large performance improvements as they turn shaders to maximums usage and as they tune the dispatch so that the GPU is not sitting around doing nothing waiting of the next task.

Just that Mario · March 26, 2022

On 3/9/2022 at 8:35 PM, Alex Atkin UK said:

Pinch of salt indeed.

My Macbook Pro scores higher than my 9900K, almost as high as my 5950X, but in real-world desktop use both of those "feel" faster, they're just more responsive in general.

eg MacOS is terrible when dealing with network drives, feel like I'm back in the 90s with how long it can take. Linux and Windows will open the drive in less than a second, MacOS can take minutes, and I've performed every single tweak I can find online to both MacOS and SAMBA. Its not WiFi either, my wired Mac Mini M1 takes just as long and hard wiring the Macbook Pro makes zero difference either.

IDK, I have constant issues with windows file explorer crashing when not connected to network. Takes multiple minutes for that garbage thing to suggest troubleshooter and then it fixes itself. MacOS connects super fast, though bit tedious.

Alex Atkin UK · March 28, 2022

On 3/26/2022 at 5:02 PM, Just that Mario said:

IDK, I have constant issues with windows file explorer crashing when not connected to network. Takes multiple minutes for that garbage thing to suggest troubleshooter and then it fixes itself. MacOS connects super fast, though bit tedious.

I wish I knew what was up with MacOS, same problem on a Mac Mini and Macbook Pro - same server on Linux or Windows is absolutely fine.

Another annoying thing is MacOS unmounts the drives whenever Samba is restarted on the server, so I regularly wake up the Mac to a "drives disconnected" message because my server updates at 3am automatically. Windows on the other hand doesn't care, it probably DOES still technically disconnect but it sensibly just re-connects in the background so you'd never know. Why bother the end-user about this unless the reconnection fails? I don't need to know and I certainly shouldn't have to waste time going back into Finder to reconnect the drives. Its more than once caused resuming work from a network drive to fail outright and is problematic with the software I use (Topaz Video Enhance AI) because once that happens it doesn't seem to recover even once reconnected.

Granted that's partly a flaw in that software, but its one that wouldn't happen if MacOS didn't behave so oddly.

Paul Thexton · March 28, 2022

5 minutes ago, Alex Atkin UK said:

Granted that's partly a flaw in that software, but its one that wouldn't happen if MacOS didn't behave so oddly.

I can’t say I’ve had the issues you’ve described here myself, but that said I never use Windows for hosting smb shares (only ever use nas appliances or Linux samba), I do know others who’ve had similar though.

It’s unlikely to help with the disconnect issue but does following this help at all?

Alex Atkin UK · March 28, 2022

1 minute ago, Paul Thexton said:

I can’t say I’ve had the issues you’ve described here myself, but that said I never use Windows for hosting smb shares (only ever use nas appliances or Linux samba), I do know others who’ve had similar though.

It’s unlikely to help with the disconnect issue but does following this help at all?

Not using Windows, using Fedora and made all the recommended tweaks to smb.conf which made an improvement to initially displaying the directory contents but still painfully slow pulling in the rest of the listing when you scroll down.

Paul Thexton · March 28, 2022

2 minutes ago, Alex Atkin UK said:

Not using Windows, using Fedora and made all the recommended tweaks to smb.conf which made an improvement to initially displaying the directory contents but still painfully slow pulling in the rest of the listing when you scroll down.

Just found this too, but I don’t know if samba supports multichannel, never needed to look in to it

https://support.apple.com/en-us/HT212277

Alex Atkin UK · March 28, 2022

39 minutes ago, Paul Thexton said:

It’s unlikely to help with the disconnect issue but does following this help at all?

Yes, that is what at least allowed the initial directly listing to appear almost immediately, before it would hang even doing that.

leadeater · March 28, 2022

3 hours ago, Paul Thexton said:

Just found this too, but I don’t know if samba supports multichannel, never needed to look in to it

https://support.apple.com/en-us/HT212277

SAMBA support both SMB Multi-Channel and SMB Direct. Not sure if they have transitioned from "experimental" or not, I last looked at these for SAMBA a few years ago. Either way both work and I have gotten both to work personally.

Paul Thexton · March 28, 2022

6 hours ago, Alex Atkin UK said:

Yes, that is what at least allowed the initial directly listing to appear almost immediately, before it would hang even doing that.

I’m out of ideas then. May sound silly but have you raised a Feedback Assistant ticket with Apple? They don’t always reply to them, but the more people use that to tell them there’s a problem it increases the slim chance an engineer who cares will see it.

Alex Atkin UK · March 29, 2022

23 hours ago, Paul Thexton said:

I’m out of ideas then. May sound silly but have you raised a Feedback Assistant ticket with Apple? They don’t always reply to them, but the more people use that to tell them there’s a problem it increases the slim chance an engineer who cares will see it.

It seems to have been going on for a long time.

Paul Thexton · March 29, 2022

@Alex Atkin UKit definitely has. I’ve just never run in to it myself and I’ve never quite understood why. I can only assume it’s down to how I tend to manage folder structures on my shared drives, because there’s not much else I tend to do differently.

Sign In

Apple M1 Ultra - 2nd highest multicore score, lost to 64-core AMD Threadripper.

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites