Reputation from leadeater - Linus Tech Tips

leadeater reacted to danwat1234 in BOINC Pentathlon 2024 49 minutes ago

Nice guess I'll pwn that one! Good timing because I think shortly after this time period, Sidock may have a quiet period as it transitions to another batch after the Huge, long "corona_RdRp_v2" currently 84% complete. https://www.sidock.si/sidock/server_status.php

leadeater got a reaction from danwat1234 in BOINC Pentathlon 2024 55 minutes ago

Yes but the point is having to use that is a symptom of the issue, using it at all because you need to is the problem itself, you shouldn't need to. All process lasso is doing is setting the below Affinity Masks statically which isn't actually a good thing most of the time, it is what we want in this instance, but it's also something that can be set correctly and not statically by a process itself. This is exactly why I'm saying it shouldn't be necessary.

If you're not telling Windows to group your processes in to a common NUMA node you're going to have problems. If you also force your processes in to NUMA Node 0 then you'll also have problems.

This is something not even Cinebench does correctly, above 64 threads per NUMA Node which is something I have on some of my servers.

Anyway:

As you can see if a thread, a new one, as spawned on a NUMA Node/Processor core that is already busy doing work and not the other completely idle one then it's not Windows scheduler, someone 100% has done something wrong somewhere. This is actually not how Windows works by default with nothing overriding what it would normally do.

https://empyreal96.github.io/nt-info-depot/CS490_Windows_Internals/08_Scheduling.pdf

And:
https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
https://learn.microsoft.com/en-us/windows/win32/procthread/multiple-processors

If you aren't touching Affinity Masks and not specifically starting your control thread on Node 0 then there is actually no reason a new process would start on or always run on Node 0 when Node 1 is idle. That means if it is happening then it's not Windows scheduler to blame alone.

Win32 Affinity Masks

Slurm is a cluster job scheduler, it's not something you would use or need to use on a single server level. Both Windows and Linux have the required scheduler flags and it's actually what slurm uses (slurm is Linux only mind you). Any application you run on a system can do what slurm does.

leadeater got a reaction from danwat1234 in BOINC Pentathlon 2024 57 minutes ago

Detecting instruction sets isn't particularly difficult. You're also talking about optimizing the code for the arch it's running on which is actually a different thing entirely from ensuring that your process/application plays nicely with systems when running multiple instances or process of your application. It's just a different aspect entirely to optimizing the code to be able to run fast on XYZ CPU.

Prime95 nor PrimeGrid are going to know that you for whatever reason want to run multiple instances on a system so unless they have catered for that then you easily get in to the situation of not having sufficiently accurate resource allocation. When I was running P95 to get the benchmark figures that was doing it based on how many cores and tasks you want to run at the same time etc so P95 certainly can do it and it's part of that application and it does it correctly from what I observed.

PrimeGrid on the other hand has to work in with what BOINC allows to track jobs and give out points etc. Ideally you'd configure the same parameters on the project website as you do in P95 so your task that gets generated and issued to your system knows it's configured for i.e. 6 threads 2 tasks and you'll get credit the right amount of points based on that and run time. That way the main task is aware of both task processes and can allocated to different NUMA nodes for example. However I have no idea if that is actually possible with BOINC at all.

Since the above is not how it's being done you have to rely on BOINC and Windows scheduling more while also making sure your process you start starting is telling them the right information and trying to reserve the correct things which tells/allows resource allocators to do a better or more correct job. BOINC should be looking at the tasks it's allocated and ensure that it is not overlapping task resource allocation when there is unutilized system resources. I can't think of a situation in BOINC context where if BOINC starts task 1 on NUMA Node 0 that when it does to start task 2 that you'd also want to allocated it to NUMA Node 0, logically if NUMA Node 1 exists then that is the preferred place to allocated it. BOINC tasks/job from what I understand are independent and don't need to talk to each other or share memory which is when you would want them on the same NUMA Node i.e. SQL DB process and application/web process on the same system/OS which would give the highest performance running on the same NUMA Node. ESXi actually has detection for that at the VM level and if it sees two VMs talking to each other a lot it'll allocate them to the same NUMA Node (sometimes you want to disable that, almost never).

This is what slurm can do for example:

https://slurm.schedmd.com/salloc.html

Also: https://slurm.schedmd.com/cpu_management.html
See above. slurm does a lot more intelligence ins allocating resources than BOINC does, by necessity, but you the user have to set the correct parameters when submitting jobs in to the job queue or it'll run poorly, or worse not at all. The other thing you don't want to do is in your application code put in anything that would conflict with the slurm allocator like putting in static thread allocations or if you need to make sure you match that when submitting in to slurm.

I'd say very roughly slurm actually allows you to do a little less work in this regard for the application/process code since you have to define a lot of this during job queue submission but you still have to be careful that you have done your thread allocations correctly in code and not done something bad.

As to my comment about researchers, do remember while they can be for example math experts and know how to get the best out of a CPU to do a particular calculation that doesn't actually mean they understand a lot of other aspects of coding and system design. The same way I understand system design, interaction between NUMA nodes and PCIe devices (GPUs, NICs) that doesn't mean I have the sufficient coding experience and knowledge to do anything with that understanding. I could for example tell you not to utilize more than 2 GPUs per server node even though there are 4 because it's 2 per CPU and NVLink is not being used so you'd get greatly less performance if you tried to use 4.

leadeater reacted to porina in BOINC Pentathlon 2024 1 hour ago

The tasks will be generated for the number of threads reported by the client, unless limited in PrimeGrid project preferences. I guess if you have different BOINC settings for "in use" and "idle" that could also affect it.

Since each task does roughly the same amount of work, they get about the same points. Up to user to optimise throughput. It does scale with core count, not thread count if cores are exceeded. It is not perfect scaling. I went into this in more detail at

Been looking to moderate my power usage but I'll throw a few cores on it today shortly.

leadeater got a reaction from danwat1234 in BOINC Pentathlon 2024 1 hour ago

@danwat1234 One of the events chosen this year is SiDock which you are currently running, yayyyyy 😀

leadeater got a reaction from porina in BOINC Pentathlon 2024 1 hour ago

The points per task is the same, the thread count you configure on the site is mostly about getting more tasks to run at once if your L3+L2 cache size is large enough. If you have a 16 thread CPU an 8 thread and 16 thread tasks will basically take the same amount of time since far as I understand it these CUL tasks are just a cache workload more than anything else. I'm sure there is a minimum thread count but past a certain point more threads likely wont help.

Divide your L3 cache amount by 22.5 and that is how many you should try and run at once, so take that figure and divide your core count (not thread).

leadeater got a reaction from Kilrah in BOINC Pentathlon 2024 1 hour ago

The points per task is the same, the thread count you configure on the site is mostly about getting more tasks to run at once if your L3+L2 cache size is large enough. If you have a 16 thread CPU an 8 thread and 16 thread tasks will basically take the same amount of time since far as I understand it these CUL tasks are just a cache workload more than anything else. I'm sure there is a minimum thread count but past a certain point more threads likely wont help.

Divide your L3 cache amount by 22.5 and that is how many you should try and run at once, so take that figure and divide your core count (not thread).

leadeater got a reaction from Kilrah in BOINC Pentathlon 2024 3 hours ago

Intel CPU detected 😉

leadeater got a reaction from Lightwreather in BOINC Pentathlon 2024 4 hours ago

If it was that much I'm pretty sure I'd be on fire, and everything around me.

leadeater got a reaction from Lightwreather in BOINC Pentathlon 2024 4 hours ago

leadeater got a reaction from HoldSquat in BOINC Pentathlon 2024 7 hours ago

leadeater got a reaction from Chachunka in BOINC Pentathlon 2024 7 hours ago

leadeater got a reaction from dogwitch in BOINC Pentathlon 2024 8 hours ago

leadeater reacted to GOTSpectrum in BOINC Pentathlon 2024 9 hours ago

another 40k update

leadeater got a reaction from Justaphf in BOINC Pentathlon 2024 9 hours ago

leadeater got a reaction from CWP in BOINC Pentathlon 2024 10 hours ago

leadeater reacted to SkillzTA in BOINC Pentathlon 2024 11 hours ago

That wasn't you in that pic?

leadeater got a reaction from SkillzTA in BOINC Pentathlon 2024 11 hours ago

If it was that much I'm pretty sure I'd be on fire, and everything around me.

leadeater got a reaction from SkillzTA in BOINC Pentathlon 2024 11 hours ago

Cores, I only count cores 😉

It means I want to break in and take some of those, damn I can only dream of having that many 4090's

Those DL580 Gen9's btw use 1000W fully loaded, so that's 2KW for me just from those. 8x DL360 Gen9 @ 405W each, 4x XL420 Gen10 @ 600W each, 6x DL380 Gen9 @ 320W each, 3x DL385 Gen10+ (only using 32 cores total per server via VMs) @ 509W each, 2x DL360 Gen10 @ 600W each and finally 5x DL380p Gen8 that died last night 😢

13,247W

leadeater reacted to SkillzTA in BOINC Pentathlon 2024 11 hours ago

I mean what happens if you have 13 4090s? lol That's ~400W * 13 = 5,200 watts an hour

Meanwhile, 796 CPU (cores or threads?)
796 CPU cores = 13 64-core Rome EPYCs. That's ~280W * 13 = 3,640 watts an hour

I am pretty confident that 13, 64-core EPYCs vs 13, 4090s on Numberfields that 64-core EPYCs will win, by a lot.

For reference, I had ~12 64-core EPYCs on Numberfields during the competition and dropped over 20M points.
Not sure if any teams had any 4090s or how many they had, but I highly doubt 13, 4090s would produce 20M points.

If you meant threads, then we're talking half the power draw for the EPYC/CPU number and it'll probably be much closer on the comparison but half the power draw.

edit
Its a little unfair to compare on a sole points based with Numberfields with GPUs vs CPUs for the simple fact that GPU tasks are worth 240 points while CPUs are worth 262 points. However, the 64-core EPYCs will still have more tasks.

64 cores, 128 threads running 128 tasks at once will complete the tasks every roughly 2 hours. That's 120 minutes to complete 128 tasks which is just below 1 task ever <60 seconds.

1 4090, while I don't own any 4090s, I seen someone mention that they complete the tasks in a little OVER 1 minute effective time. I'm guessing its a little over 3 minutes running 3x at once they were doing which means the effective time would be over >60 seconds per task.

So 1, 64-core EPYC would do more work than 1, 4090 and consume less power.
Not even accounting for the fact that the 4090 needs a CPU to work adding more power while the CPU can run headless without a GPU drawing overall less power on the system than the GPU system.

I don't even know why I am typing all this, but it's a cool, civil argument so far so I'm okay with it. Never learned anything from someone who agrees with everything I say.

leadeater got a reaction from GOTSpectrum in BOINC Pentathlon 2024 11 hours ago

But what if you have 796 CPU cores? 🙃

leadeater reacted to Kilrah in BOINC Pentathlon 2024 12 hours ago

Okay, moved my desktop to RNMA now that it works, it also estimates things correctly now (used to think a WU would take 3h30 when it finished in ~18mins), so 48 threads on it

I hate those 8+ hour PG WUs. 3.5min NF was way cooler to see zip by.

leadeater reacted to GOTSpectrum in BOINC Pentathlon 2024 12 hours ago

76 threads now crunching

leadeater reacted to GOTSpectrum in BOINC Pentathlon 2024 12 hours ago

I've not been involved due to power issues, both systems have joined the city run

leadeater reacted to HoldSquat in BOINC Pentathlon 2024 12 hours ago

So what you're saying in it's time to deploy even my wife's laptop 😅

Sign In

leadeater

Posts

Joined

Last visited

Reputation Activity

My Activity Streams