BOINC Pentathlon 2024

kiska3 · May 6

41 minutes ago, HoldSquat said:

lets see how it does after everyone unloads their bunked WUs

I don't think it'll matter too much. Their shared memory partition seems to be too small to support any form of bunkering/large contest

kiska3 · May 6

1 minute ago, Kilrah said:

Also on (of course) my main machine it gets the estimate grossly wrong and that seems to only give a fraction of the credit, and can't seem to fix it... unless it's that it doesn't like you not giving it all of your resources, hmm

Thats just Credit~~Screw~~New for you

porina · May 6

9 hours ago, leadeater said:

Wish the project people gave a little more love and time in to making sure their processes are NUMA aware and handle it more correctly on Windows, dreams are free.

Is this a project, BOINC or Windows problem? Each task independently spawns with as many threads as you tell it. They do not have knowledge if other tasks are simultaneously running. So you could argue the control point needs to be at a higher level. Either at BOINC or Windows level. I don't know what software options there are to aid in this. You'd think if you spawned a process using say 8 threads, they might be related so maybe keep them together? Watch Windows split it across cache domains, never mind NUMA. Actually, for other work, say similar to Cinebench R15, it might be optimal to spread across whatever domain. Again, don't know if there is some way for software to flag this to scheduler or whatever.

In the past when I had Zen 2 with two CCX, I had to use Process Lasso to keep the two tasks on their own CCX. Windows for whatever reason really wanted to shuffle them around which had a performance impact.

Does Linux get this right out of the box?

SkillzTA · May 6

40 minutes ago, porina said:

Does Linux get this right out of the box?

It handles it a little better as long as the BIOS options are enabled, otherwise task pinning is needed on Linux also.

leadeater · May 6

2 hours ago, porina said:

Is this a project, BOINC or Windows problem?

Kind of all 3 but as a programmer you can code your application to detect NUMA domains and processor groups and balance across these and also give Windows enough information to better allocate. I don't have these problems with other software but they are also made with servers usage with multiple sockets in mind and as standard etc.

When Processor Node 0 is 100% utilized another new BOINC tasks should never be allocated to that and should go to Processor Node 1 when that is 0%, but that's not what is happening and it's stupid and annoying.

BOINC being what it is the typical I doubt is dual and quad socket systems like I have.

porina · May 6

21 minutes ago, leadeater said:

Kind of all 3 but as a programmer you can code your application to detect NUMA domains and processor groups and balance across these and also give Windows enough information to better allocate.

Let's use an example. Say we have a 7950X, so two CCDs of 8 cores each. For this example, let's say it is optimal to run 2 tasks, each using 8 threads, split one per CCD.

The current situation is we start two independent tasks of 8 threads each. Neither task is aware of the other. Can they hint to Windows that those 8 threads should be kept together? If SMT is on, further hint that they'd rather not share with anything else. Windows default behaviour does seem to be move all the threads around. Crossing CCD boundary does result in a performance drop.

Both tasks are started by BOINC, so that's why I'm thinking that could be one location where it could direct the OS. For example, if you set affinity on the BOINC client, it does get inherited by any tasks it spawns, assuming those tasks don't override it somehow. Could BOINC have a user setting for example on how best to fill cores with work. This is what people use Process Lasso for.

21 minutes ago, leadeater said:

I don't have these problems with other software but they are also made with servers usage with multiple sockets in mind and as standard etc.

It could be done at project level but there will be a bad tradeoff. You'd have to allocate both those tasks simultaneously in a single container. If one task finishes before the other, you'll have idle time. For predictable work like this, it could work I guess. As long as you don't have other things stealing CPU time causing the two units to differ more.

Another way might be for those tasks to probe if other tasks are also present, and they can self organise that way. But then you can't fully isolate tasks from each other.

Either way, I think this is niche enough that no dev time is likely to be put in this direction at project level. If anything is done I feel it belongs at BOINC level as I've long wished for better task control, especially when running very different types of units e.g. CPU and GPU from different projects.

While I still think Windows could do better, look at the problems with Intel hybrid CPUs, and also the problems with multiple-CCD AMD CPUs. I don't have any confidence that MS can ever get the scheduler to work well as it has to work at an abstract enough level. BOINC at least is more focused in what it does.

leadeater · May 7

2 hours ago, porina said:

Let's use an example. Say we have a 7950X, so two CCDs of 8 cores each. For this example, let's say it is optimal to run 2 tasks, each using 8 threads, split one per CCD.

The current situation is we start two independent tasks of 8 threads each. Neither task is aware of the other. Can they hint to Windows that those 8 threads should be kept together? If SMT is on, further hint that they'd rather not share with anything else. Windows default behaviour does seem to be move all the threads around. Crossing CCD boundary does result in a performance drop.

This isn't really anything to do with sockets and NUMA nodes, that's microarchitecture or what people also refer to as sub-NUMA. Not jamming a new process on to an already 100% utilized CPU socket is pretty damn basic and shouldn't happen. Multiple sockets has been around since the 80's and even earlier than that but at least for Windows late 80's to early 90's. I can assure you what is happening is not solely due to Windows or it not being capable of doing it.

2 hours ago, porina said:

Either way, I think this is niche enough that no dev time is likely to be put in this direction at project level.

It's done literally every time for HPC clusters running slurm etc. The issue is BOINC isn't that and putting time in effort in to a platform design for utilizing idle time on peoples desktops doesn't make a lot of sense. Also remember researcher aren't always programming experts either, that's why NeSI for example offer professional services to their (our) users to make sure their workloads run correctly and actually work in with slurm and their server nodes etc.

5 hours ago, porina said:

Does Linux get this right out of the box?

Linux does at least go "this one is 100% utilized so unless you explicitly tell me not to I'm going to put it on the other one". Windows isn't quite as aggressive at doing that but I still suspect it's more a code issue on the project with how they are doing threading. The behavior is not normal for Windows.

Optimizing for specific CPU microarchitecture is quite well beyond the basic NUMA Node/Socket stuff, personally I think it would be quite unfair to expect that from a BOINC project that could run on 1000 different CPU models.

dogwitch · May 7

got to love.

why are 2 of the 3 pc restarting...

notice windows updated....

really windows???? really?

Lightwreather · May 7

Dropped the bunker, but now it seems like Ramanujan doesn't want to give me any tasks…

Should I bunker for NFS?

leadeater · May 7

2 minutes ago, Lightwreather said:

Should I bunker for NFS?

That or run numbers on CPU, it's actually pretty decent for that. If we do NFS we'll have to make sure enough of us are doing it for the same day or it'll go to waste.

I'm trying to get a lot of system moved over to Ramanujan and getting work but it's not going so well, I'm also adding NFS to them so if it just doesn't work out I'll turn off networking and bunk. I'll let you know if I go for NFS.

Ginger Penguin · May 7

Are we doing all NFS subprojects? It doesn't look like there's anything mandated but any strategic advantage with concentrating on any of them?

Kilrah · May 7

I released about 100 bunkered Ramanujan units.

Laptop's running Ramanujan (which still doesn't want to give enough units), desktop's running NF on GPU and alternating between bunkering NFS and running PG on CPU

porina · May 7

6 hours ago, leadeater said:

It's done literally every time for HPC clusters running slurm etc.

I have no idea what slurm is but if I had access to HPC resource, the software does have visibility of the system does it not? We do not necessarily have that case with project level code under BOINC.

6 hours ago, leadeater said:

Also remember researcher aren't always programming experts either

If you hang around mersenneforum you'll find the writer of Prime95, the math library of which (gwnum) is used in LLR and does the heavy lifting. And for CUL, LLR2 is used, which is a branch of LLR modified to enable fast checking. gwnum does take into consideration microarchitecture optimisations, such as availability of instructions (e.g. AVX-512), differences in implementations of those instructions (1 vs 2 unit AVX-512, and whatever AMD did in Zen 4), and different cache sizes. The performance critical parts were stated as being done in assembly.

leadeater · May 7

35 minutes ago, porina said:

If you hang around mersenneforum you'll find the writer of Prime95, the math library of which (gwnum) is used in LLR and does the heavy lifting. And for CUL, LLR2 is used, which is a branch of LLR modified to enable fast checking. gwnum does take into consideration microarchitecture optimisations, such as availability of instructions (e.g. AVX-512), differences in implementations of those instructions (1 vs 2 unit AVX-512, and whatever AMD did in Zen 4), and different cache sizes. The performance critical parts were stated as being done in assembly.

Detecting instruction sets isn't particularly difficult. You're also talking about optimizing the code for the arch it's running on which is actually a different thing entirely from ensuring that your process/application plays nicely with systems when running multiple instances or process of your application. It's just a different aspect entirely to optimizing the code to be able to run fast on XYZ CPU.

Prime95 nor PrimeGrid are going to know that you for whatever reason want to run multiple instances on a system so unless they have catered for that then you easily get in to the situation of not having sufficiently accurate resource allocation. When I was running P95 to get the benchmark figures that was doing it based on how many cores and tasks you want to run at the same time etc so P95 certainly can do it and it's part of that application and it does it correctly from what I observed.

PrimeGrid on the other hand has to work in with what BOINC allows to track jobs and give out points etc. Ideally you'd configure the same parameters on the project website as you do in P95 so your task that gets generated and issued to your system knows it's configured for i.e. 6 threads 2 tasks and you'll get credit the right amount of points based on that and run time. That way the main task is aware of both task processes and can allocated to different NUMA nodes for example. However I have no idea if that is actually possible with BOINC at all.

Since the above is not how it's being done you have to rely on BOINC and Windows scheduling more while also making sure your process you start starting is telling them the right information and trying to reserve the correct things which tells/allows resource allocators to do a better or more correct job. BOINC should be looking at the tasks it's allocated and ensure that it is not overlapping task resource allocation when there is unutilized system resources. I can't think of a situation in BOINC context where if BOINC starts task 1 on NUMA Node 0 that when it does to start task 2 that you'd also want to allocated it to NUMA Node 0, logically if NUMA Node 1 exists then that is the preferred place to allocated it. BOINC tasks/job from what I understand are independent and don't need to talk to each other or share memory which is when you would want them on the same NUMA Node i.e. SQL DB process and application/web process on the same system/OS which would give the highest performance running on the same NUMA Node. ESXi actually has detection for that at the VM level and if it sees two VMs talking to each other a lot it'll allocate them to the same NUMA Node (sometimes you want to disable that, almost never).

This is what slurm can do for example:

Quote

--cores-per-socket=<cores>
Restrict node selection to nodes with at least the specified number of cores per socket. See additional information under -B option above when task/affinity plugin is enabled.
NOTE: This option may implicitly set the number of tasks (if -n was not specified) as one task per requested thread.

Quote

-c, --cpus-per-task=<ncpus>
Advise Slurm that ensuing job steps will require ncpus processors per task. By default Slurm will allocate one processor per task.
For instance, consider an application that has 4 tasks, each requiring 3 processors. If our cluster is comprised of quad-processors nodes and we simply ask for 12 processors, the controller might give us only 3 nodes. However, by using the --cpus-per-task=3 options, the controller knows that each task requires 3 processors on the same node, and the controller will grant an allocation of 4 nodes, one for each of the 4 tasks.

https://slurm.schedmd.com/salloc.html

Also: https://slurm.schedmd.com/cpu_management.html

35 minutes ago, porina said:

I have no idea what slurm is but if I had access to HPC resource, the software does have visibility of the system does it not? We do not necessarily have that case with project level code under BOINC.

See above. slurm does a lot more intelligence ins allocating resources than BOINC does, by necessity, but you the user have to set the correct parameters when submitting jobs in to the job queue or it'll run poorly, or worse not at all. The other thing you don't want to do is in your application code put in anything that would conflict with the slurm allocator like putting in static thread allocations or if you need to make sure you match that when submitting in to slurm.

I'd say very roughly slurm actually allows you to do a little less work in this regard for the application/process code since you have to define a lot of this during job queue submission but you still have to be careful that you have done your thread allocations correctly in code and not done something bad.

As to my comment about researchers, do remember while they can be for example math experts and know how to get the best out of a CPU to do a particular calculation that doesn't actually mean they understand a lot of other aspects of coding and system design. The same way I understand system design, interaction between NUMA nodes and PCIe devices (GPUs, NICs) that doesn't mean I have the sufficient coding experience and knowledge to do anything with that understanding. I could for example tell you not to utilize more than 2 GPUs per server node even though there are 4 because it's 2 per CPU and NVLink is not being used so you'd get greatly less performance if you tried to use 4.

dogwitch · May 7

so 6 core chewing thru the prime grid cpu task.

then number getting processing thru 2080s and 2070s.

at a rate of 7 mins per task.

HoldSquat · May 7

2 hours ago, leadeater said:

That or run numbers on CPU, it's actually pretty decent for that. If we do NFS we'll have to make sure enough of us are doing it for the same day or it'll go to waste.

I'm trying to get a lot of system moved over to Ramanujan and getting work but it's not going so well, I'm also adding NFS to them so if it just doesn't work out I'll turn off networking and bunk. I'll let you know if I go for NFS.

Any luck with Ramanujan? I dropped my bunked this then only got two tasks since.

porina · May 7

17 minutes ago, leadeater said:

When I was running P95 to get the benchmark figures that was doing it based on how many cores and tasks you want to run at the same time etc so P95 certainly can do it and it's part of that application and it does it correctly from what I observed.

Prime95 is one piece of software running one instance. It has the control that is not possible through LLR variations.

17 minutes ago, leadeater said:

PrimeGrid on the other hand has to work in with what BOINC allows to track jobs and give out points etc. Ideally you'd configure the same parameters on the project website as you do in P95 so your task that gets generated and issued to your system knows it's configured for i.e. 6 threads 2 tasks and you'll get credit the right amount of points based on that and run time. That way the main task is aware of both task processes and can allocated to different NUMA nodes for example. However I have no idea if that is actually possible with BOINC at all.

As I mentioned in an earlier post I can only see this happening without BOINC changes is if you issue a container with subtasks in it. But it will come with new problems.

17 minutes ago, leadeater said:

Since the above is not how it's being done you have to rely on BOINC and Windows scheduling more while also making sure your process you start starting is telling them the right information and trying to reserve the correct things which tells/allows resource allocators to do a better or more correct job.

I'm not familiar with the details but I'd imagine if you configure it to use say 8 threads, all it does is split the work into 8 threads and it is someone else's problem where they go. I don't know if there are ways to signal how those threads should be grouped/assigned.

17 minutes ago, leadeater said:

As to my comment about researchers, do remember while they can be for example math experts and know how to get the best out of a CPU to do a particular calculation that doesn't actually mean they understand a lot of other aspects of coding and system design.

I get that. It just feels like you're blaming them for something not under their control.

As mentioned earlier, people have used Process Lasso to address this. I'm still not sure I understand Slurm but very loosely they seem to have a similar role in controlling what should go where.

Kilrah · May 7

19 minutes ago, HoldSquat said:

Any luck with Ramanujan? I dropped my bunked this then only got two tasks since.

I'll get some but pretty much only if I manually click the update button after I'm done with most of those I had. It really doesn't want to give anything in advance

leadeater · May 7

1 hour ago, porina said:

As mentioned earlier, people have used Process Lasso to address this.

Yes but the point is having to use that is a symptom of the issue, using it at all because you need to is the problem itself, you shouldn't need to. All process lasso is doing is setting the below Affinity Masks statically which isn't actually a good thing most of the time, it is what we want in this instance, but it's also something that can be set correctly and not statically by a process itself. This is exactly why I'm saying it shouldn't be necessary.

1 hour ago, porina said:

I'm not familiar with the details but I'd imagine if you configure it to use say 8 threads, all it does is split the work into 8 threads and it is someone else's problem where they go. I don't know if there are ways to signal how those threads should be grouped/assigned.

If you're not telling Windows to group your processes in to a common NUMA node you're going to have problems. If you also force your processes in to NUMA Node 0 then you'll also have problems.

This is something not even Cinebench does correctly, above 64 threads per NUMA Node which is something I have on some of my servers.

Anyway:

As you can see if a thread, a new one, as spawned on a NUMA Node/Processor core that is already busy doing work and not the other completely idle one then it's not Windows scheduler, someone 100% has done something wrong somewhere. This is actually not how Windows works by default with nothing overriding what it would normally do.

https://empyreal96.github.io/nt-info-depot/CS490_Windows_Internals/08_Scheduling.pdf

And:

https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support

https://learn.microsoft.com/en-us/windows/win32/procthread/multiple-processors

If you aren't touching Affinity Masks and not specifically starting your control thread on Node 0 then there is actually no reason a new process would start on or always run on Node 0 when Node 1 is idle. That means if it is happening then it's not Windows scheduler to blame alone.

1 hour ago, porina said:

I'm not familiar with the details but I'd imagine if you configure it to use say 8 threads, all it does is split the work into 8 threads and it is someone else's problem where they go. I don't know if there are ways to signal how those threads should be grouped/assigned.

Win32 Affinity Masks

Slurm is a cluster job scheduler, it's not something you would use or need to use on a single server level. Both Windows and Linux have the required scheduler flags and it's actually what slurm uses (slurm is Linux only mind you). Any application you run on a system can do what slurm does.

porina · May 7

8 minutes ago, leadeater said:

Win32 Affinity Masks

Is that the same thing as manually setting affinity? That shouldn't be what we're after, which is a way for software to say: "keep these threads logically together" and "prefer empty real cores" for example.

Thanks for the links to the scheduler stuff, that's going to take some digesting. Might be a read for after lunch.

leadeater · May 7

17 minutes ago, porina said:

Is that the same thing as manually setting affinity?

Sort of, in Windows Task Manager or something like Process Lasso you only get given static options, typically, but that's not the full scope of Win32 Affinity Masks.

I don't disagree with using Process Lasso for something like confining within CCX/CCD situations, it's really not common to need to do that or matters enough, but it just shouldn't ever be required for a multiple socket server and the simple act of starting a new process and it not being jammed on to an already busy physical CPU. This actually is not how Windows works by default.

If you for whatever reason allocate some memory on startup to Node 0 and then spawn child processes for the actual main computation unfortunately that would lead to a situation of all child processes/threads having an 'Ideal Processor' preference of NUMA Node 0. Something as simple and as small as that can have unintended flow on effects.

17 minutes ago, porina said:

That shouldn't be what we're after, which is a way for software to say: "keep these threads logically together" and "prefer empty real cores" for example.

That is within Win32 Affinity Masks. You're just thinking of static affinities like you see in process lasso

porina · May 7

17 minutes ago, leadeater said:

If you for whatever reason allocate some memory on startup to Node 0 and then spawn child processes for the actual main computation unfortunately that would lead to a situation of all child processes/threads having an 'Ideal Processor' preference of NUMA Node 0. Something as simple and as small as that can have unintended flow on effects.

That could be BOINC then. Like I've mentioned, if I set manual affinity to the BOINC client process, any tasks it spawns also have the same affinity set. So that might be the clue.

We could test this later by manually running multiple individual LLR instances outside of BOINC. Revisit after Pentathlon?

leadeater · May 7

22 minutes ago, porina said:

That could be BOINC then. Like I've mentioned, if I set manual affinity to the BOINC client process, any tasks it spawns also have the same affinity set. So that might be the clue.

Ah you know I forgot that the BOINC tasks would be a child process of boinc.exe since that is the process that starts them. I had a look at the process tree and GetDecics_4.00_windows_x86_64 (Numbers) parent PID is boinc.exe

Even so everything, that we run, is a child process of explorer.exe (which boinc.exe/boincmgr.exe is) so that shouldn't be the cause and this issue doesn't happen on all projects either. If a CPU is busy it should prefer the idle one still.

McDaWisel · May 7

Just joined the forums - I've been boincing for years, mostly on primegrid. Decided to take part of the Pentathlon this year for LTT

Primegrid have a really good Discord with great people and I was given a script that you can run to solve these issues on Linux (I use Process Lasso on Windows as well) I adapted quite a bit and it's been working well for me. Is it ok to post that code here? Should anyone be interested?

It doesn't always solve the issues though:

You can use the command lstopo on Linux to get a good idea of what the CPU architecture looks like.

Main problem is that each CCD CCX (I never know which one is which! but depending if you're on 3000, 5000 or 7000 it either matters or it doesn't) doens't nessesairly have enough cache for primegrid tasks.

So for example I have a 7502p (rented from Hetzner) with 32 physical CPUs. These are split into 8 "units" of 4 CPUs that share the same L3 cache. The total cache for the processor is 128MB. Divided by 8 that's 16MB for each unit of 4 CPUs. The current Cullen marathon challenge requires just over 20MB, so no matter what you do, you're not going to get great results. If you run 8 tasks concurrently the CPU will have to reply on RAM, which is a lot slower than L3 cache, of you're running 4 tasks concurrently, you're going over the infinity fabric (but it may fall back to RAM as well sometimes), which is faster than RAM, but still much slower than if the entire taks fitted into L3.

Egon3 · May 7

8 hours ago, dogwitch said:

got to love.

why are 2 of the 3 pc restarting...

notice windows updated....

really windows???? really?

Hah I had the same problem during the Pentathon last year, ended up occurring while I was at work and lost pretty much a day's worth of work as a result. I've made it a point before every Folding/BOINC event to check for Windows updates on all my systems the day before, then pause updates for 2-3 weeks.

Sign In

BOINC Pentathlon 2024

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites