Jump to content

leadeater

Moderator
  • Posts

    24,332
  • Joined

  • Last visited

Awards

About leadeater

  • Birthday Sep 23, 1987

Profile Information

  • Gender
    Male
  • Location
    New Zealand
  • Occupation
    Systems Engineer | IT

System

  • CPU
    Intel i7 4930K
  • Motherboard
    Asus Rampage IV Black Edition
  • RAM
    16GB G.Skill TridentX F3-2400C10-4GTX
  • GPU
    Dual Asus R9-290X
  • Case
    LD PC-V8
  • Storage
    4 512GB Samsung 850 Pro & 2 512GB Samsung 840 Pro & 1 256GB Samsung 840 Pro
  • PSU
    EVGA Supernova NEX 1500 Classified
  • Display(s)
    Dell U3014 30"
  • Cooling
    Custom EKWB, 3x 480 RAD everything cooled inc ram (why not?)
  • Keyboard
    Razor Black Window Ultimate BF4
  • Mouse
    Mad Catz R.A.T. 5
  • Sound
    Custom build speakers, home theater sound
  • Operating System
    Windows 10

Recent Profile Visitors

32,735 profile views
  1. Their project servers are having issues send out the work even though there is 100k+ tasks ready to send.
  2. If anyone can complete Numbers CPU tasks in less than 2 hours then they should do it, every point is going to matter a lot. Need the kitchen sink and everything.
  3. Landscape camera position, nice. Teaching people to rotate their device and helping those that already do/want to
  4. Not much at all really. I'm going to wait ~10 hours for sprint to finish then see how things are going or any new announcements for start times etc. If nothing has improved I'm going to switch everything to NFS and go for first Jav throw.
  5. Ah you know I forgot that the BOINC tasks would be a child process of boinc.exe since that is the process that starts them. I had a look at the process tree and GetDecics_4.00_windows_x86_64 (Numbers) parent PID is boinc.exe Even so everything, that we run, is a child process of explorer.exe (which boinc.exe/boincmgr.exe is) so that shouldn't be the cause and this issue doesn't happen on all projects either. If a CPU is busy it should prefer the idle one still.
  6. Sort of, in Windows Task Manager or something like Process Lasso you only get given static options, typically, but that's not the full scope of Win32 Affinity Masks. I don't disagree with using Process Lasso for something like confining within CCX/CCD situations, it's really not common to need to do that or matters enough, but it just shouldn't ever be required for a multiple socket server and the simple act of starting a new process and it not being jammed on to an already busy physical CPU. This actually is not how Windows works by default. If you for whatever reason allocate some memory on startup to Node 0 and then spawn child processes for the actual main computation unfortunately that would lead to a situation of all child processes/threads having an 'Ideal Processor' preference of NUMA Node 0. Something as simple and as small as that can have unintended flow on effects. That is within Win32 Affinity Masks. You're just thinking of static affinities like you see in process lasso
  7. Yes but the point is having to use that is a symptom of the issue, using it at all because you need to is the problem itself, you shouldn't need to. All process lasso is doing is setting the below Affinity Masks statically which isn't actually a good thing most of the time, it is what we want in this instance, but it's also something that can be set correctly and not statically by a process itself. This is exactly why I'm saying it shouldn't be necessary. If you're not telling Windows to group your processes in to a common NUMA node you're going to have problems. If you also force your processes in to NUMA Node 0 then you'll also have problems. This is something not even Cinebench does correctly, above 64 threads per NUMA Node which is something I have on some of my servers. Anyway: As you can see if a thread, a new one, as spawned on a NUMA Node/Processor core that is already busy doing work and not the other completely idle one then it's not Windows scheduler, someone 100% has done something wrong somewhere. This is actually not how Windows works by default with nothing overriding what it would normally do. https://empyreal96.github.io/nt-info-depot/CS490_Windows_Internals/08_Scheduling.pdf And: https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support https://learn.microsoft.com/en-us/windows/win32/procthread/multiple-processors If you aren't touching Affinity Masks and not specifically starting your control thread on Node 0 then there is actually no reason a new process would start on or always run on Node 0 when Node 1 is idle. That means if it is happening then it's not Windows scheduler to blame alone. Win32 Affinity Masks Slurm is a cluster job scheduler, it's not something you would use or need to use on a single server level. Both Windows and Linux have the required scheduler flags and it's actually what slurm uses (slurm is Linux only mind you). Any application you run on a system can do what slurm does.
  8. Detecting instruction sets isn't particularly difficult. You're also talking about optimizing the code for the arch it's running on which is actually a different thing entirely from ensuring that your process/application plays nicely with systems when running multiple instances or process of your application. It's just a different aspect entirely to optimizing the code to be able to run fast on XYZ CPU. Prime95 nor PrimeGrid are going to know that you for whatever reason want to run multiple instances on a system so unless they have catered for that then you easily get in to the situation of not having sufficiently accurate resource allocation. When I was running P95 to get the benchmark figures that was doing it based on how many cores and tasks you want to run at the same time etc so P95 certainly can do it and it's part of that application and it does it correctly from what I observed. PrimeGrid on the other hand has to work in with what BOINC allows to track jobs and give out points etc. Ideally you'd configure the same parameters on the project website as you do in P95 so your task that gets generated and issued to your system knows it's configured for i.e. 6 threads 2 tasks and you'll get credit the right amount of points based on that and run time. That way the main task is aware of both task processes and can allocated to different NUMA nodes for example. However I have no idea if that is actually possible with BOINC at all. Since the above is not how it's being done you have to rely on BOINC and Windows scheduling more while also making sure your process you start starting is telling them the right information and trying to reserve the correct things which tells/allows resource allocators to do a better or more correct job. BOINC should be looking at the tasks it's allocated and ensure that it is not overlapping task resource allocation when there is unutilized system resources. I can't think of a situation in BOINC context where if BOINC starts task 1 on NUMA Node 0 that when it does to start task 2 that you'd also want to allocated it to NUMA Node 0, logically if NUMA Node 1 exists then that is the preferred place to allocated it. BOINC tasks/job from what I understand are independent and don't need to talk to each other or share memory which is when you would want them on the same NUMA Node i.e. SQL DB process and application/web process on the same system/OS which would give the highest performance running on the same NUMA Node. ESXi actually has detection for that at the VM level and if it sees two VMs talking to each other a lot it'll allocate them to the same NUMA Node (sometimes you want to disable that, almost never). This is what slurm can do for example: https://slurm.schedmd.com/salloc.html Also: https://slurm.schedmd.com/cpu_management.html See above. slurm does a lot more intelligence ins allocating resources than BOINC does, by necessity, but you the user have to set the correct parameters when submitting jobs in to the job queue or it'll run poorly, or worse not at all. The other thing you don't want to do is in your application code put in anything that would conflict with the slurm allocator like putting in static thread allocations or if you need to make sure you match that when submitting in to slurm. I'd say very roughly slurm actually allows you to do a little less work in this regard for the application/process code since you have to define a lot of this during job queue submission but you still have to be careful that you have done your thread allocations correctly in code and not done something bad. As to my comment about researchers, do remember while they can be for example math experts and know how to get the best out of a CPU to do a particular calculation that doesn't actually mean they understand a lot of other aspects of coding and system design. The same way I understand system design, interaction between NUMA nodes and PCIe devices (GPUs, NICs) that doesn't mean I have the sufficient coding experience and knowledge to do anything with that understanding. I could for example tell you not to utilize more than 2 GPUs per server node even though there are 4 because it's 2 per CPU and NVLink is not being used so you'd get greatly less performance if you tried to use 4.
  9. That or run numbers on CPU, it's actually pretty decent for that. If we do NFS we'll have to make sure enough of us are doing it for the same day or it'll go to waste. I'm trying to get a lot of system moved over to Ramanujan and getting work but it's not going so well, I'm also adding NFS to them so if it just doesn't work out I'll turn off networking and bunk. I'll let you know if I go for NFS.
  10. Shorter than 12 months doesn't make a lot of sense, while 19 might be long 16 is pretty reasonable. But if anything Apple would be looking to tie in with yearly product refresh cycles which doesn't mean every 12 months for the SoC but it means either side of 12 or 24 is the most logical with product cycle release date drift to align with it etc. 6 months is way short, they could announce it but you might find actual availability is decent length of time after. Also do remember Apple does SoC refreshes with suffixes i.e. A12X etc. So probably not the best to assume M4 when it's just as likely to be refresh modeling naming for a refresh SoC rather than new.
  11. This isn't really anything to do with sockets and NUMA nodes, that's microarchitecture or what people also refer to as sub-NUMA. Not jamming a new process on to an already 100% utilized CPU socket is pretty damn basic and shouldn't happen. Multiple sockets has been around since the 80's and even earlier than that but at least for Windows late 80's to early 90's. I can assure you what is happening is not solely due to Windows or it not being capable of doing it. It's done literally every time for HPC clusters running slurm etc. The issue is BOINC isn't that and putting time in effort in to a platform design for utilizing idle time on peoples desktops doesn't make a lot of sense. Also remember researcher aren't always programming experts either, that's why NeSI for example offer professional services to their (our) users to make sure their workloads run correctly and actually work in with slurm and their server nodes etc. Linux does at least go "this one is 100% utilized so unless you explicitly tell me not to I'm going to put it on the other one". Windows isn't quite as aggressive at doing that but I still suspect it's more a code issue on the project with how they are doing threading. The behavior is not normal for Windows. Optimizing for specific CPU microarchitecture is quite well beyond the basic NUMA Node/Socket stuff, personally I think it would be quite unfair to expect that from a BOINC project that could run on 1000 different CPU models.
  12. Kind of all 3 but as a programmer you can code your application to detect NUMA domains and processor groups and balance across these and also give Windows enough information to better allocate. I don't have these problems with other software but they are also made with servers usage with multiple sockets in mind and as standard etc. When Processor Node 0 is 100% utilized another new BOINC tasks should never be allocated to that and should go to Processor Node 1 when that is 0%, but that's not what is happening and it's stupid and annoying. BOINC being what it is the typical I doubt is dual and quad socket systems like I have.
  13. Nah, but they can just run other stuff. Installing Linux on them is less than 15 mins, as you can tell I'm lazy Would be nice if I could commandeer the 18 2x 7713 hosts at work that have next to no CPU usage on them but that would get me in a lot of trouble haha.
  14. Winter(ish) here so that's exactly what I want, heattttt
×