Jump to content

I have two computers going, my desktop (A ryzen 3700x + Nvidia 2060 Super) and my home lab server (dual opteron, 3380+6282SE + Radeon R9 390). Both are set to prefer small jobs, with both doing CPU and GPU jobs. Desktop is set to a full 16 threads atm (tried 12, but i thought that might be what was causing the issue, so put it back to 16), and server is set to 24 cores (out of 32).

 

My desktop appears to mostly get cpu jobs that take 1 to 4 days, with 10m+ steps. Server often gets 500k step jobs and they take a few hours at worst. Because the jobs take so long on my desktop i basically get no points. maybe 200/day. That doesn't seem worth it for me or F@H. It would be better if these bigger jobs went to beefier (and more efficient) servers. Is there any way i can get it to get smaller jobs? I already have max-packet-size in the config set to small. Its basically not worth running the cpu jobs on my desktop if it makes little to no progress like this. The extra power draw is just not worth it.

 

Both of the computers run debian linux.

Link to post
Share on other sites

Unfortunately I'm not aware of a way to limit it in that sense. The WUs likely aren't all that big, just computationally intense, so max packet size only helps to a point. What WUs you're getting is probably down to what instruction sets your CPU has compatibility for. What CPU your client gives in its configuration is used by the server to decide what WUs to assign, I believe. 

 

I imagine the solution is for the FAH people to up the requirements for those WUs. It might be handing them out to any CPU with AVX2 compatibility, when practically that list needs to be trimmed a bit. This may also be a consequence of them trying to get as many WUs out there as they can.

 

 

Link to post
Share on other sites

15 hours ago, BlueJedi said:

Unfortunately I'm not aware of a way to limit it in that sense. The WUs likely aren't all that big, just computationally intense, so max packet size only helps to a point. What WUs you're getting is probably down to what instruction sets your CPU has compatibility for. What CPU your client gives in its configuration is used by the server to decide what WUs to assign, I believe. 

 

I imagine the solution is for the FAH people to up the requirements for those WUs. It might be handing them out to any CPU with AVX2 compatibility, when practically that list needs to be trimmed a bit. This may also be a consequence of them trying to get as many WUs out there as they can.

 

 

Fair enough. I've also been getting jobs for my R9-390 that outright fail to initialize. A search says at least one particular project isn't handling a radeon kwirk properly. I've seen two different opencl errors so far. Hopefully it gets fixed soon. Too many errors and FAH starts to hate you.

 

Doesn't help that to begin with that card was overheating due to the server chassis being completely unable to cool an r9-390 properly. Kept the cpu's fine at least. Had to underclock the 390 to like 890Mhz to get it to not run at 97c till the card and FAH lock up. Need to clean out that chassis and add a fan back to the window in that room (even though its still winter here) cause that small room gets a little toasty.

Link to post
Share on other sites

5 hours ago, Tomasu said:

Fair enough. I've also been getting jobs for my R9-390 that outright fail to initialize.

I run all AMD GPUs (been a budget buyer for years) and I see that exact thing across all my cards. A few other people here have been too. So you're not alone. It's not isolated to specific projects, at least in my experience. One or more of the early OpenCL calls are sensitive to instability on the AMD driver. Once it gets going its fine.

 

So card stability definitely plays a big part in how many failures you get. My factory OCd 280X, that can run about as hot as your 390, has the most failures out of all my cards. It's on a Linux box and too old to down clock in the power tables. I'll have to throw a custom BIOS on it when I have time.

 

Ultimately the root issue is the AMD OpenCL driver itself though, instability just makes the issue worse. I know from my own cards that the problem persists on Tahiti, Polaris, Vega 10 and 20 on both Windows and Linux. Even my coolest cards running stock have WUs fail to initialize, just less so. I've been talking to AMD about it and it's a known issue, and supposedly high priority, but I haven't been given any firm answers on when it will be fixed.

 

That isn't to say FAH or the OpenMM people couldn't find a work around for Core22. Whether it's worth their time, when the Nvidia driver is fine, I couldn't say. I'm with you though, I hope someone, anyone, fixes it. I have 105 failed WUs to 161 completed across all my cards since I starting logging for the event. All failures were at initialization. I can't imagine its helping me get WUs right now. My PPD is probably a third what it could be.

Link to post
Share on other sites

On 4/5/2020 at 11:52 AM, Tomasu said:

I have two computers going, my desktop (A ryzen 3700x + Nvidia 2060 Super) and my home lab server (dual opteron, 3380+6282SE + Radeon R9 390). Both are set to prefer small jobs, with both doing CPU and GPU jobs. Desktop is set to a full 16 threads atm (tried 12, but i thought that might be what was causing the issue, so put it back to 16), and server is set to 24 cores (out of 32).

 

My desktop appears to mostly get cpu jobs that take 1 to 4 days, with 10m+ steps. Server often gets 500k step jobs and they take a few hours at worst. Because the jobs take so long on my desktop i basically get no points. maybe 200/day. That doesn't seem worth it for me or F@H. It would be better if these bigger jobs went to beefier (and more efficient) servers. Is there any way i can get it to get smaller jobs? I already have max-packet-size in the config set to small. Its basically not worth running the cpu jobs on my desktop if it makes little to no progress like this. The extra power draw is just not worth it.

 

Both of the computers run debian linux.

max-packet-size and similar settings were for people on dial-up connections and I'm not even certain that they do anything anymore.

 

Your 3700x in your desktop should get lots of jobs that only take a few hours to complete just using the stock settings. I'm running 2 2700s, a 2700x and a 2600 and get jobs that run between 2 and 6 hours all the time.

 

Perhaps if you remove the packet-size setting and see what happens you might be surprised.

 

If your foldsing with a GPU and a 16-core CPU then F@H will reserve 1 thread for the GPU. At which point due to the Large Prime issue it will skip past 15 and 14 (2 x 7) and 13 and go to 12 anyway.

6 GPU Folding Rig  Linux Folding HOWTO Folding Remote Access Folding GPU Profiling ToU Scheduling UPS

Systems:

desktop: Lian-Li O11 Air Mini; Asus ProArt x670e WiFi; Ryzen 9 7950x; EVGA 240 CLC; 2 x 48GB DDR5-6000; 2 x Samsung 980 Pro 500GB PCIe3 NVMe; 2 x 8TB NAS; MSI RTX 4070 ti Super; AMD FirePro W4100; Corsair SFF750

nas1: Fractal Node 804; SuperMicro X10sl7-f; Xeon e3-1231v3; 4 x 8GB DDR3-1666 ECC; 2 x 250GB Samsung EVO Pro SSD; 7 x 4TB Seagate NAS; Corsair HX650i

nas2: Synology DS-123j; 2 x 6TB WD Red Plus NAS

nas3: Synology DS-224+; 2 x 12TB Seagate NAS

dcn01: Fractal Pop Silent XL; Gigabyte Aorus z570 Master; Ryzen 9 3950x; AMD Wraith; 2 x 16GB DDR4-3200; 256GB NVMe; Gigabyte Gaming RTX 4080 Super; MSI 4070 Ti Super Gaming X; Corsair RM750e

dcn04: Fractal Define S; Gigabyte Aorus ax570 Master; Ryzen 9 5950x; BeQuiet! PureRock 2; 2 x 16GB DDR4-3200; 250GB NVMe; ; Gigabyte Gaming RTX 4080 Super; MSI 4070 Ti Super Ventus 2; Corsair TX750M

Link to post
Share on other sites

17 hours ago, BlueJedi said:

I run all AMD GPUs (been a budget buyer for years) and I see that exact thing across all my cards. A few other people here have been too. So you're not alone. It's not isolated to specific projects, at least in my experience. One or more of the early OpenCL calls are sensitive to instability on the AMD driver. Once it gets going its fine.

 

So card stability definitely plays a big part in how many failures you get. My factory OCd 280X, that can run about as hot as your 390, has the most failures out of all my cards. It's on a Linux box and too old to down clock in the power tables. I'll have to throw a custom BIOS on it when I have time.

 

Ultimately the root issue is the AMD OpenCL driver itself though, instability just makes the issue worse. I know from my own cards that the problem persists on Tahiti, Polaris, Vega 10 and 20 on both Windows and Linux. Even my coolest cards running stock have WUs fail to initialize, just less so. I've been talking to AMD about it and it's a known issue, and supposedly high priority, but I haven't been given any firm answers on when it will be fixed.

 

That isn't to say FAH or the OpenMM people couldn't find a work around for Core22. Whether it's worth their time, when the Nvidia driver is fine, I couldn't say. I'm with you though, I hope someone, anyone, fixes it. I have 105 failed WUs to 161 completed across all my cards since I starting logging for the event. All failures were at initialization. I can't imagine its helping me get WUs right now. My PPD is probably a third what it could be.

Interesting. I did read on the FAH forum that they know about it and would look into it.

15 hours ago, Gorgon said:

max-packet-size and similar settings were for people on dial-up connections and I'm not even certain that they do anything anymore.

 

Your 3700x in your desktop should get lots of jobs that only take a few hours to complete just using the stock settings. I'm running 2 2700s, a 2700x and a 2600 and get jobs that run between 2 and 6 hours all the time.

 

Perhaps if you remove the packet-size setting and see what happens you might be surprised.

 

If your foldsing with a GPU and a 16-core CPU then F@H will reserve 1 thread for the GPU. At which point due to the Large Prime issue it will skip past 15 and 14 (2 x 7) and 13 and go to 12 anyway.

I added the packet size setting because it was acting weird. To begin with i was getting normal jobs, but then after a while, Id only get jobs that require 10,000,000+ iterations instead of 250,000, or 500,000 like most. The former take days, the latter take hours at worst.

 

Desktop:

22:22:54:WU00:FS00:0xa7:Completed 5706250 out of 10375000 steps (55%)
23:24:48:WU00:FS00:0xa7:Completed 5810000 out of 10375000 steps (56%)
 

Yup. One hour+ per percent.

 

Server:

13:25:40:WU00:FS00:0xa7:Completed 202500 out of 250000 steps (81%)
13:26:47:WU00:FS00:0xa7:Completed 205000 out of 250000 steps (82%)

 

About 1 minute per percent. While it has double the active cores as my desktop, each core is half as fast on average, so it should even out somewhat. And it did to begin with.

 

They are just getting different jobs. I think both have the max-packet-size setting set.

 

edit: I should say, it /could/ just be a fluke i got a couple jobs like this in a row. the first one got "forgotten" when i was messing with my install. Maybe i won't get more like this in the future, but theres still another 2-3 hours left on the ETA but i can't have FAH running while I'm using the computer. GPU jobs cause the desktop/gui to lag so bad i can't actively use the computer.

Link to post
Share on other sites

need more detail on the work unit it self

 

i have not observed CPU work units taking more than 3 hours (running windows on 3800x) some only taking less than 40 minutes

 

in the past you had to use linux to use the BIGAdv work units (you couldn't get them on windows client) i thought they stopped doing that

 

it work on 15 or 12 threads (13 or 14 it drop to 12,, on 16 threads it should drop to 15)

Link to post
Share on other sites

Good news is the work unit finished, and my desktop has been doing other jobs. Except now it's having trouble uploading that big multi-day job. Its over 500MB and fails 0.03% in every time :( to at least two different servers. hopefully it'll work eventually.

 

That job info seems to be: 

Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:13821 run:83 clone:2 gen:83 core:0xa7 unit:0x0000006c80fccb095c8839f1021cd73b

Project: 13821 (Run 83, Clone 2, Gen 83)

 

last couple upload attempts:

Sending unit results: id:00 state:SEND error:NO_ERROR project:13821 run:83 clone:2 gen:83 core:0xa7 unit:0x0000006c80fccb095c8839f1021cd73b
17:30:38:WU00:FS00:Uploading 570.61MiB to 128.252.203.9
17:30:38:WU00:FS00:Connecting to 128.252.203.9:8080
17:30:38:WARNING:WU00:FS00:Exception: Failed to send results to work server: Transfer failed
17:30:38:WU00:FS00:Trying to send results to collection server
17:30:38:WU00:FS00:Uploading 570.61MiB to 52.224.109.74
17:30:38:WU00:FS00:Connecting to 52.224.109.74:8080
17:31:29:WU00:FS00:Upload 0.03%
17:31:29:ERROR:WU00:FS00:Exception: Transfer failed

 

Edited by Tomasu
Link to post
Share on other sites

Ah, finally figured it out.

 

That project was misconfigured and i just happened to get a unit before they yanked it.

 

https://foldingforum.org/viewtopic.php?f=19&t=34032&p=323637&hilit=uploading+failed

 

Once the work unit times out, it'll go away. That kinda sucks it wasted all that cpu time. Oh well.

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×