Jump to content

LTT Folding Team's Emergency Response to Covid-19

Go to solution Solved by GOTSpectrum,

This event has ended and I recommend you guys head over to the Folding Community Board for any general folding conversation. 

 

 

1 hour ago, Krankenstein said:

CPU gets hit by these small work units that seem very inefficient.

The CPU WUs are small but very much important.

 

1 hour ago, Krankenstein said:

GPU not getting any more folding work so I am cutting my folding for now.

The whole point of F@H is that when you have your computer running anyway you have the client open, if there's something to do it'll do it, if not it won't. 

 

1 hour ago, Krankenstein said:

Next attempt is in 14hrs.

Then you didn't pause/fold in a very long time...

F@H
Desktop: i9-13900K, ASUS Z790-E, 64GB DDR5-6000 CL36, RTX3080, 2TB MP600 Pro XT, 2TB SX8200Pro, 2x16TB Ironwolf RAID0, Corsair HX1200, Antec Vortex 360 AIO, Thermaltake Versa H25 TG, Samsung 4K curved 49" TV, 23" secondary, Mountain Everest Max

Mobile SFF rig: i9-9900K, Noctua NH-L9i, Asrock Z390 Phantom ITX-AC, 32GB, GTX1070, 2x1TB SX8200Pro RAID0, 2x5TB 2.5" HDD RAID0, Athena 500W Flex (Noctua fan), Custom 4.7l 3D printed case

 

Asus Zenbook UM325UA, Ryzen 7 5700u, 16GB, 1TB, OLED

 

GPD Win 2

Link to comment
Share on other sites

Link to post
Share on other sites

Didn't have a computer in time to join the event, but underway now :).

 

Part of me is regretting that stock AMD cooler now. Temps are fine, GPU (RTX 2060) and CPU (Ryzen 5 3600) both sticking around the 60C mark under full load, but the noise is irritating😆.

 

Its nice that they provide so much info on the actual proteins you are folding as well.

 

I'm a biologist anyway so I thought it might be fun to share a bit of the background context for them.

 

Mine is currently working on SARS-CoV-2 (COVID-19 causing virus) Non-structural protein 15 (NSP15). It's something that people have suggest might be involved in evading the immune system in other coronaviruses.

 

Coronaviruses make dsRNA as part of their replication cycle. A kind of white blood cell (macrophages) has sensors to detect dsRNA and kick off an immune response. The idea is that NSP15 interferes with that detection in some way to delay the immune response. 

 

Happy folding!

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Krankenstein said:

GPU not getting any more folding work so I am cutting my folding for now. GPU sits at around 300 attempts and got nothing. (Yes I did the pause/fold clicking). Next attempt is in 14hrs.

CPU gets hit by these small work units that seem very inefficient.

I'll be back when the next event for F@H is scheduled!

I find that if you click pause then fold too quickly, it wont reduce the next attempt delay.  If its stuck for a long time, pause for a couple of minutes.

It is true though that for the last couple of days, GPU units seem to be drying up.  Not having any problems getting CPU WUs though, I guess everyone has been focusing on GPU.

Router:  Intel N100 (pfSense) WiFi6: Zyxel NWA210AX (1.7Gbit peak at 160Mhz)
WiFi5: Ubiquiti NanoHD OpenWRT (~500Mbit at 80Mhz) Switches: Netgear MS510TXUP, MS510TXPP, GS110EMX
ISPs: Zen Full Fibre 900 (~930Mbit down, 115Mbit up) + Three 5G (~800Mbit down, 115Mbit up)
Upgrading Laptop/Desktop CNVIo WiFi 5 cards to PCIe WiFi6e/7

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, Alex Atkin UK said:

I find that if you click pause then fold too quickly, it wont reduce the next attempt delay.  If its stuck for a long time, pause for a couple of minutes.

It is true though that for the last couple of days, GPU units seem to be drying up.  Not having any problems getting CPU WUs though, I guess everyone has been focusing on GPU.

Actually, I made "bot" for this problem, too tired to babysit all of this... Write in python... still buggy (still in development), but hey, its works...

Spoiler

preview2.png.eda8f015c956d79cd1d40a1e2560f0ec.png

If want to try Just make sure you have python installed, inspect the script, setting up some of config (if you have telegram this bot will report when something wrong happen - like host go offline - too long log update...)

Spoiler

 

Stay Hungry. Stay Foolish. Stay Folding. Stay at home!

Link to comment
Share on other sites

Link to post
Share on other sites

This wont last, GPU WUs have totally dried up now and my Google Cloud credit is almost gone.

 

Screenshot_20200416_145314b.png

Router:  Intel N100 (pfSense) WiFi6: Zyxel NWA210AX (1.7Gbit peak at 160Mhz)
WiFi5: Ubiquiti NanoHD OpenWRT (~500Mbit at 80Mhz) Switches: Netgear MS510TXUP, MS510TXPP, GS110EMX
ISPs: Zen Full Fibre 900 (~930Mbit down, 115Mbit up) + Three 5G (~800Mbit down, 115Mbit up)
Upgrading Laptop/Desktop CNVIo WiFi 5 cards to PCIe WiFi6e/7

Link to comment
Share on other sites

Link to post
Share on other sites

33 minutes ago, bafo_ah said:

Actually, I made "bot" for this problem, too tired to babysit all of this... Write in python... still buggy (still in development), but hey, its works...

  Reveal hidden contents

preview.png.55f0b23e5ce148b8b83b5d431a481f2e.png

If want to try Just make sure you have python installed, inspect the script, setting up some of config (if you have telegram this bot will report when something wrong happen - like host go offline - too long log update...)

  Reveal hidden contents

 

Yeah I actually have a PHP script that pulls my current stats from each client into a json file.  I was going to adapt that to automatically pause/unpause but just never got around to doing it and there's not much point now I will only be doing GPU overnight once that Google credit is used up.  I've not been sleeping well the last few days so not really got the right head to do any coding.

I already dropped it down to just two GPUs yesterday after seeing the trend and so if it runs over it will be by less.  Might drop it down to one GPU now, just giving it one last chance to get WUs.

 

This was at its peak during the event:

spacer.png

 

Now is looking much sadder:

Screenshot_20200416_151033.png

Router:  Intel N100 (pfSense) WiFi6: Zyxel NWA210AX (1.7Gbit peak at 160Mhz)
WiFi5: Ubiquiti NanoHD OpenWRT (~500Mbit at 80Mhz) Switches: Netgear MS510TXUP, MS510TXPP, GS110EMX
ISPs: Zen Full Fibre 900 (~930Mbit down, 115Mbit up) + Three 5G (~800Mbit down, 115Mbit up)
Upgrading Laptop/Desktop CNVIo WiFi 5 cards to PCIe WiFi6e/7

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, Kilrah said:

1. The CPU WUs are small but very much important.

 

2. The whole point of F@H is that when you have your computer running anyway you have the client open, if there's something to do it'll do it, if not it won't. 

 

3. Then you didn't pause/fold in a very long time...

1. This is something I couldn't find an answer for (well I didn't go out of my way to find an explanation really). What kind of calculations is the CPU better for than a GPU? Just a very short answer is preferred.

2. Well I kept my machine running for 3 weeks 24/7 just for this and was happy to see CPU and GPU doing a lot of work. But it seems, as a whole, the computer isn't getting much to do during the nights... so... less noise in the house and smaller electric bill. And less baby sitting.

3. Yeah I went to sleep and the next day I had to cut some trees in the yard and chop them up for firewood. So I forgot the computer for a long time.

Link to comment
Share on other sites

Link to post
Share on other sites

6 minutes ago, Krankenstein said:

1. This is something I couldn't find an answer for (well I didn't go out of my way to find an explanation really). What kind of calculations is the CPU better for than a GPU? Just a very short answer is preferred.

2. Well I kept my machine running for 3 weeks 24/7 just for this and was happy to see CPU and GPU doing a lot of work. But it seems, as a whole, the computer isn't getting much to do during the nights... so... less noise in the house and smaller electric bill. And less baby sitting.

3. Yeah I went to sleep and the next day I had to cut some trees in the yard and chop them up for firewood. So I forgot the computer for a long time.

I don't think its that the CPU is better, its that its easier to use your idle CPU time for F@H than it is GPU (its a none-starter on Linux as it makes the UI unresponsive).  There are likely more CPUs available so it makes sense to use them for jobs where they can finish quickly enough that a GPU is not essential. https://apps.foldingathome.org/serverstats

Router:  Intel N100 (pfSense) WiFi6: Zyxel NWA210AX (1.7Gbit peak at 160Mhz)
WiFi5: Ubiquiti NanoHD OpenWRT (~500Mbit at 80Mhz) Switches: Netgear MS510TXUP, MS510TXPP, GS110EMX
ISPs: Zen Full Fibre 900 (~930Mbit down, 115Mbit up) + Three 5G (~800Mbit down, 115Mbit up)
Upgrading Laptop/Desktop CNVIo WiFi 5 cards to PCIe WiFi6e/7

Link to comment
Share on other sites

Link to post
Share on other sites

16 minutes ago, Krankenstein said:

1. This is something I couldn't find an answer for (well I didn't go out of my way to find an explanation really). What kind of calculations is the CPU better for than a GPU? Just a very short answer is preferred

Just think of it in this way:

 

GPU is a good for specific type of jobs (hence why it destroys CPU's PPD), while CPU can do almost any job but is much slower.

Favebook's F@H Stats

Favebook's BOINC Stats

 

CPU i7-8700k (5.0GHz)  Motherboard Aorus Z370 Gaming 7  RAM Vengeance® RGB Pro 16GB DDR4 3200MHz  GPU  Aorus 1080 Ti

Case Carbide Series SPEC-OMEGA  Storage  Samsung Evo 970 1TB & WD Red Pro 10TB

PSU Corsair HX850i  Cooling Custom EKWB loop

 

Display Acer Predator x34 120Hz

Link to comment
Share on other sites

Link to post
Share on other sites

32 minutes ago, Krankenstein said:

1. This is something I couldn't find an answer for (well I didn't go out of my way to find an explanation really). What kind of calculations is the CPU better for than a GPU? Just a very short answer is preferred.

 

163174759_Anmerkung2020-04-16164521.png.a22c9f2818d223a2580de1f3257511f6.png

 

https://foldingathome.org/2020/03/30/covid-19-free-energy-calculations/

 

This is from a news articel on the Foh Website

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, Krankenstein said:

1. This is something I couldn't find an answer for (well I didn't go out of my way to find an explanation really). What kind of calculations is the CPU better for than a GPU? Just a very short answer is preferred.

There are algorithms / calculation engines that can't run on a GPU. Also there are WAY more CPUs than GPUs in the FAH pool of available resources.

 

See post above.

 

  

2 hours ago, bafo_ah said:

Actually, I made "bot" for this problem, too tired to babysit all of this... Write in python... still buggy (still in development), but hey, its works...

Someone posted a perfectly working python script earlier in the thread. It bumped a slot for me 3 times this afternoon:

 

dsbdh.jpg.42693d5456582c7384157e00a2d11a94.jpg

 

 

F@H
Desktop: i9-13900K, ASUS Z790-E, 64GB DDR5-6000 CL36, RTX3080, 2TB MP600 Pro XT, 2TB SX8200Pro, 2x16TB Ironwolf RAID0, Corsair HX1200, Antec Vortex 360 AIO, Thermaltake Versa H25 TG, Samsung 4K curved 49" TV, 23" secondary, Mountain Everest Max

Mobile SFF rig: i9-9900K, Noctua NH-L9i, Asrock Z390 Phantom ITX-AC, 32GB, GTX1070, 2x1TB SX8200Pro RAID0, 2x5TB 2.5" HDD RAID0, Athena 500W Flex (Noctua fan), Custom 4.7l 3D printed case

 

Asus Zenbook UM325UA, Ryzen 7 5700u, 16GB, 1TB, OLED

 

GPD Win 2

Link to comment
Share on other sites

Link to post
Share on other sites

I just bought and installed a 5700 XT (Sapphire Pulse), and if I'm honest I was mostly motivated to increase my PPD.  

 

Unfortunately, I'm struggling to consistently get WU's despite having very little issue previously with my 1060 6gb.  I've had it running since around midnight last night (~12 hours) and I've only gone through 3 WUs.  Anyone have any ideas?  Are there known issues with WU availability for Navi?  When I do get work, my PPD goes up to around 1.2mm, compared to ~450k with the 1060.  

 

I have paused folding on my CPU (R5 3600) because its contribution is negligible (~100k PPD) and I want to tinker with heat / fan speed / noise level with the 5700 XT so I'm eliminating variables.  Would having the CPU running increase my chance of getting GPU WUs?

Link to comment
Share on other sites

Link to post
Share on other sites

No, there just aren't many WUs available since a few days, and FAH have communicated confirming it, they need time to deal with the results for a while before they can put more work up.

F@H
Desktop: i9-13900K, ASUS Z790-E, 64GB DDR5-6000 CL36, RTX3080, 2TB MP600 Pro XT, 2TB SX8200Pro, 2x16TB Ironwolf RAID0, Corsair HX1200, Antec Vortex 360 AIO, Thermaltake Versa H25 TG, Samsung 4K curved 49" TV, 23" secondary, Mountain Everest Max

Mobile SFF rig: i9-9900K, Noctua NH-L9i, Asrock Z390 Phantom ITX-AC, 32GB, GTX1070, 2x1TB SX8200Pro RAID0, 2x5TB 2.5" HDD RAID0, Athena 500W Flex (Noctua fan), Custom 4.7l 3D printed case

 

Asus Zenbook UM325UA, Ryzen 7 5700u, 16GB, 1TB, OLED

 

GPD Win 2

Link to comment
Share on other sites

Link to post
Share on other sites

26 minutes ago, jhawk2122 said:

I have paused folding on my CPU (R5 3600) because its contribution is negligible (~100k PPD) and I want to tinker with heat / fan speed / noise level with the 5700 XT so I'm eliminating variables.  Would having the CPU running increase my chance of getting GPU WUs?

FWIW (likely little), I restarted folding on my CPU, which immediately got a WU and then my GPU got a WU shortly thereafter.  

Link to comment
Share on other sites

Link to post
Share on other sites

28 minutes ago, jhawk2122 said:

Would having the CPU running increase my chance of getting GPU WUs?

No.

 

CPU WUs and GPU WUs are in no way connected.

Favebook's F@H Stats

Favebook's BOINC Stats

 

CPU i7-8700k (5.0GHz)  Motherboard Aorus Z370 Gaming 7  RAM Vengeance® RGB Pro 16GB DDR4 3200MHz  GPU  Aorus 1080 Ti

Case Carbide Series SPEC-OMEGA  Storage  Samsung Evo 970 1TB & WD Red Pro 10TB

PSU Corsair HX850i  Cooling Custom EKWB loop

 

Display Acer Predator x34 120Hz

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Kilrah said:

Someone posted a perfectly working python script earlier in the thread. It bumped a slot for me 3 times this afternoon:

 

dsbdh.jpg.42693d5456582c7384157e00a2d11a94.jpg

 

 

Over here for anyone interested:

 

CPU Intel i7-7700 | Cooling Noctua NH-D14 SE2011 | Motherboard ASUS ROG Strix Z270F Gaming | RAM Corsair Vengeance LPX 3.6GHz 32GB | GPU EVGA GeForce RTX 3070 FTW3 Ultra Gaming |

Case Fractal Design Define R5 | Storage Samsung 980 PRO 500GB, Samsung 970 EVO+ "v2" 2TB | PSU Corsair RM850x 2021 | Display ASUS VP247QG + Samsung SyncMaster T220 | OS Garuda Linux

Link to comment
Share on other sites

Link to post
Share on other sites

9 hours ago, marknd59 said:

 I've got 2 stuck trying to send to 155.247.166.219 which is up and down like a yoyo and only has 2Tib left on it so it's pretty full up, and 1 stuck that lists the work server as 13.82.98.119 and collection as 0.0.0.0. I've tried shuting down and restarting F@H and all that did was get the retry time down, not a lot I can do about it.

 

9 hours ago, FraktalU said:

Also got one that is stucked since yesterday 14h

 

9 hours ago, efka112 said:

if WUs stuck for a while, check server status if server is online and have disk space, try to relaunch FAH. sometimes is issue from client side  

 

11 hours ago, marknd59 said:

 

 

I have 3 that have been stuck for over a day.

 

11 hours ago, Kilrah said:

Had one stuck for half an hour yesterday, transfer started and failed at a random percentage, looking at server stats it was rebooting constantly, uptime kept varying between 1-2-3 minutes and back... eventually stayed up and took my WU.

 

11 hours ago, efka112 said:

you get message in log that upload failed or just don't get points?

Failed upload, made sure it's all working on my end. Seems like a their end problem. Thankfully it's not just me.

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, Bitter said:

 

 

 

 

 

Failed upload, made sure it's all working on my end. Seems like a their end problem. Thankfully it's not just me.

The 1 from 13.82.98.119 managed to upload a little while ago. The other 2 for 155.247.166.219 are still stuck.

Link to comment
Share on other sites

Link to post
Share on other sites

5 hours ago, marknd59 said:

The 1 from 13.82.98.119 managed to upload a little while ago. The other 2 for 155.247.166.219 are still stuck.

Same here, been stuck for at least a day maybe longer now. Annoying but whatever. P106 is getting work but it's low points work but it's been steady. V56 and 2700X are getting work more regularly, RX580 isn't getting any now where as before it was always.

Link to comment
Share on other sites

Link to post
Share on other sites

On 4/5/2020 at 6:32 PM, danielocdh said:

babysitter python script

It will automatically scan all the slots and pause+unpause slots that are "Waiting On: WS Assignment" and have too high "Next Attempt"

Need to set host(s) and password(only if you use a password)

  Reveal hidden contents

 



################################################################################
##                                  options                                   ##
################################################################################
hosts = [ #quoted strings, hosts or IPs separated by comma
  'localhost',
  '192.168.0.123',
]
hostsPassword = '' #quoted string, if the host(s) don't use a password just leave it as: ''

restartLimit = 10 * 60 #in seconds, pause+unpause if next attempt to get WU is this or more
checkEvery = 2 * 60 #in seconds, do a check for all hosts every this seconds

tConTimeout = 15 #in seconds, connection timeout
tReadTimeout = 10 #in seconds, read timeout
testMode = False # if set to True: checkEvery=6 and restartLimit=0 but won't actually pause+unpause slots

################################################################################
##                                    code                                    ##
################################################################################
import json
import re
import telnetlib
import time
import datetime

if testMode:
    restartLimit = 0
    checkEvery = 6
countEvery = 1 #seconds, have to be a factor of checkEvery, default: 1
countEveryDec = max(0, str(countEvery)[::-1].find('.'))
countEveryDecStr = f'{{:.{countEveryDec}f}}'
def remSeconds(seconds):
    if seconds > 0:
        if (seconds * 10000) % (countEvery * 10000) == 0:
            secondsP = countEveryDecStr.format(seconds)
            pr(f'Next check in {secondsP} seconds', same=True)
        time.sleep(countEvery)
        seconds = round((seconds - countEvery) * 10000) / 10000
        remSeconds(seconds)

prLastLen = 0
prLastSame = False
def pr(t, indent=0, same=False, overPrev=False):
    global prLastLen, prLastSame
    if not overPrev and not same and prLastSame:
        prLastLen = 0
        print('')
    t = str(t)
    toPrint = ('  ' * indent) + t
    tLen = len(toPrint)
    print(toPrint + (' ' * max(0, prLastLen - tLen)), end='\r')
    prLastSame = same
    prLastLen = tLen
    if not same:
        print('')
        prLastLen = 0

def checkKeep():
    while (True):
        checkAll()
        remSeconds(checkEvery)

def checkAll():
    for host in hosts: check(host)
    now = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    pr(f'check complete at {now}', 0, False, True)


tEnd = ['\n> '.encode('utf-8'), '\n---\n'.encode('utf-8')]
def readResult(expected, expectedResult=''):
    index = expected[0]
    readB = expected[2]
    read = readB.decode('utf-8')
    #noting
    if index < 0 or read == '': return [False, 'nothing was read']
    #expected result
    if expectedResult:
        endWith = tEnd[index].decode('utf-8')
        readStrip = read[0:-len(endWith)].strip()
        if (readStrip != expectedResult):
            return [False, f'{readB}']
    #PyON->json
    match = re.search('\nPyON (\d+) ([-_a-zA-Z\d]+)\n(.*)\n---\n', read, re.DOTALL)
    ###print('');print('');print('');print(index);print(match);print(read);print(readB);print('');
    if match:
        version = match.group(1)
        if version != '1': raise Exception('Response data version does not match')
        data = match.group(3)
        #to json
        data = re.sub('(:\s)?False', r'\1false', data)
        data = re.sub('(:\s)?True', r'\1true', data)
        data = re.sub('(:\s)?None', r'\1null', data)
        data = json.loads(data)
        return [True, data]
    #auth error
    match = re.search('\nERROR: unknown command or variable', read, re.DOTALL)
    if match:
        raise Exception('error sending command, wrong password?')
    #return read
    return [True, read]
def tnCreate(host):
    tn = telnetlib.Telnet(host, 36330, tConTimeout)
    readResult(tn.expect(tEnd, tReadTimeout))
    return tn

def sendCmd(tn, cmd, par=''):
    if cmd == 'auth':
        if hostsPassword:
            cmdStr = f'auth {hostsPassword}';
            tn.write(f'{cmdStr}\n'.encode('utf-8'))
            res = readResult(tn.expect(tEnd, tReadTimeout), 'OK')
            if not res[0]: raise Exception(f'Error with {cmd}, {res[1]}')
            return res[1]
        return True
    elif cmd == 'exit':
        cmdStr = f'{cmd}';
        tn.write(f'{cmdStr}\n'.encode('utf-8'))
        res = readResult(tn.expect(tEnd, tReadTimeout))
        if not res[0]: raise Exception(f'Error with {cmd}, {res[1]}')
        return res[1]
    elif cmd == 'slot-info' or cmd == 'queue-info':
        cmdStr = f'{cmd}';
        tn.write(f'{cmdStr}\n'.encode('utf-8'))
        res = readResult(tn.expect(tEnd, tReadTimeout))
        if not res[0]: raise Exception(f'Error with {cmd}, {res[1]}')
        return res[1]
    elif cmd == 'get-info-and-restart':
        queueData = sendCmd(tn, 'queue-info')
        slotData = sendCmd(tn, 'slot-info')
        ###
        #if type(queueData) == str: print('');print('');print('');print(queueData);print(queueData.encode('utf-8'));print('');
        #if type(slotData) == str: print('');print('');print('');print(slotData);print(slotData.encode('utf-8'));print('');
        restarted = []
        for slot in slotData:
            isStillRunning = False
            queueDl = False
            for queue in queueData:
                if queue['slot'] == slot['id']:
                    if queue['state'] == 'RUNNING': isStillRunning = True
                    if queue['state'] == 'DOWNLOAD': queueDl = queue
            if not isStillRunning and queueDl and queueDl['waitingon'] == 'WS Assignment':
                match = re.match('\s?(\d+ days?)?\s?(\d+ hours?)?\s?(\d+ mins?)?\s?([\d.]+ secs?)?', queueDl['nextattempt'])
                if match:
                    seconds = 0
                    if match.group(1): seconds += int(re.sub('[^\d.]', '', match.group(1))) * 3600 * 24
                    if match.group(2): seconds += int(re.sub('[^\d.]', '', match.group(2))) * 3600
                    if match.group(3): seconds += int(re.sub('[^\d.]', '', match.group(3))) * 60
                    if match.group(4): seconds += round(float(re.sub('[^\d.]', '', match.group(4))) * 1)
                    if seconds >= restartLimit:
                        if not testMode:
                          sendCmd(tn, 'pause', queueDl['slot'])
                          time.sleep(1)
                          sendCmd(tn, 'unpause', queueDl['slot'])
                        restarted.append([queueDl['slot'], queueDl['nextattempt']])
                else: raise Exception(f'Error with {cmd}, parsing queue nextattempt:{queueDl["nextattempt"]}')
        return restarted
    elif par and (cmd == 'pause' or cmd == 'unpause'):
        cmdStr = f'{cmd} {par}';
        tn.write(f'{cmdStr}\n'.encode('utf-8'))
        res = readResult(tn.expect(tEnd, tReadTimeout))
        if not res[0]: raise Exception(f'Error with {cmd}, {res[1]}')
        return res[1]
    else : return False

def check(host):
    st = time.time()
    pr(f'checking {host}', 1, True)
    try:
        tn = tnCreate(host)
        sendCmd(tn, 'auth')
        restarted = sendCmd(tn, 'get-info-and-restart')
        if len(restarted):
            pr(f'{host}: restarted {len(restarted)} slot{"s" if len(restarted) > 1 else ""}: ' + ', '.join(map(lambda item: '' + (' with '.join(item)), restarted)), 1, False, True)
        sendCmd(tn, 'exit')
        ed = time.time()
        time.sleep(max(0, 1 - (ed - st)))
    except Exception as err:
        pr(f'{host} error: {err}', 1, False, True)

checkKeep()

 

 

Looks like this when running:

ubu.png.dcef9a0ae7cddf57aa326276700f68ad.png

 

It access the clients API in a similar(much simpler) way that FAHControl or HFM.NET do

It won't restart slots that are still running and trying to download a new WU at the same time.

Tested on python 3 on windows (3.8.2)and ubuntu (3.6.9)

Let me know if you find any issues, it was stable for me after a few hours

 

I made a minor *cough* hacky *cough* change to @danielocdh's code to suite a use case I had. I started playing with connecting to one of my systems over local port forwarding SSH tunnel, partially because I didn't want to open firewall ports and mostly because FAH's API client is not secure. Either way in case anyone else had a same use case made it so the hosts variable can take a standard host:port notation. 

Spoiler



################################################################################
##                                  options                                   ##
################################################################################
hosts = [ #list of quoted strings, hosts or IPs, with optional colon separted port, separated by comma
  'localhost',
  'localhost:36331'
]
hostsPassword = '' #quoted string, if the host(s) don't use a password just leave it as: ''

restartLimit = 10 * 60 #in seconds, pause+unpause if next attempt to get WU is this or more
checkEvery = 2 * 60 #in seconds, do a check for all hosts every this seconds

tConTimeout = 15 #in seconds, connection timeout
tReadTimeout = 10 #in seconds, read timeout
testMode = False # if set to True: checkEvery=6 and restartLimit=0 but won't actually pause+unpause slots

################################################################################
##                                    code                                    ##
################################################################################
import json
import re
import telnetlib
import time
import datetime

if testMode:
    restartLimit = 0
    checkEvery = 6
countEvery = 1 #seconds, have to be a factor of checkEvery, default: 1
countEveryDec = max(0, str(countEvery)[::-1].find('.'))
countEveryDecStr = f'{{:.{countEveryDec}f}}'
def remSeconds(seconds):
    if seconds > 0:
        if (seconds * 10000) % (countEvery * 10000) == 0:
            secondsP = countEveryDecStr.format(seconds)
            pr(f'Next check in {secondsP} seconds', same=True)
        time.sleep(countEvery)
        seconds = round((seconds - countEvery) * 10000) / 10000
        remSeconds(seconds)

prLastLen = 0
prLastSame = False
def pr(t, indent=0, same=False, overPrev=False):
    global prLastLen, prLastSame
    if not overPrev and not same and prLastSame:
        prLastLen = 0
        print('')
    t = str(t)
    toPrint = ('  ' * indent) + t
    tLen = len(toPrint)
    print(toPrint + (' ' * max(0, prLastLen - tLen)), end='\r')
    prLastSame = same
    prLastLen = tLen
    if not same:
        print('')
        prLastLen = 0

def checkKeep():
    while (True):
        checkAll()
        remSeconds(checkEvery)

def checkAll():
    for host in hosts: check(host)
    now = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    pr(f'check complete at {now}', 0, False, True)


tEnd = ['\n>\s+'.encode('utf-8'), '\n---\n'.encode('utf-8')]
def readResult(expected, expectedResult=''):
    index = expected[0]
    readB = expected[2]
    
    read = readB.decode('utf-8')
    #noting
    if index < 0 or read == '': return [False, 'nothing was read']
    #expected result
    if expectedResult:
        endWith = tEnd[index].decode('utf-8')
        readStrip = read[0:-len(endWith)].strip()
        if (readStrip != expectedResult):
            return [False, f'{readB}']
    #PyON->json
    match = re.search('\n*PyON\s+(\d+)\s+([-_a-zA-Z\d]+)\n(.*)\n---\n', read, re.DOTALL)
    ###print('');print('');print('');print(index);print(match);print(read);print(readB);print('');
    if match:
        version = match.group(1)
        if version != '1': raise Exception('Response data version does not match')
        data = match.group(3)
        #to json
        data = re.sub('(:\s)?False', r'\1false', data)
        data = re.sub('(:\s)?True', r'\1true', data)
        data = re.sub('(:\s)?None', r'\1null', data)
        data = json.loads(data)
        return [True, data]
    #auth error
    match = re.search('\nERROR: unknown command or variable', read, re.DOTALL)
    if match:
        raise Exception('error sending command, wrong password?')
    #return read
    return [True, read]
def tnCreate(host):
    match = re.search('(.*):(\d+)', host);
    port = 36330;
    
    if match:
        host = match.group(1);
        port = match.group(2);
        
    tn = telnetlib.Telnet(host, port, tConTimeout)
    readResult(tn.expect(tEnd, tReadTimeout))
    return tn

def sendCmd(tn, cmd, par=''):
    if cmd == 'auth':
        if hostsPassword:
            cmdStr = f'auth {hostsPassword}';
            tn.write(f'{cmdStr}\n'.encode('utf-8'))
            res = readResult(tn.expect(tEnd, tReadTimeout), 'OK')
            if not res[0]: raise Exception(f'Error with {cmd}, {res[1]}')
            return res[1]
        return True
    elif cmd == 'exit':
        cmdStr = f'{cmd}';
        tn.write(f'{cmdStr}\n'.encode('utf-8'))
        try:
            readResult(tn.expect(tEnd, tReadTimeout))
        except EOFError as err:
            pass #Linux FAH Client without confirmation
        return;
    elif cmd == 'slot-info' or cmd == 'queue-info':
        cmdStr = f'{cmd}';
        tn.write(f'{cmdStr}\n'.encode('utf-8'))
        res = readResult(tn.expect(tEnd, tReadTimeout))
        if not res[0]: raise Exception(f'Error with {cmd}, {res[1]}')
        return res[1]
    elif cmd == 'get-info-and-restart':
        queueData = sendCmd(tn, 'queue-info')
        slotData = sendCmd(tn, 'slot-info')
        ###
        #if type(queueData) == str: print('');print('');print('');print(queueData);print(queueData.encode('utf-8'));print('');
        #if type(slotData) == str: print('');print('');print('');print(slotData);print(slotData.encode('utf-8'));print('');
        restarted = []
        for slot in slotData:
            isStillRunning = False
            queueDl = False
            for queue in queueData:
                if queue['slot'] == slot['id']:
                    if queue['state'] == 'RUNNING': isStillRunning = True
                    if queue['state'] == 'DOWNLOAD': queueDl = queue
            if not isStillRunning and queueDl and queueDl['waitingon'] == 'WS Assignment':
                match = re.match('\s?(\d+ days?)?\s?(\d+ hours?)?\s?(\d+ mins?)?\s?([\d.]+ secs?)?', queueDl['nextattempt'])
                if match:
                    seconds = 0
                    if match.group(1): seconds += int(re.sub('[^\d.]', '', match.group(1))) * 3600 * 24
                    if match.group(2): seconds += int(re.sub('[^\d.]', '', match.group(2))) * 3600
                    if match.group(3): seconds += int(re.sub('[^\d.]', '', match.group(3))) * 60
                    if match.group(4): seconds += round(float(re.sub('[^\d.]', '', match.group(4))) * 1)
                    if seconds >= restartLimit:
                        if not testMode:
                          sendCmd(tn, 'pause', queueDl['slot'])
                          time.sleep(1)
                          sendCmd(tn, 'unpause', queueDl['slot'])
                        restarted.append([queueDl['slot'], queueDl['nextattempt']])
                else: raise Exception(f'Error with {cmd}, parsing queue nextattempt:{queueDl["nextattempt"]}')
        return restarted
    elif par and (cmd == 'pause' or cmd == 'unpause'):
        cmdStr = f'{cmd} {par}';
        tn.write(f'{cmdStr}\n'.encode('utf-8'))
        res = readResult(tn.expect(tEnd, tReadTimeout))
        if not res[0]: raise Exception(f'Error with {cmd}, {res[1]}')
        return res[1]
    else : return False

def check(host):
    st = time.time()
    pr(f'checking {host}', 1, True)
    try:
        tn = tnCreate(host)
        sendCmd(tn, 'auth')
        restarted = sendCmd(tn, 'get-info-and-restart')
        if len(restarted):
            pr(f'{host}: restarted {len(restarted)} slot{"s" if len(restarted) > 1 else ""}: ' + ', '.join(map(lambda item: '' + (' with '.join(item)), restarted)), 1, False, True)
        sendCmd(tn, 'exit')
        ed = time.time()
        time.sleep(max(0, 1 - (ed - st)))
    except Exception as err:
        pr(f'{host} error: {err}', 1, False, True)

checkKeep()

 

 

Edited by Koppa315
Added additional logic to better support Linux FAH, white space regex and error handling in exit (Linux just goes away which causes an EOFError) also removed the return value there (please re-add if you're going to use it for something)
Link to comment
Share on other sites

Link to post
Share on other sites

Looks like some WU's are finally timing out and getting cleared from the que on one PC and sending reliably on the other PC.

 

P106-100 PPD estimates seem low (100K, should be 300K), but the work I'm getting seems low points too (10,000 to 30,000 point units). I wonder if my Celeron G1820 CPU is bottle necking. Haswell 2.7Ghz non AVX 2 cores, no SMT, pretty basic CPU and low wattage. All I have sitting unused are some Haswell i3's, 4130 and 4150, but those are also 2 core but have SMT at least. I wonder if I should swap one of those in or is everyone just getting low points GPU work units lately?

 

Ok that's better, I had forgotten to put in my passkey!

Link to comment
Share on other sites

Link to post
Share on other sites

17 hours ago, Koppa315 said:

I made a minor *cough* hacky *cough* change to @danielocdh's code to suite a use case I had. I started playing with connecting to one of my systems over local port forwarding SSH tunnel, partially because I didn't want to open firewall ports and mostly because FAH's API client is not secure. Either way in case anyone else had a same use case made it so the hosts variable can take a standard host:port notation. 

  Hide contents




################################################################################
##                                  options                                   ##
################################################################################
hosts = [ #list of quoted strings, hosts or IPs, with optional colon separted port, separated by comma
  'localhost',
  'localhost:36331'
]
hostsPassword = '' #quoted string, if the host(s) don't use a password just leave it as: ''

restartLimit = 10 * 60 #in seconds, pause+unpause if next attempt to get WU is this or more
checkEvery = 2 * 60 #in seconds, do a check for all hosts every this seconds

tConTimeout = 15 #in seconds, connection timeout
tReadTimeout = 10 #in seconds, read timeout
testMode = False # if set to True: checkEvery=6 and restartLimit=0 but won't actually pause+unpause slots

################################################################################
##                                    code                                    ##
################################################################################
import json
import re
import telnetlib
import time
import datetime

if testMode:
    restartLimit = 0
    checkEvery = 6
countEvery = 1 #seconds, have to be a factor of checkEvery, default: 1
countEveryDec = max(0, str(countEvery)[::-1].find('.'))
countEveryDecStr = f'{{:.{countEveryDec}f}}'
def remSeconds(seconds):
    if seconds > 0:
        if (seconds * 10000) % (countEvery * 10000) == 0:
            secondsP = countEveryDecStr.format(seconds)
            pr(f'Next check in {secondsP} seconds', same=True)
        time.sleep(countEvery)
        seconds = round((seconds - countEvery) * 10000) / 10000
        remSeconds(seconds)

prLastLen = 0
prLastSame = False
def pr(t, indent=0, same=False, overPrev=False):
    global prLastLen, prLastSame
    if not overPrev and not same and prLastSame:
        prLastLen = 0
        print('')
    t = str(t)
    toPrint = ('  ' * indent) + t
    tLen = len(toPrint)
    print(toPrint + (' ' * max(0, prLastLen - tLen)), end='\r')
    prLastSame = same
    prLastLen = tLen
    if not same:
        print('')
        prLastLen = 0

def checkKeep():
    while (True):
        checkAll()
        remSeconds(checkEvery)

def checkAll():
    for host in hosts: check(host)
    now = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    pr(f'check complete at {now}', 0, False, True)


tEnd = ['\n>\s+'.encode('utf-8'), '\n---\n'.encode('utf-8')]
def readResult(expected, expectedResult=''):
    index = expected[0]
    readB = expected[2]
    
    read = readB.decode('utf-8')
    #noting
    if index < 0 or read == '': return [False, 'nothing was read']
    #expected result
    if expectedResult:
        endWith = tEnd[index].decode('utf-8')
        readStrip = read[0:-len(endWith)].strip()
        if (readStrip != expectedResult):
            return [False, f'{readB}']
    #PyON->json
    match = re.search('\n*PyON\s+(\d+)\s+([-_a-zA-Z\d]+)\n(.*)\n---\n', read, re.DOTALL)
    ###print('');print('');print('');print(index);print(match);print(read);print(readB);print('');
    if match:
        version = match.group(1)
        if version != '1': raise Exception('Response data version does not match')
        data = match.group(3)
        #to json
        data = re.sub('(:\s)?False', r'\1false', data)
        data = re.sub('(:\s)?True', r'\1true', data)
        data = re.sub('(:\s)?None', r'\1null', data)
        data = json.loads(data)
        return [True, data]
    #auth error
    match = re.search('\nERROR: unknown command or variable', read, re.DOTALL)
    if match:
        raise Exception('error sending command, wrong password?')
    #return read
    return [True, read]
def tnCreate(host):
    match = re.search('(.*):(\d+)', host);
    port = 36330;
    
    if match:
        host = match.group(1);
        port = match.group(2);
        
    tn = telnetlib.Telnet(host, port, tConTimeout)
    readResult(tn.expect(tEnd, tReadTimeout))
    return tn

def sendCmd(tn, cmd, par=''):
    if cmd == 'auth':
        if hostsPassword:
            cmdStr = f'auth {hostsPassword}';
            tn.write(f'{cmdStr}\n'.encode('utf-8'))
            res = readResult(tn.expect(tEnd, tReadTimeout), 'OK')
            if not res[0]: raise Exception(f'Error with {cmd}, {res[1]}')
            return res[1]
        return True
    elif cmd == 'exit':
        cmdStr = f'{cmd}';
        tn.write(f'{cmdStr}\n'.encode('utf-8'))
        try:
            readResult(tn.expect(tEnd, tReadTimeout))
        except EOFError as err:
            pass #Linux FAH Client without confirmation
        return;
    elif cmd == 'slot-info' or cmd == 'queue-info':
        cmdStr = f'{cmd}';
        tn.write(f'{cmdStr}\n'.encode('utf-8'))
        res = readResult(tn.expect(tEnd, tReadTimeout))
        if not res[0]: raise Exception(f'Error with {cmd}, {res[1]}')
        return res[1]
    elif cmd == 'get-info-and-restart':
        queueData = sendCmd(tn, 'queue-info')
        slotData = sendCmd(tn, 'slot-info')
        ###
        #if type(queueData) == str: print('');print('');print('');print(queueData);print(queueData.encode('utf-8'));print('');
        #if type(slotData) == str: print('');print('');print('');print(slotData);print(slotData.encode('utf-8'));print('');
        restarted = []
        for slot in slotData:
            isStillRunning = False
            queueDl = False
            for queue in queueData:
                if queue['slot'] == slot['id']:
                    if queue['state'] == 'RUNNING': isStillRunning = True
                    if queue['state'] == 'DOWNLOAD': queueDl = queue
            if not isStillRunning and queueDl and queueDl['waitingon'] == 'WS Assignment':
                match = re.match('\s?(\d+ days?)?\s?(\d+ hours?)?\s?(\d+ mins?)?\s?([\d.]+ secs?)?', queueDl['nextattempt'])
                if match:
                    seconds = 0
                    if match.group(1): seconds += int(re.sub('[^\d.]', '', match.group(1))) * 3600 * 24
                    if match.group(2): seconds += int(re.sub('[^\d.]', '', match.group(2))) * 3600
                    if match.group(3): seconds += int(re.sub('[^\d.]', '', match.group(3))) * 60
                    if match.group(4): seconds += round(float(re.sub('[^\d.]', '', match.group(4))) * 1)
                    if seconds >= restartLimit:
                        if not testMode:
                          sendCmd(tn, 'pause', queueDl['slot'])
                          time.sleep(1)
                          sendCmd(tn, 'unpause', queueDl['slot'])
                        restarted.append([queueDl['slot'], queueDl['nextattempt']])
                else: raise Exception(f'Error with {cmd}, parsing queue nextattempt:{queueDl["nextattempt"]}')
        return restarted
    elif par and (cmd == 'pause' or cmd == 'unpause'):
        cmdStr = f'{cmd} {par}';
        tn.write(f'{cmdStr}\n'.encode('utf-8'))
        res = readResult(tn.expect(tEnd, tReadTimeout))
        if not res[0]: raise Exception(f'Error with {cmd}, {res[1]}')
        return res[1]
    else : return False

def check(host):
    st = time.time()
    pr(f'checking {host}', 1, True)
    try:
        tn = tnCreate(host)
        sendCmd(tn, 'auth')
        restarted = sendCmd(tn, 'get-info-and-restart')
        if len(restarted):
            pr(f'{host}: restarted {len(restarted)} slot{"s" if len(restarted) > 1 else ""}: ' + ', '.join(map(lambda item: '' + (' with '.join(item)), restarted)), 1, False, True)
        sendCmd(tn, 'exit')
        ed = time.time()
        time.sleep(max(0, 1 - (ed - st)))
    except Exception as err:
        pr(f'{host} error: {err}', 1, False, True)

checkKeep()

 

 

Wish I had this during the event, thanks for the code, this should help me not worry about checking machines constantly.

Hardware & Programming Enthusiast - Creator of LAR_Systems "Folding@Home in the Dark" browser extension and GPU / CPU PPD Database. 

Link to comment
Share on other sites

Link to post
Share on other sites

Guest
This topic is now closed to further replies.


×