Jump to content

read DNS server python

archiso
7 hours ago, AlfonsLM said:

You can use the old google algorithm and keep track on how many times the crawler finds one page and rank the search result from that. Also. might be worth considering how you will try to find your search words befor building the data base. One really easy way to and definitely not the best would be to brake down the page header and use that. Might be worth checking eg images for meta data also.

That is definitely an option, but right now I need to get my code to find the urls to work. It doesn’t work right now, I think I need to rewrite that part but I can’t figure out what to do for it.

 

9 hours ago, vorticalbox said:

And if its a relitive href you need to add the domain back on. 

yep🙂

Link to comment
Share on other sites

Link to post
Share on other sites

@Ulysses499I have been testing the code using Wikipedia as a base url and this happened: 

https://en.wikipedia.org//en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License
https://en.wikipedia.org//creativecommons.org/licenses/by-sa/3.0/
https://en.wikipedia.org//foundation.wikimedia.org/wiki/Terms_of_Use
https://en.wikipedia.org//foundation.wikimedia.org/wiki/Privacy_policy

this is only a couple of them, and not all of the have the double slash that causes the problem. Do you have any ideas?

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, archiso said:

@Ulysses499I have been testing the code using Wikipedia as a base url and this happened: 



https://en.wikipedia.org//en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License
https://en.wikipedia.org//creativecommons.org/licenses/by-sa/3.0/
https://en.wikipedia.org//foundation.wikimedia.org/wiki/Terms_of_Use
https://en.wikipedia.org//foundation.wikimedia.org/wiki/Privacy_policy

this is only a couple of them, and not all of the have the double slash that causes the problem. Do you have any ideas?

Hmm you could solve this 2 ways I think.

1. Search for the reason of the double slash -> escaped symbol ?

2. (and more easy but demanding on resources) implement a validate function that searches for non correct symbols and replaces them. In this instance it would be replacing all "//" with "/" if they are not following a Protocol definition like http/s:

 

EDIT : I found the bug , those are 2 urls , but the 2nd https is missing

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, Ulysses499 said:

Hmm you could solve this 2 ways I think.

1. Search for the reason of the double slash -> escaped symbol ?

2. (and more easy but demanding on resources) implement a validate function that searches for non correct symbols and replaces them. In this instance it would be replacing all "//" with "/" if they are not following a Protocol definition like http/s:

 

maybe. can you look this over and see if I messed anything up?

import urllib.request as urlr
from bs4 import BeautifulSoup

def extract(soup, base_url):
    outlst = []
    for link in soup.findAll('a'):
        url = str(link.get('href'))
        if(url not in outlst):
            if(url.startswith("http" or "//")):
                print(url)
                outlst.append(url)
            elif(url.startswith("/")):
                url = base_url[:findnth(base_url, "/", 3)] + url
                print(url)
                outlst.append(url)
    return outlst

def findnth(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+len(needle))
        n -= 1
    return start

def crawl(lst):
    savelst = lst
    depth = 1
    for i in savelst:
        toRead = urlr.urlopen(i)
        content = BeautifulSoup(toRead, "html.parser")
        while(depth < 10):
            for link in extract(content, i):
                if(link not in savelst):
                    print(link)
                    savelst.append(link)
                depth += 1
    return savelst

 

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, archiso said:

maybe. can you look this over and see if I messed anything up?


import urllib.request as urlr
from bs4 import BeautifulSoup

def extract(soup, base_url):
    outlst = []
    for link in soup.findAll('a'):
        url = str(link.get('href'))
        if(url not in outlst):
            if(url.startswith("http" or "//")):
                print(url)
                outlst.append(url)
            elif(url.startswith("/")):
                url = base_url[:findnth(base_url, "/", 3)] + url
                print(url)
                outlst.append(url)
    return outlst

def findnth(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+len(needle))
        n -= 1
    return start

def crawl(lst):
    savelst = lst
    depth = 1
    for i in savelst:
        toRead = urlr.urlopen(i)
        content = BeautifulSoup(toRead, "html.parser")
        while(depth < 10):
            for link in extract(content, i):
                if(link not in savelst):
                    print(link)
                    savelst.append(link)
                depth += 1
    return savelst

 

The urls are combined in the if clause, my guess is its the elif. But to be sure run this code :
 

            if(url.startswith("http" or "//")):
                print(url)
                outlst.append("[IF]" + url)
            elif(url.startswith("/")):
                url = base_url[:findnth(base_url, "/", 3)] + url
                print(url)
                outlst.append("[ELIF]" + url)

Then your previous findings should give you a hint if its the if or elif

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, archiso said:

@Ulysses499I have been testing the code using Wikipedia as a base url and this happened: 


https://en.wikipedia.org//en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License
https://en.wikipedia.org//creativecommons.org/licenses/by-sa/3.0/
https://en.wikipedia.org//foundation.wikimedia.org/wiki/Terms_of_Use
https://en.wikipedia.org//foundation.wikimedia.org/wiki/Privacy_policy

this is only a couple of them, and not all of the have the double slash that causes the problem. Do you have any ideas?

This should be :

https://en.wikipedia.org/
en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License
https://en.wikipedia.org/
creativecommons.org/licenses/by-sa/3.0/
https://en.wikipedia.org/
foundation.wikimedia.org/wiki/Terms_of_Use
https://en.wikipedia.org/
foundation.wikimedia.org/wiki/Privacy_policy

You add the base url over links that don't start with a protocol definition

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, Ulysses499 said:

The urls are combined in the if clause, my guess is its the elif. But to be sure run this code :
 


            if(url.startswith("http" or "//")):
                print(url)
                outlst.append("[IF]" + url)
            elif(url.startswith("/")):
                url = base_url[:findnth(base_url, "/", 3)] + url
                print(url)
                outlst.append("[ELIF]" + url)

Then your previous findings should give you a hint if its the if or elif

yep elif is the problem....

Link to comment
Share on other sites

Link to post
Share on other sites

6 minutes ago, Ulysses499 said:

The urls are combined in the if clause, my guess is its the elif. But to be sure run this code :
 


            if(url.startswith("http" or "//")):
                print(url)
                outlst.append("[IF]" + url)
            elif(url.startswith("/")):
                url = base_url[:findnth(base_url, "/", 3)] + url
                print(url)
                outlst.append("[ELIF]" + url)

Then your previous findings should give you a hint if its the if or elif

not every elif tho...

[IF]https://nn.wikipedia.org/wiki/
[IF]https://sk.wikipedia.org/wiki/
[IF]https://sl.wikipedia.org/wiki/
[IF]https://th.wikipedia.org/wiki/
[IF]https://en.wikipedia.org/w/index.php?title=Main_Page&oldid=1004593520
[ELIF]https://en.wikipedia.org/wiki/Special:MyTalk
[ELIF]https://en.wikipedia.org/wiki/Special:MyContributions
[ELIF]https://en.wikipedia.org/w/index.php?title=Special:CreateAccount&returnto=Main+Page
[ELIF]https://en.wikipedia.org/w/index.php?title=Special:UserLogin&returnto=Main+Page
[ELIF]https://en.wikipedia.org/wiki/Main_Page
[ELIF]https://en.wikipedia.org/wiki/Talk:Main_Page
[ELIF]https://en.wikipedia.org/w/index.php?title=Main_Page&action=edit
[ELIF]https://en.wikipedia.org/w/index.php?title=Main_Page&action=history
[ELIF]https://en.wikipedia.org/wiki/Wikipedia:Contents
[ELIF]https://en.wikipedia.org/wiki/Special:Random
[ELIF]https://en.wikipedia.org/wiki/Wikipedia:About
[ELIF]https://en.wikipedia.org//en.wikipedia.org/wiki/Wikipedia:Contact_us

weird?

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, archiso said:

not every elif tho...


[IF]https://nn.wikipedia.org/wiki/
[IF]https://sk.wikipedia.org/wiki/
[IF]https://sl.wikipedia.org/wiki/
[IF]https://th.wikipedia.org/wiki/
[IF]https://en.wikipedia.org/w/index.php?title=Main_Page&oldid=1004593520
[ELIF]https://en.wikipedia.org/wiki/Special:MyTalk
[ELIF]https://en.wikipedia.org/wiki/Special:MyContributions
[ELIF]https://en.wikipedia.org/w/index.php?title=Special:CreateAccount&returnto=Main+Page
[ELIF]https://en.wikipedia.org/w/index.php?title=Special:UserLogin&returnto=Main+Page
[ELIF]https://en.wikipedia.org/wiki/Main_Page
[ELIF]https://en.wikipedia.org/wiki/Talk:Main_Page
[ELIF]https://en.wikipedia.org/w/index.php?title=Main_Page&action=edit
[ELIF]https://en.wikipedia.org/w/index.php?title=Main_Page&action=history
[ELIF]https://en.wikipedia.org/wiki/Wikipedia:Contents
[ELIF]https://en.wikipedia.org/wiki/Special:Random
[ELIF]https://en.wikipedia.org/wiki/Wikipedia:About
[ELIF]https://en.wikipedia.org//en.wikipedia.org/wiki/Wikipedia:Contact_us

weird?

My guess is that there is something different with those links in the original file that your code did not account for. 

What do you use the if/else for? Maybe you can avoid the problem by using a more simple logic

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, Ulysses499 said:

My guess is that there is something different with those links in the original file that your code did not account for. 

What do you use the if/else for? Maybe you can avoid the problem by using a more simple logic

it was to find out witch urls needed a domain appended at the beginning of them.

Link to comment
Share on other sites

Link to post
Share on other sites

6 minutes ago, archiso said:

it was to find out witch urls needed a domain appended at the beginning of them.

Ahhh okay.

What Wikipedia link created this error ? I would like to know why there are links that start with "/"

 

Anyways, you could try to solve this by looking for"[subhostname].[hostname].[domain]" in the elif case.

Link to comment
Share on other sites

Link to post
Share on other sites

On 3/28/2021 at 8:39 PM, Ulysses499 said:

Ahhh okay.

What Wikipedia link created this error ? I would like to know why there are links that start with "/"

 

Anyways, you could try to solve this by looking for"[subhostname].[hostname].[domain]" in the elif case.

Sorry for not responding earlier. I did find the problem, it was the “or” in the if statement. I fixed that and now it works! I now have the problem of making it multithreaded so it runs faster, I think I’ll make a separate post to discuss that.

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, archiso said:

Sorry for not responding earlier. I did find the problem, it was the “or” in the if statement. I fixed that and now it works! I now have the problem of making it multithreaded so it runs faster, I think I’ll make a separate post to discuss that.

Okay but here my quick advise

Use the master slave principle and have one thread act as the task giver, then give each thread a list of pages and collect them via a "done" event. There are other ways to do this more efficiently but all of them would probably involve a lot of sync work.

Link to comment
Share on other sites

Link to post
Share on other sites

17 hours ago, Ulysses499 said:

Okay but here my quick advise

Use the master slave principle and have one thread act as the task giver, then give each thread a list of pages and collect them via a "done" event. There are other ways to do this more efficiently but all of them would probably involve a lot of sync work.

I found another problem, the code for the web crawler doesn't work correctly because the while loop with the depth var only loops through one website but it does it 10 times and it only does it once because the depth var is set outside of the main for loop.

Link to comment
Share on other sites

Link to post
Share on other sites

On 3/31/2021 at 8:48 PM, archiso said:

I found another problem, the code for the web crawler doesn't work correctly because the while loop with the depth var only loops through one website but it does it 10 times and it only does it once because the depth var is set outside of the main for loop.

easy you indented the ++ one too many 😉

Also I think your for/while loops need to be changed up.

You want 10 iterations of looking up all links and adding them to the list

Link to comment
Share on other sites

Link to post
Share on other sites

On 4/2/2021 at 6:51 AM, Ulysses499 said:

easy you indented the ++ one too many 😉

Also I think your for/while loops need to be changed up.

You want 10 iterations of looking up all links and adding them to the list

I just removed the indent and I think it is infinitely looping...

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×