read DNS server python

archiso · March 28, 2021

7 hours ago, AlfonsLM said:

You can use the old google algorithm and keep track on how many times the crawler finds one page and rank the search result from that. Also. might be worth considering how you will try to find your search words befor building the data base. One really easy way to and definitely not the best would be to brake down the page header and use that. Might be worth checking eg images for meta data also.

That is definitely an option, but right now I need to get my code to find the urls to work. It doesn’t work right now, I think I need to rewrite that part but I can’t figure out what to do for it.

9 hours ago, vorticalbox said:

And if its a relitive href you need to add the domain back on.

yep

archiso · March 28, 2021

@Ulysses499I have been testing the code using Wikipedia as a base url and this happened:

https://en.wikipedia.org//en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License
https://en.wikipedia.org//creativecommons.org/licenses/by-sa/3.0/
https://en.wikipedia.org//foundation.wikimedia.org/wiki/Terms_of_Use
https://en.wikipedia.org//foundation.wikimedia.org/wiki/Privacy_policy

this is only a couple of them, and not all of the have the double slash that causes the problem. Do you have any ideas?

Ulysses499 · March 29, 2021

4 hours ago, archiso said:
@Ulysses499I have been testing the code using Wikipedia as a base url and this happened:
https://en.wikipedia.org//en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License
https://en.wikipedia.org//creativecommons.org/licenses/by-sa/3.0/
https://en.wikipedia.org//foundation.wikimedia.org/wiki/Terms_of_Use
https://en.wikipedia.org//foundation.wikimedia.org/wiki/Privacy_policy
this is only a couple of them, and not all of the have the double slash that causes the problem. Do you have any ideas?

Hmm you could solve this 2 ways I think.

1. Search for the reason of the double slash -> escaped symbol ?

2. (and more easy but demanding on resources) implement a validate function that searches for non correct symbols and replaces them. In this instance it would be replacing all "//" with "/" if they are not following a Protocol definition like http/s:

EDIT : I found the bug , those are 2 urls , but the 2nd https is missing

archiso · March 29, 2021

1 minute ago, Ulysses499 said:

Hmm you could solve this 2 ways I think.

1. Search for the reason of the double slash -> escaped symbol ?

2. (and more easy but demanding on resources) implement a validate function that searches for non correct symbols and replaces them. In this instance it would be replacing all "//" with "/" if they are not following a Protocol definition like http/s:

maybe. can you look this over and see if I messed anything up?

import urllib.request as urlr
from bs4 import BeautifulSoup

def extract(soup, base_url):
    outlst = []
    for link in soup.findAll('a'):
        url = str(link.get('href'))
        if(url not in outlst):
            if(url.startswith("http" or "//")):
                print(url)
                outlst.append(url)
            elif(url.startswith("/")):
                url = base_url[:findnth(base_url, "/", 3)] + url
                print(url)
                outlst.append(url)
    return outlst

def findnth(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+len(needle))
        n -= 1
    return start

def crawl(lst):
    savelst = lst
    depth = 1
    for i in savelst:
        toRead = urlr.urlopen(i)
        content = BeautifulSoup(toRead, "html.parser")
        while(depth < 10):
            for link in extract(content, i):
                if(link not in savelst):
                    print(link)
                    savelst.append(link)
                depth += 1
    return savelst

Ulysses499 · March 29, 2021

2 minutes ago, archiso said:

maybe. can you look this over and see if I messed anything up?


import urllib.request as urlr
from bs4 import BeautifulSoup

def extract(soup, base_url):
    outlst = []
    for link in soup.findAll('a'):
        url = str(link.get('href'))
        if(url not in outlst):
            if(url.startswith("http" or "//")):
                print(url)
                outlst.append(url)
            elif(url.startswith("/")):
                url = base_url[:findnth(base_url, "/", 3)] + url
                print(url)
                outlst.append(url)
    return outlst

def findnth(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+len(needle))
        n -= 1
    return start

def crawl(lst):
    savelst = lst
    depth = 1
    for i in savelst:
        toRead = urlr.urlopen(i)
        content = BeautifulSoup(toRead, "html.parser")
        while(depth < 10):
            for link in extract(content, i):
                if(link not in savelst):
                    print(link)
                    savelst.append(link)
                depth += 1
    return savelst

The urls are combined in the if clause, my guess is its the elif. But to be sure run this code :

            if(url.startswith("http" or "//")):
                print(url)
                outlst.append("[IF]" + url)
            elif(url.startswith("/")):
                url = base_url[:findnth(base_url, "/", 3)] + url
                print(url)
                outlst.append("[ELIF]" + url)

Then your previous findings should give you a hint if its the if or elif

Ulysses499 · March 29, 2021

4 hours ago, archiso said:
@Ulysses499I have been testing the code using Wikipedia as a base url and this happened:
https://en.wikipedia.org//en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License
https://en.wikipedia.org//creativecommons.org/licenses/by-sa/3.0/
https://en.wikipedia.org//foundation.wikimedia.org/wiki/Terms_of_Use
https://en.wikipedia.org//foundation.wikimedia.org/wiki/Privacy_policy
this is only a couple of them, and not all of the have the double slash that causes the problem. Do you have any ideas?

This should be :

https://en.wikipedia.org/
en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License
https://en.wikipedia.org/
creativecommons.org/licenses/by-sa/3.0/
https://en.wikipedia.org/
foundation.wikimedia.org/wiki/Terms_of_Use
https://en.wikipedia.org/
foundation.wikimedia.org/wiki/Privacy_policy

You add the base url over links that don't start with a protocol definition

archiso · March 29, 2021

3 minutes ago, Ulysses499 said:
The urls are combined in the if clause, my guess is its the elif. But to be sure run this code :
            if(url.startswith("http" or "//")):
                print(url)
                outlst.append("[IF]" + url)
            elif(url.startswith("/")):
                url = base_url[:findnth(base_url, "/", 3)] + url
                print(url)
                outlst.append("[ELIF]" + url)
Then your previous findings should give you a hint if its the if or elif

yep elif is the problem....

archiso · March 29, 2021

6 minutes ago, Ulysses499 said:
The urls are combined in the if clause, my guess is its the elif. But to be sure run this code :
            if(url.startswith("http" or "//")):
                print(url)
                outlst.append("[IF]" + url)
            elif(url.startswith("/")):
                url = base_url[:findnth(base_url, "/", 3)] + url
                print(url)
                outlst.append("[ELIF]" + url)
Then your previous findings should give you a hint if its the if or elif

not every elif tho...

[IF]https://nn.wikipedia.org/wiki/
[IF]https://sk.wikipedia.org/wiki/
[IF]https://sl.wikipedia.org/wiki/
[IF]https://th.wikipedia.org/wiki/
[IF]https://en.wikipedia.org/w/index.php?title=Main_Page&oldid=1004593520
[ELIF]https://en.wikipedia.org/wiki/Special:MyTalk
[ELIF]https://en.wikipedia.org/wiki/Special:MyContributions
[ELIF]https://en.wikipedia.org/w/index.php?title=Special:CreateAccount&returnto=Main+Page
[ELIF]https://en.wikipedia.org/w/index.php?title=Special:UserLogin&returnto=Main+Page
[ELIF]https://en.wikipedia.org/wiki/Main_Page
[ELIF]https://en.wikipedia.org/wiki/Talk:Main_Page
[ELIF]https://en.wikipedia.org/w/index.php?title=Main_Page&action=edit
[ELIF]https://en.wikipedia.org/w/index.php?title=Main_Page&action=history
[ELIF]https://en.wikipedia.org/wiki/Wikipedia:Contents
[ELIF]https://en.wikipedia.org/wiki/Special:Random
[ELIF]https://en.wikipedia.org/wiki/Wikipedia:About
[ELIF]https://en.wikipedia.org//en.wikipedia.org/wiki/Wikipedia:Contact_us

weird?

Ulysses499 · March 29, 2021

Just now, archiso said:

not every elif tho...


[IF]https://nn.wikipedia.org/wiki/
[IF]https://sk.wikipedia.org/wiki/
[IF]https://sl.wikipedia.org/wiki/
[IF]https://th.wikipedia.org/wiki/
[IF]https://en.wikipedia.org/w/index.php?title=Main_Page&oldid=1004593520
[ELIF]https://en.wikipedia.org/wiki/Special:MyTalk
[ELIF]https://en.wikipedia.org/wiki/Special:MyContributions
[ELIF]https://en.wikipedia.org/w/index.php?title=Special:CreateAccount&returnto=Main+Page
[ELIF]https://en.wikipedia.org/w/index.php?title=Special:UserLogin&returnto=Main+Page
[ELIF]https://en.wikipedia.org/wiki/Main_Page
[ELIF]https://en.wikipedia.org/wiki/Talk:Main_Page
[ELIF]https://en.wikipedia.org/w/index.php?title=Main_Page&action=edit
[ELIF]https://en.wikipedia.org/w/index.php?title=Main_Page&action=history
[ELIF]https://en.wikipedia.org/wiki/Wikipedia:Contents
[ELIF]https://en.wikipedia.org/wiki/Special:Random
[ELIF]https://en.wikipedia.org/wiki/Wikipedia:About
[ELIF]https://en.wikipedia.org//en.wikipedia.org/wiki/Wikipedia:Contact_us

weird?

My guess is that there is something different with those links in the original file that your code did not account for.

What do you use the if/else for? Maybe you can avoid the problem by using a more simple logic

archiso · March 29, 2021

Just now, Ulysses499 said:

My guess is that there is something different with those links in the original file that your code did not account for.

What do you use the if/else for? Maybe you can avoid the problem by using a more simple logic

it was to find out witch urls needed a domain appended at the beginning of them.

Ulysses499 · March 29, 2021

6 minutes ago, archiso said:

it was to find out witch urls needed a domain appended at the beginning of them.

Ahhh okay.

What Wikipedia link created this error ? I would like to know why there are links that start with "/"

Anyways, you could try to solve this by looking for"[subhostname].[hostname].[domain]" in the elif case.

archiso · March 30, 2021

On 3/28/2021 at 8:39 PM, Ulysses499 said:

Ahhh okay.

What Wikipedia link created this error ? I would like to know why there are links that start with "/"

Anyways, you could try to solve this by looking for"[subhostname].[hostname].[domain]" in the elif case.

Sorry for not responding earlier. I did find the problem, it was the “or” in the if statement. I fixed that and now it works! I now have the problem of making it multithreaded so it runs faster, I think I’ll make a separate post to discuss that.

Ulysses499 · March 31, 2021

4 hours ago, archiso said:

Sorry for not responding earlier. I did find the problem, it was the “or” in the if statement. I fixed that and now it works! I now have the problem of making it multithreaded so it runs faster, I think I’ll make a separate post to discuss that.

Okay but here my quick advise :

Use the master slave principle and have one thread act as the task giver, then give each thread a list of pages and collect them via a "done" event. There are other ways to do this more efficiently but all of them would probably involve a lot of sync work.

archiso · March 31, 2021

17 hours ago, Ulysses499 said:

Okay but here my quick advise :

Use the master slave principle and have one thread act as the task giver, then give each thread a list of pages and collect them via a "done" event. There are other ways to do this more efficiently but all of them would probably involve a lot of sync work.

I found another problem, the code for the web crawler doesn't work correctly because the while loop with the depth var only loops through one website but it does it 10 times and it only does it once because the depth var is set outside of the main for loop.

Ulysses499 · April 2, 2021

On 3/31/2021 at 8:48 PM, archiso said:

I found another problem, the code for the web crawler doesn't work correctly because the while loop with the depth var only loops through one website but it does it 10 times and it only does it once because the depth var is set outside of the main for loop.

easy you indented the ++ one too many

Also I think your for/while loops need to be changed up.

You want 10 iterations of looking up all links and adding them to the list

archiso · April 9, 2021

On 4/2/2021 at 6:51 AM, Ulysses499 said:

easy you indented the ++ one too many

Also I think your for/while loops need to be changed up.

You want 10 iterations of looking up all links and adding them to the list

I just removed the indent and I think it is infinitely looping...

Sign In

read DNS server python

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Featured Topics

Topics

Latest From Linus Tech Tips:

The Wiiiiiiiiiiiiiiide Gaming Setup

Latest From Tech Quickie:

Yes, It’s Real: PCI Express x32

Latest From TechLinked:

M4 Already!?

Latest From GameLinked:

Nintendo Spilled the Beans

Latest From ShortCircuit: