read DNS server python

archiso · March 27, 2021

I have a list var that I would like to be a list of all websites(from google dns), can I do this?

AlfonsLM · March 27, 2021

I have a feeling that you might have a problem with som other part of it. You can definitely do a look up on all ipv4 addresses put not sure if 1tb of disk space would be enoth for a complete DNS, their is a reson why they don't ask for updates from each other and instead just ask where to find it.

AlfonsLM · March 27, 2021

8 minutes ago, archiso said:

I have a list var that I would like to be a list of all websites(from google dns), can I do this?

So from the estimate on this page there is about 360 m domains that means you will need just 2.9tb of storage for the ipv4 address
How Many Domains Are There? - Domain Name Stats for 2021 - Make A Website Hub

Ulysses499 · March 27, 2021

26 minutes ago, AlfonsLM said:

So from the estimate on this page there is about 360 m domains that means you will need just 2.9tb of storage for the ipv4 address
How Many Domains Are There? - Domain Name Stats for 2021 - Make A Website Hub

@archiso you should also implement a disk save for each x amount of elements or else you will need a lot of ram and will probably have issues that come along with that. 360m entries is definetly a list you don't want on a normal system.

AlfonsLM · March 27, 2021

2 minutes ago, Ulysses499 said:

@archiso you should also implement a disk save for each x amount of elements or else you will need a lot of ram and will probably have issues that come along with that. 360m entries is definetly a list you don't want on a normal system.

I would not want the search index either to make the list useful. Can't se any use of it if you can't at least get an answer in less time than it takes you to ask a DNS through the TOR network.

Eigenvektor · March 27, 2021

2 hours ago, archiso said:

I have a list var that I would like to be a list of all websites(from google dns), can I do this?

When you do a normal DNS query, you send a host name and the DNS server returns you the associated IP address (A record)

When you do a reserve DNS query, you send an IP address and the DNS server returns the associated host name (PTR record)

There is typically only a single PTR record associated with an IP address, even though there may be multiple A records pointing to that IP. This means you cannot realistically use this to compile a list of all known domain names.

~edit: There's another issue:

If you were to do a reverse DNS lookup of every possible IPv4 address, you'd be querying upwards of 4.2 billion IP addresses. The DNS server is most likely going to blacklist you in short order. To avoid this you could try to slow down and/or spread the queries to other DNS servers. But keep in mind how long this would take.

Even if you were able to do one request every 10 milliseconds it would take you more than 1.33 years to query all of them. The faster you go, the more likely you'll end up blocked. But if you slow down to one request every 100 ms, we're already talking about 13.3+ years to test every possible IP address.

archiso · March 27, 2021

1 hour ago, Eigenvektor said:

When you do a normal DNS query, you send a host name and the DNS server returns you the associated IP address (A record)

When you do a reserve DNS query, you send an IP address and the DNS server returns the associated host name (PTR record)

There is typically only a single PTR record associated with an IP address, even though there may be multiple A records pointing to that IP. This means you cannot realistically use this to compile a list of all known domain names.

~edit: There's another issue:

If you were to do a reverse DNS lookup of every possible IPv4 address, you'd be querying upwards of 4.2 billion IP addresses. The DNS server is most likely going to blacklist you in short order. To avoid this you could try to slow down and/or spread the queries to other DNS servers. But keep in mind how long this would take.

Even if you were able to do one request every 10 milliseconds it would take you more than 1.33 years to query all of them. The faster you go, the more likely you'll end up blocked. But if you slow down to one request every 100 ms, we're already talking about 13.3+ years to test every possible IP address.

3 hours ago, AlfonsLM said:

I would not want the search index either to make the list useful. Can't se any use of it if you can't at least get an answer in less time than it takes you to ask a DNS through the TOR network.

3 hours ago, Ulysses499 said:

@archiso you should also implement a disk save for each x amount of elements or else you will need a lot of ram and will probably have issues that come along with that. 360m entries is definetly a list you don't want on a normal system.

3 hours ago, AlfonsLM said:

So from the estimate on this page there is about 360 m domains that means you will need just 2.9tb of storage for the ipv4 address
How Many Domains Are There? - Domain Name Stats for 2021 - Make A Website Hub

3 hours ago, AlfonsLM said:

I have a feeling that you might have a problem with som other part of it. You can definitely do a look up on all ipv4 addresses put not sure if 1tb of disk space would be enoth for a complete DNS, their is a reson why they don't ask for updates from each other and instead just ask where to find it.

Ok. This all makes sense. The reason for doing this is that I’am trying to create a search engine, I’m not actually going to use it this is just for fun. So I needed a way to have a list of the websites to search.

Ulysses499 · March 27, 2021

5 minutes ago, archiso said:

Ok. This all makes sense. The reason for doing this is that I’am trying to create a search engine, I’m not actually going to use it this is just for fun. So I needed a way to have a list of the websites to search.

Oof well what ranking algorithm did you want to use? With big data this is the more complex part. You can use any web crawler on the popular sites and gain the addresses most people use, but after that where does it go? A search engine needs more than the dns and you'd wind up with petabytes of data to look through...

archiso · March 27, 2021

3 minutes ago, Ulysses499 said:

Oof well what ranking algorithm did you want to use? With big data this is the more complex part. You can use any web crawler on the popular sites and gain the addresses most people use, but after that where does it go? A search engine needs more than the dns and you'd wind up with petabytes of data to look through...

currently my search engine just runs a basic search algorithm through a url by saving the url to a var using urllib, and loops that in a list of urls.

Ulysses499 · March 27, 2021

2 minutes ago, archiso said:

currently my search engine just runs a basic search algorithm through a url by saving the url to a var using urllib, and loops that in a list of urls.

Hmm okay since you are basically searching for the first entry with a url that contains the searched for term I would advise you to use the web crawler strat with a "core" list of websites and a maximum depth

Eigenvektor · March 27, 2021

8 minutes ago, archiso said:

Ok. This all makes sense. The reason for doing this is that I’am trying to create a search engine, I’m not actually going to use it this is just for fun. So I needed a way to have a list of the websites to search.

You could try to build a very simple web crawler for this. You could simply enter a list of known websites as a starting point.

What the crawler should do:

Visit the website(s) you entered/it found previously
Download the (presumably HTML) document at each URL
Scan the document for any (new) URLs it can find
Go back to 1

You should take some steps to prevent from running in circles, i.e. remember the URLs you've recently visited and scanned and don't scan them again for some time.

The next step would be to actually index the contents of the documents you have scanned, so you can then search through them.

archiso · March 27, 2021

Just now, Ulysses499 said:

Hmm okay since you are basically searching for the first entry with a url that contains the searched for term I would advise you to use the web crawler strat with a "core" list of websites and a maximum depth

I'm acutally using the

count()

function to find the number a results and ordering the urls by the count

Ulysses499 · March 27, 2021

2 minutes ago, archiso said:
I'm acutally using the
count()
function to find the number a results and ordering the urls by the count

So something like

//Pseudocode
search(term){
	for(website in listOfSites){
    	countResults(website,term)
    }
   	return highestcount
}

? The problem with this is that you don't really return the "best" result just the one with the most pages.

E.g. I am searching for "Linus Tech Tips" then this forum will pop out rather than the YouTube channel.

Why don't you implement the basic page rank algorithm?

archiso · March 27, 2021

2 minutes ago, Ulysses499 said:
So something like
//Pseudocode
search(term){
	for(website in listOfSites){
    	countResults(website,term)
    }
   	return highestcount
}
? The problem with this is that you don't really return the "best" result just the one with the most pages.

E.g. I am searching for "Linus Tech Tips" then this forum will pop out rather than the YouTube channel.

Why don't you implement the basic page rank algorithm?

this is what it is rn:

def search1(url, search):
        toRead = urlr.urlopen(url)
        content = str(toRead.read())[1:]
        content = remove(content)
        nonewline = str(content)
        content = nonewline.replace(' ', "")
        searchlst = search.split()
        relevance = 0
        for query in searchlst:
            relevance += content.count(query)
        containlst.append(relevance)
        print(containlst)
        
def remove(test_str):
    ret = ''
    skip1c = 0
    for i in test_str:
        if i == '<':
            skip1c += 1
        elif i == '>' and skip1c > 0:
            skip1c -= 1
        elif skip1c == 0:
            ret += i
    return ret
        
search = input("Search:\n")
urls = ['url1, 'url2', 'url3']
for url in urls:
	search1(url, search)

Ulysses499 · March 27, 2021

1 minute ago, archiso said:

this is what it is rn:


def search1(url, search):
        toRead = urlr.urlopen(url)
        content = str(toRead.read())[1:]
        content = remove(content)
        nonewline = str(content)
        content = nonewline.replace(' ', "")
        searchlst = search.split()
        relevance = 0
        for query in searchlst:
            relevance += content.count(query)
        containlst.append(relevance)
        print(containlst)
        
def remove(test_str):
    ret = ''
    skip1c = 0
    for i in test_str:
        if i == '<':
            skip1c += 1
        elif i == '>' and skip1c > 0:
            skip1c -= 1
        elif skip1c == 0:
            ret += i
    return ret
        
search = input("Search:\n")
urls = ['url1, 'url2', 'url3']
for url in urls:
	search1(url, search)

What is urlr ?

archiso · March 27, 2021

Just now, Ulysses499 said:

What is urlr ?

oh sorry:

import urllib.request as urlr

that should be at the top

Ulysses499 · March 27, 2021

1 minute ago, archiso said:
oh sorry:
import urllib.request as urlr
that should be at the top

Seems like a very demanding search engine. If you are doing this for fun I would suggest running a more phased out design.

Pseudocode

def crawl(list):
 saveList = list
 depth = 1
 for item in saveList:
  content = urlr.open(item)
  while depth < 10:
   for link in content:
    if(link not in saveList):
     saveList.append(link)
    depth += 1
 return saveList
 
def search(term):
 //Implement your search algorithm

archiso · March 27, 2021

Just now, Ulysses499 said:
Seems like a very demanding search engine. If you are doing this for fun I would suggest running a more phased out design.

Pseudocode
def crawl(list):
 saveList = list
 depth = 1
 for item in saveList:
  content = urlr.open(item)
  while depth < 10:
   for link in content:
    if(link not in saveList):
     saveList.append(link)
    depth += 1
 return saveList
 
def search(term):
 //Implement your search algorithm

so input some base sites then it gathers more sites, makes sense.

archiso · March 27, 2021

3 minutes ago, Ulysses499 said:
Seems like a very demanding search engine. If you are doing this for fun I would suggest running a more phased out design.

Pseudocode
def crawl(list):
 saveList = list
 depth = 1
 for item in saveList:
  content = urlr.open(item)
  while depth < 10:
   for link in content:
    if(link not in saveList):
     saveList.append(link)
    depth += 1
 return saveList
 
def search(term):
 //Implement your search algorithm

one problem, how does it know what is a url/link and whats not?

Ulysses499 · March 27, 2021

9 minutes ago, archiso said:

one problem, how does it know what is a url/link and whats not?

you could look for www. or http or .com etc... then save until the string does not contain a space that is not followed by "/" and have a url

archiso · March 27, 2021

1 minute ago, Ulysses499 said:

you could look for www. or http or .com etc... then save until the string does not contain a space that is not followed by "/" and have a url

I think I could just look for items after the herf="url" in the html code

Ulysses499 · March 27, 2021

Just now, archiso said:

I think I could just look for items after the herf="url" in the html code

Or that yes. just make sure to check if " is escaped

archiso · March 27, 2021

Just now, Ulysses499 said:

Or that yes. just make sure to check if " is escaped

yep

vorticalbox · March 28, 2021

12 hours ago, Ulysses499 said:

Or that yes. just make sure to check if " is escaped

And if its a relitive href you need to add the domain back on.

AlfonsLM · March 28, 2021

14 hours ago, archiso said:

currently my search engine just runs a basic search algorithm through a url by saving the url to a var using urllib, and loops that in a list of urls.

You can use the old google algorithm and keep track on how many times the crawler finds one page and rank the search result from that. Also. might be worth considering how you will try to find your search words befor building the data base. One really easy way to and definitely not the best would be to brake down the page header and use that. Might be worth checking eg images for meta data also.

Sign In

read DNS server python

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites