Jump to content

read DNS server python

archiso

I have a list var that I would like to be a list of all websites(from google dns), can I do this?

 

Link to comment
Share on other sites

Link to post
Share on other sites

I have a feeling that you might have a problem with som other part of it. You can definitely do a look up on all ipv4 addresses put not sure if 1tb of disk space would be enoth for a complete DNS, their is a reson why they don't ask for updates from each other and instead just ask where to find it.

Link to comment
Share on other sites

Link to post
Share on other sites

8 minutes ago, archiso said:

I have a list var that I would like to be a list of all websites(from google dns), can I do this?

 

So from the estimate on this page there is about 360 m domains that means you will need just 2.9tb of storage for the ipv4 address
How Many Domains Are There? - Domain Name Stats for 2021 - Make A Website Hub

Link to comment
Share on other sites

Link to post
Share on other sites

26 minutes ago, AlfonsLM said:

So from the estimate on this page there is about 360 m domains that means you will need just 2.9tb of storage for the ipv4 address
How Many Domains Are There? - Domain Name Stats for 2021 - Make A Website Hub

 @archiso you should also implement a disk save for each x amount of elements or else you will need a lot of ram and will probably have issues that come along with that. 360m entries is definetly a list you don't want on a normal system.

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, Ulysses499 said:

 @archiso you should also implement a disk save for each x amount of elements or else you will need a lot of ram and will probably have issues that come along with that. 360m entries is definetly a list you don't want on a normal system.

I would not want the search index either to make the list useful. Can't se any use of it if you can't at least get an answer in less time than it takes you to ask a DNS through the TOR network.

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, archiso said:

I have a list var that I would like to be a list of all websites(from google dns), can I do this?

When you do a normal DNS query, you send a host name and the DNS server returns you the associated IP address (A record)

When you do a reserve DNS query, you send an IP address and the DNS server returns the associated host name (PTR record)

 

There is typically only a single PTR record associated with an IP address, even though there may be multiple A records pointing to that IP. This means you cannot realistically use this to compile a list of all known domain names.

 

~edit: There's another issue:

If you were to do a reverse DNS lookup of every possible IPv4 address, you'd be querying upwards of 4.2 billion IP addresses. The DNS server is most likely going to blacklist you in short order. To avoid this you could try to slow down and/or spread the queries to other DNS servers. But keep in mind how long this would take.

 

Even if you were able to do one request every 10 milliseconds it would take you more than 1.33 years to query all of them. The faster you go, the more likely you'll end up blocked. But if you slow down to one request every 100 ms, we're already talking about 13.3+ years to test every possible IP address.

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Eigenvektor said:

When you do a normal DNS query, you send a host name and the DNS server returns you the associated IP address (A record)

When you do a reserve DNS query, you send an IP address and the DNS server returns the associated host name (PTR record)

 

There is typically only a single PTR record associated with an IP address, even though there may be multiple A records pointing to that IP. This means you cannot realistically use this to compile a list of all known domain names.

 

~edit: There's another issue:

If you were to do a reverse DNS lookup of every possible IPv4 address, you'd be querying upwards of 4.2 billion IP addresses. The DNS server is most likely going to blacklist you in short order. To avoid this you could try to slow down and/or spread the queries to other DNS servers. But keep in mind how long this would take.

 

Even if you were able to do one request every 10 milliseconds it would take you more than 1.33 years to query all of them. The faster you go, the more likely you'll end up blocked. But if you slow down to one request every 100 ms, we're already talking about 13.3+ years to test every possible IP address.

 

3 hours ago, AlfonsLM said:

I would not want the search index either to make the list useful. Can't se any use of it if you can't at least get an answer in less time than it takes you to ask a DNS through the TOR network.

 

3 hours ago, Ulysses499 said:

 @archiso you should also implement a disk save for each x amount of elements or else you will need a lot of ram and will probably have issues that come along with that. 360m entries is definetly a list you don't want on a normal system.

 

3 hours ago, AlfonsLM said:

So from the estimate on this page there is about 360 m domains that means you will need just 2.9tb of storage for the ipv4 address
How Many Domains Are There? - Domain Name Stats for 2021 - Make A Website Hub

 

3 hours ago, AlfonsLM said:

I have a feeling that you might have a problem with som other part of it. You can definitely do a look up on all ipv4 addresses put not sure if 1tb of disk space would be enoth for a complete DNS, their is a reson why they don't ask for updates from each other and instead just ask where to find it.

Ok. This all makes sense. The reason for doing this is that I’am trying to create a search engine, I’m not actually going to use it this is just for fun. So I needed a way to have a list of the websites to search.

Link to comment
Share on other sites

Link to post
Share on other sites

5 minutes ago, archiso said:

 

 

 

 

Ok. This all makes sense. The reason for doing this is that I’am trying to create a search engine, I’m not actually going to use it this is just for fun. So I needed a way to have a list of the websites to search.

Oof well what ranking algorithm did you want to use? With big data this is the more complex part. You can use any web crawler on the popular sites and gain the addresses most people use, but after that where does it go? A search engine needs more than the dns and you'd wind up with petabytes of data to look through...

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, Ulysses499 said:

Oof well what ranking algorithm did you want to use? With big data this is the more complex part. You can use any web crawler on the popular sites and gain the addresses most people use, but after that where does it go? A search engine needs more than the dns and you'd wind up with petabytes of data to look through...

currently my search engine just runs a basic search algorithm through a url by saving the url to a var using urllib, and loops that in a list of urls.

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, archiso said:

currently my search engine just runs a basic search algorithm through a url by saving the url to a var using urllib, and loops that in a list of urls.

Hmm okay since you are basically searching for the first entry with a url that contains the searched for term I would advise you to use the web crawler strat with a "core" list of websites and a maximum depth

Link to comment
Share on other sites

Link to post
Share on other sites

8 minutes ago, archiso said:

Ok. This all makes sense. The reason for doing this is that I’am trying to create a search engine, I’m not actually going to use it this is just for fun. So I needed a way to have a list of the websites to search.

You could try to build a very simple web crawler for this. You could simply enter a list of known websites as a starting point.

 

What the crawler should do:

  1. Visit the website(s) you entered/it found previously
  2. Download the (presumably HTML) document at each URL
  3. Scan the document for any (new) URLs it can find
  4. Go back to 1

You should take some steps to prevent from running in circles, i.e. remember the URLs you've recently visited and scanned and don't scan them again for some time.

 

The next step would be to actually index the contents of the documents you have scanned, so you can then search through them.

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, Ulysses499 said:

Hmm okay since you are basically searching for the first entry with a url that contains the searched for term I would advise you to use the web crawler strat with a "core" list of websites and a maximum depth

I'm acutally using the

count()

function to find the number a results and ordering the urls by the count

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, archiso said:

I'm acutally using the


count()

function to find the number a results and ordering the urls by the count

So something like

//Pseudocode
search(term){
	for(website in listOfSites){
    	countResults(website,term)
    }
   	return highestcount
}

? The problem with this is that you don't really return the "best" result just the one with the most pages.

E.g. I am searching for "Linus Tech Tips" then this forum will pop out rather than the YouTube channel.

Why don't you implement the basic page rank algorithm? 🙂

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, Ulysses499 said:

So something like



//Pseudocode
search(term){
	for(website in listOfSites){
    	countResults(website,term)
    }
   	return highestcount
}

? The problem with this is that you don't really return the "best" result just the one with the most pages.

E.g. I am searching for "Linus Tech Tips" then this forum will pop out rather than the YouTube channel.

Why don't you implement the basic page rank algorithm? 🙂

this is what it is rn:

def search1(url, search):
        toRead = urlr.urlopen(url)
        content = str(toRead.read())[1:]
        content = remove(content)
        nonewline = str(content)
        content = nonewline.replace(' ', "")
        searchlst = search.split()
        relevance = 0
        for query in searchlst:
            relevance += content.count(query)
        containlst.append(relevance)
        print(containlst)
        
def remove(test_str):
    ret = ''
    skip1c = 0
    for i in test_str:
        if i == '<':
            skip1c += 1
        elif i == '>' and skip1c > 0:
            skip1c -= 1
        elif skip1c == 0:
            ret += i
    return ret
        
search = input("Search:\n")
urls = ['url1, 'url2', 'url3']
for url in urls:
	search1(url, search)

 

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, archiso said:

this is what it is rn:


def search1(url, search):
        toRead = urlr.urlopen(url)
        content = str(toRead.read())[1:]
        content = remove(content)
        nonewline = str(content)
        content = nonewline.replace(' ', "")
        searchlst = search.split()
        relevance = 0
        for query in searchlst:
            relevance += content.count(query)
        containlst.append(relevance)
        print(containlst)
        
def remove(test_str):
    ret = ''
    skip1c = 0
    for i in test_str:
        if i == '<':
            skip1c += 1
        elif i == '>' and skip1c > 0:
            skip1c -= 1
        elif skip1c == 0:
            ret += i
    return ret
        
search = input("Search:\n")
urls = ['url1, 'url2', 'url3']
for url in urls:
	search1(url, search)

 

What is urlr ?

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, Ulysses499 said:

What is urlr ?

oh sorry:

import urllib.request as urlr

that should be at the top

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, archiso said:

oh sorry:


import urllib.request as urlr

that should be at the top

Seems like a very demanding search engine. If you are doing this for fun I would suggest running a more phased out design.

Pseudocode

def crawl(list):
 saveList = list
 depth = 1
 for item in saveList:
  content = urlr.open(item)
  while depth < 10:
   for link in content:
    if(link not in saveList):
     saveList.append(link)
    depth += 1
 return saveList
 
def search(term):
 //Implement your search algorithm

 

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, Ulysses499 said:

Seems like a very demanding search engine. If you are doing this for fun I would suggest running a more phased out design.

Pseudocode


def crawl(list):
 saveList = list
 depth = 1
 for item in saveList:
  content = urlr.open(item)
  while depth < 10:
   for link in content:
    if(link not in saveList):
     saveList.append(link)
    depth += 1
 return saveList
 
def search(term):
 //Implement your search algorithm

 

so input some base sites then it gathers more sites, makes sense.

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, Ulysses499 said:

Seems like a very demanding search engine. If you are doing this for fun I would suggest running a more phased out design.

Pseudocode


def crawl(list):
 saveList = list
 depth = 1
 for item in saveList:
  content = urlr.open(item)
  while depth < 10:
   for link in content:
    if(link not in saveList):
     saveList.append(link)
    depth += 1
 return saveList
 
def search(term):
 //Implement your search algorithm

 

one problem, how does it know what is a url/link and whats not?

Link to comment
Share on other sites

Link to post
Share on other sites

9 minutes ago, archiso said:

one problem, how does it know what is a url/link and whats not?

you could look for www. or http or .com etc... then save until the string does not contain a space that is not followed by "/" and have a url

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, Ulysses499 said:

you could look for www. or http or .com etc... then save until the string does not contain a space that is not followed by "/" and have a url

I think I could just look for items after the herf="url" in the html code

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, archiso said:

I think I could just look for items after the herf="url" in the html code

Or that yes. 🙂 just make sure to check if " is escaped

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, Ulysses499 said:

Or that yes. 🙂 just make sure to check if " is escaped

yep

Link to comment
Share on other sites

Link to post
Share on other sites

12 hours ago, Ulysses499 said:

Or that yes. 🙂 just make sure to check if " is escaped

And if its a relitive href you need to add the domain back on. 

                     ¸„»°'´¸„»°'´ Vorticalbox `'°«„¸`'°«„¸
`'°«„¸¸„»°'´¸„»°'´`'°«„¸Scientia Potentia est  ¸„»°'´`'°«„¸`'°«„¸¸„»°'´

Link to comment
Share on other sites

Link to post
Share on other sites

14 hours ago, archiso said:

currently my search engine just runs a basic search algorithm through a url by saving the url to a var using urllib, and loops that in a list of urls.

You can use the old google algorithm and keep track on how many times the crawler finds one page and rank the search result from that. Also. might be worth considering how you will try to find your search words befor building the data base. One really easy way to and definitely not the best would be to brake down the page header and use that. Might be worth checking eg images for meta data also.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×