Jump to content

web crawler help

archiso

I have made a search engine/web crawler(code: https://github.com/archiso7/Search-Engine), but I have a couple problems:

1. it is VERY slow so I want to make it mulithreaded

2. the crawl function(on line 67 of Fun.py) does not work as intended, the problems are: the while(depth < 10): loop only loops once in the main for loop and when it does run it just loops through the same website 10 times.

Link to comment
Share on other sites

Link to post
Share on other sites

First, I believe you linked the wrong repo.

 

Second, I see that you're using bs4. Try using the lxml parser instead of the default one. You can read more about it here: https://thehftguy.com/2020/07/28/making-beautifulsoup-parsing-10-times-faster/

 

Third, multithreading will only speed up your code by the amount of cores you have in a best case scenario (if you do use multiprocessing), but you'll still be doing blocking operations. Try to use async wherever you can first, this should make things way more faster than multiprocessing by itself.

4 hours ago, archiso said:

the crawl function(on line 67 of Fun.py) does not work as intended, the problems are: the while(depth < 10): loop only loops once in the main for loop and when it does run it just loops through the same website 10 times.

You shouldn't be editing your variable while looping over it.

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, igormp said:

You shouldn't be editing your variable while looping over it.

What do you mean by that?

Link to comment
Share on other sites

Link to post
Share on other sites

23 minutes ago, archiso said:

What do you mean by that?

You are looping over your savelst list, and appending values while doing so, and this may cause unwanted results. It's similar to what has been explained here: https://stackoverflow.com/a/1637875/4542015

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

The cases where you can iterate and alter a collection safety no matter the language are very limited and has to be done in a specific way. For example when appending you have to use an index based loop starting with the last element going down until first element done. This make it safe yes but it make's it harder for the next person going in the code trying to figure out why you did that unless he is somewhat experienced.

 

For your speed issue you are pretty limited because AFAIK you are limited to your amount of core or a bit less in python multithreading. Fully featured language like C++, C#, Java you can run more than your core amount and specially doing web work as you can have couple thousands callback on a stack without any issues. For example on C# I am able to nearly saturate my 1 gbps internet connection while scraping web pages. Usually start to have diminish return once you hit any hardware limit (usually in my case it was RAM as I couldn't store any more data with a mere 48 GB)

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Franck said:

The cases where you can iterate and alter a collection safety no matter the language are very limited and has to be done in a specific way. For example when appending you have to use an index based loop starting with the last element going down until first element done. This make it safe yes but it make's it harder for the next person going in the code trying to figure out why you did that unless he is somewhat experienced.

 

For your speed issue you are pretty limited because AFAIK you are limited to your amount of core or a bit less in python multithreading. Fully featured language like C++, C#, Java you can run more than your core amount and specially doing web work as you can have couple thousands callback on a stack without any issues. For example on C# I am able to nearly saturate my 1 gbps internet connection while scraping web pages. Usually start to have diminish return once you hit any hardware limit (usually in my case it was RAM as I couldn't store any more data with a mere 48 GB)

So I should use a different language for this project?

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, Franck said:

For your speed issue you are pretty limited because AFAIK you are limited to your amount of core or a bit less in python multithreading. Fully featured language like C++, C#, Java you can run more than your core amount and specially doing web work as you can have couple thousands callback on a stack without any issues. For example on C# I am able to nearly saturate my 1 gbps internet connection while scraping web pages. Usually start to have diminish return once you hit any hardware limit (usually in my case it was RAM as I couldn't store any more data with a mere 48 GB)

You can easily saturate a 1Gbit connection with python, I've done that before. You just need to properly design your code structure in order to not have any blockers, which is not the case with OP's code.

 

35 minutes ago, archiso said:

So I should use a different language for this project?

You can easily do what you want with python. As I said before, refactor your code to use lxml and async every blocking stuff you have instead of doing it sequentially. No matter the language, you're going to achieve the same slow result if you keep waiting for a page to load before going on to the next one.

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

5 minutes ago, igormp said:

You can easily saturate a 1Gbit connection with python, I've done that before. You just need to properly design your code structure in order to not have any blockers, which is not the case with OP's code.

 

You can easily do what you want with python. As I said before, refactor your code to use lxml and async every blocking stuff you have instead of doing it sequentially. No matter the language, you're going to achieve the same slow result if you keep waiting for a page to load before going on to the next one.

What do you mean by async(sorry I'm kind of new to this.)

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, archiso said:

What do you mean by async(sorry I'm kind of new to this.)

Instead of doing a get on a page and waiting for its results before going to the next one, you call it, leave it in the background (checking once in a while if you got the results already) and keep going with the rest of your code until it's ready.

 

Have a look at asyncio on python. You'll probably end up using aiohttp

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

11 minutes ago, igormp said:

Instead of doing a get on a page and waiting for its results before going to the next one, you call it, leave it in the background (checking once in a while if you got the results already) and keep going with the rest of your code until it's ready.

 

Have a look at asyncio on python. You'll probably end up using aiohttp

is the response.text from aiohttp parsed in a way the beautiful soup can read?

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, archiso said:

is the response.text from aiohttp parsed in a way the beautiful soup can read?

Yes. Going from your regular sync code to async is not that hard, you will just have to get rid of those loops and throw in tasks, I've done so a couple weeks ago.

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, igormp said:

Yes. Going from your regular sync code to async is not that hard, you will just have to get rid of those loops and throw in tasks, I've done so a couple weeks ago.

Ok. My main problem right now is that I can't figure out how to crawl websites but only to a certain depth.

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, igormp said:

Yes. Going from your regular sync code to async is not that hard, you will just have to get rid of those loops and throw in tasks, I've done so a couple weeks ago.

What are tasks?

Link to comment
Share on other sites

Link to post
Share on other sites

@igormp how would I make it so that it runs multiple get request at once then when it loads it it runs the rest of the code?

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, archiso said:

@igormp how would I make it so that it runs multiple get request at once then when it loads it it runs the rest of the code?

You need to have your list of sites to visit, enqueue that list with the function that will work to crawl each one and then call some async function o process all of those. I hope I gave enough pointers for you to keep going by yourself.

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, igormp said:

You need to have your list of sites to visit, enqueue that list with the function that will work to crawl each one and then call some async function o process all of those. I hope I gave enough pointers for you to keep going by yourself.

Yep I will try, I am going to try use a cluster computer for this so I might use JavaScript or C# instead of python.

Link to comment
Share on other sites

Link to post
Share on other sites

5 minutes ago, archiso said:

Yep I will try, I am going to try use a cluster computer for this so I might use JavaScript or C# instead of python.

Just my 2 cents: you're putting your cart before the horse, your performance will be shit and you'll have a hard time getting stuff to work due to all the complexity required.

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

15 hours ago, archiso said:

Ok. My main problem right now is that I can't figure out how to crawl websites but only to a certain depth.

When you first visit a new domain (example.com) set depth to zero. When you find/follow a URL that leads to another document on the same domain (example.com/somepath) increment depth by one. When depth exceeds your configured maximum, don't follow further URLs that link to the same host.

 

I would store URLs along with their depth and whether you've indexed them already (and when). If you find a shorter route to reach a URL, you might want to upgrade its depth. This way you can keep track of whether you need to (re-)index a document and whether you should follow links to the same domain.

 

It might be prudent to go breadth first. Meaning when you've collected URLs, you first follow those leading to other domains instead of descending deeper on the current domain. Too many hits on the same domain in short order might be seen as a DOS attempt and might lead to further requests being blocked.

 

8 hours ago, igormp said:

Just my 2 cents: you're putting your cart before the horse, your performance will be shit and you'll have a hard time getting stuff to work due to all the complexity required.

I agree. There's a good strategy: Make it run, make it right, make it fast. Meaning: First you build something that does what you need. Then you worry about making it work correctly/covering edge cases/error handling and so on. Then you start worrying about performance.

 

If course it doesn't hurt to keep those "future requirements" in mind when you build something, so that you don't end up with a design that you have to tear down completely before you can get there.

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×