Jump to content

Is there any way to automatically collect specific text from the internet?

Seabottom

What I want is to take any article, news story, research paper, blog post and literally everything else on the internet, that contains a specific word, and output it into the same .txt file.

For instance, say I want anything with the word "fish" in it. The program trawls through the entire internet (or at least the most popular places) and dumps the entire article or whatever text story containg "fish" into the same file.

In the end, I would have a single .txt file with a billion lines and an exponential amount of words. 

 

The purpose if this is to create a list of commonly used words that I will be using in my further ongoing work.

 

Is something like this even possible?

Link to comment
Share on other sites

Link to post
Share on other sites

You could probably make a macro to do it, but it would take absolutely forever to gather a billion lines

Link to comment
Share on other sites

Link to post
Share on other sites

45 minutes ago, Seabottom said:

The program trawls through the entire internet (or at least the most popular places)

What are the most popular places on the internet? Do you want to include, exclude or limit yourself to the web?

Write in C.

Link to comment
Share on other sites

Link to post
Share on other sites

38 minutes ago, Derrk said:

You could probably make a macro to do it, but it would take absolutely forever to gather a billion lines

I don't know what you mean by macro - if you mean writing a program then yes it is plausible, if you mean doing something like AutoHotkey scripting, then it's a lot more complicated I reckon.

 

4 minutes ago, Dat Guy said:

What are the most popular places on the internet? Do you want to include, exclude or limit yourself to the web?

Popular places, well basiclly anything that's indexed, you know, anything that appears on Googles search results, or if that's too much still, other sources will include news sites like BBC, NYtimes and however hundreds of others.

I don't know what you mean by the last bit. The only way to get data like this, is from the web right?

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, Seabottom said:

The only way to get data like this, is from the web right?

No, not at all. What about Gopher? The Usenet? Files available on the IRC? Files available on FTP servers? Probably even torrents and eMule? Indexing "the internet" won't work that easy.

 

5 minutes ago, Seabottom said:

if you mean writing a program then yes it is plausible

Depends. You might want to have a local copy of all indexed resources for performance reasons. Enjoy mirroring the internet. Otherwise, it's a loop that looks like this:

#! /bin/sh
curl <something> | grep <searchterm>

And it will take a long time, depending on your list of <something>s.

Write in C.

Link to comment
Share on other sites

Link to post
Share on other sites

I have some experience writing web spiders/scrapers in python to crawl websites and scrape. What you're asking for is actually pretty easy to implement. The only issue though is you want to crawl the entire web. Do you realize how big the web is? Even the most popular places have millions of pages, that's millions of requests. If you go balls to the wall at full speed , expect to be blocked instantly. The big boys don't like bots.

 

That being said, if you're still up for the challenge then a nice frameworks for thing kind of thing is scrapy. Here's some links to get you started:

- Python

- Scrapy

- A Proxy Middleware

 

Link to comment
Share on other sites

Link to post
Share on other sites

58 minutes ago, codesidian said:

I have some experience writing web spiders/scrapers in python to crawl websites and scrape. What you're asking for is actually pretty easy to implement. The only issue though is you want to crawl the entire web. Do you realize how big the web is? Even the most popular places have millions of pages, that's millions of requests. If you go balls to the wall at full speed , expect to be blocked instantly. The big boys don't like bots.

 

That being said, if you're still up for the challenge then a nice frameworks for thing kind of thing is scrapy. Here's some links to get you started:

- Python

- Scrapy

- A Proxy Middleware

 

I looked at the Scrapy website, and it I think that that's just what I need. Only problem is I don't know python, although I could learn it over time.

I know the web is incomprehensibly big, so I'll probably do one web page at a time. I see that Scrapy can be programmed with a more human scan rate, and even so, the github you link to will come in handy as well.

Link to comment
Share on other sites

Link to post
Share on other sites

to my knowledge this is gonna be a life time project. from time to time i write some tools to exract datasets from websites but with every websites layout being different i have to write a slightly different interpreter or the software end up taking advertisements as datasets and so on. its finicky enough for one or two site but on such a large scale im getting a migrane just thinking about it. maybe someone has some uber powerful way to do it but not by reading and interpreting the sites.

"You know it'll clock down as soon as it hits 40°C, right?" - "Yeah ... but it doesnt hit 40°C ... ever  😄"

 

GPU: MSI GTX1080 Ti Aero @ 2 GHz (watercooled) CPU: Ryzen 5600X (watercooled) RAM: 32GB 3600Mhz Corsair LPX MB: Gigabyte B550i PSU: Corsair SF750 Case: Hyte Revolt 3

 

Link to comment
Share on other sites

Link to post
Share on other sites

48 minutes ago, Seabottom said:

I looked at the Scrapy website, and it I think that that's just what I need. Only problem is I don't know python, although I could learn it over time.

I know the web is incomprehensibly big, so I'll probably do one web page at a time. I see that Scrapy can be programmed with a more human scan rate, and even so, the github you link to will come in handy as well.

Python's one of the easiest languages to learn imho, and if you've coded before it'll be a breeze. The only hard bit is understanding how the framework works and how crawling works. Scrapy does have built in autothrottle which is great, but even without autothrottle sending hundreds of requests a second will still take a long time. You could also implement a cycling user agent along with your proxies if you want to go full speed, but that's pretty unethical. 

 

Regarding the actual logic, you probably wouldn't tailor your algorithm to each specific site if you don't care too much about how clean your data is. Just visit, search for keyword, dump, crawl, repeat. You'll probably want to dump contents of <p> tags or whatever else you're wanting to extract.

Link to comment
Share on other sites

Link to post
Share on other sites

You can download WIkipedia to your computer

You can download books (legally) from Project Gutenberg  http://www.gutenberg.org/

You can download loads of books and magazines and scientific material from Genlib (check menus, they have torrent dumps of all content, terrabytes of books and scientific material) : http://gen.lib.rus.ec/

 

You can build a list of your favorite online newspapers and whatever and then use website downloaders to download the websites but it would take ages unless you rent a bunch of servers from Amazon for example and have each work on individual websites to speed things up.

 

Python, php, even visual basic / free pascal are programming languages that are super easy to use to parse text and pages once downloaded and extract/separate etc

 

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, Dat Guy said:

What about Gopher? The Usenet? Files available on the IRC? Files available on FTP servers? Probably even torrents and eMule? Indexing "the internet" won't work that easy.

Hint, hint.

Write in C.

Link to comment
Share on other sites

Link to post
Share on other sites

If you know Javascript/NodeJs this could be a fun project, you could setup a simple command line program to do this. Just thinking on the fly here: most new websites have a search bar, take note of how they search(most you just post a simple query) to each of these and collect the first X number of articles and dump each article into a folder. Boom, research machine!

 

You really wouldn't need to use any external libraries, most languages have some form of DOM parser/manipulator(thing used to manipulate web pages), and HTTP library for making requests. 

 

It will take some work and some fine tuning for sure, but ultimately you could customize it to include thousands of sites and make your own academic search engine.

Link to comment
Share on other sites

Link to post
Share on other sites

Thank you for the answers everyone. I know now that it is very possible to do what I want, considering there's so many different suggestions by now.

I have already created the program for creating a list of words and their number of appearance in NodeJS like this:

The: 3764

And: 2076

To: 2028

For: 1975

....

 

You get the idea, so I'll be looking at using nodejs, after I've had a look at the freely available resources mentioned by @mariushm

@datguy

 

Thank you all

Link to comment
Share on other sites

Link to post
Share on other sites

On 4/10/2019 at 6:44 PM, Seabottom said:

Thank you for the answers everyone. I know now that it is very possible to do what I want, considering there's so many different suggestions by now.

I have already created the program for creating a list of words and their number of appearance in NodeJS like this:

The: 3764

And: 2076

To: 2028

For: 1975

....

 

You get the idea, so I'll be looking at using nodejs, after I've had a look at the freely available resources mentioned by @mariushm

@datguy

 

Thank you all

One thing I would say is look at removing filler words, if context is not important (at this stage) also where you may be reading user generated content remember typos happen.  So be aware your count may/will be off until this can be taken into account.

 

This may help streamline your database, however if you are scanning the web don't forget to parse robots.txt to make sure. You don't get in too much trouble. 

 

Final thoughts is to look at different formats i.e. PDF, Docx, RTF etc and possibly an OCR for images

 

And from previous experience python is awesome for string manipulation, failing which take the raw file > strings > strip > split

 

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×