Jump to content

Webscraping with Python

Yo yo, I'm trying to webscrape the shitttttt out of Yahoo finance using python and urllib.request and they blocked me with a 503 error, which I'm given to understand means my IP doesn't work? ANYWAY, this is a thing I want to be able to make into an exe and put online someday and I don't wanna be limited by the silly webscraping rules. Is there an IP randomiser module? Would a VPN module work? Or should I try Tor? Help me out please guys, much appreciated! Or perhaps is there a better way for me to get on demand/live data on stocks etc?

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, willnburger said:

Yo yo, I'm trying to webscrape the shitttttt out of Yahoo finance using python and urllib.request and they blocked me with a 503 error, which I'm given to understand means my IP doesn't work? ANYWAY, this is a thing I want to be able to make into an exe and put online someday and I don't wanna be limited by the silly webscraping rules. Is there an IP randomiser module? Would a VPN module work? Or should I try Tor? Help me out please guys, much appreciated! Or perhaps is there a better way for me to get on demand/live data on stocks etc?

You do know that you can't just scrape anything you like right? As in you can get sued if you scrape stuff your not allowed to scrape and get pretty huge fines. If they are blocking you, there's a good chance you're not allowed to scrape it.

PSU tier list // Motherboard tier list // Community Standards 

My System:

Spoiler

AMD Ryzen 5 3600, Gigabyte RTX 3060TI Gaming OC ProFractal Design Meshify C TG, 2x8GB G.Skill Ripjaws V 3200MHz, MSI B450 Gaming Plus MaxSamsung 850 EVO 512GB, 2TB WD BlueCorsair RM850x, LG 27GL83A-B

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, martward said:

You do know that you can't just scrape anything you like right? As in you can get sued if you scrape stuff your not allowed to scrape and get pretty huge fines. If they are blocking you, there's a good chance you're not allowed to scrape it.

I'm pretty sure it's fine... it's all stuff that's on their pages, just some data

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, willnburger said:

I'm pretty sure it's fine... it's all stuff that's on their pages, just some data

That's not how it works, just because you can access it via your browser does not mean you can copy it and use it for whatever you like.

PSU tier list // Motherboard tier list // Community Standards 

My System:

Spoiler

AMD Ryzen 5 3600, Gigabyte RTX 3060TI Gaming OC ProFractal Design Meshify C TG, 2x8GB G.Skill Ripjaws V 3200MHz, MSI B450 Gaming Plus MaxSamsung 850 EVO 512GB, 2TB WD BlueCorsair RM850x, LG 27GL83A-B

Link to comment
Share on other sites

Link to post
Share on other sites

@willnburger Yahoo finance apparently used to have an API, however there are alternatives. I'd suggest you look into those alternatives since they are easier and more robust then scraping (1 change in the html and CSS of the website can destroy your entire scraper) and probably legal.

PSU tier list // Motherboard tier list // Community Standards 

My System:

Spoiler

AMD Ryzen 5 3600, Gigabyte RTX 3060TI Gaming OC ProFractal Design Meshify C TG, 2x8GB G.Skill Ripjaws V 3200MHz, MSI B450 Gaming Plus MaxSamsung 850 EVO 512GB, 2TB WD BlueCorsair RM850x, LG 27GL83A-B

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, martward said:

@willnburger Yahoo finance apparently used to have an API, however there are alternatives. I'd suggest you look into those alternatives since they are easier and more robust then scraping (1 change in the html and CSS of the website can destroy your entire scraper) and probably legal.

What are the alternatives? Is there a module I could use? Or should I be paying for data vending?

Link to comment
Share on other sites

Link to post
Share on other sites

26 minutes ago, willnburger said:

Is there an IP randomiser module?

There is no such a thing, you'd have to get a new IP-address from your ISP.

 

27 minutes ago, willnburger said:

Would a VPN module work? Or should I try Tor?

Temporarily, until they ban that IP again.

 

27 minutes ago, willnburger said:

Or perhaps is there a better way for me to get on demand/live data on stocks etc?

Contact Yahoo and pay them for the data. Or add a delay to your code after every request.

Hand, n. A singular instrument worn at the end of the human arm and commonly thrust into somebody’s pocket.

Link to comment
Share on other sites

Link to post
Share on other sites

https://www.datadriveninvestor.com/2019/02/25/6-alternatives-to-the-yahoo-finance-api/

 

I'm not familiar with financial data so I can't really give you advise on particular APIs and what they do. Article says they are (mostly) free so I'd start there.

PSU tier list // Motherboard tier list // Community Standards 

My System:

Spoiler

AMD Ryzen 5 3600, Gigabyte RTX 3060TI Gaming OC ProFractal Design Meshify C TG, 2x8GB G.Skill Ripjaws V 3200MHz, MSI B450 Gaming Plus MaxSamsung 850 EVO 512GB, 2TB WD BlueCorsair RM850x, LG 27GL83A-B

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, WereCatf said:

There is no such a thing, you'd have to get a new IP-address from your ISP.

 

Temporarily, until they ban that IP again.

 

Contact Yahoo and pay them for the data. Or add a delay to your code after every request.

If I pay them for the data will I get a constant live stream of it? Can I then distribute it however I like?

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, willnburger said:

If I pay them for the data will I get a constant live stream of it? Can I then distribute it however I like?

You'd have to ask them, I have no idea.

Hand, n. A singular instrument worn at the end of the human arm and commonly thrust into somebody’s pocket.

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, WereCatf said:

You'd have to ask them, I have no idea.

Ok

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, WereCatf said:

You'd have to ask them, I have no idea.

How do I contact them then?

 

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, willnburger said:

Yo yo, I'm trying to webscrape the shitttttt

Friendly reminder: You might try to use a less annoying language if you seek for advice here.

 

1 hour ago, willnburger said:

they blocked me with a 503 error

A 503 is not a "block".

 

1 hour ago, willnburger said:

How do I contact them then?

https://io.help.yahoo.com/contact/index?page=home&locale=en_US&y=PROD_FIN

Write in C.

Link to comment
Share on other sites

Link to post
Share on other sites

10 hours ago, Dat Guy said:

A 503 is not a "block".

It can be, if server just wanna send status 500s for banned ip. Heck, I like to send status 501 for everything. 

 

For scraping, you just want something quick. Script suits this purpose better. C is not ideal. 

Sudo make me a sandwich 

Link to comment
Share on other sites

Link to post
Share on other sites

17 hours ago, Erik Sieghart said:

I mean yes and no. It's a bunch of legal grey area. Technically all he's scraping is facts off of publicly available web pages, which isn't copyrightable. There's no law that exists that explicitly mentions scraping or web crawling.

 

A good practice of web crawling/scraping it to respect robots.txt https://developers.google.com/search/reference/robots_txt

True, I don't know anything a out the data he's trying to scrape so it could be the case that it's perfectly fine. However just be sure there are no rules doesn't mean it's automatically legal, there have been lawsuits. There was one involving Craigslist where they, if I understand correctly, ruled that the fact that Craigslist was trying to keep the scrapers out meant that it wasn't strictly public anymore or something like that (I'm not going to pretend I understand the whole thing).

If you're not distributing the data nobody's going to care, however if you do chances are someone is getting pissed off at some point.

PSU tier list // Motherboard tier list // Community Standards 

My System:

Spoiler

AMD Ryzen 5 3600, Gigabyte RTX 3060TI Gaming OC ProFractal Design Meshify C TG, 2x8GB G.Skill Ripjaws V 3200MHz, MSI B450 Gaming Plus MaxSamsung 850 EVO 512GB, 2TB WD BlueCorsair RM850x, LG 27GL83A-B

Link to comment
Share on other sites

Link to post
Share on other sites

The data is kinda useless either way, because what you see public is basically delayed, 15-30 minutes old... it's not useful for trading and all that.

Yahoo has some terms of service that you accept when you log in. You may not have to log in to see that information which is public, but they reserve the right to block visitors that abuse their services (which is what you're doing by automating requests and not using the service as a regular human person would use it)

I suspect you also requested the data constantly instead of requesting every 10-30s and caching the data into some local database or text file, to reduce the number of requests.

 

They may have an API available to you if you pay some monthly or yearly fee, which may or may not give you smaller delay (like let's say 1 minute or 5 minutes old data) and more requests per hour , like let's say 1 request per second.

An API is also something much easier to use, as you simply say something like  get?currency=euro&year=2020  and the api responds with a xml or json file with the values of the euro compared to dollar for every day of 2020 so far, one entry per hour or even finer granularity, and just that data, not a 500 KB html file with banners and logos and news articles and all that stuff, nothing to scrape.

 

Link to comment
Share on other sites

Link to post
Share on other sites

Definately look into the legal issues while scraping, but you can get around ip-blocks with proxy ip-rotation. There's a lot of documentation on the web for it for python. There are many free proxies, but to make it work smoothly you'll need a paid proxy service.

Link to comment
Share on other sites

Link to post
Share on other sites

  • 1 month later...

With urllib.request what you are basically doing is sending requests.Remember that many web apps when responding to a requested they just leave poor html and a lot of js, since most of the work is done in js (in those web apps). And from that js the html might be generated. Plus, you need a session alive or cookies if you want to be logged in.

My advice is  simulate a user with mechanize (

) and use regular expressions to create your own unique functions to scrap the pages, or help yourself with BeautifulSoup. Remember that the reponse might change depending on your browser (i mean the one you might simulate).

this is the best way to go that i could think of since you are not scrapping, instead you are simulating a user.

Link to comment
Share on other sites

Link to post
Share on other sites

On 1/9/2020 at 2:40 PM, Erik Sieghart said:

I mean yes and no. It's a bunch of legal grey area.

Just an update on this in the US a court ruled that web scripting public data is legal [0].

 

[0] https://parsers.me/us-court-fully-legalized-website-scraping-and-technically-prohibited-it/

                     ¸„»°'´¸„»°'´ Vorticalbox `'°«„¸`'°«„¸
`'°«„¸¸„»°'´¸„»°'´`'°«„¸Scientia Potentia est  ¸„»°'´`'°«„¸`'°«„¸¸„»°'´

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×