Webscraping with Python

willnburger · January 9, 2020

Yo yo, I'm trying to webscrape the shitttttt out of Yahoo finance using python and urllib.request and they blocked me with a 503 error, which I'm given to understand means my IP doesn't work? ANYWAY, this is a thing I want to be able to make into an exe and put online someday and I don't wanna be limited by the silly webscraping rules. Is there an IP randomiser module? Would a VPN module work? Or should I try Tor? Help me out please guys, much appreciated! Or perhaps is there a better way for me to get on demand/live data on stocks etc?

martward · January 9, 2020

1 minute ago, willnburger said:

Yo yo, I'm trying to webscrape the shitttttt out of Yahoo finance using python and urllib.request and they blocked me with a 503 error, which I'm given to understand means my IP doesn't work? ANYWAY, this is a thing I want to be able to make into an exe and put online someday and I don't wanna be limited by the silly webscraping rules. Is there an IP randomiser module? Would a VPN module work? Or should I try Tor? Help me out please guys, much appreciated! Or perhaps is there a better way for me to get on demand/live data on stocks etc?

You do know that you can't just scrape anything you like right? As in you can get sued if you scrape stuff your not allowed to scrape and get pretty huge fines. If they are blocking you, there's a good chance you're not allowed to scrape it.

willnburger · January 9, 2020

3 minutes ago, martward said:

You do know that you can't just scrape anything you like right? As in you can get sued if you scrape stuff your not allowed to scrape and get pretty huge fines. If they are blocking you, there's a good chance you're not allowed to scrape it.

I'm pretty sure it's fine... it's all stuff that's on their pages, just some data

martward · January 9, 2020

3 minutes ago, willnburger said:

I'm pretty sure it's fine... it's all stuff that's on their pages, just some data

That's not how it works, just because you can access it via your browser does not mean you can copy it and use it for whatever you like.

martward · January 9, 2020

@willnburger Yahoo finance apparently used to have an API, however there are alternatives. I'd suggest you look into those alternatives since they are easier and more robust then scraping (1 change in the html and CSS of the website can destroy your entire scraper) and probably legal.

willnburger · January 9, 2020

1 minute ago, martward said:

@willnburger Yahoo finance apparently used to have an API, however there are alternatives. I'd suggest you look into those alternatives since they are easier and more robust then scraping (1 change in the html and CSS of the website can destroy your entire scraper) and probably legal.

What are the alternatives? Is there a module I could use? Or should I be paying for data vending?

WereCatf · January 9, 2020

26 minutes ago, willnburger said:

Is there an IP randomiser module?

There is no such a thing, you'd have to get a new IP-address from your ISP.

27 minutes ago, willnburger said:

Would a VPN module work? Or should I try Tor?

Temporarily, until they ban that IP again.

27 minutes ago, willnburger said:

Or perhaps is there a better way for me to get on demand/live data on stocks etc?

Contact Yahoo and pay them for the data. Or add a delay to your code after every request.

martward · January 9, 2020

https://www.datadriveninvestor.com/2019/02/25/6-alternatives-to-the-yahoo-finance-api/

I'm not familiar with financial data so I can't really give you advise on particular APIs and what they do. Article says they are (mostly) free so I'd start there.

willnburger · January 9, 2020

1 minute ago, WereCatf said:

There is no such a thing, you'd have to get a new IP-address from your ISP.

Temporarily, until they ban that IP again.

Contact Yahoo and pay them for the data. Or add a delay to your code after every request.

If I pay them for the data will I get a constant live stream of it? Can I then distribute it however I like?

willnburger · January 9, 2020

1 minute ago, martward said:

https://www.datadriveninvestor.com/2019/02/25/6-alternatives-to-the-yahoo-finance-api/

I'm not familiar with financial data so I can't really give you advise on particular APIs and what they do. Article says they are (mostly) free so I'd start there.

Much appreciated

WereCatf · January 9, 2020

1 minute ago, willnburger said:

If I pay them for the data will I get a constant live stream of it? Can I then distribute it however I like?

You'd have to ask them, I have no idea.

willnburger · January 9, 2020

Just now, WereCatf said:

You'd have to ask them, I have no idea.

Ok

willnburger · January 9, 2020

1 minute ago, WereCatf said:

You'd have to ask them, I have no idea.

How do I contact them then?

Dat Guy · January 9, 2020

1 hour ago, willnburger said:

Yo yo, I'm trying to webscrape the shitttttt

Friendly reminder: You might try to use a less annoying language if you seek for advice here.

1 hour ago, willnburger said:

they blocked me with a 503 error

A 503 is not a "block".

1 hour ago, willnburger said:

How do I contact them then?

https://io.help.yahoo.com/contact/index?page=home&locale=en_US&y=PROD_FIN

wasab · January 9, 2020

Why would you want to scrape the shit out of yahoo??

wasab · January 9, 2020

10 hours ago, Dat Guy said:

A 503 is not a "block".

It can be, if server just wanna send status 500s for banned ip. Heck, I like to send status 501 for everything.

For scraping, you just want something quick. Script suits this purpose better. C is not ideal.

martward · January 10, 2020

17 hours ago, Erik Sieghart said:

I mean yes and no. It's a bunch of legal grey area. Technically all he's scraping is facts off of publicly available web pages, which isn't copyrightable. There's no law that exists that explicitly mentions scraping or web crawling.

A good practice of web crawling/scraping it to respect robots.txt https://developers.google.com/search/reference/robots_txt

True, I don't know anything a out the data he's trying to scrape so it could be the case that it's perfectly fine. However just be sure there are no rules doesn't mean it's automatically legal, there have been lawsuits. There was one involving Craigslist where they, if I understand correctly, ruled that the fact that Craigslist was trying to keep the scrapers out meant that it wasn't strictly public anymore or something like that (I'm not going to pretend I understand the whole thing).

If you're not distributing the data nobody's going to care, however if you do chances are someone is getting pissed off at some point.

mariushm · January 10, 2020

The data is kinda useless either way, because what you see public is basically delayed, 15-30 minutes old... it's not useful for trading and all that.

Yahoo has some terms of service that you accept when you log in. You may not have to log in to see that information which is public, but they reserve the right to block visitors that abuse their services (which is what you're doing by automating requests and not using the service as a regular human person would use it)

I suspect you also requested the data constantly instead of requesting every 10-30s and caching the data into some local database or text file, to reduce the number of requests.

They may have an API available to you if you pay some monthly or yearly fee, which may or may not give you smaller delay (like let's say 1 minute or 5 minutes old data) and more requests per hour , like let's say 1 request per second.

An API is also something much easier to use, as you simply say something like get?currency=euro&year=2020 and the api responds with a xml or json file with the values of the euro compared to dollar for every day of 2020 so far, one entry per hour or even finer granularity, and just that data, not a 500 KB html file with banners and logos and news articles and all that stuff, nothing to scrape.

Sjaakie · January 11, 2020

Definately look into the legal issues while scraping, but you can get around ip-blocks with proxy ip-rotation. There's a lot of documentation on the web for it for python. There are many free proxies, but to make it work smoothly you'll need a paid proxy service.

qxZap · February 13, 2020

With urllib.request what you are basically doing is sending requests.Remember that many web apps when responding to a requested they just leave poor html and a lot of js, since most of the work is done in js (in those web apps). And from that js the html might be generated. Plus, you need a session alive or cookies if you want to be logged in.

My advice is simulate a user with mechanize (

Spoiler

http://wwwsearch.sourceforge.net/mechanize/

) and use regular expressions to create your own unique functions to scrap the pages, or help yourself with BeautifulSoup. Remember that the reponse might change depending on your browser (i mean the one you might simulate).

this is the best way to go that i could think of since you are not scrapping, instead you are simulating a user.

vorticalbox · February 13, 2020

On 1/9/2020 at 2:40 PM, Erik Sieghart said:

I mean yes and no. It's a bunch of legal grey area.

Just an update on this in the US a court ruled that web scripting public data is legal [0].

[0] https://parsers.me/us-court-fully-legalized-website-scraping-and-technically-prohibited-it/

Sign In

Webscraping with Python

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites