Jump to content

Python Beautiful Soup Indiegogo?!

paps511

Hey All,

 

I'm trying to scrape some data from indiegogo for my final class project, and I am struggling hard.

I basically need to scrape an ended campaign for things like "Title, Story, Category, Story, Asking funding, Received funding" and then write it to a csv.

 

Here is my code, full of errors.... What are you guys seeing?

# importing libraries
import nltk
from bs4 import BeautifulSoup
import urllib
import urllib.request
import os
import csv

# creating CSV file to be used

file = open(os.path.expanduser(r"~/Desktop/igg_test.csv"), "w+", encoding = 'utf-8')
writer = csv.writer(file)
file.write(b"Title,Story,Category,Asked,Raised,Successful" + b"\n")

campaigns = ['aero-digital-earphone-upgrade-to-hifi-music-now/#/'
            ]

start_urls = [ 'https://www.indiegogo.com/projects/%s' % s for s in campaigns]

url = ['https://www.indiegogo.com/projects/aero-digital-earphone-upgrade-to-hifi-music-now/#/']

num_reviews = 1 # Number of reviews you want for each restaurant
page_order = range(0, (num_reviews+1), 40)

titles = []
categories = []
stories = []
asks = []
raiseds = []
#successfuls = []

for ur in start_urls:
    for o in page_order:
        page = urllib.request.urlopen(ur + ("?start=%s" % o))
        #page = urllib.request.urlopen(url)
        soup = BeautifulSoup(page, "html.parser")

        titles = titles + soup.findAll(attrs={"class":"campaignHeader-title ng-binding"})
        categories = categories + soup.findAll(attrs={"class":"campaignHeader-bylineComponent ng-scope"})
        stories = stories + soup.findAll(attrs={"class":"i-description"})
        asks = asks + soup.findAll(attrs={"class":"campaignGoalTech-percent ng-binding"})
        raiseds = raiseds + soup.findAll(attrs={"class":"campaignGoalTech-fundsAmount ng-binding"})
        #successfuls = successfuls + soup.findAll(attrs={"class":"review-content"})

for rb in titles:
    #print(rb.meta['content'])
    rb.p.contents = str(rb.p.contents)
    rb.p.contents = rb.p.contents.replace("<br>", "").replace("</br>",
"").replace("[", "").replace("]", "").replace(".", "")
    print(rb.p.contents)
    writer.writerow([rb.p.contents] + [rb.meta['content']])        

file.close()

The first error is with the writing of the first line, but I don't really care about it

TypeError                                 Traceback (most recent call last)
<ipython-input-2-8710ab00a2be> in <module>()
      3 file = open(os.path.expanduser(r"~/Desktop/igg_test.csv"), "w+", encoding = 'utf-8')
      4 writer = csv.writer(file)
----> 5 file.write(b"Title,Story,Category,Asked,Raised,Successful" + b"\n")
      6 
      7 campaigns = ['aero-digital-earphone-upgrade-to-hifi-music-now/#/'

TypeError: write() argument must be str, not bytes

These errors are the ones I am concerned with:

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-3-e0e42bb83620> in <module>()
     24 for ur in start_urls:
     25     for o in page_order:
---> 26         page = urllib.request.urlopen(ur + ("?start=%s" % o))
     27         #page = urllib.request.urlopen(url)
     28         soup = BeautifulSoup(page, "html.parser")

//anaconda/lib/python3.5/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    160     else:
    161         opener = _opener
--> 162     return opener.open(url, data, timeout)
    163 
    164 def install_opener(opener):

//anaconda/lib/python3.5/urllib/request.py in open(self, fullurl, data, timeout)
    469         for processor in self.process_response.get(protocol, []):
    470             meth = getattr(processor, meth_name)
--> 471             response = meth(req, response)
    472 
    473         return response

//anaconda/lib/python3.5/urllib/request.py in http_response(self, request, response)
    579         if not (200 <= code < 300):
    580             response = self.parent.error(
--> 581                 'http', request, response, code, msg, hdrs)
    582 
    583         return response

//anaconda/lib/python3.5/urllib/request.py in error(self, proto, *args)
    507         if http_err:
    508             args = (dict, 'default', 'http_error_default') + orig_args
--> 509             return self._call_chain(*args)
    510 
    511 # XXX probably also want an abstract factory that knows when it makes

//anaconda/lib/python3.5/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
    441         for handler in handlers:
    442             func = getattr(handler, meth_name)
--> 443             result = func(*args)
    444             if result is not None:
    445                 return result

//anaconda/lib/python3.5/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
    587 class HTTPDefaultErrorHandler(BaseHandler):
    588     def http_error_default(self, req, fp, code, msg, hdrs):
--> 589         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    590 
    591 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 416: Requested Range Not Satisfiable

 

 Thoughts?

CPU: Intel Core i7 8700k CPU Cooler: Corsair Hydro Series H100i Mobo:  Memory: G.Skill Ripjaws X 32GB 2133 Storage #1: 1TB 850 EVO SSD Storage #2: Western Digital Black 2TB Storage #3: Western Digital Green 4TB GPU: Gigabyte 980 Ti G1 Case: Mastercase5 PSU: EVGA 750 W G2 80+Gold Keyboard: Corsair K70 RGB Cherry MX Brown Mouse: Razer Deathadder Elite Monitor: LG 34UM94 Headset: Bose

Phone: Samsung Galaxy S9

Link to comment
Share on other sites

Link to post
Share on other sites

Anyone? Where my python wizards at?

CPU: Intel Core i7 8700k CPU Cooler: Corsair Hydro Series H100i Mobo:  Memory: G.Skill Ripjaws X 32GB 2133 Storage #1: 1TB 850 EVO SSD Storage #2: Western Digital Black 2TB Storage #3: Western Digital Green 4TB GPU: Gigabyte 980 Ti G1 Case: Mastercase5 PSU: EVGA 750 W G2 80+Gold Keyboard: Corsair K70 RGB Cherry MX Brown Mouse: Razer Deathadder Elite Monitor: LG 34UM94 Headset: Bose

Phone: Samsung Galaxy S9

Link to comment
Share on other sites

Link to post
Share on other sites

After some quick playing around with this, I get the same HTTP error 416 whenever I run urllib.request.urlopen("http://www.indiegogo.com/") and get the same thing with every Indiegogo page I try, but it works for other websites.  I also get the same thing when using the Linux wget command (which downloads things from the internet for you).  I've tried to read in to what exactly error 416 is, but I don't fully understand it, so I can't entirely help you there.  A few things I've found indicate that the client (in this case, Python or wget) is requesting a range of bytes that the server (Indiegogo) cannot provide, but nothing has given any more detail than that pretty unhelpful description, so I have no idea at this moment what can be done about this.  Regardless, it's almost certainly not your Python code being problematic, but it's something about connecting to Indiegogo.

 

Do you specifically have to do an Indiegogo project, or can you use another site like Kickstarter?  It might be easier to do that rather than figure out what the exact source of this problem is and how it can be fixed, if you're allowed to.

 

Edit: Error 416 seems to occur when some client requests some data/bytes from a server that do not actually exist on that server.  E.g., bytes 1000-1100 of a file that's only 700 bytes long.  I don't know why this error would be popping up for the entire webpage, though; I can get to it with my browser just fine.

Link to comment
Share on other sites

Link to post
Share on other sites

15 hours ago, Azgoth 2 said:

After some quick playing around with this, I get the same HTTP error 416 whenever I run urllib.request.urlopen("http://www.indiegogo.com/") and get the same thing with every Indiegogo page I try, but it works for other websites.  I also get the same thing when using the Linux wget command (which downloads things from the internet for you).  I've tried to read in to what exactly error 416 is, but I don't fully understand it, so I can't entirely help you there.  A few things I've found indicate that the client (in this case, Python or wget) is requesting a range of bytes that the server (Indiegogo) cannot provide, but nothing has given any more detail than that pretty unhelpful description, so I have no idea at this moment what can be done about this.  Regardless, it's almost certainly not your Python code being problematic, but it's something about connecting to Indiegogo.

 

Do you specifically have to do an Indiegogo project, or can you use another site like Kickstarter?  It might be easier to do that rather than figure out what the exact source of this problem is and how it can be fixed, if you're allowed to.

 

Edit: Error 416 seems to occur when some client requests some data/bytes from a server that do not actually exist on that server.  E.g., bytes 1000-1100 of a file that's only 700 bytes long.  I don't know why this error would be popping up for the entire webpage, though; I can get to it with my browser just fine.

I can't do Kickstarter because someone else is already doing that as a topic.

 

http://prosper.com/ and kiva.org, but I don't know where to go to get a list of completed loans. We need to be able to scrape data and get descriptions and amount funded.

CPU: Intel Core i7 8700k CPU Cooler: Corsair Hydro Series H100i Mobo:  Memory: G.Skill Ripjaws X 32GB 2133 Storage #1: 1TB 850 EVO SSD Storage #2: Western Digital Black 2TB Storage #3: Western Digital Green 4TB GPU: Gigabyte 980 Ti G1 Case: Mastercase5 PSU: EVGA 750 W G2 80+Gold Keyboard: Corsair K70 RGB Cherry MX Brown Mouse: Razer Deathadder Elite Monitor: LG 34UM94 Headset: Bose

Phone: Samsung Galaxy S9

Link to comment
Share on other sites

Link to post
Share on other sites

Use requests library instead of those low level ones. That should make the code easier to write. What also helps is writing the code top-down slitted into function of class methods with names saying what they are doing.

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

It seems like Indie Gogo won't let us scrape, so we have adjusted our project to be used with Kiva.org

 

I'm going to be making a new topic later today with some other questions. It seems to be working much better, but there are still some issues I am running into.

CPU: Intel Core i7 8700k CPU Cooler: Corsair Hydro Series H100i Mobo:  Memory: G.Skill Ripjaws X 32GB 2133 Storage #1: 1TB 850 EVO SSD Storage #2: Western Digital Black 2TB Storage #3: Western Digital Green 4TB GPU: Gigabyte 980 Ti G1 Case: Mastercase5 PSU: EVGA 750 W G2 80+Gold Keyboard: Corsair K70 RGB Cherry MX Brown Mouse: Razer Deathadder Elite Monitor: LG 34UM94 Headset: Bose

Phone: Samsung Galaxy S9

Link to comment
Share on other sites

Link to post
Share on other sites

You could use selenium to script a real browser to go through indiegogo and fetch required data.

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, riklaunim said:

You could use selenium to script a real browser to go through indiegogo and fetch required data.

we have to use python for the project... =(

CPU: Intel Core i7 8700k CPU Cooler: Corsair Hydro Series H100i Mobo:  Memory: G.Skill Ripjaws X 32GB 2133 Storage #1: 1TB 850 EVO SSD Storage #2: Western Digital Black 2TB Storage #3: Western Digital Green 4TB GPU: Gigabyte 980 Ti G1 Case: Mastercase5 PSU: EVGA 750 W G2 80+Gold Keyboard: Corsair K70 RGB Cherry MX Brown Mouse: Razer Deathadder Elite Monitor: LG 34UM94 Headset: Bose

Phone: Samsung Galaxy S9

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, riklaunim said:

Selenium is usable from Python and often used for integration frontend testing. Like here https://github.com/django-ckeditor/django-ckeditor/blob/master/ckeditor_demo/demo_application/tests.py I use selenium to test ckeditor and it's file uploading widget coupled with Django backend.

That look a bit above my level, but the assignment changed to Kiva.org to be a bit more compatible.

 

If you want to throw some thoughts on that topic, https://linustechtips.com/main/topic/577273-kiva-beautiful-soup-issues/

 

I am also having some issues with a list loop. I'm not sure how to edit items in a list, and keep them in the list.

 

I have a list of strings with a lot of garbage in it. I made a for loop to edit down to just the useful bits, but I don't know how to put them back into a list.

 

for item in url_list:
    item = str(item)
    item = item.split('href="',1)[-1]
    item = item.split('"',1)[0]

CPU: Intel Core i7 8700k CPU Cooler: Corsair Hydro Series H100i Mobo:  Memory: G.Skill Ripjaws X 32GB 2133 Storage #1: 1TB 850 EVO SSD Storage #2: Western Digital Black 2TB Storage #3: Western Digital Green 4TB GPU: Gigabyte 980 Ti G1 Case: Mastercase5 PSU: EVGA 750 W G2 80+Gold Keyboard: Corsair K70 RGB Cherry MX Brown Mouse: Razer Deathadder Elite Monitor: LG 34UM94 Headset: Bose

Phone: Samsung Galaxy S9

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×