Python Beautiful Soup Indiegogo?!

paps511 · April 2, 2016

Hey All,

I'm trying to scrape some data from indiegogo for my final class project, and I am struggling hard.

I basically need to scrape an ended campaign for things like "Title, Story, Category, Story, Asking funding, Received funding" and then write it to a csv.

Here is my code, full of errors.... What are you guys seeing?

# importing libraries
import nltk
from bs4 import BeautifulSoup
import urllib
import urllib.request
import os
import csv

# creating CSV file to be used

file = open(os.path.expanduser(r"~/Desktop/igg_test.csv"), "w+", encoding = 'utf-8')
writer = csv.writer(file)
file.write(b"Title,Story,Category,Asked,Raised,Successful" + b"\n")

campaigns = ['aero-digital-earphone-upgrade-to-hifi-music-now/#/'
            ]

start_urls = [ 'https://www.indiegogo.com/projects/%s' % s for s in campaigns]

url = ['https://www.indiegogo.com/projects/aero-digital-earphone-upgrade-to-hifi-music-now/#/']

num_reviews = 1 # Number of reviews you want for each restaurant
page_order = range(0, (num_reviews+1), 40)

titles = []
categories = []
stories = []
asks = []
raiseds = []
#successfuls = []

for ur in start_urls:
    for o in page_order:
        page = urllib.request.urlopen(ur + ("?start=%s" % o))
        #page = urllib.request.urlopen(url)
        soup = BeautifulSoup(page, "html.parser")

        titles = titles + soup.findAll(attrs={"class":"campaignHeader-title ng-binding"})
        categories = categories + soup.findAll(attrs={"class":"campaignHeader-bylineComponent ng-scope"})
        stories = stories + soup.findAll(attrs={"class":"i-description"})
        asks = asks + soup.findAll(attrs={"class":"campaignGoalTech-percent ng-binding"})
        raiseds = raiseds + soup.findAll(attrs={"class":"campaignGoalTech-fundsAmount ng-binding"})
        #successfuls = successfuls + soup.findAll(attrs={"class":"review-content"})

for rb in titles:
    #print(rb.meta['content'])
    rb.p.contents = str(rb.p.contents)
    rb.p.contents = rb.p.contents.replace("<br>", "").replace("</br>",
"").replace("[", "").replace("]", "").replace(".", "")
    print(rb.p.contents)
    writer.writerow([rb.p.contents] + [rb.meta['content']])        

file.close()

The first error is with the writing of the first line, but I don't really care about it

TypeError                                 Traceback (most recent call last)
<ipython-input-2-8710ab00a2be> in <module>()
      3 file = open(os.path.expanduser(r"~/Desktop/igg_test.csv"), "w+", encoding = 'utf-8')
      4 writer = csv.writer(file)
----> 5 file.write(b"Title,Story,Category,Asked,Raised,Successful" + b"\n")
      6 
      7 campaigns = ['aero-digital-earphone-upgrade-to-hifi-music-now/#/'

TypeError: write() argument must be str, not bytes

These errors are the ones I am concerned with:

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-3-e0e42bb83620> in <module>()
     24 for ur in start_urls:
     25     for o in page_order:
---> 26         page = urllib.request.urlopen(ur + ("?start=%s" % o))
     27         #page = urllib.request.urlopen(url)
     28         soup = BeautifulSoup(page, "html.parser")

//anaconda/lib/python3.5/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    160     else:
    161         opener = _opener
--> 162     return opener.open(url, data, timeout)
    163 
    164 def install_opener(opener):

//anaconda/lib/python3.5/urllib/request.py in open(self, fullurl, data, timeout)
    469         for processor in self.process_response.get(protocol, []):
    470             meth = getattr(processor, meth_name)
--> 471             response = meth(req, response)
    472 
    473         return response

//anaconda/lib/python3.5/urllib/request.py in http_response(self, request, response)
    579         if not (200 <= code < 300):
    580             response = self.parent.error(
--> 581                 'http', request, response, code, msg, hdrs)
    582 
    583         return response

//anaconda/lib/python3.5/urllib/request.py in error(self, proto, *args)
    507         if http_err:
    508             args = (dict, 'default', 'http_error_default') + orig_args
--> 509             return self._call_chain(*args)
    510 
    511 # XXX probably also want an abstract factory that knows when it makes

//anaconda/lib/python3.5/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
    441         for handler in handlers:
    442             func = getattr(handler, meth_name)
--> 443             result = func(*args)
    444             if result is not None:
    445                 return result

//anaconda/lib/python3.5/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
    587 class HTTPDefaultErrorHandler(BaseHandler):
    588     def http_error_default(self, req, fp, code, msg, hdrs):
--> 589         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    590 
    591 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 416: Requested Range Not Satisfiable

Thoughts?

paps511 · April 3, 2016

Anyone? Where my python wizards at?

Azgoth 2 · April 3, 2016

After some quick playing around with this, I get the same HTTP error 416 whenever I run urllib.request.urlopen("http://www.indiegogo.com/") and get the same thing with every Indiegogo page I try, but it works for other websites. I also get the same thing when using the Linux wget command (which downloads things from the internet for you). I've tried to read in to what exactly error 416 is, but I don't fully understand it, so I can't entirely help you there. A few things I've found indicate that the client (in this case, Python or wget) is requesting a range of bytes that the server (Indiegogo) cannot provide, but nothing has given any more detail than that pretty unhelpful description, so I have no idea at this moment what can be done about this. Regardless, it's almost certainly not your Python code being problematic, but it's something about connecting to Indiegogo.

Do you specifically have to do an Indiegogo project, or can you use another site like Kickstarter? It might be easier to do that rather than figure out what the exact source of this problem is and how it can be fixed, if you're allowed to.

Edit: Error 416 seems to occur when some client requests some data/bytes from a server that do not actually exist on that server. E.g., bytes 1000-1100 of a file that's only 700 bytes long. I don't know why this error would be popping up for the entire webpage, though; I can get to it with my browser just fine.

paps511 · April 3, 2016

15 hours ago, Azgoth 2 said:

After some quick playing around with this, I get the same HTTP error 416 whenever I run urllib.request.urlopen("http://www.indiegogo.com/") and get the same thing with every Indiegogo page I try, but it works for other websites. I also get the same thing when using the Linux wget command (which downloads things from the internet for you). I've tried to read in to what exactly error 416 is, but I don't fully understand it, so I can't entirely help you there. A few things I've found indicate that the client (in this case, Python or wget) is requesting a range of bytes that the server (Indiegogo) cannot provide, but nothing has given any more detail than that pretty unhelpful description, so I have no idea at this moment what can be done about this. Regardless, it's almost certainly not your Python code being problematic, but it's something about connecting to Indiegogo.

Do you specifically have to do an Indiegogo project, or can you use another site like Kickstarter? It might be easier to do that rather than figure out what the exact source of this problem is and how it can be fixed, if you're allowed to.

Edit: Error 416 seems to occur when some client requests some data/bytes from a server that do not actually exist on that server. E.g., bytes 1000-1100 of a file that's only 700 bytes long. I don't know why this error would be popping up for the entire webpage, though; I can get to it with my browser just fine.

I can't do Kickstarter because someone else is already doing that as a topic.

http://prosper.com/ and kiva.org, but I don't know where to go to get a list of completed loans. We need to be able to scrape data and get descriptions and amount funded.

riklaunim · April 4, 2016

Use requests library instead of those low level ones. That should make the code easier to write. What also helps is writing the code top-down slitted into function of class methods with names saying what they are doing.

paps511 · April 5, 2016

It seems like Indie Gogo won't let us scrape, so we have adjusted our project to be used with Kiva.org

I'm going to be making a new topic later today with some other questions. It seems to be working much better, but there are still some issues I am running into.

riklaunim · April 5, 2016

You could use selenium to script a real browser to go through indiegogo and fetch required data.

paps511 · April 5, 2016

1 hour ago, riklaunim said:

You could use selenium to script a real browser to go through indiegogo and fetch required data.

we have to use python for the project... =(

riklaunim · April 5, 2016

Selenium is usable from Python and often used for integration frontend testing. Like here https://github.com/django-ckeditor/django-ckeditor/blob/master/ckeditor_demo/demo_application/tests.py I use selenium to test ckeditor and it's file uploading widget coupled with Django backend.

paps511 · April 5, 2016

1 hour ago, riklaunim said:

Selenium is usable from Python and often used for integration frontend testing. Like here https://github.com/django-ckeditor/django-ckeditor/blob/master/ckeditor_demo/demo_application/tests.py I use selenium to test ckeditor and it's file uploading widget coupled with Django backend.

That look a bit above my level, but the assignment changed to Kiva.org to be a bit more compatible.

If you want to throw some thoughts on that topic, https://linustechtips.com/main/topic/577273-kiva-beautiful-soup-issues/

I am also having some issues with a list loop. I'm not sure how to edit items in a list, and keep them in the list.

I have a list of strings with a lot of garbage in it. I made a for loop to edit down to just the useful bits, but I don't know how to put them back into a list.

for item in url_list:
item = str(item)
item = item.split('href="',1)[-1]
item = item.split('"',1)[0]

Sign In

Python Beautiful Soup Indiegogo?!

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Featured Topics

Topics

Latest From Linus Tech Tips:

I KNEW I’d Hate this (I Was Wrong)

Latest From Tech Quickie:

Why Do Speakers Hiss?

Latest From TechLinked:

Yep, it’s an App

Latest From GameLinked:

Bethesda Knows It’s Broken

Latest From ShortCircuit:

How is this even handheld?! - OneXPlayer X1

Latest From Mac Address:

Why did you buy an Apple Vision Pro?

Latest From Channel Super Fun:

I Swapped the CEO's Assistant For a Day!