Jump to content

Hey,

 

So I am scraping data from Kiva.org, but I am running into some troupe

# importing libraries
import nltk
from bs4 import BeautifulSoup
import urllib
import urllib.request
import os
import csv

file = open(os.path.expanduser(r"~/Desktop/kiva_test.csv"), "w+", encoding = 'utf-8')
writer = csv.writer(file)
file.write("Loan Exerpt,Information,Country,Status" + "\n")

url = ['https://www.kiva.org/lend/574380']

loanexp = []
info = []
country = []
status = []

for ur in url:
    page = urllib.request.urlopen(ur)
    soup = BeautifulSoup(page, "html.parser")

    loanexp = loanexp + soup.findAll(attrs={"class":"loanExcerpt"})
    info = info + soup.findAll(attrs={"section ID":"additionalLoanInformation"}) #this doesn't work
    country = country + soup.findAll(attrs={"class":"country"})#this works 3x
    status = status + soup.findAll(attrs={"class":"loanStatus notice"})


entry = loanexp + info + country + status

writer.writerow(entry)

file.close()

Unfortunately the Info scrap part, does not work, and the county prints 4 times into the csv.

 

Also do you have any tips to clean the scraped data? here is what it looks like so far:

 

Loan Exerpt Information Country Status    

<div class="loanExcerpt">
A loan of $11,650 helped this borrower to pay tuition fees for an undergraduate degree in Bachelor of Commerce at Strathmore University.


</div>

 

Thanks!

<div class="country"><span class="f16 ke"></span>Kenya</div> <div class="country"><span class="f16 ke"></span>Kenya</div> <div class="country"><span class="f16 ke"></span>Kenya</div> <a class="country" href="#aboutTheCountry"><span class="f16 ke"></span>Kenya</a> <div class="loanStatus notice">Ended with Loss - Defaulted</div>

CPU: Intel Core i7 8700k CPU Cooler: Corsair Hydro Series H100i Mobo:  Memory: G.Skill Ripjaws X 32GB 2133 Storage #1: 1TB 850 EVO SSD Storage #2: Western Digital Black 2TB Storage #3: Western Digital Green 4TB GPU: Gigabyte 980 Ti G1 Case: Mastercase5 PSU: EVGA 750 W G2 80+Gold Keyboard: Corsair K70 RGB Cherry MX Brown Mouse: Razer Deathadder Elite Monitor: LG 34UM94 Headset: Bose

Phone: Samsung Galaxy S9

Link to comment
https://linustechtips.com/topic/577273-kiva-beautiful-soup-issues/
Share on other sites

Link to post
Share on other sites

Anyone?

CPU: Intel Core i7 8700k CPU Cooler: Corsair Hydro Series H100i Mobo:  Memory: G.Skill Ripjaws X 32GB 2133 Storage #1: 1TB 850 EVO SSD Storage #2: Western Digital Black 2TB Storage #3: Western Digital Green 4TB GPU: Gigabyte 980 Ti G1 Case: Mastercase5 PSU: EVGA 750 W G2 80+Gold Keyboard: Corsair K70 RGB Cherry MX Brown Mouse: Razer Deathadder Elite Monitor: LG 34UM94 Headset: Bose

Phone: Samsung Galaxy S9

Link to post
Share on other sites

I haven't had a chance to play with your code yet, but I do see some possible issues and have a few thoughts for you.  First, to clean up your tags, there's a .get_text() method for BeautifulSoup objects that strips out the tags and just returns the text inside them.  So,

import beautifulsoup4 as bs4

soup = bs4.BeautifulSoup("<div>This is some text</div>")
soup.get_text() # returns string "This is some text"

That's probably the best way to clean your data.  Sometimes get_text() doesn't do exactly what you want, since it just removes tags and adds in a few newlines/whitespaces, but it's the best tool to start with.

 

Second, if you're always getting the country name three times, and it's always the same country name, then the tag you're looking for most certainly appears three times in the page.  Either narrow it down a bit more--you can call find for some tag, then find on that result, etc etc, but there's also another way I don't remember offhand to search like that--or, since find_all() returns a list, just use the first item from the list instead of the whole list.  The latter way is easier, assuming that you always have the same country name all three times.

 

Third, you're doing something weird with your lists that you shouldn't be doing.  You can't use the + operator on a list--you need to use the .append() method to add a new item.

foo = []

# this won't work, it'll throw an error
foo = foo + "bar"

# this will work--no need to use "foo =", foo gets updated in-place
foo.append("bar")

You can also nicen up some of the rest of your main loop by just treating loanexp, info, etc as strings, and creating a list at the very end--[loanexp, info, ...]--and writing that to a file.  It would be a little bit cleaner than what you're doing right now.

 

Fourth, I haven't used BeautifulSoup in a few months so I could be mistaken on this, but when you're getting the data for your info list in your loop, I think you might be running in to problems with your attribute tags.  I remember there being something weird about attributes with spaces in them--I think you have to replace it with an underscore, but I'm not 100% certain of that without doing some more research.  Either way, you should also double-check that you've got the right tags.

 

Fifth, and this is more of a minor note, but there's another way you can do your file opening/closing.  It should work as it is right now, but you can also structure your program like this:

with open(...) as file:
    writer = csv.writer(file)
    etc

Python will automatically close the file when the with block finishes.  This is usually interchangeable with the structure you have (file = open() at the top. file.close() at the end), but can be useful in a lot of cases.

 

As usual, my post got longer than I initially anticipated.  Let me know if any of this helps; I can take a closer look at your code and play with it a bit if it's still misbehaving after you try some of the above.

Link to post
Share on other sites

13 hours ago, Azgoth 2 said:

I haven't had a chance to play with your code yet, but I do see some possible issues and have a few thoughts for you.  First, to clean up your tags, there's a .get_text() method for BeautifulSoup objects that strips out the tags and just returns the text inside them.  So,


import beautifulsoup4 as bs4

soup = bs4.BeautifulSoup("<div>This is some text</div>")
soup.get_text() # returns string "This is some text"

That's probably the best way to clean your data.  Sometimes get_text() doesn't do exactly what you want, since it just removes tags and adds in a few newlines/whitespaces, but it's the best tool to start with.

 

Second, if you're always getting the country name three times, and it's always the same country name, then the tag you're looking for most certainly appears three times in the page.  Either narrow it down a bit more--you can call find for some tag, then find on that result, etc etc, but there's also another way I don't remember offhand to search like that--or, since find_all() returns a list, just use the first item from the list instead of the whole list.  The latter way is easier, assuming that you always have the same country name all three times.

 

Third, you're doing something weird with your lists that you shouldn't be doing.  You can't use the + operator on a list--you need to use the .append() method to add a new item.


foo = []

# this won't work, it'll throw an error
foo = foo + "bar"

# this will work--no need to use "foo =", foo gets updated in-place
foo.append("bar")

You can also nicen up some of the rest of your main loop by just treating loanexp, info, etc as strings, and creating a list at the very end--[loanexp, info, ...]--and writing that to a file.  It would be a little bit cleaner than what you're doing right now.

 

Fourth, I haven't used BeautifulSoup in a few months so I could be mistaken on this, but when you're getting the data for your info list in your loop, I think you might be running in to problems with your attribute tags.  I remember there being something weird about attributes with spaces in them--I think you have to replace it with an underscore, but I'm not 100% certain of that without doing some more research.  Either way, you should also double-check that you've got the right tags.

 

Fifth, and this is more of a minor note, but there's another way you can do your file opening/closing.  It should work as it is right now, but you can also structure your program like this:


with open(...) as file:
    writer = csv.writer(file)
    etc

Python will automatically close the file when the with block finishes.  This is usually interchangeable with the structure you have (file = open() at the top. file.close() at the end), but can be useful in a lot of cases.

 

As usual, my post got longer than I initially anticipated.  Let me know if any of this helps; I can take a closer look at your code and play with it a bit if it's still misbehaving after you try some of the above.

I ended up using the .split and looking for the end of the metadata, and then the beginning of the next segment of metadata.

 

This did help a bit, My professor ended up sending part of this:
    info = info + soup.find(attrs={"id":"additionalLoanInformation"}).contents
    info = str(info)
    info = info.split(';">',1)[-1]
    info = info.split('</',1)[0]

 

We needed to adjust it from the "class" because the section I needed to scrape was not a separate class.

CPU: Intel Core i7 8700k CPU Cooler: Corsair Hydro Series H100i Mobo:  Memory: G.Skill Ripjaws X 32GB 2133 Storage #1: 1TB 850 EVO SSD Storage #2: Western Digital Black 2TB Storage #3: Western Digital Green 4TB GPU: Gigabyte 980 Ti G1 Case: Mastercase5 PSU: EVGA 750 W G2 80+Gold Keyboard: Corsair K70 RGB Cherry MX Brown Mouse: Razer Deathadder Elite Monitor: LG 34UM94 Headset: Bose

Phone: Samsung Galaxy S9

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×