Kiva Beautiful Soup issues

paps511 · April 5, 2016

Hey,

So I am scraping data from Kiva.org, but I am running into some troupe

# importing libraries
import nltk
from bs4 import BeautifulSoup
import urllib
import urllib.request
import os
import csv

file = open(os.path.expanduser(r"~/Desktop/kiva_test.csv"), "w+", encoding = 'utf-8')
writer = csv.writer(file)
file.write("Loan Exerpt,Information,Country,Status" + "\n")

url = ['https://www.kiva.org/lend/574380']

loanexp = []
info = []
country = []
status = []

for ur in url:
    page = urllib.request.urlopen(ur)
    soup = BeautifulSoup(page, "html.parser")

    loanexp = loanexp + soup.findAll(attrs={"class":"loanExcerpt"})
    info = info + soup.findAll(attrs={"section ID":"additionalLoanInformation"}) #this doesn't work
    country = country + soup.findAll(attrs={"class":"country"})#this works 3x
    status = status + soup.findAll(attrs={"class":"loanStatus notice"})


entry = loanexp + info + country + status

writer.writerow(entry)

file.close()

Unfortunately the Info scrap part, does not work, and the county prints 4 times into the csv.

Also do you have any tips to clean the scraped data? here is what it looks like so far:

Loan Exerpt

Information

Country

Status

<div class="loanExcerpt">
A loan of $11,650 helped this borrower to pay tuition fees for an undergraduate degree in Bachelor of Commerce at Strathmore University.

</div>

Thanks!

<div class="country"><span class="f16 ke"></span>Kenya</div>

<a class="country" href="#aboutTheCountry"><span class="f16 ke"></span>Kenya</a>

<div class="loanStatus notice">Ended with Loss - Defaulted</div>

paps511 · April 6, 2016

Anyone?

Azgoth 2 · April 7, 2016

I haven't had a chance to play with your code yet, but I do see some possible issues and have a few thoughts for you. First, to clean up your tags, there's a .get_text() method for BeautifulSoup objects that strips out the tags and just returns the text inside them. So,

import beautifulsoup4 as bs4

soup = bs4.BeautifulSoup("<div>This is some text</div>")
soup.get_text() # returns string "This is some text"

That's probably the best way to clean your data. Sometimes get_text() doesn't do exactly what you want, since it just removes tags and adds in a few newlines/whitespaces, but it's the best tool to start with.

Second, if you're always getting the country name three times, and it's always the same country name, then the tag you're looking for most certainly appears three times in the page. Either narrow it down a bit more--you can call find for some tag, then find on that result, etc etc, but there's also another way I don't remember offhand to search like that--or, since find_all() returns a list, just use the first item from the list instead of the whole list. The latter way is easier, assuming that you always have the same country name all three times.

Third, you're doing something weird with your lists that you shouldn't be doing. You can't use the + operator on a list--you need to use the .append() method to add a new item.

foo = []

# this won't work, it'll throw an error
foo = foo + "bar"

# this will work--no need to use "foo =", foo gets updated in-place
foo.append("bar")

You can also nicen up some of the rest of your main loop by just treating loanexp, info, etc as strings, and creating a list at the very end--[loanexp, info, ...]--and writing that to a file. It would be a little bit cleaner than what you're doing right now.

Fourth, I haven't used BeautifulSoup in a few months so I could be mistaken on this, but when you're getting the data for your info list in your loop, I think you might be running in to problems with your attribute tags. I remember there being something weird about attributes with spaces in them--I think you have to replace it with an underscore, but I'm not 100% certain of that without doing some more research. Either way, you should also double-check that you've got the right tags.

Fifth, and this is more of a minor note, but there's another way you can do your file opening/closing. It should work as it is right now, but you can also structure your program like this:

with open(...) as file:
    writer = csv.writer(file)
    etc

Python will automatically close the file when the with block finishes. This is usually interchangeable with the structure you have (file = open() at the top. file.close() at the end), but can be useful in a lot of cases.

As usual, my post got longer than I initially anticipated. Let me know if any of this helps; I can take a closer look at your code and play with it a bit if it's still misbehaving after you try some of the above.

paps511 · April 7, 2016

13 hours ago, Azgoth 2 said:
I haven't had a chance to play with your code yet, but I do see some possible issues and have a few thoughts for you. First, to clean up your tags, there's a .get_text() method for BeautifulSoup objects that strips out the tags and just returns the text inside them. So,
import beautifulsoup4 as bs4

soup = bs4.BeautifulSoup("<div>This is some text</div>")
soup.get_text() # returns string "This is some text"
That's probably the best way to clean your data. Sometimes get_text() doesn't do exactly what you want, since it just removes tags and adds in a few newlines/whitespaces, but it's the best tool to start with.

Second, if you're always getting the country name three times, and it's always the same country name, then the tag you're looking for most certainly appears three times in the page. Either narrow it down a bit more--you can call find for some tag, then find on that result, etc etc, but there's also another way I don't remember offhand to search like that--or, since find_all() returns a list, just use the first item from the list instead of the whole list. The latter way is easier, assuming that you always have the same country name all three times.

Third, you're doing something weird with your lists that you shouldn't be doing. You can't use the + operator on a list--you need to use the .append() method to add a new item.
foo = []

# this won't work, it'll throw an error
foo = foo + "bar"

# this will work--no need to use "foo =", foo gets updated in-place
foo.append("bar")
You can also nicen up some of the rest of your main loop by just treating loanexp, info, etc as strings, and creating a list at the very end--[loanexp, info, ...]--and writing that to a file. It would be a little bit cleaner than what you're doing right now.

Fourth, I haven't used BeautifulSoup in a few months so I could be mistaken on this, but when you're getting the data for your info list in your loop, I think you might be running in to problems with your attribute tags. I remember there being something weird about attributes with spaces in them--I think you have to replace it with an underscore, but I'm not 100% certain of that without doing some more research. Either way, you should also double-check that you've got the right tags.

Fifth, and this is more of a minor note, but there's another way you can do your file opening/closing. It should work as it is right now, but you can also structure your program like this:
with open(...) as file:
    writer = csv.writer(file)
    etc
Python will automatically close the file when the with block finishes. This is usually interchangeable with the structure you have (file = open() at the top. file.close() at the end), but can be useful in a lot of cases.

As usual, my post got longer than I initially anticipated. Let me know if any of this helps; I can take a closer look at your code and play with it a bit if it's still misbehaving after you try some of the above.

I ended up using the .split and looking for the end of the metadata, and then the beginning of the next segment of metadata.

This did help a bit, My professor ended up sending part of this:
info = info + soup.find(attrs={"id":"additionalLoanInformation"}).contents
info = str(info)
info = info.split(';">',1)[-1]
info = info.split('</',1)[0]

We needed to adjust it from the "class" because the section I needed to scrape was not a separate class.

Sign In

Kiva Beautiful Soup issues

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Topics

Latest From Linus Tech Tips:

The Biggest Test Bench I’ve Ever Seen

Latest From ShortCircuit:

Razer Finally Got a Desk Job - Razer Pro Type Ergo

Latest From TechLinked:

This Summer’s Lookin’ Steamy

Latest From GameLinked:

This Was A GOOD One...

Latest From Tech Quickie:

The Secret Council Behind Every Emoji

Latest From The WAN Show:

Google’s Best Feature In Years - WAN Show June 5, 2026

My Activity Streams