Python Help, Tuples and more

paps511 · March 6, 2016

Hey all,

So i've got a doozy of a homework assignment due tonight and man is it kicking my ass. (my classmates as well).

Our assignment:

1. Scrape hotel reviews from Trip advisor with rating (done)

2. Convert the rating to "positive" or "negative" (done)

3. Clean and Lematize the reviews (problem 1)

4. Split the data 75/25 and develop a naïve Bayesian classifier. (problem 2)

Then

5. Test how accurate is the classifier? (should be easy)

6. What are the key words (lemmas) that predict the rating? (should be easy)

I am quite lost. I have my source csv with the review and positive/negative.

But the first problem is to lematize the reviews. I have tried to iterate through the reviews, but it keeps giving me errors because it is in a list and it needs to be in a string.

#clean and lematize reviews
import csv
import nltk
import re
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

lemmed = ""

source = open("a5raw.csv","r").readlines()
reviews = csv.reader(source, delimiter=',')
#stopset = set(stopwords.words('english'))

lmzr = WordNetLemmatizer()
for words in reviews:
    join_words=','.join(words).lower()
    review_words=join_words.split()
#    print((review_words))
    
    stopset = set(stopwords.words('english'))
    tokens = word_tokenize(str(review_words))
    tokens = [w for w in tokens if not w in stopset]
    
    wordnet_lemmatizer = WordNetLemmatizer()
    for word in tokens:
        lemmed += wordnet_lemmatizer.lemmatize(word)+ " "
    print (lemmed)

For the second part, I know I need to use bag_of_words and I have this framework taken from examples in class, but don't know how to connect the dots...

#split .75 and make a naïve Bayesian classifier


def random_data_split(lst,split=0.75):
    
    import random
    random.shuffle(lst)
    i1 = int(len(lst)*split)
    i2 = int(len(lst) - i1)
    train_set = lst[i1:]
    test_set = lst[:i2]
    return train_set, test_seto

train_set, test_set = random_data_split(all_data)

def bag_of_words(indiwords):
    return dict([(word, '') for word in words])

bag_of_words(indiwords)

classifier = nltk.NaiveBayesClassifier.train(train_set)

Any help would be much appreciated

paps511 · March 6, 2016

Anyone?

Azgoth 2 · March 7, 2016

Okay, I'm playing around with some of your code, and I notice a few possible points where you might be going wrong.

First: I don't know how your .csv file is structured, but I'm imagining it's something like one review in the first cell, a second review in the next cell down, etc. Assuming this is the case: your csv.reader() object will give you an iterable object object that looks more or less like this (if you have multiple data points per row, it'll look different--a list of lists--and you'll need to do a little bit of cleanup to get it looking like this):

reviews = ["text of review 1", "text of review 2" ..., "text of review N"]

This is probably fine, and I imagine what you want--all text relevant to one review is in a single list item. But this also explains where some of your issues are coming in with the first question. When I run the first part of your for loop, here's what I get:

reviews = ["this is a story all about how", "my life got flipped turned upside down"]
for words in reviews:
	join_words=','.join(words).lower()
	review_words=join_words.split()

print(join_words)
# prints 'm,y, ,l,i,f,e, ,g,o,t, ,f,l,i,p,p,e,d, ,t,u,r,n,e,d, ,u,p,s,i,d,e, ,d,o,w,n'

print(review_words)
# prints ['m,y,', ',l,i,f,e,', ',g,o,t,', ',f,l,i,p,p,e,d,', ',t,u,r,n,e,d,', ',u,p,s,i,d,e,', ',d,o,w,n']

word_tokenize(review_words)
# error about "list object has no attribute split"

Since words in your for-loop is already a string (it's just an element of reviews, which should be strings if you're doing text analysis on them), but you're joining and then re-splitting it. The ','.join(words).lower() will convert everything to lowercase in words, then insert a comma between every character. (Remember, strings are arrays of characters, and Python treats them extremely similarly to lists). Then you're re-splitting them at the spaces, but leaving all the commas in there. This will make your data utterly useless unless you're looking specifically at graphemes, which you aren't based on your code/description. You should only need to call .lower() on words, rather then joining and splitting anything.

Another problem--the one you specifically asked about--is that NLTK's word_tokenize() function takes a string argument, but you're passing it a list of strings by giving it review_words. It calls Python's built-in .split() method, which also requires a string, so that's where the error is coming from. You can get around this by not messing with your words variable in the loop, and just call word_tokenize(words.lower()) directly.

Also, these aren't strictly problems, but just some odd things I'm spotting. You have some unnecessary steps when you're filtering out your stop words: you don't need to convert it to a set (there are no duplicate items in the stopwords.words('english') object). You also shouldn't need to convert review_words to a string, since it's already a string (or it should be--see above, yours looks like it's creating a list of strings). Lastly, unless you have a specific need to keep lemmed as a string, it might save a step later on in your assignment if you make it a list or Numpy array (if you're using the Numpy library; lists are fine if you're not). Presumably you're going to put it through some sort of statistical analysis, which is usually done on lists or arrays rather than strings.

So something like this should hopefully fix your problem. It works for me, but I'm only doing the briefest of tests with it.

lemmed = []
reviews = ["this is a story all about how", "my life got flipped turned upside down"]
for words in reviews:
	tokens = word_tokenize(words.lower())
	tokens = [w for w in tokens if w not in stopwords.words('english')]
	for item in tokens:
		wnl = WordNetLemmatizer()
		lemmed.append(wnl.lemmatize(item))

	print(lemmed)

Let me know if that makes sense and works.

As for your second question: I don't know anything about doing Naive Bayesian Classifiers (generally or in NLTK), but I do see a few quick general Python things: you should probably move your import random line out of the function--just do that at the top of the script where you would normally have your import statements. And this is probably just a typo, but your random_data_split function returns test_seto rather than test_set, which is what it should be returning.

I don't know if you're using the Natural Language Processing with Python book, but you probably should be. It's free online. Here's the chapter where they talk about doing text classification, including Naive Bayes Classifiers.

Incidentally, just out of curiosity, what is the class? It looks like an NLP class if this is a representative assignment.

paps511 · March 7, 2016

Thank you so much! This really helped a ton, and it will come in handy for our future projects.

I was actually really close. The only thing I was missing:

Quote

for words in reviews2:
hotel_review = words[0]
replacer = RegexpReplacer()
contractionless=replacer.replace(hotel_review)
train_list.append((bag_of_words(contractionless), words[1]))

That and a few other things. One of the other students' son's is really good at coding and solved it for all of us.

Once i got the data cleaned and properly tupled I had the rest of the code ready.

Azgoth 2 · March 7, 2016

Glad I could help! And glad to hear the other problem was a pretty simple fix!

Sign In

Python Help, Tuples and more

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Topics

Latest From Linus Tech Tips:

This Perfectly Silent Fan Took 300 Years to Make

Latest From ShortCircuit:

The coolest looking monitor. Period. - ASUS ROG display at Computex (Sponsored)

Latest From TechLinked:

Microsoft Just Can’t Help Itself

Latest From GameLinked:

Wait wasn't this game dead??

Latest From Tech Quickie:

Who's Tracking Your Phone Right Now?

Latest From The WAN Show:

Pizza Hut is Being Sued Over AI

My Activity Streams