Jump to content

Hey all,

 

So i've got a doozy of a homework assignment due tonight and man is it kicking my ass. (my classmates as well).

 

Our assignment:

1. Scrape hotel reviews from Trip advisor with rating (done)

2. Convert the rating to "positive" or "negative" (done)

3. Clean and Lematize the reviews (problem 1)

4. Split the data 75/25 and develop a naïve Bayesian classifier. (problem 2)

Then

5. Test how accurate is the classifier? (should be easy)

6. What are the key words (lemmas) that predict the rating? (should be easy)

 

I am quite lost. I have my source csv with the review and positive/negative.

But the first problem is to lematize the reviews. I have tried to iterate through the reviews, but it keeps giving me errors because it is in a list and it needs to be in a string.

#clean and lematize reviews
import csv
import nltk
import re
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

lemmed = ""

source = open("a5raw.csv","r").readlines()
reviews = csv.reader(source, delimiter=',')
#stopset = set(stopwords.words('english'))

lmzr = WordNetLemmatizer()
for words in reviews:
    join_words=','.join(words).lower()
    review_words=join_words.split()
#    print((review_words))
    
    stopset = set(stopwords.words('english'))
    tokens = word_tokenize(str(review_words))
    tokens = [w for w in tokens if not w in stopset]
    
    wordnet_lemmatizer = WordNetLemmatizer()
    for word in tokens:
        lemmed += wordnet_lemmatizer.lemmatize(word)+ " "
    print (lemmed)

For the second part, I know I need to use bag_of_words and I have this framework taken from examples in class, but don't know how to connect the dots...
 

#split .75 and make a naïve Bayesian classifier


def random_data_split(lst,split=0.75):
    
    import random
    random.shuffle(lst)
    i1 = int(len(lst)*split)
    i2 = int(len(lst) - i1)
    train_set = lst[i1:]
    test_set = lst[:i2]
    return train_set, test_seto

train_set, test_set = random_data_split(all_data)

def bag_of_words(indiwords):
    return dict([(word, '') for word in words])

bag_of_words(indiwords)

classifier = nltk.NaiveBayesClassifier.train(train_set)

Any help would be much appreciated

CPU: Intel Core i7 8700k CPU Cooler: Corsair Hydro Series H100i Mobo:  Memory: G.Skill Ripjaws X 32GB 2133 Storage #1: 1TB 850 EVO SSD Storage #2: Western Digital Black 2TB Storage #3: Western Digital Green 4TB GPU: Gigabyte 980 Ti G1 Case: Mastercase5 PSU: EVGA 750 W G2 80+Gold Keyboard: Corsair K70 RGB Cherry MX Brown Mouse: Razer Deathadder Elite Monitor: LG 34UM94 Headset: Bose

Phone: Samsung Galaxy S9

Link to comment
https://linustechtips.com/topic/560847-python-help-tuples-and-more/
Share on other sites

Link to post
Share on other sites

Anyone?

CPU: Intel Core i7 8700k CPU Cooler: Corsair Hydro Series H100i Mobo:  Memory: G.Skill Ripjaws X 32GB 2133 Storage #1: 1TB 850 EVO SSD Storage #2: Western Digital Black 2TB Storage #3: Western Digital Green 4TB GPU: Gigabyte 980 Ti G1 Case: Mastercase5 PSU: EVGA 750 W G2 80+Gold Keyboard: Corsair K70 RGB Cherry MX Brown Mouse: Razer Deathadder Elite Monitor: LG 34UM94 Headset: Bose

Phone: Samsung Galaxy S9

Link to post
Share on other sites

Okay, I'm playing around with some of your code, and I notice a few possible points where you might be going wrong.

 

First: I don't know how your .csv file is structured, but I'm imagining it's something like one review in the first cell, a second review in the next cell down, etc.  Assuming this is the case: your csv.reader() object will give you an iterable object object that looks more or less like this (if you have multiple data points per row, it'll look different--a list of lists--and you'll need to do a little bit of cleanup to get it looking like this):

reviews = ["text of review 1", "text of review 2" ..., "text of review N"]

This is probably fine, and I imagine what you want--all text relevant to one review is in a single list item.  But this also explains where some of your issues are coming in with the first question.  When I run the first part of your for loop, here's what I get:

reviews = ["this is a story all about how", "my life got flipped turned upside down"]
for words in reviews:
	join_words=','.join(words).lower()
	review_words=join_words.split()

print(join_words)
# prints 'm,y, ,l,i,f,e, ,g,o,t, ,f,l,i,p,p,e,d, ,t,u,r,n,e,d, ,u,p,s,i,d,e, ,d,o,w,n'

print(review_words)
# prints ['m,y,', ',l,i,f,e,', ',g,o,t,', ',f,l,i,p,p,e,d,', ',t,u,r,n,e,d,', ',u,p,s,i,d,e,', ',d,o,w,n']

word_tokenize(review_words)
# error about "list object has no attribute split"

Since words in your for-loop is already a string (it's just an element of reviews, which should be strings if you're doing text analysis on them), but you're joining and then re-splitting it.  The ','.join(words).lower() will convert everything to lowercase in words, then insert a comma between every character.  (Remember, strings are arrays of characters, and Python treats them extremely similarly to lists).  Then you're re-splitting them at the spaces, but leaving all the commas in there.  This will make your data utterly useless unless you're looking specifically at graphemes, which you aren't based on your code/description.  You should only need to call .lower() on words, rather then joining and splitting anything.

 

Another problem--the one you specifically asked about--is that NLTK's word_tokenize() function takes a string argument, but you're passing it a list of  strings by giving it review_words.  It calls Python's built-in .split() method, which also requires a string, so that's where the error is coming from.  You can get around this by not messing with your words variable in the loop, and just call word_tokenize(words.lower()) directly. 

 

Also, these aren't strictly problems, but just some odd things I'm spotting.  You have some unnecessary steps when you're filtering out your stop words: you don't need to convert it to a set (there are no duplicate items in the stopwords.words('english') object).  You also shouldn't need to convert review_words to a string, since it's already a string (or it should be--see above, yours looks like it's creating a list of strings).  Lastly, unless you have a specific need to keep lemmed as a string, it might save a step later on in your assignment if you make it a list or Numpy array (if you're using the Numpy library; lists are fine if you're not).  Presumably you're going to put it through some sort of statistical analysis, which is usually done on lists or arrays rather than strings.

 

So something like this should hopefully fix your problem.  It works for me, but I'm only doing the briefest of tests with it.

lemmed = []
reviews = ["this is a story all about how", "my life got flipped turned upside down"]
for words in reviews:
	tokens = word_tokenize(words.lower())
	tokens = [w for w in tokens if w not in stopwords.words('english')]
	for item in tokens:
		wnl = WordNetLemmatizer()
		lemmed.append(wnl.lemmatize(item))

	print(lemmed)
	

Let me know if that makes sense and works.

 

As for your second question: I don't know anything about doing Naive Bayesian Classifiers (generally or in NLTK), but I do see a few quick general Python things: you should probably move your import random line out of the function--just do that at the top of the script where you would normally have your import statements.  And this is probably just a typo, but your random_data_split function returns test_seto rather than test_set, which is what it should be returning.

 

I don't know if you're using the Natural Language Processing with Python book, but you probably should be.  It's free online.  Here's the chapter where they talk about doing text classification, including Naive Bayes Classifiers.

 

Incidentally, just out of curiosity, what is the class?  It looks like an NLP class if this is a representative assignment.

Link to post
Share on other sites

Thank you so much! This really helped a ton, and it will come in handy for our future projects.

 

I was actually really close. The only thing I was missing:

Quote

for words in reviews2:
    hotel_review = words[0]
    replacer = RegexpReplacer()
    contractionless=replacer.replace(hotel_review)
    train_list.append((bag_of_words(contractionless), words[1]))

That and a few other things. One of the other students' son's is really good at coding and solved it for all of us. 

Once i got the data cleaned and properly tupled I had the rest of the code ready.

CPU: Intel Core i7 8700k CPU Cooler: Corsair Hydro Series H100i Mobo:  Memory: G.Skill Ripjaws X 32GB 2133 Storage #1: 1TB 850 EVO SSD Storage #2: Western Digital Black 2TB Storage #3: Western Digital Green 4TB GPU: Gigabyte 980 Ti G1 Case: Mastercase5 PSU: EVGA 750 W G2 80+Gold Keyboard: Corsair K70 RGB Cherry MX Brown Mouse: Razer Deathadder Elite Monitor: LG 34UM94 Headset: Bose

Phone: Samsung Galaxy S9

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×