Jump to content

Implement java program to implement hadoop map reduce

 

Problem definition

Given TWO textual files, for each common word between the two files, find the smaller number of times that it appears between the two files. Output the top 20 common words with highest such frequency (For words with the same frequency, there’s no special requirement for the output order).

Example: if the word “John” appears 5 times in the 1st file and 3 times in the 2nd file, the smaller number of times is 3

 

Requirements

Split the input text with “(space)\t\n\r\f”. Any other tokens like “,.:`” will be regarded as a part of the words

Remove stop-words as given in Stopwords.txt, such as “a”, “the”, “that”, “of”, … (case sensitive)

Sort the common words in descending order of the smaller number of occurrences in the two files.

In general, words with different case or different non-whitespace punctuation are considered different words.

image.thumb.png.cec570f45462a9aed28c9aa7988eb751.png

 

 

image.thumb.png.bcbbeb025dd15550c3daada4774a2e6b.png

 

Archive.zip

Link to comment
Share on other sites

Link to post
Share on other sites

People here don't do your assignments for you.

 

If you are stuck on certain areas, or need help with why your formula is working I'm more than happy to help.  Post what you have, and we can point out issues in the logic/code that is causing you to get the wrong answer.

3735928559 - Beware of the dead beef

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×