Jump to content

word count in C

iWearKiltz
Go to solution Solved by JamieF,

 

Hi guys

 

So for part of an assignment i've got to get the word count of a file. I've done characters and lines.. Words are a bit different though

 

it's not like you can count the number of spaces and add one, because that just counts spaces (i.e if there is more then one space in-between words then it will count that as two words when actually it should only be counting one word).

 

So i thought to my self a two dimensional array (string?). 

 

Using microsofts secure version of things (_s, if you don't know what this is then just say without the secure version and ill try and work it out), i've tried multiple ways, but my lecturers just have not covered this material (nor half the other stuff i've done for this assignment.. )

 

 

was wondering if you guys could help

//Places file 1 into an arrayrewind(file_in); //THIS IS ALL A TEST, CHECK FOR PRINTFwhile ((f=fgetc(file_in)) !=EOF){if (isupper(f) || islower(f)){arrayfile1[q][p] = f;p++;}else{arrayfile1[q][p] = '\0';q++;}}

Was looking at something similar on Reddit the other day, might be of use to you.

 

http://www.reddit.com/r/dailyprogrammer/comments/2nynip/2014121_challenge_191_easy_word_counting/

Hi guys

 

So for part of an assignment i've got to get the word count of a file. I've done characters and lines.. Words are a bit different though

 

it's not like you can count the number of spaces and add one, because that just counts spaces (i.e if there is more then one space in-between words then it will count that as two words when actually it should only be counting one word).

 

So i thought to my self a two dimensional array (string?). 

 

Using microsofts secure version of things (_s, if you don't know what this is then just say without the secure version and ill try and work it out), i've tried multiple ways, but my lecturers just have not covered this material (nor half the other stuff i've done for this assignment.. )

 

 

was wondering if you guys could help

 

//Places file 1 into an arrayrewind(file_in); //THIS IS ALL A TEST, CHECK FOR PRINTFwhile ((f=fgetc(file_in)) !=EOF){if (isupper(f) || islower(f)){arrayfile1[q][p] = f;p++;}else{arrayfile1[q][p] = '\0';q++;}}
Spoiler

Gaming/Engineering PC: -i7 6700K, 4-4.2GHz "Eleanor" -ASUS ROG HERO VIII MOBO -16GB DDR4 3000MHz Corsair (2x8GB) -Gigabyte Windforce 980Ti OC edition (1405MHz GPU clock) -H110i GT Corsair CPU Water cooler -980GB Sandisk Ultra II SSD -Corsair 450D ATX Case -RM850i Corsair PSU (Modular) -28” 4K Samsung -27” 1080p Samsung 

Link to comment
Share on other sites

Link to post
Share on other sites

You could still check spaces, but keep track of if the last character was a space or not

 

If the current character is a space and the last character wasn't a space, then update count.

Link to comment
Share on other sites

Link to post
Share on other sites

You could still check spaces, but keep track of if the last character was a space or not

 

If the current character is a space and the last character wasn't a space, then update count.

interesting idea actually, means i get to avoid a 2 dimensional array.

 

any idea how to implement? have two variables and increment both by one each loop, (having one start at 0, other at 1) and have an if loop in a while statement?

 

thats my idea right now, have any better ones?

Spoiler

Gaming/Engineering PC: -i7 6700K, 4-4.2GHz "Eleanor" -ASUS ROG HERO VIII MOBO -16GB DDR4 3000MHz Corsair (2x8GB) -Gigabyte Windforce 980Ti OC edition (1405MHz GPU clock) -H110i GT Corsair CPU Water cooler -980GB Sandisk Ultra II SSD -Corsair 450D ATX Case -RM850i Corsair PSU (Modular) -28” 4K Samsung -27” 1080p Samsung 

Link to comment
Share on other sites

Link to post
Share on other sites

I would use a boolean variable. Depending on what version of C you are using, you may have the bool type. In C99 you can include stdbool.h to access it.

 

If you're using an earlier version than C99 and don't have bool, it can easily be implemented yourself. For example, use an int where 1 means true and 0 means false. Here are other options for representing the bool type in C.

// initialize your "boolean" as falseint lastCharWasSpace = 0;// then inside your loopif (currentCharacter == ' ') { // check that the current character is a space    if (!lastCharWasSpace) { // if last character wasn't a space, then update the word count        wordCount++;    }    lastCharWasSpace = 1; // because character was a space, set variable}else { // current character wasn't a space so reset variable    lastCharWasSpace = 0;}

edit: Also, I just remembered there is the isspace() function in ctype.h that you should use instead of currentCharacter == ' ' if you can. It'll cover more types of whitespace.

Link to comment
Share on other sites

Link to post
Share on other sites



 

I would use a boolean variable. Depending on what version of C you are using, you may have the bool type. In C99 you can include stdbool.h to access it.

 

If you're using an earlier version than C99 and don't have bool, it can easily be implemented yourself. For example, use an int where 1 means true and 0 means false. Here are other options for representing the bool type in C.

// initialize your "boolean" as falseint lastCharWasSpace = 0;// then inside your loopif (currentCharacter == ' ') { // check that the current character is a space    if (!lastCharWasSpace) { // if last character wasn't a space, then update the word count        wordCount++;    }    lastCharWasSpace = 1; // because character was a space, set variable}else { // current character wasn't a space so reset variable    lastCharWasSpace = 0;}

 

Okay so I implemented this and its a great idea, so heres my sample code:

//working out numbers of words for file 1, via 'boolean' with lastcharspace (1 true, 0 false)int lastcharspace = 0;long wc = 0;rewind(file_in);while ((f=fgetc(file_in)) !=EOF){if (f == ' '){wc++;lastcharspace == 1;}else{lastcharspace == 0;}}printf("The number of words is: %ld \n\n", wc);

 

The issue is that in my test file, i've got this:

thank you laurie

that was super helpful
 
email start
Hi Laurie,
 
Good to hear                         from you.
 
Did some test:
 
If you have the following keyboard layout:
 
 
email end
notepad end
Note the load of spaces on the line 'good to hear.....'
So obviously the program is reading those spaces as a bunch of words.. any amendments you can think of? i've been playing around with the if statement but to no avail
 
proof that it doesn't work:
post-16255-0-33809500-1418000133_thumb.p
 
thanks again
Spoiler

Gaming/Engineering PC: -i7 6700K, 4-4.2GHz "Eleanor" -ASUS ROG HERO VIII MOBO -16GB DDR4 3000MHz Corsair (2x8GB) -Gigabyte Windforce 980Ti OC edition (1405MHz GPU clock) -H110i GT Corsair CPU Water cooler -980GB Sandisk Ultra II SSD -Corsair 450D ATX Case -RM850i Corsair PSU (Modular) -28” 4K Samsung -27” 1080p Samsung 

Link to comment
Share on other sites

Link to post
Share on other sites

proof that it doesn't work:

It doesn't work because you never check if the last character was a space.

1474412270.2748842

Link to comment
Share on other sites

Link to post
Share on other sites

It doesn't work because you never check if the last character was a space.

working on a solution right now, two ticks

Spoiler

Gaming/Engineering PC: -i7 6700K, 4-4.2GHz "Eleanor" -ASUS ROG HERO VIII MOBO -16GB DDR4 3000MHz Corsair (2x8GB) -Gigabyte Windforce 980Ti OC edition (1405MHz GPU clock) -H110i GT Corsair CPU Water cooler -980GB Sandisk Ultra II SSD -Corsair 450D ATX Case -RM850i Corsair PSU (Modular) -28” 4K Samsung -27” 1080p Samsung 

Link to comment
Share on other sites

Link to post
Share on other sites

It doesn't work because you never check if the last character was a space.

 

 

I would use a boolean variable. Depending on what version of C you are using, you may have the bool type. In C99 you can include stdbool.h to access it.

 

If you're using an earlier version than C99 and don't have bool, it can easily be implemented yourself. For example, use an int where 1 means true and 0 means false. Here are other options for representing the bool type in C.

// initialize your "boolean" as falseint lastCharWasSpace = 0;// then inside your loopif (currentCharacter == ' ') { // check that the current character is a space    if (!lastCharWasSpace) { // if last character wasn't a space, then update the word count        wordCount++;    }    lastCharWasSpace = 1; // because character was a space, set variable}else { // current character wasn't a space so reset variable    lastCharWasSpace = 0;}

edit: Also, I just remembered there is the isspace() function in ctype.h that you should use instead of currentCharacter == ' ' if you can. It'll cover more types of whitespace.

ammeneded reply:

//working out numbers of words for file 1, via 'boolean' with lastcharspace (1 true, 0 false)int lastcharspace = 0;long wc = 0;rewind(file_in);while ((f=fgetc(file_in)) !=EOF){if (f != ' ' && lastcharspace == 0){lastcharspace == 0;}else if (f == ' ' && lastcharspace == 0){lastcharspace = 1;wc++;}else if (f != ' ' && lastcharspace == 1)lastcharspace = 0;else if (f == ' ' && lastcharspace == 1)lastcharspace = 1;elseprintf("there was an error");}printf("The number of words is: %ld \n\n", wc); 

post-16255-0-15839200-1418001344_thumb.p

Spoiler

Gaming/Engineering PC: -i7 6700K, 4-4.2GHz "Eleanor" -ASUS ROG HERO VIII MOBO -16GB DDR4 3000MHz Corsair (2x8GB) -Gigabyte Windforce 980Ti OC edition (1405MHz GPU clock) -H110i GT Corsair CPU Water cooler -980GB Sandisk Ultra II SSD -Corsair 450D ATX Case -RM850i Corsair PSU (Modular) -28” 4K Samsung -27” 1080p Samsung 

Link to comment
Share on other sites

Link to post
Share on other sites

 

Hi guys

 

So for part of an assignment i've got to get the word count of a file. I've done characters and lines.. Words are a bit different though

 

it's not like you can count the number of spaces and add one, because that just counts spaces (i.e if there is more then one space in-between words then it will count that as two words when actually it should only be counting one word).

 

So i thought to my self a two dimensional array (string?). 

 

Using microsofts secure version of things (_s, if you don't know what this is then just say without the secure version and ill try and work it out), i've tried multiple ways, but my lecturers just have not covered this material (nor half the other stuff i've done for this assignment.. )

 

 

was wondering if you guys could help

//Places file 1 into an arrayrewind(file_in); //THIS IS ALL A TEST, CHECK FOR PRINTFwhile ((f=fgetc(file_in)) !=EOF){if (isupper(f) || islower(f)){arrayfile1[q][p] = f;p++;}else{arrayfile1[q][p] = '\0';q++;}}

Was looking at something similar on Reddit the other day, might be of use to you.

 

http://www.reddit.com/r/dailyprogrammer/comments/2nynip/2014121_challenge_191_easy_word_counting/

Link to comment
Share on other sites

Link to post
Share on other sites

--snip--

You're not checking for whitespace other than the space character, you should use the isspace() function like @madknight3 said.

1474412270.2748842

Link to comment
Share on other sites

Link to post
Share on other sites

You're not checking for whitespace other than the space character, you should use the isspace() function like @madknight3 said.

okay but i get 62 words (instead of 30... how have i done this lmao)

code:

  //working out numbers of words for file 1, via 'boolean' with lastcharspace (1 true, 0 false)int lastcharspace = 0;long wc = 0;rewind(file_in);while ((f = fgetc(file_in)) != EOF){if (isspace(f) && lastcharspace == 1)lastcharspace == 1;else if (isspace(f) && lastcharspace == 0){lastcharspace == 1;wc++;}else if (lastcharspace == 1)lastcharspace == 0;else if (lastcharspace == 0)lastcharspace = 0;elseprintf("there was an error");} printf("The number of words is: %ld \n\n", wc); 

Spoiler

Gaming/Engineering PC: -i7 6700K, 4-4.2GHz "Eleanor" -ASUS ROG HERO VIII MOBO -16GB DDR4 3000MHz Corsair (2x8GB) -Gigabyte Windforce 980Ti OC edition (1405MHz GPU clock) -H110i GT Corsair CPU Water cooler -980GB Sandisk Ultra II SSD -Corsair 450D ATX Case -RM850i Corsair PSU (Modular) -28” 4K Samsung -27” 1080p Samsung 

Link to comment
Share on other sites

Link to post
Share on other sites

You're not checking for whitespace other than the space character, you should use the isspace() function like @madknight3 said.

ahh yeah, that was edited in, didn't see it till now. cheers!

 

Was looking at something similar on Reddit the other day, might be of use to you.

 

http://www.reddit.com/r/dailyprogrammer/comments/2nynip/2014121_challenge_191_easy_word_counting/

 

this may have solved my problem, look further down

I would use a boolean variable. Depending on what version of C you are using, you may have the bool type. In C99 you can include stdbool.h to access it.

 

If you're using an earlier version than C99 and don't have bool, it can easily be implemented yourself. For example, use an int where 1 means true and 0 means false. Here are other options for representing the bool type in C.

// initialize your "boolean" as falseint lastCharWasSpace = 0;// then inside your loopif (currentCharacter == ' ') { // check that the current character is a space    if (!lastCharWasSpace) { // if last character wasn't a space, then update the word count        wordCount++;    }    lastCharWasSpace = 1; // because character was a space, set variable}else { // current character wasn't a space so reset variable    lastCharWasSpace = 0;}

edit: Also, I just remembered there is the isspace() function in ctype.h that you should use instead of currentCharacter == ' ' if you can. It'll cover more types of whitespace.

yeah that is really useful to know (this stuff just was not taught to me and the more i know the better i'll be xD!)

 

 

 

 

okay so using JamieF's link I've got this code now

//TESTrewind(file_in);int wordcount = 1, letter = 0;char word[30];while ((f = fgetc(file_in)) != EOF){if (f == '.' || f == '=' || f == '-' || f == '_'){f = ' ';}if (f != ' ' && f != '\n'){if (isalpha(f)){word[letter] = tolower(f);letter++;}}else if ((f == ' ' || f == '\n') && letter > 0){letter = 0; wordcount++;word[0] = '\0';}}//TESTprintf("printing TEST WORD: %d\n", wordcount);

However I must put in wordcount=1 if the EOF is at the end of a word.. (the program doesn't see EOF as a 'white space character' so due to the fact it being EOF, it skips everything and doesn't add the additional word... any ammendments (other then wordcount=1, because if the last character of the file is a space then we have one extra word now..)

 

thanks

Spoiler

Gaming/Engineering PC: -i7 6700K, 4-4.2GHz "Eleanor" -ASUS ROG HERO VIII MOBO -16GB DDR4 3000MHz Corsair (2x8GB) -Gigabyte Windforce 980Ti OC edition (1405MHz GPU clock) -H110i GT Corsair CPU Water cooler -980GB Sandisk Ultra II SSD -Corsair 450D ATX Case -RM850i Corsair PSU (Modular) -28” 4K Samsung -27” 1080p Samsung 

Link to comment
Share on other sites

Link to post
Share on other sites

ahh yeah, that was edited in, didn't see it till now. cheers!

 

 

this may have solved my problem, look further down

yeah that is really useful to know (this stuff just was not taught to me and the more i know the better i'll be xD!)

 

 

 

 

okay so using JamieF's link I've got this code now

//TESTrewind(file_in);int wordcount = 1, letter = 0;char word[30];while ((f = fgetc(file_in)) != EOF){if (f == '.' || f == '=' || f == '-' || f == '_'){f = ' ';}if (f != ' ' && f != '\n'){if (isalpha(f)){word[letter] = tolower(f);letter++;}}else if ((f == ' ' || f == '\n') && letter > 0){letter = 0; wordcount++;word[0] = '\0';}}//TESTprintf("printing TEST WORD: %d\n", wordcount);

However I must put in wordcount=1 if the EOF is at the end of a word.. (the program doesn't see EOF as a 'white space character' so due to the fact it being EOF, it skips everything and doesn't add the additional word... any ammendments (other then wordcount=1, because if the last character of the file is a space then we have one extra word now..)

 

thanks

hold up. just tried this

//TESTrewind(file_in);int wordcount = 0, letter = 0;char word[30];while ((f = fgetc(file_in)) != EOF){if (f == '.' || f == '=' || f == '-' || f == '_'){f = ' ';}if (f != ' ' && f != '\n'){if (isalpha(f)){word[letter] = tolower(f);letter++;}}else if ((f == ' ' || f == '\n') && letter > 0){letter = 0; wordcount++;word[0] = '\0';}}if (letter > 0)wordcount++;//TESTprintf("printing TEST WORD: %d\n", wordcount); 

 

where that last if statement checks if there are letters in the string... anyone see any issues with this?

Spoiler

Gaming/Engineering PC: -i7 6700K, 4-4.2GHz "Eleanor" -ASUS ROG HERO VIII MOBO -16GB DDR4 3000MHz Corsair (2x8GB) -Gigabyte Windforce 980Ti OC edition (1405MHz GPU clock) -H110i GT Corsair CPU Water cooler -980GB Sandisk Ultra II SSD -Corsair 450D ATX Case -RM850i Corsair PSU (Modular) -28” 4K Samsung -27” 1080p Samsung 

Link to comment
Share on other sites

Link to post
Share on other sites

It looks like your solution works to me. Also please use proper indentation when posting code. It's very hard to read without it.

Link to comment
Share on other sites

Link to post
Share on other sites

  • 3 weeks later...

Hi,

the word counting is a basic task however to do it fast it requires looking into stuctures like hashes b-trees and even suffix trees, Prof. Knuth gives this very etude as a must-do in his famous trilogy (Volume 1):

The Art of Computer Programming

 

http://www-cs-faculty.stanford.edu/~uno/taocp.html

 

Just want to share my old console tool written in C and inspire you a byte, he-he, not a bit.

The word ripper ('counter' doesn't have the ring of monstrosity) is so superbly fast that it is the fastest on INTERNET, see for yourself:

D:\_KAZE\wordcount>dir pg47498.txt/b> grablist.lstD:\_KAZE\wordcount>Leprechaun_x-leton_32bit_Intel_01_4p.exe grablist.lst pg47498.txt.wrd 1234 YLeprechaun_singleton (Fast-In-Future Greedy n-gram-Ripper), rev. 16FIXFIX, written by Svalqyatchx.Purpose: Rips all distinct 1-grams (1-word phrases) with length 1..31 chars from incoming texts.Feature1: All words within x-lets/n-grams are in range 1..31 chars inclusive.Feature2: In this revision 128MB 1-way hash is used which results in 16,777,216 external B-Trees of order 3.Feature3: In this revision, 4 passes are to be made.Feature4: If the external memory has latency 99+microseconds then !(look no further), IOPS(seek-time) rules.Pass #1 of 4:Size of input file with files for Leprechauning: 13Allocating HASH memory 134,217,793 bytes ... OKAllocating memory 2MB ... OKSize of Input TEXTual file: 83,164/; 00,013,551P/s; Phrase count: 13,551 of them 757 distinct; Done: 64/64Bytes per second performance: 83,164B/sPhrases per second performance: 13,551P/sTime for putting phrases into trees: 1 second(s)Flushing UNsorted phrases: 100%; Shaking trees performance: 00,001,514P/sTime for shaking phrases from trees: 1 second(s)Leprechaun: Current pass done.Pass #2 of 4:Size of input file with files for Leprechauning: 13Allocating HASH memory 134,217,793 bytes ... OKAllocating memory 2MB ... OKSize of Input TEXTual file: 83,164/; 00,013,551P/s; Phrase count: 13,551 of them 790 distinct; Done: 64/64Bytes per second performance: 83,164B/sPhrases per second performance: 13,551P/sTime for putting phrases into trees: 1 second(s)Flushing UNsorted phrases: 100%; Shaking trees performance: 00,001,580P/sTime for shaking phrases from trees: 1 second(s)Leprechaun: Current pass done.Pass #3 of 4:Size of input file with files for Leprechauning: 13Allocating HASH memory 134,217,793 bytes ... OKAllocating memory 2MB ... OKSize of Input TEXTual file: 83,164/; 00,013,551P/s; Phrase count: 13,551 of them 758 distinct; Done: 64/64Bytes per second performance: 83,164B/sPhrases per second performance: 13,551P/sTime for putting phrases into trees: 1 second(s)Flushing UNsorted phrases: 100%; Shaking trees performance: 00,001,516P/sTime for shaking phrases from trees: 1 second(s)Leprechaun: Current pass done.Pass #4 of 4:Size of input file with files for Leprechauning: 13Allocating HASH memory 134,217,793 bytes ... OKAllocating memory 2MB ... OKSize of Input TEXTual file: 83,164/; 00,013,551P/s; Phrase count: 13,551 of them 800 distinct; Done: 64/64Bytes per second performance: 83,164B/sPhrases per second performance: 13,551P/sTime for putting phrases into trees: 1 second(s)Flushing UNsorted phrases: 100%; Shaking trees performance: 00,001,600P/sTime for shaking phrases from trees: 1 second(s)Leprechaun: Current pass done.Total memory needed for one pass: 294KBTotal distinct phrases: 3,105Total time: 1 second(s)Total performance: 13,551P/s i.e. phrases per secondLeprechaun: Done.D:\_KAZE\wordcount>dir pg47498.* Volume in drive D is S640_Vol5 Volume Serial Number is 5861-9E6C Directory of D:\_KAZE\wordcount12/25/2014  02:43 AM            83,164 pg47498.txt12/25/2014  02:46 AM            57,629 pg47498.txt.wrd               2 File(s)        140,793 bytes               0 Dir(s)  58,275,557,376 bytes freeD:\_KAZE\wordcount>type pg47498.txt.wrd|more0,000,001       pulled0,000,001       pulsed0,000,002       sleeps0,000,005       hollow0,000,002       provisions0,000,001       increases0,000,001       meaning0,000,002       passed0,000,001       dampness0,000,001       extract0,000,002       quite0,000,004       requirements0,000,001       silence0,000,001       rid0,000,001       dust0,000,002       possessed0,000,001       favors0,000,002       frequently0,000,002       indies0,000,008       almost0,000,001       erect0,000,001       jolly0,000,001       actual0,000,001       types0,000,007       law0,000,014       your0,000,001       cornfields0,000,001       arrive0,000,001       civilized0,000,001       servants0,000,003       stream0,000,014       see0,000,001       sundown0,000,001       throng0,000,001       towns0,000,005       forest0,000,001       flood0,000,002       pressed0,000,001       feature0,000,002       everything0,000,001       perform0,000,002       choosing0,000,002       knowledge0,000,001       tread0,000,001       colias0,000,001       palace0,000,001       gain0,000,001       laboring0,000,002       pglaf0,000,002       boo0,000,001       upspringing0,000,002       otherwise0,000,001       reared0,000,001       australia0,000,002       formats0,000,001       climbers0,000,001       correspondence0,000,001       attacking0,000,001       planted0,000,001       shade0,000,001       commercial0,000,001       moving0,000,001       share0,000,001       descended0,000,003       looking0,000,001       construction0,000,003       passengerThe process tried to write to a nonexistent pipe.^CD:\_KAZE\wordcount>

The beauty of Leprechaun is that it can rip/extract words from all English language, I mean ALL, using a simple netbook.

If you are interested you can count the words of the whole EnWiki (50GB file) the same way it is shown above, fast, there are 20,000,000+ unique words.

It is 100% FREE, source and executable codes are on INTERNET, I won't give a link here, you can search bing.com for:

"A free open-source and demonically fast phrase ripper"

 

Hope you wil enlike the power of C.

"The coolest cat ever. Just pure ice in his veins."

/Galgo/

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×