Re: Word count
Re: Word count
- Subject: Re: Word count
- From: "Louis C. Sacha" <email@hidden>
- Date: Wed, 9 Jun 2004 23:45:56 -0700
Hello...
Or, more simply
(Typed in Mail...)
- (unsigned)wordCountForString:(NSString *)textString
{
NSScanner *wordScanner = [NSScanner scannerWithString:textString];
NSCharacterSet *whiteSpace = [NSCharacterSet whitespaceCharacterSet];
unsigned wordCount = 0;
while ([wordScanner scanUpToCharactersFromSet:whiteSpace
intoString:nil]) {wordCount++;}
return wordCount;
}
Since NSScanner skips the whitespace character set by default at the
beginning of anything it scans (you can change this with the
setCharactersToBeSkipped: method), you don't need to manually scan
over the whitespace between words.
Also, you can speed things up by only looking up the character set
once, outside the loop. I prefer to use
scanUpToCharactersFromSet:intoString: to do the actual scanning, but
there probably is very little if any performance difference compared
to using scanCharactersFromSet:intoString:.
There are a variety of things that will trip up this way of counting
words, for example the string @"Hello World ! ! ! ! ! !" would come
out as 8 words (there are spaces between the !'s).
A Cocoa equivalent to Alan's method -- I think ;)-- would be:
- (unsigned)wordCountForString:(NSString *)textString
{
NSScanner *wordScanner = [NSScanner scannerWithString:textString];
NSCharacterSet *nonLetters = [[NSCharacterSet
letterCharacterSet] invertedSet];
[wordScanner setCharactersToBeSkipped:nonLetters];
unsigned wordCount = 0;
while ([wordScanner scanUpToCharactersFromSet:nonLetters
intoString:nil]) {wordCount++;}
return wordCount;
}
In one application where I wanted the count to be basically accurate
to within +/- 5% for a variety of types of text, I did something
similar to the following:
static NSCharacterSet *cachedSet = nil;
@implementation ThatClass
+ (NSCharacterSet *)whitespaceAndPunctuationSet
{
if (!cachedSet)
{
NSCharacterSet *tempSet = [NSMutableCharacterSet
whitespaceCharacterSet];
[tempSet formUnionWithCharacterSet:[NSCharacterSet
punctuationCharacterSet]];
cachedSet = [tempSet copy];
}
return cachedSet;
}
- (unsigned)wordCountForString:(NSString *)textString
{
NSScanner *wordScanner = [NSScanner scannerWithString:textString];
NSCharacterSet *whiteSpace = [NSCharacterSet whitespaceCharacterSet];
NSCharacterSet *skipSet = [ThatClass whitespaceAndPunctuationSet];
[wordScanner setCharactersToBeSkipped:skipSet];
unsigned wordCount = 0;
while ([wordScanner scanUpToCharactersFromSet:whiteSpace
intoString:nil]) {wordCount++;}
return wordCount;
}
@end
That implementation had the advantage that it it would skip
free-standing punctuation, but still count things like "don't",
"Micro$oft" and "10,000" as a single word. Of course, there are still
things that would throw it off.
The most accurate way of counting words will depend on the exact type
of text that you will be checking (and what you consider to be a
word). The best way to find out is to write several different
versions of the word counting code and throw as many different
examples of text at them as you expect to occur in the application's
use.
Hope that helps,
Louis
(Typed in Mail...)
int words = 0;
NSScanner *scanner = [NSScanner scannerWithString:string];
while (![scanner isAtEnd])
{
[scanner scanCharactersFromSet:[NSCharacterSet
whitespaceAndNewlineCharacterSet] intoString:nil];
if ([scanner scanCharactersFromSet:[[NSCharacterSet
whitespaceAndNewlineCharacterSet] invertedSet] intoString:nil])
words++;
}
zach
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.