splitting CJK text into "words"
splitting CJK text into "words"
- Subject: splitting CJK text into "words"
- From: Martin Wierschin <email@hidden>
- Date: Wed, 26 Sep 2012 14:12:27 -0700
Hello everyone,
I'm trying to split CJK text using the kind of word boundaries detected by -[NSAttributedString doubleClickAtIndex:]. That method does the job correctly, but only if the system preferences have the Word Break mode set to Japanese. I need to ensure this kind of word splitting independent of the user's system preferences.
It was my understanding that I could use CFStringTokenizer for this task, but it doesn't seem to be working. Test code that produces improper results:
> NSString* str = @"\u4E2D\u79CB\u5FEB\u5230\u4E86"; // 中秋快到了
> CFRange strRange = CFRangeMake(0, [str length]);
>
> CFStringRef cjkIdent = CFLocaleCreateCanonicalLocaleIdentifierFromString(NULL, CFSTR("jp"));
> CFLocaleRef cjkLoc = CFLocaleCreate( NULL, cjkIdent );
> CFStringTokenizerRef cjkTokenizer = CFStringTokenizerCreate( NULL, (CFStringRef)str, strRange, kCFStringTokenizerUnitWordBoundary, cjkLoc );
>
> CFStringTokenizerTokenType tokenType = CFStringTokenizerAdvanceToNextToken(cjkTokenizer);
> CFRange wordRange = CFStringTokenizerGetCurrentTokenRange(cjkTokenizer);
This code sets the wordRange to (0,2) and not (0,5) as I'd like.
I've tried a variety of locale identifiers (eg: "zh", "jp_JP", etc) but no joy. Am I missing something?
Thanks for any help,
~Martin
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden