Re: splitting CJK text into "words"
Re: splitting CJK text into "words"
- Subject: Re: splitting CJK text into "words"
- From: Martin Wierschin <email@hidden>
- Date: Thu, 27 Sep 2012 18:33:40 -0700
> There are the Kinsoku rules with are wrap rules for Japanese. Semantially similar rules exist for Chinese and Korean. A simple implementation it not too difficult, see here for a quick overview:
>
> http://en.wikipedia.org/wiki/Line_breaking_rules_in_East_Asian_languages
Thanks for the link Markus, but unless I'm missing something, that just goes over line breaking/wrapping, not detecting word boundaries.
So it looks like the Cocoa/CoreFoundation frameworks don't have what's needed for this, but after some digging it seems ICU does:
http://userguide.icu-project.org/boundaryanalysis
I can just drop down to using the libicu C functions (eg: ubrk_open). Using "ja_JP" there seems to do the trick.
Best,
~Martin
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden