Re: parsing a string into words
Re: parsing a string into words
- Subject: Re: parsing a string into words
- From: Ken Thomases <email@hidden>
- Date: Sat, 25 Apr 2009 22:15:28 -0500
On Apr 25, 2009, at 10:06 PM, Gerriet M. Denkmann wrote:
One question though: why are "version4", "ปี2009" or "ทีมA"
all parsed as one word?
I would think that the change from letters to numbers, or from Thai
to Latin would indicate a word-break.
I haven't read it, myself, but the docs for CFStringTokenizer have a
link all the way at the bottom to this page:
http://www.unicode.org/reports/tr29/#Word_Boundaries
That's presumably the governing document for how it does its work.
I also suspect that CFStringTokenizer, like some other parts of
CoreFoundation, are using ICU <http://site.icu-project.org/> under the
hood. So, any documentation of the ICU implementation <http://userguide.icu-project.org/boundaryanalysis
> would probably be relevant to CFStringTokenizer.
Regards,
Ken
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden