Re: How to convert a UTF-8 byte offset into an NSString character offset?
Re: How to convert a UTF-8 byte offset into an NSString character offset?
- Subject: Re: How to convert a UTF-8 byte offset into an NSString character offset?
- From: Quincey Morris <email@hidden>
- Date: Tue, 06 May 2014 11:12:52 -0700
On May 5, 2014, at 12:06 , Jens Alfke <email@hidden> wrote:
> How can I map a byte offset in a UTF-8 string back to the corresponding character offset in the NSString it came from?
I’ve been thinking about this since your original question, and it seems to me that this is a subtler problem than it seems:
1. You cannot *in general* map a UTF-8 byte offset to a NSString (UTF-16) “character" offset. The two representations may have different numbers of code units (1-4 for UTF-8, 1-2 for UTF-16) per code point. There’s no real answer to the question of what UTF-16 offset corresponds to the 3rd code unit of a 4-byte UTF-8 code point.
2. So, you’re restricted at least to byte offsets of UTF-8 code units that are the *start* of a code point. However, there’s a potential problem with this, because you’re not in control of the structure of the NSString. It’s possible, for example, that the UTF-8 byte offset points to the second (or later) code point of a base+combining mark sequence, but an equivalent NSString has a single code point consisting of one or two code units (a “composed character”). Even if both versions of the string have the same number of code points (“characters”), they may have different orders.
3. It’s *possible* that you can create a NSString that has the same code points in the same order as the UTF-8 string, but I don’t see any API contract that clearly guarantees it. The documentation for -[NSString initWithCharacters:length:] says that the return value is "An initialized NSString object containing length characters taken from characters.” That *might* be a sufficient guarantee, but code-point equivalence possibly isn’t guaranteed across some NSString manipulation methods, so you’d have to be careful.
4. Otherwise, I think it’s yet more difficult. The next-higher Unicode boundary is “grapheme clusters”. You can divide a NSString into grapheme clusters (either through direct iteration using ‘rangeOfComposedCharacterSequence…’, or through enumeration using ‘enumerateSubstrings…’), but to match the UTF-8 and NSString representations cluster by cluster you’d need to break the UTF-8 string into grapheme clusters using the same algorithm as NSString, and it’s not documented what the precise algorithm is.
(The documentation at:
https://developer.apple.com/library/mac/documentation/Cocoa/Conceptual/Strings/Articles/stringsClusters.html
refers to this:
http://unicode.org/reports/tr29/
which I find pretty overwhelming.)
5. Even if #3 works, you may still have some troubles with grapheme clusters. For example, if a UTF-8 byte offset is actually a code point in the middle of a cluster, you may have trouble getting consistent NSString behavior with substrings that start from that code point.
FWIW, my opinion is that if your library clients are specifying UTF-8 sequences at the API, and expect byte offsets into those sequences to be meaningful, you might well be forced to maintain the original UTF-8 sequence in the library’s internal data model — or, perhaps, an array of the original code points — and do all of your internal processing in terms of code points. Conversion to NSString would happen only in the journey from data model to UI text field.
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden