On Jul 22, 2014, at 23:26 , Roland King <email@hidden> wrote:
Running that in a playground (playgrounds are working better in Beta 4) gives me those 3 UTF-16 code points.
No, what you get in UTF-16 are code *units* — the 16-bit values that make up the UTF-16 sequence. (Code units are 16-bit values in UTF-16, 8-bit values in UTF-8 and 32-bit values in UTF-32.)
Code *points* are the 21-bit Unicode “characters” [but beware that UTF-16 code *units* are called “characters” in Cocoa] that are represented by sequences of code units (1-4 code units in UTF-8, 1-2 code units in UTF-16, 1 code unit in UTF-32).
Graphemes are sequences of of code *points*. They’re inherently of various lengths (varying numbers of code points, each of which may be represented by varying numbers of code units in the various UTF representations).
What the beta 4 release notes say is:
Unicode String improvements
The String type now implements a grapheme cluster segmentation algorithm to produce Characters. This means that iteration over complex strings that include combining marks, variation sequences, and regional indicators work properly. For example, this code now returns the value 15: countElements("a\u{1F30D}cafe\u{0301}umbrella\u{FE0E} \u{1F1E9}\u{1F1EA}”)
Also, a for-in loop over the string produces each human visible character in sequence.
I saw “segmentation” in this and thought that meant that grapheme clusters were now (in beta 4) being broken apart (and that they weren’t before). Gerriet is saying that the opposite is true: grapheme clusters are now *not* being broken apart, but they were before. And running that cafe/umbrella string through the debugger, I see that this is so. (\u{1F30D}, e\u{0301}, and \u{FE0E} \u{1F1E9}\u{1F1EA} are each a single grapheme, and there are 12 other single-code-point letters.)
That means that “Character” in Swift doesn’t mean anything simple — not anything that can be represented as a single numeric value, and certainly not a code point. For mainstream string usage, that’s good news — no more messing around with ‘rangeOfComposedCharacters’, but if you actually want the code points, it’s a retrogression.
I suppose the logical way to get code points would be to enumerate ‘ss.utf32’ (code units and code points are numerically identical in that case), but no such property seems to exist in Swift. Perhaps there is something else that gives code points, but we just don’t know what it is.
|