Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: NSString's handling of Unicode extension B (and C) characters

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: NSString's handling of Unicode extension B (and C) characters

Subject: Re: NSString's handling of Unicode extension B (and C) characters
From: Ryan Homer <email@hidden>
Date: Sat, 7 Nov 2009 09:17:08 -0500

[SOLVED]

On 2009-11-06, at 12:42 PM, Clark Cox wrote:

On Fri, Nov 6, 2009 at 5:22 AM, Ryan Homer <email@hidden> wrote:
On 2009-11-05, at 1:42 PM, Clark Cox wrote:
Yes. I am importing characters from a text file and need to process them in a certain way. A word may have an alternate form which is denoted after the word in square brackets. When the alternate form contains some of the same characters in the same position, they are represented with a dash. It's more complicated than that in that there are alternate words and characters that are separated by / and //.

Anyway, without getting into more details, the way I'm currently processing the data depends on the number of characters.
Then you need to be very careful how you define "character".
Is "ü" a single character, or two characters?

When you define a string using ü, isn't it stored internally as one UTF-16 code unit (not sure if I'm using the notation correctly), represented as U+00FC (which is one code unit), and then only if you decompose the string, you'll have the base character 'u' stored as one unit and the umlaut stored as another? In my experience, this is how is has seemed to me.

Is "킴" a single character, or three characters?

I don't know about Korean characters, but when dealing with Chinese characters, we have, for example, 中, which I consider a character, and 心, another character and then a combination of the two, 忠. Now, you can't use decomposedStringWithCanonicalMapping on 忠 to get 中 and 心. So, when I say character in this context, 中, 心 and 忠 are each a single character.

It's just that I've never come across surrogate pairs in dealing with Chinese characters until now, now that I have to deal with extension B & C characters.

... then you can use
-rangeOfComposedCharacterSequenceAtIndex: to find the range of indices
(representing a single "character") that contain the given index.


THANKS! This solves my problem.

	NSString *s = @"\u7075\xf0\xa4\x9f\xa5";
	NSUInteger length = 0;
	for (NSUInteger i=0; i<s.length; i++) {
		NSRange r = [s rangeOfComposedCharacterSequenceAtIndex:i];
		length++;
		i += r.length;
	}

The length is the 2 that I need!

Is there a way to count the surrogate pair as one character?
Not using the -length method. Conceptually, NSString deals in UTF-16
code units and only UTF-16 code units


_______________________________________________

Cocoa-dev mailing list (email@hidden)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


Follow-Ups:

Re: NSString's handling of Unicode extension B (and C) characters
From: Clark Cox <email@hidden>
Re: NSString's handling of Unicode extension B (and C) characters
From: Alastair Houghton <email@hidden>


References:  
  >NSString's handling of Unicode extension B (and C) characters (From: Ryan Homer <email@hidden>)
  >Re: NSString's handling of Unicode extension B (and C) characters (From: Ryan Homer <email@hidden>)
  >Re: NSString's handling of Unicode extension B (and C) characters (From: Clark Cox <email@hidden>)
  >Re: NSString's handling of Unicode extension B (and C) characters (From: Ryan Homer <email@hidden>)
  >Re: NSString's handling of Unicode extension B (and C) characters (From: Clark Cox <email@hidden>)




Prev by Date:
Re: Core-Data : how to merge two contexts ?

Next by Date:
Re: popup menu entries

Previous by thread:
Re: NSString's handling of Unicode extension B (and C) characters

Next by thread:
Re: NSString's handling of Unicode extension B (and C) characters

Index(es):

Date
Thread