Re: Number of chars
Re: Number of chars
- Subject: Re: Number of chars
- From: Aki Inoue <email@hidden>
- Date: Thu, 21 Mar 2013 11:10:15 -0700
Please note that std::string does not provide the localized behavior for collation, searching, case mapping, etc that our customers are accustomed to.
If you're handling user visible strings, we recommend sticking to NSString at least for those operations.
Also, looking for a safe byte boundary by checking the high-bit set doesn't work for Unicode strings like UTF-8.
With Unicode, some characters can be represented with multiple code points (not just multi-byte).
For example, Á is U+00C1 in Unicode hex value.
It also can be represented with A + the accent (U+0041 U+0301).
This is encoded in UTF-8 as a 3-byte sequence: 0x41, 0xCC, 0x81.
By just checking the high-bit, you're separating the accent from the base character.
For that matter, UTF-32 (aka UCS-4) is not safe to find the truncation boundary just at the 4-byte boundary.
Aki
On Mar 21, 2013, at 9:59 AM, Luther Baker <email@hidden> wrote:
> I apologize for leading you the wrong way Luca!
>
> -Luther
>
>
>
> On Thu, Mar 21, 2013 at 9:46 AM, Luca Ciciriello <
> email@hidden> wrote:
>
>> Ok, thanks.
>>
>> Luca.
>>
>> On Mar 21, 2013, at 3:43 PM, Glenn L. Austin <email@hidden>
>> wrote:
>>
>>>
>>> On Mar 21, 2013, at 2:34 AM, Jean-Daniel Dupas <email@hidden>
>> wrote:
>>>
>>>>
>>>> Le 21 mars 2013 à 09:27, Luca Ciciriello <email@hidden>
>> a écrit :
>>>>
>>>>> Hi all.
>>>>> I'm using in my iOS project some Objective-C++ modules. Here I have
>> some conversion from NSString to C++11 std::string. After this conversion I
>> found (correctly) in my std::string some 2-byte characters.
>>>>> My question is: How can I count the number of chars and not the
>> numbers of byte in my std::string?
>>>>>
>>>>
>>>> Don't use std::string to store unicode string. They are not design to
>> support such content.
>>>>
>>>> You can use std::wstring instead.
>>>
>>>
>>> Actually, std::string works *just fine* for UTF-8 strings.
>>>
>>> It's just that, in Unicode, 1 character doesn't necessarily fit in 1
>> byte. Also, you can't easily do truncation of strings (you might be
>> truncating the string in the middle of a multi-byte sequence -- which is
>> true in pretty much every encoding except UCS-4).
>>>
>>> UTF-8 is relatively easy to work with, however. You look at the
>> previous byte in the string to see if your current character is part of a
>> multi-byte sequence or not -- and keep going back until you find one that
>> doesn't have the high-bit set, and that's the last character of the
>> previous sequence. Of course, that "go back" doesn't mean anything if
>> you're already at the first byte in your string...
>>>
>>> --
>>> Glenn L. Austin, Computer Wizard and Race Car Driver <><
>>> <http://www.austin-soft.com>
>>>
>>>
>>
>>
>> _______________________________________________
>>
>> Cocoa-dev mailing list (email@hidden)
>>
>> Please do not post admin requests or moderator comments to the list.
>> Contact the moderators at cocoa-dev-admins(at)lists.apple.com
>>
>> Help/Unsubscribe/Update your Subscription:
>>
>> This email sent to email@hidden
>>
> _______________________________________________
>
> Cocoa-dev mailing list (email@hidden)
>
> Please do not post admin requests or moderator comments to the list.
> Contact the moderators at cocoa-dev-admins(at)lists.apple.com
>
> Help/Unsubscribe/Update your Subscription:
>
> This email sent to email@hidden
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden