Re: CFStringCreateWithBytes and Endianness
Re: CFStringCreateWithBytes and Endianness
- Subject: Re: CFStringCreateWithBytes and Endianness
- From: Fritz Anderson <email@hidden>
- Date: Thu, 04 Aug 2011 07:49:18 -0500
On 4 Aug 2011, at 6:49 AM, Andreas Grosam wrote:
> I want to create a CFString using function CFStringCreateWithBytes.
>
> CFStringRef CFStringCreateWithBytes (
> CFAllocatorRef alloc,
> const UInt8 *bytes,
> CFIndex numBytes,
> CFStringEncoding encoding,
> Boolean isExternalRepresentation
> );
>
> I suspect, the "encoding" parameter refers to the encoding of the source string.
The thing to bear in mind is that it is the encoding of the _source_ string. It's a fact about the bytes you're importing. Facts about the data aren't changeable at runtime, so there isn't a choice you can make when you call CFStringCreateWithBytes.
> My source buffer containing the string can be encoded in UTF-16LE or UTF-16BE.
> I don't want to have a BOM in the resulting CFString - and the source buffer does not contain it either.
CFString is an opaque type. You don't know how it stores its characters internally, and you shouldn't have to care. It might store endianness as a BOM in a character buffer, or as a flag in an associated data structure, or it might have a preferred internal endianness that you never see from the outside. It may or may not store the characters as UTF-16 (either endianness) at all. These details may vary by architecture, version of Core Foundation, and even from string to string.
"[T]he need for a BOM arises in the context of text interchange, rather than in normal text processing within a closed environment." — Wikipedia, "Byte order mark," <http://en.wikipedia.org/wiki/Byte_order_mark>
> The documentation does not tell me which source encoding would be the most preferred to initialize the CFString in the most efficient manner. I would guess this is UTF-16LE on Intel machines.
If you mean that you have control over how the bytes in the source data were originally written, little-endian may be a good choice, but it's only a guess, and guesses about the "efficiency" of opaque functions are worthless. If Core Foundation doesn't always use UTF-16 internally, there may be a conversion anyway, and the efficiency of the source is at most a minor consideration.
If I were less lazy, I'd look at the source of CFLite, and know for sure. The best way to know, however is not to guess. Prepare your source text in both orders, and benchmark CFStringCreateWithBytes each way. That way, you can get the answer that matches your actual use. You may find that byte order makes so little difference in speed that it needn't be a consideration.
Wikipedia says that the Unicode standard says that if there is no BOM, you assume the byte stream is big-endian. So if your first priority is to avoid a BOM, your choice is made for you: Pass kCFStringEncodingUTF16BE. Correctness is a much bigger consideration than the presence of two bytes. One of my slogans is that it's a false economy to get the wrong answer as quickly as possible.
However, assuming big-endian assumes you absolutely trust every writer of your source stream. If you accept a BOM, you'll be able to handle more inputs. Otherwise, try big-endian, and if CFStringCreateWithBytes returns NULL, try again with little-endian.
> And what happens if I just specify kCFStringEncodingUTF16 ? Is then the source encoding assumed to be in host endianness? Or UTF-16BE as the Unicode Standard suggests?
Possibly CFStringCreateWithBytes tries it both ways, and accepts the way that doesn't error. Maybe, to favor the standard behavior, it tries big-endian first. I haven't looked at the source, and can't tell you for sure. The thing to do is _test_, with the kind of data you'll actually use, and you'll know.
— F
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden