Re: Xcode Editor's Regex now uses PCRE instead of ICU?
Re: Xcode Editor's Regex now uses PCRE instead of ICU?
- Subject: Re: Xcode Editor's Regex now uses PCRE instead of ICU?
- From: Alastair Houghton <email@hidden>
- Date: Thu, 13 Mar 2008 00:40:57 +0000
On 12 Mar 2008, at 23:01, John Engelhart wrote:
Also, ICU's regex engine is implemented natively for UTF-16.
PCRE's interface is UTF-8. If, as is commonly the case for Cocoa
apps, strings are stored internally in UTF-16, you would have to
convert to use PCRE whereas ICU's engine can handle the native
representation.
I've actually found this to not be true in practice.
(I assume you're talking about UTF-16 being more common than UTF-8,
which isn't quite what I said... and the strings in an NSTextStorage,
which is I imagine what Xcode uses to hold its text, are probably
UTF-16 anyway, though I don't think the code for that particular class
is public.)
Since my observed behavior of CFString / NSString is that it tries
to avoid converting the strings buffer it was initialized with if
possible.
Yes, that's true. You can see the sources for CFString in the Darwin
source tree. Furthermore, string constants (even @"" and CFSTR("")
ones) are encoded in ASCII by the compiler, which makes 8-bit strings
quite common in practice.
Since CFString / NSString don't keep their internal buffers in a
fixed format (ie, UTF-16), then the questions becomes one of "What
is the most common internal format?" The answer to this is
extremely usage sensitive.
Indeed.
Unicode strings encoded in UTF-8 take a variable number of bytes to
encode, anywhere from 1 to 6 bytes,
Just a pedantic correction, but it's 1 to *4* bytes, not 1 to 6. The
original definition was for 1 to 6 bytes, but there are no code points
defined above 10FFFF (and IIRC there are officially never going to be).
UTF-16 requires either 2 or 4 bytes to encode each character, and
finally UTF-32 always requires exactly 4 bytes to encode a
character. Encoding ASCII strings in anything other than UTF-8
automatically doubles or quadruples the size required to store the
string.
But encoding many languages requires characters outside of ASCII; far-
eastern languages expand quite badly in UTF-8, and even eastern
European languages do better in UTF-16 AFAIK.
Without getting in to compression (even the Unicode standard
compression), the optimal encoding for the least amount of bytes
used is heavily string dependent.
UTF-16 is, for most purposes, a happy medium. It is only twice the
size for plain ASCII, but it is easier to deal with than UTF-8 (in
spite of having surrogate pairs) and for many languages it is
smaller. UTF-32 is always pointless IMO; it is guaranteed to be the
largest encoding in all cases, and because of combining characters it
isn't any simpler to handle than UTF-16.
PS- CFStringGetCStringPtr() is a public API function.
Yes, I'm aware of that, thanks :-) And it certainly makes sense to
have it.
Well, one can dream, I suppose. The user ended up "fixing" this by
building the latest version of ICU under 10.4 and linking against
that. I don't know why 10.4 behaves like this. The latest ICU
version is 3.8, 10.5 uses 3.6, and 10.4 uses 3.2. I suspect,
however, it's because 10.4 omits the Thai word breaking dictionary
and thus falls back to ordinary \b behavior. A (very) rough
estimate is the breaking dictionaries are about 450K for everything
with the thai word breaking dictionary weighing in at ~240K alone!
That I can well believe. IIRC Thai doesn't have spaces between words,
so it's *really* hard to find word breaks and it actually has to use
dictionary matching of words to do it.
The entire issue of UTF-8 versus UTF-16 makes Oniguruma, the other
regexp library that is in common use on OS X, quite an interesting
choice because it provides both UTF-8 and UTF-16 APIs.
Incidentally, I get the impression that the intention with ICU is to
move towards an encoding-independent interface for the regexp matcher
also, though you'd have to ask the ICU people when that was likely to
happen.
Kind regards,
Alastair.
--
http://alastairs-place.net
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Xcode-users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden