Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: Xcode Editor's Regex now uses PCRE instead of ICU?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Xcode Editor's Regex now uses PCRE instead of ICU?

Subject: Re: Xcode Editor's Regex now uses PCRE instead of ICU?
From: Alastair Houghton <email@hidden>
Date: Thu, 13 Mar 2008 00:40:57 +0000

On 12 Mar 2008, at 23:01, John Engelhart wrote:

Also, ICU's regex engine is implemented natively for UTF-16. PCRE's interface is UTF-8. If, as is commonly the case for Cocoa apps, strings are stored internally in UTF-16, you would have to convert to use PCRE whereas ICU's engine can handle the native representation.
I've actually found this to not be true in practice.

(I assume you're talking about UTF-16 being more common than UTF-8, which isn't quite what I said... and the strings in an NSTextStorage, which is I imagine what Xcode uses to hold its text, are probably UTF-16 anyway, though I don't think the code for that particular class is public.)

Since my observed behavior of CFString / NSString is that it tries to avoid converting the strings buffer it was initialized with if possible.

Yes, that's true. You can see the sources for CFString in the Darwin source tree. Furthermore, string constants (even @"" and CFSTR("") ones) are encoded in ASCII by the compiler, which makes 8-bit strings quite common in practice.

Since CFString / NSString don't keep their internal buffers in a fixed format (ie, UTF-16), then the questions becomes one of "What is the most common internal format?" The answer to this is extremely usage sensitive.


Indeed.

Unicode strings encoded in UTF-8 take a variable number of bytes to encode, anywhere from 1 to 6 bytes,

Just a pedantic correction, but it's 1 to *4* bytes, not 1 to 6. The original definition was for 1 to 6 bytes, but there are no code points defined above 10FFFF (and IIRC there are officially never going to be).

UTF-16 requires either 2 or 4 bytes to encode each character, and finally UTF-32 always requires exactly 4 bytes to encode a character. Encoding ASCII strings in anything other than UTF-8 automatically doubles or quadruples the size required to store the string.

But encoding many languages requires characters outside of ASCII; far- eastern languages expand quite badly in UTF-8, and even eastern European languages do better in UTF-16 AFAIK.

Without getting in to compression (even the Unicode standard compression), the optimal encoding for the least amount of bytes used is heavily string dependent.

UTF-16 is, for most purposes, a happy medium. It is only twice the size for plain ASCII, but it is easier to deal with than UTF-8 (in spite of having surrogate pairs) and for many languages it is smaller. UTF-32 is always pointless IMO; it is guaranteed to be the largest encoding in all cases, and because of combining characters it isn't any simpler to handle than UTF-16.

PS- CFStringGetCStringPtr() is a public API function.

Yes, I'm aware of that, thanks :-) And it certainly makes sense to have it.

Well, one can dream, I suppose. The user ended up "fixing" this by building the latest version of ICU under 10.4 and linking against that. I don't know why 10.4 behaves like this. The latest ICU version is 3.8, 10.5 uses 3.6, and 10.4 uses 3.2. I suspect, however, it's because 10.4 omits the Thai word breaking dictionary and thus falls back to ordinary \b behavior. A (very) rough estimate is the breaking dictionaries are about 450K for everything with the thai word breaking dictionary weighing in at ~240K alone!

That I can well believe. IIRC Thai doesn't have spaces between words, so it's *really* hard to find word breaks and it actually has to use dictionary matching of words to do it.

The entire issue of UTF-8 versus UTF-16 makes Oniguruma, the other regexp library that is in common use on OS X, quite an interesting choice because it provides both UTF-8 and UTF-16 APIs.

Incidentally, I get the impression that the intention with ICU is to move towards an encoding-independent interface for the regexp matcher also, though you'd have to ask the ICU people when that was likely to happen.

Kind regards,

Alastair.

--
http://alastairs-place.net


_______________________________________________
Do not post admin requests to the list. They will be ignored.
Xcode-users mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


Follow-Ups:

Re: Xcode Editor's Regex now uses PCRE instead of ICU?
From: "Clark Cox" <email@hidden>


References:  
  >Re: Xcode Editor's Regex now uses PCRE instead of ICU? (From: Stuart Malin <email@hidden>)
  >Re: Xcode Editor's Regex now uses PCRE instead of ICU? (From: Alastair Houghton <email@hidden>)
  >Re: Xcode Editor's Regex now uses PCRE instead of ICU? (From: John Engelhart <email@hidden>)




Prev by Date:
Re: Hang when searching in Xcode

Next by Date:
re: Linking a lib in XCode

Previous by thread:
Re: Xcode Editor's Regex now uses PCRE instead of ICU?

Next by thread:
Re: Xcode Editor's Regex now uses PCRE instead of ICU?

Index(es):

Date
Thread