Lists

Open Menu Close Menu

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: Xcode Editor's Regex now uses PCRE instead of ICU?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Xcode Editor's Regex now uses PCRE instead of ICU?

Subject: Re: Xcode Editor's Regex now uses PCRE instead of ICU?
From: John Engelhart <email@hidden>
Date: Wed, 12 Mar 2008 19:01:47 -0400

On Mar 11, 2008, at 9:25 AM, Alastair Houghton wrote:

On 11 Mar 2008, at 04:19, Stuart Malin wrote:
This question of what Xcode uses aside, I am curious though: I seem to be sensing the reason for the use of ICU is its support of Unicode. But doesn't PCRE support Unicode?
Not to the same extent that ICU does. ICU is the canonical implementation of the Unicode spec.

Also, ICU's regex engine is implemented natively for UTF-16. PCRE's interface is UTF-8. If, as is commonly the case for Cocoa apps, strings are stored internally in UTF-16, you would have to convert to use PCRE whereas ICU's engine can handle the native representation.

I've actually found this to not be true in practice. Since my observed behavior of CFString / NSString is that it tries to avoid converting the strings buffer it was initialized with if possible. From an API perspective, however, strings "appear" to be UTF-16 encoded. Naturally, these are implementation internal details, so the usual caveats apply here.

Since CFString / NSString don't keep their internal buffers in a fixed format (ie, UTF-16), then the questions becomes one of "What is the most common internal format?" The answer to this is extremely usage sensitive. Unicode strings encoded in UTF-8 take a variable number of bytes to encode, anywhere from 1 to 6 bytes, UTF-16 requires either 2 or 4 bytes to encode each character, and finally UTF-32 always requires exactly 4 bytes to encode a character. Encoding ASCII strings in anything other than UTF-8 automatically doubles or quadruples the size required to store the string. Without getting in to compression (even the Unicode standard compression), the optimal encoding for the least amount of bytes used is heavily string dependent.

My observations on how CFString / NSString keeps it's internal buffers (which goes without saying is an internal detail, not to be depended on) is roughly:

If the string is ASCII, or otherwise "8-bit simple / optimal", down convert the buffer to UTF-8 (which ASCII is a subset of) if the initialization buffer is not already in a UTF-8 compatible format.

Otherwise, convert the string to native endian UTF-16 if the initialization buffer is not already.

Owing to both it's English / ASCII development and Unix roots, an awful lot of strings fall in to the first category, which is also a win in terms of the space required for the buffer.

RegexKit has various DTrace probes embedded in it, one of them is the "PerformanceNote" probe that will fire when some non-optimal condition is detected. Since RegexKit uses PCRE, and thus requires UTF-8 encoded buffers, the reason I put this probe in RegexKit in the first place was so that I could tell when just such "encoding mismatch" issues happen. When matching a string, RegexKit checks the strings "Fastest Encoding" to see if it's UTF-8 compatible, and uses CFStringGetCStringPtr() to try to get direct access to the string buffer if at all possible. If it is enable to get direct access to a UTF-8 compatible buffer, it obviously has to go through the expensive process of converting that string in to UTF-8, and will fire off a DTrace PerformanceNote probe.

Since DTrace allows you to trivially snoop in on any process in the system at any time, and Safari AdBlock happens to use RegexKit, we can get an idea of how often the URL's that Safari AdBlock is checking, which (presumably) it's getting straight from Safari:

shell% sudo dtrace -Z -q -n 'RegexKit*:::PerformanceNote { this- >description = arg6 == 0 ? "" : copyinstr(arg6); printf("Note: %s\n", this->description);}' ... [time, and web surfing passes] ... ^C shell%

Nadda, not a single URL Safari AdBlock handed to RegexKit had to be converted to UTF-8, which would mean that in this particular usage of regular expression engines, using ICU would require the constant up conversion to UTF-16 for every single URL that passed through safari (often hundreds per page). Just to make sure things were 'working', I kicked off the unit tests:

shell% sudo dtrace -Z -q -n 'RegexKit*:::PerformanceNote { this- >description = arg6 == 0 ? "" : copyinstr(arg6); printf("Note: %s\n", this->description);}' ... Note: UTF16 to UTF8 requires slow conversion. Note: UTF8 to UTF16 requires slow conversion. Note: NSString encoding requires expensive UTF8 conversion. Note: NSString encoding requires expensive UTF8 conversion. Note: pcre_study() was able to optimize the regular expression. Note: Slow conversion via sscanf. ...

Note the "NSString encoding requires expensive UTF8 conversion.", indicating that the buffer for the source string was not in a UTF-8 friendly encoding, and thus required a full conversion from the source encoding in to UTF-8. The UTF8 to UTF16 / UTF16 to UTF8 messages are a related issue: All the offsets returned by PCRE are "UTF-8 Encoded", and must be converted to their "UTF-16 Encoded" equivalents. Happily, ASCII byte offsets are exactly equal to their UTF-16 character offsets, and thus don't require a conversion between encodings. Conversion is required due to the "from a user of the API's perspective, everything looks like a UTF-16 encoded string" requirement so that things like NSRange values from RegexKit are useable with other Foundation / NSString methods.

Again, because these results are peaking in to the internal workings of objects, the results could be completely different for someone else. For example, CoreFoundation / Foundation would be free to "convert the internal buffer representation to the users locale", which might just a;ways force the conversion of ASCII strings to UTF-16. so take the above with as large a grain of salt as you feel is appropriate. :)

PS- CFStringGetCStringPtr() is a public API function. It's a pragmatic tradeoff between internal details and reality- strings are use a lot and it saves a substantial amount of overhead if you don't have to constantly create temporary buffers for every little string operation. The docs are pretty clear that if it works, great, but have a (slower) backup plan for when it returns NULL:

This function either returns the requested pointer immediately, with no memory allocations and no copying, in constant time, or returns NULL. If the latter is the result, call an alternative function such as the CFStringGetCString function to extract the characters.

Another difference IIRC is that the set support is very much more sophisticated in ICU. In Perl and PCRE, there is basic support for character sets (the square-bracket syntax), but AFAIK there are no set operations (besides inversion), and I think the set of Unicode properties you can query is somewhat smaller than for ICU. ICU also supports string values as members of character sets, presumably so that you can use combining marks and the like in a set.

Emacs! vi!

PCRE doesn't have ICU character class set operations, but the one thing that PCRE lacks that ICU shines in is "enhanced \b break detection". There was a recent discussion on cocoa-dev regarding this topic, and almost simultaneously I got a request from a user for help using RegexKit to perform word breaking with \b on Thai strings. In PCRE, \w \d \s and \b (and friends) are only "ASCII aware", but in ICU are "Unicode aware" (surprise).

There is an option when compiling a regex in ICU, UREGEX_UWORD, which turns on "enhanced" \b behavior. As I've come to learn (speaking only English myself), this is one of those things that if you need it, you NEED it, and there is no simple work around. The regular ICU \b behavior is "like" PCRE's, or can be reasonably simulated in PCRE with \p{} and assertions, but the enhanced \b brings the specialized, dictionary driven ICU word breaker to bear on finding word breaks:

The regex "(\w+?)\b" with UREGEX_UWORD turned on:

[johne@LAPTOP_10_5] icu% ./icu_matcher 2008-03-07 20:10:34.774 icu_matcher[43421:807] subject: 'ฉัน กินข้าว' 2008-03-07 20:10:34.865 icu_matcher[43421:807] matched: 'ฉัน' 2008-03-07 20:10:34.869 icu_matcher[43421:807] range : '{0, 3}' 2008-03-07 20:10:34.873 icu_matcher[43421:807] matched: 'กิน' 2008-03-07 20:10:34.877 icu_matcher[43421:807] range : '{3, 3}' 2008-03-07 20:10:34.881 icu_matcher[43421:807] matched: 'ข้าว' 2008-03-07 20:10:34.884 icu_matcher[43421:807] range : '{6, 4}'

And turned off:

[johne@LAPTOP_10_5] icu% ./icu_matcher 2008-03-12 18:49:50.788 icu_matcher[88171:807] subject: 'ฉัน กินข้าว' 2008-03-12 18:49:50.882 icu_matcher[88171:807] matched: 'ฉัน กินข้าว' 2008-03-12 18:49:50.888 icu_matcher[88171:807] range : '{0, 10}'

UREGEX_UWORD turned on (sic) under 10.4:

[johne@LAPTOP_X86] /tmp% ./icu_matcher 2008-03-08 16:18:03.932 icu_matcher[11810] subject: 'ฉันกิน ข้าว' 2008-03-08 16:18:03.954 icu_matcher[11810] matched: 'ฉันกิน ข้าว' 2008-03-08 16:18:03.954 icu_matcher[11810] range : '{0, 10}'

Well, one can dream, I suppose. The user ended up "fixing" this by building the latest version of ICU under 10.4 and linking against that. I don't know why 10.4 behaves like this. The latest ICU version is 3.8, 10.5 uses 3.6, and 10.4 uses 3.2. I suspect, however, it's because 10.4 omits the Thai word breaking dictionary and thus falls back to ordinary \b behavior. A (very) rough estimate is the breaking dictionaries are about 450K for everything with the thai word breaking dictionary weighing in at ~240K alone!


Kind regards,

Alastair.

--
http://alastairs-place.net


_______________________________________________
Do not post admin requests to the list. They will be ignored.
Xcode-users mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


_______________________________________________
Do not post admin requests to the list. They will be ignored.
Xcode-users mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


Follow-Ups:

Re: Xcode Editor's Regex now uses PCRE instead of ICU?
From: Alastair Houghton <email@hidden>


References:  
  >Re: Xcode Editor's Regex now uses PCRE instead of ICU? (From: Stuart Malin <email@hidden>)
  >Re: Xcode Editor's Regex now uses PCRE instead of ICU? (From: Alastair Houghton <email@hidden>)




Prev by Date:
Re: Meeting other Mac OS X developers (was Re: Xcode Users)

Next by Date:
Re: Hang when searching in Xcode

Previous by thread:
Re: Xcode Editor's Regex now uses PCRE instead of ICU?

Next by thread:
Re: Xcode Editor's Regex now uses PCRE instead of ICU?

Index(es):

Date
Thread