Lists

Open Menu Close Menu

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: Unicode search [was Re: the Holy Grail of AppleScript lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unicode search [was Re: the Holy Grail of AppleScript lists]

Subject: Re: Unicode search [was Re: the Holy Grail of AppleScript lists]
From: has <email@hidden>
Date: Fri, 21 Mar 2003 01:23:09 +0000

Paul Berkowitz wrote:

I recall your saying recently that the great speed advantage of searching a
long string to see if it contains a search string, over the very slow
process of searching a list of strings to see if the list contains the same
search string, is not matched by Unicode text. Does anyone know if this
slower processing of Unicode text searches using 'contains' has anything to
do with the lack of a limit on the stack, or is totally unrelated (and
therefore fixable)?

Basically what Helmut wrote. (Except the maximum's 4 bytes per character, not 6.) Unicode's performance characteristics are a feature of its design. Just getting a simple substring is O(n) in raw Unicode text, compared to O(1) with an old-timey string. I hate to think how much code it'd need slathered on top to get any better performance. Personally, I believe it to be a cunning plot by Intel to sell the next generation of Pentium processors with the 333THz core and built-in thermonuclear power station to run it, but I'm just paranoid that way.

John Delacour wrote:

Every character in Unicode proper
consists of two bytes (or 4 in the case of UTF-32) and the length is
not variable.

Not quite: in UTF-16, characters may consist of either one or two two-byte blocks (c.f. recent discussions of separation between characters and accents, for example), so again length is not fixed and you've got to crawl across it every time you want to find something. Only UTF-32 is fixed length (and I don't imagine we'll see it in everyday use for a while yet). Out of interest, John, do Perl regexes understand Unicode, or are they strictly old-school one-byte-one-character? (If they do, what's their performance like?)

Here's the background spiel for them that want to knock themselves out:

http://www.unicode.org/standard/principles.html

The bit on Encoding Forms is the kicker vis-a-vis performance.

And the JoS "Shlemiel the painter's algorithm" link I like to trot out for occasions such as these:

http://www.joelonsoftware.com/articles/fog0000000319.html

HTH

has
--
http://www.barple.pwp.blueyonder.co.uk -- The Little Page of AppleScripts
_______________________________________________
applescript-users mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/applescript-users
Do not post admin requests to the list. They will be ignored.

Follow-Ups:
- Re: Unicode search
  - From: John Delacour <email@hidden>

Prev by Date: Re: the Holy Grail of AppleScript lists
Next by Date: scripting printers
Previous by thread: Re: What is smile?
Next by thread: Re: Unicode search
Index(es):
- Date
- Thread