Re: Cocoa-dev Digest, Vol 5, Issue 1470
Re: Cocoa-dev Digest, Vol 5, Issue 1470
- Subject: Re: Cocoa-dev Digest, Vol 5, Issue 1470
- From: John Joyce <email@hidden>
- Date: Sat, 16 Aug 2008 20:22:08 -0500
On Fri, Aug 15, 2008 at 10:53 PM, John Joyce
<email@hidden> wrote:
Right now, I'm toying with using Flex/Lex in a Cocoa project.
Unfortunately, I don't see a reliable or easy way to handle NSStrings
correctly all the time with Flex.
Does anybody have any suggestions for such text handling and reliable
unicode aware regexes?
I'm seriously not interested in implementing such details in C with
Flex.
Flex is fast and cool for that, but if it's going to be stupidly
difficult
to use reliably with other languages on a mac, it's not a good idea
for me.
Depending on exactly what you need, unicode awareness can be fairly
straightforward.
Commonly, unicode in regexes is only needed to pass through
undifferentiated blobs of text, with ASCII delimiters. For example,
imagine parsing a CSV file which potentially has unicode text inside
the quotes. For this case, you can convert the file to UTF-8, and then
constructs like . will accept them. All non-ASCII characters in UTF-8
are represented as bytes 128-255, so if you just pass those through
then you'll be fine. But be aware of some potential problem areas:
- Each non-ASCII character will be more than one byte, and flex will
think of it as more than one character. Write your regexes
accordingly. In particular, avoid length limits on runs of arbitrary
characters, and avoid using non-ASCII characters directly in your
regex.
- It's very difficult to split UTF-8 strings correctly. If you
encounter a run of non-ASCII characters, ensure that you follow that
run through the end, until you get back to ASCII. Don't have a regex
that stops in the middle of it and then expects your code to be able
to do something useful with it.
- If you need to do something with non-ASCII characters besides read
them in one side and write them out the other, for example doing
something special with all accented characters, then Flex is probably
not the right answer.
Besides this it ought to be pretty straightforward. Since Flex just
passes your code straight through to the compiler, you can write
Objective-C in the actions (as long as you compile the result as
Objective-C, of course!), convert the text from UTF-8 back to an
NSString, and take things from there.
Mike
Thanks to all.
Mike, your answer especially set it on the right path that Flex is not
going to do what I would like to do, at least not without a lot of
work that might be silly.
Certainly, I could extract ranges of strings that are within the
ranges of ASCII and apply rules to them and then lump other stuff into
a separate group, but I'd like to have more control.
If I were willing to just do ASCII, it would be a wonderful thing,
since Flex is so fast.
I understand well enough how difficult it can be to establish rules
for unicode strings when there are a lot of semantics possible
depending on the language.
In my case, I am mainly interested in working with Japanese text,
which does have difficulty due to a general lack of white space to
rely on, but beyond that not so much more than a large character set.
I want to be able to use Japanese strings as tokens throughout
something.
I seriously wonder how sophisticated syntax highlighting in Xcode
really is... it does handle things quite well!
As cool and wonderful as Flex is, it just isn't going to be reliable
for what I want to do.
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden