Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Regex weirdness



"Todd O'Bryan" <email@hidden> wrote:

>(?<!<b/?|<br/?)> should match all >'s not preceded by <b, <b/, <br, or
><br/.

OK. That's what I thought.


>(?<!<br/?|<b/?)> should do the same thing with just the disjunction
>stated in a different order.

But I believe the order has a direct bearing on your problem.


>To clarify for those who don't live in regexes, (?<!pattern1)pattern2
>will match pattern2 when it is not preceded by pattern1. So, in my
>case, pattern1 is either <b/?|<br/? or <br/?|<b/? which should function
>identically. Right?

As I read this expression:
<b/?

it says:
"the chars '<', 'b' exactly, followed by zero or one occurances of '/'"

It's the "zero occurances of '/'" that poses the problem. Specifically,
that pattern will see a match in the string "<br/">, and the matching
substring is "<b". Because that match occurs first in your X|Y
alternatives, that's the expression that defines the extent of the result:
"<b". That matching result is followed by "r>", not ">", so it meets the
criteria for replacement, given everything stated in the regex.
Unfortunately, what's stated in the regex is not what you want.

You have to match the LONGER expression first, otherwise the shorter one
will dominate. The crucial point is that the shorter pattern happens to
match an initial substring of the longer one. It's not JUST the matching
that's important here, it's the side-effect of having the extent of that
match determine where a subsequent pattern (the following ">") begins.

That's why I think there's not a bug in Java's regex, and why the two
regexes are NOT identical.

Also see:
<http://www.instantiations.com/assist/docs/regex_syntax.htm>

Find the 2nd occurance of "|", where the operator description says:
... REs separated by "|" are tried from left to right, and the first one that
allows the complete pattern to match is considered the accepted branch.
This means that if A matches, B will never be tested, even if it would
produce a longer overall match. ...


As an alternative, have you considered:
<br/|<br|<b/|<b

Since you're using regex'es directly on Strings, speed doesn't seem to be a
major consideration for you, so even if it's less efficient than the
patterns you're using now, it shouldn't matter.

BTW, unless you can guarantee the case of your tags, you need to cover that
in your patterns. There may also be whitespace to consider, since I think
e.g. "<b >" is a valid HTML tag, though perhaps unconventionally formatted.

-- GG
_______________________________________________
java-dev mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/java-dev
Do not post admin requests to the list. They will be ignored.




Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.