Re: Help with find text command
Re: Help with find text command
- Subject: Re: Help with find text command
- From: "Wallace, William" <email@hidden>
- Date: Thu, 02 Aug 2007 12:57:25 -0500
- Thread-topic: Help with find text command
has,
Thanks for the heads up. However, in this case I think it may be a moot
point because of the nature of the project. The script is intended to assist
in updating the publisher's catalog which currently contains only 10 digit
ISBNs. I am using the script to put the 13 digit ISBNs into the catalog (as
well as updating the prices). So I'm fairly confident that the situation you
describe won't be encountered (but still good to know for future reference).
In fact, now that I think about it, the word boundary issue in general
probably wouldn't be encountered (since these documents have already been
through several editorial/production rounds). Even it were, it would
probably just be a minor formatting error where whitespace was accidentally
omitted from the product title and it's ISBN. In which case, who cares? The
match would still almost certainly be an ISBN that needs to be updated. It's
not my job to fix their formatting errors for them. ;-} And if the script
somehow gives a false positive once in a while, there will be a subsequent
round of editorial review on these pages anyway, and one would think that
they should be able catch an extraneous ISBN thrown into the text stream in
an inappropriate location.
Thanks again, everybody!
--
B!ll
> From: has <email@hidden>
> Date: Thu, 2 Aug 2007 18:28:08 +0100
> To: <email@hidden>
> Subject: Re: Help with find text command
>
> I wrote:
>
>> -- find 13-character substrings that may be an ISBN
>> set possMatches to find text "\\<[[:digit:]][[:digit:]-]{11}
>> [[:digit:]X]\\>" in theText with regexp and all occurrences
>
> Additional testing uncovers a subtle problem with this pattern - the
> word boundary patterns (\< and \>) consider hyphens as boundaries, so
> something like "979-0-123-45678-X" would match as "0-123-45678-X"
> which you don't want it to.
>
> If you can switch to a more powerful regexp command that supports
> lookbehind and lookahead assertions, I think the following Perl-
> compatible regexp will work as intended. As a bonus, it checks both
> length and structure so will provide the single-pass solution you
> originally wanted:
>
> (?<![\w-])(?=[0-9X-]{13}(?![\w-]))([0-9]{1,5}-[0-9]{1,7}-[0-9]{1,7}-
> [0-9X])
>
> (Caveat emptor; do your own tests to make sure I've not missed
> anything.) Basically it uses negative lookbehind and negative
> lookahead assertions to check for a potential ISBN's beginning and
> end, and a positive lookahead insertion to check for the correct
> length inbetween. If all that matches, it then checks for a valid
> ISBN structure.
>
> You could use TextCommands' search command for this, but if you do be
> aware that the 'finding match indexes' option currently has an off-by-
> one bug in the indexes returned [1] and remember to compensate for
> that. Alternatively, Smile's 'ufind text' command may be powerful
> enough for the job (Emmanuel can advise here) or you can always call
> out to Perl or Python (remembering to compensate for their 0-
> indexing, of course).
>
> HTH
>
> has
>
> [1] AppleScript uses 1-indexing while Python uses 0-indexing, and I
> forgot to adjust the numbers accordingly. I'll fix this for the next
> release.
> --
> http://appscript.sourceforge.net
> http://rb-appscript.rubyforge.org
> http://appscript.sourceforge.net/objc-appscript.html
>
>
_______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives: http://lists.apple.com/archives/applescript-users
This email sent to email@hidden