Re: Parsing HTML
Re: Parsing HTML
- Subject: Re: Parsing HTML
- From: Gary Lists <email@hidden>
- Date: Sat, 04 Jan 2003 15:37:16 -0500
On or about 1/4/03 2:42 PM, Randal L. Schwartz wrote:
>
<tag1
>
attribute1="fo'o>b'ar"
>
attribute2='lef"t>r"ight'
>
attribute3=unquoted
>
>
>
some text
>
</tag1>
>
>
so you can't just scan to ">": you need to know if you're inside a
>
quoted attribute value or not. And notice that each kind of quotes
>
can contain the other kind of quotes. Yeah, messy problem, eh?
Only messy in theory...mostly.
Because...
attribute3 is not valid HTML; attribute2 is not valid HTML; attribute1
should use the greater than entity to be valid.
If anyone really wrote HTML like the sample you offer above, now _that'd_ be
messy. ;)
(Your sample won't even work in all browsers, so it's likely it would never
be written as such.) But, throw in some javascript escaping or some XML <br
/> tags and WAM! ... the Parse HTML routine will probably break.
And you are right, of course, about the general principle, but swapping
on/off a boolean for being inside a tag isn't all that difficult (this is
what I do in BBEdit to fix my empty ALT= attributes. But, there is grep
there, so avoiding such mis-quoted quotes is easier.)
The sub-routine that Sal referenced is a pretty good one, especially for
those whose needs are simpler. Anyone using _only_ this sub-routine in a
real production workflow would be foolhardy, but if you want to quickly get
to the next <IMG...>, then this will do it for you.
Thanks for re-reminding everyone of the useful sub-routines at Apple, Sal.
Keep 'em coming...and don't forget us OS 9ers. ;)
--
Gary
Incoming replies are auto-deleted.
Please post directly to the list or newsgroup.
Really need direct? Rot me at:
email@hidden
Lbhe fhowrpg zhfg ortva "abgwhax:" (ab dhbgrf)
Avpr gb zrrg lbh! Qba'g fcnz zr.
_______________________________________________
applescript-users mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/applescript-users
Do not post admin requests to the list. They will be ignored.