Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Removing html tags



At 11:31 AM +0100 2/28/05, Paff wrote:
Hi all!

I want to remove all html tags from a document downloaded with curl utility. I want to remove everything that is between <> characters (including <> chars) so I end up with plain text. Example: if in a document there's "<title>BOS Bank</title>" I'd like to get only "BOS Bank"; if there's "<TD class="tabelka01" rowspan="2" align="center">Kod</TD>" I want to get only "Kod" string etc.

However, I have no idea how to do that. I've searched macsripter.net and tried google but with no luck.

In the simplest cases, you can do that with one regular expression. Depending whether your file is ASCII or UTF-8 (or else) you would use the Satimage osax (which is free) of the Smile environment (which is free.)


Assuming your file is UTF-8, for instance, you would do:

-- untested
uchange "<[^>]+>" into "" in the_file with regexp
----------

Emmanuel
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Applescript-users mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/applescript-users/email@hidden

This email sent to email@hidden
References: 
 >Removing html tags (From: Paff <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.