• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag
 

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: Removing html tags
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Removing html tags


  • Subject: Re: Removing html tags
  • From: "Marc K. Myers" <email@hidden>
  • Date: Mon, 28 Feb 2005 20:23:16 -0500

On Feb 28, 2005, at 8:09 PM, Christian Vinaa wrote:
At 19:49 -0500 28/02/2005, Marc K. Myers wrote:
On Feb 28, 2005, at 6:22 PM, Christian Vinaa wrote:
At 15:19 -0500 28/02/2005, Marc K. Myers wrote:
On Feb 28, 2005, at 1:34 PM, Paff <email@hidden> wrote:
I want to remove all html tags from a document downloaded with curl
utility. I want to remove everything that is between <> characters
(including <> chars) so I end up with plain text. Example: if in a
document there's "<title>BOS Bank</title>" I'd like to get only "BOS
Bank"; if there's "<TD class="tabelka01" rowspan="2"
align="center">Kod</TD>" I want to get only "Kod" string etc.

set theText to "<tag1>this is some text</tag1> and then there's this text followed by <tag2>and <tag3>its</tag3> contents</tag2>"
set {od, AppleScript's text item delimiters} to ¬
{AppleScript's text item delimiters, "<"}
set theText to text items of theText
set newText to ""
set AppleScript's text item delimiters to ">"
repeat with anItem in theText
set newList to text items of anItem
if (count newList) > 1 then
set newText to newText & text item 2 of newList
end if
end repeat
set AppleScript's text item delimiters to od
newText


-->"this is some text and then there's this text followed by and its contents"


havent tried it out but with a quick glance it doesnt seem to take into consideration fx. the tag

<TD class="tabelka01" rowspan="2" align="center">

only tags like  </tag1>

but  PageSpinner  have a script that does in fact remove all tags
large and small  :-))

Actually, it deals quite well with that kind of tag. What it can't handle is text like "If x<3 and y>10, what are the solutions?" I'm not sure how anything without artificial intelligence could distinguish text between angle brackets from tags.


Marc [2/28/05 7:47:51 PM]



to make my meaning more clear:

a tag like  <TD class="tabelka01" rowspan="2" align="center">

that contain a " or several "s will upset the script !

That happens only when the text is entered as a literal in the script. If it was drawn from outside the script, as from a text file, AppleScript correctly escapes the quotes. Try it this way:


set theFile to (choose file)
set fileRef to (open for access theFile)
set theText to (read fileRef)
close access fileRef
set {od, AppleScript's text item delimiters} to ¬
	{AppleScript's text item delimiters, "<"}
set theText to text items of theText
set newText to ""
set AppleScript's text item delimiters to ">"
repeat with anItem in theText
	set newList to text items of anItem
	if (count newList) > 1 then
		set newText to newText & text item 2 of newList
	else
		set newText to newText & text item 1 of newList
	end if
end repeat
set AppleScript's text item delimiters to od
newText

Marc [2/28/05  8:23:00 PM]

_______________________________________________
Do not post admin requests to the list. They will be ignored.
Applescript-users mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


References: 
 >Re: Removing html tags (From: "Marc K. Myers" <email@hidden>)
 >Re: Removing html tags (From: Christian Vinaa <email@hidden>)
 >Re: Removing html tags (From: "Marc K. Myers" <email@hidden>)
 >Re: Removing html tags (From: Christian Vinaa <email@hidden>)

  • Prev by Date: Re: What is this?
  • Next by Date: Re: Removing html tags
  • Previous by thread: Re: Removing html tags
  • Next by thread: Re: Removing html tags
  • Index(es):
    • Date
    • Thread