Re: HTML to Text to Tagged Text to XHTML
Re: HTML to Text to Tagged Text to XHTML
- Subject: Re: HTML to Text to Tagged Text to XHTML
- From: Jens Alfke <email@hidden>
- Date: Sun, 25 Aug 2013 13:15:37 -0700
Some thoughts:
If you just convert the HTML to a plain string, you’ve lost the knowledge of how the characters in that string map back to the HTML, and I don’t think you can feasibly put it back together after modifying the string.
There are two approaches I can see.
(1) Use an NSMutableAttributedString. Don’t use the regular styled-text attributes, but instead a custom attribute that stores the HTML element metadata for that span of text. Then you can modify the string and it will still keep track of which ranges are part of which tag. Unfortunately I have doubts about whether you can restore the HTML exactly — I can foresee there’d be issues with elements that don’t contain any text (like <br/>).
(2) Parse the HTML into an NSXMLDocument, i.e. a DOM tree, and walk through the tree looking at the text nodes. At that point it’s easy to insert new nodes or text at the point you want. The difficulty here is that the text will be broken up across lots of nodes, for instance if one word in a phrase is italicized, and some HTML generators even do redundant stuff like breaking text into multiple <span> elements unnecessarily. So it depends on how well your NLP engine works with disconnected chunks of text. If you can stream the text into it a piece at a time, that would be ideal; you just do a depth-first traversal of the DOM tree feeding all the text nodes into it as you find them.
—Jens
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden