HTML to Text to Tagged Text to XHTML
HTML to Text to Tagged Text to XHTML
- Subject: HTML to Text to Tagged Text to XHTML
- From: Thomas Wetmore <email@hidden>
- Date: Sun, 25 Aug 2013 15:49:47 -0400
I am looking for some pointers or advice.
I am developing an application to semantically tag HTML pages with genealogical information as defined by the schema.org/Person object and related objects.
The NLP required to do the semantic analysis resides in a well-proven text processing library that I have developed over the past couple years. Once the text from the HTML page has been put into a pure string form (i.e., tags removed), the NLP is run and the results catalog every semantic object (e.g, names, dates, places, birth and death events, parent-child relationships) to its position (i.e., NSRange) within the pure text string. FYI my NLP results on simple things like names, dates and other entities, are considerably better than those from Apple's semantic tagging system.
So the overall program does the following:
1. Read an HTML file (I am doing this by building an NSXMLDocument with the HTML tidy feature, so the output will be good XHTML regardless of the input).
2. Create the untagged equivalent of the text from the document for use by the NLP and semantic tagging.
3. Do the NLP processing to find and catalog all the semantic objects within the text.
4. Convert the untagged text back into HTML with new tags that match as closely as possible the tags used on the original page, but with extra <div> and/or <span> tags inserted as required to hold semantic information -- the page must render exactly as it used to, but with the semantic tags added.
It is in the fourth area -- converting text with auxiliary semantic information back into HTML form that matches a previous HTML page -- that seems to have some marvelous challenges. I've been prototyping a few ideas on how to do this, but the algorithms seem finicky enough that I thought I would ask to see if anyone here has come across a similar type project.
Do any of you know of any applications that round trip HTML text to pure strings and then back to possibly modified HTML text?
Tom Wetmore, Chief Bottle Washer
DeadEnds Software
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden