I am looking for some pointers or advice.

I am developing an application to semantically tag HTML pages with genealogical 
information as defined by the schema.org/Person object and related objects.

The NLP required to do the semantic analysis resides in a well-proven text 
processing library that I have developed over the past couple years. Once the 
text from the HTML page has been put into a pure string form (i.e., tags 
removed), the NLP is run and the results catalog every semantic object (e.g, 
names, dates, places, birth and death events, parent-child relationships) to 
its position (i.e., NSRange) within the pure text string. FYI my NLP results on 
simple things like names, dates and other entities, are considerably better 
than those from Apple's semantic tagging system.

So the overall program does the following:

1. Read an HTML file (I am doing this by building an NSXMLDocument with the 
HTML tidy feature, so the output will be good XHTML regardless of the input).

2. Create the untagged equivalent of the text from the document for use by the 
NLP and semantic tagging.

3. Do the NLP processing to find and catalog all the semantic objects within 
the text.

4. Convert the untagged text back into HTML with new tags that match as closely 
as possible the tags used on the original page, but with extra <div> and/or 
<span> tags inserted as required to hold semantic information -- the page must 
render exactly as it used to, but with the semantic tags added.

It is in the fourth area -- converting text with auxiliary semantic information 
back into HTML form that matches a previous HTML page -- that seems to have 
some marvelous challenges.  I've been prototyping a few ideas on how to do 
this, but the algorithms seem finicky enough that I thought I would ask to see 
if anyone here has come across a similar type project.

Do any of you know of any applications that round trip HTML text to pure 
strings and then back to possibly modified HTML text?

Tom Wetmore, Chief Bottle Washer
DeadEnds Software
_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to