On Sep 6, 2011, at 9:23 PM, Douglas Davidson wrote:

> 
> On Sep 6, 2011, at 11:53 AM, Jens Alfke wrote:
> 
>> On Sep 6, 2011, at 11:11 AM, Michael Thon wrote:
>> 
>>> Yup, they're HTML, all right. Now I'm thinking of moving this code to a 
>>> separate command line app that I can call from the main application. It 
>>> should work, but I'm not sure if I'd need to provide a runloop for the HTML 
>>> importing to work.
>> 
>> The background tool will need to link against WebKit and AppKit, so it won’t 
>> be strictly-speaking ‘background’. You can mark its bundle with a special 
>> key (LSBackgroundOnly?) to keep it from showing up in the Dock or getting a 
>> menu-bar though.
>> 
>> The bigger problem is how this tool sends results back to the main app. An 
>> NSAttributedString is an in-memory object, and the tool has a separate 
>> address space. I guess you could try archiving the string and sending back 
>> the data, but I’m not sure whether all the different attribute values used 
>> in parsed HTML are archivable.
>> 
>> What do you use the attributed string for, if this is a background-only 
>> operation? Maybe there’s a less expensive way to accomplish it.
> 
> One possibility would be to convert the HTML to RTF or RTFD, which could be 
> loaded in the background.  For that sort of conversion we already have a tool 
> on the system, /usr/bin/textutil.  There are also other potential methods for 
> parsing HTML, if the intent is for something other than full editable rich 
> text support.
> 
> Douglas Davidson
> 


The app is to be used to find potential cases of plagiarism.  I'm importing 
documents that the users have on their local computers as well as web pages 
that they select from the www.  In the case of HTML, the user does not need to 
edit it, and it will be presented to the user in a WebView.  I need to extract 
plain text from the html in order to compare it to the user's document.  The 
html has not been loaded into the WebView instance yet, so there is no 
possibility to extract the plain text from there.  So, in a background thread 
(an NSOperation) I fetch the url using NSURLConnection, and then convert the 
NSData object to NSAttributedString, and from there I convert it to an 
NSString.  Later, if a potential case of plagiarism is detected, the user will 
view the web page by loading the original url into a WebView.  The previous 
conversion to plain text is accurate, such that if I select a substring of it 
(a potential plagiarized sentence) and use [webView searchFor:subString ...] in 
the  WebView instance, the webView can find the string and highlight it for the 
user.

So, I don't need editable rich text, but I do need a plain text string that 
faithfully represents the text content of the original html displayed in a 
WebView.  I've found that NSAttributedString's initWithData:... method does a 
good job of the conversion, and now I see why, because its using WebKit to do 
the actual conversion.  There might be a lighter weight way to do the 
conversion, but I haven't investigated alternatives yet.  One drawback to my 
approach is that I can't get the web page title using NSAttributesString. I 
don't know how to get to that yet.  I've considered trying to parse the html 
myself, but I think that will eventually just cause me a lot of grief.

Mike




_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to