Re: Plugin extracting text from docs

Jonas Eckerman Thu, 25 Jun 2009 10:05:43 -0700

Matus UHLAR - fantomas wrote:

This I don't understand. Do they put PDFs inside .doc files as if the..doc was an archive?
I am not sure but I think something alike was done.

Considering that an OpenXML format is basically a zip file with XMLfiles inside and that the actual document can contain hyperlinks I guessit could be possible to do something like that. Don't know enough aboutthe format to know though.

What I mean is to have
generic chain of format converters, where at the end would be plain image
or even text, that could be processed by classic rules like bayes,
replacetags etc.

If I manage to figure out how to add new parts to a message from withinthe "post_message_parse" method, that should work just fine.

An extractor plugin can return a list of parts to be added to themessage, and my module will keep looping through the message parts ifnew parts are added.

So, if a Word extractor extracts a PDF and returns it, the PDF woudl beadded to a new part, and in the next loop the PDF part will be sent to aPDF extractor if that exists. And so on. I'm running"post_message_parse" at priority -1 so any added image parts should beavailable to plugins like FuzzyOCR as well as plugins running"post_message_parse" at default priority.


The missing parts are:

1: How do I add a new part to a parsed message (including a singlepartone). This is of course the main problem.

2: The actual extractor plugin that extracts whatever files are includedin the word document. Antiword only extracts text, and my extractor forOpenXML is little more than an extremely basic XML remover.


Regards
/Jonas
--
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

Re: Plugin extracting text from docs

Reply via email to