Matus UHLAR - fantomas wrote:
This I don't understand. Do they put PDFs inside .doc files as if the
..doc was an archive?
I am not sure but I think something alike was done.
Considering that an OpenXML format is basically a zip file with XML
files inside and that the actual document can contain hyperlinks I guess
it could be possible to do something like that. Don't know enough about
the format to know though.
What I mean is to have
generic chain of format converters, where at the end would be plain image
or even text, that could be processed by classic rules like bayes,
replacetags etc.
If I manage to figure out how to add new parts to a message from within
the "post_message_parse" method, that should work just fine.
An extractor plugin can return a list of parts to be added to the
message, and my module will keep looping through the message parts if
new parts are added.
So, if a Word extractor extracts a PDF and returns it, the PDF woudl be
added to a new part, and in the next loop the PDF part will be sent to a
PDF extractor if that exists. And so on. I'm running
"post_message_parse" at priority -1 so any added image parts should be
available to plugins like FuzzyOCR as well as plugins running
"post_message_parse" at default priority.
The missing parts are:
1: How do I add a new part to a parsed message (including a singlepart
one). This is of course the main problem.
2: The actual extractor plugin that extracts whatever files are included
in the word document. Antiword only extracts text, and my extractor for
OpenXML is little more than an extremely basic XML remover.
Regards
/Jonas
--
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/