Matus UHLAR - fantomas wrote:

This I don't understand. Do they put PDFs inside .doc files as if the ..doc was an archive?

I am not sure but I think something alike was done.

Considering that an OpenXML format is basically a zip file with XML files inside and that the actual document can contain hyperlinks I guess it could be possible to do something like that. Don't know enough about the format to know though.

What I mean is to have
generic chain of format converters, where at the end would be plain image
or even text, that could be processed by classic rules like bayes,
replacetags etc.

If I manage to figure out how to add new parts to a message from within the "post_message_parse" method, that should work just fine.

An extractor plugin can return a list of parts to be added to the message, and my module will keep looping through the message parts if new parts are added.

So, if a Word extractor extracts a PDF and returns it, the PDF woudl be added to a new part, and in the next loop the PDF part will be sent to a PDF extractor if that exists. And so on. I'm running "post_message_parse" at priority -1 so any added image parts should be available to plugins like FuzzyOCR as well as plugins running "post_message_parse" at default priority.

The missing parts are:

1: How do I add a new part to a parsed message (including a singlepart one). This is of course the main problem.

2: The actual extractor plugin that extracts whatever files are included in the word document. Antiword only extracts text, and my extractor for OpenXML is little more than an extremely basic XML remover.

Regards
/Jonas
--
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

Reply via email to