Hello!
For anyone who likes to test stuff, I've uploaded my plugin that
extracts text from documents to
<http://whatever.frukt.org/graphdefang/ExtractText.zip>
I started writing last week, so it hasn't been heavily tested yet, but
it has been running here over the weekend with no showstopping problems.
What it does is use external tools and simple (interface wise) extractor
plugins to extract text from message parts. The extractors are choosed
by MIME type, file name and optionally content magic. The extracted text
is seen by bayes and SA rules. It is completely possible to create an
OCR extractor, but I haven't done so, and I currently don't plan on
doing it.
The plugin currently comes with a *very* rudimentary OpenXML (recent MS
Word) extractor, and a configuration using external tools "antiword",
"unrtf", "odt2txt" and "pdftohtml" to extract text from MS Word, RTF,
OpenDocument (OpenOffice/StarOffice) and PDF files.
It is also possible for an extractor plugin to return several binary
objects as well as text. These objects will also be processed by all
extractors, so an extractor for a container type of file can return (as
an example) a bunch of images, that is then processed by an OCR
extractor. I have not implemented any extractor that does this, so it's
completely untested.
Stuff I allready know is missing:
* A safe-guarding maximum depth of processing.
* A way for extractor plugins to get config lines.
Test it if you feel like it.
Regards
/Jonas
--
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/