Hello!

For anyone who likes to test stuff, I've uploaded my plugin that extracts text from documents to
<http://whatever.frukt.org/graphdefang/ExtractText.zip>

I started writing last week, so it hasn't been heavily tested yet, but it has been running here over the weekend with no showstopping problems.

What it does is use external tools and simple (interface wise) extractor plugins to extract text from message parts. The extractors are choosed by MIME type, file name and optionally content magic. The extracted text is seen by bayes and SA rules. It is completely possible to create an OCR extractor, but I haven't done so, and I currently don't plan on doing it.

The plugin currently comes with a *very* rudimentary OpenXML (recent MS Word) extractor, and a configuration using external tools "antiword", "unrtf", "odt2txt" and "pdftohtml" to extract text from MS Word, RTF, OpenDocument (OpenOffice/StarOffice) and PDF files.

It is also possible for an extractor plugin to return several binary objects as well as text. These objects will also be processed by all extractors, so an extractor for a container type of file can return (as an example) a bunch of images, that is then processed by an OCR extractor. I have not implemented any extractor that does this, so it's completely untested.

Stuff I allready know is missing:

* A safe-guarding maximum depth of processing.

* A way for extractor plugins to get config lines.

Test it if you feel like it.

Regards
/Jonas
--
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

Reply via email to