ExtractText plugin

Jonas Eckerman Mon, 29 Jun 2009 11:08:01 -0700

Hello!

For anyone who likes to test stuff, I've uploaded my plugin thatextracts text from documents to

<http://whatever.frukt.org/graphdefang/ExtractText.zip>

I started writing last week, so it hasn't been heavily tested yet, butit has been running here over the weekend with no showstopping problems.

What it does is use external tools and simple (interface wise) extractorplugins to extract text from message parts. The extractors are choosedby MIME type, file name and optionally content magic. The extracted textis seen by bayes and SA rules. It is completely possible to create anOCR extractor, but I haven't done so, and I currently don't plan ondoing it.

The plugin currently comes with a *very* rudimentary OpenXML (recent MSWord) extractor, and a configuration using external tools "antiword","unrtf", "odt2txt" and "pdftohtml" to extract text from MS Word, RTF,OpenDocument (OpenOffice/StarOffice) and PDF files.

It is also possible for an extractor plugin to return several binaryobjects as well as text. These objects will also be processed by allextractors, so an extractor for a container type of file can return (asan example) a bunch of images, that is then processed by an OCRextractor. I have not implemented any extractor that does this, so it'scompletely untested.


Stuff I allready know is missing:

* A safe-guarding maximum depth of processing.

* A way for extractor plugins to get config lines.

Test it if you feel like it.

Regards
/Jonas
--
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

ExtractText plugin

Reply via email to