> We can use antiword to render text from MSWord files, and unrtf to
> render text from RTF files.  What is the best tool to render text from
> PDF files?
> 
> (We are running Solaris 9)

FWIK, antiword is the best tradeoff between speed and conversion quality.

The best converter I know of, even for batch use, is actually OpenOffice with 
its "uno" interface, but it isn't that easy to handle from perl since it uses 
some kind of Java jndi in order to exchange word files and converted text with 
any implementation of a conversion controller. Also, it tends to consume a lot 
of memory since current versions keep "growing" core size for each document you 
convert (even when you close them...).

Antiword seems more resource conscious in this...

Giampaolo



> 
> L
> 
> > -----Original Message-----
> > From: Jonas Eckerman [mailto:jonas_li...@frukt.org]
> > Sent: Wednesday, June 24, 2009 1:34 PM
> > To: users@spamassassin.apache.org
> > Subject: Plugin extracting text from docs (was: new spam using large
> > images)
> >
> > Jason Haar wrote:
> >
> > > Speaking of image/rtf/word attachment spam; is there any work going
> > on
> > > to standardize this so that the textual output of such attachments
> > could
> > > be fed back into SA?
> >
> > Just as a note:
> >
> > I'm currently working on a modular plugin for extracting text and add
> > it
> > to SA message parts.
> >
> > The plugin can use either external tools or it's own simple plugin
> > modules. How to extract text from parts is configurable, and based on
> > mime types and file names, so new formats can be added by simply
> > configuring for new external tolls or creating a new plugin module.
> >
> > My *far* from finished module currently manages to extract text from
> > Word documents (using antiword), OpenXML text documents (using a
> simple
> > plugin) and RTF (using unrtf).
> >
> > I haven't tested where and how the extracted text is available to
> > SpamAssassin yet (as noted, it's *far* from finished), but I am using
> >        "set_rendered" method as in the example, so it should work. ;-
> )
> >
> > Regards
> > /Jonas
> > --
> > Jonas Eckerman
> > Fruktträdet & Förbundet Sveriges Dövblinda
> > http://www.fsdb.org/
> > http://www.frukt.org/
> > http://whatever.frukt.org/

Reply via email to