ExtractText can do any attachment you want. However, it's just a framework. We use it to call a script that extracts images, office documents, pdf files and passes them through a range of tools (antiword, unzip [newer Office are just XML files in a zip file], gocr) and converts them into text - which ExtractText then feeds back into SpamAssassin. Pushes the load on the server up of course - but certainly improves scoring. The "unzip-office-files" trick is a total hack - but it's better than skipping it ;-)
Jason On 19/11/12 17:43, Olivier Nicole wrote: > Thank you Jari, > >>> In the same way, I am wondering if something similar exists for all >>> the (open|libre|MS)office documents? >> ExtractText >> >> Works with this documents as well as with PDF. > I had a look at ExtractText, but it only extract text,not the images. > And antiword, the extractor for MS office that it is based on is > limited for MS office 2003. > > Best regards, > > Olivier > -- Cheers Jason Haar Information Security Manager, Trimble Navigation Ltd. Phone: +1 408 481 8171 PGP Fingerprint: 7A2E 0407 C9A6 CAF6 2B9F 8422 C063 5EBB FE1D 66D1