ExtractText can do any attachment you want. However, it's just a
framework. We use it to call a script that extracts images, office
documents, pdf files and passes them through a range of tools (antiword,
unzip [newer Office are just XML files in a zip file], gocr) and
converts them into text - which ExtractText then feeds back into
SpamAssassin. Pushes the load on the server up of course - but certainly
improves scoring. The "unzip-office-files" trick is a total hack - but
it's better than skipping it ;-)

Jason

On 19/11/12 17:43, Olivier Nicole wrote:
> Thank you Jari,
>
>>> In the same way, I am wondering if something similar exists for all
>>> the (open|libre|MS)office documents?
>> ExtractText
>>
>> Works with this documents as well as with PDF.
> I had a look at ExtractText, but it only extract text,not the images.
> And antiword, the extractor for MS office that it is based on is
> limited for MS office 2003.
>
> Best regards,
>
> Olivier
>  


-- 
Cheers

Jason Haar
Information Security Manager, Trimble Navigation Ltd.
Phone: +1 408 481 8171
PGP Fingerprint: 7A2E 0407 C9A6 CAF6 2B9F 8422 C063 5EBB FE1D 66D1

Reply via email to