arni wrote:
Hi,
its come up several times now that people ask for a way to directly
detect pdf spam by the pdf content and not only through headers or
other means (hashes, bayes).
I've found a solution that should be pretty easy to realise in a
Fuzzy-OCR like plugin. Here is what it should do:
Use xpdf (http://www.foolabs.com/xpdf/download.html) to read the pdf
document
export the images to ppm files using `pdfimages`
export the text parts to a simple text using `pdftotext`
This plugin should run as one of the first to make the raw text read
available (for example by attaching it as an extra mime part or
somehow internally) as well as make the images available to FuzzyOCR
or similar by the same means as above.
Unfortunately i wont be able to write such a plugin myself, it should
be rather easy to do but i cant start to learn pearl just for this ;-)
I already have... I'll be releasing the info soon.
--
Dallas Engelken
[EMAIL PROTECTED]
http://uribl.com