Re: pdf spam solution idea

Dallas Engelken Wed, 27 Jun 2007 20:12:25 -0700

arni wrote:

Hi,
its come up several times now that people ask for a way to directlydetect pdf spam by the pdf content and not only through headers orother means (hashes, bayes).I've found a solution that should be pretty easy to realise in aFuzzy-OCR like plugin. Here is what it should do:
Use xpdf (http://www.foolabs.com/xpdf/download.html) to read the pdfdocument
export the images to ppm files using `pdfimages`
export the text parts to a simple text using `pdftotext`
This plugin should run as one of the first to make the raw text readavailable (for example by attaching it as an extra mime part orsomehow internally) as well as make the images available to FuzzyOCRor similar by the same means as above.
Unfortunately i wont be able to write such a plugin myself, it shouldbe rather easy to do but i cant start to learn pearl just for this ;-)


I already have... I'll be releasing the info soon.

--
Dallas Engelken
[EMAIL PROTECTED]
http://uribl.com

Re: pdf spam solution idea

Reply via email to