Re: PDFText Plugin for PDF file scoring - PDFText2.pm for ver 3.2

James MacLean Sun, 15 Jul 2007 13:06:03 -0700

Theo Van Dinter wrote, on 14/07/07 02:13 PM:

On Sat, Jul 14, 2007 at 09:54:36AM -0300, James MacLean wrote:

Where do I find information on hooking into post_message_parse()? Triedgreping in the module area with no luck :(. Certainly agree it would bebetter to get the text out and let everyone at it :).


You can ask. :)  But yes, I didn't do a good job of fully documenting how
this is supposed to work -- you have to know about the plugin call, then
hunt around Message and Message::Node, etc.  Sorry.  Here's the basics:

First, create a plugin with the post_message_parse method.  Then in
there, use $msg->find_parts() to find the parts that you're looking
for (find_parts() is pretty well documented).  Then, you simply take
the data from $part->decode() and do something to convert it to text.
Then you take that text and call $part->set_rendered($text).

Later on, when SA looks for the text to use for body rules, uri parsing,
etc, it takes anything that has rendered text.

Thanks Theo. From this I now have:

http://support.ednet.ns.ca/SpamAssassin/PDFText2.pm

Sorry that I was not aware that I had not been developing for a currentversion :(. Explains why I could not find the pieces that I was toldabout ;).


Sample setup in local.cf :

pdftext_pdfinfo_cmd /usr/bin/pdfinfo
pdftext_pdftotext_cmd /usr/bin/pdftotext
pdftext_pdfimages_cmd /usr/bin/pdfimages
pdftext_gocr_cmd /usr/bin/gocr

body PLUGIN_PDFTEXT_TEST /Stock/i
describe PLUGIN_PDFTEXT_TEST Found word Stock
score PLUGIN_PDFTEXT_TEST 2.5

body PLUGIN_PDFTEXT2 /PDFText2-Title: stock_tmp.pdf/i
describe PLUGIN_PDFTEXT2 Found the Title stick_tmp.pdf
score PLUGIN_PDFTEXT2 4.5

Current comments :

. now it will prepend PDFText2- to the pdfinfo pushed to render so thataccurate PDFinfo matching can be done. Or was that the wrong thing to do?. added gocr of the images, but I see FuzzyOCR does fuzzy matching whichthis doesn't so as long as you don't set pdftext_gocr_cmd, it won't dothat part. Maybe there is a way this one can call that one?. not comfortable with how I create temporary dirs for pdfimages, sothat might make trouble for folks.. I can not test it in our production environment as that is still 3.1and I don't want to try the SVN FuzzyOCR just yet :). So that means I amonly lightly testing in a development environment.


Is there any similar function to post_message_parse in the 3.1 series?

Thanks again everyone,
JES

Re: PDFText Plugin for PDF file scoring - PDFText2.pm for ver 3.2

Reply via email to