Theo Van Dinter wrote, on 14/07/07 02:13 PM:
On Sat, Jul 14, 2007 at 09:54:36AM -0300, James MacLean wrote:
Where do I find information on hooking into post_message_parse()? Tried greping in the module area with no luck :(. Certainly agree it would be better to get the text out and let everyone at it :).

You can ask. :)  But yes, I didn't do a good job of fully documenting how
this is supposed to work -- you have to know about the plugin call, then
hunt around Message and Message::Node, etc.  Sorry.  Here's the basics:

First, create a plugin with the post_message_parse method.  Then in
there, use $msg->find_parts() to find the parts that you're looking
for (find_parts() is pretty well documented).  Then, you simply take
the data from $part->decode() and do something to convert it to text.
Then you take that text and call $part->set_rendered($text).

Later on, when SA looks for the text to use for body rules, uri parsing,
etc, it takes anything that has rendered text.

Thanks Theo. From this I now have:

http://support.ednet.ns.ca/SpamAssassin/PDFText2.pm

Sorry that I was not aware that I had not been developing for a current version :(. Explains why I could not find the pieces that I was told about ;).

Sample setup in local.cf :

pdftext_pdfinfo_cmd /usr/bin/pdfinfo
pdftext_pdftotext_cmd /usr/bin/pdftotext
pdftext_pdfimages_cmd /usr/bin/pdfimages
pdftext_gocr_cmd /usr/bin/gocr

body PLUGIN_PDFTEXT_TEST /Stock/i
describe PLUGIN_PDFTEXT_TEST Found word Stock
score PLUGIN_PDFTEXT_TEST 2.5

body PLUGIN_PDFTEXT2 /PDFText2-Title: stock_tmp.pdf/i
describe PLUGIN_PDFTEXT2 Found the Title stick_tmp.pdf
score PLUGIN_PDFTEXT2 4.5

Current comments :
. now it will prepend PDFText2- to the pdfinfo pushed to render so that accurate PDFinfo matching can be done. Or was that the wrong thing to do? . added gocr of the images, but I see FuzzyOCR does fuzzy matching which this doesn't so as long as you don't set pdftext_gocr_cmd, it won't do that part. Maybe there is a way this one can call that one? . not comfortable with how I create temporary dirs for pdfimages, so that might make trouble for folks. . I can not test it in our production environment as that is still 3.1 and I don't want to try the SVN FuzzyOCR just yet :). So that means I am only lightly testing in a development environment.

Is there any similar function to post_message_parse in the 3.1 series?

Thanks again everyone,
JES

Reply via email to