Theo Van Dinter wrote, on 14/07/07 02:13 PM:
On Sat, Jul 14, 2007 at 09:54:36AM -0300, James MacLean wrote:
Where do I find information on hooking into post_message_parse()? Tried
greping in the module area with no luck :(. Certainly agree it would be
better to get the text out and let everyone at it :).
You can ask. :) But yes, I didn't do a good job of fully documenting how
this is supposed to work -- you have to know about the plugin call, then
hunt around Message and Message::Node, etc. Sorry. Here's the basics:
First, create a plugin with the post_message_parse method. Then in
there, use $msg->find_parts() to find the parts that you're looking
for (find_parts() is pretty well documented). Then, you simply take
the data from $part->decode() and do something to convert it to text.
Then you take that text and call $part->set_rendered($text).
Later on, when SA looks for the text to use for body rules, uri parsing,
etc, it takes anything that has rendered text.
Thanks Theo. From this I now have:
http://support.ednet.ns.ca/SpamAssassin/PDFText2.pm
Sorry that I was not aware that I had not been developing for a current
version :(. Explains why I could not find the pieces that I was told
about ;).
Sample setup in local.cf :
pdftext_pdfinfo_cmd /usr/bin/pdfinfo
pdftext_pdftotext_cmd /usr/bin/pdftotext
pdftext_pdfimages_cmd /usr/bin/pdfimages
pdftext_gocr_cmd /usr/bin/gocr
body PLUGIN_PDFTEXT_TEST /Stock/i
describe PLUGIN_PDFTEXT_TEST Found word Stock
score PLUGIN_PDFTEXT_TEST 2.5
body PLUGIN_PDFTEXT2 /PDFText2-Title: stock_tmp.pdf/i
describe PLUGIN_PDFTEXT2 Found the Title stick_tmp.pdf
score PLUGIN_PDFTEXT2 4.5
Current comments :
. now it will prepend PDFText2- to the pdfinfo pushed to render so that
accurate PDFinfo matching can be done. Or was that the wrong thing to do?
. added gocr of the images, but I see FuzzyOCR does fuzzy matching which
this doesn't so as long as you don't set pdftext_gocr_cmd, it won't do
that part. Maybe there is a way this one can call that one?
. not comfortable with how I create temporary dirs for pdfimages, so
that might make trouble for folks.
. I can not test it in our production environment as that is still 3.1
and I don't want to try the SVN FuzzyOCR just yet :). So that means I am
only lightly testing in a development environment.
Is there any similar function to post_message_parse in the 3.1 series?
Thanks again everyone,
JES