Matus UHLAR - fantomas wrote:
Hello,

I have quite pretty archive of phish mail (bank and mail accounts), where many words and phrases repeat.

I was thinking about processing them manually and creating rules, but that would be much work. I remember that SOUGHT ruleset used to contain phrases that appear repeatedly, so I'd try to use these, if possible.

so far I found:
- description how it works https://taint.org/2007/03/05/134447a.html
- scripts to search in corpus:
https://svn.apache.org/repos/asf/spamassassin/trunk/masses/rule-dev/seek-phrases-in-corpus

which seems to use plugins (Dumptext.pm, GrepRenderedBody.pm) I found at: https://svn.apache.org/repos/asf/spamassassin/branches/3.3/masses/plugins/


Are these still working or do they have any new versions?

I'm a little hazy on the deep internals, but all the parts are still in SVN trunk. I've been using this locally with a growing collection of configuration wrapper to generate a number of rule sets for different subgroups of spam.

I've just tried a test in a current trunk checkout and everything seems to work without issue. Some components may need a little more tweaking for local conditions.

Does anyone have hints how to process phish archive?

I mean, I apparently could manually weed out any repeated non-phish phrases to avoid FPs or check them manually what mail they hit, so I didn't need to keep much of ham mail

The minimal setup is to modify masses/rule-dev/sought/example_backend/run for your local pathnames, and change the rule fragment names however you like, and run that script. I've attached a patch showing my own changes for my quick test above.

You *do* need a collection of ham, however; as-is it relies on that to weed out patterns you don't want to actually be firing on as well has sorting/grouping the patterns by hit-rate thresholds.

You could probably still use one of the intermediate files to bootstrap what you might have done manually, but you risk including poor patterns (either those that don't hit much, or also hit ham).

-kgd
Index: run
===================================================================
--- run	(revision 1904524)
+++ run	(working copy)
@@ -4,60 +4,63 @@
 # being generated, and your local configuration.  
 
 # where the scripts live.  This will need to be changed for each site
-dir=/home/jm/ftp/spamassassin/masses/rule-dev/sought/example_backend
+#dir=/home/jm/ftp/spamassassin/masses/rule-dev/sought/example_backend
+dir=/home/kdeugau/dev/satrunk/masses/rule-dev/sought/example_backend
+
+# added:  directory for working data, so we don't clutter the source tree
+workingdata=/home/kdeugau/dev/sarulegen
 
 # change: where your svn checkout lives
-sasvndir=/home/jm/ftp/spamassassin
+#sasvndir=/home/jm/ftp/spamassassin
+sasvndir=/home/kdeugau/dev/satrunk
 
 # change: where to check in the output file to SVN
-outsvndir=/home/jm/ftp/spamassassin
+#outsvndir=/home/jm/ftp/spamassassin
+outsvndir=$workingdata/zzsa
 
 # change: the targets scanned
 ( command cd $sasvndir/masses && nice ./mass-check \
   --cf='loadplugin Dumptext plugins/Dumptext.pm' \
   --cf='loadplugin Mail::SpamAssassin::Plugin::Check' \
   -n -o -C=/dev/null \
-  spam:mbox:$dir/cor-fraud/corpus_spam_fraud \
-  spam:mbox:$dir/cor-guenther_fraud/spam/*fraud*.mbox \
-  > $dir/all_w.s )
+  spam:dir:/home/kdeugau/Maildir/.spam/cur/ \
+  > $workingdata/all_w.s )
 
 # change: the targets scanned
 ( command cd $sasvndir/masses && nice ./mass-check \
   --cf='loadplugin Dumptext plugins/Dumptext.pm' \
   --cf='loadplugin Mail::SpamAssassin::Plugin::Check' \
   -n -o -C=/dev/null \
-  ham:detect:$dir/cor-ham/* \
-  ham:mbox:$dir/cor-fraud/corpus_ham_fraud \
-  ham:mbox:$dir/cor-guenther_fraud/ham/*fraud*.mbox \
-  > $dir/w.h )
+  ham:dir:/home/kdeugau/Maildir/cur/ \
+  > $workingdata/w.h )
 
 # add already-compiled old logs to the ham collection
-cd $dir
-cat logs.h/* w.h > all_w.h
+#cd $dir
+#cat logs.h/* w.h > all_w.h
 
 # change: rule prefix, required pattern length, required hitrate thresholds
 nice $sasvndir/masses/rule-dev/seek-phrases-in-log \
-  --ham $dir/all_w.h \
-  --spam $dir/all_w.s \
+  --ham $workingdata/w.h \
+  --spam $workingdata/all_w.s \
   --rules --ruleprefix __SEEK_FRAUD_ --reqpatlength 40 --reqhitrate "2.0 0.3 1.0" \
-  > out.cf
-$dir/kill_bad_patterns < out.cf > clean.cf
+  > $workingdata/out.cf
+$dir/kill_bad_patterns < $workingdata/out.cf > $workingdata/clean.cf
 date
 
 # change: rule prefix, target file location
-$dir/mk_meta_rule JM_SOUGHT_FRAUD_ $dir/clean.cf > \
-  $outsvndir/rulesrc/sandbox/jm/20_sought_fraud.cf \
+$dir/mk_meta_rule JM_SOUGHT_FRAUD_ $workingdata/clean.cf > \
+  $outsvndir/rulesrc/20_sought_fraud.cf \
   || exit 1
 
 $sasvndir/spamassassin \
-  -C $outsvndir/rulesrc/sandbox/jm/20_sought_fraud.cf --lint \
-  || exit 1
-
-svn commit -m "auto-generated test rules" \
-  $outsvndir/rulesrc/sandbox/jm/20_sought_fraud.cf \
+  -C $outsvndir/rulesrc/20_sought_fraud.cf --lint \
   || exit 1
 
-# change: hostname, path to run on DNS host
-ssh dnspublishinghost.example.com \
-        /path/to/jm/svn/trunk/masses/rule-dev/sought/mkzone_remote_svn/run_part2
+#svn commit -m "auto-generated test rules" \
+#  $outsvndir/rulesrc/sandbox/jm/20_sought_fraud.cf \
+#  || exit 1
+#
+## change: hostname, path to run on DNS host
+#ssh dnspublishinghost.example.com \
+#        /path/to/jm/svn/trunk/masses/rule-dev/sought/mkzone_remote_svn/run_part2
 

Reply via email to