Matus UHLAR - fantomas wrote:
Hello,
I have quite pretty archive of phish mail (bank and mail accounts),
where many words and phrases repeat.
I was thinking about processing them manually and creating rules, but
that would be much work.
I remember that SOUGHT ruleset used to contain phrases that appear
repeatedly, so I'd try to use these, if possible.
so far I found:
- description how it works https://taint.org/2007/03/05/134447a.html
- scripts to search in corpus:
https://svn.apache.org/repos/asf/spamassassin/trunk/masses/rule-dev/seek-phrases-in-corpus
which seems to use plugins (Dumptext.pm, GrepRenderedBody.pm) I found
at:
https://svn.apache.org/repos/asf/spamassassin/branches/3.3/masses/plugins/
Are these still working or do they have any new versions?
I'm a little hazy on the deep internals, but all the parts are still in
SVN trunk. I've been using this locally with a growing collection of
configuration wrapper to generate a number of rule sets for different
subgroups of spam.
I've just tried a test in a current trunk checkout and everything seems
to work without issue. Some components may need a little more tweaking
for local conditions.
Does anyone have hints how to process phish archive?
I mean, I apparently could manually weed out any repeated non-phish
phrases to avoid FPs or check them manually what mail they hit, so I
didn't need to keep much of ham mail
The minimal setup is to modify
masses/rule-dev/sought/example_backend/run for your local pathnames, and
change the rule fragment names however you like, and run that script.
I've attached a patch showing my own changes for my quick test above.
You *do* need a collection of ham, however; as-is it relies on that to
weed out patterns you don't want to actually be firing on as well has
sorting/grouping the patterns by hit-rate thresholds.
You could probably still use one of the intermediate files to bootstrap
what you might have done manually, but you risk including poor patterns
(either those that don't hit much, or also hit ham).
-kgd
Index: run
===================================================================
--- run (revision 1904524)
+++ run (working copy)
@@ -4,60 +4,63 @@
# being generated, and your local configuration.
# where the scripts live. This will need to be changed for each site
-dir=/home/jm/ftp/spamassassin/masses/rule-dev/sought/example_backend
+#dir=/home/jm/ftp/spamassassin/masses/rule-dev/sought/example_backend
+dir=/home/kdeugau/dev/satrunk/masses/rule-dev/sought/example_backend
+
+# added: directory for working data, so we don't clutter the source tree
+workingdata=/home/kdeugau/dev/sarulegen
# change: where your svn checkout lives
-sasvndir=/home/jm/ftp/spamassassin
+#sasvndir=/home/jm/ftp/spamassassin
+sasvndir=/home/kdeugau/dev/satrunk
# change: where to check in the output file to SVN
-outsvndir=/home/jm/ftp/spamassassin
+#outsvndir=/home/jm/ftp/spamassassin
+outsvndir=$workingdata/zzsa
# change: the targets scanned
( command cd $sasvndir/masses && nice ./mass-check \
--cf='loadplugin Dumptext plugins/Dumptext.pm' \
--cf='loadplugin Mail::SpamAssassin::Plugin::Check' \
-n -o -C=/dev/null \
- spam:mbox:$dir/cor-fraud/corpus_spam_fraud \
- spam:mbox:$dir/cor-guenther_fraud/spam/*fraud*.mbox \
- > $dir/all_w.s )
+ spam:dir:/home/kdeugau/Maildir/.spam/cur/ \
+ > $workingdata/all_w.s )
# change: the targets scanned
( command cd $sasvndir/masses && nice ./mass-check \
--cf='loadplugin Dumptext plugins/Dumptext.pm' \
--cf='loadplugin Mail::SpamAssassin::Plugin::Check' \
-n -o -C=/dev/null \
- ham:detect:$dir/cor-ham/* \
- ham:mbox:$dir/cor-fraud/corpus_ham_fraud \
- ham:mbox:$dir/cor-guenther_fraud/ham/*fraud*.mbox \
- > $dir/w.h )
+ ham:dir:/home/kdeugau/Maildir/cur/ \
+ > $workingdata/w.h )
# add already-compiled old logs to the ham collection
-cd $dir
-cat logs.h/* w.h > all_w.h
+#cd $dir
+#cat logs.h/* w.h > all_w.h
# change: rule prefix, required pattern length, required hitrate thresholds
nice $sasvndir/masses/rule-dev/seek-phrases-in-log \
- --ham $dir/all_w.h \
- --spam $dir/all_w.s \
+ --ham $workingdata/w.h \
+ --spam $workingdata/all_w.s \
--rules --ruleprefix __SEEK_FRAUD_ --reqpatlength 40 --reqhitrate "2.0 0.3 1.0" \
- > out.cf
-$dir/kill_bad_patterns < out.cf > clean.cf
+ > $workingdata/out.cf
+$dir/kill_bad_patterns < $workingdata/out.cf > $workingdata/clean.cf
date
# change: rule prefix, target file location
-$dir/mk_meta_rule JM_SOUGHT_FRAUD_ $dir/clean.cf > \
- $outsvndir/rulesrc/sandbox/jm/20_sought_fraud.cf \
+$dir/mk_meta_rule JM_SOUGHT_FRAUD_ $workingdata/clean.cf > \
+ $outsvndir/rulesrc/20_sought_fraud.cf \
|| exit 1
$sasvndir/spamassassin \
- -C $outsvndir/rulesrc/sandbox/jm/20_sought_fraud.cf --lint \
- || exit 1
-
-svn commit -m "auto-generated test rules" \
- $outsvndir/rulesrc/sandbox/jm/20_sought_fraud.cf \
+ -C $outsvndir/rulesrc/20_sought_fraud.cf --lint \
|| exit 1
-# change: hostname, path to run on DNS host
-ssh dnspublishinghost.example.com \
- /path/to/jm/svn/trunk/masses/rule-dev/sought/mkzone_remote_svn/run_part2
+#svn commit -m "auto-generated test rules" \
+# $outsvndir/rulesrc/sandbox/jm/20_sought_fraud.cf \
+# || exit 1
+#
+## change: hostname, path to run on DNS host
+#ssh dnspublishinghost.example.com \
+# /path/to/jm/svn/trunk/masses/rule-dev/sought/mkzone_remote_svn/run_part2