I'm putting up a demo/prototype of some new techniques I'm building
for datamining and analysis.
This tool scans two large corpi of 500mb or more of email to identify
any substrings that occurs frequently in one but infrequently in the
other. You can choose the limits for 'frequently' and 'infrequently'.
It then reports all such substrings.
To use, please see my webpage on this work at:
http://www.cs.rice.edu/~scrosby/datamining/
I'd say to use this program for inspiration of new rules. If you have
a gob of email and you want to know what is unique about it, this can
help find some suggestions. I've used it to look at the difference
between caught spam and missed spam and ham versus spam.
Some ideas for using:
1. Run two full corpuses through the program.
2. Run just the headers of two corpuses through the program.
3. Run just a particular header 'X-Mailer' through the program.
I cannot use this prototype because it immediately finds the spoor of
SA all over the place, in the folder classification, SA headers, and
even the artificial Received line that SA puts when it encapsulates a
message. So for now, a clean corpus is absolutely critical, and I do
not have that and cannot build one. Also this program is unaware of
email boundaries, so a particular HTML element will be counted as many
times as it occurs, not the number of messages in which it occurs. It
may be easier to use with HTML removed. In the future these problems
will hopefully be removed.
Samples of the output include:
(in headers only)
1110 3 800\nX-Priority:
1108 3 0800\nX-Priority
1107 3 +0800\nX-Priorit
1106 3 +0800\nX-Priori
Timezone might be a good bayes token ^^^
402 0 -Mailer: FoxMai
402 0 X-Mailer: FoxMa
402 0 Mailer: FoxMail
Ratware? ^^^
820 2 iority: 3\nX-Mai
X-Priority: 3 header?
2155 8 y=\"----------=_
2154 8 ry=\"----------=
I don't get much MIME except spam, so this is probably that.
194 0 m (unknown [61.
Part of a popular faked receive line? Dunno. ^^^
121 0 2919.6900 DM\nMI
Portion of a particular outlook version line followed by MIME header. ^^^
85 0 essage-Id: <000
75 0 X-Priority: 4\nX
X-Priority = number? ^^^
162 2 : 3\nX-Library:
163 3 lain\nX-Priority
163 3 plain\nX-Priorit
163 3 xt/plain\nX-Prio
227 0 [61.51.
227 0 n [61.51
2160 9 ------=_
143 0 0000\nMessage-Id
(in header&body)
3660 0
\161\161\161\161\161\161\161\161\161\161\161\161\161\161\161\161\161\161\161\161
1983 0 $$$$$$$$$$$$$$$$$$$$
1571 0 face=\"\183\194\203\206_GB2312\">
Asian ^^^^
1824 0 http://love.elong.co
28 1 looking statements,
28 1 your prompt response
28 1 ve hundred thousand
29 1 : Foxmail 4.2 [cn]\nM
29 1 -looking statements,
29 1 of this transaction
^^^ The hits for these nigerian spams was a false negative I didn't
remove from my clean corpus Note the myriad phrases that are repeated
in all 38 of these emails.
44 2 how to stop further
30 1 in this transaction.
52 2 in\nX-Priority: 3\nX-M
35 1 '; mso-bidi-font-siz
37 1 excellent opportunit
37 1 for you to participa
37 1 is an\nexcellent oppo
37 1 ntinuing with this e
37 1 formation will help
37 1 pportunity for you\nt
37 1 understand that I ca
37 1 r you to participate
37 1 we have developed a
39 1 formation on mortgag
39 1 1001.lunchboxx.net>\n
41 1 <TR>\n <TD>\n
41 1 coupons, discounts
42 1 000 \n siz
43 1 -Type: MULTIPART/alt
^^^^ Capitalized MULTIPART
If you find this useful, please send me a heads-up.
Scott
-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk