Hello,
On 12-10-05 08:43 AM, Martin Gregorie wrote: > On Thu, 2012-10-04 at 20:56 -0700, Cathryn Mataga wrote: >> I'm getting a lot of SPAM with words written like this. These are pretty >> horrible, and I don't like >> getting them every day. >> >> A:N ;A %L" >> P:O ~R %N ( P &lCT U #R&E / >> >> Is there a way to make a rule for strings of characters that would >> ignoring non-alpha characters embedded >> in the string? Not in a rule, but you may want to code a plugin that would first get rid of all non words characters, then guess if some parts of the resulting (possibly very long) string matches SEX words, then guess if the mail is a sexy joke or a pure annoyance. Let's say that it would be difficult and painfull for a non obvious result. > Try this: > > describe MG_TWOLETTER_OBFUSCATION Two letter obfuscation (X:X X :X)) > body MG_TWOLETTER_OBFUSCATION /[A-Z]\W[A-Z] \W[A-Z]\W[A-Z]/ > score MG_TWOLETTER_OBFUSCATION 5.0 It works, but matches only the second line ("T U #R&E "). If the spams Cathryn receives is composed with those two lines, this rule is effective enough. But if you see some varations (different words and obfu, case sensitivity etc.), you may want to work the regex a little more. Based on Martin's work, here is an example: body ME_SPAM /[:;`(){}~#&"%$][a-z][:;`(){}~#&"%$]/im Use a tflag multiple and if you count, say, 2 to 4 of them, flag the mail as a spam. You also may want to meta this with some header checks to avoid false positives (is it HTML_ONLY, FREEMAIL_FROM, !SPF_PASS etc. ?). Bayesian learning should also be pretty helpfull if you use it. > It matches the data you posted and does not match anything else in my > spam corpus, so its quite specific the that type of spam, on the > contents of my mail stream anyway, but of course ymmv. > > > Martin > > Alex, from prypiat. Yes, I recycle.