Hello,

On 12-10-05 08:43 AM, Martin Gregorie wrote:
> On Thu, 2012-10-04 at 20:56 -0700, Cathryn Mataga wrote:
>> I'm getting a lot of SPAM with words written like this. These are pretty
>> horrible, and I don't like
>> getting them every day.
>>
>> A:N ;A %L"
>> P:O ~R %N ( P &lCT U #R&E /
>>
>> Is there a way to make a rule for strings of characters that would
>> ignoring non-alpha characters embedded
>> in the string?

Not in a rule, but you may want to code a plugin that would first get
rid of all non words characters, then guess if some parts of the
resulting (possibly very long) string matches SEX words, then guess if
the mail is a sexy joke or a pure annoyance.

Let's say that it would be difficult and painfull for a non obvious result.

> Try this:
>
> describe MG_TWOLETTER_OBFUSCATION Two letter obfuscation (X:X X :X))
> body     MG_TWOLETTER_OBFUSCATION /[A-Z]\W[A-Z] \W[A-Z]\W[A-Z]/
> score    MG_TWOLETTER_OBFUSCATION 5.0

It works, but matches only the second line ("T U #R&E "). If the spams
Cathryn receives is composed with those two lines, this rule is
effective enough.

But if you see some varations (different words and obfu, case
sensitivity etc.), you may want to work the regex a little more.

Based on Martin's work, here is an example:

body    ME_SPAM    /[:;`(){}~#&"%$][a-z][:;`(){}~#&"%$]/im

Use a tflag multiple and if you count, say, 2 to 4 of them, flag the
mail as a spam.

You also may want to meta this with some header checks to avoid false
positives (is it HTML_ONLY, FREEMAIL_FROM, !SPF_PASS etc. ?).

Bayesian learning should also be pretty helpfull if you use it.
> It matches the data you posted and does not match anything else in my
> spam corpus, so its quite specific the that type of spam, on the
> contents of my mail stream anyway, but of course ymmv.
>
>
> Martin
>
>




Alex, from prypiat.
Yes, I recycle.


Reply via email to