Dear fellow Spamassassin users,
I'm contacting you as a member of ULYSSIS. ULYSSIS is a student
non-profit organisation at the University of Leuven trying to make
computers and technology more approachable and available to students. As
part of this objective, we run a hosting service within our university's
network for student organisations, student unions and individuals at our
university.
We've battled with spam from time to time, since we seem to attract a
lot of exotic languages which are rather well able to circumvent
commonly used methods. This has had us resort to some custom rulesets to
battle against mostly targetted French and SEO spam often coming from
very respectable servers and very normal addresses.
Now because SEO spam specifically has been adapting quite well to any
rule we think of (finding alternative ways of saying the same thing time
and time again), I was hoping to write a rule that basically boiled down
to "give some spam score to emails that contain the word SEO 3 or more
times" to push those already being detected by other rules over the
edge. To be clear, this will be a low score rule, I'm aware that ham can
perfectly well contain that word 3 times, just like this email for
example. Now while investigating I started wondering how to tackle that
some spam will just have a plain text body, while others will also
feature HTML, which means that suddenly the amount may double/half.
Beyond that it seems quite hacky to use a regex that boils down to
something like /\bSEO\b.*\bSEO\b.*\bSEO\b/i instead of something that is
properly aware of the count of certain words.
Since I sort of expected Spamassassin to have a solution for both the
text/text+html and the counting problems, I asked around on IRC but was
pointed here. So uhm, any suggestions or pointers are more than welcome.
Not too sure if any more information is required, but feel free to ask
questions or corect my presumptions if necessary.
Kind regards,
Bert Van de Poel
ULYSSIS
University of Leuven