Re: Help with a regex to catch spam with gibberish html tags

Kevin A. McGrail Thu, 30 Jan 2014 09:29:54 -0800

On 1/30/2014 11:23 AM, Andy Jezierski wrote:

Amir Caspi <ceph...@3phase.com> wrote on 01/29/2014 11:08:18 AM:


> From: Amir Caspi <ceph...@3phase.com>
> To: Andy Jezierski <ajezier...@stepan.com>,
> Cc: "users@spamassassin.apache.org" <users@spamassassin.apache.org>
> Date: 01/29/2014 11:08 AM
> Subject: Re: Help with a regex to catch spam with gibberish html tags
>

> On Jan 29, 2014, at 9:53 AM, "Andy Jezierski"<ajezier...@stepan.com> wrote:


> I've been noticing a lot of spam getting through with the same
> traits, a bunch of random words within brackets.  They all seem to
> come after the </body> or the </html> tag.  Anyone much more
> knowledgeable than me care to assist with a rule to detect them?
>
> What about something like:
>
> rawbody HTML_NONSENSE_TAGS /(?:<[A-Za-z0-9]{4,}>\s*){10,}
>
> This will hit on 10 or more consecutive tags separated by nothing
> but white space. Only single-word tags will hit, so this should
> minimize FPs from heavy formatting such as nested divs.
>
> Completely untested, use at your own risk (but post back and tell us
> how well it worked).
>
> --- Amir
> thumbed via iPhone

That rule seems to be working fine. Has hit on every one of thosepesky messages so far with no FP's. Will let it run for a while longerbefore I bump up the score.

If you want to share the complete rule, I can throw it into my sandboxand see what masscheck thinks as well.


regards,
KAM

Re: Help with a regex to catch spam with gibberish html tags

Reply via email to