Amir Caspi <ceph...@3phase.com> wrote on 01/29/2014 11:08:18 AM: > From: Amir Caspi <ceph...@3phase.com> > To: Andy Jezierski <ajezier...@stepan.com>, > Cc: "users@spamassassin.apache.org" <users@spamassassin.apache.org> > Date: 01/29/2014 11:08 AM > Subject: Re: Help with a regex to catch spam with gibberish html tags > > On Jan 29, 2014, at 9:53 AM, "Andy Jezierski" <ajezier...@stepan.com> wrote:
> I've been noticing a lot of spam getting through with the same > traits, a bunch of random words within brackets. They all seem to > come after the </body> or the </html> tag. Anyone much more > knowledgeable than me care to assist with a rule to detect them? > > What about something like: > > rawbody HTML_NONSENSE_TAGS /(?:<[A-Za-z0-9]{4,}>\s*){10,} > > This will hit on 10 or more consecutive tags separated by nothing > but white space. Only single-word tags will hit, so this should > minimize FPs from heavy formatting such as nested divs. > > Completely untested, use at your own risk (but post back and tell us > how well it worked). > > --- Amir > thumbed via iPhone That rule seems to be working fine. Has hit on every one of those pesky messages so far with no FP's. Will let it run for a while longer before I bump up the score. Thanks Andy