Ahh, heck.  Here's a better one for all of the geneticists
on the list (one of them? :-):

/\b([ACGT]{1,}\s*[CGT]\s*[ACGT]{1,}\s*){3,}\b/

The addition of the word boundary test also avoids all of the 
false matches from my corpus.  Requires that the sequence
be at least 9 bps, and have at least 3 non-A components.

   -Dave

David G. Andersen just mooed:
> One thing to try, for your particular situation.  
> This rule could match in some strange base-64
> encoded files, but it's extremely unlikely -- I ran it through
> my spam corpus, and it hit 7 lines out of 260 megabytes, so
> you should be OK:
> 
> body     GENETICS_DATA                 /([ACGT]{3,}[CGT][ACGT]?\s*){3,}/
> describe GENETICS_DATA                 A, C, T, G, who do we appreciate?
> score    GENETICS_DATA -5
> 
> The rule, unfortunately, will
> match a long line of C,G, or T -- but will not match all As.  It
> should be possible to craft it a bit better, but to do so, I believe,
> would make the regexp really slow.
> 
> I wouldn't recommend this rule for general consumption, obviously, but
> if you're in the habit of getting genetics data...
> 
>   -Dave
> 
> Geoff Gibbs just mooed:
> > David G. Andersen wrote:
> > 
> > > > > anyone else seeing false-positives more often with 2.11?
> > > > 
> > > > Yes, I have had to roll back to 2.01.
> > > 
> > > A bit of a suggestion, since you're seeing false positives in a highly
> > > specific domain.  I've been creating word-frequency-based whitelists
> > > from various mailing lists I'm on (alas, little genetics talk).
> > > But I've found great success on matching networking-geek specific
> > > terms, and would think the same approach would prove quite fruitful
> > > for genetics specific terms.  Spammers, happily, don't often say
> > > adenosine. :-)
> > 
> > That is an interesting suggestion, although most of the false positives
> > were not related to genetic specific terms. Solid blocks of ACGT do
> > trigger the whole line of shouting, but an empty Subject should
> > not trigger Subject is all in capitals. An e-mail with a base-64
> > attachment should not count as spam with no other trigger.
> > I also had one e-mail that triggered the ascii form and whole line
> > of shouting, where I cannot see a whole line of shouting and I have
> > not yet had time to work out what triggered the form, but it is
> > not obvious to the beginner (me).
> 
> -- 
> work: [EMAIL PROTECTED]                          me:  [EMAIL PROTECTED]
>       MIT Laboratory for Computer Science           http://www.angio.net/
> 
> _______________________________________________
> Spamassassin-talk mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

-- 
work: [EMAIL PROTECTED]                          me:  [EMAIL PROTECTED]
      MIT Laboratory for Computer Science           http://www.angio.net/

_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to