Excellent, I'll slap this in as an eval replacement for PORN_3 right now.  I
knew there was a reason I put up with this pain-in-the-ass mailing list :)  For
now I'll just do the hard-coded wordlist, someone can file a bugzilla ticket if
they want words to be in a config file instead of the .pm

C

Daniel Pittman wrote:

DP> On Wed, 1 May 2002, Craig R. Hughes wrote:
DP> > Daniel Pittman wrote:
DP> >
DP> > DP> Break the rule up into individual tests for the different email
DP> > DP> packages and let it run. Aside from the better scoring for what is
DP> > DP> and isn't a real mail package, this will probably run faster in
DP> > DP> many cases as a simple string match, not a regexp, is used.
DP> >
DP> > Probably right.
DP>
DP> Given that I know that, at least in part, it's hits on Communigate from
DP> my corpus that drag that rule away from trapping SPAM, that seems a good
DP> idea to me. :)
DP>
DP> > All the same logic applies to PORN_3 too, and I want to break that one
DP> > up, but of course PORN_3 is trickier because of the triple-repeat
DP> > part. PORN_3 is far and away the worst performing rule in the book.
DP> > Fully 10% of the execution time per message is being consumed by
DP> > testing PORN_3.
DP>
DP> Yup. That rule looks ... inefficient. Using an eval and a series of word
DP> tests should be better. Something akin to:
DP>
DP> my @porn_words = ("lolita", "cum", "org[iy]", "wild", "fuck", "teen",
DP> "action", "spunk", "pussy", "pussies", "suck", "sucking", "hot",
DP> "hottest", "voyeur", "le[sz]b(?:ian|o)", "anal", "interracial", "asian",
DP> "amateur", "sex+", "slut", "explicit", "(?:[^x]", ")xxx(?:[^x]", "live",
DP> "celebrity", "lick", "suck", "dorm", "webcam", "ass", "schoolgirl",
DP> "strip", "horny", "horniest", "erotic", "oral", "penis", "hardcore",
DP> "blow[ -]*job", "nast(?:y|iest)", "porn")
DP>
DP> sub porn_word_test {
DP>     my ($self, $fulltext) = @_;
DP>     my $hits = 0;
DP>     foreach $word (@porn_words) {
DP>         $hits++ if $$fulltext =~ /\b$word\b/i;
DP>         return 1 if $hits == 3;
DP>     }
DP>     return 0;
DP> }
DP>
DP> If you got clever you could even have the set of words configurable
DP> somewhere in the test files; something like:
DP>
DP> my %word_set_tests = { 'PORN_WORDS' => ( ... ), ... };
DP>
DP> WORDSET PORN_WORDS foo, bar, baz
DP> SCORE PORN_WORDS 3.0
DP>
DP>         Daniel
DP>
DP>


_______________________________________________________________

Have big pipes? SourceForge.net is looking for download mirrors. We supply
the hardware. You get the recognition. Email Us: [EMAIL PROTECTED]
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to