Re[6]: [SAtalk] [RD] Rule Philosophy

Robert Menschel Fri, 08 Aug 2003 14:38:47 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello Yorkshire,

Thursday, August 7, 2003, 9:41:16 AM, you wrote:

YD> Hi, Excuse me if I contradict myself or say anything silly here, I'm
YD> thinking out loud as much as anything :)

I think that's what we're all doing. And I'm not going to apologize for
any of it.  Just let me know if I say anything silly, so I can laugh
along with everyone else.  Thanks.

YD>>> How do you verify / trust a submission? How do you know that a rule
YD>>> is right? 

RM>> Initially, you don't. I'm looking at this much like the Wiki at
RM>> http://www.exit0.us/ -- Anyone can contribute to that web site,
RM>> whether they know what they're doing or not. The only policing is
RM>> done by people who decide they can improve on what they see.   

YD> I wasn't thinking of it that way at all, I was thinking of it more
YD> like a blacklist based around URLs found in spam, where anyone can
YD> forward spam for processing. I guess they're both good ideas though
:)

Agreed.

YD> I'm going to try and get some of my ideas down on paper, I think I
YD> have enough separate pieces of an idea now to form a whole idea.

Good.

YD>>> You can't verify them all by hand, spammers can register domains
YD>>> between them faster than a person can make rules. 

RM>> Quick answer: I'm not interested in verification. That's why I
RM>> suggested the scores be real low on initial submission (0.1). Rules
RM>> like that will match and appear in the SA headers, but unless a lot
RM>> of them are hit, these rules won't significantly change the spam/ham
RM>> rating of an email.    

RM>> We probably need someone to do a quick eyeball check ...

RM>> When the rules are distributed, the rules themselves should be in a
RM>> separate file from the scores. ...

RM>> Now, if we add a body of volunteers into this project, who are
RM>> willing and able to evaluate the rules both for validity and against
RM>> a reasonable corpus, then we could also develop standardized scores
RM>> for those rules, still low, but higher than 0.1   

YD> You're causing ideas here :)

YD> We could make more than one score file available for the same set of
YD> rules, let people have the rules at any score they choose.

I like that!  (a) One set of files for the rules (a file for each class
of rule (From, Subject, Body, URI), with some files for large groups of
rules (scam, pirate, porn, etc). Then (b) One scoring rule for each of a
variety of different audiences and philosophies (an ISP score set, Int'l
Business score set, family score set, conservative score set, aggressive
score set, etc).

YD> Or even generate rule lists on a per-user basis from a database, let
YD> users create their own rule and score file by selecting rules or
YD> groups of rules from the database, start the rules off with a low but
YD> significant score, let them re-score individual rules to suit
YD> themselves.  Give users a web interface and let them switch off the
YD> rules they don't want (I'm thinking built-in user feedback, if enough
YD> people down-score or switch off a rule then it needs re-examining)

Do I understand you right? Would this be the type of thing you're
thinking of?

1) A database which stores each rule's name, and a brief summary of or
   intro to it, identified by a unique rule number.

2) Tables in that database which identify each registered user, with
   password, user type (audience as above), and scores for each
   identified rule. 

3) Web form(s) which display the brief rule info, current score (default
   if none yet specified), and allow the user to change their scores,
   those changes to be recorded in the database.

4) A Transmit button, which builds the rules files (excluding those rules
   the user has turned off), builds the scoring file (from the user's
   database scoring information), and packages them up as a *.gz or *.zip
   file for email or download or ftp or wget or other transmission?

RM>>>>>> header    L_s_CorelWPOffice  Subject =~
RM>>>>>>                          /(?:Corel|WordPerfect).{1,15}Office/i

RM>>>> As for an ISP, I would think it's still a valid rule; ...

YD>>> No way is that a valid rule for an ISP to use. A good rule looks
YD>>> for something which only appears in spam, WPOffice probably appears
YD>>> in as much ham as spam.  

RM>> Define "good rule".  Have you looked at FROM_ENDS_IN_NUMS lately? It
RM>> matches 523 ham in my corpus. FROM_NO_LOWER matches more ham than
RM>> spam in my corpus. Many of the HTML color rules also match a goodly
RM>> amount of ham. Based on the value of these rules, I call them good
RM>> rules despite the ham they match.    

YD> FROM_NO_LOWER scores 0 here. I'm still getting mileage out of
YD> FROM_ENDS_IN_NUMS but it's not what I'd call a good rule either.

I lowered FROM_NO_LOWER from 1.6 to 1.0 on June 23; it's doing well for
me at that score. FROM_ENDS_IN_NUMS was also reduced from 0.6 to 0.45.

YD> The only reason rules like that still exist in sa is because not
YD> enough people join in the mass checks, or the technically adept
YD> people who do join in the mass checks are more likely to exchange
YD> email with other technically adept people, who are more likely to
YD> have their own domain name and be [EMAIL PROTECTED] and are therefore less
YD> likely to be [EMAIL PROTECTED]

I disagree. Well, I agree that it'd be better if more people joined in
the mass checks. But I disagree that's why these rules stick around. They
stick around because they remain suggestive of spam (though not
definitive, which is why they get low scores).

RM>> To me, a "good rule" is a rule whose score is appropriate for the
ham
RM>> and/or spam matched. That means that a rule which does match only
spam,
RM>> and that can only match spam, scores high, a rule which matches only
ham
RM>> scores negative, and a rule which is suggestive of spam or ham
provides a
RM>> suggestive score which provides judgment only when combined with
RM>> several/many other rules.

YD> The better the rule, the less ham it matches, and vice versa. Taking
a
YD> common word or phrase and capturing with it something which makes it
YD> spammy is what makes a good rule.

Fortunately there's enough room for us to agree to disagree here.  While
I'll certainly take advantage of some specific rules and score them high
because they match ONLY spam, the great majority of my rules are
suggestive rather than definitive.

Chris suggests on his page at
http://www.merchantsoverseas.com/wwwroot/gorilla/sa_rules.htm
C> ) Scoring is an art! But a general idea is to write many small rules
C> and score them low. Let 10 rules add up to 2 points rather then 1 rule
C> with all 10 things in it scoring 2 points. This will help minimize
C> false positives.
and I generally agree with that philosophy. I had a spam this morning
that hit 8.7 of 9.0 using default rules (including Bayes-80). It also
matched one 0.9 rule of mine, which pushed it over the threshold, and so
I'd never have seen it if I hadn't gone looking for this type of example.

And that's the way I like SA to work -- With my 9.0 threshold, a 9.001
score matching a dozen rules is as good as a 9.1 score matching but a
single rule. 

YD> I'm trying to produce only ISP grade rules I guess. ...

So you have a lot fewer rules in your rule set than I do in mine, I
guess.  :-)

RM>> Therefore, to me, a correctly scored WP Office rule would be a good
RM>> rule, while a badly scored WP Office rule would be a bad rule. It's
RM>> not the matching that determines good/bad, but the appropriateness
RM>> of the score.   

YD> A correct score for such a loose rule would vary so widely that it
YD> would become useless. Given enough time I can probably find someone
YD> who would score that at about -5.

It takes that long to find someone working at Corel? They *are* in
trouble then!!

YD> You're writing a rule for the brand name of a mainstream product
which
YD> probably has millions of users worldwide, doesn't that ring alarm
bells
YD> somewhere? :)

Nope. No more than does
> body      RM_bp_DontWait  /Don't wait/i
> describe  RM_bp_DontWait  Body says don't wait
> score     RM_bp_DontWait  0.1  # 211 Spam, 21 Ham, Aug 7, 2003

RM>>>>>> header  L_hr_lattelekom  Received =~ /lattelekom\.net/

RM>> ... However, I don't know how sensitive the blacklists are, nor how
RM>> long they take to update. 

YD> I wasn't sure either, so I emailed DSBL and asked. The answer I got
YD> back was more complex than I expected :) ... 

YD> So, a worst+worst case is 26 minutes with the normal case being below
YD> the 16 minutes. 

YD> I'd say that's fast enough :)

However, I can't run the software required to submit to DSBL.

RM>> Nor do I know how to submit updates. As an end user on a virtual
RM>> hosting server, I believe I don't have the access to submit updates
RM>> from the server, and I don't know how to do so from my home Windows
RM>> machine. Perhaps that information could be added to one of our SA
RM>> resources?    

YD> See the website for the blocklist you want to submit to. ...

I checked DSBL, and while as you claim,
YD> DSBL is easy though, just relay a message through the open relay to
YD> [EMAIL PROTECTED]
I don't have the ability to do that from my POP3 email client, at least
not that I was able to find from their web site.

YD> If you're using blocklists at all you should really find out a little
YD> about them, research their listing criteria, ensure you agree with
YD> their policies and methods, that way you'll never find reason to
YD> complain about them :)

I hear you, but there's only so much time available to do everything that
should be done. The benefit of participating in a community like this one
is that others do some of the work (eg: the SA developers have already
picked a set of reasonable network tests), and so I don't have to worry
about that side of the anti-spam activity. I can therefore put time into
those things I'm more skilled at.

YD> IMHO anything centralised could really do with a sample spam archived
YD> for each rule. proof, evidence, and useful reference all rolled into
YD> one. Include sequential numbers in the rulenames and cross reference
to
YD> a text evidence file for each, or something like that, so nobody can
YD> ever have any doubts as to what a rule is for, why it exists, and
that
YD> it relates to real spam.

Agreed. So any submission should include (a) the rule itself (body,
header, URI, etc), (b) a generic name (to which we'll prefix a unique
identifier), (c) one-line description, (d) Additional description as
appropriate, (e) complete spam with all headers. (a) through (d) can
easily be stored in the database, while (e) would be best kept as an
isolated file identified by the sequentially assigned rule number.

Bob Menschel

-----BEGIN PGP SIGNATURE-----
Version: PGP 8.0

iQA/AwUBPzMnI5ebK8E4qh1HEQLkjQCfW8XZhJjv/OrIno2Ir4gBhFwOwyoAnjGA
NXXWq+MYCy9qYE/Z5t0dChnA
=hWkw
-----END PGP SIGNATURE-----

-------------------------------------------------------
This SF.Net email sponsored by: Free pre-built ASP.NET sites including
Data Reports, E-commerce, Portals, and Forums are available now.
Download today and enter to win an XBOX or Visual Studio .NET.
http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re[6]: [SAtalk] [RD] Rule Philosophy

Reply via email to