Re: Re[6]: [SAtalk] [RD] Rule Philosophy

Yorkshire Dave Fri, 08 Aug 2003 08:33:33 -0700

On Fri, 2003-08-08 at 05:29, Robert Menschel wrote:
----snip----
> 
> YD> We could make more than one score file available for the same set of
> YD> rules, let people have the rules at any score they choose.
> 
> I like that!  (a) One set of files for the rules (a file for each class
> of rule (From, Subject, Body, URI), with some files for large groups of
> rules (scam, pirate, porn, etc). Then (b) One scoring rule for each of a
> variety of different audiences and philosophies (an ISP score set, Int'l
> Business score set, family score set, conservative score set, aggressive
> score set, etc).
> 
> YD> Or even generate rule lists on a per-user basis from a database, let
> YD> users create their own rule and score file by selecting rules or
> YD> groups of rules from the database, start the rules off with a low but
> YD> significant score, let them re-score individual rules to suit
> YD> themselves.  Give users a web interface and let them switch off the
> YD> rules they don't want (I'm thinking built-in user feedback, if enough
> YD> people down-score or switch off a rule then it needs re-examining)
> 
> Do I understand you right? Would this be the type of thing you're
> thinking of?
> 
> 1) A database which stores each rule's name, and a brief summary of or
>    intro to it, identified by a unique rule number.
> 
> 2) Tables in that database which identify each registered user, with
>    password, user type (audience as above), and scores for each
>    identified rule. 
> 
> 3) Web form(s) which display the brief rule info, current score (default
>    if none yet specified), and allow the user to change their scores,
>    those changes to be recorded in the database.
> 
> 4) A Transmit button, which builds the rules files (excluding those rules
>    the user has turned off), builds the scoring file (from the user's
>    database scoring information), and packages them up as a *.gz or *.zip
>    file for email or download or ftp or wget or other transmission?


That's about what I was thinking, yes. I've been trying to think of a
way of getting direct feedback on the usefulness of a rule for a while
now. 

When the idea of a rule wiki came up, I went off and looked at trying to
build a 'vote for this rule' system to attach to a wiki, so users could
vote or moderate rules up or down. I'd like to see the cream of the crop
rise to the top, easier for the users to find the best rules for
themselves and easier for the devs to find the good rules to incorporate
into SA itself. 

If anything, this idea is an even better solution, it gives users a real
reason to tell the system what they think of a rule, other than the
usual nice/polite/sharing. I don't think there's anything difficult
about it either, it's just simple sql stuff

> RM>>>>>> header    L_s_CorelWPOffice  Subject =~
> RM>>>>>>                          /(?:Corel|WordPerfect).{1,15}Office/i
> 
> RM>>>> As for an ISP, I would think it's still a valid rule; ...
> 
> YD>>> No way is that a valid rule for an ISP to use. A good rule looks
> YD>>> for something which only appears in spam, WPOffice probably appears
> YD>>> in as much ham as spam.  
> 
> RM>> Define "good rule".  Have you looked at FROM_ENDS_IN_NUMS lately? It
> RM>> matches 523 ham in my corpus. FROM_NO_LOWER matches more ham than
> RM>> spam in my corpus. Many of the HTML color rules also match a goodly
> RM>> amount of ham. Based on the value of these rules, I call them good
> RM>> rules despite the ham they match.    
> 
> YD> FROM_NO_LOWER scores 0 here. I'm still getting mileage out of
> YD> FROM_ENDS_IN_NUMS but it's not what I'd call a good rule either.
> 
> I lowered FROM_NO_LOWER from 1.6 to 1.0 on June 23; it's doing well for
> me at that score. FROM_ENDS_IN_NUMS was also reduced from 0.6 to 0.45.
> 
> YD> The only reason rules like that still exist in sa is because not
> YD> enough people join in the mass checks, or the technically adept
> YD> people who do join in the mass checks are more likely to exchange
> YD> email with other technically adept people, who are more likely to
> YD> have their own domain name and be [EMAIL PROTECTED] and are therefore less
> YD> likely to be [EMAIL PROTECTED]
> 
> I disagree. Well, I agree that it'd be better if more people joined in
> the mass checks. But I disagree that's why these rules stick around. They
> stick around because they remain suggestive of spam (though not
> definitive, which is why they get low scores).

It fits with what I'm seeing. Newbie non-techy users are far more likely
to cause a false positive than more experienced users. Think about this.
If the only users who joined in the mass checks were AOL and Hotmail
users then FROM_ENDS_IN_NUMS would probably have a substantial negative
score.

> RM>> To me, a "good rule" is a rule whose score is appropriate for the
> ham
> RM>> and/or spam matched. That means that a rule which does match only
> spam,
> RM>> and that can only match spam, scores high, a rule which matches only
> ham
> RM>> scores negative, and a rule which is suggestive of spam or ham
> provides a
> RM>> suggestive score which provides judgment only when combined with
> RM>> several/many other rules.
> 
> YD> The better the rule, the less ham it matches, and vice versa. Taking
> a
> YD> common word or phrase and capturing with it something which makes it
> YD> spammy is what makes a good rule.
> 
> Fortunately there's enough room for us to agree to disagree here.  While
> I'll certainly take advantage of some specific rules and score them high
> because they match ONLY spam, the great majority of my rules are
> suggestive rather than definitive.

The big advantage of SA is that you can use suggestive rules, but the
closer to definitive the better the rule. If a rule is too vague to give
it a worthwhile score, I end up questioning its reason for existing
> 
> Chris suggests on his page at
> http://www.merchantsoverseas.com/wwwroot/gorilla/sa_rules.htm
> C> ) Scoring is an art! But a general idea is to write many small rules
> C> and score them low. Let 10 rules add up to 2 points rather then 1 rule
> C> with all 10 things in it scoring 2 points. This will help minimize
> C> false positives.
> and I generally agree with that philosophy. I had a spam this morning
> that hit 8.7 of 9.0 using default rules (including Bayes-80). It also
> matched one 0.9 rule of mine, which pushed it over the threshold, and so
> I'd never have seen it if I hadn't gone looking for this type of example.
> 
> And that's the way I like SA to work -- With my 9.0 threshold, a 9.001
> score matching a dozen rules is as good as a 9.1 score matching but a
> single rule. 
> 
> YD> I'm trying to produce only ISP grade rules I guess. ...
> 
> So you have a lot fewer rules in your rule set than I do in mine, I
> guess.  :-)

I still have more than enough rules to make it work well though. I just
don't see much point in writing rules that are extremely subjective. 

I like to see good solid rules that match mainly spam, rules that could
carry a couple of points without any risk. When there is a false
positive, I like to be able to give a good reason for it. I'm providing
SA filtering to a few local businesses, they don't want to lose
important mail, not even into the spam bucket, and when they do they
often ask why and I'm expected to give a good reason. "Because it
mentions corel office" is not a good reason, unless corel themselves
start spamming.

 
> RM>> Therefore, to me, a correctly scored WP Office rule would be a good
> RM>> rule, while a badly scored WP Office rule would be a bad rule. It's
> RM>> not the matching that determines good/bad, but the appropriateness
> RM>> of the score.   
> 
> YD> A correct score for such a loose rule would vary so widely that it
> YD> would become useless. Given enough time I can probably find someone
> YD> who would score that at about -5.
> 
> It takes that long to find someone working at Corel? They *are* in
> trouble then!!

I meant find someone amongst my own circle of friends and acquaintances
:) I don't know anyone who works at Corel

> YD> You're writing a rule for the brand name of a mainstream product
> which
> YD> probably has millions of users worldwide, doesn't that ring alarm
> bells
> YD> somewhere? :)
> 
> Nope. No more than does
> > body      RM_bp_DontWait  /Don't wait/i
> > describe  RM_bp_DontWait  Body says don't wait
> > score     RM_bp_DontWait  0.1  # 211 Spam, 21 Ham, Aug 7, 2003

That's another works-for-some rule, I really don't think you can
distribute that sort of thing

a quick grep here gives me 78 ham and 23 spam from June to now. Ex g/f's
corpus would be worse, I told her 'Don't wait up I'm working late'
almost every day for a year (that's why she's ex, I guess :)).

A rule to match words ending in z consistently hits a lot more spam than
ham in my corpus, mainly due to random obfuscation tags, biz domains and
rot13'd .com (.pbz), but does that mean I should have a rule for words
ending in z ? Would you want a rule for z? Why is a rule for z any
different to a rule for don't wait?

Read through your mailboxes find other mail that has the name of a
product in the subject line, imagine yourself having a rule for that and
how much mayhem it might cause you. The only thing that's different is
the user and which products are or are not relevant to him/her

I'm not saying nobody should ever make such a rule, just that I suspect
a better rule is missed many times because of wrong thinking. Too busy
thinking "I don't want WPOffice" causes the real spam indicator, the %
sign, to be left out.

> 
> RM>>>>>> header  L_hr_lattelekom  Received =~ /lattelekom\.net/
> 
> RM>> ... However, I don't know how sensitive the blacklists are, nor how
> RM>> long they take to update. 
> 
> YD> I wasn't sure either, so I emailed DSBL and asked. The answer I got
> YD> back was more complex than I expected :) ... 
> 
> YD> So, a worst+worst case is 26 minutes with the normal case being below
> YD> the 16 minutes. 
> 
> YD> I'd say that's fast enough :)
> 
> However, I can't run the software required to submit to DSBL.
> 
I usually test open relays with telnet. Simplest case, a wide open
relay, goes something like this

telnet <ipaddress> 25
helo relay
mail from: <[EMAIL PROTECTED]>
rcpt to: <[EMAIL PROTECTED]>
data
.
quit

and then see if the message arrives.

There are plenty of simple utils out there to do the same job, or you
could even use a normal email client, just set the suspected open relay
as the smtp server in your email client and send yourself an email.

> RM>> Nor do I know how to submit updates. As an end user on a virtual
> RM>> hosting server, I believe I don't have the access to submit updates
> RM>> from the server, and I don't know how to do so from my home Windows
> RM>> machine. Perhaps that information could be added to one of our SA
> RM>> resources?    
> 
> YD> See the website for the blocklist you want to submit to. ...
> 
> I checked DSBL, and while as you claim,
> YD> DSBL is easy though, just relay a message through the open relay to
> YD> [EMAIL PROTECTED]
> I don't have the ability to do that from my POP3 email client, at least
> not that I was able to find from their web site.

You do for a simple open relay, you can do it straight from your normal
mail client just by changing the smtp server setting to the address of
the open relay. I'm not sure how you'd do an open socks proxy though,
unless your mail client is socks enabled. yet again I just use telnet,
but I've been doing it quite a few years so I've had time to get used to
it.

---- more snip ----

> YD> IMHO anything centralised could really do with a sample spam archived
> YD> for each rule. proof, evidence, and useful reference all rolled into
> YD> one. Include sequential numbers in the rulenames and cross reference
> to
> YD> a text evidence file for each, or something like that, so nobody can
> YD> ever have any doubts as to what a rule is for, why it exists, and
> that
> YD> it relates to real spam.
> 
> Agreed. So any submission should include (a) the rule itself (body,
> header, URI, etc), (b) a generic name (to which we'll prefix a unique
> identifier), (c) one-line description, (d) Additional description as
> appropriate, (e) complete spam with all headers. (a) through (d) can
> easily be stored in the database, while (e) would be best kept as an
> isolated file identified by the sequentially assigned rule number.
>  
> Bob Menschel
> 

Yes that's about how I'd do it, make rule names something like
D_<category>_<uniquenumber> and make the sample spam available on a
website under /samplespam/<uniquenumber>.txt. Any future modifications
to the rule get a new sample spam appended to the txt file so as a
spammer tries to avoid the filter you get a record of his techniques. As
L is local rules, D suits distributed rules.

One possible addition would be a per-user "notes" form field for users
to comment why they're disabling or re-scoring a rule, and the ability
for admin to display all users notes for any given rule. More feedback
is always better.

-- 
Yorkshire Dave


-- 
Scanned by MailScanner at wot.no-ip.com



-------------------------------------------------------
This SF.Net email sponsored by: Free pre-built ASP.NET sites including
Data Reports, E-commerce, Portals, and Forums are available now.
Download today and enter to win an XBOX or Visual Studio .NET.
http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: Re[6]: [SAtalk] [RD] Rule Philosophy

Reply via email to