Re: Re[4]: [SAtalk] [RD] Rule Philosophy

Yorkshire Dave Fri, 08 Aug 2003 14:49:06 -0700

On Thu, 2003-08-07 at 04:21, Robert Menschel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hello Dave, Chris,
> 
Hi, Excuse me if I contradict myself or say anything silly here, I'm
thinking out loud as much as anything :)


> Wednesday, August 6, 2003, 6:06:01 AM, Dave wrote:
> 
> RM>> We could, however, set up a blacklist through a website, such that
> RM>> anyone can submit an entry, ... The web system would track
> RM>> submissions, and create a ruleset from them.  
> 
> YD> Already thought of doing that, ... but the idea of automating it and
> YD> opening it up to public submission has a couple of problems. 
> 
> YD> How do you verify / trust a submission? How do you know that a rule
> YD> is right?
> 
> Initially, you don't. I'm looking at this much like the Wiki at
> http://www.exit0.us/ -- Anyone can contribute to that web site, whether
> they know what they're doing or not. The only policing is done by people
> who decide they can improve on what they see.

I wasn't thinking of it that way at all, I was thinking of it more like
a blacklist based around URLs found in spam, where anyone can forward
spam for processing. I guess they're both good ideas though :)

I'm going to try and get some of my ideas down on paper, I think I have
enough separate pieces of an idea now to form a whole idea.

> YD> You can't verify them all by hand, spammers can register domains
> YD> between them faster than a person can make rules.
> 
> Quick answer: I'm not interested in verification. That's why I suggested
> the scores be real low on initial submission (0.1). Rules like that will
> match and appear in the SA headers, but unless a lot of them are hit,
> these rules won't significantly change the spam/ham rating of an email.
> 
> We probably need someone to do a quick eyeball check on submitted rules,
> to make sure we don't get a series of rules such as
> > body L_A /a/i
> > body L_B /b/i
> etc.  26 of these would significantly mis-identify emails. Less extreme,
> we'd probably want to catch
> > body animalporn /(d[o0]g|[EMAIL PROTECTED]|m[o0][uv]se)/i
> 
> However, given how much spam is circulating for supposed anti-spam
> systems, I wouldn't object to
> > body antispamspam /(anti-spam tool|(Filter|Blocking) Spam)/i
> It'd match a lot of emails which mention/discuss spam, but with a 0.1
> score, I don't have a problem with that.
> 
> When the rules are distributed, the rules themselves should be in a
> separate file from the scores. Each system or individual would then be
> able to rescore any specific rule according to their own needs or
> desires.
> 
> Now, if we add a body of volunteers into this project, who are willing
> and able to evaluate the rules both for validity and against a reasonable
> corpus, then we could also develop standardized scores for those rules,
> still low, but higher than 0.1
> 
You're causing ideas here :) 

We could make more than one score file available for the same set of
rules, let people have the rules at any score they choose.

Or even generate rule lists on a per-user basis from a database, let
users create their own rule and score file by selecting rules or groups
of rules from the database, start the rules off with a low but
significant score, let them re-score individual rules to suit
themselves.  Give users a web interface and let them switch off the
rules they don't want (I'm thinking built-in user feedback, if enough
people down-score or switch off a rule then it needs re-examining)

> Wednesday, August 6, 2003, 7:14:25 AM, Chris wrote:
> 
> CS> I guess we are kind of trying to get rules available to the public
> CS> faster then SA versions come out. Look at all the work the devs have
> CS> to do for a mass check!
> 
> Exactly. We rely and depend upon the distributed rule set, and the
> advancements made in that rule set from version to version are fantastic.
> But because the SA developers have to pay attention to quality, and the
> spammers obviously don't, the spammers move faster than the SA
> developers.
> 
> I see this user-contributed rule set as a method we can use to tip the
> balance in our favor. If Matt receives a spam on Wednesday, submits a
> rule for it which my system automatically applies, and I receive that
> spam on Thursday, then a) Maybe that's just enough to tip the spam scale
> on my system, and I never see a false negative. b) If not, then at least
> I see the new rule in the spam header, and I can determine for myself
> what score I want that rule to have, without having to duplicate Matt's
> effort in developing the rule.
> 
> CS> I guess SA is becoming so big, it kind of needs a rule consortium
> CS> now. Again, not that the devs aren't rocking, just to get rules out
> CS> between SA versions.  
> 
> I don't see it so much as SA becoming so big, but rather the user
> community is so large and active that we can contribute to each other in
> this way, without interfering with the development effort.
> 
> Indeed, an occasional GA run against the collected contributed rule set
> would be a good way to help the developers determine which of these rules
> to add to the distribution rule set.
> 
> CS> So....any volunteers? :-) I think we would need around 5-10 people.
> 
> Count me in.
> 
> RM>>>> header    L_s_CorelWPOffice  Subject =~
> RM>>>>                          /(?:Corel|WordPerfect).{1,15}Office/i
> 
> RM>> As for an ISP, I would think it's still a valid rule; they'd just
> RM>> need to be careful to score it low enough to be incremental rather
> RM>> than definitional.  
> 
> YD> No way is that a valid rule for an ISP to use. A good rule looks for
> YD> something which only appears in spam, WPOffice probably appears in as
> YD> much ham as spam.
> 
> Define "good rule".  Have you looked at FROM_ENDS_IN_NUMS lately? It
> matches 523 ham in my corpus. FROM_NO_LOWER matches more ham than spam in
> my corpus. Many of the HTML color rules also match a goodly amount of
> ham. Based on the value of these rules, I call them good rules despite
> the ham they match.

FROM_NO_LOWER scores 0 here. I'm still getting mileage out of
FROM_ENDS_IN_NUMS but it's not what I'd call a good rule either. 

The only reason rules like that still exist in sa is because not enough
people join in the mass checks, or the technically adept people who do
join in the mass checks are more likely to exchange email with other
technically adept people, who are more likely to have their own domain
name and be [EMAIL PROTECTED] and are therefore less likely to be
[EMAIL PROTECTED]

> To me, a "good rule" is a rule whose score is appropriate for the ham
> and/or spam matched. That means that a rule which does match only spam,
> and that can only match spam, scores high, a rule which matches only ham
> scores negative, and a rule which is suggestive of spam or ham provides a
> suggestive score which provides judgment only when combined with
> several/many other rules.

The better the rule, the less ham it matches, and vice versa. Taking a
common word or phrase and capturing with it something which makes it
spammy is what makes a good rule. 

I'm trying to produce only ISP grade rules I guess. If you seriously
want to distribute that rule and have other people find it useful you
need to think about it in a wider way than just yourself. The WPOffice
rule would probably be a pain in the ass to anyone on a legit WPOffice
mailing list.   

> Therefore, to me, a correctly scored WP Office rule would be a good rule,
> while a badly scored WP Office rule would be a bad rule. It's not the
> matching that determines good/bad, but the appropriateness of the score.
> 
A correct score for such a loose rule would vary so widely that it would
become useless. Given enough time I can probably find someone who would
score that at about -5.

You're writing a rule for the brand name of a mainstream product which
probably has millions of users worldwide, doesn't that ring alarm bells
somewhere? :)

> (Yes, your addition of a % and/or $ within the rule is a good addition.
> Thanks.)
> 
The %|$ is the token which makes WPOffice spammy, without that it's just
email about WPOffice. If there's any local per-user correlation between
WPOffice and spam then bayes should be dealing with it.

Even with the % it still isn't a perfect rule, if I was using it I'd
want to add something else too, make it half of a meta && rule with some
other spam or bulkiness indicator as the other half.

> RM>>>> header    L_hr_lattelekom  Received =~ /lattelekom\.net/
> 
> RM>> This was a spam that didn't score from them -- apparently it's too
> RM>> new a pathway. This should probably be given a temporary name/flag,
> RM>> and removed once the DNSBLs catch up.  
> 
> YD> Do they need to catch up or for someone to submit it? It won't get
> YD> listed if nobody submits it, and if you submit it instead of writing
> YD> a rule for it you'll never have to remove that rule if/when it
> YD> becomes secure.
> 
> Yes, they need to catch up. To some extent some blacklists can catch up
> on their own, but mostly the catch-up depends on submissions.
> 
> I agree with your philosophy, and I use that philosophy also with Bayes;
> teach the spam to Bayes, and see if it modifies the score enough to trap
> it the next time through. If so, then there's probably no need to add a
> rule.
> 
> However, I don't know how sensitive the blacklists are, nor how long they
> take to update.

I wasn't sure either, so I emailed DSBL and asked. The answer I got back
was more complex than I expected :)

If you submit immediately after the last update, your submission will be
on their SQL database immediately, on the primary when it next updates
in 5 minutes time, on the secondaries 10 minutes after that. If somebody
actually tried to query for the ip before your update hit the secondary
then a negative entry will exist which takes 10 minutes to expire.

So, a worst+worst case is 26 minutes with the normal case being below
the 16 minutes.

I'd say that's fast enough :)

> Nor do I know how to submit updates. As an end user on a virtual
> hosting server, I believe I don't have the access to submit updates from
> the server, and I don't know how to do so from my home Windows machine.
> Perhaps that information could be added to one of our SA resources?

See the website for the blocklist you want to submit to. They're all
different, some don't accept submissions at all. DSBL is easy though,
just relay a message through the open relay to [EMAIL PROTECTED]

If you're using blocklists at all you should really find out a little
about them, research their listing criteria, ensure you agree with their
policies and methods, that way you'll never find reason to complain
about them :)

> Still, that leaves the question of how long it would take a BL to learn
> about a specific relay. A temporary rule is faster.

I'll live with being left exposed for <30 mins, it's easier to fire a
message to dsbl through an open relay which I'm already testing than it
is to write a simple rule.

> YD> One point I would like to make about all this rule-writing is
> YD> documenting the rules you make, not just date stamping them. A couple
> YD> of lines of comments reminding you why you made a rule is always a
> YD> good thing, including the line you're matching from the original spam
> YD> will help you improve the rule if the spammer morphs.
> 
> Good point. I have started documenting my rules to indicate how broadly
> it matches (spam and ham), but not *why* the rule is there. I figure it's
> easy for me to rescan my corpus for a questionable rule, and that will
> tell me why it's there, and whether it's still of value.

Now I just need to learn to practise what I preach a little better :)

> If/when we begin working with a centralized distribution set of
> contributed rules, that won't be as simple to do. Thanks.
> 
> Bob Menschel

IMHO anything centralised could really do with a sample spam archived
for each rule. proof, evidence, and useful reference all rolled into
one. Include sequential numbers in the rulenames and cross reference to
a text evidence file for each, or something like that, so nobody can
ever have any doubts as to what a rule is for, why it exists, and that
it relates to real spam.
 

-- 
Yorkshire Dave


-- 
Scanned by MailScanner at wot.no-ip.com



-------------------------------------------------------
This SF.Net email sponsored by: Free pre-built ASP.NET sites including
Data Reports, E-commerce, Portals, and Forums are available now.
Download today and enter to win an XBOX or Visual Studio .NET.
http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: Re[4]: [SAtalk] [RD] Rule Philosophy

Reply via email to