Re[4]: [SAtalk] [RD] Rule Philosophy

Robert Menschel Wed, 06 Aug 2003 21:14:24 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello Dave, Chris,

Wednesday, August 6, 2003, 6:06:01 AM, Dave wrote:

RM>> We could, however, set up a blacklist through a website, such that
RM>> anyone can submit an entry, ... The web system would track
RM>> submissions, and create a ruleset from them.  

YD> Already thought of doing that, ... but the idea of automating it and
YD> opening it up to public submission has a couple of problems. 

YD> How do you verify / trust a submission? How do you know that a rule
YD> is right?

Initially, you don't. I'm looking at this much like the Wiki at
http://www.exit0.us/ -- Anyone can contribute to that web site, whether
they know what they're doing or not. The only policing is done by people
who decide they can improve on what they see.

YD> You can't verify them all by hand, spammers can register domains
YD> between them faster than a person can make rules.

Quick answer: I'm not interested in verification. That's why I suggested
the scores be real low on initial submission (0.1). Rules like that will
match and appear in the SA headers, but unless a lot of them are hit,
these rules won't significantly change the spam/ham rating of an email.

We probably need someone to do a quick eyeball check on submitted rules,
to make sure we don't get a series of rules such as
> body L_A /a/i
> body L_B /b/i
etc.  26 of these would significantly mis-identify emails. Less extreme,
we'd probably want to catch
> body animalporn /(d[o0]g|[EMAIL PROTECTED]|m[o0][uv]se)/i

However, given how much spam is circulating for supposed anti-spam
systems, I wouldn't object to
> body antispamspam /(anti-spam tool|(Filter|Blocking) Spam)/i
It'd match a lot of emails which mention/discuss spam, but with a 0.1
score, I don't have a problem with that.

When the rules are distributed, the rules themselves should be in a
separate file from the scores. Each system or individual would then be
able to rescore any specific rule according to their own needs or
desires.

Now, if we add a body of volunteers into this project, who are willing
and able to evaluate the rules both for validity and against a reasonable
corpus, then we could also develop standardized scores for those rules,
still low, but higher than 0.1

Wednesday, August 6, 2003, 7:14:25 AM, Chris wrote:

CS> I guess we are kind of trying to get rules available to the public
CS> faster then SA versions come out. Look at all the work the devs have
CS> to do for a mass check!

Exactly. We rely and depend upon the distributed rule set, and the
advancements made in that rule set from version to version are fantastic.
But because the SA developers have to pay attention to quality, and the
spammers obviously don't, the spammers move faster than the SA
developers.

I see this user-contributed rule set as a method we can use to tip the
balance in our favor. If Matt receives a spam on Wednesday, submits a
rule for it which my system automatically applies, and I receive that
spam on Thursday, then a) Maybe that's just enough to tip the spam scale
on my system, and I never see a false negative. b) If not, then at least
I see the new rule in the spam header, and I can determine for myself
what score I want that rule to have, without having to duplicate Matt's
effort in developing the rule.

CS> I guess SA is becoming so big, it kind of needs a rule consortium
CS> now. Again, not that the devs aren't rocking, just to get rules out
CS> between SA versions.  

I don't see it so much as SA becoming so big, but rather the user
community is so large and active that we can contribute to each other in
this way, without interfering with the development effort.

Indeed, an occasional GA run against the collected contributed rule set
would be a good way to help the developers determine which of these rules
to add to the distribution rule set.

CS> So....any volunteers? :-) I think we would need around 5-10 people.

Count me in.

RM>>>> header    L_s_CorelWPOffice  Subject =~
RM>>>>                          /(?:Corel|WordPerfect).{1,15}Office/i

RM>> As for an ISP, I would think it's still a valid rule; they'd just
RM>> need to be careful to score it low enough to be incremental rather
RM>> than definitional.  

YD> No way is that a valid rule for an ISP to use. A good rule looks for
YD> something which only appears in spam, WPOffice probably appears in as
YD> much ham as spam.

Define "good rule".  Have you looked at FROM_ENDS_IN_NUMS lately? It
matches 523 ham in my corpus. FROM_NO_LOWER matches more ham than spam in
my corpus. Many of the HTML color rules also match a goodly amount of
ham. Based on the value of these rules, I call them good rules despite
the ham they match.

To me, a "good rule" is a rule whose score is appropriate for the ham
and/or spam matched. That means that a rule which does match only spam,
and that can only match spam, scores high, a rule which matches only ham
scores negative, and a rule which is suggestive of spam or ham provides a
suggestive score which provides judgment only when combined with
several/many other rules.

Therefore, to me, a correctly scored WP Office rule would be a good rule,
while a badly scored WP Office rule would be a bad rule. It's not the
matching that determines good/bad, but the appropriateness of the score.

(Yes, your addition of a % and/or $ within the rule is a good addition.
Thanks.)

RM>>>> header    L_hr_lattelekom  Received =~ /lattelekom\.net/

RM>> This was a spam that didn't score from them -- apparently it's too
RM>> new a pathway. This should probably be given a temporary name/flag,
RM>> and removed once the DNSBLs catch up.  

YD> Do they need to catch up or for someone to submit it? It won't get
YD> listed if nobody submits it, and if you submit it instead of writing
YD> a rule for it you'll never have to remove that rule if/when it
YD> becomes secure.

Yes, they need to catch up. To some extent some blacklists can catch up
on their own, but mostly the catch-up depends on submissions.

I agree with your philosophy, and I use that philosophy also with Bayes;
teach the spam to Bayes, and see if it modifies the score enough to trap
it the next time through. If so, then there's probably no need to add a
rule.

However, I don't know how sensitive the blacklists are, nor how long they
take to update.

Nor do I know how to submit updates. As an end user on a virtual
hosting server, I believe I don't have the access to submit updates from
the server, and I don't know how to do so from my home Windows machine.
Perhaps that information could be added to one of our SA resources?

Still, that leaves the question of how long it would take a BL to learn
about a specific relay. A temporary rule is faster.

YD> One point I would like to make about all this rule-writing is
YD> documenting the rules you make, not just date stamping them. A couple
YD> of lines of comments reminding you why you made a rule is always a
YD> good thing, including the line you're matching from the original spam
YD> will help you improve the rule if the spammer morphs.

Good point. I have started documenting my rules to indicate how broadly
it matches (spam and ham), but not *why* the rule is there. I figure it's
easy for me to rescan my corpus for a questionable rule, and that will
tell me why it's there, and whether it's still of value.

If/when we begin working with a centralized distribution set of
contributed rules, that won't be as simple to do. Thanks.

Bob Menschel

-----BEGIN PGP SIGNATURE-----
Version: PGP 8.0

iQA/AwUBPzHF0pebK8E4qh1HEQKjiACfVCyAb2kGVuqOAUcrwnjxCaIQx+gAn3VW
ZBEUTzBBWMmLh6q1soPNsJ4b
=N68A
-----END PGP SIGNATURE-----

-------------------------------------------------------
This SF.Net email sponsored by: Free pre-built ASP.NET sites including
Data Reports, E-commerce, Portals, and Forums are available now.
Download today and enter to win an XBOX or Visual Studio .NET.
http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re[4]: [SAtalk] [RD] Rule Philosophy

Reply via email to