-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hello Dave, Chris,
Wednesday, August 6, 2003, 6:06:01 AM, Dave wrote: RM>> We could, however, set up a blacklist through a website, such that RM>> anyone can submit an entry, ... The web system would track RM>> submissions, and create a ruleset from them. YD> Already thought of doing that, ... but the idea of automating it and YD> opening it up to public submission has a couple of problems. YD> How do you verify / trust a submission? How do you know that a rule YD> is right? Initially, you don't. I'm looking at this much like the Wiki at http://www.exit0.us/ -- Anyone can contribute to that web site, whether they know what they're doing or not. The only policing is done by people who decide they can improve on what they see. YD> You can't verify them all by hand, spammers can register domains YD> between them faster than a person can make rules. Quick answer: I'm not interested in verification. That's why I suggested the scores be real low on initial submission (0.1). Rules like that will match and appear in the SA headers, but unless a lot of them are hit, these rules won't significantly change the spam/ham rating of an email. We probably need someone to do a quick eyeball check on submitted rules, to make sure we don't get a series of rules such as > body L_A /a/i > body L_B /b/i etc. 26 of these would significantly mis-identify emails. Less extreme, we'd probably want to catch > body animalporn /(d[o0]g|[EMAIL PROTECTED]|m[o0][uv]se)/i However, given how much spam is circulating for supposed anti-spam systems, I wouldn't object to > body antispamspam /(anti-spam tool|(Filter|Blocking) Spam)/i It'd match a lot of emails which mention/discuss spam, but with a 0.1 score, I don't have a problem with that. When the rules are distributed, the rules themselves should be in a separate file from the scores. Each system or individual would then be able to rescore any specific rule according to their own needs or desires. Now, if we add a body of volunteers into this project, who are willing and able to evaluate the rules both for validity and against a reasonable corpus, then we could also develop standardized scores for those rules, still low, but higher than 0.1 Wednesday, August 6, 2003, 7:14:25 AM, Chris wrote: CS> I guess we are kind of trying to get rules available to the public CS> faster then SA versions come out. Look at all the work the devs have CS> to do for a mass check! Exactly. We rely and depend upon the distributed rule set, and the advancements made in that rule set from version to version are fantastic. But because the SA developers have to pay attention to quality, and the spammers obviously don't, the spammers move faster than the SA developers. I see this user-contributed rule set as a method we can use to tip the balance in our favor. If Matt receives a spam on Wednesday, submits a rule for it which my system automatically applies, and I receive that spam on Thursday, then a) Maybe that's just enough to tip the spam scale on my system, and I never see a false negative. b) If not, then at least I see the new rule in the spam header, and I can determine for myself what score I want that rule to have, without having to duplicate Matt's effort in developing the rule. CS> I guess SA is becoming so big, it kind of needs a rule consortium CS> now. Again, not that the devs aren't rocking, just to get rules out CS> between SA versions. I don't see it so much as SA becoming so big, but rather the user community is so large and active that we can contribute to each other in this way, without interfering with the development effort. Indeed, an occasional GA run against the collected contributed rule set would be a good way to help the developers determine which of these rules to add to the distribution rule set. CS> So....any volunteers? :-) I think we would need around 5-10 people. Count me in. RM>>>> header L_s_CorelWPOffice Subject =~ RM>>>> /(?:Corel|WordPerfect).{1,15}Office/i RM>> As for an ISP, I would think it's still a valid rule; they'd just RM>> need to be careful to score it low enough to be incremental rather RM>> than definitional. YD> No way is that a valid rule for an ISP to use. A good rule looks for YD> something which only appears in spam, WPOffice probably appears in as YD> much ham as spam. Define "good rule". Have you looked at FROM_ENDS_IN_NUMS lately? It matches 523 ham in my corpus. FROM_NO_LOWER matches more ham than spam in my corpus. Many of the HTML color rules also match a goodly amount of ham. Based on the value of these rules, I call them good rules despite the ham they match. To me, a "good rule" is a rule whose score is appropriate for the ham and/or spam matched. That means that a rule which does match only spam, and that can only match spam, scores high, a rule which matches only ham scores negative, and a rule which is suggestive of spam or ham provides a suggestive score which provides judgment only when combined with several/many other rules. Therefore, to me, a correctly scored WP Office rule would be a good rule, while a badly scored WP Office rule would be a bad rule. It's not the matching that determines good/bad, but the appropriateness of the score. (Yes, your addition of a % and/or $ within the rule is a good addition. Thanks.) RM>>>> header L_hr_lattelekom Received =~ /lattelekom\.net/ RM>> This was a spam that didn't score from them -- apparently it's too RM>> new a pathway. This should probably be given a temporary name/flag, RM>> and removed once the DNSBLs catch up. YD> Do they need to catch up or for someone to submit it? It won't get YD> listed if nobody submits it, and if you submit it instead of writing YD> a rule for it you'll never have to remove that rule if/when it YD> becomes secure. Yes, they need to catch up. To some extent some blacklists can catch up on their own, but mostly the catch-up depends on submissions. I agree with your philosophy, and I use that philosophy also with Bayes; teach the spam to Bayes, and see if it modifies the score enough to trap it the next time through. If so, then there's probably no need to add a rule. However, I don't know how sensitive the blacklists are, nor how long they take to update. Nor do I know how to submit updates. As an end user on a virtual hosting server, I believe I don't have the access to submit updates from the server, and I don't know how to do so from my home Windows machine. Perhaps that information could be added to one of our SA resources? Still, that leaves the question of how long it would take a BL to learn about a specific relay. A temporary rule is faster. YD> One point I would like to make about all this rule-writing is YD> documenting the rules you make, not just date stamping them. A couple YD> of lines of comments reminding you why you made a rule is always a YD> good thing, including the line you're matching from the original spam YD> will help you improve the rule if the spammer morphs. Good point. I have started documenting my rules to indicate how broadly it matches (spam and ham), but not *why* the rule is there. I figure it's easy for me to rescan my corpus for a questionable rule, and that will tell me why it's there, and whether it's still of value. If/when we begin working with a centralized distribution set of contributed rules, that won't be as simple to do. Thanks. Bob Menschel -----BEGIN PGP SIGNATURE----- Version: PGP 8.0 iQA/AwUBPzHF0pebK8E4qh1HEQKjiACfVCyAb2kGVuqOAUcrwnjxCaIQx+gAn3VW ZBEUTzBBWMmLh6q1soPNsJ4b =N68A -----END PGP SIGNATURE----- ------------------------------------------------------- This SF.Net email sponsored by: Free pre-built ASP.NET sites including Data Reports, E-commerce, Portals, and Forums are available now. Download today and enter to win an XBOX or Visual Studio .NET. http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01 _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk