-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hello Yorkshire,
Thursday, August 7, 2003, 9:41:16 AM, you wrote: YD> Hi, Excuse me if I contradict myself or say anything silly here, I'm YD> thinking out loud as much as anything :) I think that's what we're all doing. And I'm not going to apologize for any of it. Just let me know if I say anything silly, so I can laugh along with everyone else. Thanks. YD>>> How do you verify / trust a submission? How do you know that a rule YD>>> is right? RM>> Initially, you don't. I'm looking at this much like the Wiki at RM>> http://www.exit0.us/ -- Anyone can contribute to that web site, RM>> whether they know what they're doing or not. The only policing is RM>> done by people who decide they can improve on what they see. YD> I wasn't thinking of it that way at all, I was thinking of it more YD> like a blacklist based around URLs found in spam, where anyone can YD> forward spam for processing. I guess they're both good ideas though :) Agreed. YD> I'm going to try and get some of my ideas down on paper, I think I YD> have enough separate pieces of an idea now to form a whole idea. Good. YD>>> You can't verify them all by hand, spammers can register domains YD>>> between them faster than a person can make rules. RM>> Quick answer: I'm not interested in verification. That's why I RM>> suggested the scores be real low on initial submission (0.1). Rules RM>> like that will match and appear in the SA headers, but unless a lot RM>> of them are hit, these rules won't significantly change the spam/ham RM>> rating of an email. RM>> We probably need someone to do a quick eyeball check ... RM>> When the rules are distributed, the rules themselves should be in a RM>> separate file from the scores. ... RM>> Now, if we add a body of volunteers into this project, who are RM>> willing and able to evaluate the rules both for validity and against RM>> a reasonable corpus, then we could also develop standardized scores RM>> for those rules, still low, but higher than 0.1 YD> You're causing ideas here :) YD> We could make more than one score file available for the same set of YD> rules, let people have the rules at any score they choose. I like that! (a) One set of files for the rules (a file for each class of rule (From, Subject, Body, URI), with some files for large groups of rules (scam, pirate, porn, etc). Then (b) One scoring rule for each of a variety of different audiences and philosophies (an ISP score set, Int'l Business score set, family score set, conservative score set, aggressive score set, etc). YD> Or even generate rule lists on a per-user basis from a database, let YD> users create their own rule and score file by selecting rules or YD> groups of rules from the database, start the rules off with a low but YD> significant score, let them re-score individual rules to suit YD> themselves. Give users a web interface and let them switch off the YD> rules they don't want (I'm thinking built-in user feedback, if enough YD> people down-score or switch off a rule then it needs re-examining) Do I understand you right? Would this be the type of thing you're thinking of? 1) A database which stores each rule's name, and a brief summary of or intro to it, identified by a unique rule number. 2) Tables in that database which identify each registered user, with password, user type (audience as above), and scores for each identified rule. 3) Web form(s) which display the brief rule info, current score (default if none yet specified), and allow the user to change their scores, those changes to be recorded in the database. 4) A Transmit button, which builds the rules files (excluding those rules the user has turned off), builds the scoring file (from the user's database scoring information), and packages them up as a *.gz or *.zip file for email or download or ftp or wget or other transmission? RM>>>>>> header L_s_CorelWPOffice Subject =~ RM>>>>>> /(?:Corel|WordPerfect).{1,15}Office/i RM>>>> As for an ISP, I would think it's still a valid rule; ... YD>>> No way is that a valid rule for an ISP to use. A good rule looks YD>>> for something which only appears in spam, WPOffice probably appears YD>>> in as much ham as spam. RM>> Define "good rule". Have you looked at FROM_ENDS_IN_NUMS lately? It RM>> matches 523 ham in my corpus. FROM_NO_LOWER matches more ham than RM>> spam in my corpus. Many of the HTML color rules also match a goodly RM>> amount of ham. Based on the value of these rules, I call them good RM>> rules despite the ham they match. YD> FROM_NO_LOWER scores 0 here. I'm still getting mileage out of YD> FROM_ENDS_IN_NUMS but it's not what I'd call a good rule either. I lowered FROM_NO_LOWER from 1.6 to 1.0 on June 23; it's doing well for me at that score. FROM_ENDS_IN_NUMS was also reduced from 0.6 to 0.45. YD> The only reason rules like that still exist in sa is because not YD> enough people join in the mass checks, or the technically adept YD> people who do join in the mass checks are more likely to exchange YD> email with other technically adept people, who are more likely to YD> have their own domain name and be [EMAIL PROTECTED] and are therefore less YD> likely to be [EMAIL PROTECTED] I disagree. Well, I agree that it'd be better if more people joined in the mass checks. But I disagree that's why these rules stick around. They stick around because they remain suggestive of spam (though not definitive, which is why they get low scores). RM>> To me, a "good rule" is a rule whose score is appropriate for the ham RM>> and/or spam matched. That means that a rule which does match only spam, RM>> and that can only match spam, scores high, a rule which matches only ham RM>> scores negative, and a rule which is suggestive of spam or ham provides a RM>> suggestive score which provides judgment only when combined with RM>> several/many other rules. YD> The better the rule, the less ham it matches, and vice versa. Taking a YD> common word or phrase and capturing with it something which makes it YD> spammy is what makes a good rule. Fortunately there's enough room for us to agree to disagree here. While I'll certainly take advantage of some specific rules and score them high because they match ONLY spam, the great majority of my rules are suggestive rather than definitive. Chris suggests on his page at http://www.merchantsoverseas.com/wwwroot/gorilla/sa_rules.htm C> ) Scoring is an art! But a general idea is to write many small rules C> and score them low. Let 10 rules add up to 2 points rather then 1 rule C> with all 10 things in it scoring 2 points. This will help minimize C> false positives. and I generally agree with that philosophy. I had a spam this morning that hit 8.7 of 9.0 using default rules (including Bayes-80). It also matched one 0.9 rule of mine, which pushed it over the threshold, and so I'd never have seen it if I hadn't gone looking for this type of example. And that's the way I like SA to work -- With my 9.0 threshold, a 9.001 score matching a dozen rules is as good as a 9.1 score matching but a single rule. YD> I'm trying to produce only ISP grade rules I guess. ... So you have a lot fewer rules in your rule set than I do in mine, I guess. :-) RM>> Therefore, to me, a correctly scored WP Office rule would be a good RM>> rule, while a badly scored WP Office rule would be a bad rule. It's RM>> not the matching that determines good/bad, but the appropriateness RM>> of the score. YD> A correct score for such a loose rule would vary so widely that it YD> would become useless. Given enough time I can probably find someone YD> who would score that at about -5. It takes that long to find someone working at Corel? They *are* in trouble then!! YD> You're writing a rule for the brand name of a mainstream product which YD> probably has millions of users worldwide, doesn't that ring alarm bells YD> somewhere? :) Nope. No more than does > body RM_bp_DontWait /Don't wait/i > describe RM_bp_DontWait Body says don't wait > score RM_bp_DontWait 0.1 # 211 Spam, 21 Ham, Aug 7, 2003 RM>>>>>> header L_hr_lattelekom Received =~ /lattelekom\.net/ RM>> ... However, I don't know how sensitive the blacklists are, nor how RM>> long they take to update. YD> I wasn't sure either, so I emailed DSBL and asked. The answer I got YD> back was more complex than I expected :) ... YD> So, a worst+worst case is 26 minutes with the normal case being below YD> the 16 minutes. YD> I'd say that's fast enough :) However, I can't run the software required to submit to DSBL. RM>> Nor do I know how to submit updates. As an end user on a virtual RM>> hosting server, I believe I don't have the access to submit updates RM>> from the server, and I don't know how to do so from my home Windows RM>> machine. Perhaps that information could be added to one of our SA RM>> resources? YD> See the website for the blocklist you want to submit to. ... I checked DSBL, and while as you claim, YD> DSBL is easy though, just relay a message through the open relay to YD> [EMAIL PROTECTED] I don't have the ability to do that from my POP3 email client, at least not that I was able to find from their web site. YD> If you're using blocklists at all you should really find out a little YD> about them, research their listing criteria, ensure you agree with YD> their policies and methods, that way you'll never find reason to YD> complain about them :) I hear you, but there's only so much time available to do everything that should be done. The benefit of participating in a community like this one is that others do some of the work (eg: the SA developers have already picked a set of reasonable network tests), and so I don't have to worry about that side of the anti-spam activity. I can therefore put time into those things I'm more skilled at. YD> IMHO anything centralised could really do with a sample spam archived YD> for each rule. proof, evidence, and useful reference all rolled into YD> one. Include sequential numbers in the rulenames and cross reference to YD> a text evidence file for each, or something like that, so nobody can YD> ever have any doubts as to what a rule is for, why it exists, and that YD> it relates to real spam. Agreed. So any submission should include (a) the rule itself (body, header, URI, etc), (b) a generic name (to which we'll prefix a unique identifier), (c) one-line description, (d) Additional description as appropriate, (e) complete spam with all headers. (a) through (d) can easily be stored in the database, while (e) would be best kept as an isolated file identified by the sequentially assigned rule number. Bob Menschel -----BEGIN PGP SIGNATURE----- Version: PGP 8.0 iQA/AwUBPzMnI5ebK8E4qh1HEQLkjQCfW8XZhJjv/OrIno2Ir4gBhFwOwyoAnjGA NXXWq+MYCy9qYE/Z5t0dChnA =hWkw -----END PGP SIGNATURE----- ------------------------------------------------------- This SF.Net email sponsored by: Free pre-built ASP.NET sites including Data Reports, E-commerce, Portals, and Forums are available now. Download today and enter to win an XBOX or Visual Studio .NET. http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01 _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk