Charles Gregory wrote: > > Hi! > > I suggested this once before, and did not see any response. > Many rules that I see suggested on this list all have the characteristic > of being a good test against e-mail that contain a large number of > occurences (a high 'count') of a particular 'trick' or 'obfuscation'. > BUT these rules have to be scored very LOW because sometimes legitimate > mail contains one or two occurences of the same text/string. > > For example, Someone might include a legitimate Acronym, such as > I.B.M. or I.B.E.W. and this would trigger a rule to check for a single > occurence of 'period obfuscated text'. But if we were able to check the > COUNT of how many times a particular rule was matched, we could easily > distinguish runaway use of obfuscation.
It is an interesting idea. It is analysis of the analysis, or meta analysis. It strikes a bit problematic because of the infinite regression it implies -- generally speaking, how do you determine when to stop analyzing your analysis? -- but used with descretion, I would think such possibly worthwhile. Bryan > Now, if the current rule-checking logic has been optimized to stop after > it finds a successful match, then we would need an extra parameter to > tell the test to keep going and count all occurences. Then, we would need > a parameter on the 'score' line to work with those counts. > Here would be a coding example, based on Jennifer's period checker: > > body LOC_PERIODS count /\s[a-zA-Z]{9}\.[a-zA-Z]{1}[ ,'\?!]/i > describe LOC_PERIODS Too many words with period spacing > score LOC_PERIODS 5:0.5,10:1.2 > > Meaning in this case, score 0.5 for a count of 5 or higher, and 1.2 for a > count of 10 or higher. As per other scoring lines, you could have > up to four space separated groups of scores. > > Note that we do not want to use a straight *multiplier* as there will be > cases where we want to have no score until a certain minimum threshold is > reached. In the above example, up to 4 instances of period spaced words > would score nothing at all.... > > In terms of program logic, the main change would be: > - recognizing the 'count' parameter on the rule and accumulating the > count, as well as insuring that testing doesn't stop on the first match. > - on the scoring, recognizing the 'x:y' pairs as being count related. > - A simple error condition check for: > - count-style scoring (x:y) for a rule that didn't use the 'count' > option. > - normal style scoring (x) for a rule that used the 'count' option. > > So, how's that grab people? This would be a fundamental change, affecting > the basic behaviour of every test except for the 'evals' - and even then > with clever coding it might be applied to those. But I don't think it > would be a lot of code. It would probably take longer to document the new > usage.... :-) > > - Charles > > ------------------------------------------------------- > This SF.net email is sponsored by: IBM Linux Tutorials. > Become an expert in LINUX or just sharpen your skills. Sign up for IBM's > Free Linux Tutorials. Learn everything from the bash shell to sys admin. > Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click -- Nothing in the world has more potential for beauty than woman. Nothing has more potential to destroy it, than the world. - (Anonymous) http://www.wecs.com/content.htm This signature file is generated by Pick-a-Tag ! Written by Jeroen van Vaarsel http://www.google.com/search?hl=en&ie=ISO-8859-1&q=pick-a-tag ------------------------------------------------------- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk