> > would be looked up as "www.stearns.org" or "stearns.org".) > > The parser in the Bayes routine (tokenize_line in Bayes.pm) > creates 'UD:' > lookup tokens for each component of the domain name. So for the above > example, it would create: > UD:www.stearns.org > UD:stearns.org > UD:org > > Thus the DB would only need to contain one entry for the lowest common > denominator [1]. IE: stearns.org.
Actually I'm starting to think it would have to look for '.stearns.org' with the leading '.' > > > I suspect doing this with a DB lookup may not be such a > win, compared > > to using a local eval test that parses a config file and creates an > > in-memory hash table. > > > > - --j. > > Au contraire, a DB lookup is a big win compaired to a regex match for > speed/memory consumption. The Bayesan engine does hundreds of lookups > per message against a database that has tens (or hundreds) of > thousands of > (50k~200k) entries. Other people on this list have found that > using regex > matches, (EG 'evilrules') a set of just a few thousand patterns make a > major hit in processor load. This topic is even on other anitspam lists. I think a lot of people are coming to the realisation that this is the next greatest thing ;) > > One of the big advantages of using a DB type system is that it can be > updated 'hot' on a running system. A system based upon > parsing a config > file and creating an in-memory hash table would require > restarting spamd > every time an update was made. > > If we want to have any hope of automating such a system, it > needs to be > updatable 'hot' (note how Bayes operates). Again, am I the only one that thinks this should operate like an AWL? If it hits a spam threshold, them parse it for URLs. Match those against good URL list so it can't be poisoned. Then adjust the score. You got yourself an ABL. > > Yes, you are right in that a URI DB cannot use regular expressions or > patterns. However, if we're just looking for a 'catcher' for spammer > sites in URIs, that's probably not necessary. We just want to grab a > host/site name out of a spam and slam it in there. Ask people such as > Chris how much time he spent "regex"ing each entry in his 'evilrules' > set. Speed of update and search are far more important IMHO. I spend _ZERO_ time doing regex thanks to Yorkshire Dave and his wonderful reg2rule script! (Where ya been buddy!?) > > I envision this working in a couple of possible ways, either > updated from > a central site (EG the rules emporium) via wget/rsync etc, or > by a local > engine that would use some kind of heuristics on suspect host > names found > in potential spam (do DNS lookups, use IP that point to spammer nets, > look at 'whois' data for spammer hosting, look at DNS TTLs, etc). I think each machine should handle there own. However it would be nice to be able to import files from others into your own DB. > > Part of my motivation is a local "competition". I'm motivated by my poor grandfather on dialup having to DL 300+ spams a day. *snip* While we're on the subject, new evilrules update out today I think :) --Chris ------------------------------------------------------- This SF. Net email is sponsored by: GoToMyPC GoToMyPC is the fast, easy and secure way to access your computer from any Web browser or wireless device. Click here to Try it Free! https://www.gotomypc.com/tr/OSDN/AW/Q4_2003/t/g22lp?Target=mm/g22lp.tmpl _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk