> > would be looked up as "www.stearns.org" or "stearns.org".)
> 
> The parser in the Bayes routine (tokenize_line in Bayes.pm) 
> creates 'UD:'
> lookup tokens for each component of the domain name. So for the above
> example, it would create:
>       UD:www.stearns.org
>       UD:stearns.org
>       UD:org
> 
> Thus the DB would only need to contain one entry for the lowest common
> denominator [1]. IE: stearns.org.


Actually I'm starting to think it would have to look for '.stearns.org' with
the leading '.'

> 
> > I suspect doing this with a DB lookup may not be such a 
> win, compared
> > to using a local eval test that parses a config file and creates an
> > in-memory hash table.
> >
> > - --j.
> 
> Au contraire, a DB lookup is a big win compaired to a regex match for
> speed/memory consumption. The Bayesan engine does hundreds of lookups
> per message against a database that has tens (or hundreds) of 
> thousands of
> (50k~200k) entries. Other people on this list have found that 
> using regex
> matches, (EG 'evilrules') a set of just a few thousand patterns make a
> major hit in processor load.

This topic is even on other anitspam lists. I think a lot of people are
coming to the realisation that this is the next greatest thing ;)

> 
> One of the big advantages of using a DB type system is that it can be
> updated 'hot' on a running system. A system based upon 
> parsing a config
> file and creating an in-memory hash table would require 
> restarting spamd
> every time an update was made.
> 
> If we want to have any hope of automating such a system, it 
> needs to be
> updatable 'hot' (note how Bayes operates).

Again, am I the only one that thinks this should operate like an AWL? If it
hits a spam threshold, them parse it for URLs. Match those against good URL
list so it can't be poisoned. Then adjust the score. You got yourself an
ABL. 

> 
> Yes, you are right in that a URI DB cannot use regular expressions or
> patterns. However, if we're just looking for a 'catcher' for spammer
> sites in URIs, that's probably not necessary. We just want to grab a
> host/site name out of a spam and slam it in there. Ask people such as
> Chris how much time he spent "regex"ing each entry in his 'evilrules'
> set. Speed of update and search are far more important IMHO.

I spend _ZERO_ time doing regex thanks to Yorkshire Dave and his wonderful
reg2rule script! (Where ya been buddy!?)

> 
> I envision this working in a couple of possible ways, either 
> updated from
> a central site (EG the rules emporium) via wget/rsync etc, or 
> by a local
> engine that would use some kind of heuristics on suspect host 
> names found
> in potential spam (do DNS lookups, use IP that point to spammer nets,
> look at 'whois' data for spammer hosting, look at DNS TTLs, etc).

I think each machine should handle there own. However it would be nice to be
able to import files from others into your own DB. 

> 
> Part of my motivation is a local "competition".

I'm motivated by my poor grandfather on dialup having to DL 300+ spams a
day. 

*snip*

While we're on the subject, new evilrules update out today I think :)

--Chris


-------------------------------------------------------
This SF. Net email is sponsored by: GoToMyPC
GoToMyPC is the fast, easy and secure way to access your computer from
any Web browser or wireless device. Click here to Try it Free!
https://www.gotomypc.com/tr/OSDN/AW/Q4_2003/t/g22lp?Target=mm/g22lp.tmpl
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to