On Tue, 2002-02-19 at 03:56, Arpi wrote: > so, my primary goal: make a small but very fast, efficient version to be > used on very high traffic mail servers. and, by allowing several instances > at the same time make possible to profit from SMP. > (afaik spamd only processes a single mail at the same time)
spamd will fork for every new incoming message, having already precompiled its regular expressions. This means you definitely take advantage of SMP, and by having spamc connect through a round-robin DNS or some smart switching hardware, you can even spread it over many machines. > for this goal, we'll may use asm optimization, and maybe switch from PCRE to > libtre (faster but non back-referencing regexp lib). There are some rules that use backrefs -- the SUSP_RECIPS type rules. I suppose you could just ignore these if speed were the big issue. > i'll implement very fast code (using hashed tree-based search) for spam > phrases matching. i found it very usefull, many of the spam mails are > catched by this. I haven't looked much at the spam-phrase matching code in the perl version. I did briefly look at regenerating the spam phrases while doing the mass-check stuff to rescore rules, but once again it requires having a large non-spam corpus *sigh* > > > Are you interested in such thing in CVS ? > > > > I'd love to see it. > > > ok. as soon as i got it in production level (currently it prints lots of > debug stuff, and just analyzes mail (count score), doesn't edit headers > and give back filtered mail, anyway it isn't really needed for us). > yet another question: i've seen in docs some statistics running spamassassin > on ~40.000 spam mails and similar amount of non-spam. > can i access this spam collection/database? would be usefull for real-life > benchmarking. (currently i'm running it on ~1800 spam and ~60000 non-spam > mails for tests) We have a corpus of spam messages available. I'll mail you instructions on how to download it under a separate message. We don't though have ready access to large non-spam archives, because obviously people don't like sharing around their private correspondence :) 60,000 messages is probably a good starting point though. C _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk