On Tue, 2002-02-19 at 03:56, Arpi wrote:
> so, my primary goal: make a small but very fast, efficient version to be
> used on very high traffic mail servers. and, by allowing several instances
> at the same time make possible to profit from SMP.
> (afaik spamd only processes a single mail at the same time)

spamd will fork for every new incoming message, having already
precompiled its regular expressions.  This means you definitely take
advantage of SMP, and by having spamc connect through a round-robin DNS
or some smart switching hardware, you can even spread it over many
machines.

> for this goal, we'll may use asm optimization, and maybe switch from PCRE to
> libtre (faster but non back-referencing regexp lib).

There are some rules that use backrefs -- the SUSP_RECIPS type rules.  I
suppose you could just ignore these if speed were the big issue.

> i'll implement very fast code (using hashed tree-based search) for spam
> phrases matching. i found it very usefull, many of the spam mails are
> catched by this.

I haven't looked much at the spam-phrase matching code in the perl
version.  I did briefly look at regenerating the spam phrases while
doing the mass-check stuff to rescore rules, but once again it requires
having a large non-spam corpus *sigh*

> > > Are you interested in such thing in CVS ?
> > 
> > I'd love to see it.
> > 
> ok. as soon as i got it in production level (currently it prints lots of
> debug stuff, and just analyzes mail (count score), doesn't edit headers
> and give back filtered mail, anyway it isn't really needed for us).

> yet another question: i've seen in docs some statistics running spamassassin
> on ~40.000 spam mails and similar amount of non-spam.
> can i access this spam collection/database? would be usefull for real-life
> benchmarking. (currently i'm running it on ~1800 spam and ~60000 non-spam
> mails for tests)

We have a corpus of spam messages available.  I'll mail you instructions
on how to download it under a separate message.  We don't though have
ready access to large non-spam archives, because obviously people don't
like sharing around their private correspondence :)  60,000 messages is
probably a good starting point though.

C

_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to