Hi,

> On Sat, 11 May 2002, Mail Admin wrote:
> > Hi, I want to use spamassassin on a system where real heavy load
> > exists. I have 540,000 incoming emails daily. I know spamc/spamd do
> > well under moderate load , but this is not enough. Did anybody think
> > of rewriting spammassasin in C,
> 
> Yup. It's been suggested here before and, in fact, someone said that
> they have done so.

I was it - I've started and worked for few weeks on such thing.
I've used PCRE (perl regexp lib) to match rules, it was really fast,
but i got about 30 times speedup by adding pre-match strings.
So, for example there are long, complex regexpes, talking really long
to match over a middle-size mail message, but most of these contain
some words required for matching, so I pre-test the regexp by doing strstr()
for that word, and do regexp matching only if the word was found.
Ppl here said it's because of PCRE lameness, maybe. But it helped a lot.

I didn't implement the eval() tests (needs to rewrite all them in pure C).

I didn't implement, but planned to implement the phrase (word pair) checks.
It could be done very fast, using some hashing search on the text, using a
counter matrix to count word-pair matches and single-word hits at the same
time (many of the normal tests are just single word match).

> * the Perl regexp engine running the rules.
> * the need to walk a message larger than L1 cache more than once.
> * forking, forking, forking.
> 
> For the first, *nothing* that you do is likely to improve things much
> other than rewriting the rules themselves; this can be done equally well
> with Perl.
agree. anyway adding pre-match rules can speed up things a lot!
-> so, test regexp X _only_ if another (much simpler) regexp-Y matched.
i've found many places where matching regexp X took 50 times longer than
matching regexp Y.

> > If nothing is done in that. I am ready to help in such project. If
> > anybody is interested please mail me , and lets start. 
> 
> Check the mailing list archives.

I'm interested, but I've stopped my work by 2 reasons:
- time (i was busy with commercial work and work on MPlayer)
- SA guys said the thing with regexp pre-matching will eb added to the perl
version, so I delayed my work on rewritting regexp ruleset.

> Every time you fork a process, you pay a huge cost. Avoiding that would
> improve your throughput dramatically. Using spam[cd] you pay at least

sorry, i disagree.
spamassassin (even in spamc/spamd pair) is very limited, <10 mails/sec on
1.8ghz p4. imho fork's overload is not comparable to this slowness.

i'vs just uploaded the current snapshot of my version to:
ftp://ftp.mplayerhq.hu/spamassassin-c_0.2.tar.gz

it is not usable in production yet - it is full of timers and debug stuff,
for testing and benchmarking purposes.

unfortunatelly i'm on my own with this project here...
(yes, i agree on that perl is usefull thing for text processing, but it is
no more true when high performance does matter - then asm+c kicks in)


A'rpi / Astral & ESP-team

--
Developer of MPlayer, the Movie Player for Linux - http://www.MPlayerHQ.hu

_______________________________________________________________

Have big pipes? SourceForge.net is looking for download mirrors. We supply
the hardware. You get the recognition. Email Us: [EMAIL PROTECTED]
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to