On Sat, 11 May 2002, Craig R. Hughes wrote:
> Daniel Pittman wrote:
> 
> DP> For the first, *nothing* that you do is likely to improve things
> DP> much other than rewriting the rules themselves; this can be done
> DP> equally well with Perl.
> 
> Rule optimization is proceeding. You might find a better/faster regex
> engine, but you'll probably have to re-optimize the rules for that
> engine vs the perl engine. I think we're going to be focussing on
> optimizing the rules fore the perl engine, which could improve things
> a lot from where they are now (things like prepending \b's where
> appropriate, etc).

*nod*  Also, things like replacing some of the more costly regexp tests
with things like the new porn words eval. :)

> DP> For the second, rewriting SpamAssassin to use a streaming, single
> DP> pass algorithm would be (ahem) challenging. I don't suggest it. :)
> 
> I've been tossing this idea around some. I think it bears thinking
> about, even though it is going to be (ahem) challenging. Maybe not do
> just a single pass, but at least do a lot *fewer* passes.

Sure. I don't envy you the task of working out how to achieve it,
though, without making it easier for SPAM to slip by through having
line-breaks that the MUA gets rid of in there...

*grin*  Good luck. :)

> DP> > If nothing is done in that. I am ready to help in such project. 
> DP> > If anybody is interested please mail me , and lets start.
> DP>
> DP> Check the mailing list archives.
> 
> I'm planning on picking up the C stuff which was mentioned here before
> and taking a look at it. I think the disadvantage the C code is is
> definitely portability, and also flexibility in terms of the non-regex
> rules. You not only have to write the scanner parts of SA, but also
> the EvalTests stuff, and also all the network tests, etc, etc.

It's probably worth pointing out that a good deal of my disparaging the
idea of a C rewrite being "better" or "faster" is that Perl is, as a
language, as fast as or faster than C in almost every case.

It's only in a few specific things, like some numerical work, that you
can actually tell the difference between calling a regexp library from C
and from the Perl bytecode loop.

It's a highly optimized language, Perl, and doesn't waste cycles really.

[...]

> DP> Every time you fork a process, you pay a huge cost. Avoiding that
> DP> would improve your throughput dramatically. Using spam[cd] you pay
> DP> at least *three* forks per message, at best, and probably closer
> DP> to five.
> 
> Why 3 forks? You'll have to fork spamc, and spamd forks (but probably
> will be a cheap fork cos it's fork only, not fork-and-exec). I count 2
> there.

spamc reads from stdin, then writes to stdout, yes? So, that needs to go
somewhere and for every filtering system I have seen so far that implies
one additional fork for the final delivery.

Well, at least, everything but the SendMail milter or procmail which, as
I understand it, checkpoints to disk instead...

[...]

> DP> That should give you one or, with extra work, less than one fork
> DP> per message. That is the best way to improve your performance.
> 
> Less than one fork per message seems unlikely, unless you get really
> clever. And as far as spamd->spamd child forks are concerned, those
> should be really, really cheap, so avoiding them probably won't gain
> you much (assuming a non-naive OS fork implementation where it's going
> to do copy-on-write for the process memory space).

spamd forking isn't likely to cost much. It's the client side stuff
that's costly, mostly because of the need to reinject the message into
the MTA or delivery system somewhere along the way.

If I was writing a high-performance SpamAssassin client I would have the
SMTP listener process, written in Perl, accept connections from the
outside world on port 25.

It would, before doing any forking, compile all the regexp rules and the
like in the master process, just like the existing spamd does, but store
the mtimes for later use.

It would then pass that connected socket to one of a pool of pre-forked
children and expect them to deal with it. The child would process the
message and pass it on to the parent SMTP listener, using a model of
sending the same response to the initial sender as it received from the
parent listener.

That way I would get the reliability of Postfix or SendMail for free,
because I would be tied exactly to their reliability. No need to
checkpoint to disk because I would never ACK the email 'til my parent
had.

After processing somewhere between 50 and 250 messages the child would
exit; in times of high load (when the backlog in the master process grew
too much) additional children would be forked.


This should keep the fork load well below one fork per message, using
a well tested architecture to implement it and giving reliability for
"free", basically.

Oh, and I would insert the SpamAssassin headers into the outbound stream
before sending any of the SMTP data received from the original sender,
refusing to allow SpamAssassin to modify any of it.[1]

Since you asked. :)

        Daniel

Footnotes: 
[1]  This probably implies a bit of rework in SpamAssassin to stop it
     doing destructive things to the email. :)

-- 
There is no happiness in having or in getting, but only in giving.
        -- Henry Drummond

_______________________________________________________________

Have big pipes? SourceForge.net is looking for download mirrors. We supply
the hardware. You get the recognition. Email Us: [EMAIL PROTECTED]
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to