On Sat, 11 May 2002, Craig R. Hughes wrote: > Daniel Pittman wrote: > > DP> For the first, *nothing* that you do is likely to improve things > DP> much other than rewriting the rules themselves; this can be done > DP> equally well with Perl. > > Rule optimization is proceeding. You might find a better/faster regex > engine, but you'll probably have to re-optimize the rules for that > engine vs the perl engine. I think we're going to be focussing on > optimizing the rules fore the perl engine, which could improve things > a lot from where they are now (things like prepending \b's where > appropriate, etc).
*nod* Also, things like replacing some of the more costly regexp tests with things like the new porn words eval. :) > DP> For the second, rewriting SpamAssassin to use a streaming, single > DP> pass algorithm would be (ahem) challenging. I don't suggest it. :) > > I've been tossing this idea around some. I think it bears thinking > about, even though it is going to be (ahem) challenging. Maybe not do > just a single pass, but at least do a lot *fewer* passes. Sure. I don't envy you the task of working out how to achieve it, though, without making it easier for SPAM to slip by through having line-breaks that the MUA gets rid of in there... *grin* Good luck. :) > DP> > If nothing is done in that. I am ready to help in such project. > DP> > If anybody is interested please mail me , and lets start. > DP> > DP> Check the mailing list archives. > > I'm planning on picking up the C stuff which was mentioned here before > and taking a look at it. I think the disadvantage the C code is is > definitely portability, and also flexibility in terms of the non-regex > rules. You not only have to write the scanner parts of SA, but also > the EvalTests stuff, and also all the network tests, etc, etc. It's probably worth pointing out that a good deal of my disparaging the idea of a C rewrite being "better" or "faster" is that Perl is, as a language, as fast as or faster than C in almost every case. It's only in a few specific things, like some numerical work, that you can actually tell the difference between calling a regexp library from C and from the Perl bytecode loop. It's a highly optimized language, Perl, and doesn't waste cycles really. [...] > DP> Every time you fork a process, you pay a huge cost. Avoiding that > DP> would improve your throughput dramatically. Using spam[cd] you pay > DP> at least *three* forks per message, at best, and probably closer > DP> to five. > > Why 3 forks? You'll have to fork spamc, and spamd forks (but probably > will be a cheap fork cos it's fork only, not fork-and-exec). I count 2 > there. spamc reads from stdin, then writes to stdout, yes? So, that needs to go somewhere and for every filtering system I have seen so far that implies one additional fork for the final delivery. Well, at least, everything but the SendMail milter or procmail which, as I understand it, checkpoints to disk instead... [...] > DP> That should give you one or, with extra work, less than one fork > DP> per message. That is the best way to improve your performance. > > Less than one fork per message seems unlikely, unless you get really > clever. And as far as spamd->spamd child forks are concerned, those > should be really, really cheap, so avoiding them probably won't gain > you much (assuming a non-naive OS fork implementation where it's going > to do copy-on-write for the process memory space). spamd forking isn't likely to cost much. It's the client side stuff that's costly, mostly because of the need to reinject the message into the MTA or delivery system somewhere along the way. If I was writing a high-performance SpamAssassin client I would have the SMTP listener process, written in Perl, accept connections from the outside world on port 25. It would, before doing any forking, compile all the regexp rules and the like in the master process, just like the existing spamd does, but store the mtimes for later use. It would then pass that connected socket to one of a pool of pre-forked children and expect them to deal with it. The child would process the message and pass it on to the parent SMTP listener, using a model of sending the same response to the initial sender as it received from the parent listener. That way I would get the reliability of Postfix or SendMail for free, because I would be tied exactly to their reliability. No need to checkpoint to disk because I would never ACK the email 'til my parent had. After processing somewhere between 50 and 250 messages the child would exit; in times of high load (when the backlog in the master process grew too much) additional children would be forked. This should keep the fork load well below one fork per message, using a well tested architecture to implement it and giving reliability for "free", basically. Oh, and I would insert the SpamAssassin headers into the outbound stream before sending any of the SMTP data received from the original sender, refusing to allow SpamAssassin to modify any of it.[1] Since you asked. :) Daniel Footnotes: [1] This probably implies a bit of rework in SpamAssassin to stop it doing destructive things to the email. :) -- There is no happiness in having or in getting, but only in giving. -- Henry Drummond _______________________________________________________________ Have big pipes? SourceForge.net is looking for download mirrors. We supply the hardware. You get the recognition. Email Us: [EMAIL PROTECTED] _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk