On Thu, 23 Sep 2004, Juhapekka Tolvanen spake: > "Myth 4: PERL is designed for language processing, so > SpamAssassin is written in a more appropriate language. > Let me preface this with the fact that I've had about 10 > years of experience coding PERL.
... yet he hasn't even read the Perl FAQ, which states : But never write : "PERL", because perl is not an acronym, apocryphal folklore and post- : facto expansions notwithstanding. Not a good start. > While PERL is very usefula > for language processing and web applications, it is also an > extremely slow, interpreted language. Benchmarks? Oops, I don't see any, just unsubstantiated opinion. Yes, Perl is interpreted. > The average overhead > for a single PERL process is around 2MB of RAM. Even compiled > PERL still requires the use of a bootstrapped interpreter and > bytecode translation. ... which happens exactly once, when you start spamd. > PERL is very slow compared to a compiled > language, and the regular expression functions PERL supports > for text extraction have their roots in the C implementation > of regular expressions, which are much faster. Nonsense. The Perl regexp implementation is *written* in C; so here the DSPAM author is saying that C is intrinsically faster than itself. > DSPAM has very > low-level string functions coded in C which are extremely fast, > effective, and don't even require the use of processor-intensive > regular expressions. So does Perl. However, if you want to do the sort of textual pattern matching that requires the construction of a DFA and optional backtracking to implement in C, one uses a library that's good for that sort of thing. We call those `regular expression libraries'. Plus, the Bayesian part of SA (i.e., the only part which DSPAM could replace) does not use regexps for anything more than identification of tokens. If you can tokenize without using regular expressions or anything reducable to them, or anything more formally powerful (which would consume much more CPU power to match, and for which matching may not ever terminate), then there's a whole lot of computer scientists who'd like to talk to you... It is true that SA spends the majority of its time inside the regular expression matcher (written in *C*, note: it's not spending its time in the interpreter proper). However, given the number of regexp-based rules SA contains, that's unsurprising. If you rip all those rules out and leave just a Bayes engine, SA would probably be rather faster. > While PERL is useful for data extraction > and reporting, it is the completely wrong choice for language > processing, especially in a large-scale environment. Disproved by reality; i.e., SpamAssassin itself. > analyzing one mailbox, PERL would be acceptable...but if you > plan on running this on a production system with live users, it > is a death wish." Likewise disproved by reality. > I really don't care about attitudes of author of DSPAM. I just want to It's not the `attitudes' that are relevant, it's more that everything he says there is either irrelevant or the purest moonshine. > know, how much faster SpamAssassin will be, if its Bayesian engine is > replaced with something else, for example with DSPAM. It does not hurt, > if we try it out and see what happens. And it does not hurt, if people > have more alternatives. No faster. The bottleneck with large SA installations using Bayes is not CPU time: it is disk I/O (and RAM, of course). I can't see how reimplementing the Bayesian engine can help fix this: replacing the *storage mechanism* might be a good idea, though. (DB_File is the fastest large-scale key->value storage mechanism available without setup work: if you don't mind the setup work, you could try Bayes-in-SQL until you find an RDBMS that's faster than Berkeley DB. You might be searching for some time.) > I even switched from plain SpamAssassin to spamd. That's something that virtually everyone with the privileges to do (i.e. root) should do, I think. Anything else is just being pointlessly inefficient. > After all these horrible experiences it is painful to read, when > somebody tries to explain, how fast Perl is after all. Hell yes I know > Perl-program is compiled when it starts, but it is not enough. Real > compiled language like C is faster in many cases. But SA spends nearly all its time in the regexp engine, and waiting (on disk I/O, network traffic for net tests, and locks on the Bayes database if a Bayes expiry is underway). Switching to C would just needlessly hinder maintenance. > I reiterate: It does not hurt, if we try out and see what happens. Try it, then. :) -- `I agree that school is a learning environment, and learning to intimidate others -- aka "social skills" -- is part of that.' --- jabberwocky