On Mon, 2011-08-01 at 12:30 -0700, monolit wrote: > I tried to measure performance of Spamassassin by using SDBM databse, > because of improvement performance. This site > http://wiki.apache.org/spamassassin/BayesBenchmarkResults > BayesBenchmarkResults claims, that by using SDBM database instead of > Berkeley DB, Spamassassin will be three times faster. Thats why I did the > measurement. > > I expected when I converted database format from Berkeley DB to SDBM > improvement of performance (as the link claims). But the tests didnt show > that. So, now I dont know where is the problem. > If you have URIBL checks turned on I'd expect that the normal network delays for these will completely mask any performance difference you may get by swapping one fast database for another. Here are some numbers:
- the slowest single record Berkeley DB operation in a 2006 Oracle benchmark (TDS no-sync writes with disk logs on a 2.0 GHz Windows XP box) ran at 45,748 ops/sec, or 0.02 mSec per operation - pinging www.spamhaus.org just now took 30 mS. Now consider that URIBL lookups are generally slower than that, but as they are all asynchronous, to a first approximation the time taken to handle the lot is the time taken by the slowest URIBL. Lets assume that the longest URIBL lookup takes 30ms. Lets further assume that each spam message contains 100 Bayes tokens, in which case looking them up on Bayes would take 2 mSec, or 7% of the time needed to ping www.spamhaus.org. The impact if using a database thats 3 times faster? The time taken for the lookups is now 0.7 ms, and the time for ping + 100 lookups has changed from 32mS to 30.7mS - a reduction of 4%! Now consider that: - the slowest URIBL lookup will take a lot longer than 30 mS - we've entirely neglected the time taken by SA to scan a message and run the regexes in the rules collection IOW, in real life the speedup will be quite a lot less that the 4% I estimated. You measured a speed up of 309 seconds in 87 minutes, or 0.6%, which, all things considered, seems about what I'd expect even if SDBM is really 3 times faster than Berkeley DB. Running repeated tests on a fixed set of messages can tell you about the overall performance of SA, but very little about the time taken by any of its internal modules, and that's ignoring the falsified cache hit rate that you'll see if you run repeated tests on the same data set. I think you'd get better data by running a single test with diagnostics turned on and looking at the execution time of the various spamd components. BTW, your results from running 10,000,000 messages through spamc/spamd give an elapsed average processing time of around 0.5 mS per message, a figure I find hard believe unless you're running a supercomputer. Admittedly, my system is at the opposite end of the power scale. For comparison, after a few runs on my 500 message spam corpus, and so all caches in my box and those in various routers out on the 'net are likely to be full, I can get down to 800-900 mS per message on a 1.6GHz core Duo with 1GB RAM. My typical scan times, on an 866 MHz P3 box with 512MB RAM, range from 1.1 seconds to 48.5 seconds (averaging 3.4 seconds) over the last 2111 messages processed. Martin