On Mon, 2011-08-01 at 12:30 -0700, monolit wrote:
> I tried to measure performance of Spamassassin by using SDBM databse,
> because of improvement performance. This site 
> http://wiki.apache.org/spamassassin/BayesBenchmarkResults
> BayesBenchmarkResults  claims, that by using SDBM database instead of
> Berkeley DB, Spamassassin will be three times faster. Thats why I did the
> measurement. 
> 
> I expected when I converted database format from Berkeley DB to SDBM
> improvement of performance (as the link claims). But the tests didnt show
> that. So, now I dont know where is the problem.
>
If you have URIBL checks turned on I'd expect that the normal network
delays for these will completely mask any performance difference you may
get by swapping one fast database for another. Here are some numbers:

- the slowest single record Berkeley DB operation in a 2006 Oracle
  benchmark (TDS no-sync writes with disk logs on a 2.0 GHz Windows
  XP box) ran at 45,748 ops/sec, or 0.02 mSec per operation

- pinging www.spamhaus.org just now took 30 mS.

Now consider that URIBL lookups are generally slower than that, but as
they are all asynchronous, to a first approximation the time taken to
handle the lot is the time taken by the slowest URIBL. Lets assume that
the longest URIBL lookup takes 30ms.  

Lets further assume that each spam message contains 100 Bayes tokens, in
which case looking them up on Bayes would take 2 mSec, or 7% of the time
needed to ping www.spamhaus.org. 

The impact if using a database thats 3 times faster? The time taken for
the lookups is now 0.7 ms, and the time for ping + 100 lookups has
changed from 32mS to 30.7mS - a reduction of 4%!

Now consider that: 
- the slowest URIBL lookup will take a lot longer than 30 mS
- we've entirely neglected the time taken by SA to scan a message
  and run the regexes in the rules collection

IOW, in real life the speedup will be quite a lot less that the 4% I
estimated.

You measured a speed up of 309 seconds in 87 minutes, or 0.6%, which,
all things considered, seems about what I'd expect even if SDBM is
really 3 times faster than Berkeley DB.

Running repeated tests on a fixed set of messages can tell you about the
overall performance of SA, but very little about the time taken by any
of its internal modules, and that's ignoring the falsified cache hit
rate that you'll see if you run repeated tests on the same data set. I
think you'd get better data by running a single test with diagnostics
turned on and looking at the execution time of the various spamd
components.

BTW, your results from running 10,000,000 messages through spamc/spamd
give an elapsed average processing time of around 0.5 mS per message, a
figure I find hard believe unless you're running a supercomputer.

Admittedly, my system is at the opposite end of the power scale. For
comparison, after a few runs on my 500 message spam corpus, and so all
caches in my box and those in various routers out on the 'net are likely
to be full, I can get down to 800-900 mS per message on a 1.6GHz core
Duo with 1GB RAM. My typical scan times, on an 866 MHz P3 box with 512MB
RAM, range from  1.1 seconds to 48.5 seconds (averaging 3.4 seconds)
over the last 2111 messages processed.
    

Martin


Reply via email to