Re: Mondo bayes_toks - millions of entries

Wes Fri, 30 Nov 2007 11:00:23 -0800

I'm doing the "sa-learn restore" to the PostgreSQL database now.
Performance is not so good - about 300 tokens per second loaded.  It's going
to take a while to reload the several million from the backup.


I am using Mail::SpamAssassin::BayesStore::PgSQL.

The PostgreSQL shows it is doing a separate transaction per token loaded.

11-30-2007.18:38:52 postmaster-20565: LOG:  statement: begin
11-30-2007.18:38:52 postmaster-20565: LOG:  statement:
select put_tokens(2,'{\\\\353\\\\244\\\\114\\\\145\\\\321}', 0,1,1196373684)
11-30-2007.18:38:52 postmaster-20565: LOG:  statement: commit
11-30-2007.18:38:52 postmaster-20565: LOG:  statement: begin
11-30-2007.18:38:52 postmaster-20565: LOG:  statement:
select put_tokens(2,'{\\\\164\\\\223\\\\254\\\\212\\\\016}', 0,2,1196379608)
11-30-2007.18:38:52 postmaster-20565: LOG:  statement: commit
11-30-2007.18:38:52 postmaster-20565: LOG:  statement: begin
11-30-2007.18:38:52 postmaster-20565: LOG:  statement:
select put_tokens(2,'{\\\\264\\\\260\\\\042\\\\254\\\\337}', 0,1,1196374147)
11-30-2007.18:38:52 postmaster-20565: LOG:  statement: commit
11-30-2007.18:38:52 postmaster-20565: LOG:  statement: begin
11-30-2007.18:38:52 postmaster-20565: LOG:  statement:
select put_tokens(2,'{\\\\144\\\\207\\\\105\\\\341\\\\202}', 0,1,1196374214)
11-30-2007.18:38:52 postmaster-20565: LOG:  statement: commit
11-30-2007.18:38:52 postmaster-20565: LOG:  statement: begin
11-30-2007.18:38:52 postmaster-20565: LOG:  statement:
select put_tokens(2,'{\\\\167\\\\116\\\\332\\\\321\\\\265}', 0,1,1196374269)
11-30-2007.18:38:52 postmaster-20565: LOG:  statement: commit
1

I'm guessing this is because the restore is using the same modules as spamd,
instead of doing a bulk load, which would take a few seconds?  Does it do
the same thing when updating existing token access times and adding tokens
from a message?  If so, this would seem to be a rather significant
bottleneck as opposed to updating everything with one transaction.

Is this being done to avoid deadlocks?  Deadlocks can be avoided by sorting
the keys to be updated so that they are always updated in the same order
(and/or retrying should a deadlock be detected).

Wes

Re: Mondo bayes_toks - millions of entries

Reply via email to