I'm doing the "sa-learn restore" to the PostgreSQL database now. Performance is not so good - about 300 tokens per second loaded. It's going to take a while to reload the several million from the backup.
I am using Mail::SpamAssassin::BayesStore::PgSQL. The PostgreSQL shows it is doing a separate transaction per token loaded. 11-30-2007.18:38:52 postmaster-20565: LOG: statement: begin 11-30-2007.18:38:52 postmaster-20565: LOG: statement: select put_tokens(2,'{\\\\353\\\\244\\\\114\\\\145\\\\321}', 0,1,1196373684) 11-30-2007.18:38:52 postmaster-20565: LOG: statement: commit 11-30-2007.18:38:52 postmaster-20565: LOG: statement: begin 11-30-2007.18:38:52 postmaster-20565: LOG: statement: select put_tokens(2,'{\\\\164\\\\223\\\\254\\\\212\\\\016}', 0,2,1196379608) 11-30-2007.18:38:52 postmaster-20565: LOG: statement: commit 11-30-2007.18:38:52 postmaster-20565: LOG: statement: begin 11-30-2007.18:38:52 postmaster-20565: LOG: statement: select put_tokens(2,'{\\\\264\\\\260\\\\042\\\\254\\\\337}', 0,1,1196374147) 11-30-2007.18:38:52 postmaster-20565: LOG: statement: commit 11-30-2007.18:38:52 postmaster-20565: LOG: statement: begin 11-30-2007.18:38:52 postmaster-20565: LOG: statement: select put_tokens(2,'{\\\\144\\\\207\\\\105\\\\341\\\\202}', 0,1,1196374214) 11-30-2007.18:38:52 postmaster-20565: LOG: statement: commit 11-30-2007.18:38:52 postmaster-20565: LOG: statement: begin 11-30-2007.18:38:52 postmaster-20565: LOG: statement: select put_tokens(2,'{\\\\167\\\\116\\\\332\\\\321\\\\265}', 0,1,1196374269) 11-30-2007.18:38:52 postmaster-20565: LOG: statement: commit 1 I'm guessing this is because the restore is using the same modules as spamd, instead of doing a bulk load, which would take a few seconds? Does it do the same thing when updating existing token access times and adding tokens from a message? If so, this would seem to be a rather significant bottleneck as opposed to updating everything with one transaction. Is this being done to avoid deadlocks? Deadlocks can be avoided by sorting the keys to be updated so that they are always updated in the same order (and/or retrying should a deadlock be detected). Wes