BDB and errors...

Robert Mueller Tue, 14 Mar 2006 12:24:57 -0800

We're using cyrus 2.3 and everything works fine, except we seem to haveintermittent problems with BDB 4.2 (specifically the RPM db4-4.2.52-3.1). Weonly use BDB for the delivery db.

In general it works fine, however if for some reason a server has crashedand we reboot the server, we then seem to almost always have a problem withthe DB.


Probably best to show a sequence of events.

1. Server froze up, so force a hard reset

2. Server boots up and remounts everything fine. All partitions are reiserfsand mount ok with journal playback3. We start cyrus. Since the delivery DB is temporary and non-critical, thestart script explicitly does:


    rm -f /var/imap/db/log.*
    rm -f /var/imap/db/__db*
    rm -f /var/imap/deliver.db

To clean out all existing BDB state and information. I can confirm that theonly files left in the /var/imap/db dir are DB_CONFIG and skipstamp. Thereappears to be no BDB environment state4. cyrus appears to start fine, but intermittently we see errors in the loglike:

Mar 14 13:47:25 server1 lmtp[2514]: DBERROR: mystore: error storing<[EMAIL PROTECTED]>: DB_PAGE_NOTFOUND: Requested page notfound

Each time an error like this occurs, it seems to leave a transaction open.Running:


(cd /var/imap/db; /usr/bin/db_stat -t -h .)

Normally shows "Active transactions" as 0, but after each of the aboveerrors appears in the log, the count increases and never decreases.Eventually this causes problems because it appears that processes get stuckwaiting for the transaction in a semi-busy loop inside BDB (continuous callsto select with a 1/10th of second timeout), and the checkpointing processcan't cleanup old log files with open transactions in them. Eventuallyeither the transaction count reaches the set_tx_max value, and causes BDB togo into error status, or the server load increases a lot due to thesemi-busy wait loop BDB gets in.

5. Stopping cyrus, then starting it again with the exact same start scriptusually then fixes the problem

That's the bit I don't get. Why would restarting again change anything, itseems that we're clearing out exactly the same data in each case, butthere's definitely some weird state getting left behind after a hard rebootcausing the errors, but I don't know where or why.

Has anyone seen anything similar with their servers or has any idea whatwould be causing this?


Rob

----
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

BDB and errors...

Reply via email to