Bugs item #775414, was opened at 2003-07-21 19:29 Message generated for change (Comment added) made by greg You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=775414&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Extension Modules Group: Python 2.3 Status: Open Resolution: None Priority: 5 Submitted By: Tim Peters (tim_one) Assigned to: Gregory P. Smith (greg) Summary: bsddb3 hash craps out with threads Initial Comment: Richie Hindle presented something like the attached (hammer.py) on the spambayes-dev mailing list. On Win98SE and Win2K w/ Python 2.3c1 I usually see this death pretty quickly: Traceback (most recent call last): File "hammer.py", line 36, in ? main() File "hammer.py", line 33, in main hammer(db) File "hammer.py", line 15, in hammer x = db[str(int(random.random() * 100000))] File "C:\CODE\PYTHON\lib\bsddb\__init__.py", line 86, in __getitem__ return self.db[key] bsddb._db.DBRunRecoveryError: (-30982, 'DB_RUNRECOVERY: Fatal error, run database recovery -- fatal region error detected; run recovery') Richie also reported "illegal operation" crashes on Win98SE. It's not clear whether a bsddb3 hash *can* be used with threads like this. If it can't, there's a doc bug. If it should be able to, there's a more serious problem. Note that it looks like hashopen() always merges DB_THREAD into the flags, so the absence of specifying DB_THREAD probably isn't the problem. ---------------------------------------------------------------------- >Comment By: Gregory P. Smith (greg) Date: 2005-11-05 08:54 Message: Logged In: YES user_id=413 modifying bsddb/__init__.py to wrap all calls with DeadlockWrap will be a bit of a pita (but would be doable). I've attached an example wrapped_hammer.py that demonstrates the _openDBEnv change as well as DeadlockWrap wrapping to work properly. ---------------------------------------------------------------------- Comment By: Gregory P. Smith (greg) Date: 2005-11-05 08:31 Message: Logged In: YES user_id=413 oh good i see you already suggested the simple thread calling lock_detect that I was about to suggest. :) regardless a thread isn't needed. see dbenv.set_lk_detect which tells BerkeleyDB to run deadlock detection automatically anytime a lock conflict occurs. http://www.sleepycat.com/docs/api_c/env_set_lk_detect.html Just add e.set_lk_detect(db.DB_LOCK_DEFAULT) to bsddb/__init__.py's _openDBEnv() function. That causes hammer.py to get DBLockDeadlockError exceptions as expected (dying if the main thread gets one). No lock_detect thread needed. The bsddb legacy interface in __init__.py could have all of its database accesses wrapped in the bsddb.dbutils.DeadlockWrap function. to prevent this. (testing now) ---------------------------------------------------------------------- Comment By: Mark Hammond (mhammond) Date: 2005-11-03 20:20 Message: Logged In: YES user_id=14198 The db_deadlock program ends up being equivalent to a thread repeatedly calling: dbenv.lock_detect(bsddb.db.DB_LOCK_DEFAULT, 0) For completeness, I attach deadlock_hammer.py - a version that uses yet another thread to perform this lock detection. It also catches the deadlock exceptions, printing but ignoring them. Also, due to the way shutdown is less than graceful, I found I needed to add DB_RECOVER_FATAL to the env flags, otherwise I would often hang on open unless I clobbered the DB directory. On both my box (where it took a little while to see a deadlock) and on a dual-processor box (which provoked it much quicker), this version seems to run forever (although with sporadic performance) ---------------------------------------------------------------------- Comment By: Mark Hammond (mhammond) Date: 2005-11-03 18:00 Message: Logged In: YES user_id=14198 Sadly, I believe bsddb is working "as designed". Quoting from http://www.sleepycat.com/docs/api_c/env_open.html "When the DB_INIT_LOCK flag is specified, it is usually necessary to run a deadlock detector, as well." So I dig into my bsddb build tree, and found db_deadlock.exe. Sure enough, once studly_hammer.py had deadlocked, executing db_deadlock in the DB directory got things running again - although the threads all eventually died with: bsddb._db.DBLockDeadlockError: (-30996, 'DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock') Obviously it is PITA to need to run an external daemon, and as Python doesn't distribute db_deadlock.exe, the sleepycat license may mean not all applications are allowed to distribute it. This program also polls for deadlocks, meaning your app may hang as long as the poll period. All in all, it seems to suck :) ---------------------------------------------------------------------- Comment By: Gregory P. Smith (greg) Date: 2003-10-05 18:17 Message: Logged In: YES user_id=413 if you believe your application is properly using BerkeleyDB and you are having DB_RUNRECOVERY issues I suggest contacting sleepycat. ---------------------------------------------------------------------- Comment By: Rick Bradley (roundeye) Date: 2003-10-05 12:46 Message: Logged In: YES user_id=58334 This is also showing up in Syncato (http://www.syncato.org/), and the database isn't recoverable using the Berkeley DB db_recover utility (even using the "catastrophic" flag). Does anyone know of a reliable way to recover? Rick Bradley ---------------------------------------------------------------------- Comment By: Skip Montanaro (montanaro) Date: 2003-09-29 10:05 Message: Logged In: YES user_id=44345 Forgot to mention that without the DBEnv() object, it gets a segmentation violation on Solaris 8 seg faults pretty quickly (within 10,000 iterations for each thread) or raises bsddb._db.DBRunRecoveryError. ---------------------------------------------------------------------- Comment By: Skip Montanaro (montanaro) Date: 2003-09-29 09:41 Message: Logged In: YES user_id=44345 I built from CVS head on a Solaris machine. bsddb.__version__ reports '4.2.1'. When run, the studly_hammer.py script completes the dbenv.open() call, but appears to hang during the hashopen() call. Adding some print statements to hashopen() indicates that it hangs during d.open(). I don't know what to make of this. If others have some suggestions, I'll be happy to try them out. ---------------------------------------------------------------------- Comment By: Tim Peters (tim_one) Date: 2003-09-29 07:15 Message: Logged In: YES user_id=31435 Greg, I'm in a constant state of debugging (in other apps) thread problems that *appear* unique to Win9x. But in years of this, I have yet to see one that actually is unique to Win9x -- in the end, they always turn out to be legit races in the app I'm debugging, and can always be reproduced on other platforms if the test is made stressful enough and/or run long enough. Win9x appears especially good at provoking thread problems just because its scheduling is erratic, often acting like a Linux system under extreme load that way. IOW, unless there's a bug in Sleepycat's implementation of locking on Win9x, I bet dollars to doughnuts this program will eventually deadlock everywhere. In Python's lifetime, across dozens of miserable thread problems, we haven't pinned the blame once on Win9x. That wasn't for lack of trying <wink>. ---------------------------------------------------------------------- Comment By: Anthony Baxter (anthonybaxter) Date: 2003-09-29 00:42 Message: Logged In: YES user_id=29957 I'd be much happier with a documentation fix for 2.3.2. Note that when I said "fails to complete" on Solaris, I meant that it crashes out, not that it deadlocks. I can post the tracebacks here if you'd like. ---------------------------------------------------------------------- Comment By: Gregory P. Smith (greg) Date: 2003-09-29 00:02 Message: Logged In: YES user_id=413 anthony - if we don't put this patch into python 2.3.2, the python 2.3.x bsddb module documentation should be updated to say that multithreaded access is not supported and will cause problems, possibly even python interpreter crashes. ---------------------------------------------------------------------- Comment By: Gregory P. Smith (greg) Date: 2003-09-28 23:57 Message: Logged In: YES user_id=413 Deadlocks only occurring under DOS-based "windows" (win95/98/me) aren't something the python module can prevent. I suggest submitting the sample code and info from studly_hammer.py to sleepycat. They're usually very responsive to questions of that nature. btw, i'll give things a go on solaris later this week. if the test suite never completes i again suspect it is a berkeleydb library issue on that platform rather than python module. ---------------------------------------------------------------------- Comment By: Tim Peters (tim_one) Date: 2003-09-28 18:38 Message: Logged In: YES user_id=31435 Running the original hammer.py under current CVS Python freezes in the same way (as in my immediately preceding note) now too; again Win98SE. ---------------------------------------------------------------------- Comment By: Tim Peters (tim_one) Date: 2003-09-28 18:28 Message: Logged In: YES user_id=31435 About studly_hammer.py: [Skip Montanaro] > ... > Attached is a modified version of the hammer.py script which seems to > not fail for me on either Windows run from IDLE (Python 2.3, BDB > 4.1.6) or Mac OS X (Python CVS, BDB 4.2.1). The original script > failed for me on Windows but not Mac OS X. Can some other people for > whom the original script fails please try it? (I also attached it to > bug #775414.) On Win98SE with current Python 2.3.1, it doesn't fail, but it never seemed to finish for me either. Staring at WinTop showed that the Python process stopped accumulating cycles. Can't be killed with Ctrl+C (no visible effect). Can be killed with Ctrl+Break. Dumping print "%s %s" % (thread.get_ident(), i) at the top of the hammer loop showed that the threads get through several hundred iterations, then all printing stops. Attaching to a debug-build Python from the debugger when a freeze occurs isn't terribly illuminating. One thread's stack shows _BSDDB_D! __db_win32_mutex_lock + 134 bytes _BSDDB_D! __lock_get + 2264 bytes _BSDDB_D! __lock_get + 197 bytes _BSDDB_D! __ham_get_meta + 120 bytes _BSDDB_D! __ham_c_dup + 4201 bytes _BSDDB_D! __db_c_put + 2544 bytes _BSDDB_D! __db_put + 507 bytes _DB_put(DBObject * 0x016cff88, __db_txn * 0x016d0000, __db_dbt * 0x016cc000, __db_dbt * 0x50d751fe, int 0) line 562 + 35 bytes The main thread's stack shows _BSDDB_D! __db_win32_mutex_lock + 134 bytes _BSDDB_D! __lock_get + 2264 bytes _BSDDB_D! __lock_get + 197 bytes _BSDDB_D! __db_lget + 365 bytes _BSDDB_D! __ham_lock_bucket + 105 bytes _BSDDB_D! __ham_get_cpage + 195 bytes _BSDDB_D! __ham_item_next + 25 bytes _BSDDB_D! __ham_call_hash + 2479 bytes _BSDDB_D! __ham_c_dup + 4307 bytes _BSDDB_D! __db_c_put + 2544 bytes _BSDDB_D! __db_put + 507 bytes _DB_put(DBObject * 0x008fe2e8, __db_txn * 0x00000000, __db_dbt * 0x0062f230, __db_dbt * 0x0062f248, int 0) line 562 + 35 bytes DB_ass_sub(DBObject * 0x008fe2e8, _object * 0x00b83178, _object * 0x00b83370) line 2330 + 23 bytes PyObject_SetItem(_object * 0x008fe2e8, _object * 0x00b83178, _object * 0x00b83370) line 123 + 18 bytes eval_frame(_frame * 0x00984948) line 1448 + 17 bytes ... The other threads are somewhere in the OS kernel and don't have useful tracebacks. This varies from run to run, but all threads with a useful stack are always stuck at the same place in __db_win32_mutex_lock. All in all, looks like it's simply deadlocked. ---------------------------------------------------------------------- Comment By: Anthony Baxter (anthonybaxter) Date: 2003-09-27 22:11 Message: Logged In: YES user_id=29957 Could you check that it (and the test_bsddb3) works on Solaris? There's a couple of solaris boxes on the SF compile farm (cf.sf.net). I was unable to get test_bsddb3 to complete at all on Solaris 2.6, 7 or 8, when using DB 4.1.25. As far as 2.3.2, I really really don't think it's appropriate to throw it in at this late point. Particularly given the 2.3.1 screwups, I don't want to risk it. ---------------------------------------------------------------------- Comment By: Gregory P. Smith (greg) Date: 2003-09-27 16:08 Message: Logged In: YES user_id=413 I just committed a change to bsddb/__init__.py (file rev 1.10) that adds the creation of a thread-safe DBEnv object for each hashopen, btopen or rnopen database. hammer.py has been running for 5 minutes on my linux/alpha system using BerkeleyDB 4.1.25. (admittedly my test is running on python 2.2.2, but as this isn't a python core related change i doubt that matters). After others have tested this on other platforms with success I believe we can close this bug. This patch would probably be good for python 2.3.2. ---------------------------------------------------------------------- Comment By: Skip Montanaro (montanaro) Date: 2003-09-27 11:10 Message: Logged In: YES user_id=44345 If hammer.py fails for you, please try this slightly modified version (studly_hammer.py). ---------------------------------------------------------------------- Comment By: Gregory P. Smith (greg) Date: 2003-09-12 15:28 Message: Logged In: YES user_id=413 I don't see any problem in _bsddb.c:_DB_put(), what memory are you talking about? All of the DBT key and data parameters are allocated on the local stack on the various DB methods that call _DB_put. What do you see that could be clobbered? ---------------------------------------------------------------------- Comment By: Skip Montanaro (montanaro) Date: 2003-09-12 12:52 Message: Logged In: YES user_id=44345 The sleepycat mails (there are two of them - Keith's is second) are in the attached sleepy.txt file. ---------------------------------------------------------------------- Comment By: Richie Hindle (richiehindle) Date: 2003-09-12 12:25 Message: Logged In: YES user_id=85414 Sorry to muddy the waters, but I'm 99% sure that this is not a threading issue. Today I had the same DBRunRecoveryError for my Spambayes POP3 proxy classifier database, which only ever gets accessed from the main program thread. ---------------------------------------------------------------------- Comment By: Jeremy Hylton (jhylton) Date: 2003-09-12 12:22 Message: Logged In: YES user_id=31392 I don't want to sound like a broken record, but I will: Can anyone comment on the lack of thread-safety in _DB_put()? It appears that there is nothing to prevent the memory used by one call from being stomped on by another call in a different thread. This problem would exist even in an application using the modern interface and specifying DB_THREAD. ---------------------------------------------------------------------- Comment By: Gregory P. Smith (greg) Date: 2003-09-12 12:10 Message: Logged In: YES user_id=413 Looking at bsddb/__init__.py (where the old bsddb compatibility interface is implemented) I don't see why the hammer.py attached below should cause a problem. The database is opened with DB_THREAD using a private environment (no DBEnv passed to DB()). I definately see potential threading problems with the _DBWithCursor class defined there if any of the methods using a cursor are used (the cursor could be shared across threads; that's a no-no). But in the context of hammer.py that doesn't happen so I wouldn't have expected a problem. Unless perhaps creating the DB withou a DBEnv implies that the DB_THREAD flag won't work. The DB_RECOVER flag is only useful for opening existing DBEnv's; we have none. I've got to pop offline for a bit now but i'll try a hammer.py modified to use direct DB calls (for easier playing around with and bug reporting to sleepycat if turns out to be a bug on their end) later tonight. PS keiths response is in the sleepycat.txt attachment if you open the URL to this bug report on sourceforge. ---------------------------------------------------------------------- Comment By: Tim Peters (tim_one) Date: 2003-09-12 12:07 Message: Logged In: YES user_id=31435 Jeremy, Keith's response is in the sleepy.txt file attached to the bug report. ---------------------------------------------------------------------- Comment By: Jeremy Hylton (jhylton) Date: 2003-09-12 12:03 Message: Logged In: YES user_id=31392 I don't see Keith's response anywhere in this thread. Can you add it for the record? The only call to db->put() that I see is in _DB_put(). It does not look thread-safe to me. ---------------------------------------------------------------------- Comment By: Skip Montanaro (montanaro) Date: 2003-09-12 12:00 Message: Logged In: YES user_id=44345 The bsddb module emulates the old bsddb module's 1.85-ish interface using modern DB/DBEnv objects underneath. So his comments about that not being threadsafe don't apply here. But the low-level open() call isn't made with a DBEnv argument is it? Nor is the DB_RECOVER flag set. Would the compatibility interface be able to do both things? ---------------------------------------------------------------------- Comment By: Skip Montanaro (montanaro) Date: 2003-09-12 11:57 Message: Logged In: YES user_id=44345 In theory, yes, we could special case the bsddb stuff. However, the code currently is run indirectly via the anydbm module. It will take a little effort on our part to do something special for bsddb. It would be nice if other apps using the naive interface were able to use multiple threads. ---------------------------------------------------------------------- Comment By: Gregory P. Smith (greg) Date: 2003-09-12 11:45 Message: Logged In: YES user_id=413 ah, Keith's response from sleepycat assumed that we were using the DB 1.85 compatibility interface. We do not. The bsddb module emulates the old bsddb module's 1.85-ish interface using modern DB/DBEnv objects underneath. So his comments about that not being threadsafe don't apply here. ---------------------------------------------------------------------- Comment By: Jeremy Hylton (jhylton) Date: 2003-09-12 11:37 Message: Logged In: YES user_id=31392 Are the DB_mapping methods only used the old interface? My question is about those methods, which I assumed were used by the old and new interfaces. ---------------------------------------------------------------------- Comment By: Gregory P. Smith (greg) Date: 2003-09-12 11:30 Message: Logged In: YES user_id=413 The old bsddb interface compatibility code could be modified to use a single DBEnv per process opened with the DB_SYSTEM_MEM flag. Do we want to do this? Shouldn't we encourage the use of the real pybsddb DB/DBEnv object interface for threads instead? AFAIK the old bsddb module + libs were not thread safe. ---------------------------------------------------------------------- Comment By: Skip Montanaro (montanaro) Date: 2003-09-12 11:23 Message: Logged In: YES user_id=44345 >From what I got back from Sleepycat on this, I'm pretty sure the old bsddb interface is not going to be thread safe. Attached are two messages from Sleepycat. Is there some way for the old interface to create a default environment shared by all the bsddb.*open() calls and then set the DB_RECOVER flag in the low-level open() call? ---------------------------------------------------------------------- Comment By: Jeremy Hylton (jhylton) Date: 2003-09-12 10:14 Message: Logged In: YES user_id=31392 How does the bsddb wrapper achieve thread safety? I know very little about the wrapper or the underlying bsddb libraries. I found the following comment in the C API docs: http://www.sleepycat.com/docs/ref/program/mt.html#2 > When using the non-cursor Berkeley DB calls to retrieve > key/data items (for example, DB->get), the memory to which the > pointer stored into the Dbt refers is valid only until the next call > using the DB handle returned by DB->open. This includes any > use of the returned DB handle, including by another thread > within the process. This suggests that a call to a self->db->get() must process its results (copy them into Python-owned memory) before any other operation on the same db object can proceed. Is that right? The bsddb wrapper releases the GIL before calling the low-level DB API functions and the acquires it after the call returns. Is there some other lock that prevents multiple simultaneous calls from stomping on each other? ---------------------------------------------------------------------- Comment By: Jeremy Hylton (jhylton) Date: 2003-09-12 09:46 Message: Logged In: YES user_id=31392 I'm running this test with CVS Python (built on 9/11/03) on RH Linux 9 with bsddb 4.1.25. I see the same error although it takes a relatively long time to provoke -- a minute or two. ---------------------------------------------------------------------- Comment By: Tim Peters (tim_one) Date: 2003-09-12 09:08 Message: Logged In: YES user_id=31435 Greg, any luck? We're starting to see the same error ("fatal region error detected") in some ZODB tests using bsddb3, and that's an infinitely more complicated setup than this little program. Jeremy Hylton also sees "fatal region" errors on Linux, in the ZODB context. ---------------------------------------------------------------------- Comment By: Gregory P. Smith (greg) Date: 2003-08-13 16:26 Message: Logged In: YES user_id=413 i'll try and reproduce this. ---------------------------------------------------------------------- Comment By: Richie Hindle (richiehindle) Date: 2003-07-22 01:50 Message: Logged In: YES user_id=85414 Minor correction: I'm on Plain Old Win98, not SE. For what it's worth, the script seems more often than not to provoke an application error when there's background load, and a DBRunRecoveryError when there isn't. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=775414&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com