I'd like to thank everyone who replied on these bugs over the last couple years. Especially the people who prodded me into trying out kernel 2.6.18-6. I've tracked down the bug, or at least I think maybe I have. There is definitely a bug here, but it seems to be failing even in cases where it "ought to" work; I suspect that 2.6.18-6 is also a bit broken.
The problem stems from the fact that the input thread, for its entire execution, holds a lock on a mutex. The purpose of this mutex is to guard a condition variable and an associated member variable of the input thread, both of which are used to synchronize the input thread and the foreground code that actually reads from the keyboard. Reading input in the foreground was introduced in the patch that introduced this bug (previously we tried to read in the background; it turns out that curses updates the display when you read keyboard input, so this led to simultaneous calls to refresh() and corrupted output). Unfortunately, the threads wrapper used in cwidget does not provide any way to push thread cleanup handlers for mutex locks, because the authors of the POSIX threads spec cunningly made it impossible to embed these in an RAII C++ object due to their use of a macro. You probably see where this is going... Because of the flakiness of pthread_cancel(), cwidget mostly avoids it. However, it is used in a few places, including when we shut down the input thread. IIRC, this is done because it's necessary to interrupt a select(), and cancel() is the main tool at our disposal to do that. The problem is that, in the words of pthread_cleanup_push(3): ...if a thread exits OR IS CANCELLED while it owns a locked mutex, the mutex will remain locked forever and prevent other threads from executing normally. (emphasis mine) Recall that the only reason this mutex is locked is that we need it for the condition variable. But if you look at the code of the input thread, you'll see that the condition variable is only accessed in one sub-clause of that code. In fact, of the three cancellation points in that function (counting select(), which isn't one on Linux, and the bracketing pthread_testcancel() calls, which are), only one of them occurs in the area where the mutex actually has to be locked, and that's the line where we actually wait on the condition variable. And it turns out that wait() *DOES* handle cleanup from cancellation properly. One unexplained question is why the bug hit even when aptitude suspended in response to a keystroke. When aptitude was waiting for a keystroke, the input thread would be blocked in select(), so no cleanup handlers were installed and the scenario above makes sense. But when aptitude was waiting for a keystroke, I thought that the input thread would still be blocked waiting on the input event condition, and as I noted earlier, that should install a proper cleanup handler. Another question is why we didn't see this bug in 2.6.26. I can't think of anything, except that maybe the implementation of mutexes changed somehow so that they get unlocked when a thread is canceled. So, the long and the short of it is: if we just lock the mutex when we actually need it, that is, when we're about to send a get_input_event to the main thread, there's no danger of being canceled and leaving the mutex locked. I've attached a patch that does this; on my computer, it eliminates the hang. I'll ask the release managers for permission to apply this patch in Lenny. Now I just need to reboot into 2.6.26 so that my wireless works and I can actually send this mail. Daniel
diff --git a/src/cwidget/toplevel.cc b/src/cwidget/toplevel.cc index c912ad6..e0b6ab2 100644 --- a/src/cwidget/toplevel.cc +++ b/src/cwidget/toplevel.cc @@ -1,6 +1,6 @@ // toplevel.cc // -// Copyright 1999-2005, 2007-2008 Daniel Burrows +// Copyright 1999-2005, 2007-2009 Daniel Burrows // // This program is free software; you can redistribute it and/or modify // it under the terms of the GNU General Public License as published by @@ -446,7 +446,6 @@ namespace cwidget void operator()() { - threads::mutex::lock l(input_event_mutex); input_event_fired = false; // Important note: this routine only blocks indefinitely in @@ -480,6 +479,7 @@ namespace cwidget } else { + threads::mutex::lock l(input_event_mutex); post_event(new get_input_event(input_event_mutex, input_event_fired, input_event_condition));