Hi Mike, you might be the guy StefanHa was referring to on the qemu-devel mailing-list.
I just made some more tests, so… Am 02.08.2013 um 23:47 schrieb Mike Dawson <mike.daw...@cloudapt.com>: > Oliver, > > We've had a similar situation occur. For about three months, we've run > several Windows 2008 R2 guests with virtio drivers that record video > surveillance. We have long suffered an issue where the guest appears to hang > indefinitely (or until we intervene). For the sake of this conversation, we > call this state "wedged", because it appears something (rbd, qemu, virtio, > etc) gets stuck on a deadlock. When a guest gets wedged, we see the following: > > - the guest will not respond to pings If showing up the hung_task - message, I can ping and establish new ssh-sessions, just the session with a while loop does not accept any keyboard-action. > - the qemu-system-x86_64 process drops to 0% cpu > - graphite graphs show the interface traffic dropping to 0bps > - the guest will stay wedged forever (or until we intervene) > - strace of qemu-system-x86_64 shows QEMU is making progress [1][2] > nothing special here: 5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=6, events=POLLIN}, {fd=19, events=POLLIN}, {fd=15, events=POLLIN}, {fd=4, events=POLLIN}], 11, -1) = 1 ([{fd=12, revents=POLLIN}]) [pid 11793] read(5, 0x7fff16b61f00, 16) = -1 EAGAIN (Resource temporarily unavailable) [pid 11793] read(12, "\2\0\0\0\0\0\0\0\0\0\0\0\0\361p\0\252\340\374\373\373!gH\10\0E\0\0Yq\374"..., 69632) = 115 [pid 11793] read(12, 0x7f0c1737fcec, 69632) = -1 EAGAIN (Resource temporarily unavailable) [pid 11793] poll([{fd=27, events=POLLIN|POLLERR|POLLHUP}, {fd=26, events=POLLIN|POLLERR|POLLHUP}, {fd=24, events=POLLIN|POLLERR|POLLHUP}, {fd=12, events=POLLIN|POLLERR|POLLHUP}, {fd=3, events=POLLIN|POLLERR|POLLHUP}, {fd= and that for many, many threads. Inside the VM I see 75% wait, but I can restart the spew-test in a second session. All that tested with rbd_cache=false,cache=none. I also test every qemu-version with a 2 CPU 2GiB mem Windows 7 VM with some high load, encountering no problem ATM. Running smooth and fast. > We can "un-wedge" the guest by opening a NoVNC session or running a 'virsh > screenshot' command. After that, the guest resumes and runs as expected. At > that point we can examine the guest. Each time we'll see: > > - No Windows error logs whatsoever while the guest is wedged > - A time sync typically occurs right after the guest gets un-wedged > - Scheduled tasks do not run while wedged > - Windows error logs do not show any evidence of suspend, sleep, etc > > We had so many issue with guests becoming wedged, we wrote a script to 'virsh > screenshot' them via cron. Then we installed some updates and had a month or > so of higher stability (wedging happened maybe 1/10th as often). Until today > we couldn't figure out why. > > Yesterday, I realized qemu was starting the instances without specifying > cache=writeback. We corrected that, and let them run overnight. With RBD > writeback re-enabled, wedging came back as often as we had seen in the past. > I've counted ~40 occurrences in the past 12-hour period. So I feel like > writeback caching in RBD certainly makes the deadlock more likely to occur. > > Joshd asked us to gather RBD client logs: > > "joshd> it could very well be the writeback cache not doing a callback at > some point - if you could gather logs of a vm getting stuck with debug rbd = > 20, debug ms = 1, and debug objectcacher = 30 that would be great" > > We'll do that over the weekend. If you could as well, we'd love the help! > > [1] http://www.gammacode.com/kvm/wedged-with-timestamps.txt > [2] http://www.gammacode.com/kvm/not-wedged.txt > As I wrote above, no cache so far, so omitting the verbose debugging at the moment. But will do if requested. Thnx for your report, Oliver. > Thanks, > > Mike Dawson > Co-Founder & Director of Cloud Architecture > Cloudapt LLC > 6330 East 75th Street, Suite 170 > Indianapolis, IN 46250 > > On 8/2/2013 6:22 AM, Oliver Francke wrote: >> Well, >> >> I believe, I'm the winner of buzzwords-bingo for today. >> >> But seriously speaking... as I don't have this particular problem with >> qcow2 with kernel 3.2 nor qemu-1.2.2 nor newer kernels, I hope I'm not >> alone here? >> We have a raising number of tickets from people reinstalling from ISO's >> with 3.2-kernel. >> >> Fast fallback is to start all VM's with qemu-1.2.2, but we then lose >> some features ala latency-free-RBD-cache ;) >> >> I just opened a bug for qemu per: >> >> https://bugs.launchpad.net/qemu/+bug/1207686 >> >> with all dirty details. >> >> Installing a backport-kernel 3.9.x or upgrade Ubuntu-kernel to 3.8.x >> "fixes" it. So we have a bad combination for all distros with 3.2-kernel >> and rbd as storage-backend, I assume. >> >> Any similar findings? >> Any idea of tracing/debugging ( Josh? ;) ) very welcome, >> >> Oliver. >> _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com