Stack from qemu_fill_buffer to qio_channel_socket_readv #0 qio_channel_socket_readv (ioc=<optimized out>, iov=<optimized out>, niov=<optimized out>, fds=0x0, nfds=0x0, errp=0x0) at ./io/channel-socket.c:477 #1 0x0000001486ec97e2 in qio_channel_read (ioc=ioc@entry=0x148a73a6c0, buf=buf@entry=\060\nLw", buflen=buflen@entry=28728, errp=errp@entry=0x0) at ./io/channel.c:112 #2 0x0000001486e005ec in channel_get_buffer (opaque=<optimized out>, buf=0x1489844c00 "\060\nLw", pos=<optimized out>, size=28728) at ./migration/qemu-file-channel.c:80 #3 0x0000001486dff095 in qemu_fill_buffer (f=f@entry=0x1489843c00) at ./migration/qemu-file.c:293
I checked that sioc->fd, &msg, sflags) is in fact the socket. With e.g. with this fd being 27 tcp ESTAB 1405050 0 ::ffff:10.22.69.30:49152 ::ffff:10.22.69.157:49804 users:(("qemu-system-x86",pid=29273,fd=27)) ino:3345152 sk:30 <-> skmem:(r1420644,rb1495660,t0,tb332800,f668,w0,o0,bl0,d14) ts sack cubic wscale:7,7 rto:200 rtt:0.04/0.02 ato:80 mss:8948 cwnd:10 bytes_received:1981460 segs_out:37 segs_in:247 data_segs_in:231 send 17896.0Mbps lastsnd:254728 lastrcv:250372 lastack:250372 rcv_rtt:0.205 rcv_space:115461 minrtt:0.04 I need to break on the fail of that recvmsg in qio_channel_socket_readv # the following does not work due to optimization the ret value is only around later b io/channel-socket.c:478 if ret < 0 But catching it "inside" the if works b io/channel-socket.c:479 Take the following with a grain of salt, this is very threaded and noisy to debug. Once I hit it the recmsg returned "-1", that was on f->pos = 311641887 But at the same time I could confirm (via ss) that the socket itself is still open on source and target of the migration. -1 is EAGAIN and returns QIO_CHANNEL_ERR_BLOCK That seems to arrive in nbd_rwv nbd/common.c:44). And led to "qio_channel_yield" There are a few corouting switches in between so I hope I'm not loosing anything. But that first ret<0 actually worked, it seems the yield and retry got it working. I got back to qemu_fill_buffer iterating further after this. This hit ret<0 in qio_channel_socket_readv again at f->pos 311641887. This time on returning the QIO_CHANNEL_ERR_BLOCK it returned to "./migration/qemu-file-channel.c:81". That was interesting as it is different than before. After this it seemed to become a death spiral - recmsg returned -1 every time (still on the same offset). It passed back through the nbd_rwv which called qio_channel_yield for multiple times. Then it continued and later on on 321998304 is the last I saw. It did no more pass b io/channel-socket.c:479 at all, but then led to the exit. Hmm, I might have lost myself on the coroutine switches - but it is odd at least. Trying to redo less interactive and with a bit more prep ... Maybe the results are more reliable then ... Getting back with more later ... -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1711602 Title: --copy-storage-all failing with qemu 2.10 Status in QEMU: New Status in libvirt package in Ubuntu: Confirmed Status in qemu package in Ubuntu: Confirmed Bug description: We fixed an issue around disk locking already in regard to qemu-nbd [1], but there still seem to be issues. $ virsh migrate --live --copy-storage-all kvmguest-artful-normal qemu+ssh://10.22.69.196/system error: internal error: qemu unexpectedly closed the monitor: 2017-08-18T12:10:29.800397Z qemu-system-x86_64: -chardev pty,id=charserial0: char device redirected to /dev/pts/0 (label charserial0) 2017-08-18T12:10:48.545776Z qemu-system-x86_64: load of migration failed: Input/output error Source libvirt log for the guest: 2017-08-18 12:09:08.251+0000: initiating migration 2017-08-18T12:09:08.809023Z qemu-system-x86_64: Unable to read from socket: Connection reset by peer 2017-08-18T12:09:08.809481Z qemu-system-x86_64: Unable to read from socket: Connection reset by peer Target libvirt log for the guest: 2017-08-18T12:09:08.730911Z qemu-system-x86_64: load of migration failed: Input/output error 2017-08-18 12:09:09.010+0000: shutting down, reason=crashed Given the timing it seems that the actual copy now works (it is busy ~10 seconds on my environment which would be the copy). Also we don't see the old errors we saw before, but afterwards on the actual take-over it fails. Dmesg has no related denials as often apparmor is in the mix. Need to check libvirt logs of source [2] and target [3] in Detail. [1]: https://lists.gnu.org/archive/html/qemu-devel/2017-08/msg02200.html [2]: http://paste.ubuntu.com/25339356/ [3]: http://paste.ubuntu.com/25339358/ To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1711602/+subscriptions