>-----Original Message----- >From: Stefan Hajnoczi [mailto:stefa...@gmail.com] >Sent: Wednesday, September 18, 2013 06:12 AM >To: 'Mark Trumpold' >Cc: qemu-devel@nongnu.org, 'Paul Clements', nbd-gene...@lists.sourceforge.net, >bonz...@stefanha-thinkpad.redhat.com, w...@uter.be >Subject: Re: [Qemu-devel] Hibernate and qemu-nbd > >On Tue, Sep 17, 2013 at 07:10:44AM -0700, Mark Trumpold wrote: >> I am using the kernel functionality directly with the commands: >> echo platform >/sys/power/disk >> echo disk >/sys/power/state >> >> The following appears in dmesg when I attempt to hibernate: >> >> ==================================================== >> [ 38.881397] nbd (pid 1473: qemu-nbd) got signal 0 >> [ 38.881401] block nbd0: shutting down socket >> [ 38.881404] block nbd0: Receive control failed (result -4) >> [ 38.881417] block nbd0: queue cleared >> [ 87.463133] block nbd0: Attempted send on closed socket >> [ 87.463137] end_request: I/O error, dev nbd0, sector 66824 >> ==================================================== >> >> My environment: >> Debian: 6.0.5 >> Kernel: 3.3.1 >> Qemu userspace: 1.2.0 > >This could be a bug in the nbd client kernel module. >drivers/block/nbd.c:sock_xmit() does the following: > > result = kernel_recvmsg(sock, &msg, &iov, 1, size, > msg.msg_flags); > > if (signal_pending(current)) { > siginfo_t info; > printk(KERN_WARNING "nbd (pid %d: %s) got signal %d\n", > task_pid_nr(current), current->comm, > dequeue_signal_lock(current, ¤t->blocked, &info)); > result = -EINTR; > sock_shutdown(nbd, !send); > break; > } > >The signal number in the log output looks bogus, we shouldn't get 0. >sock_xmit() actually blocks all signals except SIGKILL before calling >kernel_recvmsg(). I guess this is an artifact of the suspend-to-disk >operation, maybe the signal pending flag is set on the process. > >Perhaps someone with a better understanding of the kernel internals can >check this? > >What happens next is that the nbd kernel module shuts down the NBD connection. > >As a workaround, please try running a separate nbd-client(1) process and drop >the qemu-nbd -c command-line argument. This way nbd-client(1) uses the >nbd kernel module instead of the qemu-nbd process and you'll get the >benefit of nbd-client's automatic reconnect. > >Stefan >
Hi Stefan, Thank you for the information. I did some experiments per you suggestion. Wasn't sure if the following was what you had in mind: 1) Configured 'nbd-server' and started (/etc/nbd-server/config): [generic] [export] exportname = /root/qemu/q1.img port = 2000 2) Started 'nbd-client': -> nbd-client localhost 2000 /dev/nbd0 3) Verify '/dev/nbd0' is in use (will appear in list): -> cat /proc/partitions At this point I could mount '/dev/nbd0' as expected, but not necessary to demonstrate a problem. Now at this point if I enter S1(standby), S3(suspend to ram), or S4(suspend to disk) I get the same dmesg as before indicating 'nbd0' caught signal 0 and exited. When I resume I simply repeat step #3 to verify. ================== Also, previously before contacting the group I had modified the same kernel source that you had identified in 'drivers/block/nbd.c:sock_xmit()' to not take any action. This was strictly for troubleshooting: 199 result = kernel_recvmsg(sock, &msg, &iov, 1, size, 200 msg.msg_flags); 201 202 if (signal_pending(current)) { 203 siginfo_t info; 204 printk(KERN_WARNING "nbd (pid %d: %s) got signal %d\n", 205 task_pid_nr(current), current->comm, 206 dequeue_signal_lock(current, ¤t->blocked,&info)); 207 208 //result = -EINTR; 209 //sock_shutdown(nbd, !send); 210 //break; 211 } We then got errors ("Wrong magac ...) in the following section: /* NULL returned = something went wrong, inform userspace */ static struct request *nbd_read_stat(struct nbd_device *lo) { int result; struct nbd_reply reply; struct request *req; reply.magic = 0; result = sock_xmit(lo, 0, &reply, sizeof(reply), MSG_WAITALL); if (result <= 0) { dev_err(disk_to_dev(lo->disk), "Receive control failed (result %d)\n", result); goto harderror; } if (ntohl(reply.magic) != NBD_REPLY_MAGIC) { dev_err(disk_to_dev(lo->disk), "Wrong magic (0x%lx)\n", (unsigned long)ntohl(reply.magic)); result = -EPROTO; goto harderror; So, it seemed to me the call at line #199 above must be returning with error after we commented out the signal action logic. Thank you for your attention on this. Let me know if I followed you suggestion correctly, and/or other tests I can do. Regards, Mark T.