On Thu, Sep 19, 2013 at 10:44 PM, Mark Trumpold <ma...@netqa.com> wrote: > >>-----Original Message----- >>From: Stefan Hajnoczi [mailto:stefa...@gmail.com] >>Sent: Wednesday, September 18, 2013 06:12 AM >>To: 'Mark Trumpold' >>Cc: qemu-devel@nongnu.org, 'Paul Clements', nbd-gene...@lists.sourceforge.net, >>bonz...@stefanha-thinkpad.redhat.com, w...@uter.be >>Subject: Re: [Qemu-devel] Hibernate and qemu-nbd >> >>On Tue, Sep 17, 2013 at 07:10:44AM -0700, Mark Trumpold wrote: >>> I am using the kernel functionality directly with the commands: >>> echo platform >/sys/power/disk >>> echo disk >/sys/power/state >>> >>> The following appears in dmesg when I attempt to hibernate: >>> >>> ==================================================== >>> [ 38.881397] nbd (pid 1473: qemu-nbd) got signal 0 >>> [ 38.881401] block nbd0: shutting down socket >>> [ 38.881404] block nbd0: Receive control failed (result -4) >>> [ 38.881417] block nbd0: queue cleared >>> [ 87.463133] block nbd0: Attempted send on closed socket >>> [ 87.463137] end_request: I/O error, dev nbd0, sector 66824 >>> ==================================================== >>> >>> My environment: >>> Debian: 6.0.5 >>> Kernel: 3.3.1 >>> Qemu userspace: 1.2.0 >> >>This could be a bug in the nbd client kernel module. >>drivers/block/nbd.c:sock_xmit() does the following: >> >> result = kernel_recvmsg(sock, &msg, &iov, 1, size, >> msg.msg_flags); >> >> if (signal_pending(current)) { >> siginfo_t info; >> printk(KERN_WARNING "nbd (pid %d: %s) got signal %d\n", >> task_pid_nr(current), current->comm, >> dequeue_signal_lock(current, ¤t->blocked, &info)); >> result = -EINTR; >> sock_shutdown(nbd, !send); >> break; >> } >> >>The signal number in the log output looks bogus, we shouldn't get 0. >>sock_xmit() actually blocks all signals except SIGKILL before calling >>kernel_recvmsg(). I guess this is an artifact of the suspend-to-disk >>operation, maybe the signal pending flag is set on the process. >> >>Perhaps someone with a better understanding of the kernel internals can >>check this? >> >>What happens next is that the nbd kernel module shuts down the NBD connection. >> >>As a workaround, please try running a separate nbd-client(1) process and drop >>the qemu-nbd -c command-line argument. This way nbd-client(1) uses the >>nbd kernel module instead of the qemu-nbd process and you'll get the >>benefit of nbd-client's automatic reconnect. >> >>Stefan >> > > Hi Stefan, > > Thank you for the information. > > I did some experiments per you suggestion. Wasn't sure if the following > was what you had in mind: > > 1) Configured 'nbd-server' and started (/etc/nbd-server/config): > [generic] > [export] > exportname = /root/qemu/q1.img > port = 2000
You can use qemu-nbd instead of nbd-server. This way you'll be able to serve up qcow2 and other image formats. Just avoid the qemu-nbd -c option. This makes qemu-nbd purely run the NBD network protocol and skips simultaneously running the kernel NBD client. (Since qemu-nbd doesn't reconnect when ioctl(NBD_DO_IT) fails with EINTR the workaround is to use nbd-client(1) to drive the kernel NBD client instead.) > 2) Started 'nbd-client': > -> nbd-client localhost 2000 /dev/nbd0 > > 3) Verify '/dev/nbd0' is in use (will appear in list): > -> cat /proc/partitions > > At this point I could mount '/dev/nbd0' as expected, but not necessary > to demonstrate a problem. > > Now at this point if I enter S1(standby), S3(suspend to ram), or > S4(suspend to disk) I get the same dmesg as before indicating > 'nbd0' caught signal 0 and exited. > > When I resume I simply repeat step #3 to verify. It's expected that you get the same kernel messages. The difference should be that /dev/nbd0 is still accessible after resuming from disk because nbd-client automatically reconnects after the nbd kernel module bails out with EINTR. > ================== > > Also, previously before contacting the group I had modified the same > kernel source that you had identified in 'drivers/block/nbd.c:sock_xmit()' > to not take any action. This was strictly for troubleshooting: > > 199 result = kernel_recvmsg(sock, &msg, &iov, 1, size, > 200 msg.msg_flags); > 201 > 202 if (signal_pending(current)) { > 203 siginfo_t info; > 204 printk(KERN_WARNING "nbd (pid %d: %s) got signal %d\n", > 205 task_pid_nr(current), current->comm, > 206 dequeue_signal_lock(current, > ¤t->blocked,&info)); 207 > 208 //result = -EINTR; > 209 //sock_shutdown(nbd, !send); > 210 //break; > 211 } > > We then got errors ("Wrong magac ...) in the following section: > > /* NULL returned = something went wrong, inform userspace */ > static struct request *nbd_read_stat(struct nbd_device *lo) > { > int result; > struct nbd_reply reply; > struct request *req; > > reply.magic = 0; > result = sock_xmit(lo, 0, &reply, sizeof(reply), MSG_WAITALL); > if (result <= 0) { > dev_err(disk_to_dev(lo->disk), > "Receive control failed (result %d)\n", result); > goto harderror; > } > > if (ntohl(reply.magic) != NBD_REPLY_MAGIC) { > dev_err(disk_to_dev(lo->disk), "Wrong magic (0x%lx)\n", > (unsigned long)ntohl(reply.magic)); > result = -EPROTO; > goto harderror; > > > So, it seemed to me the call at line #199 above must be returning with > error after we commented out the signal action logic. I'm not familiar enough with the code to say what is happening. As the next step I would print out the kernel_recvmsg() return value when the signal is pending and look into what happens during suspend-to-disk (there's some sort of process freezing that takes place). Sorry I can't be of more help. Hopefully someone more familiar with the nbd kernel module will have time to chime in. Stefan