Hi all
Recently during test COLO, i found sometimes the client goes to hung on
Primary side. First i thought it maybe a COLO revelant issue, but after
ton of tests i doubt that this maybe a NBD issue (athough i'm not sure).
So i'd like to share what i found:
Since commit 1c778ef7, we convert to using QIOChannel APIs for actual
socket I/O.
Let foucus on nbd_reply_ready() here:
Before commit 1c778ef7
nbd_reply_ready()
nbd_receive_reply()
nbd_wr_sync()
{
...
while (offset < size) {
if (do_read) {
len = qemu_recv(fd, buffer + offset, size - offset, 0);
} else {
...
}
if (len < 0) {
err = socket_error();
if (err == EINTR || (offset > 0 && (err == EAGAIN || err
== EWOULDBLOCK))) {
continue;
}
return -err;
}
...
}
....
}
if len < 0 && error == EAGAIN. we have two choice
1) continue to recv until finished.
2) return -EAGAIN, nbd_receive_reply() will check this return value and
will return *Successfully*.
After commit 1c778ef7:
nbd_reply_ready()
read_sync()
nbd_wr_syncv()
{
...
while (nlocal_iov > 0) {
...
if (do_read) {
len = qio_channel_readv(ioc, local_iov, nlocal_iov,
&local_err);
} else {
...
}
if (len == QIO_CHANNEL_ERR_BLOCK) {
if (qemu_in_coroutine()) {
qemu_coroutine_yield();
} else {
qio_channel_wait(ioc,
do_read ? G_IO_IN : G_IO_OUT);
}
continue;
}
...
}
}
For NBD,
qio_channel_readv()
qio_channel_readv_full
klass->io_readv()
qio_channel_socket_readv()
{
for(..) {
ret = recv(xxx);
if (ret < 0) {
if (errno == EAGAIN) {
if (done) {
return done;
} else {
return QIO_CHANNEL_ERR_BLOCK;
}
}
}
...
}
}
Here, if ret < 0 && error == EAGAIN && !done, we'll return
QIO_CHANNEL_ERR_BLOCK. Then nbd_wr_syncv() will invoke
qio_channel_wait() and the guest will *HUNG* until i kill
nbd server service.
It's easy to reproduce. My question: If the scenario i describe above is
what we expected?
Thanks
-Xie