On Tue, Jan 25, 2011 at 11:27 AM, Michael S. Tsirkin <m...@redhat.com> wrote: > On Tue, Jan 25, 2011 at 09:49:04AM +0000, Stefan Hajnoczi wrote: >> On Tue, Jan 25, 2011 at 7:12 AM, Stefan Hajnoczi <stefa...@gmail.com> wrote: >> > On Mon, Jan 24, 2011 at 8:05 PM, Kevin Wolf <kw...@redhat.com> wrote: >> >> Am 24.01.2011 20:47, schrieb Michael S. Tsirkin: >> >>> On Mon, Jan 24, 2011 at 08:48:05PM +0100, Kevin Wolf wrote: >> >>>> Am 24.01.2011 20:36, schrieb Michael S. Tsirkin: >> >>>>> On Mon, Jan 24, 2011 at 07:54:20PM +0100, Kevin Wolf wrote: >> >>>>>> Am 12.12.2010 16:02, schrieb Stefan Hajnoczi: >> >>>>>>> Virtqueue notify is currently handled synchronously in userspace >> >>>>>>> virtio. This >> >>>>>>> prevents the vcpu from executing guest code while hardware emulation >> >>>>>>> code >> >>>>>>> handles the notify. >> >>>>>>> >> >>>>>>> On systems that support KVM, the ioeventfd mechanism can be used to >> >>>>>>> make >> >>>>>>> virtqueue notify a lightweight exit by deferring hardware emulation >> >>>>>>> to the >> >>>>>>> iothread and allowing the VM to continue execution. This model is >> >>>>>>> similar to >> >>>>>>> how vhost receives virtqueue notifies. >> >>>>>>> >> >>>>>>> The result of this change is improved performance for userspace >> >>>>>>> virtio devices. >> >>>>>>> Virtio-blk throughput increases especially for multithreaded >> >>>>>>> scenarios and >> >>>>>>> virtio-net transmit throughput increases substantially. >> >>>>>>> >> >>>>>>> Some virtio devices are known to have guest drivers which expect a >> >>>>>>> notify to be >> >>>>>>> processed synchronously and spin waiting for completion. Only >> >>>>>>> enable ioeventfd >> >>>>>>> for virtio-blk and virtio-net for now. >> >>>>>>> >> >>>>>>> Care must be taken not to interfere with vhost-net, which uses host >> >>>>>>> notifiers. If the set_host_notifier() API is used by a device >> >>>>>>> virtio-pci will disable virtio-ioeventfd and let the device deal with >> >>>>>>> host notifiers as it wishes. >> >>>>>>> >> >>>>>>> After migration and on VM change state (running/paused) >> >>>>>>> virtio-ioeventfd >> >>>>>>> will enable/disable itself. >> >>>>>>> >> >>>>>>> * VIRTIO_CONFIG_S_DRIVER_OK -> enable virtio-ioeventfd >> >>>>>>> * !VIRTIO_CONFIG_S_DRIVER_OK -> disable virtio-ioeventfd >> >>>>>>> * virtio_pci_set_host_notifier() -> disable virtio-ioeventfd >> >>>>>>> * vm_change_state(running=0) -> disable virtio-ioeventfd >> >>>>>>> * vm_change_state(running=1) -> enable virtio-ioeventfd >> >>>>>>> >> >>>>>>> Signed-off-by: Stefan Hajnoczi <stefa...@linux.vnet.ibm.com> >> >>>>>> >> >>>>>> On current git master I'm getting hangs when running iozone on a >> >>>>>> virtio-blk disk. "Hang" means that it's not responsive any more and >> >>>>>> has >> >>>>>> 100% CPU consumption. >> >>>>>> >> >>>>>> I bisected the problem to this patch. Any ideas? >> >>>>>> >> >>>>>> Kevin >> >>>>> >> >>>>> Does it help if you set ioeventfd=off on command line? >> >>>> >> >>>> Yes, with ioeventfd=off it seems to work fine. >> >>>> >> >>>> Kevin >> >>> >> >>> Then it's the ioeventfd that is to blame. >> >>> Is it the io thread that consumes 100% CPU? >> >>> Or the vcpu thread? >> >> >> >> I was building with the default options, i.e. there is no IO thread. >> >> >> >> Now I'm just running the test with IO threads enabled, and so far >> >> everything looks good. So I can only reproduce the problem with IO >> >> threads disabled. >> > >> > Hrm...aio uses SIGUSR2 to force the vcpu to process aio completions >> > (relevant when --enable-io-thread is not used). I will take a look at >> > that again and see why we're spinning without checking for ioeventfd >> > completion. >> >> Here's my understanding of --disable-io-thread. Added Anthony on CC, >> please correct me. >> >> When I/O thread is disabled our only thread runs guest code until an >> exit request is made. There are synchronous exit cases like a halt >> instruction or single step. There are also asynchronous exit cases >> when signal handlers use qemu_notify_event(), which does cpu_exit(), >> to set env->exit_request = 1 and unlink the current tb. >> >> With this structure in mind, anything which needs to interrupt the >> vcpu in order to process events must use signals and >> qemu_notify_event(). Otherwise that event source may be starved and >> never processed. >> >> virtio-ioeventfd currently does not use signals and will therefore >> never interrupt the vcpu. >> >> However, you normally don't notice the missing signal handler because >> some other event interrupts the vcpu and we enter select(2) to process >> all pending handlers. So virtio-ioeventfd mostly gets a free ride on >> top of timer events. This is suboptimal because it adds latency to >> virtqueue kick - we're waiting for another event to interrupt the vcpu >> before we can process virtqueue-kick. >> >> If any other vcpu interruption makes virtio-ioeventfd chug along then >> why are you seeing 100% CPU livelock? My theory is that dynticks has >> a race condition which causes timers to stop working in QEMU. Here is >> an strace of QEMU --disable-io-thread entering live lock. I can >> trigger this by starting a VM and running "while true; do true; done" >> at the shell. Then strace the QEMU process: >> >> 08:04:34.985177 ioctl(11, KVM_RUN, 0) = 0 >> 08:04:34.985242 --- SIGALRM (Alarm clock) @ 0 (0) --- >> 08:04:34.985319 write(6, "\1\0\0\0\0\0\0\0", 8) = 8 >> 08:04:34.985368 rt_sigreturn(0x2758ad0) = 0 >> 08:04:34.985423 select(15, [5 8 14], [], [], {0, 0}) = 1 (in [5], left {0, >> 0}) >> 08:04:34.985484 read(5, "\1\0\0\0\0\0\0\0", 512) = 8 >> 08:04:34.985538 timer_gettime(0, {it_interval={0, 0}, it_value={0, 0}}) = 0 >> 08:04:34.985588 timer_settime(0, 0, {it_interval={0, 0}, it_value={0, >> 273000}}, NULL) = 0 >> 08:04:34.985646 ioctl(11, KVM_RUN, 0) = -1 EINTR (Interrupted system call) >> 08:04:34.985928 --- SIGALRM (Alarm clock) @ 0 (0) --- >> 08:04:34.986007 write(6, "\1\0\0\0\0\0\0\0", 8) = 8 >> 08:04:34.986063 rt_sigreturn(0x2758ad0) = -1 EINTR (Interrupted system call) >> 08:04:34.986124 select(15, [5 8 14], [], [], {0, 0}) = 1 (in [5], left {0, >> 0}) >> 08:04:34.986188 read(5, "\1\0\0\0\0\0\0\0", 512) = 8 >> 08:04:34.986246 timer_gettime(0, {it_interval={0, 0}, it_value={0, 0}}) = 0 >> 08:04:34.986299 timer_settime(0, 0, {it_interval={0, 0}, it_value={0, >> 250000}}, NULL) = 0 >> 08:04:34.986359 ioctl(11, KVM_INTERRUPT, 0x7fff90404ef0) = 0 >> 08:04:34.986406 ioctl(11, KVM_RUN, 0) = 0 >> 08:04:34.986465 ioctl(11, KVM_RUN, 0) = 0 <--- guest >> finishes execution >> >> v--- dynticks_rearm_timer() returns early because >> timer is already scheduled >> 08:04:34.986533 timer_gettime(0, {it_interval={0, 0}, it_value={0, 24203}}) >> = 0 >> 08:04:34.986585 --- SIGALRM (Alarm clock) @ 0 (0) --- <--- timer expires >> 08:04:34.986661 write(6, "\1\0\0\0\0\0\0\0", 8) = 8 >> 08:04:34.986710 rt_sigreturn(0x2758ad0) = 0 >> >> v--- we re-enter the guest without rearming the timer! >> 08:04:34.986754 ioctl(11, KVM_RUN^C <unfinished ...> >> [QEMU hang, 100% CPU] >> >> So dynticks fails to rearm the timer before we enter the guest. This >> is a race condition: we check that there is already a timer scheduled >> and head on towards re-entering the guest, the timer expires before we >> enter the guest, we re-enter the guest without realizing the timer has >> expired. Now we're inside the guest without the hope of a timer >> expiring - and the guest is running a CPU-bound workload that doesn't >> need to perform I/O. >> >> The result is a hung QEMU (screen does not update) and a softlockup >> inside the guest once we do kick it to life again (by detaching >> strace). >> >> I think the only way to avoid this race condition in dynticks is to >> mask SIGALRM, then check if the timer expired, and then ioctl(KVM_RUN) >> with atomic signal mask change back to SIGALRM enabled. Thoughts? >> >> Back to virtio-ioeventfd, we really shouldn't use virtio-ioeventfd >> when there is no I/O thread. > > Can we make it work with SIGIO? > >> It doesn't make sense because there's no >> opportunity to process the virtqueue while the guest code is executing >> in parallel like there is with I/O thread. It will just degrade >> performance when QEMU only has one thread. > > Probably. But it's really better to check this than theorethise about > it.
eventfd does not seem to support O_ASYNC. After adding the necessary code into QEMU no signals were firing so I wrote a test: #define _GNU_SOURCE #include <stdlib.h> #include <stdio.h> #include <fcntl.h> #include <signal.h> #include <sys/eventfd.h> int main(int argc, char **argv) { int fd = eventfd(0, 0); if (fd < 0) { perror("eventfd"); exit(1); } if (fcntl(fd, F_SETSIG, SIGTERM) < 0) { perror("fcntl(F_SETSIG)"); exit(1); } if (fcntl(fd, F_SETOWN, getpid()) < 0) { perror("fcntl(F_SETOWN)"); exit(1); } if (fcntl(fd, F_SETFL, O_NONBLOCK | O_ASYNC) < 0) { perror("fcntl(F_SETFL)"); exit(1); } switch (fork()) { case -1: perror("fork"); exit(1); case 0: /* child */ eventfd_write(fd, 1); exit(0); default: /* parent */ break; } sleep(5); wait(NULL); close(fd); return 0; } I'd expect the parent to get a SIGTERM but the process just sleeps and then exits. When replacing the eventfd with a pipe in this program the parent does receive a SIGKILL. Stefan