Re: [RFC patch 0/1] block: vhost-blk backend

Stefan Hajnoczi Wed, 05 Oct 2022 08:45:13 -0700

On Wed, Oct 05, 2022 at 02:50:06PM +0300, Andrey Zhadchenko wrote:
> 
> 
> On 10/4/22 22:00, Stefan Hajnoczi wrote:
> > On Mon, Jul 25, 2022 at 11:55:26PM +0300, Andrey Zhadchenko wrote:
> > > Although QEMU virtio-blk is quite fast, there is still some room for
> > > improvements. Disk latency can be reduced if we handle virito-blk requests
> > > in host kernel so we avoid a lot of syscalls and context switches.
> > > 
> > > The biggest disadvantage of this vhost-blk flavor is raw format.
> > > Luckily Kirill Thai proposed device mapper driver for QCOW2 format to 
> > > attach
> > > files as block devices: 
> > > https://www.spinics.net/lists/kernel/msg4292965.html
> > > 
> > > Also by using kernel modules we can bypass iothread limitation and finaly 
> > > scale
> > > block requests with cpus for high-performance devices. This is planned to 
> > > be
> > > implemented in next version.
> > > 
> > > Linux kernel module part:
> > > https://lore.kernel.org/kvm/20220725202753.298725-1-andrey.zhadche...@virtuozzo.com/
> > > 
> > > test setups and results:
> > > fio --direct=1 --rw=randread  --bs=4k  --ioengine=libaio --iodepth=128
> > > QEMU drive options: cache=none
> > > filesystem: xfs
> > > 
> > > SSD:
> > >                 | randread, IOPS  | randwrite, IOPS |
> > > Host           |      95.8k        |      85.3k      |
> > > QEMU virtio    |      57.5k        |      79.4k      |
> > > QEMU vhost-blk |      95.6k        |      84.3k      |
> > > 
> > > RAMDISK (vq == vcpu):
> > >                   | randread, IOPS | randwrite, IOPS |
> > > virtio, 1vcpu    |        123k      |      129k       |
> > > virtio, 2vcpu    |        253k (??) |      250k (??)  |
> > > virtio, 4vcpu    |        158k      |      154k       |
> > > vhost-blk, 1vcpu |        110k      |      113k       |
> > > vhost-blk, 2vcpu |        247k      |      252k       |
> > > vhost-blk, 4vcpu |        576k      |      567k       |
> > > 
> > > Andrey Zhadchenko (1):
> > >    block: add vhost-blk backend
> > > 
> > >   configure                     |  13 ++
> > >   hw/block/Kconfig              |   5 +
> > >   hw/block/meson.build          |   1 +
> > >   hw/block/vhost-blk.c          | 395 ++++++++++++++++++++++++++++++++++
> > >   hw/virtio/meson.build         |   1 +
> > >   hw/virtio/vhost-blk-pci.c     | 102 +++++++++
> > >   include/hw/virtio/vhost-blk.h |  44 ++++
> > >   linux-headers/linux/vhost.h   |   3 +
> > >   8 files changed, 564 insertions(+)
> > >   create mode 100644 hw/block/vhost-blk.c
> > >   create mode 100644 hw/virtio/vhost-blk-pci.c
> > >   create mode 100644 include/hw/virtio/vhost-blk.h
> > 
> > vhost-blk has been tried several times in the past. That doesn't mean it
> > cannot be merged this time, but past arguments should be addressed:
> > 
> > - What makes it necessary to move the code into the kernel? In the past
> >    the performance results were not very convincing. The fastest
> >    implementations actually tend to be userspace NVMe PCI drivers that
> >    bypass the kernel! Bypassing the VFS and submitting block requests
> >    directly was not a huge boost. The syscall/context switch argument
> >    sounds okay but the numbers didn't really show that kernel block I/O
> >    is much faster than userspace block I/O.
> > 
> >    I've asked for more details on the QEMU command-line to understand
> >    what your numbers show. Maybe something has changed since previous
> >    times when vhost-blk has been tried.
> > 
> >    The only argument I see is QEMU's current 1 IOThread per virtio-blk
> >    device limitation, which is currently being worked on. If that's the
> >    only reason for vhost-blk then is it worth doing all the work of
> >    getting vhost-blk shipped (kernel, QEMU, and libvirt changes)? It
> >    seems like a short-term solution.
> > 
> > - The security impact of bugs in kernel vhost-blk code is more serious
> >    than bugs in a QEMU userspace process.
> > 
> > - The management stack needs to be changed to use vhost-blk whereas
> >    QEMU can be optimized without affecting other layers.
> > 
> > Stefan
> 
> Indeed there was several vhost-blk attempts, but from what I found in
> mailing lists only the Asias attempt got some attention and discussion.
> Ramdisk performance results were great but ramdisk is more a benchmark than
> a real use. I didn't find out why Asias dropped his version except vague "He
> concluded performance results was not worth". The storage speed is very
> important for vhost-blk performance, as there is no point to cut cpu costs
> from 1ms to 0,1ms if the request need 50ms to proceed in the actual disk. I
> think that 10 years ago NVMI was non-existent and SSD + SATA was probably a
> lot faster than HDD but still not enough to utilize this technology.


Yes, it's possible that latency improvements are more noticeable now.
Thank you for posting the benchmark results. I will also run benchmarks
so we can compare vhost-blk with today's QEMU as well as multiqueue
IOThreads QEMU (for which I only have a hacky prototype) on a local NVMe
PCI SSD.

> The tests I did give me 60k IOPS randwrite for VM and 95k for host. And the
> vhost-blk is able to negate the difference even using only 1 thread/vq/vcpu.
> And unlinke current QEMU single IOThread it can be easily scaled with number
> of cpus/vcpus. For sure this can be solved by liftimg IOThread limitations
> but this will probably be even more disastrous amount of changes (and adding
> vhost-blk won't break old setups!).
> 
> Probably the only undisputed advantage of vhost-blk is syscalls reduction.
> And again the benefit really depends on a storage speed, as it should be
> somehow comparable with syscalls time. Also I must note that this may be
> good for high-density servers with a lot of VMs. But for now I did not have
> the exact numbers which show how much time we are really winning for a
> single request at average.
> 
> Overall vhost-blk will only become better along with the increase of storage
> speed.
> 
> Also I must note that all arguments above apply to vdpa-blk. And unlike
> vhost-blk, which needs it's own QEMU code, vdpa-blk can be setup with
> generic virtio-vdpa QEMU code (I am not sure if it is merged yet but still).
> Although vdpa-blk have it's own problems for now.

Yes, I think that's why Stefano hasn't pushed for a software vpda-blk
device yet despite having played with it and is more focussed on
hardware enablement. vdpa-blk has the same issues as vhost-blk.

Stefan

signature.asc
Description: PGP signature

Re: [RFC patch 0/1] block: vhost-blk backend

Reply via email to