Hi Sagi, I just got back from a conference and am going to be offline for a week starting tomorrow. I haven't had time to look through your email but will reply when I'm back from vacation.
Stefan On Sun, 11 Jun 2023 at 14:29, Sagi Grimberg <s...@grimberg.me> wrote: > > > > On 6/8/23 19:08, Stefan Hajnoczi wrote: > > On Thu, Jun 08, 2023 at 10:40:57AM +0300, Sagi Grimberg wrote: > >> Hey Stefan, Paolo, > >> > >> I just had a report from a user experiencing lower virtio-blk > >> performance than he expected. This user is running virtio-blk on top of > >> nvme-tcp device. The guest is running 12 CPU cores. > >> > >> The guest read/write throughput is capped at around 30% of the available > >> throughput from the host (~800MB/s from the guest vs. 2800MB/s from the > >> host - 25Gb/s nic). The workload running on the guest is a > >> multi-threaded fio workload. > >> > >> What is observed is the fact that virtio-blk is using a single disk-wide > >> iothread processing all the vqs. Specifically nvme-tcp (similar to other > >> tcp based protocols) is negatively impacted by lack of thread > >> concurrency that can distribute I/O requests to different TCP > >> connections. > >> > >> We also attempted to move the iothread to a dedicated core, however that > >> did yield any meaningful performance improvements). The reason appears > >> to be less about CPU utilization on the iothread core, but more around > >> single TCP connection serialization. > >> > >> Moving to io=threads does increase the throughput, however sacrificing > >> latency significantly. > >> > >> So the user find itself with available host cpus and TCP connections > >> that it could easily use to get maximum throughput, without the ability > >> to leverage them. True, other guests will use different > >> threads/contexts, however the goal here is to allow the full performance > >> from a single device. > >> > >> I've seen several discussions and attempts in the past to allow a > >> virtio-blk device leverage multiple iothreads, but around 2 years ago > >> the discussions over this paused. So wanted to ask, are there any plans > >> or anything in the works to address this limitation? > >> > >> I've seen that the spdk folks are heading in this direction with their > >> vhost-blk implementation: > >> https://review.spdk.io/gerrit/c/spdk/spdk/+/16068 > > > > Hi Sagi, > > Yes, there is an ongoing QEMU multi-queue block layer effort to make it > > possible for multiple IOThreads to process disk I/O for the same > > --blockdev in parallel. > > Great to know. > > > Most of my recent QEMU patches have been part of this effort. There is a > > work-in-progress branch that supports mapping virtio-blk virtqueues to > > specific IOThreads: > > https://gitlab.com/stefanha/qemu/-/commits/virtio-blk-iothread-vq-mapping > > Thanks for the pointer. > > > The syntax is: > > > > --device > > '{"driver":"virtio-blk-pci","iothread-vq-mapping":[{"iothread":"iothread0"},{"iothread":"iothread1"}],"drive":"drive0"}' > > > > This says "assign virtqueues round-robin to iothread0 and iothread1". > > Half the virtqueues will be processed by iothread0 and the other half by > > iothread1. There is also syntax for assigning specific virtqueues to > > each IOThread, but usually the automatic round-robin assignment is all > > that's needed. > > > > This work is not finished yet. Basic I/O (e.g. fio) works without > > crashes, but expect to hit issues if you use blockjobs, hotplug, etc. > > > > Performance optimization work has just begun, so it won't deliver all > > the benefits yet. I ran a benchmark yesterday where going from 1 to 2 > > IOThreads increased performance by 25%. That's much less than we're > > aiming for; attaching two independent virtio-blk devices improves the > > performance by ~100%. I know we can get there eventually. Some of the > > bottlenecks are known (e.g. block statistics collection causes lock > > contention) and others are yet to be investigated. > > Hmm, I rebased this branch on top of mainline master and ran a naive > test, and it seems that performance regressed quite a bit :( > > I'm running this test on my laptop (Intel(R) Core(TM) i7-8650U CPU > @1.90GHz), so this is more qualitative test for BW only. > I use null_blk as the host device. > > With mainline master I get ~9GB/s 64k randread, and with your branch > I get ~5GB/s, this is regardless of assigning iothreads (one or > two) or not. > > my qemu command: > taskset -c 0-3 build/qemu-system-x86_64 -cpu host -m 1G -enable-kvm -smp > 4 -drive > file=/var/lib/libvirt/images/ubuntu-22/root-disk-clone.qcow2,format=qcow2 > -drive > if=none,id=drive0,cache=none,aio=native,format=raw,file=/dev/nullb0 > -device virtio-blk-pci,drive=drive0,scsi=off -nographic > > my guest fio jobfile: > -- > [global] > group_reporting > runtime=3000 > time_based > loops=1 > direct=1 > invalidate=1 > randrepeat=0 > norandommap > exitall > cpus_allowed=0-3 > cpus_allowed_policy=split > > [read] > filename=/dev/vda > numjobs=4 > iodepth=32 > bs=64k > rw=randread > ioengine=io_uring > -- > > Maybe I'm doing something wrong? Didn't expect to find a regression > against mainline on the default setup. >