Thanks Stefan and Dongli for your feedback and advices! I will do the further investigation per your advices and get back to you later on.
Thanks, -Wei On 4/16/19, 2:20 AM, "Stefan Hajnoczi" <stefa...@gmail.com> wrote: On Tue, Apr 16, 2019 at 07:23:38AM +0800, Dongli Zhang wrote: > > > On 4/16/19 1:34 AM, Wei Li wrote: > > Hi @Paolo Bonzini & @Stefan Hajnoczi, > > > > Would you please help confirm whether @Paolo Bonzini's multiqueue feature change will benefit virtio-scsi or not? Thanks! > > > > @Stefan Hajnoczi, > > I also spent some time on exploring the virtio-scsi multi-queue features via num_queues parameter as below, here are what we found: > > > > 1. Increase number of Queues from one to the same number as CPU will get better IOPS increase. > > 2. Increase number of Queues to the number (e.g. 8) larger than the number of vCPU (e.g. 2) can get even better IOPS increase. > > As mentioned in below link, when the number of hw queues is larger than > nr_cpu_ids, the blk-mq layer would limit and only use at most nr_cpu_ids queues > (e.g., /sys/block/sda/mq/). > > That is, when the num_queus=4 while vcpus is 2, there should be only 2 queues > available /sys/block/sda/mq/ > > https://lore.kernel.org/lkml/1553682995-5682-1-git-send-email-dongli.zh...@oracle.com/ > > I am just curious how increasing the num_queues from 2 to 4 would double the > iops, while there are only 2 vcpus available... I don't know the answer. It's especially hard to guess without seeing the benchmark (fio?) configuration and QEMU command-line. Common things to look at are: 1. Compare "iostat -dx 1" inside the guest and host. Are the I/O patterns comparable? blktrace(8) can give you even more detail on the exact I/O patterns. If the guest and host have different I/O patterns (blocksize, IOPS, queue depth) then request merging or I/O scheduler effects could be responsible for the difference. 2. kvm_stat or perf record -a -e kvm:\* counters for vmexits and interrupt injections. If these counters vary greatly between queue sizes, then that is usually a clue. It's possible to get higher performance by spending more CPU cycles although your system doesn't have many CPUs available, so I'm not sure if this is the case. 3. Power management and polling (kvm.ko halt_poll_ns, tuned profiles, and QEMU iothread poll-max-ns). It's expensive to wake a CPU when it goes into a low power mode due to idle. There are several features that can keep the CPU awake or even poll so that request latency is reduced. The reason why the number of queues may matter is that kicking multiple queues may keep the CPU awake more than batching multiple requests onto a small number of queues. Stefan