On 10/23/19 5:17 PM, Stefan Hajnoczi wrote: > On Tue, Oct 22, 2019 at 04:01:57AM +0000, Denis Lunev wrote: >> On 10/21/19 4:24 PM, Stefan Hajnoczi wrote: >>> On Fri, Oct 18, 2019 at 02:55:47PM +0300, Denis Plotnikov wrote: >>>> From: "Denis V. Lunev" <d...@openvz.org> >>>> >>>> Linux guests submit IO requests no longer than PAGE_SIZE * max_seg >>>> field reported by SCSI controler. Thus typical sequential read with >>>> 1 MB size results in the following pattern of the IO from the guest: >>>> 8,16 1 15754 2.766095122 2071 D R 2095104 + 1008 [dd] >>>> 8,16 1 15755 2.766108785 2071 D R 2096112 + 1008 [dd] >>>> 8,16 1 15756 2.766113486 2071 D R 2097120 + 32 [dd] >>>> 8,16 1 15757 2.767668961 0 C R 2095104 + 1008 [0] >>>> 8,16 1 15758 2.768534315 0 C R 2096112 + 1008 [0] >>>> 8,16 1 15759 2.768539782 0 C R 2097120 + 32 [0] >>>> The IO was generated by >>>> dd if=/dev/sda of=/dev/null bs=1024 iflag=direct >>>> >>>> This effectively means that on rotational disks we will observe 3 IOPS >>>> for each 2 MBs processed. This definitely negatively affects both >>>> guest and host IO performance. >>>> >>>> The cure is relatively simple - we should report lengthy scatter-gather >>>> ability of the SCSI controller. Fortunately the situation here is very >>>> good. VirtIO transport layer can accomodate 1024 items in one request >>>> while we are using only 128. This situation is present since almost >>>> very beginning. 2 items are dedicated for request metadata thus we >>>> should publish VIRTQUEUE_MAX_SIZE - 2 as max_seg. >>>> >>>> The following pattern is observed after the patch: >>>> 8,16 1 9921 2.662721340 2063 D R 2095104 + 1024 [dd] >>>> 8,16 1 9922 2.662737585 2063 D R 2096128 + 1024 [dd] >>>> 8,16 1 9923 2.665188167 0 C R 2095104 + 1024 [0] >>>> 8,16 1 9924 2.665198777 0 C R 2096128 + 1024 [0] >>>> which is much better. >>>> >>>> The dark side of this patch is that we are tweaking guest visible >>>> parameter, though this should be relatively safe as above transport >>>> layer support is present in QEMU/host Linux for a very long time. >>>> The patch adds configurable property for VirtIO SCSI with a new default >>>> and hardcode option for VirtBlock which does not provide good >>>> configurable framework. >>>> >>>> Unfortunately the commit can not be applied as is. For the real cure we >>>> need guest to be fixed to accomodate that queue length, which is done >>>> only in the latest 4.14 kernel. Thus we are going to expose the property >>>> and tweak it on machine type level. >>>> >>>> The problem with the old kernels is that they have >>>> max_segments <= virtqueue_size restriction which cause the guest >>>> crashing in the case of violation. >>>> To fix the case described above in the old kernels we can increase >>>> virtqueue_size to 256 and max_segments to 254. The pitfall here is >>>> that seabios allows the virtqueue_size-s < 128, however, the seabios >>>> patch extending that value to 256 is pending. >>> If I understand correctly you are relying on Indirect Descriptor support >>> in the guest driver in order to exceed the Virtqueue Descriptor Table >>> size. >>> >>> Unfortunately the "max_segments <= virtqueue_size restriction" is >>> required by the VIRTIO 1.1 specification: >>> >>> 2.6.5.3.1 Driver Requirements: Indirect Descriptors >>> >>> A driver MUST NOT create a descriptor chain longer than the Queue >>> Size of the device. >>> >>> So this idea seems to be in violation of the specification? >>> >>> There is a bug in hw/block/virtio-blk.c:virtio_blk_update_config() and >>> hw/scsi/virtio-scsi.c:virtio_scsi_get_config(): >>> >>> virtio_stl_p(vdev, &blkcfg.seg_max, 128 - 2); >>> >>> This number should be the minimum of blk_get_max_iov() and >>> virtio_queue_get_num(), minus 2 for the header and footer. >>> >>> I looked at the Linux SCSI driver code and it seems each HBA has a >>> single max_segments number - it does not vary on a per-device basis. >>> This could be a problem if two host block device with different >>> max_segments are exposed to the guest through the same virtio-scsi >>> controller. Another bug? :( >>> >>> Anyway, if you want ~1024 descriptors you should set Queue Size to 1024. >>> I don't see a spec-compliant way of doing it otherwise. Hopefully I >>> have overlooked something and there is a nice way to solve this. >>> >>> Stefan >> you are perfectly correct. We need actually 3 changes to improve >> guest behavior: >> 1) This patch, which adds property but does not change anything >> useful > This patch is problematic because it causes existing guest drivers to > violate the VIRTIO specification (or fail) if the value is set too high. > In addition, it does not take into account the virtqueue size so the > default value is too low when the user sets -device ...,queue-size=1024. > > Let's calculate blkcfg.seg_max based on the virtqueue size as mentioned > in my previous email instead. As far as I understand maximum amount of segments could be more than virtqueue size for indirect requests (allowed in VirtIO 1.0).
> There is one caveat with my suggestion: drivers are allowed to access > VIRTIO Configuration Space before virtqueue setup has determined the > final size. Therefore the value of this field can change after > virtqueue setup. Drivers that set a custom virtqueue size would need to > read the value after virtqueue setup. (Linux drivers do not modify the > virtqueue size so it won't affect them.) > > Stefan I think that we should do that a little bit different :) We can not change max_segs just if queue size is changed, this should be somehow bound to machine type. Thus I propose to add "automatic" value, i.e. if max_segs is set to 0 the code should set it to queue size -2. This should be default. Alternatively the value from max_segs should be taken. Will this work for you? Please note, currently the specification could also be violated if we will reduce queue size to 64 :) Den