On Wed, Oct 08, 2014 at 01:37:25PM +0300, Avi Kivity wrote: > > On 10/08/2014 01:14 PM, Michael S. Tsirkin wrote: > >On Wed, Oct 08, 2014 at 12:51:21PM +0300, Avi Kivity wrote: > >>On 10/08/2014 12:15 PM, Michael S. Tsirkin wrote: > >>>On Wed, Oct 08, 2014 at 10:43:07AM +0300, Avi Kivity wrote: > >>>>On 09/30/2014 12:33 PM, Michael S. Tsirkin wrote: > >>>>>a single descriptor might use all of > >>>>>the virtqueue. In this case we wont to be able to pass the > >>>>>descriptor directly to linux as a single iov, since > >>>>> > >>>>You could separate maximum request scatter/gather list size from the > >>>>virtqueue size. They are totally unrelated - even now you can have a > >>>>larger > >>>>request by using indirect descriptors. > >>>We could add a feature to have a smaller or larger S/G length limit. > >>>Is this something useful? > >>> > >>Having a larger ring size is useful, esp. with zero-copy transmit, and you > >>would need the sglist length limit in order to not require linearization on > >>linux hosts. So the limit is not useful in itself, only indirectly. > >> > >>Google cloud engine exposes virtio ring sizes of 16384. > >OK this sounds useful, I'll queue this up for consideration. > >Thanks! > > Thanks. > > >>Even more useful is getting rid of the desc array and instead passing descs > >>inline in avail and used. > >You expect this to improve performance? > >Quite possibly but this will have to be demonstrated. > > > > The top vhost function in small packet workloads is vhost_get_vq_desc, and > the top instruction within that (50%) is the one that reads the first 8 > bytes of desc. It's a guaranteed cache line miss (and again on the guest > side when it's time to reuse).
OK so basically what you are pointing out is that we get 5 accesses: read of available head, read of available ring, read of descriptor, write of used ring, write of used ring head. If processing is in-order, we could build a much simpler design, with a valid bit in the descriptor, cleared by host as descriptors are consumed. Basically get rid of both used and available ring. Sounds good in theory. > Inline descriptors will amortize the cache miss over 4 descriptors, and will > allow the hardware to prefetch, since the descriptors are linear in memory. If descriptors are used in order (as they are with current qemu) then aren't they amortized already? -- MST