On 6/1/2016 2:53 PM, Yuanhan Liu wrote: > On Wed, Jun 01, 2016 at 06:40:41AM +0000, Xie, Huawei wrote: >>> /* Retrieve all of the head indexes first to avoid caching issues. */ >>> for (i = 0; i < count; i++) { >>> - desc_indexes[i] = vq->avail->ring[(vq->last_used_idx + i) & >>> - (vq->size - 1)]; >>> + used_idx = (vq->last_used_idx + i) & (vq->size - 1); >>> + desc_indexes[i] = vq->avail->ring[used_idx]; >>> + >>> + vq->used->ring[used_idx].id = desc_indexes[i]; >>> + vq->used->ring[used_idx].len = 0; >>> + vhost_log_used_vring(dev, vq, >>> + offsetof(struct vring_used, ring[used_idx]), >>> + sizeof(vq->used->ring[used_idx])); >>> } >>> >>> /* Prefetch descriptor index. */ >>> rte_prefetch0(&vq->desc[desc_indexes[0]]); >>> - rte_prefetch0(&vq->used->ring[vq->last_used_idx & (vq->size - 1)]); >>> - >>> for (i = 0; i < count; i++) { >>> int err; >>> >>> - if (likely(i + 1 < count)) { >>> + if (likely(i + 1 < count)) >>> rte_prefetch0(&vq->desc[desc_indexes[i + 1]]); >>> - rte_prefetch0(&vq->used->ring[(used_idx + 1) & >>> - (vq->size - 1)]); >>> - } >>> >>> pkts[i] = rte_pktmbuf_alloc(mbuf_pool); >>> if (unlikely(pkts[i] == NULL)) { >>> @@ -916,18 +920,12 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id, >>> rte_pktmbuf_free(pkts[i]); >>> break; >>> } >>> - >>> - used_idx = vq->last_used_idx++ & (vq->size - 1); >>> - vq->used->ring[used_idx].id = desc_indexes[i]; >>> - vq->used->ring[used_idx].len = 0; >>> - vhost_log_used_vring(dev, vq, >>> - offsetof(struct vring_used, ring[used_idx]), >>> - sizeof(vq->used->ring[used_idx])); >>> } >> Had tried post-updating used ring in batch, but forget the perf change. > I would assume pre-updating gives better performance gain, as we are > fiddling with avail and used ring together, which would be more cache > friendly.
The distance between entry for avail ring and used ring are at least 8 cache lines. The benefit comes from batch updates, if applicable. > >> One optimization would be on vhost_log_used_ring. >> I have two ideas, >> a) In QEMU side, we always assume use ring will be changed. so that we >> don't need to log used ring in VHOST. >> >> Michael: feasible in QEMU? comments on this? >> >> b) We could always mark the total used ring modified rather than entry >> by entry. > I doubt it's worthwhile. One fact is that vhost_log_used_ring is > a non operation in most time: it will take action only in the short > gap of during live migration. > > And FYI, I even tried with all vhost_log_xxx being removed, it showed > no performance boost at all. Therefore, it's not a factor that will > impact performance. I knew this. > --yliu >