On 2018/12/31 上午2:40, Michael S. Tsirkin wrote:
On Thu, Dec 27, 2018 at 05:55:52PM +0800, Jason Wang wrote:
On 2018/12/26 下午11:06, Michael S. Tsirkin wrote:
On Wed, Dec 26, 2018 at 12:03:50PM +0800, Jason Wang wrote:
On 2018/12/26 上午12:41, Michael S. Tsirkin wrote:
Hi!
I was just wondering: packed ring batches things naturally.
E.g.

user_access_begin
check descriptor valid
smp_rmb
copy descriptor
user_access_end
But without speculation on the descriptor (which may only work for in-order
or even a violation of spec). Only one two access of a single descriptor
could be batched. For split ring, we can batch more since we know how many
descriptors is pending. (avail_idx - last_avail_idx).

Anything I miss?

Thanks

just check more descriptors in a loop:

   user_access_begin
   for (i = 0; i < 16; ++i) {
         if (!descriptor valid)
                break;
         smp_rmb
         copy descriptor
   }
   user_access_end

you don't really need to know how many there are
ahead of the time as you still copy them 1 by one.

So let's see the case of split ring


user_access_begin

n = avail_idx - last_avail_idx (1)

n = MIN(n, 16)

smp_rmb

read n entries from avail_ring (2)

for (i =0; i <n; i++)

     copy descriptor (3)

user_access_end


Consider for the case of heavy workload. So for packed ring, we have 32
times of userspace access and 16 times of smp_rmb()

For split ring we have

(1) 1 time

(2) 2 times at most

(3) 16 times

19 times of userspace access and 1 times of smp_rmb(). In fact 2 could be
eliminated with in order. 3 could be batched completely with in order and
partially when out of order.

I don't see how packed ring help here especially consider lfence on x86 is
more than memory fence, it prevents speculation in fact.

Thanks
So on x86 at least RMB is free, this is why I never bothered optimizing
it out. Is smp_rmb still worth optimizing out for ARM? Does it cost
more than the extra indirection in the split ring?


I don't know, but obviously, RMB has a chance to damage the performance more or less. But even on arch where the RMB is free, packed ring still does not show obvious advantage.



But my point was really fundamental - if ring accesses are expensive
then we should batch them.


I don't object the batching, the reason that they are expensive could be:

1) unnecessary overhead caused by speculation barrier and check likes SMAP

2) cache contention

So it does not conflict with the effort that I did to remove 1). My plan is: for metadata, try to eliminate all the 1) completely. For data, we can do batch copying to amortize its effort. For avail/descriptor batching, we can try to it on top.


  Right now we have an API that gets
an iovec directly. That limits the optimizations you can do.

The translation works like this:

ring -> valid descriptors -> iovecs

We should have APIs for each step that work in batches.


Yes.

Thanks




So packed layout should show the gain with this approach.
That could be motivation enough to finally enable vhost packed ring
support.

Thoughts?

Reply via email to