On Thu 20 Dec 2018 at 13:55, Vlad Buslov <vla...@mellanox.com> wrote:
> On Wed 19 Dec 2018 at 04:27, Cong Wang <xiyou.wangc...@gmail.com> wrote:
>> On Mon, Dec 17, 2018 at 2:30 AM Jiri Pirko <j...@resnulli.us> wrote:
>>>
>>> Sun, Dec 16, 2018 at 07:52:18PM CET, xiyou.wangc...@gmail.com wrote:
>>> >On Sun, Dec 16, 2018 at 8:32 AM Vlad Buslov <vla...@mellanox.com> wrote:
>>> >>
>>> >> On Thu 13 Dec 2018 at 23:32, Cong Wang <xiyou.wangc...@gmail.com> wrote:
>>> >> > On Tue, Dec 11, 2018 at 2:19 AM Vlad Buslov <vla...@mellanox.com> 
>>> >> > wrote:
>>> >> >>
>>> >> >> As a part of the effort to remove dependency on rtnl lock, cls API is 
>>> >> >> being
>>> >> >> converted to use fine-grained locking mechanisms instead of global 
>>> >> >> rtnl
>>> >> >> lock. However, chain_head_change callback for ingress Qdisc is a 
>>> >> >> sleeping
>>> >> >> function and cannot be executed while holding a spinlock.
>>> >> >
>>> >> >
>>> >> > Why does it have to be a spinlock not a mutex?
>>> >> >
>>> >> > I've read your cover letter and this changelog, I don't find any
>>> >> > answer.
>>> >>
>>> >> My initial implementation used mutex. However, it was changed to
>>> >> spinlock by Jiri's request during internal review.
>>> >>
>>> >
>>> >So what's the answer to my question? :)
>>>
>>> Yeah, my concern agains mutexes was that it would be needed to have one
>>> per every block and per every chain. I find it quite heavy and I believe
>>> it is better to use spinlock in those cases. This patch is a side effect
>>> of that. Do you think it would be better to have mutexes instead of
>>> spinlocks?
>>
>> My only concern with spinlock is we have to go async as we
>> can't block. This is almost always error-prone especially
>> when dealing with tcf block. I had to give up with spinlock
>> for idrinfo->lock, please take a look at:
>>
>> commit 95278ddaa15cfa23e4a06ee9ed7b6ee0197c500b
>> Author: Cong Wang <xiyou.wangc...@gmail.com>
>> Date:   Tue Oct 2 12:50:19 2018 -0700
>>
>>     net_sched: convert idrinfo->lock from spinlock to a mutex
>>
>>
>> There are indeed some cases in kernel we do take multiple
>> mutex'es, for example,
>>
>> /*
>>  * bd_mutex locking:
>>  *
>>  *  mutex_lock(part->bd_mutex)
>>  *    mutex_lock_nested(whole->bd_mutex, 1)
>>  */
>>
>> So, how heavy are they comparing with spinlocks?
>>
>> Thanks.
>
> Hi Cong,
>
> I did quick-and-dirty mutex-based implementation of my cls API patchset
> (removed async miniqp refactoring, changed block and chain lock types to
> mutex lock) and performed some benchmarks to determine exact mutex
> impact on cls API performance (both rule and chains API).
>
> 1. Parallel flower insertion rate (create 5 million flower filters on
> same chain/prio with 10 tc instances) - same performance:
>
> SPINLOCK
> $ time ls add_sw* | xargs -n 1 -P 100 sudo tc -force -b
>
> real    0m12.183s
> user    0m35.013s
> sys     1m24.947s
>
> MUTEX
> $ time ls add_sw* | xargs -n 1 -P 100 sudo tc -force -b
>
> real    0m12.246s
> user    0m35.417s
> sys     1m25.330s
>
> 2. Parallel flower insertion rate (create 100k flower filters, each on
> new instance of chain/tp, 10 tc instances) - mutex implementation is
> ~17% slower:
>
> SPINLOCK:
> $ time ls add_sw* | xargs -n 1 -P 100 sudo tc -force -b
>
> real    2m19.285s
> user    0m1.863s
> sys     5m41.157s
>
> MUTEX:
> multichain$ time ls add_sw* | xargs -n 1 -P 100 sudo tc -force -b
>
> real    2m43.495s
> user    0m1.664s
> sys     8m32.483s
>
> 3. Chains insertion rate with cls chain API (create 100k empty chains, 1
> tc instance) - ~3% difference:
>
> SPINLOCK:
> $ time sudo tc -b add_chains
>
> real    0m46.822s
> user    0m0.239s
> sys     0m46.437s
>
> MUTEX:
> $ time sudo tc -b add_chains
>
> real    0m48.600s
> user    0m0.230s
> sys     0m48.227s
>
>
> Only case where performance is significantly impacted by mutex is second
> test. This happens because chain lookup is a linear search that is
> performed while holding highly contested block lock. Perf profile for
> mutex version confirms this:
>
> +   91.83%     0.00%  tc               [kernel.vmlinux]            [k] 
> entry_SYSCALL_64_after_hwframe
> +   91.83%     0.00%  tc               [kernel.vmlinux]            [k] 
> do_syscall_64
> +   91.80%     0.00%  tc               libc-2.25.so                [.] 
> __libc_sendmsg
> +   91.78%     0.00%  tc               [kernel.vmlinux]            [k] 
> __sys_sendmsg
> +   91.77%     0.00%  tc               [kernel.vmlinux]            [k] 
> ___sys_sendmsg
> +   91.77%     0.00%  tc               [kernel.vmlinux]            [k] 
> sock_sendmsg
> +   91.77%     0.00%  tc               [kernel.vmlinux]            [k] 
> netlink_sendmsg
> +   91.77%     0.01%  tc               [kernel.vmlinux]            [k] 
> netlink_unicast
> +   91.77%     0.00%  tc               [kernel.vmlinux]            [k] 
> netlink_rcv_skb
> +   91.71%     0.01%  tc               [kernel.vmlinux]            [k] 
> rtnetlink_rcv_msg
> +   91.69%     0.03%  tc               [kernel.vmlinux]            [k] 
> tc_new_tfilter
> +   90.96%    26.45%  tc               [kernel.vmlinux]            [k] 
> __tcf_chain_get
> +   64.30%     0.01%  tc               [kernel.vmlinux]            [k] 
> __mutex_lock.isra.7
> +   39.36%    39.18%  tc               [kernel.vmlinux]            [k] 
> osq_lock
> +   24.92%    24.81%  tc               [kernel.vmlinux]            [k] 
> mutex_spin_on_owner
>
> Based on these results we can conclude that use-cases with significant
> amount of chains on single block will be affected by using mutex in cls
> API.
>
> Regards,
> Vlad

Hi Cong,

Any comments regarding benchmark results? Are you okay with
spinlock-based implementation?

Thanks,
Vlad

Reply via email to