On Thu 20 Dec 2018 at 13:55, Vlad Buslov <vla...@mellanox.com> wrote: > On Wed 19 Dec 2018 at 04:27, Cong Wang <xiyou.wangc...@gmail.com> wrote: >> On Mon, Dec 17, 2018 at 2:30 AM Jiri Pirko <j...@resnulli.us> wrote: >>> >>> Sun, Dec 16, 2018 at 07:52:18PM CET, xiyou.wangc...@gmail.com wrote: >>> >On Sun, Dec 16, 2018 at 8:32 AM Vlad Buslov <vla...@mellanox.com> wrote: >>> >> >>> >> On Thu 13 Dec 2018 at 23:32, Cong Wang <xiyou.wangc...@gmail.com> wrote: >>> >> > On Tue, Dec 11, 2018 at 2:19 AM Vlad Buslov <vla...@mellanox.com> >>> >> > wrote: >>> >> >> >>> >> >> As a part of the effort to remove dependency on rtnl lock, cls API is >>> >> >> being >>> >> >> converted to use fine-grained locking mechanisms instead of global >>> >> >> rtnl >>> >> >> lock. However, chain_head_change callback for ingress Qdisc is a >>> >> >> sleeping >>> >> >> function and cannot be executed while holding a spinlock. >>> >> > >>> >> > >>> >> > Why does it have to be a spinlock not a mutex? >>> >> > >>> >> > I've read your cover letter and this changelog, I don't find any >>> >> > answer. >>> >> >>> >> My initial implementation used mutex. However, it was changed to >>> >> spinlock by Jiri's request during internal review. >>> >> >>> > >>> >So what's the answer to my question? :) >>> >>> Yeah, my concern agains mutexes was that it would be needed to have one >>> per every block and per every chain. I find it quite heavy and I believe >>> it is better to use spinlock in those cases. This patch is a side effect >>> of that. Do you think it would be better to have mutexes instead of >>> spinlocks? >> >> My only concern with spinlock is we have to go async as we >> can't block. This is almost always error-prone especially >> when dealing with tcf block. I had to give up with spinlock >> for idrinfo->lock, please take a look at: >> >> commit 95278ddaa15cfa23e4a06ee9ed7b6ee0197c500b >> Author: Cong Wang <xiyou.wangc...@gmail.com> >> Date: Tue Oct 2 12:50:19 2018 -0700 >> >> net_sched: convert idrinfo->lock from spinlock to a mutex >> >> >> There are indeed some cases in kernel we do take multiple >> mutex'es, for example, >> >> /* >> * bd_mutex locking: >> * >> * mutex_lock(part->bd_mutex) >> * mutex_lock_nested(whole->bd_mutex, 1) >> */ >> >> So, how heavy are they comparing with spinlocks? >> >> Thanks. > > Hi Cong, > > I did quick-and-dirty mutex-based implementation of my cls API patchset > (removed async miniqp refactoring, changed block and chain lock types to > mutex lock) and performed some benchmarks to determine exact mutex > impact on cls API performance (both rule and chains API). > > 1. Parallel flower insertion rate (create 5 million flower filters on > same chain/prio with 10 tc instances) - same performance: > > SPINLOCK > $ time ls add_sw* | xargs -n 1 -P 100 sudo tc -force -b > > real 0m12.183s > user 0m35.013s > sys 1m24.947s > > MUTEX > $ time ls add_sw* | xargs -n 1 -P 100 sudo tc -force -b > > real 0m12.246s > user 0m35.417s > sys 1m25.330s > > 2. Parallel flower insertion rate (create 100k flower filters, each on > new instance of chain/tp, 10 tc instances) - mutex implementation is > ~17% slower: > > SPINLOCK: > $ time ls add_sw* | xargs -n 1 -P 100 sudo tc -force -b > > real 2m19.285s > user 0m1.863s > sys 5m41.157s > > MUTEX: > multichain$ time ls add_sw* | xargs -n 1 -P 100 sudo tc -force -b > > real 2m43.495s > user 0m1.664s > sys 8m32.483s > > 3. Chains insertion rate with cls chain API (create 100k empty chains, 1 > tc instance) - ~3% difference: > > SPINLOCK: > $ time sudo tc -b add_chains > > real 0m46.822s > user 0m0.239s > sys 0m46.437s > > MUTEX: > $ time sudo tc -b add_chains > > real 0m48.600s > user 0m0.230s > sys 0m48.227s > > > Only case where performance is significantly impacted by mutex is second > test. This happens because chain lookup is a linear search that is > performed while holding highly contested block lock. Perf profile for > mutex version confirms this: > > + 91.83% 0.00% tc [kernel.vmlinux] [k] > entry_SYSCALL_64_after_hwframe > + 91.83% 0.00% tc [kernel.vmlinux] [k] > do_syscall_64 > + 91.80% 0.00% tc libc-2.25.so [.] > __libc_sendmsg > + 91.78% 0.00% tc [kernel.vmlinux] [k] > __sys_sendmsg > + 91.77% 0.00% tc [kernel.vmlinux] [k] > ___sys_sendmsg > + 91.77% 0.00% tc [kernel.vmlinux] [k] > sock_sendmsg > + 91.77% 0.00% tc [kernel.vmlinux] [k] > netlink_sendmsg > + 91.77% 0.01% tc [kernel.vmlinux] [k] > netlink_unicast > + 91.77% 0.00% tc [kernel.vmlinux] [k] > netlink_rcv_skb > + 91.71% 0.01% tc [kernel.vmlinux] [k] > rtnetlink_rcv_msg > + 91.69% 0.03% tc [kernel.vmlinux] [k] > tc_new_tfilter > + 90.96% 26.45% tc [kernel.vmlinux] [k] > __tcf_chain_get > + 64.30% 0.01% tc [kernel.vmlinux] [k] > __mutex_lock.isra.7 > + 39.36% 39.18% tc [kernel.vmlinux] [k] > osq_lock > + 24.92% 24.81% tc [kernel.vmlinux] [k] > mutex_spin_on_owner > > Based on these results we can conclude that use-cases with significant > amount of chains on single block will be affected by using mutex in cls > API. > > Regards, > Vlad
Hi Cong, Any comments regarding benchmark results? Are you okay with spinlock-based implementation? Thanks, Vlad