Kirill Tkhai <ktk...@virtuozzo.com> writes: > On 15.11.2017 09:25, Eric W. Biederman wrote: >> Kirill Tkhai <ktk...@virtuozzo.com> writes: >> >>> Curently mutex is used to protect pernet operations list. It makes >>> cleanup_net() to execute ->exit methods of the same operations set, >>> which was used on the time of ->init, even after net namespace is >>> unlinked from net_namespace_list. >>> >>> But the problem is it's need to synchronize_rcu() after net is removed >>> from net_namespace_list(): >>> >>> Destroy net_ns: >>> cleanup_net() >>> mutex_lock(&net_mutex) >>> list_del_rcu(&net->list) >>> synchronize_rcu() <--- Sleep there for >>> ages >>> list_for_each_entry_reverse(ops, &pernet_list, list) >>> ops_exit_list(ops, &net_exit_list) >>> list_for_each_entry_reverse(ops, &pernet_list, list) >>> ops_free_list(ops, &net_exit_list) >>> mutex_unlock(&net_mutex) >>> >>> This primitive is not fast, especially on the systems with many processors >>> and/or when preemptible RCU is enabled in config. So, all the time, while >>> cleanup_net() is waiting for RCU grace period, creation of new net >>> namespaces >>> is not possible, the tasks, who makes it, are sleeping on the same mutex: >>> >>> Create net_ns: >>> copy_net_ns() >>> mutex_lock_killable(&net_mutex) <--- Sleep there for >>> ages >>> >>> The solution is to convert net_mutex to the rw_semaphore. Then, >>> pernet_operations::init/::exit methods, modifying the net-related data, >>> will require down_read() locking only, while down_write() will be used >>> for changing pernet_list. >>> >>> This gives signify performance increase, like you may see below. There >>> is measured sequential net namespace creation in a cycle, in single >>> thread, without other tasks (single user mode): >>> >>> 1)int main(int argc, char *argv[]) >>> { >>> unsigned nr; >>> if (argc < 2) { >>> fprintf(stderr, "Provide nr iterations arg\n"); >>> return 1; >>> } >>> nr = atoi(argv[1]); >>> while (nr-- > 0) { >>> if (unshare(CLONE_NEWNET)) { >>> perror("Can't unshare"); >>> return 1; >>> } >>> } >>> return 0; >>> } >>> >>> Origin, 100000 unshare(): >>> 0.03user 23.14system 1:39.85elapsed 23%CPU >>> >>> Patched, 100000 unshare(): >>> 0.03user 67.49system 1:08.34elapsed 98%CPU >>> >>> 2)for i in {1..10000}; do unshare -n bash -c exit; done >>> >>> Origin: >>> real 1m24,190s >>> user 0m6,225s >>> sys 0m15,132s >>> >>> Patched: >>> real 0m18,235s (4.6 times faster) >>> user 0m4,544s >>> sys 0m13,796s >>> >>> This patch requires commit 76f8507f7a64 "locking/rwsem: Add >>> down_read_killable()" >>> from Linus tree (not in net-next yet). >> >> Using a rwsem to protect the list of operations makes sense. >> >> That should allow removing the sing >> >> I am not wild about taking a the rwsem down_write in >> rtnl_link_unregister, and net_ns_barrier. I think that works but it >> goes from being a mild hack to being a pretty bad hack and something >> else that can kill the parallelism you are seeking it add. >> >> There are about 204 instances of struct pernet_operations. That is a >> lot of code to have carefully audited to ensure it can in parallel all >> at once. The existence of the exit_batch method, net_ns_barrier, >> for_each_net and taking of net_mutex in rtnl_link_unregister all testify >> to the fact that there are data structures accessed by multiple network >> namespaces. >> >> My preference would be to: >> >> - Add the net_sem in addition to net_mutex with down_write only held in >> register and unregister, and maybe net_ns_barrier and >> rtnl_link_unregister. >> >> - Factor out struct pernet_ops out of struct pernet_operations. With >> struct pernet_ops not having the exit_batch method. With pernet_ops >> being embedded an anonymous member of the old struct pernet_operations. >> >> - Add [un]register_pernet_{sys,dev} functions that take a struct >> pernet_ops, that don't take net_mutex. Have them order the >> pernet_list as: >> >> pernet_sys >> pernet_subsys >> pernet_device >> pernet_dev >> >> With the chunk in the middle taking the net_mutex. > > I think this approach will work. Thanks for the suggestion. Some more > thoughts to the plan below. > > The only difficult thing there will be to choose the right order > to move ops from pernet_subsys to pernet_sys and from pernet_device > to pernet_dev one by one. > > This is rather easy in case of tristate drivers, as modules may be loaded > at any time, and the only important order is dependences between them. > So, it's possible to start from a module, who has no dependences, > and move it to pernet_sys, and then continue with modules, > who have no more dependences in pernet_subsys. For pernet_device > it's vise versa. > > In case of bool drivers, the situation may be worse, because > the order is not so clear there. The same priority initcalls > (for example, initcalls, registered via core_initcall()) may require > the certain order, driven by linking order. I know one example from > device mapper code, which lives here: drivers/md/Makefile. > This problem is also solvable, even if such places do not contain > comments about linking order. It's just need to respect Makefile > order, when choosing a new candidate to move. > >> I think I would enforce the ordering with a failure to register >> if a subsystem or a device tried to register out of order. >> >> - Disable use of the single threaded workqueue if nothing needs the >> net_mutex. > > We may use per-cpu worqueues in the future. The idea to refuse using > worqueue doesn't seem good for me, because asynchronous net destroying > looks very useful.
per-cpu workqueues are fine, and definitely what I am expecting. If we are doing this I want to get us off the single threaded workqueue that serializes all of the cleanup. That has a huge potential for simplifying things and reducing maintenance if running everything in parallel is actually safe. I forget how the modern per-cpu workqueues work with respect to sleeps and locking (I don't remember if when piece of work sleeps for a long time we spin up another thread per-cpu workqueue thread, and thus avoid priority inversion problems). If locks between workqueues are not a problem we could start the transition off of the single-threaded serializing workqueue sooner rather than later. >> - Add a test mode that deliberartely spawns threads on multiple >> processors and deliberately creates multiple network namespaces >> at the same time. >> >> - Add a test mode that deliberately spawns threads on multiple >> processors and delibearate destrosy multiple network namespaces >> at the same time.> >> - Convert the code to unlocked operation one pernet_operations to at a >> time. Being careful with the loopback device it's order in the list >> strongly matters. >> >> - Finally remove the unnecessary code. >> >> >> At the end of the day because all of the operations for one network >> namespace will run in parallel with all of the operations for another >> network namespace all of the sophistication that goes into batching the >> cleanup of multiple network namespaces can be removed. As different >> tasks (not sharing a lock) can wait in syncrhonize_rcu at the same time >> without slowing each other down. >> >> I think we can remove the batching but I am afraid that will lead us into >> rtnl_lock contention. > > I've looked into this lock. It used in many places for many reasons. > It's a little strange, nobody tried to break it up in several small locks.. It gets tricky. At this point getting net_mutex is enough to start with. Mostly the rtnl_lock covers the slow path which tends to keep it from rising to the top of the priority list. >> I am definitely not comfortable with changing the locking on so much >> code without an explanation at all why it is safe in the commit comments >> in all 204 instances. Which combined equal most of the well maintained >> and regularly used part of the networking stack. > > David, > > could you please check the plan above and say whether 1)it's OK for you, > and 2)if so, will you expect all the changes are made in one big 200+ patch > set > or we may go sequentially(50 patches a time, for example)? I think I would start with the something covering the core networking pieces so the improvements can be seen. Eric