On Tue, Nov 14, 2017 at 09:04:06PM +0300, Kirill Tkhai wrote: > On 14.11.2017 20:44, Andrei Vagin wrote: > > On Tue, Nov 14, 2017 at 04:53:33PM +0300, Kirill Tkhai wrote: > >> Curently mutex is used to protect pernet operations list. It makes > >> cleanup_net() to execute ->exit methods of the same operations set, > >> which was used on the time of ->init, even after net namespace is > >> unlinked from net_namespace_list. > >> > >> But the problem is it's need to synchronize_rcu() after net is removed > >> from net_namespace_list(): > >> > >> Destroy net_ns: > >> cleanup_net() > >> mutex_lock(&net_mutex) > >> list_del_rcu(&net->list) > >> synchronize_rcu() <--- Sleep there for > >> ages > >> list_for_each_entry_reverse(ops, &pernet_list, list) > >> ops_exit_list(ops, &net_exit_list) > >> list_for_each_entry_reverse(ops, &pernet_list, list) > >> ops_free_list(ops, &net_exit_list) > >> mutex_unlock(&net_mutex) > >> > >> This primitive is not fast, especially on the systems with many processors > >> and/or when preemptible RCU is enabled in config. So, all the time, while > >> cleanup_net() is waiting for RCU grace period, creation of new net > >> namespaces > >> is not possible, the tasks, who makes it, are sleeping on the same mutex: > >> > >> Create net_ns: > >> copy_net_ns() > >> mutex_lock_killable(&net_mutex) <--- Sleep there for > >> ages > >> > >> The solution is to convert net_mutex to the rw_semaphore. Then, > >> pernet_operations::init/::exit methods, modifying the net-related data, > >> will require down_read() locking only, while down_write() will be used > >> for changing pernet_list. > >> > >> This gives signify performance increase, like you may see below. There > >> is measured sequential net namespace creation in a cycle, in single > >> thread, without other tasks (single user mode): > >> > >> 1)int main(int argc, char *argv[]) > >> { > >> unsigned nr; > >> if (argc < 2) { > >> fprintf(stderr, "Provide nr iterations arg\n"); > >> return 1; > >> } > >> nr = atoi(argv[1]); > >> while (nr-- > 0) { > >> if (unshare(CLONE_NEWNET)) { > >> perror("Can't unshare"); > >> return 1; > >> } > >> } > >> return 0; > >> } > >> > >> Origin, 100000 unshare(): > >> 0.03user 23.14system 1:39.85elapsed 23%CPU > >> > >> Patched, 100000 unshare(): > >> 0.03user 67.49system 1:08.34elapsed 98%CPU > >> > >> 2)for i in {1..10000}; do unshare -n bash -c exit; done > > > > Hi Kirill, > > > > This mutex has another role. You know that net namespaces are destroyed > > asynchronously, and the net mutex gurantees that a backlog will be not > > big. If we have something in backlog, we know that it will be handled > > before creating a new net ns. > > > > As far as I remember net namespaces are created much faster than > > they are destroyed, so with this changes we can create a really big > > backlog, can't we? > > I don't think limitation is a good goal or a gool for the mutex, > because it's very easy to create many net namespaces in case of > the mutex exists. You may open /proc/[pid]/ns/net like a file, > and net_ns counter will increment. Then, do unshare(), and > the mutex has no a way to protect against that.
You are right, but with the mutex a user can not support a big backlog for a long time, it is shrunk to zero periodically. With these changes he can support a big backlog for a long time. A big backlog affects other users. If someone creates namespaces, he probably expects that they will be destroyed for a reasonable time. But currently someone else can increase a destroy time to a really big values. This problem was before your patches, but they may do this problem worse. The question here is: Should we think about this problem in the context of these patches? > Anyway, mutex > can't limit a number of something in general, I've never seen > a (good) example in kernel. I'm agree with you here. > > As I see, the real limitation happen in inc_net_namespaces(), > which is decremented after RCU grace period in cleanup_net(), > and it has not changed. ucount limits are to big to handle this problem. > > > There was a discussion a few month ago: > > https://lists.onap.org/pipermail/containers/2016-October/037509.html > > > > > >> > >> Origin: > >> real 1m24,190s > >> user 0m6,225s > >> sys 0m15,132s > > > > Here you measure time of creating and destroying net namespaces. > > > >> > >> Patched: > >> real 0m18,235s (4.6 times faster) > >> user 0m4,544s > >> sys 0m13,796s > > > > But here you measure time of crearing namespaces and you know nothing > > when they will be destroyed. > > You're right, and I predict, the sum time, spent on cpu, will remain the same, > but the think is that now creation and destroying may be executed in parallel.