On Thu, November 26, 2015 13:16:04 Martin Graesslin wrote: > we are facing a problem during the startup of Plasma on Wayland. If OOM > protection is enabled for kdeinit and we already have a running X server, > kdeinit freezes dead. > > I'm sorry for having ignored the issue for too long and had just disabled > OOM protection on my system, so I never hit it. Now I enabled it again to > get the problem. On my system I have now two frozen kdeinit processes: > > martin 1960 1956 0 77832 26448 1 13:05 ? 00:00:00 > /opt/kf5/bin/ kdeinit5 --oom-pipe 4 --kded +kcminit_startup > martin 1961 1960 0 77832 2816 3 13:05 ? 00:00:00 > /opt/kf5/bin/ kdeinit5 --oom-pipe 4 --kded +kcminit_startup > > One has the following stacktrace: > It's frozen in this line of code: > sigsuspend(&oldsigs); // wait for the signal to come > > The other one has the following stacktrace: > which is: > d.n = read(d.fd[0], &d.result, 1); > > Given that it looks to me like these two processes dead-lock. I do not > understand why, why it only happens on Wayland, why the fact that an X > server must already be running is relevant and what the OOM protection has > to do with it.
I don't have the answer but I can help explain the deadlock better I think. You might start off looking at frameworks/kinit/src/start_kdeinit/start_kdeinit.c:39, which describes how the OOM protection plays into the kdeinit concept. AFAICS, the idea is that the "start_kdeinit" program forks off a child (kdeinit), which itself will eventually fork off children of its own. The OOM protection is intended for the kdeinit child alone, not the grandchildren. Instead of having the kdeinit child disable protection for its own children, it uses a pipe IPC to send the PID of its own children (grandchilds of start_kdeinit) back to start_kdeinit, and start_kdeinit disables the OOM protection. It wouldn't do to have the grandchild exec() the actual program before the OOM protection is re-enabled, so kdeinit's child (the grandchild) waits for SIGUSR1 to be sent before proceeding. ** In this case, SIGUSR1 seems to be never sent, likely due to start_kdeinit.c:200 (in the original parent proc): if (set_protection(pid, 0)) { kill(pid, SIGUSR1); } There's no else block here; if set_protection (a static function in start_kdeinit.c) fails for any reason then the process is neither resumed nor killed and will simply hang. AFAICS the only reason that set_protection would fail to succeed is if the process's UID is not as expected (since the UID is simply a value fed over a pipe; it's intended to be a grandchild of start_kdeinit itself but if something else gets fed in somehow there's a UID check as a safeguard). In the meantime kdeinit itself waits to know whether its child succeeds in exec()'ing, so it can call up an error message if needed. But kdeinit's child is waiting on a SIGUSR1 that doesn't get sent, and can't proceed the portion of its codepath where it can send its result back to kdeinit (using a separate set of pipe fds). Since the grandchild never reports back to kdeinit, kdeinit itself remains blocked. The immediate fix would seem to revolve around properly indicating the error case from start_kdeinit.c:200, but it might be prudent to have timeouts around some of the other indefinitely-blocking function calls in kinit.cpp so that kdeinit itself is not left blocking forever. There's also the question of why start_kdeinit is expected to disable OOM protection instead of kdeinit doing it directly... in any event kdeinit has to know OOM protection is in use and participates in the process. Perhaps it's a kernel restriction but it seems to me it would be easier to factor out the code in set_protection() in a separate function used by both start_kdeinit and kdeinit. Regards, - Michael Pyne _______________________________________________ Kde-frameworks-devel mailing list Kde-frameworks-devel@kde.org https://mail.kde.org/mailman/listinfo/kde-frameworks-devel