On 7 June 2016 at 06:11, Xinxing Hu <xinxing.hu...@gmail.com> wrote: > Hi Guys, > > I have another idea about this issue. Maybe it is not kernel, but uloop > related. I read procd and libubox code a little bit, and it seems there is a > potential issue existing in uloop_run(). > > In general, uloop_run() is running in a while loop: > > while() > 1, Process timeouts list > > 2, Handle terminated child processes > > 3, uloop_run_events(timeout) => calls epoll_wait() > done > > During boot, procd_inittab_run("sysinit") is called in Step1, which calls > add_initd(). add_initd() would add an entry in timeouts list, whose callback > function is to execute an rc.d/S* script. > > When the while loop goes back to Step1 again, the timeouts list would be > processed, and an rc.d/S* script would be executed in a child process while > the parent process remains in the while loop. If everything goes fine, when > the child process is terminated, the parent process will handle terminated > child process by calling waitpid() in the while loop. A process callback > function will also be called, which adds another timeout entry in timeouts > list. This new entry corresponds to the next rc.d/S* script to be executed. > When the while loop reaches Step1 again, the next rc.d/S* script would be > invoked. > > Everything looks OK till now. However, due to process scheduling, problems > might happen when uloop_run_events(uloop_get_next_timeout(&tv)) is called. > For instance: if the child process is still running when > uloop_get_next_timeout(&tv) is called, then the timeouts list is already > empty at that time, so the return value of uloop_get_next_timeout(&tv) would > be -1. Furthermore, if the child process is terminated and signal handler is > executed before epoll_wait() is called, then epoll_wait will block the > parent process forever until some other events it is listening to arrive. In > this sense, other events arriving just hide this issue. During the boot, as > long as /etc/rc.d/S* is not finished executing, epoll_wait() should never be > blocked. > > I think, a potential solution might be: during initialization, we let uloop > listens to a kind of 'dummy' event. Every time when the child process > finishes executing a rc.d/S* script, we send a 'dummy' event. In this case, > epoll_wait would never be blocked during booting.
Interesting. Looks like the same issue can also happen to the uloop_canceled check. Python's tornado library uses pipe() as a "waker" to "calls the given callback on the next I/O loop iteration." Can you give the attached patch a try to see if it can solve the issue for you? It was only just run-tested on qemu malta to make sure the patched libubox still runs. yousong > > Best Regards, > Xinxing > > > > > On 2016/5/17 18:03, Mats Karrman wrote: > Hi Felix, others, > > I have been experiencing problems with the init scripts dispatch > suddenly stopping (indefinitely). > This happens maybe once in 100 reboots. > After inserting a new start script that launches another daemon > (cgrulesengd) very early in the boot process, the failures started to > come a lot more frequently, maybe once in 10 reboots, making this a real > issue. > I'm normally using the versions of procd and libubox selected by OpenWRT > BB branch but I have tested the latest versions from the git repos with > the same result. > So far I have only got this to happen on a quite fast board (ARM dual > CorexA9 @ 1GHz). > Inserting trace prints in libubox changes behavior, also suggesting the > problem is timing dependent. > > When init hangs: > - it is still possible to log in on console > - there is always a zombie start script, e.g. S11sysctl. > - by killing a process (e.g. ubusd or cgrulesengd) the init process > continues. > - otherwise generating an event, e.g inserting something into a USB port > also makes the init continue. > > I have traced the problem down to the "epoll_wait" call in > libubox::uloop.c::uloop_fetch_events(). > The following patch makes sure epoll_wait is never called without a timeout. > My tests show that this solves the problem. > I have been able to observe the case when the boot gets stuck and then > continues after the 8s timeout. > However I'm not sure that this is the correct fix for the problem as > there may be other reasons that there is no event in the first place. > Your feedback would be welcome! > > BR // Mats > Currently working for Inteno Broadband Technology AB > > --- > Avast 防毒软件已对此电子邮件执行病毒检查。 > https://www.avast.com/antivirus > > > _______________________________________________ > Lede-dev mailing list > Lede-dev@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/lede-dev
0001-uloop-use-a-waker-for-notifying-sigchld-and-loop-can.patch
Description: Binary data
_______________________________________________ Lede-dev mailing list Lede-dev@lists.infradead.org http://lists.infradead.org/mailman/listinfo/lede-dev