Hi Yousong,

Thanks a lot for you patch. I tried it a little bit, and till now everything looks OK. It seems this issue is solved by your patch.

Not sure whether Mats (the original reporter of this issue) still see the issue or not.

Best Regards,
Xinxing

On 2016/6/7 20:49, Yousong Zhou wrote:
On 7 June 2016 at 06:11, Xinxing Hu <xinxing.hu...@gmail.com> wrote:
Hi Guys,

I have another idea about this issue. Maybe it is not kernel, but uloop
related. I read procd and libubox code a little bit, and it seems there is a
potential issue existing in uloop_run().

In general, uloop_run() is running in a while loop:

while()
        1, Process timeouts list

        2, Handle terminated child processes

        3, uloop_run_events(timeout) => calls epoll_wait()
done

During boot, procd_inittab_run("sysinit") is called in Step1, which calls
add_initd(). add_initd() would add an entry in timeouts list, whose callback
function is to execute an rc.d/S* script.

When the while loop goes back to Step1 again, the timeouts list would be
processed, and an rc.d/S* script would be executed in a child process while
the parent process remains in the while loop. If everything goes fine, when
the child process is terminated, the parent process will handle terminated
child process by calling waitpid() in the while loop. A process callback
function will also be called, which adds another timeout entry in timeouts
list. This new entry corresponds to the next rc.d/S* script to be executed.
When the while loop reaches Step1 again, the next rc.d/S* script would be
invoked.

Everything looks OK till now. However, due to process scheduling, problems
might happen when uloop_run_events(uloop_get_next_timeout(&tv)) is called.
For instance: if the child process is still running when
uloop_get_next_timeout(&tv) is called, then the timeouts list is already
empty at that time, so the return value of uloop_get_next_timeout(&tv) would
be -1. Furthermore, if the child process is terminated and signal handler is
executed before epoll_wait() is called, then epoll_wait will block the
parent process forever until some other events it is listening to arrive. In
this sense, other events arriving just hide this issue. During the boot, as
long as /etc/rc.d/S* is not finished executing, epoll_wait() should never be
blocked.

I think, a potential solution might be: during initialization, we let uloop
listens to a kind of 'dummy' event. Every time when the child process
finishes executing a rc.d/S* script, we send a 'dummy' event. In this case,
epoll_wait would never be blocked during booting.

Interesting.  Looks like the same issue can also happen to the
uloop_canceled check.  Python's tornado library uses pipe() as a
"waker" to "calls the given callback on the next I/O loop iteration."

Can you give the attached patch a try to see if it can solve the issue
for you?  It was only just run-tested on qemu malta to make sure the
patched libubox still runs.

                yousong


Best Regards,
Xinxing




On 2016/5/17 18:03, Mats Karrman wrote:
Hi Felix, others,

I have been experiencing problems with the init scripts dispatch
suddenly stopping (indefinitely).
This happens maybe once in 100 reboots.
After inserting a new start script that launches another daemon
(cgrulesengd) very early in the boot process, the failures started to
come a lot more frequently, maybe once in 10 reboots, making this a real
issue.
I'm normally using the versions of procd and libubox selected by OpenWRT
BB branch but I have tested the latest versions from the git repos with
the same result.
So far I have only got this to happen on a quite fast board (ARM dual
CorexA9 @ 1GHz).
Inserting trace prints in libubox changes behavior, also suggesting the
problem is timing dependent.

When init hangs:
- it is still possible to log in on console
- there is always a zombie start script, e.g. S11sysctl.
- by killing a process (e.g. ubusd or cgrulesengd) the init process
continues.
- otherwise generating an event, e.g inserting something into a USB port
also makes the init continue.

I have traced the problem down to the "epoll_wait" call in
libubox::uloop.c::uloop_fetch_events().
The following patch makes sure epoll_wait is never called without a timeout.
My tests show that this solves the problem.
I have been able to observe the case when the boot gets stuck and then
continues after the 8s timeout.
However I'm not sure that this is the correct fix for the problem as
there may be other reasons that there is no event in the first place.
Your feedback would be welcome!

BR // Mats
Currently working for Inteno Broadband Technology AB

---
Avast 防毒软件已对此电子邮件执行病毒检查。
https://www.avast.com/antivirus


_______________________________________________
Lede-dev mailing list
Lede-dev@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/lede-dev

_______________________________________________
Lede-dev mailing list
Lede-dev@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/lede-dev

Reply via email to