On 2016-05-18 11:38, Mats Karrman wrote: > > > On 2016-05-17 17:31, Mats Karrman wrote: >> >> On 2016-05-17 13:29, Felix Fietkau wrote: >>> I just took a look at the code and uloop's processing of signals looked >>> a bit racy to me. I've pushed a commit that makes it use signalfd if >>> available. I also found that waitpid wasn't being retried on signal >>> interrupt, so I added an extra check there. The changes are in libubox >>> git, but not in OpenWrt/LEDE yet. >>> Please test if this fixes your issue. >>> >>> Thanks, >>> >>> - Felix >> Tried that but no immediate success, but it might have provided >> some additional clues. Now the boot hangs early on *every* boot >> but after logging in I found something different in the ps list. >> There is a Broadcom utility (smd) that is called from one of the >> start scripts (S10environment). It's purpose is to set scheduling >> priority and cpu affinity for some of the Broadcom proprietary >> processes, The smd program handles fork rather ugly. The >> parent only loops until it receives SIGCHLD and then exits without >> any wait. With the modified libubox I get a zombie smd child and >> sleeping smd parent and S11environment (no other zombie). >> >> Not sure exactly how this happened but I got to think about >> something written in the wait man page: >> >> """ >> If a parent process terminates, then its "zombie" children (if any) >> are adopted by init(8), which automatically performs a wait to >> remove the zombies. >> """ >> >> Is this wait really (unconditionally) implemented in procd or could >> that be what I accomplished with the "forced timeout" patch? >> >> I fixed the ugly fork and got the system to boot once. >> Then tried the original libubox with the fixed smd program but >> this was not enough to get things working (25 reboots to hang). >> >> Now I'm running reboot tests with your new libubox and fixed smd... > More than 250 reboots without problem :) > > Clearly the smd program is broken, but still it doesn't feel good that it > manages to hang the init process. Considering that timing is involved > it's difficult to make any certain conclusions but it seems like having > uloop epoll_wait to time out occasionally isn't such a bad idea? I agree, that definitely needs fixing. What kernel are you using?
- Felix _______________________________________________ Lede-dev mailing list Lede-dev@lists.infradead.org http://lists.infradead.org/mailman/listinfo/lede-dev