[Bug 1168526] [NEW] race condition causing lxc to not detect container init process exit

2013-04-12 Thread Pavel Bennett
Public bug reported: For the purpose of the repro, my lxc init process is node.js v0.11.0 (built from source) with a single line: process.exit(0); When running it in lxc, sometimes lxc doesn't exit. lxc-start remains a parent of a defunct node process without reaping it or exiting. I've made a

[Bug 1168526] Re: race condition causing lxc to not detect container init process exit

2013-04-12 Thread Pavel Bennett
> Precisely which version of lxc were you using? I just put back version 0.9.0-0ubuntu2 (as opposed to the 0.9.0 I built from source) while on kernel 3.7.9-030709-generic and haven't yet run into this issue (I assume that's the patch you mentioned). However, when I update to kernel 3.8.6-030806-ge

[Bug 1168526] Re: race condition causing lxc to not detect container init process exit

2013-04-13 Thread Pavel Bennett
I should add that these "forwarded signal 2" lines are due to me pressing Ctrl+C and are not actually relevant. Have you been able to repro this bug on kernel 3.8.6? I'm thinking how to fix this as lxc_spawn is what gets the pid which is needed by lxc_poll to listen for SIGCHLD from the correct p

[Bug 1168526] Re: race condition causing lxc to not detect container init process exit

2013-04-14 Thread Pavel Bennett
Btw, that "queueing mode" would simply mean not calling epoll_wait until the pid is available. This shouldn't require managing a queue ourselves. Can you think of anything that this would break? Or we could go with the patch you've written, although I haven't looked into why the problem appears to

[Bug 1168526] Re: race condition causing lxc to not detect container init process exit

2013-04-20 Thread Pavel Bennett
I've also tried it with a C++ app very similar to yours and was unable to repro. There is something about having node.js as the init process running a "process.exit(0);" js. The init process (node v0.11.0) does exit as "ps faux" shows it as a zombie and a child of lxc-start. I went back to kernel

[Bug 1168526] Re: race condition causing lxc to not detect container init process exit

2013-06-28 Thread Pavel Bennett
Hey Serge, let me know if that repro worked for you or when you're planning to give it a try. I'm keeping the VM image around in case you need it. > What's odd is that I can't even reproduce it with the daily ppa build, > which doesn't have the workaround which is in the ubuntu package. Did you t

[Bug 1196295] [NEW] lxc-start enters uninterruptible sleep

2013-06-30 Thread Pavel Bennett
Public bug reported: After running and terminating around 6000 containers overnight, something happened on my box that is affecting every new LXC container I try to start. The DEBUG log file looks like: lxc-start 1372615570.399 WARN lxc_start - inherited fd 9 lxc-start 1372615570.

[Bug 1196295] Re: lxc-start enters uninterruptible sleep

2013-06-30 Thread Pavel Bennett
Also, in dmesg: [54545.873460] unregister_netdevice: waiting for lo to become free. Usage count = 1 [54556.103535] unregister_netdevice: waiting for lo to become free. Usage count = 1 [54566.333609] unregister_netdevice: waiting for lo to become free. Usage count = 1 [54576.563664] unregister_n

[Bug 1196295] Re: lxc-start enters uninterruptible sleep

2013-06-30 Thread Pavel Bennett
Some basic environment details. I can post more if requested. Ubuntu Server 13.04 64-bit $ uname -r 3.8.0-25-generic $ dpkg -l | grep lxc ii liblxc00.9.0-0ubuntu3.3 amd64Linux Containers userspace tools (library) ii lxc

[Bug 1196295] Re: lxc-start enters uninterruptible sleep

2013-07-02 Thread Pavel Bennett
Reproduced even with lxc-stop. dmesg: [178420.689704] unregister_netdevice: waiting for lo to become free. Usage count = 1 [178430.919783] unregister_netdevice: waiting for lo to become free. Usage count = 1 [178441.149854] unregister_netdevice: waiting for lo to become free. Usage count = 1 [

[Bug 1196295] Re: lxc-start enters uninterruptible sleep

2013-07-02 Thread Pavel Bennett
This should help. kern.log: Jul 2 05:41:32 server1 kernel: [136565.201601] device vethbJ4JsM left promiscuous mode Jul 2 05:41:32 server1 kernel: [136565.201603] vmbr: port 5(vethbJ4JsM) entered disabled state Jul 2 05:41:38 server1 kernel: [136570.551496] vmbr: port 2(veth49SiBX) entered f

[Bug 1196295] HookError_generic.txt

2013-07-02 Thread Pavel Bennett
apport information ** Attachment added: "HookError_generic.txt" https://bugs.launchpad.net/bugs/1196295/+attachment/3722561/+files/HookError_generic.txt -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bu

[Bug 1196295] Re: lxc-start enters uninterruptible sleep

2013-07-02 Thread Pavel Bennett
apport information ** Tags added: apport-collected ** Description changed: After running and terminating around 6000 containers overnight, something happened on my box that is affecting every new LXC container I try to start. The DEBUG log file looks like: lxc-start 1372615570.3

[Bug 1196295] HookError_cloud_archive.txt

2013-07-02 Thread Pavel Bennett
apport information ** Attachment added: "HookError_cloud_archive.txt" https://bugs.launchpad.net/bugs/1196295/+attachment/3722560/+files/HookError_cloud_archive.txt -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.laun

[Bug 1196295] HookError_source_lxc.txt

2013-07-02 Thread Pavel Bennett
apport information ** Attachment added: "HookError_source_lxc.txt" https://bugs.launchpad.net/bugs/1196295/+attachment/3722563/+files/HookError_source_lxc.txt -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.

[Bug 1196295] HookError_source_linux.txt

2013-07-02 Thread Pavel Bennett
apport information ** Attachment added: "HookError_source_linux.txt" https://bugs.launchpad.net/bugs/1196295/+attachment/3722562/+files/HookError_source_linux.txt -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launch

[Bug 1196295] HookError_ubuntu.txt

2013-07-02 Thread Pavel Bennett
apport information ** Attachment added: "HookError_ubuntu.txt" https://bugs.launchpad.net/bugs/1196295/+attachment/3722564/+files/HookError_ubuntu.txt -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs

[Bug 1196295] Re: lxc-start enters uninterruptible sleep

2013-07-02 Thread Pavel Bennett
Looks like apport was missing some module to gather what it wanted. Let me know if this info would be valuable and I can re-run it. ** Changed in: lxc (Ubuntu) Status: Incomplete => Confirmed -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribe

[Bug 1196295] Re: lxc-start enters uninterruptible sleep

2013-07-02 Thread Pavel Bennett
Attaching a better apport file after installing the missing dependency. I will hide the ones from earlier as this will contain the same data and more. ** Attachment added: "apport.lxc.txt" https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1196295/+attachment/3722586/+files/apport.lxc.txt --

[Bug 1196295] Re: lxc-start enters uninterruptible sleep

2013-07-02 Thread Pavel Bennett
Changing to Confirmed as per instructions in comment #7 ** Changed in: linux Status: Incomplete => Confirmed -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1196295 Title: lxc-start enters uni

[Bug 1196295] Re: lxc-start enters uninterruptible sleep

2013-07-05 Thread Pavel Bennett
** Tags added: kernel-bug-exists-upstream -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1196295 Title: lxc-start enters uninterruptible sleep To manage notifications about this bug go to: https://b

[Bug 1196295] Re: lxc-start enters uninterruptible sleep

2013-07-05 Thread Pavel Bennett
Managed to repro with v3.10-saucy last night. What do you guys suspect it could be? I'm keeping the server in this state for now if you'd like me to gather some data. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launc

[Bug 1196295] Re: lxc-start enters uninterruptible sleep

2013-07-17 Thread Pavel Bennett
The last one I'm aware of that did not exhibit this issue was 3.5.0-27. I wish I had a simpler repro though, since on our system it takes 10-15 hours of heavy processing to hit the uninterruptible sleeps. Could it be tracked by looking at the state of the OS? Every new lxc- start ends up hanging a

[Bug 1168526] Re: race condition causing lxc to not detect container init process exit

2013-06-20 Thread Pavel Bennett
Hey Serge, were you able to get a reliable repro for this? I have a reason to upgrade to Raring, and this seems to be the only blocker. We've reproduced the issue with the stock Linux Mint 15. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ub

[Bug 1168526] Re: race condition causing lxc to not detect container init process exit

2013-06-21 Thread Pavel Bennett
I can't try Saucy right now, but the repro instructions with kernel versions are in the original post and in #2. We've tried node v0.11.2 as well on Raring and got the repro. Repro summary: Install any of the above kernels, such as the one with the Raring installer, then install lxc from apt. G

[Bug 1168526] Re: race condition causing lxc to not detect container init process exit

2013-06-21 Thread Pavel Bennett
Sure, run these inside the container: git clone https://github.com/joyent/node.git --depth 1 cd node ./configure make -j9 sudo make install Then the binary will be at /usr/local/bin/node It's v0.11.3-pre, but should still repro. -- You received this bug notification because you are a member o

[Bug 1168526] Re: race condition causing lxc to not detect container init process exit

2013-06-21 Thread Pavel Bennett
I created a VM with Ubuntu Server 13.04 just for this bug. At first, I was able to run the steps outlined above 50 times with no issues. What was I missing? Concurrency! I rebooted the VM after adding 1 more core, and... bingo! Zombies on the 3rd try. The VM disk image I have here should be compat