Steve,

Can someone from Foundations evaluate this bug?  Core issue is that when
child process dies, the parent process is also killed.

                      Michael


On 09/30/2016 02:02 AM, Launchpad Bug Tracker wrote:
> bugproxy (bugproxy) has assigned this bug to you for Ubuntu:
>
> Problem Description
> ===========================
> I write a simple systemd service which will fork child processes fiercely. 
> But quickly the service failed:
>
> % sudo systemctl status reproducer.service
> ? reproducer.service - Reproducer of systemd services killed by ips
>    Loaded: loaded (/etc/systemd/system/reproducer.service; disabled; vendor 
> preset: enabled)
>    Active: failed (Result: exit-code) since Fri 2016-03-18 06:58:37 CDT; 2min 
> 43s ago
>   Process: 5103 ExecStart=/home/hpt/reproducer/reproducer.sh (code=exited, 
> status=0/SUCCESS)
>  Main PID: 5105 (code=exited, status=254)
>
> Mar 18 06:58:36 pinelp3 reproducer.sh[5103]: 
> /home/hpt/reproducer/reproducer.sh: fork: Resource temporarily unavailable
> Mar 18 06:58:36 pinelp3 reproducer.sh[5103]: 
> /home/hpt/reproducer/reproducer.sh: fork: Resource temporarily unavailable
> Mar 18 06:58:37 pinelp3 reproducer.sh[5103]: 
> /home/hpt/reproducer/reproducer.sh: fork: Resource temporarily unavailable
> Mar 18 06:58:37 pinelp3 reproducer.sh[5103]: 
> /home/hpt/reproducer/reproducer.sh: fork: Resource temporarily unavailable
> Mar 18 06:58:37 pinelp3 reproducer.sh[5103]: 
> /home/hpt/reproducer/reproducer.sh: fork: Resource temporarily unavailable
> Mar 18 06:58:37 pinelp3 reproducer.sh[5103]: 
> /home/hpt/reproducer/reproducer.sh: fork: Resource temporarily unavailable
> Mar 18 06:58:37 pinelp3 systemd[1]: reproducer.service: Main process exited, 
> code=exited, status=254/n/a
> Mar 18 06:58:37 pinelp3 reproducer.sh[5103]: 
> /home/hpt/reproducer/reproducer.sh: fork: Resource temporarily unavailable
> Mar 18 06:58:37 pinelp3 systemd[1]: reproducer.service: Unit entered failed 
> state.
> Mar 18 06:58:37 pinelp3 systemd[1]: reproducer.service: Failed with result 
> 'exit-code'.
>
> The default task limit of systemd services is 512. Looks like the
> service is terminated by the kernel's ips cgroup controller. I think
> this isn't correct. Child processes cannot be forked shouldn't cause
> parent to die.
>
>
> % cat /etc/systemd/system/reproducer.service 
> [Unit]
> Description=Reproducer of systemd services killed by ips
> After=multi-user.target
>
> [Service]
> ExecStart=/home/hpt/reproducer/reproducer.sh
> Type=forking
>
> [Install]
> WantedBy=multi-user.target
>
> % cat /home/hpt/reproducer/reproducer.sh
> #!/bin/bash
>
> foo()
> {
>         #exec sh -c "echo $1: \$\$;sleep 60"
>         echo $1: 
>         sleep 60
> }
>
> bar()
> {
>         c=1
>         while true
>         do
>                 for ((i=1;i<=2048;i++))
>                 do
>                         foo $c &
>                         ((c++))
>                 done
>
>                 wait
>                 c=1
>         done
> }
>
> # main
> bar &
>
> disown -a
>
> exit 0
>
>
> ---uname output---
> Linux pinelp3 4.4.0-12-generic #28-Ubuntu SMP Wed Mar 9 00:40:38 UTC 2016 
> ppc64le ppc64le ppc64le GNU/Linux
>  
> Machine Type = IBM,8408-E8E,lpar 
>
> Steps to Reproduce
> ================================
> 1. install the simple service in "Problem description"
> 2. sudo systemctl start reproducer.service
> 3. wait 2~3 minutes
>  
> == Comment: #3 - Vaishnavi Bhat <vaish...@in.ibm.com> - 2016-03-22 11:21:55 ==
> >From the machine,
> root@pinelp3:~# ulimit -a
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 48192
> max locked memory       (kbytes, -l) 64
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 1024
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) 8192
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 48192
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
>
> root@pinelp3:~# ps aux | wc -l          --------->While the service is 
> running 
> 1084
> root@pinelp3:~# ps aux | wc -l                                   "
> 1084
> root@pinelp3:~# ps aux | wc -l                                   "
> 1084
> root@pinelp3:~# ps aux | wc -l                                   "
> 1084
> root@pinelp3:~# ps aux | wc -l    ---------->While the service is not 
> running. 
> 572
>
> root@pinelp3:~# free -m   --------------> While service is running
>               total        used        free      shared  buff/cache   
> available
> Mem:          12117         628         459          22       11029        
> 9541
> Swap:          2052           8        2044
>
> root@pinelp3:~# free -m   ------------> while the service is not running. 
>               total        used        free      shared  buff/cache   
> available
> Mem:          12117         308         809          22       10999        
> 9890
> Swap:          2052           8        2044
>
> == Comment: #4 - Breno Henrique Leitao <bren...@br.ibm.com> - 2016-03-22 
> 11:48:57 ==
> This is a new feature in Ubuntu and Systemd that limits the amount of 
> processes/child created.
>
> You can disable it as doucmented in
> https://wiki.ubuntu.com/ppc64el/Recommendations#Max_pids_on_Ubuntu_16.04
>
> == Comment: #5 - Ping Tian Han <pt...@cn.ibm.com> - 2016-03-22 20:24:33 ==
> (In reply to comment #4)
>> This is a new feature in Ubuntu and Systemd that limits the amount of
>> processes/child created.
>>
>> You can disable it as doucmented in
>> https://wiki.ubuntu.com/ppc64el/Recommendations#Max_pids_on_Ubuntu_16.04
> Hi,
>
> We knew this feature. But it is then main process of the service failed
> when it cannont fork chlid processes. This the problem we want to know
> how to fix. Thanks.
>
> == Comment: #6 - Breno Henrique Leitao <bren...@br.ibm.com> - 2016-03-24 
> 13:58:08 ==
> Hi,
>
>> We knew this feature. But it is then main process of the service failed when
>> it cannont fork chlid processes. This the problem we want to know how to
>> fix. Thanks.
> What do you want to fix exactly? There was an RFC limiting all systemd
> services to spawn, a maximum of process.
>
> This came from the Systemd creator at
> https://lists.freedesktop.org/archives/systemd-
> devel/2015-November/035006.html.
>
> If you wish to have a service that needs more than 512 process, just
> enable it on the service file, as:
>
> [Unit]
> Description=Reproducer of systemd services killed by ips
> After=multi-user.target
>
> [Service]
> ExecStart=/home/hpt/reproducer/reproducer.sh
> Type=forking
> TasksMax=1024 (or bigger)
>
> [Install]
> WantedBy=multi-user.target
>
> Anyway, what are you trying to test exactly, and why do you think it
> fails?
>
> == Comment: #7 - Ping Tian Han <pt...@cn.ibm.com> - 2016-03-24 21:26:35 ==
> (In reply to comment #6)
>> Hi,
>>
>>> We knew this feature. But it is then main process of the service failed when
>>> it cannont fork chlid processes. This the problem we want to know how to
>>> fix. Thanks.
>> What do you want to fix exactly? There was an RFC limiting all systemd
>> services to spawn, a maximum of process.
>>
>> This came from the Systemd creator at
>> https://lists.freedesktop.org/archives/systemd-devel/2015-November/035006.
>> html.
>>
>> If you wish to have a service that needs more than 512 process, just enable
>> it on the service file, as:
>>
>> [Unit]
>> Description=Reproducer of systemd services killed by ips
>> After=multi-user.target
>>
>> [Service]
>> ExecStart=/home/hpt/reproducer/reproducer.sh
>> Type=forking
>> TasksMax=1024 (or bigger)
>>
>> [Install]
>> WantedBy=multi-user.target
>>
>> Anyway, what are you trying to test exactly, and why do you think it fails?
> Hi,
>
> We saw that the main process of the example service quited (or killed?)
> by some reason when it cannot fork more chlid processes. This is the
> problem. We can tolerate that there is only 512 child processes but
> cannot if the main process being killed.
>
> We think that even though the main process cannot fork more than 512
> child processes, it shouldn't be killed.
>
> thanks.
>
> == Comment: #8 - Breno Henrique Leitao <bren...@br.ibm.com> - 2016-03-29 
> 10:16:04 ==
> Hi PIng,
>
>> We think that even though the main process cannot fork more than 512 child
>> processes, it shouldn't be killed.
> Right. Now I understand your point. So, what is the main process here?
> Is it your serivce process, or the systemd process?
>
> PS: I am decreasing the priority to normal, since this is a very corner
> case usage.
>
> == Comment: #9 - Kevin W. Rudd <ru...@us.ibm.com> - 2016-03-29 15:50:15 ==
> The process in question wasn't killed. It did its own exit:
>
> ...
> 19236 clone(child_stack=0, 
> flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
> child_tidptr=0x3fff96c25aa0) = -1 EAGAIN (Resource temporarily unavailable)
> 19236 write(2, "/usr/local/bin/real_forkit.sh: f"..., 70) = 70
> 19236 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> 19236 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> 19236 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> 19236 exit_group(254)                   = ?
> 19236 +++ exited with 254 +++
>
> The problem here is that the replication script is not *handling* the
> error.  It is not even looking at the return code of the function it is
> trying to fork.  It is very possible that the shell itself triggered its
> own protection mechanism when faced with a script that appeared to be
> running away (the above exit was not after the first clone failure, but
> after several calls to clone had failed).
>
> Personally, if I were the admin of this system, this is exactly what I
> would want to happen to a run away process (smack it down quickly).
>
> == Comment: #10 - Ping Tian Han <pt...@cn.ibm.com> - 2016-03-29 21:17:16 ==
> (In reply to comment #9)
>> The process in question wasn't killed. It did its own exit:
>>
>> ...
>> 19236 clone(child_stack=0,
>> flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
>> child_tidptr=0x3fff96c25aa0) = -1 EAGAIN (Resource temporarily unavailable)
>> 19236 write(2, "/usr/local/bin/real_forkit.sh: f"..., 70) = 70
>> 19236 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>> 19236 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
>> 19236 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>> 19236 exit_group(254)                   = ?
>> 19236 +++ exited with 254 +++
>>
>> The problem here is that the replication script is not *handling* the error.
>> It is not even looking at the return code of the function it is trying to
>> fork.  It is very possible that the shell itself triggered its own
>> protection mechanism when faced with a script that appeared to be  running
>> away (the above exit was not after the first clone failure, but after
>> several calls to clone had failed).
>>
> Thanks, Kevin. Looks like this is a problem of bash? I think there isn't
> any reason casuing a shell process as parents to quit just because it
> cannot fork more child processes...
>
>> Personally, if I were the admin of this system, this is exactly what I would
>> want to happen to a run away process (smack it down quickly).
> ** Affects: ubuntu
>      Importance: Undecided
>      Assignee: Taco Screen team (taco-screen-team)
>          Status: New
>
>
> ** Tags: architecture-ppc64le bugnameltc-139320 severity-medium 
> targetmilestone-inin---

-- 
Michael Hohnbaum
OIL Program Manager
Power (ppc64el) Development Project Manager
Canonical, Ltd.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1629226

Title:
  systemd's service killed by cgroup controller pids

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/bash/+bug/1629226/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to