I was able to reproduce the behaviour under a test VM.

I used a simple example nginx deployment with a high number of replicas.

With the older version of runc, I was able to run 200 replicas
successfully and even 250.

With the newer version of runc, I was only able to run 164 replicas out of 200 
(the last 36 pods were stuck in a ContainerCreating state.
There were also 2 system pods on the node (weave + kube-proxy), so a total of 
166 containers.

```
nginx-deployment-66b6c48dd5-2hdgd   0/1     ContainerCreating   0          11m
nginx-deployment-66b6c48dd5-2ql8r   0/1     ContainerCreating   0          11m
nginx-deployment-66b6c48dd5-4z54r   0/1     ContainerCreating   0          11m
nginx-deployment-66b6c48dd5-5vvzl   0/1     ContainerCreating   0          11m
nginx-deployment-66b6c48dd5-6xh9g   0/1     ContainerCreating   0          11m
...
```

I saw the same logs as what I saw on the production server for
containerd, and nothing more in dmesg / kern.log.

The load on the node was ok:
```
root@node0:~# vmstat -S M 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0   2305     70    997    0    0   140   390 1242 2560  5  4 92  0  0
 1  0      0   2305     70    998    0    0     0     0 2312 4529  1  1 98  0  0
 0  0      0   2303     70    998    0    0     0     3 5055 11301  4  4 92  0  0
 0  0      0   2303     70    998    0    0     0     0 2171 4232  1  1 98  0  0
 0  0      0   2302     70    998    0    0     0     0 5512 11719  4  4 92  0  0
 2  0      0   2302     70    998    0    0     0     0 2182 4261  1  1 98  0  0
 1  0      0   2302     70    998    0    0     0     0 3822 7883  3  1 96  0  0
 0  0      0   2302     70    998    0    0     0     0 3724 7891  3  3 95  0  0
 0  0      0   2303     70    998    0    0     0     0 2729 5351  3  1 96  0  0
 0  0      0   2300     70    998    0    0     0     4 4674 9863  3  3 94  0  0
 0  0      0   2300     70    998    0    0     0     0 2476 4930  2  1 97  0  0
 1  0      0   2305     70    997    0    0     0    65 5219 10770  4  4 91  0  0
 0  0      0   2430     70    994    0    0     0  2267 4918 9713  6  5 89  0  0
 0  0      0   2558     71    990    0    0     0  2033 7205 15041  7  7 85  1  0
 0  0      0   2589     71    989    0    0     0  1364 5006 10203  6  5 88  1  0
 0  0      0   2429     71    993    0    0     0  1554 9502 20112 12 13 74  1  0
 0  0      0   2325     71    997    0    0     0  1043 9185 19772 11 11 78  1  0
 0  0      0   2303     71    998    0    0     0  1810 3723 7264  4  4 91  0  0
 2  0      0   2318     71    998    0    0     0     0 3181 6494  3  2 96  0  0
 0  0      0   2317     71    998    0    0     0    23 2789 5525  2  1 97  0  0
 3  0      0   2307     71    999    0    0    26    84 5406 11430  5  4 90  0  0
 0  0      0   2307     71    999    0    0     0    31 2631 5201  2  1 97  0  0
 0  0      0   2306     71    999    0    0     0     5 2687 5532  2  1 97  0  0
 1  0      0   2306     71    999    0    0     0   221 4689 9833  3  3 94  0  0
 1  0      0   2305     71    999    0    0     0    22 3072 6339  2  2 96  0  0
 0  0      0   2306     71    999    0    0     0     4 2482 4812  2  1 98  0  0
 0  0      0   2307     71    999    0    0     0    46 3528 7051  3  1 96  0  0
 0  0      0   2306     71    999    0    0     0    40 4363 9302  3  3 94  0  0
 0  0      0   2306     71    999    0    0     0    33 2537 5363  2  1 97  0  0
 2  0      0   2306     71    999    0    0     0    35 4215 8648  3  3 94  0  0
 0  0      0   2306     71    999    0    0     0     6 3303 6824  3  2 95  0  0
 1  0      0   2306     71    999    0    0     0     3 2122 4117  2  1 97  0  0
 1  0      0   2305     71    999    0    0     0     0 5339 11477  4  3 93  0  0
 0  0      0   2304     71    999    0    0     0     0 2193 4326  1  1 98  0  0
 0  0      0   2304     71    999    0    0     0     0 5424 11640  5  4 92  0  0
 4  0      0   2304     71    999    0    0     0     0 1859 3685  2  1 98  0  0
 2  0      0   2302     71    999    0    0     0     0 5260 11252  4  4 92  0  0
 3  0      0   2302     71    999    0    0     0     0 2044 4116  1  1 98  0  0
 0  0      0   2301     71    999    0    0     0     0 4949 11084  4  3 93  0  0
 2  0      0   2301     71    999    0    0     0     0 2252 4349  2  1 98  0  0
 0  0      0   2299     71    999    0    0     0     0 5329 11410  4  3 92  0  0
 2  0      0   2299     71    999    0    0     0     0 2077 4186  2  1 97  0  0


root@node0:~# free -m
              total        used        free      shared  buff/cache   available
Mem:           5945        2577        2296          20        1071        3297
Swap:             0           0           0

root@node0:~# uptime 
 23:41:26 up 20 min,  1 user,  load average: 0.48, 0.75, 0.53

```

After a while, I went down to 150 replicas, which were running fine.
I went back to 160 replicas, which were ok, then 170, and reached the limit 
again at 163 replicas / 165 pods this time (+ 7 pods stuck on 
ContainerCreating).

I removed the deployment, rebooted the node and started the deployment again.
I was blocked at 164 replicas (over 170) + 2 other pods.

So there seems to be a hard limit around 164 - 165 pods per machine with
the latest runc.

The master server was kept with the older version of runc for all the tests, 
only the version of the node changed.
I used Weave for container networking.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1927219

Title:
  context deadline exceeded: unknown in containerd with latest runc
  version

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/runc/+bug/1927219/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to