I was able to reproduce the behaviour under a test VM. I used a simple example nginx deployment with a high number of replicas.
With the older version of runc, I was able to run 200 replicas successfully and even 250. With the newer version of runc, I was only able to run 164 replicas out of 200 (the last 36 pods were stuck in a ContainerCreating state. There were also 2 system pods on the node (weave + kube-proxy), so a total of 166 containers. ``` nginx-deployment-66b6c48dd5-2hdgd 0/1 ContainerCreating 0 11m nginx-deployment-66b6c48dd5-2ql8r 0/1 ContainerCreating 0 11m nginx-deployment-66b6c48dd5-4z54r 0/1 ContainerCreating 0 11m nginx-deployment-66b6c48dd5-5vvzl 0/1 ContainerCreating 0 11m nginx-deployment-66b6c48dd5-6xh9g 0/1 ContainerCreating 0 11m ... ``` I saw the same logs as what I saw on the production server for containerd, and nothing more in dmesg / kern.log. The load on the node was ok: ``` root@node0:~# vmstat -S M 5 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 2305 70 997 0 0 140 390 1242 2560 5 4 92 0 0 1 0 0 2305 70 998 0 0 0 0 2312 4529 1 1 98 0 0 0 0 0 2303 70 998 0 0 0 3 5055 11301 4 4 92 0 0 0 0 0 2303 70 998 0 0 0 0 2171 4232 1 1 98 0 0 0 0 0 2302 70 998 0 0 0 0 5512 11719 4 4 92 0 0 2 0 0 2302 70 998 0 0 0 0 2182 4261 1 1 98 0 0 1 0 0 2302 70 998 0 0 0 0 3822 7883 3 1 96 0 0 0 0 0 2302 70 998 0 0 0 0 3724 7891 3 3 95 0 0 0 0 0 2303 70 998 0 0 0 0 2729 5351 3 1 96 0 0 0 0 0 2300 70 998 0 0 0 4 4674 9863 3 3 94 0 0 0 0 0 2300 70 998 0 0 0 0 2476 4930 2 1 97 0 0 1 0 0 2305 70 997 0 0 0 65 5219 10770 4 4 91 0 0 0 0 0 2430 70 994 0 0 0 2267 4918 9713 6 5 89 0 0 0 0 0 2558 71 990 0 0 0 2033 7205 15041 7 7 85 1 0 0 0 0 2589 71 989 0 0 0 1364 5006 10203 6 5 88 1 0 0 0 0 2429 71 993 0 0 0 1554 9502 20112 12 13 74 1 0 0 0 0 2325 71 997 0 0 0 1043 9185 19772 11 11 78 1 0 0 0 0 2303 71 998 0 0 0 1810 3723 7264 4 4 91 0 0 2 0 0 2318 71 998 0 0 0 0 3181 6494 3 2 96 0 0 0 0 0 2317 71 998 0 0 0 23 2789 5525 2 1 97 0 0 3 0 0 2307 71 999 0 0 26 84 5406 11430 5 4 90 0 0 0 0 0 2307 71 999 0 0 0 31 2631 5201 2 1 97 0 0 0 0 0 2306 71 999 0 0 0 5 2687 5532 2 1 97 0 0 1 0 0 2306 71 999 0 0 0 221 4689 9833 3 3 94 0 0 1 0 0 2305 71 999 0 0 0 22 3072 6339 2 2 96 0 0 0 0 0 2306 71 999 0 0 0 4 2482 4812 2 1 98 0 0 0 0 0 2307 71 999 0 0 0 46 3528 7051 3 1 96 0 0 0 0 0 2306 71 999 0 0 0 40 4363 9302 3 3 94 0 0 0 0 0 2306 71 999 0 0 0 33 2537 5363 2 1 97 0 0 2 0 0 2306 71 999 0 0 0 35 4215 8648 3 3 94 0 0 0 0 0 2306 71 999 0 0 0 6 3303 6824 3 2 95 0 0 1 0 0 2306 71 999 0 0 0 3 2122 4117 2 1 97 0 0 1 0 0 2305 71 999 0 0 0 0 5339 11477 4 3 93 0 0 0 0 0 2304 71 999 0 0 0 0 2193 4326 1 1 98 0 0 0 0 0 2304 71 999 0 0 0 0 5424 11640 5 4 92 0 0 4 0 0 2304 71 999 0 0 0 0 1859 3685 2 1 98 0 0 2 0 0 2302 71 999 0 0 0 0 5260 11252 4 4 92 0 0 3 0 0 2302 71 999 0 0 0 0 2044 4116 1 1 98 0 0 0 0 0 2301 71 999 0 0 0 0 4949 11084 4 3 93 0 0 2 0 0 2301 71 999 0 0 0 0 2252 4349 2 1 98 0 0 0 0 0 2299 71 999 0 0 0 0 5329 11410 4 3 92 0 0 2 0 0 2299 71 999 0 0 0 0 2077 4186 2 1 97 0 0 root@node0:~# free -m total used free shared buff/cache available Mem: 5945 2577 2296 20 1071 3297 Swap: 0 0 0 root@node0:~# uptime 23:41:26 up 20 min, 1 user, load average: 0.48, 0.75, 0.53 ``` After a while, I went down to 150 replicas, which were running fine. I went back to 160 replicas, which were ok, then 170, and reached the limit again at 163 replicas / 165 pods this time (+ 7 pods stuck on ContainerCreating). I removed the deployment, rebooted the node and started the deployment again. I was blocked at 164 replicas (over 170) + 2 other pods. So there seems to be a hard limit around 164 - 165 pods per machine with the latest runc. The master server was kept with the older version of runc for all the tests, only the version of the node changed. I used Weave for container networking. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1927219 Title: context deadline exceeded: unknown in containerd with latest runc version To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/runc/+bug/1927219/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs