Hi Hans, Thanks for the pointer to the synthetic reproducer!
It provided accurate and consistent results considering the kernel versions reported (not) to exhibit the issue. The Azure test kernel with the 3 patches [1] to address that shows the same (good) results as the Azure kernel prior to the regression being introduced. P.S.: the issue isn't strictly having that patch in, as it's included in later kernel versions w/out this issue (eg, 5.15), but having that patch in while not having these other patches in as well (as, eg, 5.15). [1] https://lists.ubuntu.com/archives/kernel- team/2022-November/135069.html ... Test Results from 4x VMs on Azure (2x 4vCPU/16G and 2x 8vCPU/32G) Test Steps follow below; essentially, run the for-loop with curl 10x times, and count how many times it doesn't finish / is stuck. (i.e., epoll wait didn't return/finish). 1) original/"good" kernel: 0% error rate -- 5.4.0-1094-azure #100-Ubuntu SMP Mon Oct 17 03:14:36 UTC 2022 VM1: 0/10 VM2: 0/10 VM3: 0/10 VM4: 0/10 2) regression/"bad" kernel: 60%-80% error rate -- 5.4.0-1095-azure #101-Ubuntu SMP Thu Oct 20 15:50:47 UTC 2022 VM1: 8/10 VM2: 7/10 VM3: 7/10 VM4: 6/10 3) candidate/"test" kernel: 0% error rate -- 5.4.0-1098-azure #104-Ubuntu SMP Wed Nov 23 21:19:57 UTC 2022 VM1: 0/10 VM2: 0/10 VM3: 0/10 VM4: 0/10 ... Test Steps/Criteria on Focal: Install go 1.19: $ sudo snap install --channel=1.19/stable --classic go Create test programs: $ cat <<EOF >main.go package main import ( "fmt" "github.com/prometheus/procfs/sysfs" "log" "net/http" ) func main() { fs, err := sysfs.NewFS("/sys") if err != nil { panic(err) } netDevices, err := fs.NetClassDevices() http.HandleFunc("/", func(writer http.ResponseWriter, request *http.Request) { for _, device := range netDevices { _, err := fs.NetClassByIface(device) if err != nil { panic(err) } } fmt.Printf(". ") writer.WriteHeader(200) writer.Write([]byte("ok")) }) log.Fatal(http.ListenAndServe(":9100", nil)) } EOF $ cat <<EOF >go.mod module app go 1.19 require ( github.com/prometheus/procfs v0.8.0 // indirect golang.org/x/sync v0.0.0-20220601150217-0de741cfad7f // indirect ) EOF Fetch test deps: $ go mod download github.com/prometheus/procfs $ go get github.com/prometheus/procfs/sysfs@v0.8.0 Start test program: go run main.go & Exercise it: for i in {0..10000} ; do curl localhost:9100/metrics ; done Test Criteria: PASS = for-loop finishes. FAIL = for-loop doesn't finish. # reference: https://github.com/prometheus/node_exporter/issues/2500#issuecomment-1304847221 Stack traces on FAIL / for-loop not finished: azureuser@ktest-3:~$ sudo grep -l epoll /proc/$(pidof go)/task/*/stack | xargs sudo grep -H ^ /proc/33267/task/33287/stack:[<0>] ep_poll+0x3bb/0x410 /proc/33267/task/33287/stack:[<0>] do_epoll_wait+0xb8/0xd0 /proc/33267/task/33287/stack:[<0>] __x64_sys_epoll_pwait+0x4c/0xa0 /proc/33267/task/33287/stack:[<0>] do_syscall_64+0x5e/0x200 /proc/33267/task/33287/stack:[<0>] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 azureuser@ktest-3:~$ sudo grep -l epoll /proc/$(pidof go)/task/*/stack | xargs sudo grep -H ^ /proc/1193/task/1193/stack:[<0>] ep_poll+0x3bb/0x410 /proc/1193/task/1193/stack:[<0>] do_epoll_wait+0xb8/0xd0 /proc/1193/task/1193/stack:[<0>] __x64_sys_epoll_pwait+0x4c/0xa0 /proc/1193/task/1193/stack:[<0>] do_syscall_64+0x5e/0x200 /proc/1193/task/1193/stack:[<0>] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 azureuser@ktest-3:~$ sudo grep -l epoll /proc/$(pidof go)/task/*/stack | xargs sudo grep -H ^ /proc/1173/task/1193/stack:[<0>] ep_poll+0x3bb/0x410 /proc/1173/task/1193/stack:[<0>] do_epoll_wait+0xb8/0xd0 /proc/1173/task/1193/stack:[<0>] __x64_sys_epoll_pwait+0x4c/0xa0 /proc/1173/task/1193/stack:[<0>] do_syscall_64+0x5e/0x200 /proc/1173/task/1193/stack:[<0>] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1996678 Title: containerd sporadic timeouts Status in containerd package in Ubuntu: Confirmed Status in linux package in Ubuntu: Invalid Status in containerd source package in Focal: New Status in linux source package in Focal: In Progress Bug description: This morning I routinely upgraded security upgrades on number of machines. Containerd has upgraded from `1.5.9-0ubuntu1~20.04.4` for me `1.5.9-0ubuntu1~20.04.5`. What happened next: at some random time on machines with new containerd something happens with containerd tasks and/or cgroups. This is how it's seen in syslog: containerd[710]: time="2022-11-16T03:21:21.004153517Z" level=error msg="get state for 2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a" error="context deadline exceeded: unknown" And some ctr commands: # ctr --namespace k8s.io task ls|grep 2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a 2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a 0 UNKNOWN See that the status of the task is UNKNOWN (!!!) # ctr --namespace k8s.io container ls|grep 2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a 2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a k8s.gcr.io/pause:3.5 io.containerd.runc.v2 Cgroups: ├─kubepods-besteffort-pod3fdc0061_bbf6_47d1_97f5_b1f271b46e23.slice │ ├─cri-containerd-5f78e0cb957de97fd8465cc42c842bdd764d981ca7a903a2515bbc6bb06796a9.scope … │ │ └─2677 /csi-node-driver-registrar --v=0 --csi-address=/csi/csi.sock --k… │ ├─cri-containerd-af070f16c1f0ff22eb16661e787e85db3810727909abd23d69a6a43578c1dced.scope … │ │ └─3264 /usr/local/bin/cephcsi --type=liveness --endpoint=unix:///csi/cs… │ ├─cri-containerd-4e063ef0c8f768dbf34cf7a179bca5cc98a04fa7e00b29d20c17d3031d409f86.scope … │ │ └─2960 /usr/local/bin/cephcsi --nodeid=nas-decent-bobcat.dev-k8s-1.hq.w… │ └─cri-containerd-2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a.scope … │ └─2414 /pause # ps auxf|grep 2414 -B 2 root 2279 0.1 0.0 114100 4956 ? Sl Nov15 0:42 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a -address /run/containerd/containerd.sock 65535 2414 0.0 0.0 964 4 ? Ss Nov15 0:00 \_ /pause It happens not immediately - but after some random time. Sometimes it's several minutes, sometimes it's around an hour. But nonetheless - all machines with the new package get into this weird state. As long as I revert package - it all returns to run as expected. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/containerd/+bug/1996678/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp