Hi Hans,

Thanks for the pointer to the synthetic reproducer!

It provided accurate and consistent results considering
the kernel versions reported (not) to exhibit the issue.

The Azure test kernel with the 3 patches [1] to address
that shows the same (good) results as the Azure kernel
prior to the regression being introduced.

P.S.: the issue isn't strictly having that patch in,
as it's included in later kernel versions w/out this 
issue (eg, 5.15), but having that patch in while not
having these other patches in as well (as, eg, 5.15).

[1] https://lists.ubuntu.com/archives/kernel-
team/2022-November/135069.html

...

Test Results from 4x VMs on Azure (2x 4vCPU/16G and 2x 8vCPU/32G)

Test Steps follow below; essentially, run the for-loop with curl
10x times, and count how many times it doesn't finish / is stuck.
(i.e., epoll wait didn't return/finish).

1) original/"good" kernel: 0% error rate
-- 5.4.0-1094-azure #100-Ubuntu SMP Mon Oct 17 03:14:36 UTC 2022

VM1: 0/10
VM2: 0/10
VM3: 0/10
VM4: 0/10

2) regression/"bad" kernel: 60%-80% error rate
-- 5.4.0-1095-azure #101-Ubuntu SMP Thu Oct 20 15:50:47 UTC 2022

VM1: 8/10
VM2: 7/10
VM3: 7/10
VM4: 6/10

3) candidate/"test" kernel: 0% error rate
-- 5.4.0-1098-azure #104-Ubuntu SMP Wed Nov 23 21:19:57 UTC 2022

VM1: 0/10
VM2: 0/10
VM3: 0/10
VM4: 0/10

...

Test Steps/Criteria on Focal:

Install go 1.19:

$ sudo snap install --channel=1.19/stable --classic go

Create test programs:

$ cat <<EOF >main.go
package main

import (
        "fmt"
        "github.com/prometheus/procfs/sysfs"
        "log"
        "net/http"
)

func main() {

        fs, err := sysfs.NewFS("/sys")
        if err != nil {
                panic(err)
        }
        netDevices, err := fs.NetClassDevices()

        http.HandleFunc("/", func(writer http.ResponseWriter, request 
*http.Request) {
                for _, device := range netDevices {
                        _, err := fs.NetClassByIface(device)
                        if err != nil {
                                panic(err)
                        }
                }
                fmt.Printf(". ")
                writer.WriteHeader(200)
                writer.Write([]byte("ok"))
        })
        log.Fatal(http.ListenAndServe(":9100", nil))
}
EOF

$ cat <<EOF >go.mod
module app

go 1.19

require (
        github.com/prometheus/procfs v0.8.0 // indirect
        golang.org/x/sync v0.0.0-20220601150217-0de741cfad7f // indirect
)
EOF

Fetch test deps:

$ go mod download github.com/prometheus/procfs
$ go get github.com/prometheus/procfs/sysfs@v0.8.0

Start test program:

go run main.go &

Exercise it:
for i in {0..10000} ; do curl localhost:9100/metrics ; done

Test Criteria:

PASS = for-loop finishes.
FAIL = for-loop doesn't finish.

# reference:
https://github.com/prometheus/node_exporter/issues/2500#issuecomment-1304847221

Stack traces on FAIL / for-loop not finished:

azureuser@ktest-3:~$ sudo grep -l epoll /proc/$(pidof go)/task/*/stack | xargs 
sudo grep -H ^
/proc/33267/task/33287/stack:[<0>] ep_poll+0x3bb/0x410
/proc/33267/task/33287/stack:[<0>] do_epoll_wait+0xb8/0xd0
/proc/33267/task/33287/stack:[<0>] __x64_sys_epoll_pwait+0x4c/0xa0
/proc/33267/task/33287/stack:[<0>] do_syscall_64+0x5e/0x200
/proc/33267/task/33287/stack:[<0>] entry_SYSCALL_64_after_hwframe+0x5c/0xc1

azureuser@ktest-3:~$ sudo grep -l epoll /proc/$(pidof go)/task/*/stack | xargs 
sudo grep -H ^
/proc/1193/task/1193/stack:[<0>] ep_poll+0x3bb/0x410
/proc/1193/task/1193/stack:[<0>] do_epoll_wait+0xb8/0xd0
/proc/1193/task/1193/stack:[<0>] __x64_sys_epoll_pwait+0x4c/0xa0
/proc/1193/task/1193/stack:[<0>] do_syscall_64+0x5e/0x200
/proc/1193/task/1193/stack:[<0>] entry_SYSCALL_64_after_hwframe+0x5c/0xc1

azureuser@ktest-3:~$ sudo grep -l epoll /proc/$(pidof go)/task/*/stack | xargs 
sudo grep -H ^
/proc/1173/task/1193/stack:[<0>] ep_poll+0x3bb/0x410
/proc/1173/task/1193/stack:[<0>] do_epoll_wait+0xb8/0xd0
/proc/1173/task/1193/stack:[<0>] __x64_sys_epoll_pwait+0x4c/0xa0
/proc/1173/task/1193/stack:[<0>] do_syscall_64+0x5e/0x200
/proc/1173/task/1193/stack:[<0>] entry_SYSCALL_64_after_hwframe+0x5c/0xc1

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1996678

Title:
  containerd sporadic timeouts

Status in containerd package in Ubuntu:
  Confirmed
Status in linux package in Ubuntu:
  Invalid
Status in containerd source package in Focal:
  New
Status in linux source package in Focal:
  In Progress

Bug description:
  This morning I routinely upgraded security upgrades on number of
  machines.

  Containerd has upgraded from `1.5.9-0ubuntu1~20.04.4` for me
  `1.5.9-0ubuntu1~20.04.5`.

  What happened next:

  at some random time on machines with new containerd something happens
  with containerd tasks and/or cgroups.

  This is how it's seen in syslog:

  containerd[710]: time="2022-11-16T03:21:21.004153517Z" level=error
  msg="get state for
  2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a"
  error="context deadline exceeded: unknown"

  
  And some ctr commands:

  # ctr --namespace k8s.io task ls|grep 
2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a
  2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a    0        
UNKNOWN

  See that the status of the task is UNKNOWN (!!!)

  # ctr --namespace k8s.io container ls|grep 
2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a
  2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a    
k8s.gcr.io/pause:3.5                                                           
io.containerd.runc.v2    

  Cgroups:

      ├─kubepods-besteffort-pod3fdc0061_bbf6_47d1_97f5_b1f271b46e23.slice 
      │ 
├─cri-containerd-5f78e0cb957de97fd8465cc42c842bdd764d981ca7a903a2515bbc6bb06796a9.scope
 …
      │ │ └─2677 /csi-node-driver-registrar --v=0 --csi-address=/csi/csi.sock 
--k…
      │ 
├─cri-containerd-af070f16c1f0ff22eb16661e787e85db3810727909abd23d69a6a43578c1dced.scope
 …
      │ │ └─3264 /usr/local/bin/cephcsi --type=liveness 
--endpoint=unix:///csi/cs…
      │ 
├─cri-containerd-4e063ef0c8f768dbf34cf7a179bca5cc98a04fa7e00b29d20c17d3031d409f86.scope
 …
      │ │ └─2960 /usr/local/bin/cephcsi 
--nodeid=nas-decent-bobcat.dev-k8s-1.hq.w…
      │ 
└─cri-containerd-2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a.scope
 …
      │   └─2414 /pause


  # ps auxf|grep 2414 -B 2
  root        2279  0.1  0.0 114100  4956 ?        Sl   Nov15   0:42 
/usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 
2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a -address 
/run/containerd/containerd.sock
  65535       2414  0.0  0.0    964     4 ?        Ss   Nov15   0:00  \_ /pause


  It happens not immediately - but after some random time. Sometimes
  it's several minutes, sometimes it's around an hour. But nonetheless -
  all machines with the new package get into this weird state.

  As long as I revert package - it all returns to run as expected.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/containerd/+bug/1996678/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to