[Kernel-packages] [Bug 1996678] Re: containerd sporadic timeouts

Mauricio Faria de Oliveira Thu, 24 Nov 2022 13:30:45 -0800

Hi Hans,

Thanks for the pointer to the synthetic reproducer!


It provided accurate and consistent results considering
the kernel versions reported (not) to exhibit the issue.

The Azure test kernel with the 3 patches [1] to address
that shows the same (good) results as the Azure kernel
prior to the regression being introduced.

P.S.: the issue isn't strictly having that patch in,
as it's included in later kernel versions w/out this 
issue (eg, 5.15), but having that patch in while not
having these other patches in as well (as, eg, 5.15).

[1] https://lists.ubuntu.com/archives/kernel-
team/2022-November/135069.html

...

Test Results from 4x VMs on Azure (2x 4vCPU/16G and 2x 8vCPU/32G)

Test Steps follow below; essentially, run the for-loop with curl
10x times, and count how many times it doesn't finish / is stuck.
(i.e., epoll wait didn't return/finish).

1) original/"good" kernel: 0% error rate
-- 5.4.0-1094-azure #100-Ubuntu SMP Mon Oct 17 03:14:36 UTC 2022

VM1: 0/10
VM2: 0/10
VM3: 0/10
VM4: 0/10

2) regression/"bad" kernel: 60%-80% error rate
-- 5.4.0-1095-azure #101-Ubuntu SMP Thu Oct 20 15:50:47 UTC 2022

VM1: 8/10
VM2: 7/10
VM3: 7/10
VM4: 6/10

3) candidate/"test" kernel: 0% error rate
-- 5.4.0-1098-azure #104-Ubuntu SMP Wed Nov 23 21:19:57 UTC 2022

VM1: 0/10
VM2: 0/10
VM3: 0/10
VM4: 0/10

...

Test Steps/Criteria on Focal:

Install go 1.19:

$ sudo snap install --channel=1.19/stable --classic go

Create test programs:

$ cat <<EOF >main.go
package main

import (
        "fmt"
        "github.com/prometheus/procfs/sysfs"
        "log"
        "net/http"
)

func main() {

        fs, err := sysfs.NewFS("/sys")
        if err != nil {
                panic(err)
        }
        netDevices, err := fs.NetClassDevices()

        http.HandleFunc("/", func(writer http.ResponseWriter, request 
*http.Request) {
                for _, device := range netDevices {
                        _, err := fs.NetClassByIface(device)
                        if err != nil {
                                panic(err)
                        }
                }
                fmt.Printf(". ")
                writer.WriteHeader(200)
                writer.Write([]byte("ok"))
        })
        log.Fatal(http.ListenAndServe(":9100", nil))
}
EOF

$ cat <<EOF >go.mod
module app

go 1.19

require (
        github.com/prometheus/procfs v0.8.0 // indirect
        golang.org/x/sync v0.0.0-20220601150217-0de741cfad7f // indirect
)
EOF

Fetch test deps:

$ go mod download github.com/prometheus/procfs
$ go get github.com/prometheus/procfs/sysfs@v0.8.0

Start test program:

go run main.go &

Exercise it:
for i in {0..10000} ; do curl localhost:9100/metrics ; done

Test Criteria:

PASS = for-loop finishes.
FAIL = for-loop doesn't finish.

# reference:
https://github.com/prometheus/node_exporter/issues/2500#issuecomment-1304847221

Stack traces on FAIL / for-loop not finished:

azureuser@ktest-3:~$ sudo grep -l epoll /proc/$(pidof go)/task/*/stack | xargs 
sudo grep -H ^
/proc/33267/task/33287/stack:[<0>] ep_poll+0x3bb/0x410
/proc/33267/task/33287/stack:[<0>] do_epoll_wait+0xb8/0xd0
/proc/33267/task/33287/stack:[<0>] __x64_sys_epoll_pwait+0x4c/0xa0
/proc/33267/task/33287/stack:[<0>] do_syscall_64+0x5e/0x200
/proc/33267/task/33287/stack:[<0>] entry_SYSCALL_64_after_hwframe+0x5c/0xc1

azureuser@ktest-3:~$ sudo grep -l epoll /proc/$(pidof go)/task/*/stack | xargs 
sudo grep -H ^
/proc/1193/task/1193/stack:[<0>] ep_poll+0x3bb/0x410
/proc/1193/task/1193/stack:[<0>] do_epoll_wait+0xb8/0xd0
/proc/1193/task/1193/stack:[<0>] __x64_sys_epoll_pwait+0x4c/0xa0
/proc/1193/task/1193/stack:[<0>] do_syscall_64+0x5e/0x200
/proc/1193/task/1193/stack:[<0>] entry_SYSCALL_64_after_hwframe+0x5c/0xc1

azureuser@ktest-3:~$ sudo grep -l epoll /proc/$(pidof go)/task/*/stack | xargs 
sudo grep -H ^
/proc/1173/task/1193/stack:[<0>] ep_poll+0x3bb/0x410
/proc/1173/task/1193/stack:[<0>] do_epoll_wait+0xb8/0xd0
/proc/1173/task/1193/stack:[<0>] __x64_sys_epoll_pwait+0x4c/0xa0
/proc/1173/task/1193/stack:[<0>] do_syscall_64+0x5e/0x200
/proc/1173/task/1193/stack:[<0>] entry_SYSCALL_64_after_hwframe+0x5c/0xc1

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1996678

Title:
  containerd sporadic timeouts

Status in containerd package in Ubuntu:
  Confirmed
Status in linux package in Ubuntu:
  Invalid
Status in containerd source package in Focal:
  New
Status in linux source package in Focal:
  In Progress

Bug description:
  This morning I routinely upgraded security upgrades on number of
  machines.

  Containerd has upgraded from `1.5.9-0ubuntu1~20.04.4` for me
  `1.5.9-0ubuntu1~20.04.5`.

  What happened next:

  at some random time on machines with new containerd something happens
  with containerd tasks and/or cgroups.

  This is how it's seen in syslog:

  containerd[710]: time="2022-11-16T03:21:21.004153517Z" level=error
  msg="get state for
  2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a"
  error="context deadline exceeded: unknown"

  
  And some ctr commands:

  # ctr --namespace k8s.io task ls|grep 
2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a
  2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a    0        
UNKNOWN

  See that the status of the task is UNKNOWN (!!!)

  # ctr --namespace k8s.io container ls|grep 
2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a
  2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a    
k8s.gcr.io/pause:3.5                                                           
io.containerd.runc.v2    

  Cgroups:

      ├─kubepods-besteffort-pod3fdc0061_bbf6_47d1_97f5_b1f271b46e23.slice 
      │ 
├─cri-containerd-5f78e0cb957de97fd8465cc42c842bdd764d981ca7a903a2515bbc6bb06796a9.scope
 …
      │ │ └─2677 /csi-node-driver-registrar --v=0 --csi-address=/csi/csi.sock 
--k…
      │ 
├─cri-containerd-af070f16c1f0ff22eb16661e787e85db3810727909abd23d69a6a43578c1dced.scope
 …
      │ │ └─3264 /usr/local/bin/cephcsi --type=liveness 
--endpoint=unix:///csi/cs…
      │ 
├─cri-containerd-4e063ef0c8f768dbf34cf7a179bca5cc98a04fa7e00b29d20c17d3031d409f86.scope
 …
      │ │ └─2960 /usr/local/bin/cephcsi 
--nodeid=nas-decent-bobcat.dev-k8s-1.hq.w…
      │ 
└─cri-containerd-2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a.scope
 …
      │   └─2414 /pause


  # ps auxf|grep 2414 -B 2
  root        2279  0.1  0.0 114100  4956 ?        Sl   Nov15   0:42 
/usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 
2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a -address 
/run/containerd/containerd.sock
  65535       2414  0.0  0.0    964     4 ?        Ss   Nov15   0:00  \_ /pause


  It happens not immediately - but after some random time. Sometimes
  it's several minutes, sometimes it's around an hour. But nonetheless -
  all machines with the new package get into this weird state.

  As long as I revert package - it all returns to run as expected.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/containerd/+bug/1996678/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1996678] Re: containerd sporadic timeouts

Reply via email to