Thanks for your answer.

On Mon, Jun 29, 2020 at 9:10 PM Ian Lance Taylor <i...@golang.org> wrote:

> Thanks for the background.
>
> Earlier I suggested looking at the output of "strace -f" for the
> programs that fail.  Does that show anything of interest?
>

What follows is the analysis of one strace (strace -fv -s 100) attached to
the docker daemon.

The strace log file shows the creation of a chain of processes: dockerd
forks containerd forks containerd-shim forks runc forks a command that runs
inside the container (the command is '/usr/bin/find .').  This is also
expected.

When the I/O of the process /usr/bin/find in the docker container is
blocked, strace shows that the Golang schedulers are still active:

(line 298015)
[pid  2266] _newselect(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=10000}
<unfinished ...>
[pid  2264] sched_yield( <unfinished ...>
[pid  2270] swapcontext(0x2190a580, 0 <unfinished ...>
[pid  2264] <... sched_yield resumed>)  = 0
[pid  2270] <... swapcontext resumed>)  = 0
[pid  2264] sched_yield( <unfinished ...>
[pid  2270] swapcontext(0, 0x214dda80 <unfinished ...>
[pid  2264] <... sched_yield resumed>)  = 0
[pid  2270] <... swapcontext resumed>)  = 558750336
[pid  2264] sched_yield( <unfinished ...>
[pid  2270] swapcontext(0, 0x2190a580 <unfinished ...>
[pid  2264] <... sched_yield resumed>)  = 0
[pid  2270] <... swapcontext resumed>)  = 563127680
[pid  2264] sched_yield( <unfinished ...>
[pid  2270] swapcontext(0x2190a580, 0 <unfinished ...>
[pid  2264] <... sched_yield resumed>)  = 0
[pid  2270] <... swapcontext resumed>)  = 0
...
(line 298134)
[pid  2266] <... _newselect resumed>)   = 0 (Timeout)
[pid  2264] <... sched_yield resumed>)  = 0
[pid  2270] <... swapcontext resumed>)  = 563127680
[pid  2266] epoll_wait(4,  <unfinished ...>
[pid  2264] sched_yield( <unfinished ...>
[pid  2270] swapcontext(0x2190a580, 0 <unfinished ...>
[pid  2266] <... epoll_wait resumed>[], 128, 0) = 0
[pid  2264] <... sched_yield resumed>)  = 0
[pid  2270] <... swapcontext resumed>)  = 0
[pid  2266] _newselect(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=10000}
<unfinished ...>

TIDs 2264, 2266 and 2270 belong to the process runc.  The strace log has
similar straces for the other processes (dockerd, containerd,
containerd-shim), so I assume also their goroutine schedulers were active.
I am actually wondering how to relate the arguments listed in the strace
file to the Golang or C code.

Just before the container gets blocked it runs the command 'find .' that
should produce output to the terminal (but there is no output at first,
that is the problem).  The data is visible in the strace log through the
'write()' system call:

[pid  2204] execve("/usr/bin/find", ["/usr/bin/find", "."],
["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"HOSTNAME=caae2fb3bccf", "TERM=xterm", "HOME=/root"] <unfinished ...>
...
[pid  2204] lstat64(".",  <unfinished ...>
...
[pid  2204] write(1, ".\n", 2 <unfinished ...>
...
[pid  2204] lstat64("./var",  <unfinished ...>
...
[pid  2204] write(1, "./var\n", 6 <unfinished ...>
...

The 'find' process writes 626 lines to stdout (18778 characters, this seems
to be reproducible).  The last lines are:

...
[pid  2204] lstat64("./sys/kernel/slab/rpc_buffers",  <unfinished ...>
...
[pid  2204] write(1, "./sys/kernel/slab/rpc_buffers\n", 30 <unfinished ...>

All the write() system calls except the last one are successfully
completed.  The last one remains blocked.  At that time PID 2204 hangs for
a long time

runc / dockerd / containerd / containerd-shim have continuous activity as
in the first strace that I showed above.

When I terminate the docker process with a signal SIGINT, the characters
written by PID 2204 are suddenly flushed to the terminal.

Are there anything specific things to look for in the strace files (eg.
specific epoll() calls).  Is there a way to map the arguments and return
values of swapcontext() to goroutines, or would this be a useless thing to
try to do?

This is the analysis of one trial.  Some of the other trials did not start
the full chain of processes, it looks like the behaviour of the bug is also
timing dependent.

Hugo



>
> Ian
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/CAARrXCTpc%2BFzT9xZn%3DD_beLaupnes6fLfdM2OGXj5kJM5c__zg%40mail.gmail.com.

Reply via email to