On 16.04, the version of aufs looks correct, maybe it's a different bug. Did you also try with Docker 1.12 on 16.04? Looking at the other bug, there is a Docker container than can help diagnose if you are encountering the aufs issue: docker run -it --rm akihirosuda/test18180
-- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1633223 Title: rcu_sched detected stalls with kernel 3.19.0-58, NVIDIA driver, and docker Status in linux package in Ubuntu: New Bug description: ---Problem Description--- Seeing occasional rcu_sched detected stalls on 14.04 LTS with kernel 3.19.0-58. The system is running docker containers, and has the NVIDIA GPU driver loaded. We've seen about 4 stalls in the last month, all with the 3.19.0-58 kernel, and with the NVIDIA 352.93 and 361.49 drivers. ---uname output--- Linux dldev1 3.19.0-58-generic #64~14.04.1-Ubuntu SMP Fri Mar 18 19:05:01 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux ---Additional Hardware Info--- 2 x NVIDIA K80 GPU adapter: $ lspci | grep NV 0002:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 0002:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 0006:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 0006:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) Machine Type = 8247-42L ---System Hang--- Usual symptom is that the system is unresponsive except maybe for ping and writing the stall-detection messages to the console. Login/getty isn't available either via ssh nor on the console. System must be power cycled to recover. Attached is the kernel log from a stall detection on May 18th. The detection first occurs at: May 18 15:17:55. The system is later rebooted and those messages indicate the kernel (3.19.0-58) and NVIDIA driver version (352.93) that were active at the time. We've suffered 3 or 4 stalls since, all with the same kernel, but some with a newer NVIDIA driver (361.49). Unfortunately, information about the newer stalls wasn't preserved in the various log files (and we're not capturing the console constantly), so we don't have detailed data for those. We'd welcome any suggestions for how to collect additional data for these occurrences. I can't say for sure that we haven't seen the stalls on other systems, but they're occuring fairly frequently on this system, and it's unusual in that it's running both Docker and NVIDIA GPU driver. So maybe aufs or the NVIDIA driver are somehow involved. From the kern.log, The Call trace points to some kind of deadlock in aufs - May 18 15:17:55 dldev1 kernel: [713670.798624] Task dump for CPU 3: May 18 15:17:55 dldev1 kernel: [713670.798628] cc1 R running task 0 99183 99173 0x00040004 May 18 15:17:55 dldev1 kernel: [713670.798633] Call Trace: May 18 15:17:55 dldev1 kernel: [713670.798643] [c000000fa64673a0] [c0000000000cf004] wake_up_worker+0x44/0x60 (unreliable) May 18 15:17:55 dldev1 kernel: [713670.798671] [c000000fa6467570] [c000000fa64675d0] 0xc000000fa64675d0 May 18 15:17:55 dldev1 kernel: [713670.798676] [c000000fa64675d0] [c000000000a1b050] __schedule+0x370/0x900 May 18 15:17:55 dldev1 kernel: [713670.798679] [c000000fa64677f0] [c000000fa6467850] 0xc000000fa6467850 May 18 15:17:55 dldev1 kernel: [713670.798682] Task dump for CPU 75: May 18 15:17:55 dldev1 kernel: [713670.798684] cc1 D 00001000005d9410 0 99427 99405 0x00040004 May 18 15:17:55 dldev1 kernel: [713670.798688] Call Trace: May 18 15:17:55 dldev1 kernel: [713670.798691] [c0000017efdd3460] [c0000017efdd34a0] 0xc0000017efdd34a0 (unreliable) May 18 15:17:55 dldev1 kernel: [713670.798695] [c0000017efdd3630] [c0000017efdd3690] 0xc0000017efdd3690 May 18 15:17:55 dldev1 kernel: [713670.798698] [c0000017efdd3690] [c000000000a1b050] __schedule+0x370/0x900 May 18 15:17:55 dldev1 kernel: [713670.798702] [c0000017efdd38b0] [c000000000a1f128] rwsem_down_write_failed+0x288/0x400 May 18 15:17:55 dldev1 kernel: [713670.798706] [c0000017efdd3940] [c000000000a1e538] down_write+0x88/0x90 May 18 15:17:55 dldev1 kernel: [713670.798716] [c0000017efdd3970] [d00000001ead562c] do_ii_write_lock+0x8c/0xd0 [aufs] May 18 15:17:55 dldev1 kernel: [713670.798724] [c0000017efdd39a0] [d00000001eac0e98] aufs_read_lock+0xb8/0xd0 [aufs] May 18 15:17:55 dldev1 kernel: [713670.798733] [c0000017efdd39e0] [d00000001ead8208] aufs_d_revalidate+0x98/0x7a0 [aufs] May 18 15:17:55 dldev1 kernel: [713670.798737] [c0000017efdd3aa0] [c0000000002c88f8] lookup_fast+0x368/0x3b0 May 18 15:17:55 dldev1 kernel: [713670.798740] [c0000017efdd3b10] [c0000000002cb620] path_lookupat+0x180/0x970 May 18 15:17:55 dldev1 kernel: [713670.798743] [c0000017efdd3be0] [c0000000002cbe68] filename_lookup+0x58/0x140 May 18 15:17:55 dldev1 kernel: [713670.798746] [c0000017efdd3c30] [c0000000002cde04] user_path_at_empty+0x84/0xe0 May 18 15:17:55 dldev1 kernel: [713670.798749] [c0000017efdd3d20] [c0000000002be744] vfs_fstatat+0x84/0x140 May 18 15:17:55 dldev1 kernel: [713670.798753] [c0000017efdd3d80] [c0000000002bee14] SyS_newlstat+0x34/0x60 May 18 15:17:55 dldev1 kernel: [713670.798757] [c0000017efdd3e30] [c000000000009258] system_call+0x38/0xd0 Other info from the log - May 18 16:00:00 dldev1 kernel: [ 11.802963] nvidia 0002:03:00.0: enabling device (0140 -> 0142) May 18 16:00:00 dldev1 kernel: [ 11.803114] nvidia 0002:04:00.0: enabling device (0140 -> 0142) May 18 16:00:00 dldev1 kernel: [ 11.803221] nvidia 0006:03:00.0: enabling device (0140 -> 0142) May 18 16:00:00 dldev1 kernel: [ 11.803349] nvidia 0006:04:00.0: enabling device (0140 -> 0142) May 18 16:00:00 dldev1 kernel: [ 11.803598] [drm] Initialized nvidia-drm 0.0.0 20150116 for 0002:03:00.0 on minor 0 May 18 16:00:00 dldev1 kernel: [ 11.803654] [drm] Initialized nvidia-drm 0.0.0 20150116 for 0002:04:00.0 on minor 1 May 18 16:00:00 dldev1 kernel: [ 11.803700] [drm] Initialized nvidia-drm 0.0.0 20150116 for 0006:03:00.0 on minor 2 May 18 16:00:00 dldev1 kernel: [ 11.803742] [drm] Initialized nvidia-drm 0.0.0 20150116 for 0006:04:00.0 on minor 3 May 18 16:00:00 dldev1 kernel: [ 11.803749] NVRM: loading NVIDIA UNIX ppc64le Kernel Module 352.93 Tue Apr 5 17:31:42 PDT 2016 .... May 18 16:00:01 dldev1 kernel: [ 278.416678] aufs 3.x-rcN-20150105 May 18 16:00:01 dldev1 kernel: [ 278.952335] bridge: automatic filtering via arp/ip/ip6tables has been deprecated. Update your scripts to load br_netfilter if you need this. May 18 16:00:01 dldev1 kernel: [ 278.953351] Bridge firewalling registered May 18 16:00:01 dldev1 kernel: [ 278.957164] nf_conntrack version 0.5.0 (16384 buckets, 65536 max) May 18 16:00:01 dldev1 kernel: [ 278.997765] ip_tables: (C) 2000-2006 Netfilter Core Team May 18 16:00:01 dldev1 kernel: [ 279.013814] nvidia 0002:03:00.0: Using 64-bit DMA iommu bypass May 18 16:00:03 dldev1 kernel: [ 280.480596] IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready May 18 16:00:04 dldev1 kernel: [ 282.035907] nvidia 0002:04:00.0: Using 64-bit DMA iommu bypass May 18 16:00:07 dldev1 kernel: [ 285.065144] nvidia 0006:03:00.0: Using 64-bit DMA iommu bypass May 18 16:00:10 dldev1 kernel: [ 288.150339] nvidia 0006:04:00.0: Using 64-bit DMA iommu bypass ... May 18 16:01:55 dldev1 kernel: [ 393.265785] aufs au_opts_verify:1612:docker[3444]: dirperm1 breaks the protection by the permission bits on the lower branch May 18 16:01:56 dldev1 kernel: [ 393.328552] aufs au_opts_verify:1612:docker[3444]: dirperm1 breaks the protection by the permission bits on the lower branch May 18 16:01:56 dldev1 kernel: [ 393.366023] device veth0beb24c entered promiscuous mode May 18 16:01:56 dldev1 kernel: [ 393.367228] IPv6: ADDRCONF(NETDEV_UP): veth0beb24c: link is not ready May 18 16:01:56 dldev1 kernel: [ 393.367235] docker0: port 1(veth0beb24c) entered forwarding state May 18 16:01:56 dldev1 kernel: [ 393.367314] docker0: port 1(veth0beb24c) entered forwarding state May 18 16:01:56 dldev1 kernel: [ 393.367825] docker0: port 1(veth0beb24c) entered disabled state May 18 16:01:56 dldev1 kernel: [ 393.526038] docker0: port 1(veth0beb24c) entered disabled state May 18 16:01:56 dldev1 kernel: [ 393.530143] device veth0beb24c left promiscuous mode May 18 16:01:56 dldev1 kernel: [ 393.530148] docker0: port 1(veth0beb24c) entered disabled state > Hi Breno, > > As per Ubuntu support, all new fixes will target 16.04 and Canonical will > not automatically fix 14.04. If one needs a fix in 14.04, it has to go > through as a special request to Canonical. So should we check if this > problem happens on Ubuntu 16.04? That is correct. Ubuntu 14.04 fixes will be SRU. I.e, they need to be fixed in 16.10 first, and then backport to 16.04 an 14.04. Kernel 4.4 for Ubuntu 16.04.5 is already in proposed. Does it contain the bug also? I understand that kernel 3.19 is already EOL. The requested docker info; we're using the docker packages provided by Canonical: $ dpkg -l | egrep 'docker|aufs' ii aufs-tools 1:3.2+20130722-1.1 ppc64el Tools to manage aufs filesystems ii docker.io 1.9.1~git20151210-18bfacb ppc64el Linux container runtime $ docker version Client: Version: 1.9.1 API version: 1.21 Go version: go1.4.2 gccgo (GCC) 5.2.1 20151016 (Advance-Toolchain-) [ibm/gcc-5-branch, revision: 229493 merged from gcc-5-branch, revision 228917] Git commit: 18bfacb Built: Thu Dec 17 19:19:02 UTC 2015 OS/Arch: linux/ppc64le Server: Version: 1.9.1 API version: 1.21 Go version: go1.4.2 gccgo (GCC) 5.2.1 20151016 (Advance-Toolchain-) [ibm/gcc-5-branch, revision: 229493 merged from gcc-5-branch, revision 228917] Git commit: 18bfacb Built: Thu Dec 17 19:19:02 UTC 2015 OS/Arch: linux/ppc64le $ docker info Containers: 8 Images: 27 Server Version: 1.9.1 Storage Driver: aufs Root Dir: /var/lib/docker/aufs Backing Filesystem: extfs Dirs: 43 Dirperm1 Supported: true Execution Driver: native-0.2 Logging Driver: json-file Kernel Version: 3.19.0-58-generic Operating System: Ubuntu 14.04.4 LTS CPUs: 192 Total Memory: 127.5 GiB Name: dldev1 ID: NKEF:PLNM:HHXN:3DAD:3CLU:Z2GP:AZLR:MZRO:ADYA:5GOV:YZRN:DWOS WARNING: No swap limit support I had said: "we're using the docker packages provided by Canonical". That's not true. The 'docker.io' package we're using is from: deb http://ftp.unicamp.br/pub/ppc64el/ubuntu/14_04/docker-ppc64el/ trusty main The aufs.ko does come from Canonical: $ dpkg -S aufs.ko linux-image-extra-3.19.0-58-generic: /lib/modules/3.19.0-58-generic/kernel/ubuntu/aufs/aufs.ko Another occurrence this afternoon, again while building a software project in a Docker container. Yesterday was building 'Caffe', which is mostly C or C++. Today was building 'TensorFlow' which includes a lot of Java. Still seeing this hang on Ubuntu 16.04 with 4.4.0-38 kernel. Similar scenario--building TensorFlow in a container often causes the problem. I assume because the build is creating lots of filesystem artifacts. Seems that running 'sync' every few seconds during the build makes the problem less likely to strike. Will attach kernel log from an instance on 16.04 / 4.4.0-38 from Oct 7th. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1633223/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp