Default Comment by Bridge ** Attachment added: "console 's log file" https://bugs.launchpad.net/bugs/1632471/+attachment/4759589/+files/small-08152016
** Changed in: ubuntu Assignee: (unassigned) => Taco Screen team (taco-screen-team) ** Package changed: ubuntu => linux (Ubuntu) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1632471 Title: Briggs: ISST-LTE:UbuntuKVM: system hangs, console shows all of these messages raw_spin_lock_irqsave, rcu_sched, rcu_bh detected stalls on CPUs #166 Status in linux package in Ubuntu: New Bug description: == Comment: #0 - Chanh H. Nguyen <chngu...@us.ibm.com> - 2016-08-15 11:56:22 == We have the SuperMicro system "Briggs" that have 2 Qlogic cards on it (16Gb and 8Gb) connect to the V7000 storage. We install Ubuntu 16.04.1 kernel -34 on it and run test BASE/IO over weekend on the bare metal mode . System seems to hang now, it responses to ping only, nothing else. Open the console, it doesn't give out the login promt except all of these errors print out ....... [323294.686996] INFO: rcu_bh detected stalls on CPUs/tasks: [323294.687105] 3-...: (1 GPs behind) idle=813/140000000000000/0 softirq=9880043/9935887 fqs=48656494 [323294.687187] 71-...: (1 GPs behind) idle=a99/140000000000000/0 softirq=9875483/9935684 fqs=48656494 [323294.687268] 153-...: (1 GPs behind) idle=321/140000000000000/0 softirq=3948104/3976836 fqs=48656494 [323294.687346] (detected by 75, t=48672447 jiffies, g=79, c=78, q=706) [323294.687427] Task dump for CPU 3: [323294.687462] gmond R running task 0 16394 1 0x00040004 [323294.687542] Call Trace: [323294.687568] [c000003fb3bdb700] [c000003fb3bdb740] 0xc000003fb3bdb740 (unreliable) [323294.687648] [c000003fb3bdb8d0] [c000003fec8de900] 0xc000003fec8de900 [323294.687716] [c000003fb3bdb920] [0000000000000020] 0x20 [323294.687773] [c000003fb3bdb960] [c000003fb3bdb9b0] 0xc000003fb3bdb9b0 [323294.687842] [c000003fb3bdb9b0] [c0000000000f9d14] task_sched_runtime+0xc4/0x250 [323294.687922] [c000003fb3bdba20] [c0000000001003d8] thread_group_cputime+0xc8/0x1c0 [323294.688002] [c000003fb3bdba80] [c000000000100720] thread_group_cputime_adjusted+0x40/0x80 [323294.688082] [c000003fb3bdbae0] [c000000000385428] do_task_stat+0x2b8/0xaf0 [323294.688152] [c000003fb3bdbc60] [c00000000037c710] proc_single_show+0x80/0x100 [323294.688232] [c000003fb3bdbcb0] [c000000000317808] seq_read+0xe8/0x530 [323294.688302] [c000003fb3bdbd50] [c0000000002e170c] __vfs_read+0x6c/0xe0 [323294.688412] [c000003fb3bdbd90] [c0000000002e2464] vfs_read+0xa4/0x1c0 [323294.688481] [c000003fb3bdbde0] [c0000000002e356c] SyS_read+0x6c/0x110 [323294.688550] [c000003fb3bdbe30] [c000000000009204] system_call+0x38/0xb4 [323294.688618] Task dump for CPU 71: [323294.688653] fs_inod R running task 0 151173 78575 0x00040004 [323294.688735] Call Trace: [323294.688760] [c000003fbcd17890] [c000005cc0adee00] 0xc000005cc0adee00 (unreliable) [323294.688840] Task dump for CPU 153: [323294.688876] fs_inod D 00003fffafd1efb8 0 151172 79296 0x00040004 [323294.688956] Call Trace: [323294.688981] [c000005a398675e0] [00000000004352ca] 0x4352ca (unreliable) [323294.689049] [c000005a398677b0] [c000003feee5e900] 0xc000003feee5e900 [323294.689118] [c000005a39867800] [c000000000ae8604] __schedule+0x314/0x990 [323294.689187] [c000005a398678d0] [c000000000ae8cc8] schedule+0x48/0xc0 [323294.689257] [c000005a39867900] [c000000000aecc08] rwsem_down_write_failed+0x2b8/0x410 [323294.689337] [c000005a39867990] [c000000000aebdf8] down_write+0x78/0xa0 [323294.689407] [c000005a398679c0] [c0000000002797bc] unlink_file_vma+0x4c/0x90 [323294.689475] [c000005a39867a00] [c000000000270110] free_pgtables+0x100/0x1c0 [323294.689543] [c000005a39867a50] [c00000000027d4e0] exit_mmap+0x100/0x1a0 [323294.689612] [c000005a39867b10] [c0000000000b5098] mmput+0xa8/0x1b0 [323294.689681] [c000005a39867b40] [c0000000002ec740] flush_old_exec+0x680/0x840 [323294.689792] [c000005a39867bb0] [c00000000036412c] load_elf_binary+0x43c/0xf90 [323294.689872] [c000005a39867c90] [c0000000002ecbb8] search_binary_handler+0xe8/0x2a0 [323294.689951] [c000005a39867d20] [c0000000002edad0] do_execveat_common.isra.14+0x600/0x850 [323294.690031] [c000005a39867df0] [c0000000002ee13c] SyS_execve+0x5c/0x80 [323294.690098] [c000005a39867e30] [c000000000009204] system_call+0x38/0xb4 [323294.926994] INFO: rcu_sched self-detected stall on CPU [323294.927058] 113-...: (48720536 ticks this GP) idle=df7/140000000000001/0 softirq=3730603/3730614 fqs=48343855 [323294.927138] (t=48734941 jiffies g=6962468 c=6962467 q=43612779) [323294.927210] Task dump for CPU 71: [323294.927241] fs_inod R running task 0 151173 78575 0x00040004 [323294.927312] Call Trace: [323294.927333] [c000003fbcd17890] [c000005cc0adee00] 0xc000005cc0adee00 (unreliable) [323294.927405] Task dump for CPU 113: [323294.927436] fs_inod R running task 0 46071 148792 0x00042004 [323294.927507] Call Trace: [323294.927528] [c000003fcb0134b0] [c0000000000fbcc0] sched_show_task+0xe0/0x180 (unreliable) [323294.927599] [c000003fcb013520] [c00000000013ec34] rcu_dump_cpu_stacks+0xe4/0x150 [323294.927669] [c000003fcb013570] [c000000000144364] rcu_check_callbacks+0x6b4/0x9b0 [323294.927739] [c000003fcb0136a0] [c00000000014c1c8] update_process_times+0x58/0xa0 [323294.927809] [c000003fcb0136d0] [c0000000001638d8] tick_sched_handle.isra.6+0x48/0xe0 [323294.927880] [c000003fcb013710] [c0000000001639d4] tick_sched_timer+0x64/0xd0 [323294.927940] [c000003fcb013750] [c00000000014cc94] __hrtimer_run_queues+0x124/0x450 [323294.928011] [c000003fcb0137e0] [c00000000014dcbc] hrtimer_interrupt+0xec/0x2c0 [323294.928082] [c000003fcb0138a0] [c00000000001f5bc] __timer_interrupt+0x8c/0x290 [323294.928152] [c000003fcb0138f0] [c00000000001f970] timer_interrupt+0xa0/0xe0 [323294.928253] [c000003fcb013920] [c000000000002714] decrementer_common+0x114/0x180 [323294.928324] --- interrupt: 901 at _raw_spin_lock_irqsave+0xa4/0x130 [323294.928324] LR = _raw_spin_lock_irqsave+0x9c/0x130 [323294.928413] [c000003fcb013c10] [c000007942f936b0] 0xc000007942f936b0 (unreliable) [323294.928484] [c000003fcb013c50] [c000000000124dbc] rwsem_wake+0xcc/0x110 [323294.928545] [c000003fcb013c90] [c000000000121e58] up_write+0x78/0x90 [323294.928604] [c000003fcb013cc0] [c0000000000b6f4c] copy_process.isra.5+0x152c/0x1870 [323294.928675] [c000003fcb013da0] [c0000000000b7468] _do_fork+0xa8/0x4b0 [323294.928736] [c000003fcb013e30] [c000000000009460] ppc_clone+0x8/0xc [323294.928799] Task dump for CPU 153: [323294.928829] fs_inod D 00003fffafd1efb8 0 151172 79296 0x00040004 [323294.928901] Call Trace: [323294.928922] [c000005a398675e0] [00000000004352ca] 0x4352ca (unreliable) [323294.928982] [c000005a398677b0] [c000003feee5e900] 0xc000003feee5e900 [323294.929043] [c000005a39867800] [c000000000ae8604] __schedule+0x314/0x990 [323294.929103] [c000005a398678d0] [c000000000ae8cc8] schedule+0x48/0xc0 [323294.929165] [c000005a39867900] [c000000000aecc08] rwsem_down_write_failed+0x2b8/0x410 [323294.929236] [c000005a39867990] [c000000000aebdf8] down_write+0x78/0xa0 [323294.929298] [c000005a398679c0] [c0000000002797bc] unlink_file_vma+0x4c/0x90 [323294.929358] [c000005a39867a00] [c000000000270110] free_pgtables+0x100/0x1c0 [323294.929419] [c000005a39867a50] [c00000000027d4e0] exit_mmap+0x100/0x1a0 [323294.929478] [c000005a39867b10] [c0000000000b5098] mmput+0xa8/0x1b0 [323294.929540] [c000005a39867b40] [c0000000002ec740] flush_old_exec+0x680/0x840 [323294.929638] [c000005a39867bb0] [c00000000036412c] load_elf_binary+0x43c/0xf90 [323294.929709] [c000005a39867c90] [c0000000002ecbb8] search_binary_handler+0xe8/0x2a0 [323294.929779] [c000005a39867d20] [c0000000002edad0] do_execveat_common.isra.14+0x600/0x850 [323294.929848] [c000005a39867df0] [c0000000002ee13c] SyS_execve+0x5c/0x80 [323294.931065] [c000005a39867e30] [c000000000009204] system_call+0x38/0xb4[323294.934998] INFO: rcu_sched detected stalls on CPUs/tasks: [323294.935005] 71-...: (4 GPs behind) idle=a99/140000000000000/0 softirq=9935677/9935684 fqs=48343856 [323294.953222] [323294.954627] 113-...: (48720536 ticks this GP) idle=df7/140000000000001/0 softirq=3730603/3730614 fqs=48343856 [323294.964332] 153-...: (2 GPs behind) idle=321/140000000000000/0 softirq=3976834/3976836 fqs=48343857 [323294.974024] (detected by 17, t=48734953 jiffies, g=6962468, c=6962467, q=43612782) [323294.980967] Task dump for CPU 71: [323294.985119] fs_inod R running task 0 151173 78575 0x00040004 [323294.992058] Call Trace: [323294.994823] [c000003fbcd17890] [c000005cc0adee00] 0xc000005cc0adee00 (unreliable) [323295.001767] Task dump for CPU 113: [323295.005916] fs_inod R running task 0 46071 148792 0x00042004 [323295.012851] Call Trace: [323295.015627] [c000003fcb013a10] [0000000010160000] 0x10160000 (unreliable) [323295.022572] Task dump for CPU 153: [323295.025342] fs_inod D 00003fffafd1efb8 0 151172 79296 0x00040004 [323295.032259] Call Trace: [323295.035044] [c000005a398675e0] [00000000004352ca] 0x4352ca (unreliable) [323295.041977] [c000005a398677b0] [c000003feee5e900] 0xc000003feee5e900 [323295.048921] [c000005a39867800] [c000000000ae8604] __schedule+0x314/0x990 [323295.055839] [c000005a398678d0] [c000000000ae8cc8] schedule+0x48/0xc0 [323295.061404] [c000005a39867900] [c000000000aecc08] rwsem_down_write_failed+0x2b8/0x410 [323295.069715] [c000005a39867990] [c000000000aebdf8] down_write+0x78/0xa0 [323295.076640] [c000005a398679c0] [c0000000002797bc] unlink_file_vma+0x4c/0x90 [323295.083584] [c000005a39867a00] [c000000000270110] free_pgtables+0x100/0x1c0 [323295.090520] [c000005a39867a50] [c00000000027d4e0] exit_mmap+0x100/0x1a0 [323295.097434] [c000005a39867b10] [c0000000000b5098] mmput+0xa8/0x1b0 [323295.103006] [c000005a39867b40] [c0000000002ec740] flush_old_exec+0x680/0x840 [323295.109929] [c000005a39867bb0] [c00000000036412c] load_elf_binary+0x43c/0xf90 [323295.116877] [c000005a39867c90] [c0000000002ecbb8] search_binary_handler+0xe8/0x2a0 [323295.125178] [c000005a39867d20] [c0000000002edad0] do_execveat_common.isra.14+0x600/0x850 [323295.133501] [c000005a39867df0] [c0000000002ee13c] SyS_execve+0x5c/0x80 [323295.139059] [c000005a39867e30] [c000000000009204] system_call+0x38/0xb4 [323357.706980] INFO: rcu_bh detected stalls on CPUs/tasks: == Comment: #16 - Fei Fei Xu <feifei...@cn.ibm.com> - 2016-08-24 01:01:19 == Some findings from console log: 1) rcu_bh detected stalls were detected on CPU 3, CPU71, CPU113 and CPU 153. [323294.686996] INFO: rcu_bh detected stalls on CPUs/tasks: [323294.687105] 3-...: (1 GPs behind) idle=813/140000000000000/0 softirq=9880043/9935887 fqs=48656494 [323294.687187] 71-...: (1 GPs behind) idle=a99/140000000000000/0 softirq=9875483/9935684 fqs=48656494 [323294.687268] 153-...: (1 GPs behind) idle=321/140000000000000/0 softirq=3948104/3976836 fqs=48656494 [345225.881576] 113-...: (54201466 ticks this GP) idle=df7/140000000000001/0 softirq=3730603/3730614 fqs=53769037 The softirq number behind the "/" stays constant across repeated stall-warning messages(9935887, 9935684, 3976836, 3730614). This happen when the stalled CPU is spinning with interrupts are disabled. Which from the log, CPUs might be spinning in a spin lock. From the task dump: [321971.602704] Task dump for CPU 153: [321971.605481] fs_inod D 00003fffafd1efb8 0 151172 79296 0x00040004 task "fs_inod" (should be a task in BASE/IO test case) running on 153 is waiting for something. This might be the root task causing the locks on CPUs. Notice a warning : nfs: server 10.33.11.31 not responding, timed out Seems the BASE/IO tests run the cases mounting through NFS. fs_inod might be waiting for someting read/write from/to a non-responding NFS server. And rcu stall happens. So can we make sure the networking to the NFS server is stable, and then check if the issue can be reproduced? == Comment: #20 - Paul Mackerras <p...@au1.ibm.com> - 2016-08-31 00:30:36 == Seeing some CPUs stuck in rwsem_wake() and others stuck in schedule() from down_write() makes me wonder if this could be the same issue as that described in https://lkml.org/lkml/2016/8/30/114 (which includes a proposed fix). == Comment: #25 - Balbir Singh <balb...@au1.ibm.com> - 2016-09-13 20:48:20 == (In reply to comment #23) A pull request was sent out yesterday for 4.8 (https://lkml.org/lkml/2016/9/13/589). Once that is done, we can request any backport. The backport is straight forward, I presume a mirroring of this bug with a request will do the needful? == Comment: #31 - Leonardo Augusto Guimaraes Garcia <lagar...@br.ibm.com> - 2016-10-11 10:56:36 == What's the next step here? Should we mirror this bug so that Canonical gets the fix? == Comment: #33 - Scott E. Garfinkle <s...@us.ibm.com> - 2016-10-11 12:14:24 == Yes, this needs to get mirrored to Canonical 16.04 "kernel" component. I made an effort to set the correct comments to be mirrored or not. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1632471/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp