> > Package: linux-2.6 > > Version: 2.6.32-38 > > Severity: important > > Tags: upstream > > > > the lockup happened on unlinking a large file (several GB) on XFS on fibre > > channel storage (qla2xxx module). We experience very simiral trouble on > > this machine though, using other kernels (from stable, stable-backports and > > sid as well) and other fs/storage combinations (ext4 on local Dell PERC > > controller). > > How long did the soft lockup state persist before you rebooted?
14 minutes > > The full boot-time dmesg output may be found on > > > > http://joni.heaven-industries.com/~egon/tornado-dmesg.txt > > > > The bug is not too easy to trigger, so I hope there's another way to look > > for it than bisecting 5 years of kernel commits... On the other hand, I'm > > of course ready to spend a time on it if it helps to make the kernel better > > :). > [...] > > When you say 'not too easy', do you mean that you don't know what > specific circumstances it occurs in, or that it occurs as part of a long > or complex sequence of operations? The former. > What sort of applications or services is this machine running? > > Given that you can reproduce this with the kernel version in sid, maybe > we should take this straight to the upstream developers - but it's hard > to guess which component or maintainer might be responsible. I'm sorry for taking so long to respond--we tried to reproduce that in the sid kernel and it didn't crash since then until today (well, the kernel from squeeze-backports crashed here once in between). So I thought the BIOS upgrade may have solved it and didn't want to bother you in that case. The today's crash may not even be related to the original, so I should perhaps open a new bug report... Anyway, here it goes (edited from netconsole output): [313919.501641] BUG: unable to handle kernel paging request at ffff88007f873133 [313919.501703] IP: [<ffffffff810ed6aa>] free_block+0x50/0xfc PGD 1604063 PUD 7f335067 PMD 0 Oops: 0002 [#1] SMP [313919.501827] CPU 11 [313919.501836] Modules linked in: dm_snapshot des_generic ecb md4 hmac nls_utf8 cifs nfs lockd fscache auth_rpcgss nfs_acl sunrpc ext4 jbd2 crc16 xfs dm_round_robin scsi_dh_emc dm_multipath scsi_dh netconsole configfs snd_pcm snd_timer snd soundcore snd_page_alloc joydev tpm_tis pcspkr tpm tpm_bios dcdbas i7core_edac evdev psmouse ghes serio_raw edac_core acpi_power_meter hed button processor thermal_sys ext3 jbd mbcache dm_mod sg sr_mod cdrom sd_mod ata_generic ses crc_t10dif enclosure usbhid hid uhci_hcd qla2xxx scsi_transport_fc scsi_tgt ata_piix libata ehci_hcd usbcore megaraid_sas scsi_mod bnx2 [last unloaded: scsi_wait_scan] [313919.502423] Pid: 78, comm: kswapd1 Not tainted 3.0.0-2-amd64 #1 Dell Inc. PowerEdge R810/0TT6JF [313919.502486] RIP: 0010:[<ffffffff810ed6aa>] [<ffffffff810ed6aa>] free_block+0x50/0xfc [313919.502540] RSP: 0018:ffff88085e35f9d0 EFLAGS: 00010082 [313919.502569] RAX: ffff880004013000 RBX: ffff88107f04a340 RCX: ffff880004018000 [313919.502616] RDX: ffff88087f4761c0 RSI: ffff88007f87312b RDI: ffff880004013200 [313919.502663] RBP: 000000000000000c R08: dead000000200200 R09: dead000000100100 [313919.502710] R10: dead000000200200 R11: dead000000100100 R12: 0000000000000003 [313919.502757] R13: ffff88107f01d5f0 R14: 0000000000000000 R15: ffff880004013200 [313919.502804] FS: 0000000000000000(0000) GS:ffff88107f4a0000(0000) knlGS:0000000000000000 [313919.502853] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [313919.502883] CR2: ffff88007f873133 CR3: 0000000001603000 CR4: 00000000000006e0 [313919.502930] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [313919.502977] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [313919.503024] Process kswapd1 (pid: 78, threadinfo ffff88085e35e000, task ffff88085e35ae60) [313919.503072] Stack: [313919.503094] 0000000000000082 ffff88107f01d5c0 ffff88087f4761c0 ffff88107f04a340 [313919.503160] 0000000000000000 0000000000000001 0000000000000000 ffffffff810ed7ea [313919.503226] ffff8802f1a7c280 ffff88107f01d5c0 ffff88107f04a340 ffff880004716c90 [313919.503290] Call Trace: [313919.503318] [<ffffffff810ed7ea>] ? __drain_alien_cache+0x94/0xa9 [313919.503351] [<ffffffff810ed486>] ? __cache_free.clone.30+0x8f/0x196 [313919.503383] [<ffffffff810ed5b2>] ? kmem_cache_free+0x25/0x69 [313919.503420] [<ffffffff8111d252>] ? free_buffer_head+0x1f/0x30 [313919.503452] [<ffffffff8111d4d8>] ? try_to_free_buffers+0x94/0xa6 [313919.508629] [<ffffffffa031ae19>] ? xfs_vm_releasepage+0x82/0x8b [xfs] [313919.508668] [<ffffffff810c3b66>] ? shrink_page_list+0x2ef/0x496 [313919.508701] [<ffffffff810c412b>] ? shrink_inactive_list+0x23a/0x391 [313919.508739] [<ffffffff811a79ae>] ? radix_tree_gang_lookup_tag+0x7e/0xd2 [313919.508773] [<ffffffff810c4932>] ? shrink_zone+0x3ad/0x4ac [313919.508810] [<ffffffff81335b15>] ? _cond_resched+0x9/0x20 [313919.508841] [<ffffffff810c55f9>] ? balance_pgdat+0x28e/0x514 [313919.508873] [<ffffffff810c5b33>] ? kswapd+0x2b4/0x2cd [313919.508906] [<ffffffff8106023f>] ? wake_up_bit+0x23/0x23 [313919.508937] [<ffffffff810c587f>] ? balance_pgdat+0x514/0x514 [313919.508968] [<ffffffff810c587f>] ? balance_pgdat+0x514/0x514 [313919.508999] [<ffffffff8105fdc7>] ? kthread+0x7a/0x82 [313919.509032] [<ffffffff8133d1e4>] ? kernel_thread_helper+0x4/0x10 [313919.509064] [<ffffffff8105fd4d>] ? kthread_worker_fn+0x149/0x149 [313919.509096] [<ffffffff8133d1e0>] ? gs_change+0x13/0x13 [313919.509125] Code: 4c 89 ff e8 03 e8 ff ff 48 8b 48 08 48 8b 30 49 b9 00 01 10 00 00 00 ad de 49 b8 00 02 20 00 00 00 ad de 4a 8b 94 f3 68 10 00 00 [313919.509349] 89 4e 08 48 89 31 4c 89 08 4c 89 40 08 44 2b 78 18 8b 8b 10 [313919.509486] RIP [<ffffffff810ed6aa>] free_block+0x50/0xfc RSP <ffff88085e35f9d0> CR2: ffff88007f873133 [313919.509902] ---[ end trace 4e4c1859e166ba0c ]--- It was followed by (possibly irrelevant) backtrace from another CPU: [313936.280974] ------------[ cut here ]------------ [313936.281068] WARNING: at /build/buildd-linux-2.6_3.0.0-5-amd64-UzNnAD/linux-2.6-3.0.0/debian/build/source_amd64_none/kernel/watchdog.c:240 watchdog_overflow_callback+0x96/0xa4() [313936.281193] Hardware name: PowerEdge R810 [313936.281259] Watchdog detected hard LOCKUP on cpu 7 [313936.281314] Modules linked in: dm_snapshot des_generic ecb md4 hmac nls_utf8 cifs nfs lockd fscache auth_rpcgss nfs_acl sunrpc ext4 jbd2 crc16 xfs dm_round_robin scsi_dh_emc dm_multipath scsi_dh netconsole configfs snd_pcm snd_timer snd soundcore snd_page_alloc joydev tpm_tis pcspkr tpm tpm_bios dcdbas i7core_edac evdev psmouse ghes serio_raw edac_core acpi_power_meter hed button processor thermal_sys ext3 jbd mbcache dm_mod sg sr_mod cdrom sd_mod ata_generic ses crc_t10dif enclosure usbhid hid uhci_hcd qla2xxx scsi_transport_fc scsi_tgt ata_piix libata ehci_hcd usbcore megaraid_sas scsi_mod bnx2 [last unloaded: scsi_wait_scan] [313936.284925] Pid: 38550, comm: sky-db Tainted: G D 3.0.0-2-amd64 #1 [313936.285001] Call Trace: [313936.285066] <NMI> [<ffffffff81046591>] ? warn_slowpath_common+0x78/0x8c [313936.285198] [<ffffffff81046646>] ? warn_slowpath_fmt+0x45/0x4a [313936.285276] [<ffffffff8100e28d>] ? paravirt_read_tsc+0x5/0x8 [313936.285347] [<ffffffff81091b60>] ? watchdog_overflow_callback+0x96/0xa4 [313936.285424] [<ffffffff810b1a5c>] ? __perf_event_overflow+0x101/0x198 [313936.285500] [<ffffffff8101429b>] ? paravirt_read_msr+0x7/0xa [313936.285571] [<ffffffff810142a9>] ? paravirt_write_msr+0xb/0xe [313936.285646] [<ffffffff81017d48>] ? intel_pmu_handle_irq+0x35d/0x3c0 [313936.285726] [<ffffffff813383fa>] ? perf_event_nmi_handler+0x3e/0x87 [313936.285800] [<ffffffff81339ee3>] ? notifier_call_chain+0x2e/0x5b [313936.285874] [<ffffffff81339f5f>] ? notify_die+0x2d/0x32 [313936.285945] [<ffffffff81337a3c>] ? do_nmi+0x61/0x202 [313936.286015] [<ffffffff81337750>] ? nmi+0x20/0x30 [313936.286089] [<ffffffff81072135>] ? do_raw_spin_lock+0x15/0x1b [313936.286159] <<EOE>> [<ffffffff810ed471>] ? __cache_free.clone.30+0x7a/0x196 [313936.286299] [<ffffffff810ed5b2>] ? kmem_cache_free+0x25/0x69 [313936.286376] [<ffffffff8111d252>] ? free_buffer_head+0x1f/0x30 [313936.286448] [<ffffffff8111d4d8>] ? try_to_free_buffers+0x94/0xa6 [313936.286544] [<ffffffffa031ae19>] ? xfs_vm_releasepage+0x82/0x8b [xfs] [313936.286623] [<ffffffff810c3b66>] ? shrink_page_list+0x2ef/0x496 [313936.286695] [<ffffffff810c412b>] ? shrink_inactive_list+0x23a/0x391 [313936.286767] [<ffffffff810c4932>] ? shrink_zone+0x3ad/0x4ac [313936.286839] [<ffffffff810c4da0>] ? do_try_to_free_pages+0x106/0x315 [313936.286918] [<ffffffff810bc6bf>] ? get_page_from_freelist+0x651/0x697 [313936.286990] [<ffffffff810c5235>] ? try_to_free_pages+0xb6/0xf6 [313936.287062] [<ffffffff810bcd41>] ? __alloc_pages_nodemask+0x4d9/0x765 [313936.287137] [<ffffffff810beef7>] ? page_zone.clone.4+0x21/0x21 [313936.287214] [<ffffffff810e7acd>] ? alloc_pages_vma+0xdc/0xe1 [313936.287292] [<ffffffff810d263c>] ? handle_pte_fault+0x162/0x79b [313936.287364] [<ffffffff810cfc31>] ? set_pmd+0x5/0x8 [313936.287434] [<ffffffff810cfd4c>] ? pte_offset_kernel+0x16/0x36 [313936.287506] [<ffffffff810d2ff1>] ? handle_mm_fault+0x1ea/0x22c [313936.287578] [<ffffffff81339e90>] ? do_page_fault+0x2e9/0x30e [313936.287651] [<ffffffff810d7c2e>] ? mmap_region+0x336/0x431 [313936.287722] [<ffffffff810d5c0e>] ? get_unmapped_area+0xe4/0x13e [313936.287797] [<ffffffff81337495>] ? page_fault+0x25/0x30 [313936.287869] ---[ end trace 4e4c1859e166ba0d ]--- Here's the last atop sample (2 minutes until the crash): ATOP - tornado 2011/10/25 03:45:02 600 seconds elapsed PRC | sys 80.95s | user 67m20s | #proc 201 | #zombie 0 | #exit 4773 | CPU | sys 14% | user 675% | irq 1% | idle 903% | wait 8% | cpu | sys 1% | user 71% | irq 0% | idle 27% | cpu002 w 1% | cpu | sys 1% | user 71% | irq 0% | idle 28% | cpu004 w 0% | cpu | sys 1% | user 64% | irq 0% | idle 35% | cpu007 w 0% | cpu | sys 1% | user 59% | irq 0% | idle 39% | cpu006 w 0% | cpu | sys 1% | user 59% | irq 0% | idle 40% | cpu005 w 0% | cpu | sys 1% | user 57% | irq 0% | idle 42% | cpu003 w 0% | cpu | sys 1% | user 52% | irq 0% | idle 45% | cpu001 w 2% | cpu | sys 1% | user 50% | irq 0% | idle 43% | cpu000 w 5% | cpu | sys 1% | user 30% | irq 0% | idle 69% | cpu008 w 0% | cpu | sys 1% | user 30% | irq 0% | idle 70% | cpu011 w 0% | cpu | sys 0% | user 29% | irq 0% | idle 70% | cpu015 w 0% | cpu | sys 1% | user 28% | irq 0% | idle 72% | cpu009 w 0% | cpu | sys 1% | user 24% | irq 0% | idle 75% | cpu012 w 0% | cpu | sys 1% | user 20% | irq 0% | idle 79% | cpu013 w 0% | cpu | sys 0% | user 19% | irq 0% | idle 81% | cpu010 w 0% | cpu | sys 0% | user 11% | irq 0% | idle 89% | cpu014 w 0% | CPL | avg1 2.57 | avg5 1.94 | avg15 2.15 | csw 9919970 | intr 6679518 | MEM | tot 63.0G | free 1.3G | cache 52.0G | buff 2.0M | slab 1.2G | SWP | tot 9.3G | free 9.3G | | vmcom 13.7G | vmlim 40.8G | PAG | scan 454328 | stall 0 | | swin 0 | swout 256 | DSK | sda | busy 8% | read 9391 | write 17782 | avio 1 ms | NET | transport | tcpi 738 | tcpo 684 | udpi 45 | udpo 45 | NET | network | ipi 2033 | ipo 722 | ipfrw 0 | deliv 2032 | NET | eth2 0% | pcki 3185 | pcko 601 | si 4 Kbps | so 2 Kbps | NET | lo ---- | pcki 198 | pcko 198 | si 0 Kbps | so 0 Kbps | PID SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPU CMD 1/217 38629 8.36s 8m29s 74264K 73832K 11916K 301.7M -- - R 86% sky-db 39009 14.48s 8m22s -0.1G -0.1G 556K 337.9M -- - R 86% sky-db 38674 8.55s 8m26s -13.3M -13.7M 9220K 215.9M -- - R 86% sky-db 38652 9.33s 8m25s 25112K 25016K 5184K 136.9M -- - R 86% sky-db 38563 8.28s 8m25s 20124K 19924K 8568K 68804K -- - R 86% sky-db 38957 8.55s 8m25s 34588K 34112K 7316K 52356K -- - R 86% sky-db 38576 8.75s 8m24s 5912K 5696K 1172K 14784K -- - R 85% sky-db 38550 10.55s 8m19s -10.0M -8288K 646.9M 555.2M -- - R 85% sky-db 3752 0.10s 0.42s -640K -528K 1616K 8K -- - S 0% sky-hps 4491 0.02s 0.48s -640K -528K 1196K 28K -- - S 0% sky-hps 3836 0.06s 0.42s 0K 0K 0K 0K -- - S 0% sky-hps 4075 0.45s 0.00s 0K 0K 0K 0K -- - S 0% flush-254:1 3910 0.10s 0.34s 0K -40K 0K 0K -- - S 0% skytrick 3828 0.22s 0.22s 0K 0K 0K 0K -- - S 0% sky-hps 59708 0.14s 0.30s 0K -120K 0K 0K -- - S 0% proxy 4438 0.09s 0.34s 0K -244K 0K 0K -- - S 0% sky-db 37660 0.35s 0.07s 0K -40K 0K 0K -- - S 0% skytrick 4445 0.08s 0.34s 0K -244K 0K 0K -- - S 0% sky-db 4459 0.09s 0.33s 0K -244K 0K 0K -- - S 0% sky-db 4184 0.07s 0.35s 0K -1268K 0K 0K -- - S 0% sky-db 4452 0.09s 0.32s 0K -244K 0K 0K -- - S 0% sky-db 4474 0.07s 0.34s 0K -244K 0K 0K -- - S 0% sky-db 4481 0.07s 0.30s 0K -244K 0K 0K -- - S 0% sky-db All the sky-db process were messing (through libdb API) with the same big (57GB) file on the XFS (over LVM2 logical volume over a local HW RAID) filesystem. The application is being built locally against stock squeeze libdb4.8-dev pkg. I noticed a dramatic drop i the "buff" (2.0M value, see above) about an hour before the crash--until then it was many hours steadily over 200 megabytes. In the last hour it was only 1.8-2.0 MB as of the atop log. The machine had 4 days uptime, the respective boot dmesg output is at http://joni.heaven-industries.com/~egon/tornado-dmesg-111021.txt It turned out we have a very similar another machine here which crashed in the last days too with the outdated BIOS (no crash since the upgrade, but it's only 4 days since then) running CentOS 5.x (kernel 2.6.18). To clarify once again: the crash described above happened with a recent BIOS. Thanks! -- Egon Eckert, Heaven Industries, s.r.o. E-mail: egon.eck...@heaven-industries.com EARTH smog | bricks AIR -- mud -- FIRE soda water | tequila WATER -- with thanks to fortune -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20111025102124.gb19...@heaven-industries.com