On Tue, 2009-08-18 at 20:38 +0400, Nikita V. Youshchenko wrote: > Package: linux-image-2.6.26-2-xen-amd64 > Version: 2.6.26-17lenny2 > Severity: normal > > I'm running several servers (Dual Core AMD Opteron, 16G of ram) with > Xen. > > Dom0 is debian lenny adm64. > > Today I tried to run a large backup in dom0 - copy NFS-mounted partition > to a local volume group. > > buki:~# cd backup/ > buki:~/backup# mount homes:/srv/homes src > buki:~/backup# mount /dev/vg/homes-backup dst > buki:~/backup# (cd src && tar cpf - .) | (cd dst && tar xpf -) > > This resulted into server crash in a few minutes. > > Crash is reproducible - if after reboot I try the same, same crash > happens soon with similary-looking backtrace. > > [ 712.450538] ------------[ cut here ]------------ > [ 712.450538] kernel BUG at drivers/xen/core/spinlock.c:74! > [ 712.450538] invalid opcode: 0000 [1] SMP > [ 712.450538] CPU 0 > [ 712.450538] Modules linked in: nfs nfsd lockd nfs_acl auth_rpcgss sunrpc > exportfs sctp libcrc32c ipv6 lock_dlm gfs2 dlm configfs bridge 8021q dm_emc > dm_round_robin dm_multipath loop psmouse pcspkr serio_raw k8temp i2c_amd8111 > i2c_amd756 i2c_core button amd_rng rng_core shpchp pci_hotplug evdev reiserfs > dm_mirror dm_log dm_snapshot dm_mod raid456 async_xor async_memcpy async_tx > xor raid1 md_mod ide_cd_mod cdrom ide_pci_generic sd_mod amd74xx ide_core > floppy lpfc scsi_transport_fc scsi_tgt tg3 e100 mii sata_sil ata_generic > ohci_hcd libata scsi_mod dock thermal processor fan thermal_sys > [ 712.450538] Pid: 5636, comm: tar Not tainted 2.6.26-2-xen-amd64 #1 > [ 712.450538] RIP: e030:[<ffffffff8037fa9c>] [<ffffffff8037fa9c>] > xen_spin_wait+0x90/0x139 > [ 712.450538] RSP: e02b:ffffffff80595e38 EFLAGS: 00010246 > [ 712.450538] RAX: ffff88001b1adfd8 RBX: ffff880001bbda20 RCX: > ffff88008162e000 > [ 712.450538] RDX: ffffffffff5f7000 RSI: 0000000000008483 RDI: > 0000000000000000 > [ 712.450538] RBP: 0000000000000103 R08: 00000000dc38d979 R09: > ffffffffffffffff > [ 712.450538] R10: 0000000000000000 R11: 0000000000000246 R12: > 0000000000000000 > [ 712.450538] R13: ffff88001fe7f698 R14: ffff88001fe7f698 R15: > 0000000000000000 > [ 712.450538] FS: 00007f3ef40eb6e0(0000) GS:ffffffff80539000(0000) > knlGS:0000000000000000 > [ 712.450538] CS: e033 DS: 0000 ES: 0000 > [ 712.450538] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [ 712.450538] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: > 0000000000000400 > [ 712.450538] Process tar (pid: 5636, threadinfo ffff88001b1ac000, task > ffff88001e270280) > [ 712.450538] Stack: ffff88001da47480 ffff88001fe7f698 0000000000008483 > 0000000000000000 > [ 712.450538] 0000000000000013 ffffffff80435886 0000000000000000 > ffff88001d8f8700 > [ 712.450538] 0000000000000000 ffffffffa00854f4 0000000000000000 > ffffc2000001ac00 > [ 712.450538] Call Trace: > [ 712.450538] <IRQ> [<ffffffff80435886>] ? _spin_lock+0x3a/0x42 > [ 712.450538] [<ffffffffa00854f4>] ? :sata_sil:sil_interrupt+0x2b/0x275 > [ 712.450538] [<ffffffff8025f0e6>] ? handle_IRQ_event+0x47/0x8e > [ 712.450538] [<ffffffff802605a2>] ? handle_level_irq+0xae/0x116 > [ 712.450538] [<ffffffff8020e13e>] ? do_IRQ+0x4e/0x9a > [ 712.450538] [<ffffffff8037d4cc>] ? evtchn_do_upcall+0x13c/0x1fc > [ 712.450538] [<ffffffff8020bbde>] ? do_hypervisor_callback+0x1e/0x30 > [ 712.450538] <EOI> [<ffffffff802e48d4>] ? dummy_inode_setattr+0x0/0x3 > [ 712.450538] [<ffffffff8037cbf9>] ? xen_poll_irq+0x67/0x74 > [ 712.450538] [<ffffffff8037fb06>] ? xen_spin_wait+0xfa/0x139 > [ 712.450538] [<ffffffff80435cfd>] ? lock_kernel+0x4f/0x63 > [ 712.450538] [<ffffffffa01dcd12>] ? :reiserfs:reiserfs_setattr+0x25/0x27d > [ 712.450538] [<ffffffff8031156f>] ? __up_read+0x13/0x8a > [ 712.450538] [<ffffffffa01f4064>] ? :reiserfs:reiserfs_getxattr+0xa9/0xbe > [ 712.450538] [<ffffffff8029e7e3>] ? notify_change+0x174/0x2f5 > [ 712.450538] [<ffffffff80288db5>] ? chown_common+0x8c/0xa1 > [ 712.450538] [<ffffffff802a1923>] ? mnt_want_write+0x31/0x86 > [ 712.450538] [<ffffffff80288f83>] ? sys_chown+0x4b/0x6f > [ 712.450538] [<ffffffff8029a9d8>] ? dput+0x35/0x13e > [ 712.450538] [<ffffffff802ab10e>] ? sys_futimesat+0x88/0x9e > [ 712.450538] [<ffffffff8020b528>] ? system_call+0x68/0x6d > [ 712.450538] [<ffffffff8020b4c0>] ? system_call+0x0/0x6d > [ 712.450538] > [ 712.450538] > [ 712.450538] Code: 30 fa 58 80 4c 39 2c 08 75 04 0f 0b eb fe 48 c7 c0 40 fa > 58 80 eb 1f 65 48 8b 04 25 10 00 00 00 66 f7 80 44 e0 ff ff 00 ff 75 04 <0f> > 0b eb fe 48 c7 c0 30 fa 58 80 48 8d 1c 08 48 83 3b 00 74 04 > [ 712.450538] RIP [<ffffffff8037fa9c>] xen_spin_wait+0x90/0x139 > [ 712.450538] RSP <ffffffff80595e38> > [ 712.450538] ---[ end trace 73e60cdc01c1f34c ]--- > [ 712.450538] Kernel panic - not syncing: Aiee, killing interrupt handler! [...]
The kernel was spinning in process context, was interrupted by the SATA device, and its interrupt handler also started spinning. The BUG line is here: /* announce we're spinning */ spinning = &__get_cpu_var(spinning); if (spinning->lock) { BUG_ON(spinning->lock == lock); if(raw_irqs_disabled()) { BUG_ON(__get_cpu_var(spinning_bh).lock == lock); spinning = &__get_cpu_var(spinning_irq); } else { -> BUG_ON(!in_softirq()); spinning = &__get_cpu_var(spinning_bh); } BUG_ON(spinning->lock); } spinning->ticket = token; smp_wmb(); spinning->lock = lock; This asserts that if we spin on a lock after interrupting another spin, and interrupts are enabled, we must be in a softirq. This seems bogus to me - in general, interrupts are enabled during interrupt handlers once their specific IRQ has been masked. I'll have a look at whether & how this code has changed upstream and in other forward-ported branches. Ben. -- Ben Hutchings The generation of random numbers is too important to be left to chance. - Robert Coveyou
signature.asc
Description: This is a digitally signed message part