Raoul Bhatia [IPAX] <r.bha...@ipax.at> 2013-01-29 11:01:
On 2012-10-12 20:52, Brian Kroth wrote:Brian Paul Kroth <bpkr...@gmail.com> 2012-10-11 14:06:Jonathan Nieder <jrnie...@gmail.com> 2012-10-01 01:25:<snip/>Once again very sorry for the delay :(I forgot to disable the DEBUG_INFO and kept filling up my build VMs disk during compile. Then realized I had grabbed the 3.7 rc code, which these patches don't apply against. "git checkout remotes/stable/linux-3.2.y" (results in head c74a5e1fe4d0672936c8fb63d7484dfeaa30669c and 3.2.28), seemed to fix that.<snip/>Anyways, I just started running that on a machine, so I'll let you know if I noticed anything there first before I think about pushing it to further places.Thanks, BrianGot another panic using this kernel/set of patches. The dump is attached.Let me know if you need anything else.Hi!
Hello!
Has there been any progress regarding this issue?
Not really. At least not that I'm aware of.
Brian, are you right now using the fsc facility or not?
Yes, with 54 mounts each on about 100 hosts.
If yes, which patches / configure options are you currently using and how often do you see kernel panics?
Currently we're running this kernel most places: ii linux-image-3.2.0-0.bpo.2-amd64 3.2.20-1~bpo60+1 Linux 3.2 for 64-bit PCs With a few hosts gradually moving over to this: ii linux-image-3.2.0-0.bpo.4-amd64 3.2.35-2~bpo60+1 Linux 3.2 for 64-bit PCs And one host running 3.2.28 with the set of patches from here: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=682007#47We've seen the panic on all of those kernels. Since it's fairly recent, I've attached another dump from the bpo.4 3.2.35 kernel's panic.
Frequency and cause is a little difficult to tease out precisely. These are lab machines and the workload may vary quite substantially based on what classes and compute jobs (eg: from condor) happen to be running on them at any given time.
Recently (as of the students returning a week and a half ago) we've seen this on 6 machines.
Past I see 37 other events in the last 90 days (our log rotation period). Usually their clustered together, so probably tied to a particular job's workload. Unfortunately, those jobs are usually gone by the time I see it.
Are there any workarounds to this issue besides disabling fsc?
Not that I'm aware of. Let me know if you need anything else. Thanks, Brian
Jan 19 02:08:44 tux-116 [120882.927408] BUG: unable to handle kernel Jan 19 02:08:44 tux-116 NULL pointer dereference Jan 19 02:08:44 tux-116 at 0000000000000040 Jan 19 02:08:44 tux-116 [120882.927421] IP: Jan 19 02:08:44 tux-116 [<ffffffffa103c5f7>] __fscache_read_or_alloc_pages+0x194/0x262 [fscache] Jan 19 02:08:44 tux-116 [120882.927432] PGD 22120c067 Jan 19 02:08:44 tux-116 PUD 22157d067 Jan 19 02:08:44 tux-116 PMD 0 Jan 19 02:08:44 tux-116 Jan 19 02:08:44 tux-116 [120882.927440] Oops: 0000 [#1] Jan 19 02:08:44 tux-116 SMP Jan 19 02:08:44 tux-116 Jan 19 02:08:44 tux-116 [120882.927446] CPU 0 Jan 19 02:08:44 tux-116 Jan 19 02:08:44 tux-116 [120882.927449] Modules linked in: Jan 19 02:08:44 tux-116 btrfs Jan 19 02:08:44 tux-116 zlib_deflate Jan 19 02:08:44 tux-116 libcrc32c Jan 19 02:08:44 tux-116 ufs Jan 19 02:08:44 tux-116 qnx4 Jan 19 02:08:44 tux-116 hfsplus Jan 19 02:08:44 tux-116 hfs Jan 19 02:08:44 tux-116 minix Jan 19 02:08:44 tux-116 ntfs Jan 19 02:08:44 tux-116 vfat Jan 19 02:08:44 tux-116 msdos Jan 19 02:08:44 tux-116 fat Jan 19 02:08:44 tux-116 jfs Jan 19 02:08:44 tux-116 xfs Jan 19 02:08:44 tux-116 reiserfs Jan 19 02:08:44 tux-116 ext2 Jan 19 02:08:44 tux-116 cpufreq_userspace Jan 19 02:08:44 tux-116 cpufreq_powersave Jan 19 02:08:44 tux-116 cpufreq_conservative Jan 19 02:08:44 tux-116 cpufreq_stats Jan 19 02:08:44 tux-116 autofs4 Jan 19 02:08:44 tux-116 cachefiles Jan 19 02:08:44 tux-116 binfmt_misc Jan 19 02:08:44 tux-116 kvm_intel Jan 19 02:08:44 tux-116 kvm Jan 19 02:08:44 tux-116 nfsd Jan 19 02:08:44 tux-116 nfs Jan 19 02:08:44 tux-116 lockd Jan 19 02:08:44 tux-116 fscache Jan 19 02:08:44 tux-116 auth_rpcgss Jan 19 02:08:44 tux-116 nfs_acl Jan 19 02:08:44 tux-116 sunrpc Jan 19 02:08:44 tux-116 netconsole Jan 19 02:08:44 tux-116 configfs Jan 19 02:08:44 tux-116 ext3 Jan 19 02:08:44 tux-116 jbd Jan 19 02:08:44 tux-116 dm_crypt Jan 19 02:08:44 tux-116 sbs Jan 19 02:08:44 tux-116 power_supply Jan 19 02:08:44 tux-116 sbshc Jan 19 02:08:44 tux-116 adt7475 Jan 19 02:08:44 tux-116 hwmon_vid Jan 19 02:08:44 tux-116 ipmi_watchdog Jan 19 02:08:44 tux-116 ipmi_devintf Jan 19 02:08:44 tux-116 ipmi_si Jan 19 02:08:44 tux-116 ipmi_msghandler Jan 19 02:08:44 tux-116 fuse Jan 19 02:08:44 tux-116 uhci_hcd Jan 19 02:08:44 tux-116 ohci_hcd Jan 19 02:08:44 tux-116 snd_hda_codec_realtek Jan 19 02:08:44 tux-116 snd_hda_intel Jan 19 02:08:44 tux-116 snd_hda_codec Jan 19 02:08:44 tux-116 snd_hwdep Jan 19 02:08:44 tux-116 snd_pcm_oss Jan 19 02:08:44 tux-116 snd_mixer_oss Jan 19 02:08:44 tux-116 tpm_infineon Jan 19 02:08:44 tux-116 snd_pcm Jan 19 02:08:44 tux-116 nvidia(P) Jan 19 02:08:44 tux-116 snd_seq_midi Jan 19 02:08:44 tux-116 acpi_cpufreq Jan 19 02:08:44 tux-116 mperf Jan 19 02:08:44 tux-116 snd_rawmidi Jan 19 02:08:44 tux-116 snd_seq_midi_event Jan 19 02:08:44 tux-116 snd_seq Jan 19 02:08:44 tux-116 snd_timer Jan 19 02:08:44 tux-116 snd_seq_device Jan 19 02:08:44 tux-116 hp_wmi Jan 19 02:08:44 tux-116 sparse_keymap Jan 19 02:08:44 tux-116 rfkill Jan 19 02:08:44 tux-116 i2c_i801 Jan 19 02:08:44 tux-116 coretemp Jan 19 02:08:44 tux-116 snd Jan 19 02:08:44 tux-116 tpm_tis Jan 19 02:08:44 tux-116 tpm Jan 19 02:08:44 tux-116 tpm_bios Jan 19 02:08:44 tux-116 wmi Jan 19 02:08:44 tux-116 button Jan 19 02:08:44 tux-116 processor Jan 19 02:08:44 tux-116 evdev Jan 19 02:08:44 tux-116 thermal_sys Jan 19 02:08:44 tux-116 psmouse Jan 19 02:08:44 tux-116 i2c_core Jan 19 02:08:44 tux-116 soundcore Jan 19 02:08:44 tux-116 snd_page_alloc Jan 19 02:08:44 tux-116 serio_raw Jan 19 02:08:44 tux-116 ext4 Jan 19 02:08:44 tux-116 mbcache Jan 19 02:08:44 tux-116 jbd2 Jan 19 02:08:44 tux-116 crc16 Jan 19 02:08:44 tux-116 dm_mod Jan 19 02:08:44 tux-116 raid10 Jan 19 02:08:44 tux-116 raid456 Jan 19 02:08:44 tux-116 async_raid6_recov Jan 19 02:08:44 tux-116 async_pq Jan 19 02:08:44 tux-116 raid6_pq Jan 19 02:08:44 tux-116 async_xor Jan 19 02:08:44 tux-116 xor Jan 19 02:08:44 tux-116 async_memcpy Jan 19 02:08:44 tux-116 async_tx Jan 19 02:08:44 tux-116 raid1 Jan 19 02:08:44 tux-116 raid0 Jan 19 02:08:44 tux-116 multipath Jan 19 02:08:44 tux-116 linear Jan 19 02:08:44 tux-116 md_mod Jan 19 02:08:44 tux-116 usbhid Jan 19 02:08:44 tux-116 hid Jan 19 02:08:44 tux-116 sg Jan 19 02:08:44 tux-116 sr_mod Jan 19 02:08:44 tux-116 sd_mod Jan 19 02:08:44 tux-116 cdrom Jan 19 02:08:44 tux-116 crc_t10dif Jan 19 02:08:44 tux-116 ahci Jan 19 02:08:44 tux-116 libahci Jan 19 02:08:44 tux-116 crc32c_intel Jan 19 02:08:44 tux-116 ghash_clmulni_intel Jan 19 02:08:44 tux-116 ehci_hcd Jan 19 02:08:44 tux-116 ata_generic Jan 19 02:08:44 tux-116 aesni_intel Jan 19 02:08:44 tux-116 cryptd Jan 19 02:08:44 tux-116 libata Jan 19 02:08:44 tux-116 aes_x86_64 Jan 19 02:08:44 tux-116 scsi_mod Jan 19 02:08:44 tux-116 aes_generic Jan 19 02:08:44 tux-116 e1000e Jan 19 02:08:44 tux-116 usbcore Jan 19 02:08:44 tux-116 usb_common Jan 19 02:08:44 tux-116 [last unloaded: microcode] Jan 19 02:08:44 tux-116 Jan 19 02:08:44 tux-116 [120882.927712] Jan 19 02:08:44 tux-116 [120882.927718] Pid: 16263, comm: run_EM.sh Tainted: P O 3.2.0-0.bpo.4-amd64 #1 Debian 3.2.35-2~bpo60+1 Jan 19 02:08:44 tux-116 Hewlett-Packard HP Compaq 8200 Elite CMT PC Jan 19 02:08:44 tux-116 /1494 Jan 19 02:08:44 tux-116 Jan 19 02:08:44 tux-116 [120882.927733] RIP: 0010:[<ffffffffa103c5f7>] Jan 19 02:08:44 tux-116 [<ffffffffa103c5f7>] __fscache_read_or_alloc_pages+0x194/0x262 [fscache] Jan 19 02:08:44 tux-116 [120882.927745] RSP: 0018:ffff8801f63119d8 EFLAGS: 00010246 Jan 19 02:08:44 tux-116 [120882.927749] RAX: 0000000000000000 RBX: ffff8802112b4198 RCX: ffff8801f6311968 Jan 19 02:08:44 tux-116 [120882.927754] RDX: 0000000000000000 RSI: ffff8801f6311958 RDI: ffff88022dfbea20 Jan 19 02:08:44 tux-116 [120882.927759] RBP: ffff8801f6311a94 R08: ffff8801f6310000 R09: ffff88022dc0eab0 Jan 19 02:08:44 tux-116 [120882.927763] R10: 0000000000000286 R11: 0000000000000000 R12: ffff8801f6311b98 Jan 19 02:08:44 tux-116 [120882.927767] R13: ffff8801f6019180 R14: ffff8801f6411940 R15: 00000000000200da Jan 19 02:08:44 tux-116 [120882.927772] FS: 00002acadfa6f700(0000) GS:ffff88022dc00000(0000) knlGS:0000000000000000 Jan 19 02:08:44 tux-116 [120882.927778] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jan 19 02:08:44 tux-116 [120882.927785] CR2: 0000000000000040 CR3: 0000000221977000 CR4: 00000000000406f0 Jan 19 02:08:44 tux-116 [120882.927790] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jan 19 02:08:44 tux-116 [120882.927794] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jan 19 02:08:44 tux-116 [120882.927799] Process run_EM.sh (pid: 16263, threadinfo ffff8801f6310000, task ffff8802210d0870) Jan 19 02:08:44 tux-116 [120882.927805] Stack: Jan 19 02:08:44 tux-116 [120882.927809] ffff88022241b440 Jan 19 02:08:44 tux-116 ffffffffa108eacf Jan 19 02:08:44 tux-116 0000000000000000 Jan 19 02:08:44 tux-116 ffff8801f71bedb8 Jan 19 02:08:44 tux-116 Jan 19 02:08:44 tux-116 [120882.927828] ffff8801f6311a94 Jan 19 02:08:44 tux-116 ffff8801f6311b98 Jan 19 02:08:44 tux-116 0000000000000004 Jan 19 02:08:44 tux-116 ffff8801f71beef8 Jan 19 02:08:44 tux-116 Jan 19 02:08:44 tux-116 [120882.927843] 00000000ffffff97 Jan 19 02:08:44 tux-116 ffffffffa108ea04 Jan 19 02:08:44 tux-116 00000010000200da Jan 19 02:08:44 tux-116 0000000000000024 Jan 19 02:08:44 tux-116 Jan 19 02:08:44 tux-116 [120882.927864] Call Trace: Jan 19 02:08:44 tux-116 [120882.927876] [<ffffffffa108eacf>] ? __nfs_readpages_from_fscache+0x146/0x146 [nfs] Jan 19 02:08:44 tux-116 [120882.927886] [<ffffffffa108ea04>] ? __nfs_readpages_from_fscache+0x7b/0x146 [nfs] Jan 19 02:08:44 tux-116 [120882.927893] [<ffffffffa106dceb>] ? nfs_readpages+0xe1/0x157 [nfs] Jan 19 02:08:44 tux-116 [120882.927902] [<ffffffff810efd44>] ? alloc_pages_current+0xbb/0xd8 Jan 19 02:08:44 tux-116 [120882.927909] [<ffffffff810c608e>] ? __do_page_cache_readahead+0x124/0x1ca Jan 19 02:08:44 tux-116 [120882.927917] [<ffffffff810c6150>] ? ra_submit+0x1c/0x20 Jan 19 02:08:44 tux-116 [120882.927924] [<ffffffff810be4b8>] ? generic_file_aio_read+0x299/0x5d0 Jan 19 02:08:44 tux-116 [120882.927930] [<ffffffff8103b982>] ? __wake_up+0x35/0x46 Jan 19 02:08:44 tux-116 [120882.927939] [<ffffffffa1063bf2>] ? nfs_file_read+0x9d/0xbe [nfs] Jan 19 02:08:44 tux-116 [120882.927945] [<ffffffff8110655d>] ? do_sync_read+0xba/0xf3 Jan 19 02:08:44 tux-116 [120882.927951] [<ffffffff81106fb0>] ? vfs_read+0xa1/0xfb Jan 19 02:08:44 tux-116 [120882.927957] [<ffffffff8110b34a>] ? get_user_arg_ptr+0x47/0x5b Jan 19 02:08:44 tux-116 [120882.927963] [<ffffffff8110c215>] ? kernel_read+0x39/0x47 Jan 19 02:08:44 tux-116 [120882.927970] [<ffffffff8110ce61>] ? do_execve_common+0x161/0x30f Jan 19 02:08:44 tux-116 [120882.927975] [<ffffffff811c5cdb>] ? strncpy_from_user+0x40/0x6d Jan 19 02:08:44 tux-116 [120882.927982] [<ffffffff8101495c>] ? sys_execve+0x3f/0x54 Jan 19 02:08:44 tux-116 [120882.927989] [<ffffffff8136daac>] ? stub_execve+0x6c/0xc0 Jan 19 02:08:44 tux-116 [120882.927994] Code: Jan 19 02:08:44 tux-116 85 Jan 19 02:08:44 tux-116 c0 Jan 19 02:08:44 tux-116 74 Jan 19 02:08:44 tux-116 06 Jan 19 02:08:44 tux-116 48 Jan 19 02:08:44 tux-116 8b Jan 19 02:08:44 tux-116 7a Jan 19 02:08:44 tux-116 28 Jan 19 02:08:44 tux-116 ff Jan 19 02:08:44 tux-116 d0 Jan 19 02:08:44 tux-116 48 Jan 19 02:08:44 tux-116 c7 Jan 19 02:08:44 tux-116 c1 Jan 19 02:08:44 tux-116 7c Jan 19 02:08:44 tux-116 06 Jan 19 02:08:44 tux-116 04 Jan 19 02:08:44 tux-116 a1 Jan 19 02:08:44 tux-116 48 Jan 19 02:08:44 tux-116 c7 Jan 19 02:08:44 tux-116 c2 Jan 19 02:08:44 tux-116 84 Jan 19 02:08:44 tux-116 06 Jan 19 02:08:44 tux-116 04 Jan 19 02:08:44 tux-116 a1 Jan 19 02:08:44 tux-116 4c Jan 19 02:08:44 tux-116 89 Jan 19 02:08:44 tux-116 ee Jan 19 02:08:44 tux-116 4c Jan 19 02:08:44 tux-116 89 Jan 19 02:08:44 tux-116 f7 Jan 19 02:08:44 tux-116 e8 Jan 19 02:08:44 tux-116 36 Jan 19 02:08:44 tux-116 fc Jan 19 02:08:44 tux-116 ff Jan 19 02:08:44 tux-116 ff Jan 19 02:08:44 tux-116 85 Jan 19 02:08:44 tux-116 c0 Jan 19 02:08:44 tux-116 78 Jan 19 02:08:44 tux-116 59 Jan 19 02:08:44 tux-116 49 Jan 19 02:08:44 tux-116 8b Jan 19 02:08:44 tux-116 46 Jan 19 02:08:44 tux-116 70 Jan 19 02:08:44 tux-116 8b Jan 19 02:08:44 tux-116 40 Jan 19 02:08:44 tux-116 40 Jan 19 02:08:44 tux-116 a8 Jan 19 02:08:44 tux-116 04 Jan 19 02:08:44 tux-116 74 Jan 19 02:08:44 tux-116 29 Jan 19 02:08:44 tux-116 f0 Jan 19 02:08:44 tux-116 ff Jan 19 02:08:44 tux-116 05 Jan 19 02:08:44 tux-116 66 Jan 19 02:08:44 tux-116 41 Jan 19 02:08:44 tux-116 00 Jan 19 02:08:44 tux-116 00 Jan 19 02:08:44 tux-116 49 Jan 19 02:08:44 tux-116 8b Jan 19 02:08:44 tux-116 46 Jan 19 02:08:44 tux-116 68 Jan 19 02:08:44 tux-116 44 Jan 19 02:08:44 tux-116 89 Jan 19 02:08:44 tux-116 Jan 19 02:08:44 tux-116 [120882.928292] RIP Jan 19 02:08:44 tux-116 [<ffffffffa103c5f7>] __fscache_read_or_alloc_pages+0x194/0x262 [fscache] Jan 19 02:08:44 tux-116 [120882.928304] RSP <ffff8801f63119d8> Jan 19 02:08:44 tux-116 [120882.928309] CR2: 0000000000000040 Jan 19 02:08:44 tux-116 [120882.928348] ---[ end trace 87eba6a8e7d80e34 ]--- Jan 19 02:08:44 tux-116 [120882.928355] Kernel panic - not syncing: Fatal exception Jan 19 02:08:44 tux-116 [120882.928360] Pid: 16263, comm: run_EM.sh Tainted: P D O 3.2.0-0.bpo.4-amd64 #1 Debian 3.2.35-2~bpo60+1 Jan 19 02:08:44 tux-116 [120882.928367] Call Trace: Jan 19 02:08:44 tux-116 [120882.928373] [<ffffffff8136627d>] ? panic+0x92/0x1aa Jan 19 02:08:44 tux-116 [120882.928438] [<ffffffff81049e84>] ? kmsg_dump+0x41/0xdd Jan 19 02:08:44 tux-116 [120882.928446] [<ffffffff81368f41>] ? oops_end+0xa9/0xb6 Jan 19 02:08:44 tux-116 [120882.928470] [<ffffffff8102fd85>] ? no_context+0x1ff/0x20c Jan 19 02:08:44 tux-116 [120882.928476] [<ffffffff8100d6b0>] ? __switch_to+0x1c9/0x2b1 Jan 19 02:08:44 tux-116 [120882.928482] [<ffffffff8136b112>] ? do_page_fault+0x215/0x34c Jan 19 02:08:44 tux-116 [120882.928487] [<ffffffff81366b6c>] ? __schedule+0x5a0/0x5cd Jan 19 02:08:44 tux-116 [120882.928494] [<ffffffffa1039ed2>] ? fscache_wait_bit+0xd/0xd [fscache] Jan 19 02:08:44 tux-116 [120882.928501] [<ffffffffa1039ed2>] ? fscache_wait_bit+0xd/0xd [fscache] Jan 19 02:08:44 tux-116 [120882.928508] [<ffffffff81368635>] ? page_fault+0x25/0x30 Jan 19 02:08:44 tux-116 [120882.928514] [<ffffffffa103c5f7>] ? __fscache_read_or_alloc_pages+0x194/0x262 [fscache] Jan 19 02:08:44 tux-116 [120882.928521] [<ffffffffa103c5ef>] ? __fscache_read_or_alloc_pages+0x18c/0x262 [fscache] Jan 19 02:08:44 tux-116 [120882.928530] [<ffffffffa108eacf>] ? __nfs_readpages_from_fscache+0x146/0x146 [nfs] Jan 19 02:08:44 tux-116 [120882.928601] [<ffffffffa108ea04>] ? __nfs_readpages_from_fscache+0x7b/0x146 [nfs] Jan 19 02:08:44 tux-116 [120882.928611] [<ffffffffa106dceb>] ? nfs_readpages+0xe1/0x157 [nfs] Jan 19 02:08:44 tux-116 [120882.928618] [<ffffffff810efd44>] ? alloc_pages_current+0xbb/0xd8 Jan 19 02:08:44 tux-116 [120882.928623] [<ffffffff810c608e>] ? __do_page_cache_readahead+0x124/0x1ca Jan 19 02:08:44 tux-116 [120882.928629] [<ffffffff810c6150>] ? ra_submit+0x1c/0x20 Jan 19 02:08:44 tux-116 [120882.928635] [<ffffffff810be4b8>] ? generic_file_aio_read+0x299/0x5d0 Jan 19 02:08:44 tux-116 [120882.928641] [<ffffffff8103b982>] ? __wake_up+0x35/0x46 Jan 19 02:08:44 tux-116 [120882.928725] [<ffffffffa1063bf2>] ? nfs_file_read+0x9d/0xbe [nfs] Jan 19 02:08:44 tux-116 [120882.928727] [<ffffffff8110655d>] ? do_sync_read+0xba/0xf3 Jan 19 02:08:44 tux-116 [120882.928729] [<ffffffff81106fb0>] ? vfs_read+0xa1/0xfb Jan 19 02:08:44 tux-116 [120882.928731] [<ffffffff8110b34a>] ? get_user_arg_ptr+0x47/0x5b Jan 19 02:08:44 tux-116 [120882.928732] [<ffffffff8110c215>] ? kernel_read+0x39/0x47
signature.asc
Description: Digital signature