On 12 June 2014 07:35, John Stultz <john.stu...@linaro.org> wrote: > I've been seeing some ext4 corruption with recent kernels under > qemu-system-arm. > > This issue seems to crop up after shutting down uncleanly (terminating > qemu), shortly after booting about 50% of the time. > > ext4/mmc related dmesg details are: > [ 3.206809] mmci-pl18x mb:mmci: mmc0: PL181 manf 41 rev0 at > 0x10005000 irq 41,42 (pio) > [ 3.268316] mmc0: new SDHC card at address 4567 > [ 3.281963] mmcblk0: mmc0:4567 QEMU! 2.00 GiB > [ 3.315699] mmcblk0: p1 p2 p3 p4 < p5 p6 > > ... > [ 11.806169] EXT4-fs (mmcblk0p5): Ignoring removed nomblk_io_submit option > [ 11.904714] EXT4-fs (mmcblk0p5): recovery complete > [ 11.905854] EXT4-fs (mmcblk0p5): mounted filesystem with ordered > data mode. Opts: nomblk_io_submit,errors=panic > ... > [ 91.558824] EXT4-fs error (device mmcblk0p5): > ext4_mb_generate_buddy:756: group 1, 2252 clusters in bitmap, 2284 in > gd; block bitmap corrupt. > [ 91.560641] Aborting journal on device mmcblk0p5-8. > [ 91.562589] Kernel panic - not syncing: EXT4-fs (device mmcblk0p5): > panic forced after error > [ 91.562589] > [ 91.563486] CPU: 0 PID: 1 Comm: init Not tainted 3.15.0-rc1 #560 > [ 91.564616] [<c00116e5>] (unwind_backtrace) from [<c000f3b1>] > (show_stack+0x11/0x14) > [ 91.565154] [<c000f3b1>] (show_stack) from [<c04262b1>] > (dump_stack+0x59/0x7c) > [ 91.565666] [<c04262b1>] (dump_stack) from [<c0423297>] (panic+0x67/0x178) > [ 91.566147] [<c0423297>] (panic) from [<c0134bb1>] > (ext4_handle_error+0x69/0x74) > [ 91.566659] [<c0134bb1>] (ext4_handle_error) from [<c0135437>] > (__ext4_grp_locked_error+0x6b/0x160) > [ 91.567223] [<c0135437>] (__ext4_grp_locked_error) from > [<c0143079>] (ext4_mb_generate_buddy+0x1b1/0x29c) > [ 91.567860] [<c0143079>] (ext4_mb_generate_buddy) from [<c01447e5>] > (ext4_mb_init_cache+0x219/0x4e0) > [ 91.568473] [<c01447e5>] (ext4_mb_init_cache) from [<c0144b67>] > (ext4_mb_init_group+0xbb/0x138) > [ 91.569021] [<c0144b67>] (ext4_mb_init_group) from [<c0144cd7>] > (ext4_mb_good_group+0xf3/0xfc) > [ 91.569659] [<c0144cd7>] (ext4_mb_good_group) from [<c0145c8f>] > (ext4_mb_regular_allocator+0x153/0x2c4) > [ 91.570250] [<c0145c8f>] (ext4_mb_regular_allocator) from > [<c0148095>] (ext4_mb_new_blocks+0x2fd/0x4e4) > [ 91.570868] [<c0148095>] (ext4_mb_new_blocks) from [<c013f931>] > (ext4_ext_map_blocks+0x965/0x10bc) > [ 91.571444] [<c013f931>] (ext4_ext_map_blocks) from [<c0122c8b>] > (ext4_map_blocks+0xfb/0x36c) > [ 91.571992] [<c0122c8b>] (ext4_map_blocks) from [<c01263b1>] > (mpage_map_and_submit_extent+0x99/0x5f0) > [ 91.572614] [<c01263b1>] (mpage_map_and_submit_extent) from > [<c0126bc1>] (ext4_writepages+0x2b9/0x4e8) > [ 91.573201] [<c0126bc1>] (ext4_writepages) from [<c0094ae9>] > (do_writepages+0x19/0x28) > [ 91.573709] [<c0094ae9>] (do_writepages) from [<c008c811>] > (__filemap_fdatawrite_range+0x3d/0x44) > [ 91.574265] [<c008c811>] (__filemap_fdatawrite_range) from > [<c008c883>] (filemap_flush+0x23/0x28) > [ 91.574854] [<c008c883>] (filemap_flush) from [<c012bf75>] > (ext4_rename+0x2f9/0x3e4) > [ 91.575360] [<c012bf75>] (ext4_rename) from [<c00c3363>] > (vfs_rename+0x183/0x45c) > [ 91.575911] [<c00c3363>] (vfs_rename) from [<c00c3867>] > (SyS_renameat2+0x22b/0x26c) > [ 91.576460] [<c00c3867>] (SyS_renameat2) from [<c00c38df>] > (SyS_rename+0x1f/0x24) > [ 91.576961] [<c00c38df>] (SyS_rename) from [<c000cd01>] > (ret_fast_syscall+0x1/0x5c) > > > Bisecting this points to: e7f3d22289e4307b3071cc18b1d8ecc6598c0be4 > (mmc: mmci: Handle CMD irq before DATA irq). Which I guess shouldn't > be surprising, as I saw problems with that patch earlier in the > 3.15-rc cycle: > https://lkml.org/lkml/2014/4/14/824 > > However that discussion petered out (possibly my fault for not > following up) as to if it was an issue with the patch or a issue with > qemu. Then the original issue disappeared for me, which I figured was > due to a fix upstream, but now I'm guessing coincided with me updating > my system and getting qemu v2.0 (where as previously I was on 1.5). > > $ qemu-system-arm -version > QEMU emulator version 2.0.0 (Debian 2.0.0+dfsg-2ubuntu1.1), Copyright > (c) 2003-2008 Fabrice Bellard > > While the previous behavior was annoying and kept my emulated > environments from booting, this while a bit more rare and subtle eats > the disks, which is much more painful for my testing. > > Unfortunately reverting the change (manually, as it doesn't revert > cleanly anymore) doesn't seem to completely avoid the issue, so the > bisection may have gone slightly astray (though it is interesting it > landed on the same commit I earlier had trouble with). So I'll > back-track and double check some of the last few "good" results to > validate I didn't just luck into 3 good boots accidentally. I'll also > review my revert in case I missed something subtle in doing it > manually. > > Anyway, if there is any thoughts on how to better chase this down and > debug it, I'd appreciate it! I can also provide reproduction > instructions with a pre-built Linaro android disk image and hand built > kernel if anyone wants to debug this themselves.
According to your log, the primecell-periphid is 0x00041181, which means mmci will be using the arm_variant. A simple fix; for the arm_variant, go back to use the old behaviour. A quite simple fix; Invent a new primecell-periphid and a new corresponding variant and use the old behaviour for this variant. The new primecell-periphid then needs to be provided through DT for the QEMU dtb. Is there any of the above solution you see as the preferred one? Kind regards Uffe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/