On Tue, Apr 22, 2014 at 11:19:32AM +0530, Amit Sahrawat wrote: > Hi Darrick, > > Thanks for the reply, sorry for responding late. > > On Wed, Apr 16, 2014 at 11:16 PM, Darrick J. Wong > <darrick.w...@oracle.com> wrote: > > On Wed, Apr 16, 2014 at 01:21:34PM +0530, Amit Sahrawat wrote: > >> Sorry Ted, if it caused the confusion. > >> > >> There were actually 2 parts to the problem, the logs in the first mail > >> were from the original situation – where in there were many block > >> groups and error prints also showed that. > >> > >> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1493, 0 > >> clusters in bitmap, 58339 in gd > >> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1000, 0 > >> clusters in bitmap, 3 in gd > >> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1425, 0 > >> clusters in bitmap, 1 in gd > >> JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's > >> a risk of filesystem corruption in case of system crash. > >> JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's > >> a risk of filesystem corruption in case of system crash. > >> > >> 1) Original case – when the disk got corrupted and we only had the > >> logs and the hung task messages. But not the HDD on which issue was > >> observed. > >> 2) In order to reproduce the problem as was coming through the logs > >> (which highlighted the problem in the bitmap corruption). To minimize > >> the environment and make a proper case, we created a smaller partition > >> size and with only 2 groups. And intentionally corrupted the group 1 > >> (our intention was just to replicate the error scenario). > > > > I'm assuming that the original broken fs simply had a corrupt block bitmap, > > and > > that the dd thing was just to simulate that corruption in a testing > > environment? > > Yes, we did so in order to replicate the error scenario. > > > >> 3) After corruption we used ‘fsstress’ - we got the similar problem > >> as was coming the original logs. – We shared our analysis after this > >> point for looping in the writepages part the free blocks mismatch. > > > > Hm. I tried it with 3.15-rc1 and didn't see any hangs. Corrupt bitmaps > > shut > > down allocations from the block group and the FS continues, as expected. > > > We are using kernel version 3.8, so cannot switch to 3.15-rc1. It is a > limitation currently. > > >> 4) We came across ‘Darrick’ patches(in which it also mentioned about > >> how to corrupt to reproduce the problem) and applied on our > >> environment. It solved the initial problem about the looping in > >> writepages, but now we got hangs at other places. > > > > There are hundreds of Darrick patches ... to which one are you referring? :) > > (What was the subject line?) > > > ext4: error out if verifying the block bitmap fails > ext4: fix type declaration of ext4_validate_block_bitmap > ext4: mark block group as corrupt on block bitmap error > ext4: mark block group as corrupt on inode bitmap error > ext4: mark group corrupt on group descriptor checksum > ext4: don't count free clusters from a corrupt block group
Ok, thank you for clarifying. :) > So, the patches helps in marking the block group as corrupt and avoids > further allocation. But when we consider the normal write path using > write_begin. Since, there is mismatch between the free cluster count > from the group descriptor and the bitmap. In that case it marks the > pages dirty by copying dirty but later it get ENOSPC from the > writepages when it actually does the allocation. > > So, our doubt is if we are marking the block group as corrupt, we > should also subtract the block group count from the > s_freeclusters_counter. This will make sure we have the valid > freecluster count and error ‘ENOSPC’ can be returned from the > write_begin, instead of propagating such paths till the writepages. > > We made change like this: > > @@ -737,14 +737,18 @@ void ext4_mb_generate_buddy(struct super_block *sb, > grp->bb_fragments = fragments; > > if (free != grp->bb_free) { > + struct ext4_sb_info *sbi = EXT4_SB(sb); > ext4_grp_locked_error(sb, group, 0, 0, > "%u clusters in bitmap, %u in gd; " > "block bitmap corrupt.", > free, grp->bb_free); > /* > * If we intend to continue, we consider group descriptor > * corrupt and update bb_free using bitmap value > */ > + percpu_counter_sub(&sbi->s_freeclusters_counter, > grp->bb_free); > grp->bb_free = free; > set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); > } > mb_set_largest_free_order(sb, grp); > > Is this the correct method? Or are missing something in this? Please > share your opinion. I think this looks ok. If you send a proper patch doing this to the mailing list, I'll officially review it. > >> Using ‘tune2fs’ is not a viable solution in our case, we can only > >> provide the solution via. the kernel changes. So, we made the changes > >> as shared earlier. > > > > Would it help if you could set errors=remount-ro in mke2fs? > > > Sorry, we cannot reformat or use tune2fs to change the ‘errors’ value. I apologize, my question was unclear; what I meant to ask is, would it have been helpful if you could have set errors=remount-ro back when you ran mke2fs? Now that the format's been done, I suppose the only recourse is mount -o remount,errors=remount-ro (online) or tune2fs (offline). --D > > > --D > >> So the question isn't how the file system got corrupted, but that > >> you'd prefer that the system recovers without hanging after this > >> corruption. > >> >> Yes, our priority is to keep the system running. > >> > >> Again, Sorry for the confusion. But the intention was just to show the > >> original problem and what we did in order to replicate the problem. > >> > >> Thanks & Regards, > >> Amit Sahrawat > >> > >> > >> On Wed, Apr 16, 2014 at 10:37 AM, Theodore Ts'o <ty...@mit.edu> wrote: > >> > On Wed, Apr 16, 2014 at 10:30:10AM +0530, Amit Sahrawat wrote: > >> >> 4) Corrupt the block group ‘1’ by writing all ‘1’, we had one file > >> >> with all 1’s, so using ‘dd’ – > >> >> dd if=i_file of=/dev/sdb1 bs=4096 seek=17 count=1 > >> >> After this mount the partition – create few random size files and then > >> >> ran ‘fsstress, > >> > > >> > Um, sigh. You didn't say that you were deliberately corrupting the > >> > file system. That wasn't in the subject line, or anywhere else in the > >> > original message. > >> > > >> > So the question isn't how the file system got corrupted, but that > >> > you'd prefer that the system recovers without hanging after this > >> > corruption. > >> > > >> > I wish you had *said* that. It would have saved me a lot of time, > >> > since I was trying to figure out how the system had gotten so > >> > corrupted (not realizing you had deliberately corrupted the file > >> > system). > >> > > >> > So I think if you run "tune2fs -e remount-ro /dev/sdb1" before you > >> > started the fsstress, the file system would have remounted the > >> > filesystem read-only at the first EXT4-fs error message. This would > >> > avoid the hang that you saw, since the file system would hopefully > >> > "failed fast", before th euser had the opportunity to put data into > >> > the page cache that would be lost when the system discovered there was > >> > no place to put the data. > >> > > >> > Regards, > >> > > >> > - Ted > >> -- > >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > >> the body of a message to majord...@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/