Re: [CALL FOR TESTING] Make Ext3 fsck way faster [2.6.24-rc6 -mm patch]

Andrew Morton Mon, 14 Jan 2008 16:35:16 -0800

On Mon, 14 Jan 2008 08:39:01 -0500
Abhishek Rai <[EMAIL PROTECTED]> wrote:


> This is the patch for 2.6.24-rc6 -mm tree, please let me know if anyone would 
> like a patch against another recent kernel. Ingo Molnar has already posted a 
> patch for 2.6.24-rc7.
> 

Please always retain and maintain the full changelog with each version of a
patch.  The changelog in your earlier "Clustering indirect blocks in Ext3"
patch was good.  I retained that (after renaming it to "ext3: indirect
block clustering" rather than this dopey name) and then, err, dropped the
patch ;)

Please cc linux-ext4 on ext2/3/4-related work.

Please update Documentation/filesystems/ext3.txt when adding new mount
options.

Here's a quick low-levelish nitpickingish implementation review.  I haven't
thought about the design-level things yet.

This code will need careful checking regarding the journalling side of
things - to verify that if the machine crashes at any stage, recovery will
produce a consistent and secure fs.  It is all-nigh impossible to spot bugs
in this area via the usual kernel bug reporting process and it is very hard
to effectively runtime test recovery even in a development situation. 
Careful review unusually important here.

> +/
> + * Count number of free blocks in a block group that don't lie in the
> + * metacluster region of the block group.
> + */
> +static void
> +ext3_init_grp_free_nonmc_blocks(struct super_block *sb,
> +                             struct buffer_head *bitmap_bh,
> +                             unsigned long block_group)
> +{
> +     struct ext3_sb_info *sbi = EXT3_SB(sb);
> +     struct ext3_bg_info *bgi = &sbi->s_bginfo[block_group];
> +
> +     BUG_ON(!test_opt(sb, METACLUSTER));
> +
> +     spin_lock(sb_bgl_lock(sbi, block_group));
> +     if (bgi->bgi_free_nonmc_blocks_count >= 0)
> +             goto out;
> +
> +     bgi->bgi_free_nonmc_blocks_count =
> +             ext3_count_free(bitmap_bh, sbi->s_nonmc_blocks_per_group/8);
> +
> +out:
> +     spin_unlock(sb_bgl_lock(sbi, block_group));
> +     BUG_ON(bgi->bgi_free_nonmc_blocks_count >
> +             sbi->s_nonmc_blocks_per_group);
> +}

hm, so ext3_count_free() hits prime time.

ext3_count_free() should be converted to use the standard hweight8(). 
Really it should be converted to hweight_long() or hweight32() or something
- it is presently probably inefficient.

This function appears to be called frequently and it calls the slow-looking
ext3_count_free() under spinlock.  This might have scalability problems on
the right machine running the right workload.

Can we address that before we hit it?

> +/*
> + * ext3_update_nonmc_block_count:
> + *   Update bgi_free_nonmc_blocks_count for block group 'group_no' following
> + *   an allocation or deallocation.
> + *
> + *   @group_no:      affected block group
> + *   @start:         start of the [de]allocated range
> + *   @count:         number of blocks [de]allocated
> + *   @allocation:    1 if blocks were allocated, 0 otherwise.
> + */
> +static inline void
> +ext3_update_nonmc_block_count(struct ext3_sb_info *sbi, unsigned long 
> group_no,
> +                             ext3_grpblk_t start, unsigned long count,
> +                             int allocation)
> +{
> +     struct ext3_bg_info *bginfo = &sbi->s_bginfo[group_no];
> +     ext3_grpblk_t change;
> +
> +     BUG_ON(bginfo->bgi_free_nonmc_blocks_count < 0);
> +     BUG_ON(start >= sbi->s_nonmc_blocks_per_group);
> +
> +     change = min_t(ext3_grpblk_t, start + count,
> +                     sbi->s_nonmc_blocks_per_group) - start;
> +
> +     spin_lock(sb_bgl_lock(sbi, group_no));
> +     BUG_ON(bginfo->bgi_free_nonmc_blocks_count >
> +             sbi->s_nonmc_blocks_per_group);
> +     BUG_ON(allocation && bginfo->bgi_free_nonmc_blocks_count < change);
> +
> +     bginfo->bgi_free_nonmc_blocks_count += (allocation ? -change : change);
> +
> +     BUG_ON(bginfo->bgi_free_nonmc_blocks_count >
> +             sbi->s_nonmc_blocks_per_group);
> +     spin_unlock(sb_bgl_lock(sbi, group_no));
> +}

Far too large to inline.  Please just remove all inlines from the patch. 
The compiler will sort it out.

> +static ext3_grpblk_t
> +bitmap_find_prev_zero_bit(char *map, ext3_grpblk_t start, ext3_grpblk_t 
> lowest)
> +{
> +     ext3_grpblk_t k, blk;
> +
> +     k = start & ~7;
> +     while (lowest <= k) {
> +             if (map[k/8] != '\255' &&
> +                     (blk = ext3_find_next_zero_bit(map, k + 8, k))
> +                      < (k + 8))
> +                             return blk;
> +
> +             k -= 8;
> +     }
> +     return -1;
> +}

Please rename `k' to something meaningful.

The logic in here looks odd.  If (map[k/8] != 0xff) then the
ext3_find_next_zero_bit() cannot fail.  So why do we test for it in this
fashion?

Perhaps some commnetary is needed here.

> +static ext3_grpblk_t
> +bitmap_search_prev_usable_block(ext3_grpblk_t start, struct buffer_head *bh,
> +                                     ext3_grpblk_t lowest)
> +{
> +     ext3_grpblk_t next;
> +     struct journal_head *jh = bh2jh(bh);
> +
> +     /*
> +      * The bitmap search --- search backward alternately through the actual
> +      * bitmap and the last-committed copy until we find a bit free in
> +      * both
> +      */
> +     while (start >= lowest) {

Little style nit: in bitmap_find_prev_zero_bit() you used (lowest < k)
which is a little surprising but I assumeed that you were following the (a
< b < c) layout.  But here you're not doing that.


> +             next = bitmap_find_prev_zero_bit(bh->b_data, start, lowest);
> +             if (next < lowest)
> +                     return -1;
> +             if (ext3_test_allocatable(next, bh))
> +                     return next;
> +             jbd_lock_bh_state(bh);
> +             if (jh->b_committed_data)
> +                     start = bitmap_find_prev_zero_bit(jh->b_committed_data,
> +                                                             next, lowest);
> +             jbd_unlock_bh_state(bh);
> +     }
> +     return -1;
> +}
>
> ...
>
> +int ext3_alloc_indirect_blocks(struct super_block *sb,
> +                     struct buffer_head *bitmap_bh,
> +                     struct ext3_group_desc *gdp,
> +                     int group_no, unsigned long indirect_blks,
> +                     ext3_fsblk_t new_blocks[])
> +{
> +     struct ext3_bg_info *bgi = &EXT3_SB(sb)->s_bginfo[group_no];
> +     ext3_grpblk_t blk = EXT3_BLOCKS_PER_GROUP(sb) - 1;
> +     ext3_grpblk_t mc_start = EXT3_SB(sb)->s_nonmc_blocks_per_group;
> +     ext3_fsblk_t group_first_block;
> +     int allocated = 0;
> +
> +     BUG_ON(!test_opt(sb, METACLUSTER));
> +
> +     /* This check is racy but that wouldn't harm us. */
> +     if (bgi->bgi_free_nonmc_blocks_count >=
> +             le16_to_cpu(gdp->bg_free_blocks_count))
> +             return 0;
> +
> +     group_first_block = ext3_group_first_block_no(sb, group_no);
> +     while (allocated < indirect_blks && blk >= mc_start) {
> +             if (!ext3_test_allocatable(blk, bitmap_bh)) {
> +                     blk = bitmap_search_prev_usable_block(blk, bitmap_bh,
> +                                                             mc_start);
> +                     continue;
> +             }
> +             if (claim_block(sb_bgl_lock(EXT3_SB(sb), group_no), blk,
> +                             bitmap_bh)) {
> +                     new_blocks[allocated++] = group_first_block + blk;
> +             } else {
> +                     /*
> +                      * The block was allocated by another thread, or it
> +                      * was allocated and then freed by another thread
> +                      */
> +                     cpu_relax();

Whoa.  The first ever use of cpu_relax in a filesystem (apart from sysfs,
but that's hardy endorsement).

What are you trying to do here?  Why was this needed?

> +             }
> +             if (allocated < indirect_blks)
> +                     blk = bitmap_search_prev_usable_block(blk, bitmap_bh,
> +                                                             mc_start);
> +     }
> +     return allocated;
> +}
> +
> +/*
> + * check_allocated_blocks:
> + *   Helper function for ext3_new_blocks. Checks newly allocated block
> + *   numbers.
> + */
> +int check_allocated_blocks(ext3_fsblk_t blk, unsigned long num,
> +                             struct super_block *sb, int group_no,
> +                             struct ext3_group_desc *gdp,
> +                             struct buffer_head *bitmap_bh)
> +{
> +     struct ext3_super_block *es = EXT3_SB(sb)->s_es;
> +     struct ext3_sb_info *sbi = EXT3_SB(sb);
> +     ext3_fsblk_t grp_blk = blk - ext3_group_first_block_no(sb, group_no);
> +
> +     if (in_range(le32_to_cpu(gdp->bg_block_bitmap), blk, num) ||
> +             in_range(le32_to_cpu(gdp->bg_inode_bitmap), blk, num) ||
> +             in_range(blk, le32_to_cpu(gdp->bg_inode_table),
> +                             EXT3_SB(sb)->s_itb_per_group) ||
> +             in_range(blk + num - 1, le32_to_cpu(gdp->bg_inode_table),
> +                             EXT3_SB(sb)->s_itb_per_group)) {
> +             ext3_error(sb, "ext3_new_blocks",
> +                             "Allocating block in system zone - "
> +                             "blocks from "E3FSBLK", length %lu",
> +                             blk, num);
> +             return 1;
> +     }
> +
> +#ifdef CONFIG_JBD_DEBUG
> +     {
> +             struct buffer_head *debug_bh;
> +
> +             /* Record bitmap buffer state in the newly allocated block */
> +             debug_bh = sb_find_get_block(sb, blk);
> +             if (debug_bh) {
> +                     BUFFER_TRACE(debug_bh, "state when allocated");
> +                     BUFFER_TRACE2(debug_bh, bitmap_bh, "bitmap state");
> +                     brelse(debug_bh);
> +             }
> +     }
> +     jbd_lock_bh_state(bitmap_bh);
> +     spin_lock(sb_bgl_lock(sbi, group_no));
> +     if (buffer_jbd(bitmap_bh) && bh2jh(bitmap_bh)->b_committed_data) {
> +             int i;
> +
> +             for (i = 0; i < num; i++) {
> +                     if (ext3_test_bit(grp_blk+i,
> +                                     bh2jh(bitmap_bh)->b_committed_data))
> +                             printk(KERN_ERR "%s: block was unexpectedly set"
> +                                     " in b_committed_data\n", __FUNCTION__);
> +             }
> +     }
> +     ext3_debug("found bit %d\n", grp_blk);
> +     spin_unlock(sb_bgl_lock(sbi, group_no));
> +     jbd_unlock_bh_state(bitmap_bh);
> +#endif

Has the CONFIG_JBD_DEBUG=y code been tested?

> +
> +     if (blk + num - 1 >= le32_to_cpu(es->s_blocks_count)) {
> +             ext3_error(sb, "ext3_new_blocks",
> +                             "block("E3FSBLK") >= blocks count(%d) - "
> +                             "block_group = %d, es == %p ", blk,
> +                             le32_to_cpu(es->s_blocks_count), group_no, es);
> +             return 1;
> +     }
> +
> +     return 0;
> +}
>
> ...
>
> +                             - (indirect_blks_done + grp_mc_alloc);
>               grp_alloc_blk = ext3_try_to_allocate_with_rsv(sb, handle,
> -                                     group_no, bitmap_bh, -1, my_rsv,
> -                                     &num, &fatal);
> +                                     group_no, bitmap_bh, grp_target_blk,
> +                                     my_rsv, &grp_alloc);
> +             if (grp_alloc_blk < 0)
> +                     grp_alloc = 0;
> +
> +             /*
> +              * If we couldn't allocate anything, there is nothing more to
> +              * do with this block group, so move over to the next. But
> +              * before that We must release write access to the old one via
> +              * ext3_journal_release_buffer(), else we'll run out of credits.
> +              */
> +             if (grp_mc_alloc == 0 && grp_alloc == 0) {
> +                     BUFFER_TRACE(bitmap_bh, "journal_release_buffer");
> +                     ext3_journal_release_buffer(handle, bitmap_bh);
> +                     goto next;
> +             }
> +
> +             BUFFER_TRACE(bitmap_bh, "journal_dirty_metadata for "
> +                                     "bitmap block");
> +             fatal = ext3_journal_dirty_metadata(handle, bitmap_bh);
>               if (fatal)
>                       goto out;
> -             if (grp_alloc_blk >= 0)
> +
> +             ext3_debug("using block group %d(%d)\n",
> +                             group_no, gdp->bg_free_blocks_count);
> +
> +             BUFFER_TRACE(gdp_bh, "get_write_access");
> +             fatal = ext3_journal_get_write_access(handle, gdp_bh);
> +             if (fatal)
> +                     goto out;
> +
> +             /* Should this be called before ext3_journal_dirty_metadata? */

If you're concerned here I'd suggest that you formulate a question for the
ext3/4 maintainers to think about.


> +             for (i = 0; i < grp_mc_alloc; i++) {
> +                     if (check_allocated_blocks(
> +                             new_blocks[indirect_blks_done + i], 1, sb,
> +                             group_no, gdp, bitmap_bh))
> +                             goto out;
> +             }
> +             if (grp_alloc > 0) {
> +                     ret_block = ext3_group_first_block_no(sb, group_no) +
> +                             grp_alloc_blk;
> +                     if (check_allocated_blocks(ret_block, grp_alloc, sb,
> +                                             group_no, gdp, bitmap_bh))
> +                             goto out;
> +             }
> +
> +             indirect_blks_done += grp_mc_alloc;
> +             performed_allocation = 1;
> +
> +             /* The caller will add the new buffer to the journal. */
> +             if (grp_alloc > 0)
> +                     ext3_debug("allocating block %lu. "
> +                                     "Goal hits %d of %d.\n",
> +                                     ret_block, goal_hits, goal_attempts);
> +
> +             spin_lock(sb_bgl_lock(sbi, group_no));
> +             gdp->bg_free_blocks_count =
> +                     cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count) -
> +                                     (grp_mc_alloc + grp_alloc));
> +             spin_unlock(sb_bgl_lock(sbi, group_no));
> +             percpu_counter_sub(&sbi->s_freeblocks_counter,
> +                             (grp_mc_alloc + grp_alloc));
> +
> +             BUFFER_TRACE(gdp_bh, "journal_dirty_metadata for "
> +                             "group descriptor");
> +             err = ext3_journal_dirty_metadata(handle, gdp_bh);
> +             if (!fatal)
> +                     fatal = err;
> +
> +             sb->s_dirt = 1;
> +             if (fatal)
> +                     goto out;
> +
> +             brelse(bitmap_bh);
> +             bitmap_bh = NULL;
> +
>
> ...
>
> +
> +/*
> + * ext3_ind_read_end_bio --
> + *
> + *   bio callback for read IO issued from ext3_read_indblocks.
> + *   May be called multiple times until the whole I/O completes at
> + *   which point bio->bi_size = 0 and it frees read_info and bio.
> + *   The first time it is called, first_bh is unlocked so that any sync
> + *   waier can unblock.

typo.

> + */
> +static void ext3_ind_read_end_bio(struct bio *bio, int err)
> +{
> +     struct ext3_ind_read_info *read_info = bio->bi_private;
> +     struct buffer_head *bh;
> +     int uptodate = !err && test_bit(BIO_UPTODATE, &bio->bi_flags);

It's pretty regrettable that ext3 now starts using bios directly.  All of
jbd and ext3 has thus far avoided this additional coupling.

Why was this necessary?

> +     int i;
> +
> +     if (err == -EOPNOTSUPP)
> +             set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
> +
> +     /* Wait for all buffers to finish - is this needed? */
> +     if (bio->bi_size)
> +             return;
> +
> +     for (i = 0; i < read_info->count; i++) {
> +             bh = read_info->bh[i];
> +             if (err == -EOPNOTSUPP)
> +                     set_bit(BH_Eopnotsupp, &bh->b_state);
> +
> +             if (uptodate) {
> +                     BUG_ON(buffer_uptodate(bh));
> +                     BUG_ON(ext3_buffer_prefetch(bh));
> +                     set_buffer_uptodate(bh);
> +                     if (read_info->seq_prefetch)
> +                             ext3_set_buffer_prefetch(bh);
> +             }
> +
> +             unlock_buffer(bh);
> +             brelse(bh);
> +     }
> +
> +     kfree(read_info);
> +     bio_put(bio);
> +}
> +
> +/*
> + * ext3_get_max_read --
> + *   @inode: inode of file.
> + *   @block: block number in file (starting from zero).
> + *   @offset_in_dind_block: offset of the indirect block inside it's
> + *   parent doubly-indirect block.
> + *
> + *      Compute the maximum no. of indirect blocks that can be read
> + *      satisfying following constraints:
> + *              - Don't read indirect blocks beyond the end of current
> + *              doubly-indirect block.
> + *              - Don't read beyond eof.
> + */
> +static inline unsigned long ext3_get_max_read(const struct inode *inode,
> +                                               int block,
> +                                               int offset_in_dind_block)

This is far too large to be inlined.

> +{
> +     const struct super_block *sb = inode->i_sb;
> +     unsigned long max_read;
> +     unsigned long ptrs = EXT3_ADDR_PER_BLOCK(inode->i_sb);
> +     unsigned long ptrs_bits = EXT3_ADDR_PER_BLOCK_BITS(inode->i_sb);
> +     unsigned long blocks_in_file =
> +             (inode->i_size + sb->s_blocksize - 1) >> sb->s_blocksize_bits;
> +     unsigned long remaining_ind_blks_in_dind =
> +             (ptrs >= offset_in_dind_block) ? (ptrs - offset_in_dind_block)
> +                                            : 0;
> +     unsigned long remaining_ind_blks_before_eof =
> +             ((blocks_in_file - EXT3_NDIR_BLOCKS + ptrs - 1) >> ptrs_bits) -
> +             ((block - EXT3_NDIR_BLOCKS) >> ptrs_bits);
> +
> +     BUG_ON(block >= blocks_in_file);
> +
> +     max_read = min_t(unsigned long, remaining_ind_blks_in_dind,
> +                      remaining_ind_blks_before_eof);

Can use plain old min() here.

> +     BUG_ON(max_read < 1);

Please prefer to use the (much) friendlier ext3_error() over the
user-hostile BUG.  Where it makes sense.

This is especially the case where the assertion could be triggered by
corrupted disk contents - doing a BUG() in response to that is a coding
bug.

> +     return max_read;
> +}
> +
> +static void ext3_read_indblocks_submit(struct bio **pbio,
> +                                     struct ext3_ind_read_info **pread_info,
> +                                     int *read_cnt, int seq_prefetch)
> +{
> +     struct bio *bio = *pbio;
> +     struct ext3_ind_read_info *read_info = *pread_info;
> +
> +     BUG_ON(*read_cnt < 1);
> +
> +     read_info->seq_prefetch = seq_prefetch;
> +     read_info->count = *read_cnt;
> +     read_info->size = bio->bi_size;
> +     bio->bi_private = read_info;
> +     bio->bi_end_io = ext3_ind_read_end_bio;
> +     submit_bio(READ, bio);
> +
> +     *pbio = NULL;
> +     *pread_info = NULL;
> +     *read_cnt = 0;
> +}
> +
> +struct ind_block_info {
> +     ext3_fsblk_t            blockno;
> +     struct buffer_head      *bh;
> +};
> +
> +static int ind_info_cmp(const void *a, const void *b)
> +{
> +     struct ind_block_info *info_a = (struct ind_block_info *)a;
> +     struct ind_block_info *info_b = (struct ind_block_info *)b;
> +
> +     return info_a->blockno - info_b->blockno;
> +}
> +
> +static void ind_info_swap(void *a, void *b, int size)
> +{
> +     struct ind_block_info *info_a = (struct ind_block_info *)a;
> +     struct ind_block_info *info_b = (struct ind_block_info *)b;
> +     struct ind_block_info tmp;
> +
> +     tmp = *info_a;
> +     *info_a = *info_b;
> +     *info_b = tmp;
> +}

Unneeded (and undesirable) casts of void*.

> +/*
> + * ext3_read_indblocks_async --
> + *      @sb:            super block
> + *      @ind_blocks[]:  array of indirect block numbers on disk
> + *      @count:         maximum number of indirect blocks to read
> + *      @first_bh:      buffer_head for indirect block ind_blocks[0], may be
> + *                      NULL
> + *      @seq_prefetch:  if this is part of a sequential prefetch and buffers'
> + *                      prefetch bit must be set.
> + *      @blocks_done:   number of blocks considered for prefetching.
> + *
> + *      Issue a single bio request to read upto count buffers identified in
> + *      ind_blocks[]. Fewer than count buffers may be read in some cases:
> + *      - If a buffer is found to be uptodate and it's prefetch bit is set, 
> we
> + *      don't look at any more buffers as they will most likely be in the 
> cache.
> + *      - We skip buffers we cannot lock without blocking (except for 
> first_bh
> + *      if specified).
> + *      - We skip buffers beyond a certain range on disk.
> + *
> + *      This function must issue read on first_bh if specified unless of 
> course
> + *      it's already uptodate.
> + */

This comment is almost-but-not-quite in kerneldoc format.  Please review
the /** comments in ext3 and convert newly-added commetns to match.

> +static int ext3_read_indblocks_async(struct super_block *sb,
> +                                  const __le32 ind_blocks[], int count,
> +                                  struct buffer_head *first_bh,
> +                                  int seq_prefetch,
> +                                  unsigned long *blocks_done)
> +{
> +     struct buffer_head *bh;
> +     struct bio *bio = NULL;
> +     struct ext3_ind_read_info *read_info = NULL;
> +     int read_cnt = 0, blk;
> +     ext3_fsblk_t prev_blk = 0, io_start_blk = 0, curr;
> +     struct ind_block_info *ind_info = NULL;
> +     int err = 0, ind_info_count = 0;
> +
> +     BUG_ON(count < 1);
> +     /* Don't move this to ext3_get_max_read() since callers often need to
> +      * trim the count returned by that function. So this bound must only
> +      * be imposed at the last moment. */
> +     count = min_t(unsigned long, count, EXT3_IND_READ_MAX);

Both count and EXT3_IND_READ_MAX are plain old int.  Forcing them to ulong
for this comparison is a little odd.

Please check whether those two things have appropriate types - can they
ever legitimately go negative?  I doubt it, in which case they should have
unsigned types.

Please review all newly-added min_t's for these things.

> +     *blocks_done = 0UL;
> +
> +     if (count == 1 && first_bh) {

This comparison could do with a comment explaining what's happening?

> +             lock_buffer(first_bh);
> +             get_bh(first_bh);
> +             first_bh->b_end_io = end_buffer_read_sync;
> +             submit_bh(READ, first_bh);
> +             *blocks_done = 1UL;
> +             return 0;
> +     }
> +
> +     ind_info = kmalloc(count * sizeof(*ind_info), GFP_KERNEL);
> +     if (unlikely(!ind_info))
> +             return -ENOMEM;
> +
> +     /*
> +      * First pass: sort block numbers for all indirect blocks that we'll
> +      * read. This allows us to scan blocks in sequenial order during the
> +      * second pass which helps coalasce requests to contiguous blocks.
> +      * Since we sort block numbers here instead of assuming any specific
> +      * layout on the disk, we have some protection against different
> +      * indirect block layout strategies as long as they keep all indirect
> +      * blocks close by.
> +      */
> +     for (blk = 0; blk < count; blk++) {
> +             curr = le32_to_cpu(ind_blocks[blk]);
> +             if (!curr)
> +                     continue;
> +
> +             /*
> +              * Skip this block if it lies too far from blocks we have
> +              * already decided to read. "Too far" should typically indicate
> +              * lying on a different track on the disk. EXT3_IND_READ_MAX
> +              * seems reasonable for most disks.
> +              */
> +             if (io_start_blk > 0 &&
> +                     (max(io_start_blk, curr) - min(io_start_blk, curr) >=
> +                             EXT3_IND_READ_MAX))
> +                     continue;
> +
> +             if (blk == 0 && first_bh) {
> +                     bh = first_bh;
> +                     get_bh(first_bh);
> +             } else {
> +                     bh = sb_getblk(sb, curr);
> +                     if (unlikely(!bh)) {
> +                             err = -ENOMEM;
> +                             goto failure;
> +                     }
> +             }
> +
> +             if (buffer_uptodate(bh)) {
> +                     if (ext3_buffer_prefetch(bh)) {
> +                             brelse(bh);
> +                             break;
> +                     }
> +                     brelse(bh);

brelse() is an old-fashioned thing which checks for a NULL arg.  Unless you
really have a bh* which might legitimately be NULL, please prefer put_bh().


> +                     continue;
> +             }
> +
> +             if (io_start_blk == 0)
> +                     io_start_blk = curr;
> +
> +             ind_info[ind_info_count].blockno = curr;
> +             ind_info[ind_info_count].bh = bh;
> +             ind_info_count++;
> +     }
> +     *blocks_done = blk;
> +
> +     sort(ind_info, ind_info_count, sizeof(*ind_info),
> +             ind_info_cmp, ind_info_swap);
> +
> +     /* Second pass: compose bio requests and issue them. */
> +     for (blk = 0; blk < ind_info_count; blk++) {
> +             bh = ind_info[blk].bh;
> +             curr = ind_info[blk].blockno;
> +
> +             if (prev_blk > 0 && curr != prev_blk + 1) {
> +                     ext3_read_indblocks_submit(&bio, &read_info,
> +                                             &read_cnt, seq_prefetch);
> +                     prev_blk = 0;
> +             }
> +
> +             /* Lock the buffer without blocking, skipping any buffers
> +              * which would require us to block. first_bh when specified is
> +              * an exception as caller typically wants it to be read for
> +              * sure (e.g., ext3_read_indblocks_sync).
> +              */
> +             if (bh == first_bh) {
> +                     lock_buffer(bh);
> +             } else if (test_set_buffer_locked(bh)) {
> +                     brelse(bh);
> +                     continue;
> +             }
> +
>
> ...
>
> +
> +     kfree(ind_info);
> +     return 0;
> +
> +failure:
> +     while (--read_cnt >= 0) {
> +             unlock_buffer(read_info->bh[read_cnt]);
> +             brelse(read_info->bh[read_cnt]);
> +     }
> +     *blocks_done = 0UL;
> +
> +done:
> +     kfree(read_info);
> +
> +     if (bio)
> +             bio_put(bio);
> +
> +     kfree(ind_info);
> +     return err;
> +}

Again, I really don't see why we needed to dive into the BIO layer for
this.  Why not use submit_bh() and friends for all of this?

Also, why is it necessary to do block sorting at the filesystem level?  The
block layer does that.

> +                                                     NULL, 1, &blocks_done);
> +
> +                             goto done;
> +                     }
> +                     brelse(prev_bh);
> +             }
> +
> +             /* Either random read, or sequential detection failed above.
> +              * We always prefetch the next indirect block in this case
> +              * whenever possible.
> +              * This is because for random reads of size ~512KB, there is
> +              * >12% chance that a read will span two indirect blocks.
> +              */
> +             *err = ext3_read_indblocks_sync(sb, ind_blocks,
> +                                             (max_read >= 2) ? 2 : 1,
> +                                             first_bh, 0, &blocks_done);
> +             if (*err)
> +                     goto out;
> +     }
> +
> +done:
> +     /* Reader: pointers */
> +     if (!verify_chain(chain, &chain[depth - 2])) {
> +             brelse(first_bh);
> +             goto changed;
> +     }
> +     add_chain(&chain[depth - 1], first_bh,
> +               (__le32 *)first_bh->b_data + offsets[depth - 1]);
> +     /* Reader: end */
> +     if (!chain[depth - 1].key)
> +             goto out;
> +
> +     BUG_ON(!buffer_uptodate(first_bh));
> +     return NULL;
> +
> +changed:
> +     *err = -EAGAIN;
> +     goto out;
> +failure:
> +     *err = -EIO;
> +out:
> +     if (*err) {
> +             ext3_debug("Error %d reading indirect blocks\n", *err);
> +             return &chain[depth - 2];
> +     } else
> +             return &chain[depth - 1];
> +}

I'm wondering about the interaction between this code and the
buffer_boundary() logic.  I guess we should disable the buffer_boundary()
handling when this code is in effect.  Have you reviewed and tested that
aspect?


> @@ -1692,6 +1699,13 @@ static int ext3_fill_super (struct super
>       }
>       sbi->s_frags_per_block = 1;
>       sbi->s_blocks_per_group = le32_to_cpu(es->s_blocks_per_group);
> +     if (test_opt(sb, METACLUSTER)) {
> +             sbi->s_nonmc_blocks_per_group = sbi->s_blocks_per_group -
> +                     sbi->s_blocks_per_group / 12;
> +             sbi->s_nonmc_blocks_per_group &= ~7;
> +     } else
> +             sbi->s_nonmc_blocks_per_group = sbi->s_blocks_per_group;
> +

hm, magic numbers.

> +             sbi->s_bginfo = kmalloc(sbi->s_groups_count *
> +                                     sizeof(*sbi->s_bginfo), GFP_KERNEL);
> +             if (!sbi->s_bginfo) {
> +                     printk(KERN_ERR "EXT3-fs: not enough memory\n");
> +                     goto failed_mount3;
> +             }
> +             for (i = 0; i < sbi->s_groups_count; i++)
> +                     sbi->s_bginfo[i].bgi_free_nonmc_blocks_count = -1;
> +     } else
> +             sbi->s_bginfo = NULL;
> +
>       /*
>        * set up enough so that it can read an inode
>        */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [CALL FOR TESTING] Make Ext3 fsck way faster [2.6.24-rc6 -mm patch]

Reply via email to