from:"Chris Mason"

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-29 Thread Chris Mason

On Wed, Nov 28, 2012 at 11:16:21PM -0700, Linus Torvalds wrote:
> On Wed, Nov 28, 2012 at 6:58 PM, Linus Torvalds
>  wrote:
> >
> > But the fact that the code wants to do things like
> >
> > block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);
> >
> > seriously seems to be the main thing that keeps us using
> > 'inode->i_blkbits'. Calculating bbits from bh->b_size is just costly
> > enough to hurt (not everywhere, but on some machines).
> >
> > Very annoying.
> 
> Hmm. Here's a patch that does that anyway. I'm not 100% happy with the
> whole ilog2 thing, but at the same time, in other cases it actually
> seems to improve code generation (ie gets rid of the whole unnecessary
> two dereferences through page->mapping->host just to get the block
> size, when we have it in the buffer-head that we have to touch
> *anyway*).
> 
> Comments? Again, untested.

Jumping in based on Linus original patch, which is doing something like
this:

set_blocksize() {
block new calls to writepage, prepare/commit_write
set the block size
unblock

< --- can race in here and find bad buffers --->

sync_blockdev()
kill_bdev() 

< --- now we're safe --- >
}

We could add a second semaphore and a page_mkwrite call:

set_blocksize() {

block new calls to prepare/commit_write and page_mkwrite(), but
leave writepage unblocked.

sync_blockev()

<--- now we're safe.  There are no dirty pages and no ways to
make new ones ---> 

block new calls to readpage (writepage too for good luck?)

kill_bdev()
set the block size

unblock readpage/writepage
unblock prepare/commit_write and page_mkwrite

}

Another way to look at things:

As Linus said in a different email, we don't need to drop the pages, just
the buffers.  Once we've blocked prepare/commit_write,
there is no way to make a partially up to date page with dirty data.
We may make fully uptodate dirty pages, but for those we can
just create dirty buffers for the whole page.

As long as we had prepare/commit write blocked while we ran
sync_blockdev, we can blindly detach any buffers that are the wrong size
and just make new ones.

This may or may not apply to loop.c, I'd have to read that more
carefully.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-29 Thread Chris Mason

On Thu, Nov 29, 2012 at 07:12:49AM -0700, Chris Mason wrote:
> On Wed, Nov 28, 2012 at 11:16:21PM -0700, Linus Torvalds wrote:
> > On Wed, Nov 28, 2012 at 6:58 PM, Linus Torvalds
> >  wrote:
> > >
> > > But the fact that the code wants to do things like
> > >
> > > block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);
> > >
> > > seriously seems to be the main thing that keeps us using
> > > 'inode->i_blkbits'. Calculating bbits from bh->b_size is just costly
> > > enough to hurt (not everywhere, but on some machines).
> > >
> > > Very annoying.
> > 
> > Hmm. Here's a patch that does that anyway. I'm not 100% happy with the
> > whole ilog2 thing, but at the same time, in other cases it actually
> > seems to improve code generation (ie gets rid of the whole unnecessary
> > two dereferences through page->mapping->host just to get the block
> > size, when we have it in the buffer-head that we have to touch
> > *anyway*).
> > 
> > Comments? Again, untested.
> 
> Jumping in based on Linus original patch, which is doing something like
> this:
> 
> set_blocksize() {
>   block new calls to writepage, prepare/commit_write
>   set the block size
>   unblock
> 
>   < --- can race in here and find bad buffers --->
> 
>   sync_blockdev()
>   kill_bdev() 
>   
>   < --- now we're safe --- >
> }
> 
> We could add a second semaphore and a page_mkwrite call:
> 
> set_blocksize() {
> 
>   block new calls to prepare/commit_write and page_mkwrite(), but
>   leave writepage unblocked.
> 
>   sync_blockev()
> 
>   <--- now we're safe.  There are no dirty pages and no ways to
>   make new ones ---> 
> 
>   block new calls to readpage (writepage too for good luck?)
> 
>   kill_bdev()

Whoops, kill_bdev needs the page lock, which sends us into ABBA when
readpage does the down_read.  So, slight modification, unblock
readpage/writepage before the kill_bdev.  We'd need to change readpage
to discard buffers with the wrong size.  The risk is that readpage can
find buffers with the wrong size, and would need to be changed to
discard them.

The patch below is based on Linus' original and doesn't deal with the
readpage race.  But it does get the rest of the idea across.  It boots
and survives banging no blockdev --setbsz with mkfs, but I definitely
wouldn't trust it.

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 1a1e5e3..1377171 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -116,8 +116,6 @@ EXPORT_SYMBOL(invalidate_bdev);
 
 int set_blocksize(struct block_device *bdev, int size)
 {
-   struct address_space *mapping;
-
/* Size must be a power of two, and between 512 and PAGE_SIZE */
if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
return -EINVAL;
@@ -126,28 +124,40 @@ int set_blocksize(struct block_device *bdev, int size)
if (size < bdev_logical_block_size(bdev))
return -EINVAL;
 
-   /* Prevent starting I/O or mapping the device */
-   percpu_down_write(&bdev->bd_block_size_semaphore);
-
-   /* Check that the block device is not memory mapped */
-   mapping = bdev->bd_inode->i_mapping;
-   mutex_lock(&mapping->i_mmap_mutex);
-   if (mapping_mapped(mapping)) {
-   mutex_unlock(&mapping->i_mmap_mutex);
-   percpu_up_write(&bdev->bd_block_size_semaphore);
-   return -EBUSY;
-   }
-   mutex_unlock(&mapping->i_mmap_mutex);
-
/* Don't change the size if it is same as current */
if (bdev->bd_block_size != size) {
+   /* block all modifications via writing and page_mkwrite */
+   percpu_down_write(&bdev->bd_block_size_semaphore);
+
+   /* write everything that was dirty */
sync_blockdev(bdev);
+
+   /* block readpage and writepage */
+   percpu_down_write(&bdev->bd_page_semaphore);
+
bdev->bd_block_size = size;
bdev->bd_inode->i_blkbits = blksize_bits(size);
+
+   /* we can't call kill_bdev with the page_semaphore down
+* because we'll deadlock against readpage.
+* The block_size_semaphore should prevent any new
+* pages from being dirty, but readpage can jump
+* in once we up the bd_page_sem and find a
+* page with buffers from the old size.
+*
+* The kill_bdev call below is going to get rid
+* of those buffers, but we do have a r

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-29 Thread Chris Mason

On Thu, Nov 29, 2012 at 10:26:56AM -0700, Linus Torvalds wrote:
> On Thu, Nov 29, 2012 at 6:12 AM, Chris Mason  wrote:
> >
> > Jumping in based on Linus original patch, which is doing something like
> > this:
> >
> > set_blocksize() {
> > block new calls to writepage, prepare/commit_write
> > set the block size
> > unblock
> >
> > < --- can race in here and find bad buffers --->
> >
> > sync_blockdev()
> > kill_bdev()
> >
> > < --- now we're safe --- >
> > }
> >
> > We could add a second semaphore and a page_mkwrite call:
> 
> Yeah, we could be fancy, but the more I think about it, the less I can
> say I care.
> 
> After all, the only things that do the whole set_blocksize() thing should be:
> 
>  - filesystems at mount-time
> 
>  - things like loop/md at block device init time.
> 
> and quite frankly, if there are any *concurrent* writes with either of
> the above, I really *really* don't think we should care. I mean,
> seriously.
> 
> So the _only_ real reason for the locking in the first place is to
> make sure of internal kernel consistency. We do not want to oops or
> corrupt memory if people do odd things. But we really *really* don't
> care if somebody writes to a partition at the same time as somebody
> else mounts it. Not enough to do extra work to please insane people.
> 
> It's also worth noting that NONE OF THIS HAS EVER WORKED IN THE PAST.
> The whole sequence always used to be unlocked. The locking is entirely
> new. There is certainly not any legacy users that can possibly rely on
> "I did writes at the same time as the mount with no serialization, and
> it worked". It never has worked.
> 
> So I think this is a case of "perfect is the enemy of good".
> Especially since I think that with the fs/buffer.c approach, we don't
> actually need any locking at all at higher levels.

The bigger question is do we have users that expect to be able to set
the blocksize after mmaping the block device (no writes required)?  I
actually feel a little bad for taking up internet bandwidth asking, but
it is a change in behaviour.

Regardless, changing mmap for a race in the page cache is just backwards, and
with the current 3.7 code, we can still trigger the race with fadvise ->
readpage in the middle of set_blocksize()

Obviously nobody does any of this, otherwise we'd have tons of reports
from those handy WARN_ONs in fs/buffer.c.  So its definitely hard to be
worried one way or another.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Chris Mason

On Thu, Nov 29, 2012 at 12:02:17PM -0700, Linus Torvalds wrote:
> On Thu, Nov 29, 2012 at 9:19 AM, Linus Torvalds
>  wrote:
> >
> > I think I'll apply this for 3.7 (since it's too late to do anything
> > fancier), and then for 3.8 I will rip out all the locking entirely,
> > because looking at the fs/buffer.c patch I wrote up, it's all totally
> > unnecessary.
> >
> > Adding a ACCESS_ONCE() to the read of the i_blkbits value (when
> > creating new buffers) simply makes the whole locking thing pointless.
> > Just make the page lock protect the block size, and make it per-page,
> > and we're done.
> 
> There's a 'block-dev' branch in my git tree, if you guys want to play
> around with it.
> 
> It actually reverts fs/block-dev.c back to the 3.6 state (except for
> some whitespace damage that I refused to re-introduce), so that part
> of the changes should be pretty safe and well tested.
> 
> The fs/buffer.c changes, of course, are new. It's largely the same
> patch I already sent out, with a small helper function to simplify it,
> and to keep the whole ACCESS_ONCE() thing in just a single place.

The fs/buffer.c part makes sense during a quick read.  But
fs/direct-io.c plays with i_blkbits too.  The semaphore was fixing real
bugs there.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Chris Mason

On Thu, Nov 29, 2012 at 12:26:06PM -0700, Linus Torvalds wrote:
> On Thu, Nov 29, 2012 at 11:15 AM, Chris Mason  
> wrote:
> >
> > The fs/buffer.c part makes sense during a quick read.  But
> > fs/direct-io.c plays with i_blkbits too.  The semaphore was fixing real
> > bugs there.
> 
> Ugh. I _hate_ direct-IO. What a mess. And yeah, it seems to be
> incestuously playing games that should be in fs/buffer.c. I thought it
> was doing the sane thing with the page cache.
> 
> (I now realize that Mikulas was talking about this mess, while I
> thought he was talking about the AIO code which is largely sane).

It was all a trick to get you to say the AIO code was sane.

It looks like we could use the private copy of i_blkbits that DIO is
already recording.

blkdev_get_blocks (called during DIO) is also checking i_blkbits, but I
really don't get why that isn't byte based instead.  DIO is already
doing the shift & mask game.

I think only clean_blockdev_aliases is intentionally using the inode's
i_blkbits, but again that shouldn't be changing for filesystems so it
seems safe to use the DIO copy.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Chris Mason

On Thu, Nov 29, 2012 at 01:52:22PM -0700, Linus Torvalds wrote:
> On Thu, Nov 29, 2012 at 11:48 AM, Chris Mason  
> wrote:
> >
> > It was all a trick to get you to say the AIO code was sane.
> 
> It's only sane compared to the DIO code.
> 
> That said, I hate AIO much less these days that we've largely merged
> the code with the regular IO. It's still a horrible interface, but at
> least it is no longer a really disgusting separate implementation in
> the kernel of that horrible interface.
> 
> So yeah, I guess AIO really is pretty sane these days.
> 
> > It looks like we could use the private copy of i_blkbits that DIO is
> > already recording.
> 
> Yes. But that didn't fix the blkdev_get_blocks() mess you pointed out.
> 
> I've pushed out two more commits to the 'block-dev' branch at
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux block-dev
> 
> in case anybody wants to take a look.
> 
> It is - as usual - entirely untested. It compiles, and I *think* that
> blkdev_get_blocks() makes a whole lot more sense this way - as you
> said, it should be byte-based (although it actually does the block
> number conversion because I worried about overflow - probably
> unnecessarily).
> 
> Comments?

Your blkdev_get_blocks emails were great reading while at the dentist,
thanks for helping me pass the time.

Just reading the new blkdev_get_blocks, it looks like we're mixing
shifts.  In direct-io.c map_bh->b_size is how much we'd like to map, and
it has no relation at all to the actual block size of the device.  The
interface is abusing b_size to ask for as large a mapping as possible.

Most importantly, it has no relation to the fs_startblk that we pass in,
which is based on inode->i_blkbits.

So your new check in blkdev_get_blocks:

   if (iblock >= end_block) {

Is wrong because iblock and end_block are based on different sizes.  I
think we have to do the eof checks inside fs/direct-io.c or change the
get_blocks interface completely.

I really thought fs/direct-io.c was already doing eof checks, but I'm
reading harder.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Chris Mason

On Thu, Nov 29, 2012 at 03:36:38PM -0700, Linus Torvalds wrote:
> On Thu, Nov 29, 2012 at 2:16 PM, Linus Torvalds
>  wrote:
> >
> > But you're right. The direct-IO code really *is* violating that, and
> > knows that get_block() ends up being defined in i_blkbits regardless
> > of b_size.
> 
> It turns out fs/ioctl.c does the same - it fills in the buffer head
> with some random bh->b_size too. I think it's not even a power of two
> in that case.
> 
> And I guess it's understandable - they don't actually *use* the
> buffer, they just want the offset. So the b_size field really is just
> random crap to the users of the get_block interfaces, since they've
> never cared before.
> 
> Ugh, this was definitely a dark and disgusting underbelly of the VFS
> layer. We've not had to really touch it for a *looong* time..

I searched through filemap.c for the magic i_size check that would let
us get away with ignoring i_blkbits in get_blocks, but its just not
there.  The whole fallback-to-buffered scheme seems to rely on
get_blocks checking for i_size.  I really hope I'm just missing
something.

If we're going to change this, I'd vote for something non-bh based.  I
didn't check every single FS, but I don't think direct-IO really wants
or needs buffer heads at all.

One less wart in direct-io.c would really be nice, but I'm assuming
it'll take us at least one full release to hammer out a shiny new
get_blocks.  Passing i_blkbits would be more mechanical, since all the
filesystems would just ignore it.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Chris Mason

On Thu, Nov 29, 2012 at 07:13:02PM -0700, Linus Torvalds wrote:
> On Thu, Nov 29, 2012 at 5:16 PM, Chris Mason  wrote:
> >
> > I searched through filemap.c for the magic i_size check that would let
> > us get away with ignoring i_blkbits in get_blocks, but its just not
> > there.  The whole fallback-to-buffered scheme seems to rely on
> > get_blocks checking for i_size.  I really hope I'm just missing
> > something.
> 
> So generic_write_checks() limits the size to i_size at for writes (and
> for "isblk").

Great, that's what I was missing.

> 
> Sure, then it will do the buffered part after that, but that should
> all be fine anyway, since by then we use the normal page cache.
> 
> For reads, generic_file_aio_read() will check pos < size, but doesn't
> seem to actually limit the size of the iovec.

I couldn't explain that either.

> 
> I'm not sure why it doesn't just do "iov_shorten()".
> 
> Anyway, having looked at actually passing in the block size to
> get_block(), I can say that is a horrible idea. There are tons of
> get_block functions (for various filesystems), and *none* of them
> really want the block size, because they tend to work on block
> indexes. And if they do want the block size, they'll just get it from
> the inode or sb, since they are filesystems and it's all stable.
> 
> So the *only* of the places that would want the block size is
> fs/block_dev.c. And the callers really already seem to do the i_size
> check, although they sometimes do it badly. And since there are fewer
> callers than there are get_block() implementations, I think we should
> just fix the callers and be done with it.
> 
> Those generic_file_aio_read/write() functions in fs/direct-io.c really
> just seem to be badly written. The fact that they may depend on the
> i_size check in get_blocks() is sad, but I think we should fix it and
> just remove the check for block devices. That's going to simplify so
> much..
> 
> I updated the 'block-dev' branch to have that simpler fs/block_dev.c
> model instead. I'll look at the iovec shortening later. It's a
> non-fast-forward thing, look out!
> 
> (I actually think we should just add the max-offset check to
> rw_copy_check_uvector(). That one already does the MAX_RW_COUNT thing,
> and we could make it do a max_offset check as well).

This is definitely easier, and I can't see any reason not to do it.  I'm
used to get_block being expensive and so it didn't even cross my mind.

We can benchmark things just to make sure.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] Btrfs fixes

2013-02-06 Thread Chris Mason

[ sorry, my lbdb seems to really like linux-ker...@vger.kerrnel.org,
fixed for real this time ]

Hi Linus,

Please pull my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

We've got corner cases for updating i_size that ceph was hitting, error
handling for quotas when we run out of space, a very subtle snapshot
deletion race, a crash while removing devices, and one deadlock between
subvolume creation and the sb_internal code (thanks lockdep).

Josef Bacik (3) commits (+12/-4):
Btrfs: do not merge logged extents if we've removed them from the tree 
(+2/-1)
Btrfs: fix possible stale data exposure (+1/-1)
Btrfs: fix missing i_size update (+9/-2)

Miao Xie (2) commits (+21/-9):
Btrfs: fix missing release of the space/qgroup reservation in 
start_transaction() (+19/-8)
Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write() (+2/-1)

Jan Schmidt (1) commits (+10/-12):
Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata

Liu Bo (1) commits (+38/-9):
Btrfs: fix race between snapshot deletion and getting inode

Chris Mason (1) commits (+4/-1):
Btrfs: move d_instantiate outside the transaction during mksubvol

Eric Sandeen (1) commits (+2/-1):
btrfs: don't try to notify udev about missing devices

Total: (9) commits

 fs/btrfs/extent-tree.c  | 22 ++
 fs/btrfs/extent_map.c   |  3 ++-
 fs/btrfs/file.c | 25 -
 fs/btrfs/ioctl.c|  5 -
 fs/btrfs/ordered-data.c | 13 ++---
 fs/btrfs/scrub.c| 25 -
 fs/btrfs/transaction.c  | 27 +++
 fs/btrfs/volumes.c  |  3 ++-
 8 files changed, 87 insertions(+), 36 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Oops when mounting btrfs partition

2013-02-02 Thread Chris Mason

Hi Arnd,

First things first, nospace_cache is a safe thing to use.  It is slow
because it's finding free extents, but it's just a cache and always safe
to discard.  With your other errors, I'd just mount it readonly
and then you won't waste time on atime updates.

I'll take a look at the BUG you got during log recovery.  We've fixed a
few of those during the 3.8 rc cycle.

> Feb  1 22:57:37 localhost kernel: [ 8561.599482] Kernel BUG at 
> a01fdcf7 [verbose debug info unavailable]

> Jan 14 19:18:42 localhost kernel: [1060055.746373] btrfs csum failed ino 
> 15619835 off 454656 csum 2755731641 private 864823192
> Jan 14 19:18:42 localhost kernel: [1060055.746381] btrfs: bdev /dev/sdb1 
> errs: wr 0, rd 0, flush 0, corrupt 17, gen 0
> ...
> Jan 21 16:35:40 localhost kernel: [1655047.701147] parent transid verify 
> failed on 17006399488 wanted 54700 found 54764

These aren't good.  With a few exceptions for really tight races in fsx
use cases, csum errors are bad data from the disk.  The transid verify
failed shows we wanted to find a metadata block from generation 54700
but found 54764 instead:

54700 = 0xD5AC
54764 = 0xD5EC

This same bad block comes up a few different times.

> Jan 21 16:35:40 localhost kernel: [1655047.752692] btrfs read error 
> corrected: ino 1 off 17006399488 (dev /dev/sdb1 sector 64689288)

This shows we pulled from the second copy of this block and got the
right answer, and then wrote the right answer to the duplicate.
Inode 1 means it was metadata.

But for some reason still aborted the transaction.  It could have been
an EIO on the correction, but the auto correction code in 3.5 did work
well.

I think your plan to pull the data off and reformat is a good one.  I'd
also look hard at your ram since drives don't usually send back single bit
errors.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] Btrfs fixes

2013-01-24 Thread Chris Mason

On Tue, Jan 22, 2013 at 05:48:33PM -0700, Chris Mason wrote:
> Hi Linus,
> 
> My for-linus branch has our batch of btrfs fixes:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus
> 
> We've been hammering away at a crc corruption as well, which I was
> really hoping to get into this pull.  It isn't nailed down yet, but we
> were finally able to get a solid way to reproduce.  The only good
> news is it isn't a recent regression.

Update on this, we've tracked down the crc errors and are doing final
checks on the patches.  Linus are you planning on taking this pull?  If
not I can just fold the new stuff into a bigger request.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] Btrfs fixes (v2)

2013-01-24 Thread Chris Mason

Hi Linus,

My for-linus branch has our batch of btrfs fixes:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

It turns out that we had two crc bugs when running fsx-linux in a
loop.  Many thanks to Josef, Miao Xie, and Dave Sterba for nailing it
all down.  Miao also has a new OOM fix in this v2 pull as well.

Ilya fixed a regression Liu Bo found in the balance ioctls for pausing
and resuming a running balance across drives.

Josef's orphan truncate patch fixes an obscure corruption we'd see
during xfstests.

Arne's patches address problems with subvolume quotas.  If the user
destroys quota groups incorrectly the FS will refuse to mount.

The rest are smaller fixes and plugs for memory leaks.

Miao Xie (8) commits (+76/-24):
Btrfs: fix missing write access release in btrfs_ioctl_resize() (+1/-0)
Btrfs: do not delete a subvolume which is in a R/O subvolume (+5/-5)
Btrfs: Add ACCESS_ONCE() to transaction->abort accesses (+3/-2)
Btrfs: fix wrong max device number for single profile (+1/-1)
Btrfs: fix repeated delalloc work allocation (+41/-14)
Btrfs: fix missed transaction->aborted check (+16/-0)
Btrfs: fix resize a readonly device (+4/-2)
Btrfs: disable qgroup id 0 (+5/-0)

Ilya Dryomov (6) commits (+94/-32):
Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag (+9/-8)
Btrfs: fix "mutually exclusive op is running" error code (+4/-4)
Btrfs: fix a regression in balance usage filter (+8/-1)
Btrfs: bring back balance pause/resume logic (+71/-17)
Btrfs: fix unlock order in btrfs_ioctl_rm_dev (+1/-1)
Btrfs: fix unlock order in btrfs_ioctl_resize (+1/-1)

Liu Bo (5) commits (+23/-7):
Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents 
(+14/-6)
Btrfs: use right range to find checksum for compressed extents (+5/-0)
Btrfs: let allocation start from the right raid type (+1/-1)
Btrfs: reset path lock state to zero (+2/-0)
Btrfs: fix off-by-one in lseek (+1/-0)

Josef Bacik (5) commits (+69/-29):
Btrfs: do not allow logged extents to be merged or removed (+16/-3)
Btrfs: add orphan before truncating pagecache (+38/-15)
Btrfs: set flushing if we're limited flushing (+1/-1)
Btrfs: put csums on the right ordered extent (+2/-2)
Btrfs: fix panic when recovering tree log (+12/-8)

Arne Jansen (2) commits (+19/-1):
Btrfs: prevent qgroup destroy when there are still relations (+12/-1)
Btrfs: ignore orphan qgroup relations (+7/-0)

Zach Brown (1) commits (+1/-0):
btrfs: fix btrfs_cont_expand() freeing IS_ERR em

Lukas Czerner (1) commits (+1/-1):
btrfs: get the device in write mode when deleting it

Eric Sandeen (1) commits (+14/-3):
btrfs: update timestamps on truncate()

Tsutomu Itoh (1) commits (+3/-1):
Btrfs: fix memory leak in name_cache_insert()

Total: (30) commits (+300/-98)

 fs/btrfs/extent-tree.c  |   6 +-
 fs/btrfs/extent_map.c   |  13 -
 fs/btrfs/extent_map.h   |   1 +
 fs/btrfs/file-item.c|   4 +-
 fs/btrfs/file.c |  10 +++-
 fs/btrfs/free-space-cache.c |  20 ---
 fs/btrfs/inode.c| 137 +---
 fs/btrfs/ioctl.c| 129 ++---
 fs/btrfs/qgroup.c   |  20 ++-
 fs/btrfs/send.c |   4 +-
 fs/btrfs/super.c|   2 +-
 fs/btrfs/transaction.c  |  19 +-
 fs/btrfs/tree-log.c |  10 +++-
 fs/btrfs/volumes.c  |  23 ++--
 14 files changed, 300 insertions(+), 98 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] Btrfs fixes

2013-03-17 Thread Chris Mason

Hi Linus,

My for-linus branch has some btrfs fixes:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

Eric's rcu barrier patch fixes a long standing problem with our unmount
code hanging on to devices in workqueue helpers.  Liu Bo nailed down a
difficult assertion for in-memory extent mappings.

Liu Bo (4) commits (+9/-7):
Btrfs: get better concurrency for snapshot-aware defrag work (+3/-0)
Btrfs: fix warning when creating snapshots (+5/-6)
Btrfs: fix warning of free_extent_map (+1/-0)
Btrfs: remove btrfs_try_spin_lock (+0/-1)

Josef Bacik (1) commits (+4/-1):
Btrfs: return EIO if we have extent tree corruption

Eric Sandeen (1) commits (+6/-0):
btrfs: use rcu_barrier() to wait for bdev puts at unmount

Wang Shilong (1) commits (+6/-4):
Btrfs: return as soon as possible when edquot happens

Total: (7) commits (+25/-12)

 fs/btrfs/extent-tree.c |  5 -
 fs/btrfs/file.c|  1 +
 fs/btrfs/inode.c   |  3 +++
 fs/btrfs/locking.h |  1 -
 fs/btrfs/qgroup.c  | 10 ++
 fs/btrfs/transaction.c | 11 +--
 fs/btrfs/volumes.c |  6 ++
 7 files changed, 25 insertions(+), 12 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[no subject]

2012-08-29 Thread Chris Mason

Hi Linus,

I've split out the big send/receive update from my last pull request and
now have just the fixes in my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

For anyone who wants send/receive updates, they are maintained as well.
But it is has enough cleanups (without fixes) that we shouldn't be asking
Linus to take it right now.  The send/recv branch will wander over to
linux-next shortly though.

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git send-recv

The largest patches in this pull are Josef's patches to fix DIO locking
problems and his patch to fix a crash during balance.
They are both well tested.

The rest are smaller fixes that we've had queued.  The last rc came out
while I was hacking new and exciting ways to recover from a misplaced rm
-rf on my dev box, so these missed rc3.

Josef Bacik (9) commits (+322/-216):
Btrfs: don't allocate a seperate csums array for direct reads (+19/-32)
Btrfs: do not use missing devices when showing devname (+2/-0)
Btrfs: fix enospc problems when deleting a subvol (+1/-1)
Btrfs: increase the size of the free space cache (+7/-8)
Btrfs: lock extents as we map them in DIO (+127/-129)
Btrfs: fix deadlock with freeze and sync V2 (+9/-4)
Btrfs: allow delayed refs to be merged (+142/-27)
Btrfs: do not strdup non existent strings (+5/-3)
Btrfs: barrier before waitqueue_active (+10/-12)

Stefan Behrens (5) commits (+16/-77):
Btrfs: fix that repair code is spuriously executed for transid failures 
(+6/-2)
Btrfs: revert checksum error statistic which can cause a BUG() (+2/-39)
Btrfs: fix a misplaced address operator in a condition (+1/-1)
Btrfs: remove superblock writing after fatal error (+5/-33)
Btrfs: fix that error value is changed by mistake (+2/-2)

Dan Carpenter (4) commits (+16/-8):
Btrfs: unlock on error in btrfs_delalloc_reserve_metadata() (+3/-1)
Btrfs: fix some error codes in btrfs_qgroup_inherit() (+6/-2)
Btrfs: fix some endian bugs handling the root times (+4/-4)
Btrfs: checking for NULL instead of IS_ERR (+3/-1)

Liu Bo (2) commits (+25/-6):
Btrfs: fix ordered extent leak when failing to start a transaction (+5/-2)
Btrfs: fix a dio write regression (+20/-4)

Arne Jansen (2) commits (+38/-73):
Btrfs: fix deadlock in wait_for_more_refs (+21/-73)
Btrfs: fix race in run_clustered_refs (+17/-0)

Chris Mason (1) commits (+3/-0):
Btrfs: don't run __tree_mod_log_free_eb on leaves

Fengguang Wu (1) commits (+3/-2):
btrfs: fix second lock in btrfs_delete_delayed_items()

Miao Xie (1) commits (+1/-0):
Btrfs: fix wrong mtime and ctime when creating snapshots

Total: (25) commits (+424/-382)

 fs/btrfs/backref.c   |   4 +-
 fs/btrfs/compression.c   |   1 +
 fs/btrfs/ctree.c |   9 +-
 fs/btrfs/ctree.h |   3 +-
 fs/btrfs/delayed-inode.c |  12 +-
 fs/btrfs/delayed-ref.c   | 163 +++-
 fs/btrfs/delayed-ref.h   |   4 +
 fs/btrfs/disk-io.c   |  53 ++--
 fs/btrfs/disk-io.h   |   2 +-
 fs/btrfs/extent-tree.c   | 123 +-
 fs/btrfs/extent_io.c |  17 +--
 fs/btrfs/file-item.c |   4 +-
 fs/btrfs/inode.c | 326 ---
 fs/btrfs/ioctl.c |   2 +-
 fs/btrfs/locking.c   |   2 +-
 fs/btrfs/qgroup.c|  12 +-
 fs/btrfs/root-tree.c |   4 +-
 fs/btrfs/super.c |  15 ++-
 fs/btrfs/transaction.c   |   3 +-
 fs/btrfs/volumes.c   |  33 +
 fs/btrfs/volumes.h   |   2 -
 21 files changed, 418 insertions(+), 376 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] Btrfs updates

2012-08-29 Thread Chris Mason

Hi Linus,

I've split out the big send/receive update from my last pull request and
now have just the fixes in my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

For anyone who wants send/receive updates, they are maintained as well.
But it is has enough cleanups (without fixes) that we shouldn't be asking
Linus to take it right now.  The send/recv branch will wander over to
linux-next shortly though.

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git send-recv

The largest patches in this pull are Josef's patches to fix DIO locking
problems and his patch to fix a crash during balance.
They are both well tested.

The rest are smaller fixes that we've had queued.  The last rc came out
while I was hacking new and exciting ways to recover from a misplaced rm
-rf on my dev box, so these missed rc3.

Josef Bacik (9) commits (+322/-216):
Btrfs: don't allocate a seperate csums array for direct reads (+19/-32)
Btrfs: do not use missing devices when showing devname (+2/-0)
Btrfs: fix enospc problems when deleting a subvol (+1/-1)
Btrfs: increase the size of the free space cache (+7/-8)
Btrfs: lock extents as we map them in DIO (+127/-129)
Btrfs: fix deadlock with freeze and sync V2 (+9/-4)
Btrfs: allow delayed refs to be merged (+142/-27)
Btrfs: do not strdup non existent strings (+5/-3)
Btrfs: barrier before waitqueue_active (+10/-12)

Stefan Behrens (5) commits (+16/-77):
Btrfs: fix that repair code is spuriously executed for transid failures 
(+6/-2)
Btrfs: revert checksum error statistic which can cause a BUG() (+2/-39)
Btrfs: fix a misplaced address operator in a condition (+1/-1)
Btrfs: remove superblock writing after fatal error (+5/-33)
Btrfs: fix that error value is changed by mistake (+2/-2)

Dan Carpenter (4) commits (+16/-8):
Btrfs: unlock on error in btrfs_delalloc_reserve_metadata() (+3/-1)
Btrfs: fix some error codes in btrfs_qgroup_inherit() (+6/-2)
Btrfs: fix some endian bugs handling the root times (+4/-4)
Btrfs: checking for NULL instead of IS_ERR (+3/-1)

Liu Bo (2) commits (+25/-6):
Btrfs: fix ordered extent leak when failing to start a transaction (+5/-2)
Btrfs: fix a dio write regression (+20/-4)

Arne Jansen (2) commits (+38/-73):
Btrfs: fix deadlock in wait_for_more_refs (+21/-73)
Btrfs: fix race in run_clustered_refs (+17/-0)

Chris Mason (1) commits (+3/-0):
Btrfs: don't run __tree_mod_log_free_eb on leaves

Fengguang Wu (1) commits (+3/-2):
btrfs: fix second lock in btrfs_delete_delayed_items()

Miao Xie (1) commits (+1/-0):
Btrfs: fix wrong mtime and ctime when creating snapshots

Total: (25) commits (+424/-382)

 fs/btrfs/backref.c   |   4 +-
 fs/btrfs/compression.c   |   1 +
 fs/btrfs/ctree.c |   9 +-
 fs/btrfs/ctree.h |   3 +-
 fs/btrfs/delayed-inode.c |  12 +-
 fs/btrfs/delayed-ref.c   | 163 +++-
 fs/btrfs/delayed-ref.h   |   4 +
 fs/btrfs/disk-io.c   |  53 ++--
 fs/btrfs/disk-io.h   |   2 +-
 fs/btrfs/extent-tree.c   | 123 +-
 fs/btrfs/extent_io.c |  17 +--
 fs/btrfs/file-item.c |   4 +-
 fs/btrfs/inode.c | 326 ---
 fs/btrfs/ioctl.c |   2 +-
 fs/btrfs/locking.c   |   2 +-
 fs/btrfs/qgroup.c|  12 +-
 fs/btrfs/root-tree.c |   4 +-
 fs/btrfs/super.c |  15 ++-
 fs/btrfs/transaction.c   |   3 +-
 fs/btrfs/volumes.c   |  33 +
 fs/btrfs/volumes.h   |   2 -
 21 files changed, 418 insertions(+), 376 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-31 Thread Chris Mason

On Thursday 31 January 2008, Jan Kara wrote:
> On Thu 31-01-08 11:56:01, Chris Mason wrote:
> > On Thursday 31 January 2008, Al Boldi wrote:
> > > Andreas Dilger wrote:
> > > > On Wednesday 30 January 2008, Al Boldi wrote:
> > > > > And, a quick test of successive 1sec delayed syncs shows no hangs
> > > > > until about 1 minute (~180mb) of db-writeout activity, when the
> > > > > sync abruptly hangs for minutes on end, and io-wait shows almost
> > > > > 100%.
> > > >
> > > > How large is the journal in this filesystem?  You can check via
> > > > "debugfs -R 'stat <8>' /dev/XXX".
> > >
> > > 32mb.
> > >
> > > > Is this affected by increasing
> > > > the journal size?  You can set the journal size via "mke2fs -J
> > > > size=400" at format time, or on an unmounted filesystem by running
> > > > "tune2fs -O ^has_journal /dev/XXX" then "tune2fs -J size=400
> > > > /dev/XXX".
> > >
> > > Setting size=400 doesn't help, nor does size=4.
> > >
> > > > I suspect that the stall is caused by the journal filling up, and
> > > > then waiting while the entire journal is checkpointed back to the
> > > > filesystem before the next transaction can start.
> > > >
> > > > It is possible to improve this behaviour in JBD by reducing the
> > > > amount of space that is cleared if the journal becomes "full", and
> > > > also doing journal checkpointing before it becomes full.  While that
> > > > may reduce performance a small amount, it would help avoid such huge
> > > > latency problems. I believe we have such a patch in one of the Lustre
> > > > branches already, and while I'm not sure what kernel it is for the
> > > > JBD code rarely changes much
> > >
> > > The big difference between ordered and writeback is that once the
> > > slowdown starts, ordered goes into ~100% iowait, whereas writeback
> > > continues 100% user.
> >
> > Does data=ordered write buffers in the order they were dirtied?  This
> > might explain the extreme problems in transactional workloads.
>
>   Well, it does but we submit them to block layer all at once so elevator
> should sort the requests for us...

nr_requests is fairly small, so a long stream of random requests should still 
end up being random IO.

Al, could you please compare the write throughput from vmstat for the 
data=ordered vs data=writeback runs?  I would guess the data=ordered one has 
a lower overall write throughput.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-31 Thread Chris Mason

On Thursday 31 January 2008, Al Boldi wrote:
> Andreas Dilger wrote:
> > On Wednesday 30 January 2008, Al Boldi wrote:
> > > And, a quick test of successive 1sec delayed syncs shows no hangs until
> > > about 1 minute (~180mb) of db-writeout activity, when the sync abruptly
> > > hangs for minutes on end, and io-wait shows almost 100%.
> >
> > How large is the journal in this filesystem?  You can check via
> > "debugfs -R 'stat <8>' /dev/XXX".
>
> 32mb.
>
> > Is this affected by increasing
> > the journal size?  You can set the journal size via "mke2fs -J size=400"
> > at format time, or on an unmounted filesystem by running
> > "tune2fs -O ^has_journal /dev/XXX" then "tune2fs -J size=400 /dev/XXX".
>
> Setting size=400 doesn't help, nor does size=4.
>
> > I suspect that the stall is caused by the journal filling up, and then
> > waiting while the entire journal is checkpointed back to the filesystem
> > before the next transaction can start.
> >
> > It is possible to improve this behaviour in JBD by reducing the amount
> > of space that is cleared if the journal becomes "full", and also doing
> > journal checkpointing before it becomes full.  While that may reduce
> > performance a small amount, it would help avoid such huge latency
> > problems. I believe we have such a patch in one of the Lustre branches
> > already, and while I'm not sure what kernel it is for the JBD code rarely
> > changes much
>
> The big difference between ordered and writeback is that once the slowdown
> starts, ordered goes into ~100% iowait, whereas writeback continues 100%
> user.

Does data=ordered write buffers in the order they were dirtied?  This might 
explain the extreme problems in transactional workloads.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] Btrfs v0.12 released

2008-02-11 Thread Chris Mason

On Sunday 10 February 2008, David Miller wrote:
> From: Chris Mason <[EMAIL PROTECTED]>
> Date: Wed, 6 Feb 2008 12:00:13 -0500
>
> This function never returns an error, so the simplest fix was to
> return the hash value which avoids all of the issues.  In attempting
> other schemes to fix this, I found it very difficult to give gcc
> a packed attribute for that "u64 *" argument other than to create
> some new pseudo structure which would have been ugly.
>
Many thanks, I clearly didn't put enough thought into the unaligned access 
problems.

> Similar code lives in the btrfs kernel code too, I'll try to get a
> partition at least mounted and working minimally and if successful
> I'll send you patches for that too.

The kernel is actually worse, because the set/get macros are more complex.  
Some live in ctree.h like in the progs, but the nasty ones live in 
struct-funcs.c

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BTRFS only works with PAGE_SIZE <= 4K

2008-02-12 Thread Chris Mason

On Tuesday 12 February 2008, David Miller wrote:
> From: Chris Mason <[EMAIL PROTECTED]>
> Date: Wed, 6 Feb 2008 12:00:13 -0500
>
> > So, here's v0.12.
>
> Any page size larger than 4K will not work with btrfs.  All of the
> extent stuff assumes that PAGE_SIZE <= sectorsize.

Yeah, there is definitely clean up to do in that area.

>
> I confirmed this by forcing mkfs.btrfs to use an 8K sectorsize on
> sparc64 and I was finally able to successfully mount a partition.

Nice

>
> With 4K there are zero's in the root tree node header, because it's
> extent's location on disk is at a sub-PAGE_SIZE multiple and the
> extent code doesn't handle that.
>
> You really need to start validating this stuff on other platforms.
> Something that isn't little endian and something that doesn't use 4K
> pages.  I'm sure you have some powerpc parts around somewhere. :)

Grin, I think around v0.4 I grabbed a ppc box for a day and got things 
working.  There has been some churn since then...

My first prio is the newest set of disk format changes, and then I'll sit down 
and work on stability on a bunch of arches.

>
> Anyways, here is a patch for the kernel bits which fixes most of the
> unaligned accesses on sparc64.

Many thanks, I'll try these out here and push them into the tree.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BTRFS partition usage...

2008-02-12 Thread Chris Mason

On Tuesday 12 February 2008, Jan Engelhardt wrote:
> On Feb 12 2008 09:08, Chris Mason wrote:
> >> >So, if Btrfs starts zeroing at 1k, will that be acceptable for you?
> >>
> >> Something looks wrong here. Why would btrfs need to zero at all?
> >> Superblock at 0, and done. Just like xfs.
> >> (Yes, I had xfs on sparc before, so it's not like you NEED the
> >> whitespace at the start of a partition.)
> >
> >I've had requests to move the super down to 64k to make room for
> > bootloaders, which may not matter for sparc, but I don't really plan on
> > different locations for different arches.
>
> In x86, there is even more space for a bootloader (some 28k or so)
> even if your partition table is as closely packed as possible,
> from 0 to 7e00 IIRC.
>
> For sparc you could have something like
>
>   startlbaendlba  type
>   sda10   2   1 Boot
>   sda22   58  3 Whole disk
>   sda358  9   83 Linux
>
> and slap the bootloader into "MBR", just like on x86.
> Or I am missing something..

It was a request from hpa, and he clearly had something in mind.  He kindly 
offered to review the disk format for bootloaders and other lower level 
issues but I asked him to wait until I firm it up a bit.

>From my point of view, 0 is a bad idea because it is very likely to conflict 
with other things.  There are lots of things in the FS that need deep 
thought,and the perfect system to fully use the first 64k of a 1TB filesystem 
isn't quite at the top of my list right now ;)

Regardless of offset, it is a good idea to mop up previous filesystems where 
possible, and a very good idea to align things on some sector boundary.  Even 
going 1MB in wouldn't be a horrible idea to align with erasure blocks on SSD.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BTRFS partition usage...

2008-02-12 Thread Chris Mason

On Tuesday 12 February 2008, Jan Engelhardt wrote:
> On Feb 12 2008 08:49, Chris Mason wrote:
> >> > This is a real issue on sparc where the default sun disk labels
> >> > created use an initial partition where block zero aliases the disk
> >> > label.  It took me a few iterations before I figured out why every
> >> > btrfs make would zero out my disk label :-/
> >>
> >> Actually it seems this is only a problem with mkfs.btrfs, it clears
> >> out the first 64 4K chunks of the disk for whatever reason.
> >
> >It is a good idea to remove supers from other filesystems.  I also need to
> > add zeroing at the end of the device as well.
> >
> >Looks like I misread the e2fs zeroing code.  It zeros the whole external
> > log device, and I assumed it also zero'd out the start of the main FS.
> >
> >So, if Btrfs starts zeroing at 1k, will that be acceptable for you?
>
> Something looks wrong here. Why would btrfs need to zero at all?
> Superblock at 0, and done. Just like xfs.
> (Yes, I had xfs on sparc before, so it's not like you NEED the
> whitespace at the start of a partition.)

I've had requests to move the super down to 64k to make room for bootloaders, 
which may not matter for sparc, but I don't really plan on different 
locations for different arches.

4k aligned is important given that sector sizes are growing.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BTRFS partition usage...

2008-02-12 Thread Chris Mason

On Tuesday 12 February 2008, Jan Engelhardt wrote:
> On Feb 12 2008 09:35, Chris Mason wrote:
> >> and slap the bootloader into "MBR", just like on x86.
> >> Or I am missing something..
> >
> >It was a request from hpa, and he clearly had something in mind.  He
> > kindly offered to review the disk format for bootloaders and other lower
> > level issues but I asked him to wait until I firm it up a bit.
> >
> >From my point of view, 0 is a bad idea because it is very likely to
> > conflict with other things.  There are lots of things in the FS that need
> > deep thought,and the perfect system to fully use the first 64k of a 1TB
> > filesystem isn't quite at the top of my list right now ;)
> >
> >Regardless of offset, it is a good idea to mop up previous filesystems
> > where possible, and a very good idea to align things on some sector
> > boundary.  Even going 1MB in wouldn't be a horrible idea to align with
> > erasure blocks on SSD.
>
> I still don't like the idea of btrfs trying to be smarter than a user
> who can partition up his system according to
>   (a) his likes
>   (b) system or hardware requirements or recommendations
> to align the superblock to a specific location.

Will all the users in the world who think about super block location when they 
partition their disks please raise their hands?

The location of the super block needs to be very simple in order for mount and 
friends to find and detect it.  It needs a simple algorithm to try multiple 
locations in case a given copy of the super is corrupt.

Design in this case is a bunch of compromises around other users of the 
hardware, ease of programming, and the benefits in performance or usability 
from doing something complex.

>
> 1MB alignment does not always mean 1MB alignment.
> Sector 1 begins at 0x7e00 on x86.
> And with the maximum CHS geometry (255/63), partitions begin
> at 0x7e00+n*8225280 bytes, so the SB is unlikely to ever be on
> a 1048576 boundary.

IO is already aligned on sectors, sometimes we'll have a perfect erasure block 
alignment and sometimes not.  When the location of the super is my biggest 
bottleneck, I'll be a very happy boy.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BTRFS partition usage...

2008-02-12 Thread Chris Mason

On Tuesday 12 February 2008, David Miller wrote:
> From: David Miller <[EMAIL PROTECTED]>
> Date: Mon, 11 Feb 2008 23:21:39 -0800 (PST)
>
> > Filesystems like ext2 put their superblock 1 block into the partition
> > in order to avoid overwriting disk labels and other uglies.  UFS does
> > this too, as do several others.  One of the few exceptions I've been
> > able to find is XFS.
> >
> > This is a real issue on sparc where the default sun disk labels
> > created use an initial partition where block zero aliases the disk
> > label.  It took me a few iterations before I figured out why every
> > btrfs make would zero out my disk label :-/
>
> Actually it seems this is only a problem with mkfs.btrfs, it clears
> out the first 64 4K chunks of the disk for whatever reason.

It is a good idea to remove supers from other filesystems.  I also need to add 
zeroing at the end of the device as well.

Looks like I misread the e2fs zeroing code.  It zeros the whole external log 
device, and I assumed it also zero'd out the start of the main FS.

So, if Btrfs starts zeroing at 1k, will that be acceptable for you?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] Btrfs v0.12 released

2008-02-12 Thread Chris Mason

On Tuesday 12 February 2008, David Miller wrote:
> From: Chris Mason <[EMAIL PROTECTED]>
> Date: Mon, 11 Feb 2008 08:42:20 -0500
>
> > The kernel is actually worse, because the set/get macros are more
> > complex. Some live in ctree.h like in the progs, but the nasty ones live
> > in struct-funcs.c
>
> This is really problematic, because you've got these things called
> "btrfs_item_ptr()" which really isn't a pointer, it's a relative
> 'unsigned long' offset cast to a pointer.  The source of this
> seems to be btrfs_leaf_data().
>
> And then those things get passed down into the SETGET functions!

Explaining it won't make it pretty, but at least I can tell you what the code 
does.

This is all part of the btrfs code that supports tree block sizes larger than 
a page.  The extent_buffer code (extent_io.c) provides a read/write api into 
an extent_buffer based on offsets from the start of the multi-page buffer.  
That's where the relative unsigned long comes from.

The part where I cast it to pointers is me trying to maintain type checking 
throughout the code.  The pointers themselves are useless, they need to be 
matched with an extent_buffer to actually get to the bytes.

There are a few parts where the SETGET funcs are open coded, mostly in very 
performance critical functions.  Just look for lexxx_to_cpu

>
> Then deeper down we have terribly inconsistent things like
> btrfs_item_nr_offset() and
> btrfs_item_offset_nr().

Btree blocks have the offset of the item header from the start of the block 
and the offset of the item data.  And, I'm very bad at naming.

>
> Sigh... I'll see what I can do.

Thanks

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] a large btrfs update

2012-07-26 Thread Chris Mason

uce BTRFS_IOC_SEND for btrfs send/receive (+4717/-1)
Btrfs: introduce subvol uuids and times (+292/-15)
Btrfs: don't update atime on RO subvolumes (+7/-0)
Btrfs: add btrfs_compare_trees function (+440/-0)
Btrfs: make iref_to_path non static (+9/-5)

Chris Mason (5) commits (+22/-9):
Btrfs: call the ordered free operation without any locks held (+8/-1)
Btrfs: don't wait around for new log writers on an SSD (+2/-1)
Btrfs: add a barrier before a waitqueue_active check (+1/-0)
Btrfs: reduce calls to wake_up on uncontended locks (+9/-5)
Btrfs: uninit variable fixes in send/receive (+2/-2)

Stefan Behrens (3) commits (+9/-4):
Btrfs: avoid I/O repair BUG() from btree_read_extent_buffer_pages() (+1/-1)
Btrfs: remove unwanted printk() for btrfs device I/O stats (+0/-3)
Btrfs: suppress printk() if all device I/O stats are zero (+8/-0)

Li Zefan (3) commits (+159/-122):
Btrfs: kill free_space pointer from inode structure (+10/-19)
Btrfs: zero unused bytes in inode item (+3/-0)
Btrfs: rewrite BTRFS_SETGET_FUNCS (+146/-103)

Ilya Dryomov (2) commits (+3/-3):
Btrfs: do not ignore errors from btrfs_cleanup_fs_roots() when mounting 
(+2/-2)
Btrfs: do not return EINVAL instead of ENOMEM from open_ctree() (+1/-1)

Dan Carpenter (2) commits (+4/-3):
Btrfs: small naming cleanup in join_transaction() (+2/-2)
Btrfs: fix error handling in __add_reloc_root() (+2/-1)

David Sterba (2) commits (+23/-18):
btrfs: allow cross-subvolume file clone (+8/-3)
btrfs: join DEV_STATS ioctls to one (+15/-15)

Arnd Hannemann (1) commits (+8/-1):
Btrfs: allow mount -o remount,compress=no

Anand Jain (1) commits (+1/-1):
btrfs read error corrected message floods the console during recovery

Mitch Harder (1) commits (+20/-14):
Btrfs: Check INCOMPAT flags on remount and add helper function

Tsutomu Itoh (1) commits (+3/-3):
Btrfs: return error of btrfs_update_inode() to caller

Andrew Mahone (1) commits (+5/-3):
btrfs: ignore unfragmented file checks in defrag when compression enabled - 
rebased

Total: (65) commits

 fs/btrfs/Makefile   |2 +-
 fs/btrfs/async-thread.c |9 +-
 fs/btrfs/backref.c  |   40 +-
 fs/btrfs/backref.h  |7 +-
 fs/btrfs/btrfs_inode.h  |   14 +-
 fs/btrfs/check-integrity.c  |7 +-
 fs/btrfs/ctree.c|  775 +++-
 fs/btrfs/ctree.h|  368 +++-
 fs/btrfs/delayed-inode.c|   23 +-
 fs/btrfs/delayed-inode.h|2 +
 fs/btrfs/delayed-ref.c  |   56 +-
 fs/btrfs/delayed-ref.h  |   62 +-
 fs/btrfs/disk-io.c  |  150 +-
 fs/btrfs/disk-io.h  |6 +
 fs/btrfs/extent-tree.c  |  358 ++--
 fs/btrfs/extent_io.c|   58 +-
 fs/btrfs/file-item.c|4 +-
 fs/btrfs/free-space-cache.c |2 +-
 fs/btrfs/inode.c|   42 +-
 fs/btrfs/ioctl.c|  471 -
 fs/btrfs/ioctl.h|   97 +-
 fs/btrfs/locking.c  |   14 +-
 fs/btrfs/qgroup.c   | 1571 +++
 fs/btrfs/relocation.c   |3 +-
 fs/btrfs/root-tree.c|  107 +-
 fs/btrfs/send.c | 4570 +++
 fs/btrfs/send.h |  133 ++
 fs/btrfs/struct-funcs.c |  196 +-
 fs/btrfs/super.c|   28 +-
 fs/btrfs/transaction.c  |  101 +-
 fs/btrfs/transaction.h  |   12 +
 fs/btrfs/tree-log.c |4 +-
 fs/btrfs/volumes.c  |   25 +-
 fs/btrfs/volumes.h  |4 +-
 fs/inode.c  |2 +
 35 files changed, 8690 insertions(+), 633 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] Btrfs fixes

2012-10-26 Thread Chris Mason

Hi Linus,

My for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

Has our series of fixes for the next rc.  The biggest batch is from Jan
Schmidt, fixing up some problems in our subvolume quota code and fixing
btrfs send/receive to work with the new extended inode refs.

My git tree is against 3.6, but these were all retested against your
current git.

Jan Schmidt (7) commits (+149/-76):
Btrfs: don't put removals from push_node_left into tree mod log twice 
(+7/-2)
Btrfs: fix a tree mod logging issue for root replacement operations (+2/-8)
Btrfs: tree mod log's old roots could still be part of the tree (+21/-4)
Btrfs: fix extent buffer reference for tree mod log roots (+1/-1)
Btrfs: extended inode refs support for send mechanism (+94/-58)
Btrfs: comment for loop in tree_mod_log_insert_move (+5/-0)
Btrfs: determine level of old roots (+19/-3)

Josef Bacik (2) commits (+8/-6):
Btrfs: Use btrfs_update_inode_fallback when creating a snapshot (+6/-5)
Btrfs: do not bug when we fail to commit the transaction (+2/-1)

Stefan Behrens (1) commits (+2/-2):
Btrfs: Fix wrong error handling code

Lukas Czerner (1) commits (+2/-1):
btrfs: Return EINVAL when length to trim is less than FSB

Arne Jansen (1) commits (+2/-1):
Btrfs: send correct rdev and mode in btrfs-send

Gabriel de Perthuis (1) commits (+1/-1):
Fix a sign bug causing invalid memory access in the ino_paths ioctl.

Liu Bo (1) commits (+5/-3):
Btrfs: fix memory leak when cloning root's node

Alex Lyakas (1) commits (+13/-14):
Btrfs: Send: preserve ownership (uid and gid) also for symlinks.

Miao Xie (1) commits (+7/-0):
Btrfs: fix deadlock caused by the nested chunk allocation

Tsutomu Itoh (1) commits (+13/-4):
Btrfs: fix memory leak in btrfs_quota_enable()

Total: (17) commits (+202/-108)
 fs/btrfs/backref.c |  28 -
 fs/btrfs/backref.h |   4 ++
 fs/btrfs/ctree.c   |  70 +-
 fs/btrfs/ctree.h   |   3 +
 fs/btrfs/extent_io.c   |   4 +-
 fs/btrfs/inode.c   |   7 +--
 fs/btrfs/ioctl.c   |   6 +-
 fs/btrfs/qgroup.c  |  17 --
 fs/btrfs/send.c| 156 ++---
 fs/btrfs/transaction.c |   2 +-
 fs/btrfs/volumes.c |   7 +++
 11 files changed, 199 insertions(+), 105 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] Btrfs

2013-10-05 Thread Chris Mason

Hi Linus,

Please pull my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

I've merged my for-linus up to 3.12-rc3 because the top commit is only
meant for 3.12.  The rest of the fixes are also available in my master
branch on top of my last 3.11 based pull.

This is a small collection of fixes, including a regression fix from Liu
Bo that solves rare crashes with compression on.

Ilya Dryomov (2) commits (+28/-11):
Btrfs: fix a use-after-free bug in btrfs_dev_replace_finishing (+7/-5)
Btrfs: eliminate races in worker stopping code (+21/-6)

Darrick J. Wong (1) commits (+8/-0):
btrfs: Fix crash due to not allocating integrity data for a bioset

Liu Bo (1) commits (+1/-1):
Btrfs: fix crash of compressed writes

Josef Bacik (1) commits (+2/-5):
Btrfs: fix transid verify errors when recovering log tree

Total: (5) commits (+39/-17)

 fs/btrfs/async-thread.c | 25 +++--
 fs/btrfs/async-thread.h |  2 ++
 fs/btrfs/dev-replace.c  |  5 +
 fs/btrfs/extent_io.c| 10 +-
 fs/btrfs/transaction.c  |  7 ++-
 fs/btrfs/volumes.c  |  7 ++-
 6 files changed, 39 insertions(+), 17 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] Btrfs

2013-07-09 Thread Chris Mason

Hi Linus,

This Btrfs pull is available in two flavors:

First, my for-linus branch has it against all the btrfs pulls from 3.10:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

Or, with a merge commit on top of 3.10 (master branch):

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git master

I did it this way because the 3.10 merge was pretty much empty and you
probably don't want my merge commit at the top.

There's a trivial conflict with your current master involving a printk spelling
fix, but these are otherwise clean and tested against your current tree.

These are the usual mixture of bugs, cleanups and performance fixes.  Miao has
some really nice tuning of our crc code as well as our transaction commits.

Josef is peeling off more and more problems related to early enospc, and has a
number of important bug fixes in here too.

The stats below are against my for-linus branch.

Josef Bacik (27) commits (+865/-590):
Btrfs: move btrfs_truncate_page to btrfs_cont_expand instead of 
btrfs_truncate (+35/-12)
Btrfs: check for actual acls rather than just xattrs when caching no acl 
(+16/-2)
Btrfs: stop using try_to_writeback_inodes_sb_nr to flush delalloc (+4/-5)
Btrfs: fix not being able to find skinny extents during relocate (+27/-8)
Btrfs: exclude logged extents before replying when we are mixed (+99/-37)
Btrfs: only do the tree_mod_log_free_eb if this is our last ref (+2/-1)
Btrfs: make the chunk allocator completely tree lockless (+166/-169)
Btrfs: check if we can nocow if we don't have data space (+148/-26)
Btrfs: use a percpu to keep track of possibly pinned bytes (+66/-5)
Btrfs: unlock extent range on enospc in compressed submit (+5/-1)
Btrfs: make backref walking code handle skinny metadata (+25/-6)
Btrfs: add some missing iput()'s in btrfs_orphan_cleanup (+4/-1)
Btrfs: hold the tree mod lock in __tree_mod_log_rewind (+6/-4)
Btrfs: cleanup backref search commit root flag stuff (+16/-27)
Btrfs: free csums when we're done scrubbing an extent (+1/-0)
Btrfs: fix transaction throttling for delayed refs (+69/-18)
Btrfs: wake up delayed ref flushing waiters on abort (+1/-0)
Btrfs: stop waiting on current trans if we aborted (+11/-4)
Btrfs: wait ordered range before doing direct io (+9/-1)
Btrfs: put our inode if orphan cleanup fails (+3/-1)
Btrfs: cleanup orphaned root orphan item (+29/-2)
Btrfs: optimize read_block_for_search (+20/-27)
Btrfs: do not pin while under spin lock (+8/-4)
Btrfs: simplify unlink reservations (+50/-191)
Btrfs: optimize reada_for_balance (+9/-37)
Btrfs: fix estale with btrfs send (+35/-0)
Btrfs: do delay iput in sync_fs (+1/-1)

Miao Xie (24) commits (+1043/-797):
Btrfs: cleanup unnecessary assignment when cleaning up all the residual 
transaction (+1/-8)
Btrfs: don't flush the delalloc inodes in the while loop if flushoncommit 
is set (+18/-8)
Btrfs: just flush the delalloc inodes in the source tree before snapshot 
creation (+7/-9)
Btrfs: don't wait for all the writers circularly during the transaction 
commit (+65/-21)
Btrfs: remove unnecessary varient ->num_joined in btrfs_transaction 
structure (+1/-9)
Btrfs: make the cleaner complete early when the fs is going to be umounted 
(+7/-5)
Btrfs: remove the code for the impossible case in cleanup_transaction() 
(+6/-5)
Btrfs: make the snap/subv deletion end more early when the fs is R/O 
(+15/-14)
Btrfs: introduce grab/put functions for the root of the fs/file tree 
(+26/-3)
Btrfs: fix several potential problems in copy_nocow_pages_for_inode (+22/-1)
Btrfs: move the R/O check out of btrfs_clean_one_deleted_snapshot() (+9/-5)
Btrfs: fix oops when recovering the file data by scrub function (+1/-1)
Btrfs: remove the time check in btrfs_commit_transaction() (+6/-23)
Btrfs: remove unnecessary ->s_umount in cleaner_kthread() (+28/-12)
Btrfs: cleanup the code of copy_nocow_pages_for_inode() (+23/-25)
Btrfs: make the state of the transaction more readable (+116/-94)
Btrfs: cleanup the similar code of the fs root read (+228/-269)
Btrfs: cleanup redundant code in btrfs_submit_direct() (+1/-9)
Btrfs: introduce per-subvolume ordered extent list (+143/-58)
Btrfs: introduce per-subvolume delalloc inode list (+183/-67)
Btrfs: merge pending IO for tree log write back (+17/-6)
Btrfs: remove btrfs_sector_sum structure (+76/-142)
Btrfs: fix broken nocow after balance (+44/-0)
Btrfs: fix wrong mirror number tuning (+0/-3)

Liu Bo (8) commits (+65/-39):
Btrfs: check if leaf's parent exists before pushing items around (+1/-1)
Btrfs: kill replicate code in replay_one_buffer (+2/-7)
Btrfs: fix crash regarding to ulist_add_merge (+15/-0)
Btrfs: dont do log_removal in insert_new_root (+5/-5)
Btrfs: allow file data clone within a file (+19/-7)
Btrfs: remove unused code in btrfs_del_r

Re: Build failures due to commit 416161db (btrfs: offline dedupe)

2013-09-13 Thread Chris Mason

Quoting Mark Fasheh (2013-09-13 13:58:01)
> On Fri, Sep 13, 2013 at 01:00:22PM -0400, Chris Mason wrote:
> > Quoting Guenter Roeck (2013-09-13 12:35:35)
> > I'm happy to fix this with a bigger put of the info struct, just
> > let me know the preferred arch-happy solution.
> 
> In fact old versions of the patch were putting the whole struct but during
> review I was asked to change it. This should be very straight forward to fix
> so long as we all stay calm ;)
> --Mark

Mark, could you please send a patch for the whole-struct option until
the unaligned put is upstreamed?

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] Btrfs

2013-09-12 Thread Chris Mason

Hi Linus,

Please pull my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

This is against 3.11-rc7, but was pulled and tested against your tree as
of yesterday.  We do have two small incrementals queued up, but I wanted
to get this bunch out the door before I hop on an airplane.

This is a fairly large batch of fixes, performance improvements, and
cleanups from the usual Btrfs suspects.

We've included Stefan Behren's work to index subvolume UUIDs, which is
targeted at speeding up send/receive with many subvolumes or snapshots
in place.  It closes a long standing performance issue that was built
in to the disk format.

Mark Fasheh's offline dedup work is also here.  In this case offline
means the FS is mounted and active, but the dedup work is not done
inline during file IO.   This is a building block where utilities  are
able to ask the FS to dedup a series of extents.  The kernel takes
care of verifying the data involved really is the same.  Today this
involves reading both extents, but we'll continue to evolve the patches.

Anand Jain (3):
  btrfs: fix get set label blocking against balance
  btrfs: use BTRFS_SUPER_INFO_SIZE macro at btrfs_read_dev_super()
  btrfs: return btrfs error code for dev excl ops err

Andy Shevchenko (1):
  btrfs: reuse kbasename helper

Carey Underwood (1):
  Btrfs: Release uuid_mutex for shrink during device delete

Dan Carpenter (1):
  btrfs/raid56: fix and cleanup some error paths

Dave Jones (1):
  Fix leak in __btrfs_map_block error path

David Sterba (2):
  btrfs: make errors in btrfs_num_copies less noisy
  btrfs: add mount option to set commit interval

Filipe David Borba Manana (18):
  Btrfs: optimize btrfs_lookup_extent_info()
  Btrfs: add missing error checks to add_data_references
  Btrfs: optimize function btrfs_read_chunk_tree
  Btrfs: add missing error check to find_parent_nodes
  Btrfs: add missing error handling to read_tree_block
  Btrfs: fix inode leak on kmalloc failure in tree-log.c
  Btrfs: don't ignore errors from btrfs_run_delayed_items
  Btrfs: return ENOSPC when target space is full
  Btrfs: add missing error code to BTRFS_IOC_INO_LOOKUP handler
  Btrfs: don't miss inode ref items in BTRFS_IOC_INO_LOOKUP
  Btrfs: reset force_compress on btrfs_file_defrag failure
  Btrfs: fix memory leak of orphan block rsv
  Btrfs: fix printing of non NULL terminated string
  Btrfs: fix race between removing a dev and writing sbs
  Btrfs: fix race conditions in BTRFS_IOC_FS_INFO ioctl
  Btrfs: fix memory leak of uuid_root in free_fs_info
  Btrfs: fix deadlock in uuid scan kthread
  Btrfs: optimize key searches in btrfs_search_slot

Geert Uytterhoeven (12):
  Btrfs: Remove superfluous casts from u64 to unsigned long long
  Btrfs: Make BTRFS_DEV_REPLACE_DEVID an unsigned long long constant
  Btrfs: Format PAGE_SIZE as unsigned long
  Btrfs: Format mirror_num as int
  Btrfs: Make btrfs_device_uuid() return unsigned long
  Btrfs: Make btrfs_device_fsid() return unsigned long
  Btrfs: Make btrfs_dev_extent_chunk_tree_uuid() return unsigned long
  Btrfs: Make btrfs_header_fsid() return unsigned long
  Btrfs: Make btrfs_header_chunk_tree_uuid() return unsigned long
  Btrfs: PAGE_CACHE_SIZE is already unsigned long
  Btrfs: Do not truncate sector_t on 32-bit with CONFIG_LBDAF=y
  Btrfs: Use %z to format size_t

Ilya Dryomov (5):
  Btrfs: find_next_devid: root -> fs_info
  Btrfs: add btrfs_alloc_device and switch to it
  Btrfs: add alloc_fs_devices and switch to it
  Btrfs: rollback btrfs_device fields on umount
  Btrfs: stop refusing the relocation of chunk 0

Jeff Mahoney (1):
  btrfs: fall back to global reservation when removing subvolumes

Josef Bacik (30):
  Btrfs: stop using GFP_ATOMIC for the tree mod log allocations
  Btrfs: set lockdep class before locking new extent buffer
  Btrfs: reset ret in record_one_backref
  Btrfs: cleanup reloc roots properly on error
  Btrfs: don't bother autodefragging if our root is going away
  Btrfs: cleanup arguments to extent_clear_unlock_delalloc
  Btrfs: fix what bits we clear when erroring out from delalloc
  Btrfs: check to see if we have an inline item properly
  Btrfs: change how we queue blocks for backref checking
  Btrfs: don't bug_on when we fail when cleaning up transactions
  Btrfs: handle errors when doing slow caching
  Btrfs: check our parent dir when doing a compare send
  Btrfs: deal with enomem in the rewind path
  Btrfs: stop using GFP_ATOMIC when allocating rewind ebs
  Btrfs: skip subvol entries when checking if we've created a dir already
  Btrfs: don't allow a subvol to be deleted if it is the default subovl
  Btrfs: fix the error handling wrt orphan items
  Btrfs: fix heavy delalloc related deadlock
  Btrfs: av

Re: Build failures due to commit 416161db (btrfs: offline dedupe)

2013-09-13 Thread Chris Mason

Quoting Guenter Roeck (2013-09-13 12:35:35)
> On Fri, Sep 13, 2013 at 03:52:43PM +0200, Geert Uytterhoeven wrote:
> > On Fri, Sep 13, 2013 at 3:33 PM, Guenter Roeck  wrote:
> > > fs/btrfs/ioctl.c: In function 'btrfs_ioctl_file_extent_same':
> > > fs/btrfs/ioctl.c:2802:3: error: implicit declaration of function 
> > > '__put_user_unaligned' [-Werror=implicit-function-declaration]
> > > cc1: some warnings being treated as errors
> > > make[2]: *** [fs/btrfs/ioctl.o] Error 1
> > > make[2]: *** Waiting for unfinished jobs
> > >
> > > Seen with alpha:allmodconfig, arm:allmodconfig, m68k:allmodconfig, and
> > > xtensa:allmodconfig.
> > 
> > Known issue, cfr. my early warning 10 days ago:
> > 
> > "Btrfs is the first user of __put_user_unaligned() outside the compat code,
> > hence now all 32-bit architectures should make sure to implement this, too."
> > 
> > http://marc.info/?l=linux-arch&m=137820065929216&w=2
> > 
> > and today's thread https://lkml.org/lkml/2013/9/12/814
> > 
> 
> It doesn't seem right that a patch breaks the build for several platforms, and
> the problem is then blamed on the platform code instead of the code that is
> introducing the problem.
> 
> Maybe we should add BROKEN to the btrfs dependencies for the affected 
> platforms.
> After all, it _is_ broken.

I'm happy to fix this with a bigger put of the info struct, just
let me know the preferred arch-happy solution.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] Btrfs

2013-09-22 Thread Chris Mason

Hi Linus,

Please pull my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

These are mostly bug fixes and a two small performance fixes.  The most
important of the bunch are Josef's fix for a snapshotting regression and
Mark's update to fix compile problems on arm.

These are on top of 3.11 + my first pull, but they were also tested
against your master as of last night.

Josef Bacik (13) commits (+219/-102):
Btrfs: check roots last log commit when checking if an inode has been 
logged (+4/-1)
Revert "Btrfs: rework the overcommit logic to be based on the total size" 
(+3/-12)
Btrfs: kill delay_iput arg to the wait_ordered functions (+14/-33)
Btrfs: drop dir i_size when adding new names on replay (+27/-0)
Btrfs: replay dir_index items before other items (+12/-3)
Btrfs: fix worst case calculator for space usage (+1/-1)
Btrfs: actually log directory we are fsync()'ing (+9/-1)
Btrfs: actually limit the size of delalloc range (+5/-3)
Btrfs: fixup error handling in btrfs_reloc_cow (+32/-22)
Btrfs: remove space_info->reservation_progress (+0/-12)
Btrfs: create the uuid tree on remount rw (+10/-0)
Btrfs: improve replacing nocow extents (+98/-14)
Btrfs: iput inode on allocation failure (+4/-0)

Stefan Behrens (2) commits (+4/-0):
btrfs: show compiled-in config features at module load time (+3/-0)
Btrfs: add the missing mutex unlock in write_all_supers() (+1/-0)

Filipe David Borba Manana (2) commits (+7/-7):
Btrfs: don't leak transaction in btrfs_sync_file() (+2/-2)
Btrfs: more efficient inode tree replace operation (+5/-5)

David Sterba (2) commits (+8/-0):
btrfs: add lockdep and tracing annotations for uuid tree (+2/-0)
btrfs: refuse to remount read-write after abort (+6/-0)

Guangyu Sun (1) commits (+2/-0):
Btrfs: dir_inode_operations should use btrfs_update_time also

Mark Fasheh (1) commits (+45/-31):
btrfs: change extent-same to copy entire argument struct

Ilya Dryomov (1) commits (+2/-1):
Btrfs: do not add replace target to the alloc_list

Frank Holton (1) commits (+2/-2):
btrfs: Add btrfs: prefix to kernel log output

chandan (1) commits (+1/-1):
Btrfs: btrfs_ioctl_default_subvol: Revert back to toplevel subvolume when 
arg is 0

Miao Xie (1) commits (+74/-31):
Btrfs: allocate the free space by the existed max extent size when ENOSPC

Total: (25) commits

 fs/btrfs/btrfs_inode.h   |   5 +-
 fs/btrfs/ctree.c |   7 ++-
 fs/btrfs/ctree.h |  17 ++-
 fs/btrfs/dev-replace.c   |   4 +-
 fs/btrfs/disk-io.c   |   2 +
 fs/btrfs/extent-tree.c   |  57 +++---
 fs/btrfs/extent_io.c |   8 ++--
 fs/btrfs/file.c  |   4 +-
 fs/btrfs/free-space-cache.c  |  67 ++
 fs/btrfs/free-space-cache.h  |   5 +-
 fs/btrfs/inode.c |  16 +--
 fs/btrfs/ioctl.c |  80 ++-
 fs/btrfs/ordered-data.c  |  24 ++
 fs/btrfs/ordered-data.h  |   5 +-
 fs/btrfs/relocation.c|  43 ++---
 fs/btrfs/scrub.c | 112 +--
 fs/btrfs/super.c |  21 +++-
 fs/btrfs/transaction.c   |   2 +-
 fs/btrfs/tree-log.c  |  52 ++--
 fs/btrfs/volumes.c   |   7 +--
 include/trace/events/btrfs.h |   1 +
 21 files changed, 364 insertions(+), 175 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Btrfs: remove repeated eb->pages check in, disk-io.c/csum_dirty_buffer

2012-10-09 Thread Chris Mason

On Mon, Oct 08, 2012 at 07:26:15AM -0600, Wang Sheng-Hui wrote:
> In csum_dirty_buffer, we first get eb from page->private.
> Then we check if the page is the first page of eb. Later
> we check it again. Remove the repeated check here.

You had the right idea here, two checks and one has a warning, so you
kept the warning.  But when the metadata block size is bigger than a
page, the WARN_ON triggers for any page that isn't the first one in the
extent buffer.

I kept this commit but removed the WARN_ON(1)

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] Btrfs

2012-10-09 Thread Chris Mason

k reference 
(+4/-0)
Btrfs: add a new "type" field into the block reservation structure (+39/-22)
Btrfs: fix wrong size for the reservation of the, snapshot creation (+4/-4)
Revert "Btrfs: do not do filemap_write_and_wait_range in fsync" (+11/-3)
Btrfs: fix file extent discount problem in the, snapshot (+25/-44)
Btrfs: fix orphan transaction on the freezed filesystem (+49/-23)
Btrfs: add a type field for the transaction handle (+21/-42)
Btrfs: fix error path in create_pending_snapshot() (+17/-23)
Btrfs: use a slab for ordered extents allocation (+31/-3)
Btrfs: fix wrong orphan count of the fs/file tree (+1/-1)
Btrfs: fix corrupted metadata in the snapshot (+32/-18)
Btrfs: fix the snapshot that should not exist (+53/-15)
Btrfs: fix memory leak in start_transaction() (+3/-1)
Btrfs: fix unprotected ->log_batch (+9/-11)

Liu Bo (13) commits (+150/-113):
Btrfs: fix a bug in checking whether a inode is already in log (+10/-8)
Btrfs: kill obsolete arguments in btrfs_wait_ordered_extents (+7/-18)
Btrfs: fix a bug in parsing return value in logical resolve (+34/-20)
Btrfs: use larger limit for translation of logical to inode (+5/-4)
Btrfs: check if an inode has no checksum when logging it (+12/-11)
Btrfs: update delayed ref's tracepoints to show sequence (+10/-4)
Btrfs: use flag EXTENT_DEFRAG for snapshot-aware defrag (+28/-14)
Btrfs: improve fsync by filtering extents that we want (+26/-3)
Btrfs: cleanup for duplicated code in find_free_extent (+0/-4)
Btrfs: cleanup extents after we finish logging inode (+6/-0)
Btrfs: use helper for logical resolve (+3/-16)
Btrfs: fix off-by-one in file clone (+9/-9)
Btrfs: cleanup fs_info->hashers (+0/-2)

Tsutomu Itoh (6) commits (+19/-20):
Btrfs: confirmation of value is added before trace_btrfs_get_extent() is 
called (+2/-1)
Btrfs: remove unnecessary IS_ERR in bio_readpage_error() (+1/-1)
Btrfs: cleanup of error processing in btree_get_extent() (+5/-9)
Btrfs: fix error handling in delete_block_group_cache() (+2/-2)
Btrfs: remove unnecessary code in btree_get_extent() (+1/-7)
Btrfs: check return value of ulist_alloc() properly (+8/-0)

David Sterba (4) commits (+119/-62):
btrfs: allow setting NOCOW for a zero sized file via ioctl (+27/-4)
btrfs: move transaction aborts to the point of failure (+80/-47)
btrfs: return EPERM upon rmdir on a subvolume (+3/-2)
btrfs: polish names of kmem caches (+9/-9)

Sage Weil (3) commits (+18/-2):
Btrfs: do not take cleanup_work_sem in btrfs_run_delayed_iputs() (+0/-2)
Btrfs: pass lockdep rwsem metadata to async commit transaction (+16/-0)
Btrfs: set journal_info in async trans commit worker (+2/-0)

Stefan Behrens (2) commits (+156/-21):
Btrfs: make filesystem read-only when submitting barrier fails (+142/-19)
Btrfs: detect corrupted filesystem after write I/O errors (+14/-2)

Robin Dong (2) commits (+12/-157):
btrfs: remove unused function btrfs_insert_some_items() (+0/-143)
btrfs: move inline function code to header file (+12/-14)

Mark Fasheh (2) commits (+848/-116):
btrfs: extended inode ref iteration (+138/-37)
btrfs: extended inode refs (+710/-79)

Wei Yongjun (2) commits (+3/-6):
Btrfs: fix possible memory leak in scrub_setup_recheck_block() (+1/-0)
Btrfs: using for_each_set_bit_from to simplify the code (+2/-6)

Chris Mason (2) commits (+38/-16):
Btrfs: fix btrfs send for inline items and compression (+37/-15)
btrfs: init ref_index to zero in add_inode_ref (+1/-1)

Jan Schmidt (2) commits (+129/-112):
btrfs: improved readablity for add_inode_ref (+97/-81)
Btrfs: fix gcc warnings for 32bit compiles (+32/-31)

Zach Brown (1) commits (+2/-1):
btrfs: fix min csum item size warnings in 32bit

Daniel J Blueman (1) commits (+11/-11):
btrfs: fix message printing

Anand Jain (1) commits (+7/-5):
Btrfs: write_buf is now callable outside send.c

Kent Overstreet (1) commits (+2/-17):
btrfs: Kill some bi_idx references

Andrei Popa (1) commits (+13/-1):
Btrfs: make compress and nodatacow mount options mutually exclusive

liubo (1) commits (+0/-8):
Btrfs: cleanup for unused ref cache stuff

Wang Sheng-Hui (1) commits (+0/-4):
Btrfs: remove repeated eb->pages check in, disk-io.c/csum_dirty_buffer

Total: (121) commits

 fs/btrfs/backref.c   | 299 +++---
 fs/btrfs/backref.h   |  10 +-
 fs/btrfs/btrfs_inode.h   |  15 +-
 fs/btrfs/check-integrity.c   |  16 +-
 fs/btrfs/compression.c   |  13 +-
 fs/btrfs/ctree.c | 148 +--
 fs/btrfs/ctree.h | 109 +-
 fs/btrfs/delayed-inode.c |   6 +-
 fs/btrfs/disk-io.c   | 230 ++-
 fs/btrfs/disk-io.h   |   2 +
 fs/btrfs/extent-tree.c   | 376 +-
 fs/btrfs/extent_io.c | 128 --
 fs/btrfs/extent_io.h |  23 +-
 fs/btrfs/extent_map.c

Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-30 Thread Chris Mason

On Thu, Nov 29, 2012 at 07:49:10PM -0700, Dave Chinner wrote:
> On Thu, Nov 29, 2012 at 02:16:50PM -0800, Linus Torvalds wrote:
> > On Thu, Nov 29, 2012 at 1:29 PM, Chris Mason  
> > wrote:
> > >
> > > Just reading the new blkdev_get_blocks, it looks like we're mixing
> > > shifts.  In direct-io.c map_bh->b_size is how much we'd like to map, and
> > > it has no relation at all to the actual block size of the device.  The
> > > interface is abusing b_size to ask for as large a mapping as possible.
> > 
> > Ugh. That's a big violation of how buffer-heads are supposed to work:
> > the block number is very much defined to be in multiples of b_size
> > (see for example "submit_bh()" that turns it into a sector number).
> > 
> > But you're right. The direct-IO code really *is* violating that, and
> > knows that get_block() ends up being defined in i_blkbits regardless
> > of b_size.
> 
> Same with mpage_readpages(), so it's not just direct IO that has
> this problem

I guess the good news is that block devices don't have readpages.  The
bad news would be that we can't put readpages in without much bigger
changes.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI

2012-12-07 Thread Chris Mason

On Fri, Dec 07, 2012 at 11:18:00AM -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 7 Dec 2012, Ric Wheeler wrote:
> > 
> > Review is part of the way we work as a community and we should figure out 
> > how
> > to fix our review process so that we can have meaningful results from the
> > review or we lose confidence in the process and it makes it much harder to 
> > get
> > reviewers to spend time reviewing when their reviews are ultimately ignored.
> 
> Christ, I promised myself to not respond any more to this thread, but the 
> insanity just continues, from people who damn well should know better.
> 
> The code wasn't merged. The review worked.
> 
> What you (and Dave, and Christoph) are trying to do is shut down a feature 
> that somebody else decided they needed. That's not what code review is all 
> about, and dammit, don't try to even claim it is.
> 
> So stop these dishonest and disingenious arguments. They are full of crap.
> 
> No amount of "review" has any meaning what-so-ever on whether somebody 
> else decides they need a feature or not. You can review all you want, but 
> it's irrelevant - if some company decides they are going to ship or use a 
> feature, it's out of your hands.
> 
> What got merged was a ONE-LINER to make sure that possible future 
> development didn't unnecessarily make things any more confusing, with the 
> knowledge that there was a user of the code you didn't like. 
> 
> Every single argument I've heard of from the "please revert" camp has been 
> inane. And they've been *transparently* inane, to the point where I don't 
> understand how you can make them with a straght face and not be ashamed.

I really agree with Dave's statement that we should ioctl for private
features and system call for features other filesystems are likely to
implement.  So we really shouldn't have private bits in fallocate in use
in production systems.

That's not what happened though, and the right way forward from here is
to give the bit to the feature, maybe with a generic name like
FALLOCATE_WITHOUT_BEING_HORRIBLY_SLOW.  It should have been done
differently, but it wasn't.  And it's a problem we all have, so it makes
sense that we'll all want to address it somehow.

On a single flash drive doing random 4K writes, xfs does 950MB/s into
regular extents but only 400MB/s into preallocated extents.

http://masoncoding.com/presentation/perf-linuxcon12/fallocate.png

ext4 has a bigger hit, but there's a little room for improvement all
around.

Maybe we should use this thread as the starting point for the proper
12-18 months of bike shedding for a real fix?

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI

2012-12-07 Thread Chris Mason

On Fri, Dec 07, 2012 at 01:43:25PM -0700, Theodore Ts'o wrote:
> On Fri, Dec 07, 2012 at 02:03:06PM -0500, Chris Mason wrote:
> > 
> > That's not what happened though, and the right way forward from here is
> > to give the bit to the feature, maybe with a generic name like
> > FALLOCATE_WITHOUT_BEING_HORRIBLY_SLOW.  
> 
> I don't think that's a good idea, because the current name explicitly
> calls out the fact that we are making a tradeoff between what
> ***might*** be a security exposure in some cases (but which might be
> perfectly fine in others) for performance.  Using the generic name
> would hide the fact that this tradeoff is being made, and the
> arguments (which I've never seen backed up with a specific design) is
> that it's possible to speed up random writes into preallocated space
> on a flash device without making any kind of tradeoff that might imply
> a security tradeoff.

Grin, we're really good at debating names.  But I do see what you mean.
I'd hope that whatever generic facility we put in doesn't have the
security implications.

> 
> If indeed it is possible to speed up this particular workload without
> making any kind of no-hide-stale tradeoff, then we won't need the bit
> --- writes into fallocated space will just get faster, with or without
> the bit
> 
> I am sure it will be possible to do this in some cases (for example if
> you have a device that supports persistent trim which can quickly
> zeroize the blocks in question), but I would be very surprised if it's
> possible to completely eliminate the performance degradation for all
> devices and workloads.  (Not all storage devices support persistent
> trim, just for starters.)

Persistent trim is what I had in mind, but there are other ideas that do
imply a change in behavior as well.  Can we safely assume this feature
won't matter on spinning media?  New features like persistent
trim do make it much easier to solve securely, and using a bit for it
means we can toss back an error to the app if the underlying storage
isn't safe.

If google wants to have a block device patch that pretends to persistent
trim on devices that can't, great.

> 
> In answer's to Linus's question, the reason why people are
> hyperventilating so badly about this is that in some circles,
> revealing stale data is so horrible that anyone who even tries to
> suggest this should be excommunicated.  The mere existence of the
> code, or that people are using it, horribly offends those people.

So I've always said this was a real performance problem and that it
isn't just limited to ext4.  But can we please move past this part?  I
don't think it is completely accurate.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI

2012-12-07 Thread Chris Mason

On Fri, Dec 07, 2012 at 02:27:43PM -0700, Theodore Ts'o wrote:
> On Fri, Dec 07, 2012 at 04:09:32PM -0500, Chris Mason wrote:
> > Persistent trim is what I had in mind, but there are other ideas that do
> > imply a change in behavior as well.  Can we safely assume this feature
> > won't matter on spinning media?  New features like persistent
> > trim do make it much easier to solve securely, and using a bit for it
> > means we can toss back an error to the app if the underlying storage
> > isn't safe.
> 
> We originally implemented no hide stale for spinning media.  Some
> folks have claimed that for XFS their superior technology means that
> no hide stale doesn't buy them anything for HDD's.  I'm not entirely
> sure I buy this, since if you need to update metadata, it means at
> least one extra seek for each random write into 4k preallocated space,
> and 7200 RPM disks only have about 200 seeks per second.

True, 7200 RPM disks are slow, but even allowing them to expose stale
data just makes them a little less slow.

I know it's against the rules to pretend that disks don't matter.  But
really, once you're doing random IO into a spindle you've given up on
performance anyway.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI

2012-12-07 Thread Chris Mason

On Fri, Dec 07, 2012 at 02:49:04PM -0700, Ric Wheeler wrote:
> On 12/07/2012 04:43 PM, Chris Mason wrote:
> > On Fri, Dec 07, 2012 at 02:27:43PM -0700, Theodore Ts'o wrote:
> >> On Fri, Dec 07, 2012 at 04:09:32PM -0500, Chris Mason wrote:
> >>> Persistent trim is what I had in mind, but there are other ideas that do
> >>> imply a change in behavior as well.  Can we safely assume this feature
> >>> won't matter on spinning media?  New features like persistent
> >>> trim do make it much easier to solve securely, and using a bit for it
> >>> means we can toss back an error to the app if the underlying storage
> >>> isn't safe.
> >> We originally implemented no hide stale for spinning media.  Some
> >> folks have claimed that for XFS their superior technology means that
> >> no hide stale doesn't buy them anything for HDD's.  I'm not entirely
> >> sure I buy this, since if you need to update metadata, it means at
> >> least one extra seek for each random write into 4k preallocated space,
> >> and 7200 RPM disks only have about 200 seeks per second.
> > True, 7200 RPM disks are slow, but even allowing them to expose stale
> > data just makes them a little less slow.
> >
> > I know it's against the rules to pretend that disks don't matter.  But
> > really, once you're doing random IO into a spindle you've given up on
> > performance anyway.
> >
> > -chris
> 
> That's right.
> 
> And equally true, once you have moved the disk heads to that track, you can 
> write a lot as cheaply as a little (i.e., do 1MB instead of 4KB). That will 
> also 
> avoid fragmentation of the extents.

When you do a 4K write, you have to remember that you've written just
those 4K.  When you do a 1MB write, you have to remember that you've
written just that 1MB.  It's the same operation, except with the 1MB
you've also had to setup all the bios and send down the zeros, and do
the proper locking to make sure you're not sending zeros down over
some concurrent IO.

The 1MB setup is actually more work, but it does greatly reduce the
amount of time the workload needs to run before it goes into a steady
state.  For smaller files it may work well, but for larger ones I don't
think it will be enough.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI

2012-12-07 Thread Chris Mason

On Fri, Dec 07, 2012 at 05:17:05PM -0700, Dave Chinner wrote:
> On Fri, Dec 07, 2012 at 02:03:06PM -0500, Chris Mason wrote:

[ dead and beaten fallocate ponies ]

> 
> > On a single flash drive doing random 4K writes, xfs does 950MB/s into
> > regular extents but only 400MB/s into preallocated extents.
> > 
> > http://masoncoding.com/presentation/perf-linuxcon12/fallocate.png
> 
> This is bordering on irrelevancy, but can you provide the workload
> you were running to generate this graph?  Random 4k writes could be
> anything, really.

This one was fio aio/dio, I'll dig out the job file and rerun it on
3.7-rc on Monday.  Any real random write is going to show this with
enough load.

> 
> In my experience, applications that actually do processing between
> random write IOs don't see anywhere near the same degradation as
> such micro-benchmarks tend to indicate can occur with unwritten
> extents. Are you seeing this level of degradation in real-world applications?
> If you give me a reason to fix it (and the hardware to test it on),
> I'm pretty sure I can bring the overhead down to just a few percent
> on fully featured SSDs like FusionIO devices...

We should have a card I can send, drop me the address.

For the workload...that's harder.  We can talk all day about what a
normal random write workload is, but if you have a fio job that you
think represents real world, I can run that.

[ much nodding ;) ]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI

2012-12-10 Thread Chris Mason

On Fri, Dec 07, 2012 at 06:39:49PM -0700, Chris Mason wrote:
> On Fri, Dec 07, 2012 at 05:17:05PM -0700, Dave Chinner wrote:
> > On Fri, Dec 07, 2012 at 02:03:06PM -0500, Chris Mason wrote:
> > > On a single flash drive doing random 4K writes, xfs does 950MB/s into
> > > regular extents but only 400MB/s into preallocated extents.
> > > 
> > > http://masoncoding.com/presentation/perf-linuxcon12/fallocate.png
> > 
> > This is bordering on irrelevancy, but can you provide the workload
> > you were running to generate this graph?  Random 4k writes could be
> > anything, really.
> 
> This one was fio aio/dio, I'll dig out the job file and rerun it on
> 3.7-rc on Monday.  Any real random write is going to show this with
> enough load.

Ok, I ran this against 3.6.  Since my box has two iodrives in it now, I
tossed them into lvm and ran striped over both.  A single drive is iop
bound at 1GB/s, and we're able to push 2GB/s over both.  LVM slows it
down slightly, and if you let the runs go long enough, you can see the
little log structured squirrels jumping in from time to time.

Long story short, on the lvm block device we average about 1.7GB/s over
the two drives.  This is iop bound, the two cards can push about 2.6GB/s
doing streaming writes.

XFS without preallocation comes very close to the iops bound number.
This is really impressive, but it also means every additional IO to track
the preallocation is going to hurt the bottom line.

With preallocation on, the speed is the same with one drive as with two.
Eric had asked me to do a run with holes, and they come out a little
worse than preallocated.

Graphs:

http://masoncoding.com/mason/benchmark/xfs-fallocate/xfs-random-write-compare.png

The fio job is in that xfs-fallocate directory, and included below.

-chris

[global]
bs=4k
direct=1
ioengine=aio
size=12g
rw=randwrite
norandommap
runtime=30
iodepth=1024

# set overwrite=1 to force us to fully overwrite
# the preallocated files before the random IO starts
#
#overwrite=1

# set fallocate=none to ues sparse files
#fallocate=none

# run 4 jobs where each job is operating on
# only one file.  This way there's no lock contention
# on the file itself.
#
[f1]
filename=/mnt/f1
[f1]
filename=/mnt/f2
[f1]
filename=/mnt/f3
[f1]
filename=/mnt/f4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: memory corruption, possibly caused by i915

2013-01-02 Thread Chris Mason

On Wed, Jan 02, 2013 at 08:52:33AM -0700, Dave Jones wrote:
> We've had a increased number of reports in the last six months or so
> from Fedora users getting corrupted page tables.
> At first I wrote it off to bad hardware, but they started happening frequently
> enough that I began to wonder if it was a real problem.
> 
> The only common thing I could think of was that now that gnome-shell is
> our default desktop, we're making a lot more use of DRI than we used to.
> 
> To test a hypothesis, I played a whole lot of quake3 over the holidays,
> and was finally able to make it happen too.
> 
> After playing the game for a few hours, I exited it, and all was well.
> But when I then went to shut down the laptop, I saw this..
> 
> [52460.280346] BUG: Bad page map in process panel-6-systray  
> pte:8800b665a0e8 pmd:b6659067
> [52460.280848] addr:0038bf3fd000 vm_flags:0070 anon_vma:  
> (null) mapping:88011052fd98 index:1fd
> [52460.281547] vma->vm_ops->fault: filemap_fault+0x0/0x470
> [52460.281878] vma->vm_file->f_op->mmap: btrfs_file_mmap+0x0/0x60 [btrfs]
> [52460.286556] Pid: 1317, comm: panel-6-systray Not tainted 3.7.0+ #15
> [52460.286926] Call Trace:
> [52460.287086]  [] print_bad_pte+0x1e2/0x250
> [52460.287445]  [] unmap_single_vma+0x5dd/0x8a0
> [52460.287804]  [] unmap_vmas+0x51/0xa0
> [52460.288087]  [] exit_mmap+0x98/0x170
> [52460.288388]  [] mmput+0x78/0xe0
> [52460.288651]  [] do_exit+0x24e/0xa30
> [52460.288944]  [] ? fput+0xe/0x10
> [52460.289268]  [] ? task_work_run+0xac/0xe0
> [52460.289608]  [] do_group_exit+0x3f/0xa0
> [52460.289937]  [] sys_exit_group+0x17/0x20
> [52460.290288]  [] system_call_fastpath+0x16/0x1b
> 
> It's falling over in btrfs's mmap op, but I think it's just the victim here,
> of something else corrupting what had been mmaped in the panel process.

Hi Dave,

It's a btrfs file, but this isn't in our mmap op.  The traces are
finding bad pages at unmap time. 

> 
> Daniel, can you think of additional sanity checks that could be added to
> the i915 driver ? (Even if at the expense of speed: a CONFIG_DEBUG option
> to prove correctness would be very worthwhile imo)

If the bad pages are getting all the way to btrfs,
CONFIG_DEBUG_PAGE_ALOC may help?  You've got lockdep on so maybe you
already enabled it.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] Two btrfs reverts

2012-12-20 Thread Chris Mason

Hi Linus,

I had missed that for two of the patches in my last pull, we had
included different fixes during 3.7.  My for-linus has them reverted:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

Chris Mason (2) commits (+6/-8):
Revert "Btrfs: MOD_LOG_KEY_REMOVE_WHILE_MOVING never change node's nritems" 
(+2/-2)
Revert "Btrfs: reorder tree mod log operations in deleting a pointer" 
(+4/-6)

Total: (2) commits (+6/-8)

 fs/btrfs/ctree.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG REPORT] Kernel panic on 3.9.0-rc7-4-gbb33db7

2013-04-19 Thread Chris Mason

Quoting Tejun Heo (2013-04-19 01:57:54)
> 
> Ewweehh
> 
> No wonder this thing crashes.  Chris, can't the original bio carry
> bbio in bi_private and let end_bio_extent_readpage() free the bbio
> instead of abusing bi_bdev like this?

Yes, we can definitely carry bbio up higher in the stack.  I'll patch it
up right now.  I do agree that it'll be too big for -final, but we'll
have it either way.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG REPORT] Kernel panic on 3.9.0-rc7-4-gbb33db7

2013-04-19 Thread Chris Mason

Quoting Jens Axboe (2013-04-19 09:32:50)
> > 
> > No wonder this thing crashes.  Chris, can't the original bio carry
> > bbio in bi_private and let end_bio_extent_readpage() free the bbio
> > instead of abusing bi_bdev like this?
> 
> Ugh, wtf.
> 
> Chris, time for a swim in the bay :-)

Yeah, I can't really defend this one.  We needed a space for an int and
I assumed end_io meant the FS was free to do horrible things.

Really though, I'll just take a quick dip in the lake and patch this out
of btrfs. 

Jan is probably right about changing around our endio callbacks to
explicitly pass the mirror, it should be less complex and cleaner.

Many thanks to everyone here that tracked it down.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] Btrfs

2013-06-13 Thread Chris Mason

Hi Linus,

Please pull my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

This is an assortment of crash fixes:

Josef Bacik (3) commits (+9/-8):
Btrfs: don't delete fs_roots until after we cleanup the transaction (+1/-1)
Btrfs: init relocate extent_io_tree with a mapping (+5/-4)
Btrfs: stop all workers before cleaning up roots (+3/-3)

Liu Bo (1) commits (+2/-2):
Btrfs: fix use-after-free bug during umount

Naohiro Aota (1) commits (+3/-0):
btrfs: Drop inode if inode root is NULL

Total: (5) commits (+14/-10)

 fs/btrfs/disk-io.c| 10 +-
 fs/btrfs/inode.c  |  3 +++
 fs/btrfs/relocation.c |  9 +
 3 files changed, 13 insertions(+), 9 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.10-rc6] WARNING: at fs/btrfs/inode.c:7961 btrfs_destroy_inode+0x265/0x2e0 [btrfs]()

2013-06-17 Thread Chris Mason

Quoting Dave Jones (2013-06-17 09:49:55)
> Hit this while running this script in a loop..
> https://github.com/kernelslacker/io-tests/blob/master/setup.sh
> [34385.251507] [ cut here ]
> [34385.254068] WARNING: at fs/btrfs/inode.c:7961 
> btrfs_destroy_inode+0x265/0x2e0 [btrfs]()

Thanks Dave, how long did you have to run the script to trigger it?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] introduce list_for_each_entry_del

2013-06-04 Thread Chris Mason

Quoting Christoph Hellwig (2013-06-04 10:48:56)
> On Mon, Jun 03, 2013 at 03:55:55PM -0400, J??rn Engel wrote:
> > Actually, when I compare the two invocations, I prefer the
> > list_for_each_entry_del() variant over list_pop_entry().
> > 
> > while ((ref = list_pop_entry(&prefs, struct __prelim_ref, list))) {
> > list_for_each_entry_del(ref, &prefs, list) {
> > 
> > Christoph?
> 
> I really don't like something that looks like an iterator (*for_each*)
> to modify a list.  Maybe it's just me, so I'd love to hear others chime
> in.

Have to agree with Christoph.  I just couldn't put my finger on why I
didn't like it until I saw the list_pop_entry suggestion.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.10-rc6] WARNING: at fs/btrfs/inode.c:7961 btrfs_destroy_inode+0x265/0x2e0 [btrfs]()

2013-06-17 Thread Chris Mason

Quoting Dave Jones (2013-06-17 14:20:06)
> On Mon, Jun 17, 2013 at 01:39:42PM -0400, Chris Mason wrote:
>  > Quoting Dave Jones (2013-06-17 09:49:55)
>  > > Hit this while running this script in a loop..
>  > > https://github.com/kernelslacker/io-tests/blob/master/setup.sh
>  > > [34385.251507] [ cut here ]
>  > > [34385.254068] WARNING: at fs/btrfs/inode.c:7961 
> btrfs_destroy_inode+0x265/0x2e0 [btrfs]()
>  > 
>  > Thanks Dave, how long did you have to run the script to trigger it?
>  > 
>  > -chris
> 
> Judging by the timestamp, about 9 hours.  This is on a 3 disk (sata)
> (oldish) quad opteron. Might repro faster on a more modern machine.

Exactly how did you run it?  I want to make sure I'm matching your
config.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.10-rc6] WARNING: at fs/btrfs/inode.c:7961 btrfs_destroy_inode+0x265/0x2e0 [btrfs]()

2013-06-19 Thread Chris Mason

Quoting Dave Jones (2013-06-17 14:58:10)
> On Mon, Jun 17, 2013 at 02:42:27PM -0400, Chris Mason wrote:
>  > Quoting Dave Jones (2013-06-17 14:20:06)
>  > > On Mon, Jun 17, 2013 at 01:39:42PM -0400, Chris Mason wrote:
>  > >  > Quoting Dave Jones (2013-06-17 09:49:55)
>  > >  > > Hit this while running this script in a loop..
>  > >  > > https://github.com/kernelslacker/io-tests/blob/master/setup.sh
>  > >  > > [34385.251507] [ cut here ]
>  > >  > > [34385.254068] WARNING: at fs/btrfs/inode.c:7961 
> btrfs_destroy_inode+0x265/0x2e0 [btrfs]()
>  > >  > 
>  > >  > Thanks Dave, how long did you have to run the script to trigger it?
>  > >  > 
>  > >  > -chris
>  > > 
>  > > Judging by the timestamp, about 9 hours.  This is on a 3 disk (sata)
>  > > (oldish) quad opteron. Might repro faster on a more modern machine.
>  > 
>  > Exactly how did you run it?  I want to make sure I'm matching your
>  > config.
> 
> while [ 1 ];
> do
>   setup.sh
> done
> 
> You'll need to set DISK1 etc variables at the top of the script to point to
> at least 3 disks for it to scribble over.
> 
> you'll also need fsx and fsstress in /usr/local/bin.

I've tried with and without memory pressure, and let it run for about 30
hours.  So far, nothing here.  Have you seen this again?

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.10-rc6] WARNING: at fs/btrfs/inode.c:7961 btrfs_destroy_inode+0x265/0x2e0 [btrfs]()

2013-06-19 Thread Chris Mason

Quoting Dave Jones (2013-06-19 14:34:50)
> On Wed, Jun 19, 2013 at 02:02:33PM -0400, Chris Mason wrote:
>  > Quoting Dave Jones (2013-06-17 14:58:10)
>  > > On Mon, Jun 17, 2013 at 02:42:27PM -0400, Chris Mason wrote:
>  > >  > Quoting Dave Jones (2013-06-17 14:20:06)
>  > >  > > On Mon, Jun 17, 2013 at 01:39:42PM -0400, Chris Mason wrote:
>  > >  > >  > Quoting Dave Jones (2013-06-17 09:49:55)
>  > >  > >  > > Hit this while running this script in a loop..
>  > >  > >  > > https://github.com/kernelslacker/io-tests/blob/master/setup.sh
>  > >  > >  > > [34385.251507] [ cut here ]
>  > >  > >  > > [34385.254068] WARNING: at fs/btrfs/inode.c:7961 
> btrfs_destroy_inode+0x265/0x2e0 [btrfs]()
>  > >  > >  > 
>  > >  > >  > Thanks Dave, how long did you have to run the script to trigger 
> it?
>  > >  > > 
>  > >  > > Judging by the timestamp, about 9 hours.  This is on a 3 disk (sata)
>  > >  > > (oldish) quad opteron. Might repro faster on a more modern machine.
>  > >  > 
>  > >  > Exactly how did you run it?  I want to make sure I'm matching your
>  > >  > config.
>  > > 
>  > > while [ 1 ];
>  > > do
>  > >   setup.sh
>  > > done
>  > > 
>  > > You'll need to set DISK1 etc variables at the top of the script to point 
> to
>  > > at least 3 disks for it to scribble over.
>  > > 
>  > > you'll also need fsx and fsstress in /usr/local/bin.
>  > 
>  > I've tried with and without memory pressure, and let it run for about 30
>  > hours.  So far, nothing here.  Have you seen this again?
> 
> yeah, one time I hit it within 30 minutes.

Ok, I'll try bumping the thread count on both fsstress and fsx

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: btrfs triggered lockdep WARN.

2013-06-27 Thread Chris Mason

Quoting Dave Jones (2013-06-27 10:58:24)
> Another bug caused by this script. 
> https://github.com/kernelslacker/io-tests/blob/master/setup.sh

I'm still struggling to reproduce that one here.  I've tried every
variation I can think of but I'll try again.

I really hope you don't already have CONFIG_DEBUG_PAGE_ALLOC turned on,
maybe it will catch this?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: btrfs triggered lockdep WARN.

2013-06-27 Thread Chris Mason

Quoting Dave Jones (2013-06-27 11:19:22)
> On Thu, Jun 27, 2013 at 11:01:30AM -0400, Chris Mason wrote:
>  > Quoting Dave Jones (2013-06-27 10:58:24)
>  > > Another bug caused by this script. 
> https://github.com/kernelslacker/io-tests/blob/master/setup.sh
>  > 
>  > I'm still struggling to reproduce that one here.  I've tried every
>  > variation I can think of but I'll try again.
>  
> Note that this is a different trace to the other post about that script.

Yeah, but I haven't hit anything unusual at all yet.

> 
>  > I really hope you don't already have CONFIG_DEBUG_PAGE_ALLOC turned on,
>  > maybe it will catch this?
> 
> I do. Though given this is lockdep complaining about what looks like
> memory corruption, it's probably not related.

Ok, could you please try this with some heavy memory pressure?  I'm
hoping to trigger a use-after-free that points us in the right
direction.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()

2013-08-19 Thread Chris Mason

Quoting Linus Torvalds (2013-08-19 17:16:36)
> On Mon, Aug 19, 2013 at 1:29 PM, Christoph Lameter  wrote:
> > On Mon, 19 Aug 2013, Simon Kirby wrote:
> >
> >>[... ]  The
> >> alloc/free traces are always the same -- always alloc_pipe_info and
> >> free_pipe_info. This is seen on 3.10 and (now) 3.11-rc4:
> >>
> >> Object 880090f19e78: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  
> >> lkkk
> >
> > This looks like an increment after free in the second 32 bit value of the
> > structure. First 32 bit value's poison is unchanged.
> 
> Ugh. If that is "struct pipe_inode_info" and I read it right, that's
> the "wait_lock" spinlock that is part of the mutex.
> 
> Doing a "spin_lock()" could indeed cause an increment operation. But
> it still sounds like a very odd case. And even for some wild pointer
> I'd then expect the spin_unlock to also happen, and to then increment
> the next byte (or word) too. More importantly, for a mutex, I'd expect
> the *other* fields to be corrupted too (the "waiter" field etc). That
> is, unless we're still spinning waiting for the mutex, but with that
> value we shouldn't, as far as I can see.
> 

Simon, is this box doing btrfs send/receive?  If so, it's probably where
this pipe is coming from.

Linus' CONFIG_DEBUG_PAGE_ALLOC suggestions are going to be the fastest
way to find it, I can give you a patch if it'll help.

It would be nice if you could trigger this on plain 3.11-rcX instead of
btrfs-next.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] Btrfs

2013-08-10 Thread Chris Mason

Hi Linus

Please pull my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

These are assorted fixes, mostly from Josef nailing down xfstests runs.
Zach also has a long standing fix for problems with readdir wrapping
f_pos (or ctx->pos)

These patches were spread out over different bases, so I rebased things on
top of rc4 and retested overnight.

Josef Bacik (6) commits (+82/-52):
Btrfs: check to see if root_list is empty before adding it to dead roots 
(+5/-5)
Btrfs: make sure the backref walker catches all refs to our extent (+14/-11)
Btrfs: allow splitting of hole em's when dropping extent cache (+40/-22)
Btrfs: release both paths before logging dir/changed extents (+2/-3)
Btrfs: fix backref walking when we hit a compressed extent (+15/-8)
Btrfs: do not offset physical if we're compressed (+6/-3)

Liu Bo (2) commits (+12/-5):
Btrfs: fix a bug of snapshot-aware defrag to make it work on partial 
extents (+12/-4)
Btrfs: fix extent buffer leak after backref walking (+0/-1)

Zach Brown (1) commits (+25/-8):
btrfs: don't loop on large offsets in readdir

Jie Liu (1) commits (+0/-3):
btrfs: fix file truncation if FALLOC_FL_KEEP_SIZE is specified

Total: (10) commits (+119/-68)

 fs/btrfs/backref.c | 48 ++
 fs/btrfs/ctree.c   |  1 -
 fs/btrfs/extent_io.c   |  9 +---
 fs/btrfs/file.c| 62 --
 fs/btrfs/inode.c   | 52 ++
 fs/btrfs/transaction.c |  8 +++
 fs/btrfs/transaction.h |  2 +-
 fs/btrfs/tree-log.c|  5 ++--
 8 files changed, 119 insertions(+), 68 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] Btrfs

2013-03-02 Thread Chris Mason

rors in compression submission path (+28/-10)
Btrfs: remove extent mapping if we fail to add chunk (+12/-2)
Btrfs: relax the block group size limit for bitmaps (+9/-3)
Btrfs: cleanup orphan reservation if truncate fails (+2/-0)
Btrfs: make sure NODATACOW also gets NODATASUM set (+2/-1)
Btrfs: don't re-enter when allocating a chunk (+9/-0)
Btrfs: remove unused extent io tree ops V2 (+11/-27)
Btrfs: fix chunk allocation error handling (+22/-10)

Liu Bo (14) commits (+796/-109):
Btrfs: kill unused argument of btrfs_pin_extent_for_log_replay (+3/-6)
Btrfs: fix cleaner thread not working with inode cache option (+8/-1)
Btrfs: use token to avoid times mapping extent buffer (+35/-28)
Btrfs: extend the checksum item as much as possible (+46/-21)
Btrfs: fix NULL pointer after aborting a transaction (+7/-1)
Btrfs: use reserved space for creating a snapshot (+2/-0)
Btrfs: kill unused argument of update_block_group (+5/-7)
Btrfs: kill unused arguments of cache_block_group (+5/-8)
Btrfs: do not change inode flags in rename (+0/-25)
Btrfs: record first logical byte in memory (+20/-1)
Btrfs: fix memory leak of log roots (+9/-2)
Btrfs: remove deprecated comments (+0/-6)
Btrfs: snapshot-aware defrag (+654/-0)
Btrfs: save us a read_lock (+2/-3)

Eric Sandeen (11) commits (+58/-108):
btrfs: ensure we don't overrun devices_info[] in __btrfs_alloc_chunk (+5/-1)
btrfs: remove unused "item" in btrfs_insert_delayed_item() (+0/-2)
btrfs: remove unused fs_info from btrfs_decode_error() (+4/-5)
btrfs: remove cache only arguments from defrag path (+32/-82)
btrfs: remove unnecessary DEFINE_WAIT() declarations (+0/-2)
btrfs: annotate intentional switch case fallthroughs (+2/-0)
btrfs: add missing break in btrfs_print_leaf() (+1/-0)
btrfs: remove unused fd in btrfs_ioctl_send() (+0/-3)
btrfs: handle null fs_info in btrfs_panic() (+7/-4)
btrfs: fix varargs in __btrfs_std_error (+7/-7)
btrfs: list_entry can't return NULL (+0/-2)

Chris Mason (7) commits (+561/-30):
Btrfs: reduce CPU contention while waiting for delayed extent operations 
(+70/-5)
Btrfs: remove conflicting check for minimum number of devices in raid56 
(+0/-8)
Btrfs: reduce lock contention on extent buffer locks (+16/-0)
Btrfs: add a plugging callback to raid56 writes (+124/-4)
Btrfs: fix cluster alignment for mount -o ssd (+6/-1)
Btrfs: fix max chunk size on raid5/6 (+21/-4)
Btrfs: Add a stripe cache to raid56 (+324/-8)

Wang Shilong (6) commits (+78/-68):
Btrfs: remove reduplicate check about root in the function 
btrfs_clean_quota_tree (+0/-3)
Btrfs: cleanup to make the function btrfs_delalloc_reserve_metadata more 
logic (+38/-44)
Btrfs: return ENOMEM rather than use BUG_ON when btrfs_alloc_path fails 
(+9/-3)
Btrfs: don't call btrfs_qgroup_free if just btrfs_qgroup_reserve fails 
(+6/-5)
Btrfs: fix missing deleted items in btrfs_clean_quota_tree (+21/-13)
Btrfs: fix missing check before disabling quota (+4/-0)

David Sterba (6) commits (+131/-42):
btrfs: access superblock via pagecache in scan_one_device (+64/-6)
btrfs: put some enospc messages under enospc_debug (+15/-11)
btrfs: try harder to allocate raid56 stripe cache (+26/-7)
btrfs: use only inline_pages from extent buffer (+7/-17)
btrfs: remove a printk from scan_one_device (+0/-1)
btrfs: add cancellation points to defrag (+19/-0)

Zach Brown (2) commits (+9/-12):
btrfs: limit fallocate extent reservation to 256MB (+4/-3)
btrfs: define BTRFS_MAGIC as a u64 value (+5/-9)

David Woodhouse (2) commits (+2294/-113):
Btrfs: add rw argument to merge_bio_hook() (+11/-11)
Btrfs: RAID5 and RAID6 (+2283/-102)

Ilya Dryomov (2) commits (+6/-6):
Btrfs: allow for selecting only completely empty chunks (+1/-1)
Btrfs: eliminate a use-after-free in btrfs_balance() (+5/-5)

jeff.liu (2) commits (+67/-0):
Btrfs: Add a new ioctl to get the label of a mounted file system (+23/-0)
Btrfs: set/change the label of a mounted file system (+44/-0)

Filipe Brandenburger (1) commits (+19/-11):
Btrfs: move fs/btrfs/ioctl.h to include/uapi/linux/btrfs.h

Mark Fasheh (1) commits (+54/-4):
btrfs: add "no file data" flag to btrfs send ioctl

Alexandre Oliva (1) commits (+3/-3):
clear chunk_alloc flag on retryable failure

Thomas Gleixner (1) commits (+1/-0):
btrfs: Init io_lock after cloning btrfs device struct

Paul Gortmaker (1) commits (+1/-4):
btrfs: fixup/remove module.h usage as required

Tomasz Torcz (1) commits (+1/-0):
Btrfs: select XOR_BLOCKS in Kconfig

Jan Schmidt (1) commits (+1/-4):
Btrfs: fix backref walking race with tree deletions

Qu Wenruo (1) commits (+25/-38):
btrfs: cleanup for open-coded alignment

Kusanagi Kouichi (1) commits (+1/-1):
Btrfs: Check CAP_DAC_READ_SEARCH for BTRFS_IOC_INO_PATHS

Arne Jansen (1) commits (+

Re: [GIT PULL] Btrfs

2013-03-02 Thread Chris Mason

On Sat, Mar 02, 2013 at 05:45:41PM -0700, Linus Torvalds wrote:
> On Sat, Mar 2, 2013 at 7:15 AM, Chris Mason  wrote:
> >
> > Our set of btrfs features, fixes and cleanups are in my for-linus
> > branch:
> 
> I *really* wish that big pull requests like this would come in earlier
> in the merge window. I hate seeing them the day before I close the
> window - really.  A number of the latter commits are done in the last
> few days, which also smells bad.

Definitely, I wanted to send this earlier in the merge window.  But I
was out last week and also didn't want to send the big stuff (raid 5/6
and the fsync work) to you right before I left on vacation.

So instead I sent things off to linux-next, and everyone on the btrfs
list collected fixes while I was gone.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] btrfs/raid56: Add missing #include

2013-03-03 Thread Chris Mason

On Sun, Mar 03, 2013 at 04:44:41AM -0700, Geert Uytterhoeven wrote:
> tilegx_defconfig:
> 
> fs/btrfs/raid56.c: In function 'btrfs_alloc_stripe_hash_table':
> fs/btrfs/raid56.c:206:3: error: implicit declaration of function 'vzalloc' 
> [-Werror=implicit-function-declaration]
> fs/btrfs/raid56.c:206:9: warning: assignment makes pointer from integer 
> without a cast [enabled by default]
> fs/btrfs/raid56.c:226:4: error: implicit declaration of function 'vfree' 
> [-Werror=implicit-function-declaration]

Thanks, I've got this one in my for-linus now.  It'll go with the next
pull.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] Btrfs fixup

2013-03-03 Thread Chris Mason

Hi Linus,

Geert and James both sent this one in, sorry guys.

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

Geert Uytterhoeven (1) commits (+1/-0):
btrfs/raid56: Add missing #include 

Total: (1) commits (+1/-0)

 fs/btrfs/raid56.c | 1 +
 1 file changed, 1 insertion(+)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] SLAB changes for v3.10

2013-05-08 Thread Chris Mason

[ Sorry if I break the threading on this, I had to pull it off gmane ]

On Tue, 7 May 2013, Tony Lindgren wrote:
> OK got it narrowed down to CONFIG_DEBUG_SPINLOCK=y causing the problem
> with commit 8a965b3b. Ain't nothing like bisecting and booting and then
> diffing .config files on top of that.

I'm unable to boot with slab on current Linus -master, and bisected it
down almost as far as Tony did before trying SLUB and then finding this
thread.   My box is a standard x86-64, nothing exciting and spinlock
debugging isn't on.

A few printks stuffed into Christoph's code:

cache 88047f80 at index 5
creating slab cache at 6
create special cache #1 (this is kmalloc_caches[1])
cache 88047f0001c0 at index 7
creating slab cache at 8
creating slab cache at 9
creating slab cache at 10
... more get created

Pulling this into the code from commit 8a965b3b:

for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) {
if (!kmalloc_caches[i]) {
kmalloc_caches[i] = create_kmalloc_cache(NULL,
1 << i, flags);

/*
 * Caches that are not of the two-to-the-power-of size.
 * These have to be created immediately after the
 * earlier power of two caches
 */
if (KMALLOC_MIN_SIZE <= 32 && !kmalloc_caches[1] && i 
== 6)
kmalloc_caches[1] = create_kmalloc_cache(NULL, 
96, flags);

if (KMALLOC_MIN_SIZE <= 64 && !kmalloc_caches[2] && i 
== 7)
kmalloc_caches[2] = create_kmalloc_cache(NULL, 
192, flags);
}
}

kmalloc_caches[7] was not null, and so kmalloc_caches[2] was never
created.

I get this oops (with early printk on)

[ cut here ]
kernel BUG at mm/slab.c:1635!
invalid opcode:  [#1] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.9.0-josef+ #920
Hardware name: Supermicro 
X9SRE/X9SRE-3F/X9SRi/X9SRi-3F/X9SRE/X9SRE-3F/X9SRi/X9SRi-3F, BIOS 1.0a 
03/06/2012
task: 8196a410 ti: 8195a000 task.ti: 8195a000
RIP: 0010:[]  [] 
kmem_cache_init_late+0x40/0x7d
RSP: :8195bf78  EFLAGS: 00010282
RAX: fff4 RBX: 88047f006480 RCX: ff31
RDX: 000e RSI: 0046 RDI: 81baf238
RBP: 8195bf80 R08: 0400 R09: 
R10: 2fa4 R11:  R12: 81ab58d0
R13: 81abd2c0 R14: 88047ffaf0c0 R15: 
FS:  () GS:88047fc0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 88047000 CR3: 01965000 CR4: 000406b0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Stack:
  8195bfc0 81a25b64 81a2573d
 81abd2c0 3000  
  8195bfd0 81a2547f 8195bfe8
Call Trace:
 [] start_kernel+0x235/0x394
 [] ? repair_env_string+0x58/0x58
 [] x86_64_start_reservations+0x2a/0x2c
 [] x86_64_start_kernel+0xc7/0xca
Code: 53 e8 40 57 bd ff 48 8b 05 21 f2 f4 ff 48 8d 58 a8 48 8d 43 58 48 3d a0 
8e 99 81 74 1a 31 f6 48 89 df e8 ba 7f 6d ff 85 c0 74 02 <0f> 0b 48 8b 5b 58 48 
83 eb 58 eb da 48 c7 c7 70 8e 99 81 e8 e6
RIP  [] kmem_cache_init_late+0x40/0x7d
 RSP 
---[ end trace 2e5587581263f881 ]---

This patch fixes things for me, but to maintain the rules from
Christoph's patch,  kmalloc_caches[2] should have been created whenever
kmalloc_caches[7] was done.

diff --git a/mm/slab_common.c b/mm/slab_common.c
index d2517b0..ff3218a 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -446,18 +446,18 @@ void __init create_kmalloc_caches(unsigned long flags)
if (!kmalloc_caches[i]) {
kmalloc_caches[i] = create_kmalloc_cache(NULL,
1 << i, flags);
+   }
 
-   /*
-* Caches that are not of the two-to-the-power-of size.
-* These have to be created immediately after the
-* earlier power of two caches
-*/
-   if (KMALLOC_MIN_SIZE <= 32 && !kmalloc_caches[1] && i 
== 6)
-   kmalloc_caches[1] = create_kmalloc_cache(NULL, 
96, flags);
+   /*
+* Caches that are not of the two-to-the-power-of size.
+* These have to be created immediately after the
+* earlier power of two caches
+*/
+   if (KMALLOC_MIN_SIZE <= 32 && !kmalloc_caches[1] && i == 6)
+   kmalloc_c

Re: [GIT PULL] SLAB changes for v3.10

2013-05-08 Thread Chris Mason

Quoting Christoph Lameter (2013-05-08 14:25:49)
> On Wed, 8 May 2013, Chris Mason wrote:
> 
> > This patch fixes things for me, but to maintain the rules from
> > Christoph's patch,  kmalloc_caches[2] should have been created whenever
> > kmalloc_caches[7] was done.
> 
> Not necessary. The early slab bootstrap must create some slab caches of
> specific sizes, it will only use those during very early bootstrap.
> 
> The later creation of the array must skip those.
> 
> You correctly moved the checks out of the if (!kmalloc_cacheS())
> condition so that the caches are created properly.

But if the ordering is required at all, why is it ok to create cache 2
after cache 6 instead of after cache 7?

IOW if we can safely do cache 2 after cache 6, why can't we just do both
cache 1 and cache 2 after the loop?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Fix crash during slab init

2013-05-08 Thread Chris Mason

Commit 8a965b3b introduced a regression that caused us to crash early
during boot.  The commit was introducing ordering of slab creation,
making sure two odd-sized slabs were created after specific powers of
two sizes.

But, if any of the  power of two slabs were created earlier during boot,
slabs at index 1 or 2 might not get created at all.  This patch makes
sure none of the slabs get skipped.

Tony Lindgren bisected this down to the offending commit, which really
helped because bisect kept bringing me to almost but not quite this one.

Signed-off-by: Chris Mason 
Acked-by: Christoph Lameter 
Acked-by: Tony Lindgren 

---

v1->v2 reword description

diff --git a/mm/slab_common.c b/mm/slab_common.c
index d2517b0..ff3218a 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -446,18 +446,18 @@ void __init create_kmalloc_caches(unsigned long flags)
if (!kmalloc_caches[i]) {
kmalloc_caches[i] = create_kmalloc_cache(NULL,
1 << i, flags);
+   }
 
-   /*
-* Caches that are not of the two-to-the-power-of size.
-* These have to be created immediately after the
-* earlier power of two caches
-*/
-   if (KMALLOC_MIN_SIZE <= 32 && !kmalloc_caches[1] && i 
== 6)
-   kmalloc_caches[1] = create_kmalloc_cache(NULL, 
96, flags);
+   /*
+* Caches that are not of the two-to-the-power-of size.
+* These have to be created immediately after the
+* earlier power of two caches
+*/
+   if (KMALLOC_MIN_SIZE <= 32 && !kmalloc_caches[1] && i == 6)
+   kmalloc_caches[1] = create_kmalloc_cache(NULL, 96, 
flags);
 
-   if (KMALLOC_MIN_SIZE <= 64 && !kmalloc_caches[2] && i 
== 7)
-   kmalloc_caches[2] = create_kmalloc_cache(NULL, 
192, flags);
-   }
+   if (KMALLOC_MIN_SIZE <= 64 && !kmalloc_caches[2] && i == 7)
+   kmalloc_caches[2] = create_kmalloc_cache(NULL, 192, 
flags);
}
 
/* Kmalloc array is now usable */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] Btrfs

2013-05-09 Thread Chris Mason

ng start to defrag (+7/-4)
Btrfs: cleanup unused arguments of btrfs_csum_data (+15/-21)
Btrfs: return free space in cow error path (+9/-3)
Btrfs: improve the loop of scrub_stripe (+57/-26)
Btrfs: use helper to cleanup tree roots (+1/-14)
Btrfs: share stop worker code (+23/-32)
Btrfs: cleanup unused function (+0/-1)
Btrfs: pass NULL instead of 0 (+1/-1)

Jan Schmidt (7) commits (+682/-212):
Btrfs: separate sequence numbers for delayed ref tracking and tree mod log 
(+63/-19)
Btrfs: fix accessing the root pointer in tree mod log functions (+20/-20)
Btrfs: split btrfs_qgroup_account_ref into four functions (+148/-105)
Btrfs: fix tree mod log regression on root split operations (+29/-26)
Btrfs: automatic rescan after "quota enable" command (+11/-0)
Btrfs: fix unlock after free on rewinded tree blocks (+11/-7)
Btrfs: rescan for qgroups (+400/-35)

Tsutomu Itoh (6) commits (+59/-79):
Btrfs: remove unused variable in __process_changed_new_xattr() (+0/-2)
Btrfs: cleanup of function where btrfs_extend_item() is called (+2/-3)
Btrfs: cleanup of function where fixup_low_keys() is called (+38/-51)
Btrfs: remove unused argument of btrfs_extend_item() (+9/-11)
Btrfs: remove unused argument of fixup_low_keys() (+8/-10)
Btrfs: fix error handling in btrfs_ioctl_send() (+2/-2)

Stefan Behrens (4) commits (+51/-23):
Btrfs: allow omitting stream header and end-cmd for btrfs send (+33/-11)
Btrfs: clear received_uuid field for new writable snapshots (+8/-4)
Btrfs: delete unused parameter to btrfs_read_root_item() (+6/-8)
Btrfs: set UUID in root_item for created trees (+4/-0)

Eric Sandeen (4) commits (+383/-458):
btrfs: document mount options in Documentation/fs/btrfs.txt (+173/-7)
btrfs: ignore device open failures in __btrfs_open_devices (+3/-3)
btrfs: make static code static & remove dead code (+135/-392)
btrfs: move leak debug code to functions (+72/-56)

Miao Xie (4) commits (+155/-53):
Btrfs: use a lock to protect incompat/compat flag of the super block 
(+26/-11)
Btrfs: allocate new chunks if the space is not enough for global rsv 
(+11/-8)
Btrfs: improve the performance of the csums lookup (+111/-31)
Btrfs: fix unblocked autodefraggers when remount (+7/-3)

Zhi Yong Wu (2) commits (+2/-7):
btrfs: Cleanup some redundant codes in btrfs_lookup_csums_range() (+2/-5)
btrfs: Cleanup some redundant codes in btrfs_log_inode() (+0/-2)

Zach Brown (1) commits (+2/-0):
btrfs: abort unlink trans in missed error case

Simon Kirby (1) commits (+133/-109):
Btrfs: Include the device in most error printk()s

Nathaniel Yazdani (1) commits (+1/-1):
btrfs: fix minor typo in comment

Chris Mason (1) commits (+5/-0):
Btrfs: allow superblock mismatch from older mkfs

Vincent (1) commits (+3/-2):
Btrfs: fix reada debug code compilation

Total: (101) commits

 Documentation/filesystems/btrfs.txt | 180 +++-
 fs/btrfs/Kconfig|  22 +-
 fs/btrfs/backref.c  |  87 ++--
 fs/btrfs/backref.h  |   3 -
 fs/btrfs/btrfs_inode.h  |   2 +-
 fs/btrfs/compression.c  |  14 +-
 fs/btrfs/compression.h  |   2 -
 fs/btrfs/ctree.c| 382 +++-
 fs/btrfs/ctree.h| 145 ---
 fs/btrfs/delayed-inode.c|  66 +--
 fs/btrfs/delayed-ref.c  |  30 +-
 fs/btrfs/dir-item.c |  11 +-
 fs/btrfs/disk-io.c  | 409 ++
 fs/btrfs/disk-io.h  |   5 +-
 fs/btrfs/extent-tree.c  | 549 +++
 fs/btrfs/extent_io.c| 310 ++---
 fs/btrfs/extent_io.h|  44 +-
 fs/btrfs/extent_map.c   |  23 +-
 fs/btrfs/extent_map.h   |   3 +-
 fs/btrfs/file-item.c| 102 ++---
 fs/btrfs/file.c |  37 +-
 fs/btrfs/free-space-cache.c | 596 +++--
 fs/btrfs/free-space-cache.h |   5 +
 fs/btrfs/inode-item.c   |  17 +-
 fs/btrfs/inode.c| 183 
 fs/btrfs/ioctl.c| 108 -
 fs/btrfs/locking.c  |   4 +-
 fs/btrfs/ordered-data.c |  28 +-
 fs/btrfs/ordered-data.h |   3 +-
 fs/btrfs/print-tree.c   |   9 +-
 fs/btrfs/print-tree.h   |   2 +-
 fs/btrfs/qgroup.c   | 840 
 fs/btrfs/raid56.c   |  14 +-
 fs/btrfs/reada.c|   5 +-
 fs/btrfs/relocation.c   | 111 +++--
 fs/btrfs/root-tree.c|   7 +-
 fs/btrfs/scrub.c| 130 +++---
 fs/btrfs/send.c |  32 +-
 fs/btrfs/send.h |   1 -
 fs/btrfs/super.c| 107 +++--
 fs/btrfs/transaction.c  |  95 ++--
 fs/btrfs

[GIT PULL] One more btrfs

2013-04-13 Thread Chris Mason

Hi Linus

My for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

Has a recent fix from Josef for our tree log replay code.  It fixes
problems where the inode counter for the number of bytes in the file
wasn't getting updated properly during fsync replay.

The commit did get rebased this morning, but it was only to clean up the
subject line.  The code hasn't changed.

Josef Bacik (1) commits (+42/-6):
Btrfs: make sure nbytes are right after log replay

Total: (1) commits (+42/-6)

 fs/btrfs/tree-log.c | 48 ++--
 1 file changed, 42 insertions(+), 6 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] Btrfs updates

2013-05-18 Thread Chris Mason

Hi Linus,

Please pull my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

Miao Xie has been very busy, fixing races and enospc problems and many
other small but important pieces.

Alexandre Oliva discovered some problems with how our error handling was
interacting with the block layer and for now has disabled our partial
handling of sub-page writes.  The real sub-page work is in a series of
patches from IBM that we still need to integrate and test.  The code
Alexandre has turned off was really incomplete.

Josef has more error handling fixes and an important fix for the new
skinny extent format.

This also has my fix for the tracepoint crash from late in 3.9.  It's
the first stage in a larger clean up to get rid of btrfs_bio and make
a proper bioset for all the items we need to tack into the bio.  For now
the bioset only holds our mirror_num and stripe_index, but for the next
merge window I'll shuffle more in.

Miao Xie (10) commits (+87/-69):
Btrfs: don't steal the reserved space from the global reserve if their 
space type is different (+4/-2)
Btrfs: don't abort the current transaction if there is no enough space for 
inode cache (+2/-1)
Btrfs: don't invoke btrfs_invalidate_inodes() in the spin lock context 
(+6/-0)
Btrfs: don't use global block reservation for inode cache truncation 
(+34/-22)
Btrfs: fix unprotected root node of the subvolume's inode rb-tree (+3/-4)
Btrfs: remove BUG_ON() in btrfs_read_fs_tree_no_radix() (+0/-1)
Btrfs: pause the space balance when remounting to R/O (+1/-0)
Btrfs: optimize the error handle of use_block_rsv() (+28/-37)
Btrfs: update the global reserve if it is empty (+8/-1)
Btrfs: fix accessing a freed tree root (+1/-1)

Josef Bacik (4) commits (+35/-32):
Btrfs: make sure roots are assigned before freeing their nodes (+21/-18)
Btrfs: handle running extent ops with skinny metadata (+12/-10)
Btrfs: remove warn on in free space cache writeout (+1/-3)
Btrfs: don't null pointer deref on abort (+1/-1)

Stefan Behrens (3) commits (+8/-1):
Btrfs: explicitly use global_block_rsv for quota_tree (+2/-0)
Btrfs: fix possible memory leak in replace_path() (+1/-1)
Btrfs: don't allow device replace on RAID5/RAID6 (+5/-0)

Liu Bo (2) commits (+8/-4):
Btrfs: return errno if possible when we fail to allocate memory (+6/-2)
Btrfs: fix off-by-one in fiemap (+2/-2)

Gabriel de Perthuis (1) commits (+5/-5):
btrfs: don't stop searching after encountering the wrong item

Alexandre Oliva (1) commits (+30/-55):
    btrfs: do away with non-whole_page extent I/O

Chris Mason (1) commits (+120/-72):
Btrfs: use a btrfs bioset instead of abusing bio internals

David Sterba (1) commits (+4/-4):
btrfs: annotate quota tree for lockdep

Wang Shilong (1) commits (+2/-1):
Btrfs: fix possible memory leak in the find_parent_nodes()

Andreas Philipp (1) commits (+6/-7):
Correct allowed raid levels on balance.

Total: (25) commits (+305/-250)

 fs/btrfs/backref.c  |   3 +-
 fs/btrfs/check-integrity.c  |   2 +-
 fs/btrfs/ctree.c|   4 +-
 fs/btrfs/ctree.h|   8 +--
 fs/btrfs/delayed-ref.h  |   1 +
 fs/btrfs/dev-replace.c  |   5 ++
 fs/btrfs/disk-io.c  |  52 ++---
 fs/btrfs/extent-tree.c  |  94 --
 fs/btrfs/extent_io.c| 138 +++-
 fs/btrfs/extent_io.h|   2 +
 fs/btrfs/free-space-cache.c |  43 +++---
 fs/btrfs/free-space-cache.h |   2 +
 fs/btrfs/inode-map.c|   8 ++-
 fs/btrfs/inode.c|  81 +-
 fs/btrfs/ioctl.c|  10 ++--
 fs/btrfs/raid56.c   |   2 +-
 fs/btrfs/relocation.c   |   7 ++-
 fs/btrfs/scrub.c|  10 ++--
 fs/btrfs/super.c|   1 +
 fs/btrfs/volumes.c  |  54 -
 fs/btrfs/volumes.h  |  20 +++
 21 files changed, 301 insertions(+), 246 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm tree with Linus' tree

2013-05-20 Thread Chris Mason

Quoting Stephen Rothwell (2013-05-20 00:04:49)
> Hi Andrew,
> 
> Today's linux-next merge of the akpm tree got conflicts in
> fs/btrfs/inode.c and fs/btrfs/volumes.c between commit 9be3395bcd4a
> ("Btrfs: use a btrfs bioset instead of abusing bio internals") from
> Linus' tree and commit "block: prep work for batch completion" from the
> akpm tree.
> 
> I fixed it up (I think - see below) and can carry the fix as necessary
> (no action is required).
> 
> I also noticed that a single conversion of bio_endio to bio_endio_batch
> is done in the akpm patch but bio_endio_batch is not introduced until a
> later patch ... :-(

Thanks, this looks right and I've linux-next through an aio/dio test on
btrfs.

Kent, reviewing the merge I see a missing bio_endio_batch conversion.  I
think this was missing from the original:

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index faf20f5..a47bc10 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7084,7 +7084,7 @@ static void btrfs_end_dio_bio(struct bio *bio, int err,
bio_io_error(dip->orig_bio);
} else {
set_bit(BIO_UPTODATE, &dip->dio_bio->bi_flags);
-   bio_endio(dip->orig_bio, 0);
+   bio_endio_batch(dip->orig_bio, 0, batch);
}
 out:
bio_put(bio);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] Btrfs updates

2013-03-29 Thread Chris Mason

Hi Linus,

Please pull my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

We've had a busy two weeks of bug fixing.  The biggest patches in here
are some long standing early-enospc problems (Josef) and a very old race
where compression and mmap combine forces to lose writes (me).  I'm
fairly sure the mmap bug goes all the way back to the introduction of
the compression code, which is proof that fsx doesn't trigger every
possible mmap corner after all.

I'm sure you'll notice one of these is from this morning, it's a small
and isolated use-after-free fix in our scrub error reporting.  I double
checked it here.

Josef Bacik (6) commits (+90/-18):
Btrfs: hold the ordered operations mutex when waiting on ordered extents 
(+2/-0)
Btrfs: don't drop path when printing out tree errors in scrub (+2/-1)
Btrfs: fix space leak when we fail to reserve metadata space (+41/-6)
Btrfs: fix space accounting for unlink and rename (+2/-4)
Btrfs: limit the global reserve to 512mb (+1/-1)
Btrfs: handle a bogus chunk tree nicely (+42/-6)

Jan Schmidt (2) commits (+24/-16):
Btrfs: fix EIO from btrfs send in is_extent_unchanged for punched holes 
(+4/-6)
Btrfs: fix locking on ROOT_REPLACE operations in tree mod log (+20/-10)

Wang Shilong (2) commits (+10/-2):
Btrfs: fix double free in the btrfs_qgroup_account_ref() (+1/-2)
Btrfs: fix missing qgroup reservation before fallocating (+9/-0)

Miao Xie (2) commits (+5/-3):
Btrfs: fix wrong return value of btrfs_lookup_csum() (+3/-1)
Btrfs: fix wrong reservation of csums (+2/-2)

Chris Mason (1) commits (+49/-0):
Btrfs: fix race between mmap writes and compression

Liu Bo (1) commits (+1/-1):
Btrfs: update to use fs_state bit

Tsutomu Itoh (1) commits (+9/-3):
Btrfs: fix memory leak in btrfs_create_tree()

Total: (15) commits

 fs/btrfs/ctree.c| 30 --
 fs/btrfs/disk-io.c  | 14 ++---
 fs/btrfs/extent-tree.c  | 84 ++---
 fs/btrfs/extent_io.c| 33 +++
 fs/btrfs/extent_io.h|  2 ++
 fs/btrfs/file-item.c|  6 ++--
 fs/btrfs/file.c |  9 ++
 fs/btrfs/inode.c| 22 ++---
 fs/btrfs/ordered-data.c |  2 ++
 fs/btrfs/qgroup.c   |  3 +-
 fs/btrfs/scrub.c|  3 +-
 fs/btrfs/send.c | 10 +++---
 fs/btrfs/volumes.c  | 13 +++-
 13 files changed, 188 insertions(+), 43 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 0/4] ipc: reduce ipc lock contention

2013-03-07 Thread Chris Mason

On Thu, Mar 07, 2013 at 01:45:33AM -0700, Peter Zijlstra wrote:
> On Tue, 2013-03-05 at 15:53 -0500, Rik van Riel wrote:
> 
> > Indeed.  Though how well my patches will work with Oracle will
> > depend a lot on what kind of semctl syscalls they are doing.
> > 
> > Does Oracle typically do one semop per semctl syscall, or does
> > it pass in a whole bunch at once?
> 
> https://oss.oracle.com/~mason/sembench.c
> 
> I think Chris wrote that to match a particular pattern of semaphore
> operations the database engine in question does. I haven't checked to
> see if it triggers the case in point though.
> 
> Also, Chris since left Oracle but maybe he knows who to poke.
> 

Dave Kleikamp (cc'd) took over my patches and did the most recent
benchmarking.  Ported against 3.0:

https://oss.oracle.com/git/?p=linux-uek-2.6.39.git;a=commit;h=c7fa322dd72b08450a440ef800124705a1fa148c

The current versions are still in the 2.6.32 oracle kernel, but it looks
like they reverted this 3.0 commit.  I think with Manfred's upstream
work my more complex approach wasn't required anymore, but hopefully
Dave can fill in details.

Here is some of the original discussion around the patch:

https://lkml.org/lkml/2010/4/12/257

In terms of how oracle uses IPC, the part that shows up in profiles is
using semtimedop for bulk wakeups.  They can configure things to use
either a bunch of small arrays or a huge single array (and anything in
between). 

There is one IPC semaphore per process and they use this to wait for
some event (like a log commit).  When the event comes in, everyone
waiting is woken in bulk via a semtimedop call.

So, single proc waking many waiters at once.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 0/4] ipc: reduce ipc lock contention

2013-03-07 Thread Chris Mason

On Thu, Mar 07, 2013 at 08:54:55AM -0700, Dave Kleikamp wrote:
> On 03/07/2013 06:55 AM, Chris Mason wrote:
> > On Thu, Mar 07, 2013 at 01:45:33AM -0700, Peter Zijlstra wrote:
> >> On Tue, 2013-03-05 at 15:53 -0500, Rik van Riel wrote:
> >>
> >>> Indeed.  Though how well my patches will work with Oracle will
> >>> depend a lot on what kind of semctl syscalls they are doing.
> >>>
> >>> Does Oracle typically do one semop per semctl syscall, or does
> >>> it pass in a whole bunch at once?
> >>
> >> https://oss.oracle.com/~mason/sembench.c
> >>
> >> I think Chris wrote that to match a particular pattern of semaphore
> >> operations the database engine in question does. I haven't checked to
> >> see if it triggers the case in point though.
> >>
> >> Also, Chris since left Oracle but maybe he knows who to poke.
> >>
> > 
> > Dave Kleikamp (cc'd) took over my patches and did the most recent
> > benchmarking.  Ported against 3.0:
> > 
> > https://oss.oracle.com/git/?p=linux-uek-2.6.39.git;a=commit;h=c7fa322dd72b08450a440ef800124705a1fa148c
> > 
> > The current versions are still in the 2.6.32 oracle kernel, but it looks
> > like they reverted this 3.0 commit.  I think with Manfred's upstream
> > work my more complex approach wasn't required anymore, but hopefully
> > Dave can fill in details.
> 
> From what I recall, I could never get better performance from your
> patches that we saw with Manfred's work alone. I can't remember the
> reasons for including and then reverting the patches from the 3.0
> (2.6.39) Oracle kernel, but in the end we weren't able to justify their
> inclusion.

Ok, so after this commit, oracle was happy:

commit fd5db42254518fbf241dc454e918598fbe494fa2
Author: Manfred Spraul 
Date:   Wed May 26 14:43:40 2010 -0700

ipc/sem.c: optimize update_queue() for bulk wakeup calls

But that doesn't explain why Davidlohr saw semtimedop at the top of the
oracle profiles in his runs.

Looking through the patches in this thread, I don't see anything that
I'd expect to slow down oracle TPC numbers.

I dealt with the ipc_perm lock a little differently:

https://oss.oracle.com/git/?p=linux-uek-2.6.39.git;a=commitdiff;h=78fe45325c8e2e3f4b6ebb1ee15b6c2e8af5ddb1;hp=8102e1ff9d667661b581209323faaf7a84f0f528

My code switched the ipc_rcu_hdr refcount to an atomic, which changed
where I needed the spinlock.  It may make things easier in patches 3/4
and 4/4.

(some of this code was Jens, but at the time he made me promise to
pretend he never touched it)

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] Btrfs updates

2013-03-08 Thread Chris Mason

Hi Linus,

Please grab my for-linus:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

These are scattered fixes and one performance improvement.  The biggest
functional change is in how we throttle metadata changes.  The new code
bumps our average file creation rate up by ~13% in fs_mark, and lowers
CPU usage.

Stefan bisected out a regression in our allocation code that made
balance loop on extents larger than 256MB.

Liu Bo (6) commits (+71/-19):
Btrfs: build up error handling for merge_reloc_roots (+35/-12)
Btrfs: check for NULL pointer in updating reloc roots (+2/-0)
Btrfs: avoid deadlock on transaction waiting list (+7/-0)
Btrfs: free all recorded tree blocks on error (+6/-3)
Btrfs: do not BUG_ON on aborted situation (+12/-3)
Btrfs: do not BUG_ON in prepare_to_reloc (+9/-1)

Chris Mason (2) commits (+96/-63):
Btrfs: enforce min_bytes parameter during extent allocation (+4/-2)
Btrfs: improve the delayed inode throttling (+92/-61)

Miao Xie (2) commits (+45/-39):
Btrfs: fix unclosed transaction handler when the async transaction 
commitment fails (+4/-0)
Btrfs: fix wrong handle at error path of create_snapshot() when the commit 
fails (+41/-39)

Stefan Behrens (1) commits (+0/-8):
Btrfs: allow running defrag in parallel to administrative tasks

Ilya Dryomov (1) commits (+5/-0):
Btrfs: fix a mismerge in btrfs_balance()

Josef Bacik (1) commits (+4/-1):
Btrfs: use set_nlink if our i_nlink is 0

Total: (13) commits (+221/-130)

 fs/btrfs/delayed-inode.c | 151 ---
 fs/btrfs/delayed-inode.h |   2 +
 fs/btrfs/disk-io.c   |  16 +++--
 fs/btrfs/inode.c |   6 +-
 fs/btrfs/ioctl.c |  18 ++
 fs/btrfs/relocation.c|  74 +--
 fs/btrfs/transaction.c   |  65 
 fs/btrfs/tree-log.c  |   5 +-
 fs/btrfs/volumes.c   |  14 -
 9 files changed, 221 insertions(+), 130 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] a btrfs fix

2012-09-15 Thread Chris Mason

Hi Linus,

My for-linus branch has one revert in the new quota code:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

We're building up more fixes at etc for the next merge window, but I'm
keeping them out unless they are bigger regressions or have a huge
impact.

Chris Mason (1):
  Revert "Btrfs: fix some error codes in btrfs_qgroup_inherit()"

 fs/btrfs/qgroup.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL 1/2] Btrfs fixes

2012-08-09 Thread Chris Mason

Hi everyone,

This first pull is the bulk of our changes for the next rc.  It is
against the 3.5 kernel so people testing the new features have a stable
point to work against.  This was tested against Linus' current tree as
well.

The second pull is just one fix against 3.6-rc1 (in another email).

Linus, please grab my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

Most of these fixes are against the new send/receive code.  Alexander
fixed a number of bugs in there and I found a more while backing up my
laptop.  It does nightly incremental runs now about 3x faster than
rsync, so things are looking pretty good.

On top of that we have fixes for some long standing bugs in the delayed
reference code (a few more of these are still being worked on),
deadlocks and other small fixes.

Alexander Block (23) commits (+482/-419):
Btrfs: don't treat top/root directory inode as deleted/reused (+20/-1)
Btrfs: fix use of radix_tree for name_cache in send/receive (+37/-39)
Btrfs: rename backref_ctx::found_in_send_root to found_itself (+4/-4)
Btrfs: pass root instead of parent_root to iterate_inode_ref (+2/-2)
Btrfs: add correct parent to check_dirs when dir got moved (+11/-0)
Btrfs: add missing check for dir != tmp_dir to is_first_ref (+1/-1)
Btrfs: fix check for changed extent in is_extent_unchanged (+2/-2)
Btrfs: free nce and nce_head on error in name_cache_insert (+5/-1)
Btrfs: don't break in the final loop of find_extent_clone (+0/-1)
Btrfs: fix cur_ino < parent_ino case for send/receive (+146/-244)
Btrfs: add/fix comments/documentation for send/receive (+134/-6)
Btrfs: use normal return path for root == send_root case (+0/-6)
Btrfs: fix memory leak for name_cache in send/receive (+1/-0)
Btrfs: use kmalloc instead of stack for backref_ctx (+18/-11)
Btrfs: remove unused use_list from send/receive code (+0/-2)
Btrfs: remove unused tmp_path from iterate_dir_item (+0/-8)
Btrfs: add rdev to get_inode_info in send/receive (+17/-13)
Btrfs: use <= instead of < in is_extent_unchanged (+1/-1)
Btrfs: update send_progress at correct places (+20/-6)
Btrfs: ignore non-FS inodes for send/receive (+5/-0)
Btrfs: code cleanups for send/receive (+35/-48)
Btrfs: make aux field of ulist 64 bit (+21/-23)
Btrfs: remove unused code with #if 0 (+2/-0)

Josef Bacik (9) commits (+325/-215):
Btrfs: don't allocate a seperate csums array for direct reads (+19/-32)
Btrfs: do not use missing devices when showing devname (+2/-0)
Btrfs: fix enospc problems when deleting a subvol (+1/-1)
Btrfs: increase the size of the free space cache (+7/-8)
Btrfs: lock extents as we map them in DIO (+127/-129)
Btrfs: allow delayed refs to be merged (+142/-27)
Btrfs: do not strdup non existent strings (+5/-3)
Btrfs: barrier before waitqueue_active (+10/-12)
Btrfs: use a slab for btrfs_dio_private (+12/-3)

Dan Carpenter (4) commits (+16/-8):
Btrfs: unlock on error in btrfs_delalloc_reserve_metadata() (+3/-1)
Btrfs: fix some error codes in btrfs_qgroup_inherit() (+6/-2)
Btrfs: fix some endian bugs handling the root times (+4/-4)
Btrfs: checking for NULL instead of IS_ERR (+3/-1)

Stefan Behrens (3) commits (+8/-36):
Btrfs: fix a misplaced address operator in a condition (+1/-1)
Btrfs: remove superblock writing after fatal error (+5/-33)
Btrfs: fix that error value is changed by mistake (+2/-2)

Chris Mason (2) commits (+40/-15):
Btrfs: fix btrfs send for inline items and compression (+37/-15)
Btrfs: don't run __tree_mod_log_free_eb on leaves (+3/-0)

Fengguang Wu (2) commits (+4/-6):
btrfs: fix second lock in btrfs_delete_delayed_items() (+3/-2)
btrfs: Use PTR_RET in btrfs_resume_balance_async() (+1/-4)

Arne Jansen (2) commits (+38/-73):
Btrfs: fix deadlock in wait_for_more_refs (+21/-73)
Btrfs: fix race in run_clustered_refs (+17/-0)

Miao Xie (1) commits (+1/-0):
Btrfs: fix wrong mtime and ctime when creating snapshots

Total: (46) commits

 fs/btrfs/backref.c   |  12 +-
 fs/btrfs/compression.c   |   1 +
 fs/btrfs/ctree.c |  14 +-
 fs/btrfs/ctree.h |   3 +-
 fs/btrfs/delayed-inode.c |  12 +-
 fs/btrfs/delayed-ref.c   | 163 +++--
 fs/btrfs/delayed-ref.h   |   4 +
 fs/btrfs/disk-io.c   |  45 +--
 fs/btrfs/disk-io.h   |   2 +-
 fs/btrfs/extent-tree.c   | 123 +++
 fs/btrfs/extent_io.c |   1 -
 fs/btrfs/file-item.c |   4 +-
 fs/btrfs/inode.c | 318 -
 fs/btrfs/ioctl.c |   2 +-
 fs/btrfs/locking.c   |   2 +-
 fs/btrfs/qgroup.c|  32 +-
 fs/btrfs/root-tree.c |   4 +-
 fs/btrfs/send.c  | 895 ++-
 fs/btrfs/super.c |   2 +
 fs/btrfs/transaction.c   |   3 +-
 fs/btrfs/ulist.c |   7 +-
 fs/btrfs/ulist.h |   9 +-
 fs/btrfs/volumes.c   |  16 +-

[GIT PULL 2/2] Btrfs merge fix

2012-08-09 Thread Chris Mason

Hi Linus,

Please pull my for-linus-3.6 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-3.6

It fixes a merging error in rc1.  The calls to mnt_want_write should
have been removed.

Alexander Block (1):
  Btrfs: remove mnt_want_write call in btrfs_mksubvol

 fs/btrfs/ioctl.c | 5 -
 1 file changed, 5 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: very poor ext3 write performance on big filesystems?

2008-02-19 Thread Chris Mason

On Tuesday 19 February 2008, Tomasz Chmielewski wrote:
> Theodore Tso schrieb:
>
> (...)
>
> > The following ld_preload can help in some cases.  Mutt has this hack
> > encoded in for maildir directories, which helps.
>
> It doesn't work very reliable for me.
>
> For some reason, it hangs for me sometimes (doesn't remove any files, rm
> -rf just stalls), or segfaults.

You can go the low-tech route (assuming your file names don't have spaces in 
them)

find . -printf "%i %p\n" | sort -n | awk '{print $2}' | xargs rm

>
>
> As most of the ideas here in this thread assume (re)creating a new
> filesystem from scratch - would perhaps playing with
> /proc/sys/vm/dirty_ratio and /proc/sys/vm/dirty_background_ratio help a
> bit?

Probably not.  You're seeking between all the inodes on the box, and probably 
not bound by the memory used.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: very poor ext3 write performance on big filesystems?

2008-02-19 Thread Chris Mason

On Tuesday 19 February 2008, Tomasz Chmielewski wrote:
> Chris Mason schrieb:
> > On Tuesday 19 February 2008, Tomasz Chmielewski wrote:
> >> Theodore Tso schrieb:
> >>
> >> (...)
> >>
> >>> The following ld_preload can help in some cases.  Mutt has this hack
> >>> encoded in for maildir directories, which helps.
> >>
> >> It doesn't work very reliable for me.
> >>
> >> For some reason, it hangs for me sometimes (doesn't remove any files, rm
> >> -rf just stalls), or segfaults.
> >
> > You can go the low-tech route (assuming your file names don't have spaces
> > in them)
> >
> > find . -printf "%i %p\n" | sort -n | awk '{print $2}' | xargs rm
>
> Why should it make a difference?

It does something similar to Ted's ld preload, sorting the results from 
readdir by inode number before using them.  You will still seek quite a lot 
between the directory entries, but operations on the files themselves will go 
in a much more optimal order.  It might help.

>
> Does "find" find filenames/paths faster than "rm -r"?
>
> Or is "find once/remove once" faster than "find files/rm files/find
> files/rm files/...", which I suppose "rm -r" does?

rm -r does removes things in the order that readdir returns.  In your hard 
linked tree (on almost any FS), this will be very random.  The sorting is 
probably the best you can do from userland to optimize the ordering.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[ANNOUNCE] Btrfs v0.13

2008-02-21 Thread Chris Mason

Hello everyone,

Btrfs v0.13 is now available for download from:

http://oss.oracle.com/projects/btrfs/

We took another short break from the multi-device code to make the minor mods 
required to compile on 2.6.25, fix some problematic bugs and do more tuning.

The most important fix is for file data checksumming errors.  These might show 
up on .o files from compiles or other files where seeky writes were done 
internally to fill it up.   The end result was a bunch of zeros in the file 
where people expected their data to be.  Thanks to Yan Zheng for tracking it 
down.

GregKH provided most of the 2.6.25 port with some sysfs updates.  Since the 
sysfs files are not used much and Greg has offered additional cleanups, I've 
disabled the btrfs sysfs interface on kernels older than 2.6.25.  This way he 
won't have to back port any of his changes.

Optimizations and other fixes:

* File data checksumming done in larger chunks, resulting in fewer btree 
searches and fewer kmap calls.

* CPU Optimizations for back reference removal

* CPU Optimizations for block allocation, and much more efficient searching 
through the free space cache.

* Allocation optimizations, the free space clustering code was not properly 
allocating from a cluster once it found it.  For normal mounts the fix 
improves metadata writeback, for mount -o ssd it improves everything.

* Unaligned access fixes from Dave Miller

* Btree reads are done in larger bios when possible

* i_block accounting is fixed

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] Btrfs v0.13

2008-02-21 Thread Chris Mason

On Thursday 21 February 2008, Chris Mason wrote:
> Hello everyone,
>
> Btrfs v0.13 is now available for download from:
>
> http://oss.oracle.com/projects/btrfs/
>
> We took another short break from the multi-device code to make the minor
> mods required to compile on 2.6.25, fix some problematic bugs and do more
> tuning.

Sorry, I should have added:

v0.13 has no disk format changes since v0.12.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[ANNOUNCE] Btrfs v0.12 released

2008-02-06 Thread Chris Mason

Hello everyone,

I wasn't planning on releasing v0.12 yet, and it was supposed to have some 
initial support for multiple devices.  But, I have made a number of 
performance fixes and small bug fixes, and I wanted to get them out there 
before the (destabilizing) work on multiple-devices took over.

So, here's v0.12.  It comes with a shiny new disk format (sorry), but the gain 
is dramatically better random writes to existing files.  In testing here, the 
random write phase of tiobench went from 1MB/s to 30MB/s.  The fix was to 
change the way back references for file extents were hashed.

Other changes:

Insert and delete multiple items at once in the btree where possible.  Back 
references added more tree balances, and it showed up in a few benchmarks.  
With v0.12, backrefs have no real impact on performance.

Optimize bio end_io routines.  Btrfs was spending way too much CPU time in the 
bio end_io routines, leading to lock contention and other problems.

Optimize read ahead during transaction commit.  The old code was trying to 
read far too much at once, which made the end_io problems really stand out.

mount -o ssd option, which clusters file data writes together regardless of 
the directory the files belong to.  There are a number of other performance 
tweaks for SSD, aimed at clustering metadata and data writes to better take 
advantage of the hardware.

mount -o max_inline=size option, to override the default max inline file data 
size (default is 8k).  Any value up to the leaf size is allowed (default 
16k).

Simple -ENOSPC handling.  Emphasis on simple, but it prevents accidentally 
filling the disk most of the time.  With enough threads/procs banging on 
things, you can still easily crash the box.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL 1/2] Btrfs fixes

2012-08-21 Thread Chris Mason

On Mon, Aug 20, 2012 at 07:55:59PM -0600, Linus Torvalds wrote:
> On Mon, Aug 20, 2012 at 6:53 PM, Chris Samuel  wrote:
> >
> > This pull request with a whole heap of btrfs fixes (46 commits) appears
> > not to have been merged yet, does anyone know if it was rejected or just
> > missed ?
> 
> Read my -rc2 release notes.
> 
> TL;DR: I rejected big pull requests that didn't convince me. Make a
> damn good case for it, or send minimal fixes instead.
> 
> I'm tried of these "oops, what we sent you for -rc1 wasn't ready, so
> here's a thousand lines of changes" crap.

When just the second pull went in, I wasn't sure if it was waiting for
vacation or you felt it was too big, but when I saw rc2 it was pretty
clear.

So I'm working up an rc3 pull with longer explanations.  The bulk of my
last pull was send/receive fixes.  The rc1 send/recv worked fine for me
on my test box, but larger scale use on well aged filesystems showed
some problems.

It's fair to say send/receive wasn't ready.  I did expect some fixes for
rc2 but not that many.  More details will be in my pull this afternoon,
but with our current code it is working very well for me.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 3.4.4-rt13: btrfs + xfstests 006 = BOOM.. and a bonus rt_mutex deadlock report for absolutely free!

2012-07-12 Thread Chris Mason

On Thu, Jul 12, 2012 at 05:07:58AM -0600, Thomas Gleixner wrote:
> On Thu, 12 Jul 2012, Mike Galbraith wrote:
> > crash> struct rt_mutex 0x8801770601c8
> > struct rt_mutex {
> >   wait_lock = {
> > raw_lock = {
> >   slock = 7966
> > }
> >   }, 
> >   wait_list = {
> > node_list = {
> >   next = 0x880175eedbe0, 
> >   prev = 0x880175eedbe0
> > }, 
> > rawlock = 0x880175eedbd8, 
> 
> Urgh. Here is something completely wrong. That should point to
> wait_lock, i.e. the rt_mutex itself, but that points into lala land.

This is probably the memcpy you found later this morning, right?

>  
> > Reproducer2: dbench -t 30 8
> > 
> > [  692.857164] 
> > [  692.857165] 
> > [  692.863963] [ BUG: circular locking deadlock detected! ]
> > [  692.869264] Not tainted
> > [  692.871708] 
> > [  692.877008] btrfs-delayed-m/1404 is deadlocking current task dbench/7937
> > [  692.877009] 
> > [  692.885183] 
> > [  692.885184] 1) dbench/7937 is trying to acquire this lock:
> > [  692.892149]  [88014d6aea80] {&(&eb->lock)->lock}
> > [  692.897102] .. ->owner: 880175808501
> > [  692.901018] .. held by:   btrfs-delayed-m: 1404 [880175808500, 120]
> > [  692.907657] 
> > [  692.907657] 2) btrfs-delayed-m/1404 is blocked on this lock:
> > [  692.914797]  [88014bf58d60] {&(&eb->lock)->lock}
> > [  692.919751] .. ->owner: 880175186101
> > [  692.923672] .. held by:dbench: 7937 [880175186100, 120]
> > [  692.930309] 
> > [  692.930309] btrfs-delayed-m/1404's [blocked] stackdump:
> 
> Hrmm. Both locks are rw_locks and we prevent multiple readers for the
> known reasons in RT. No idea how to deal with that one :(

The reader/writer part in btrfs is just an optimization.  If we need
them to be all writer locks for RT purposes, that's not a problem.

But, before we go down that road, we do annotations trying
to make sure lockdep doesn't get confused about lock classes.  Basically
the tree is locked level by level.  So its safe to take eb->lock while
holding eb->lock as long as you follow the rules.

Are additional annotations required for RT?

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 3.4.4-rt13: btrfs + xfstests 006 = BOOM.. and a bonus rt_mutex deadlock report for absolutely free!

2012-07-13 Thread Chris Mason

On Fri, Jul 13, 2012 at 04:26:26AM -0600, Thomas Gleixner wrote:
> On Fri, 13 Jul 2012, Mike Galbraith wrote:
> > On Fri, 2012-07-13 at 11:52 +0200, Thomas Gleixner wrote: 
> > > On Fri, 13 Jul 2012, Mike Galbraith wrote:
> > > > On Thu, 2012-07-12 at 15:31 +0200, Thomas Gleixner wrote: 
> > > > > Bingo, that makes it more likely that this is caused by copying w/o
> > > > > initializing the lock and then freeing the original structure.
> > > > > 
> > > > > A quick check for memcpy finds that __btrfs_close_devices() does a
> > > > > memcpy of btrfs_device structs w/o initializing the lock in the new
> > > > > copy, but I have no idea whether that's the place we are looking for.
> > > > 
> > > > Thanks a bunch Thomas.  I doubt I would have ever figured out that lala
> > > > land resulted from _copying_ a lock.  That's one I won't be forgetting
> > > > any time soon.  Box not only survived a few thousand xfstests 006 runs,
> > > > dbench seemed disinterested in deadlocking virgin 3.0-rt.
> > > 
> > > Cute. It think that the lock copying caused the deadlock problem as
> > > the list pointed to the wrong place, so we might have ended up with
> > > following down the wrong chain when walking the list as long as the
> > > original struct was not freed. That beast is freed under RCU so there
> > > could be a rcu read side critical section fiddling with the old lock
> > > and cause utter confusion.
> > 
> > Virgin 3.0-rt appears to really be solid.  But then it doesn't have
> > pesky rwlocks.
> 
> Ah. So 3.0 is not having those rwlock thingies. Bummer.
>  
> > > /me goes and writes a nastigram^W proper changelog
> > > 
> > > > btrfs still locks up in my enterprise kernel, so I suppose I had better
> > > > plug your fix into 3.4-rt and see what happens, and go beat hell out of
> > > > virgin 3.0-rt again to be sure box really really survives dbench.
> > > 
> > > A test against 3.4-rt sans enterprise mess might be nice as well.
> > 
> > Enterprise is 3.0-stable with um 555 btrfs patches (oh dear).
> > 
> > Virgin 3.4-rt and 3.2-rt deadlock gripe.  Enterprise doesn't gripe, but
> > deadlocks, so I have another adventure in my future even if I figure out
> > wth to do about rwlocks.
> 
> Hrmpf. /me goes to stare into fs/btrfs/ some more.

Please post the deadlocks here, I'll help ;)

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] Btrfs fixes

2013-01-22 Thread Chris Mason

Hi Linus,

My for-linus branch has our batch of btrfs fixes:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

We've been hammering away at a crc corruption as well, which I was
really hoping to get into this pull.  It isn't nailed down yet, but we
were finally able to get a solid way to reproduce.  The only good
news is it isn't a recent regression.

The most important batch of fixes in here come from Ilya.  They address
a regression Liu Bo found in the balance ioctls for pausing and resuming
a running balance across drives.

Josef's orphan truncate patch fixes an obscure corruption we'd see
during xfstests.

Arne's patches address problems with subvolume quotas.  If the user
destroys quota groups incorrectly the FS will refuse to mount.

The rest are smaller fixes and plugs for memory leaks.

Ilya Dryomov (6) commits (+94/-32):
Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag (+9/-8)
Btrfs: fix "mutually exclusive op is running" error code (+4/-4)
Btrfs: fix a regression in balance usage filter (+8/-1)
Btrfs: bring back balance pause/resume logic (+71/-17)
Btrfs: fix unlock order in btrfs_ioctl_rm_dev (+1/-1)
Btrfs: fix unlock order in btrfs_ioctl_resize (+1/-1)

Liu Bo (4) commits (+18/-7):
Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents 
(+14/-6)
Btrfs: let allocation start from the right raid type (+1/-1)
Btrfs: reset path lock state to zero (+2/-0)
Btrfs: fix off-by-one in lseek (+1/-0)

Miao Xie (4) commits (+15/-7):
Btrfs: fix missing write access release in btrfs_ioctl_resize() (+1/-0)
Btrfs: do not delete a subvolume which is in a R/O subvolume (+5/-5)
Btrfs: fix resize a readonly device (+4/-2)
Btrfs: disable qgroup id 0 (+5/-0)

Arne Jansen (2) commits (+19/-1):
Btrfs: prevent qgroup destroy when there are still relations (+12/-1)
Btrfs: ignore orphan qgroup relations (+7/-0)

Josef Bacik (2) commits (+39/-16):
Btrfs: add orphan before truncating pagecache (+38/-15)
Btrfs: set flushing if we're limited flushing (+1/-1)

Zach Brown (1) commits (+1/-0):
btrfs: fix btrfs_cont_expand() freeing IS_ERR em

Lukas Czerner (1) commits (+1/-1):
btrfs: get the device in write mode when deleting it

Eric Sandeen (1) commits (+14/-3):
btrfs: update timestamps on truncate()

Tsutomu Itoh (1) commits (+3/-1):
Btrfs: fix memory leak in name_cache_insert()

Total: (22) commits

 fs/btrfs/extent-tree.c |   6 ++-
 fs/btrfs/file.c|  10 ++--
 fs/btrfs/inode.c   |  82 +++
 fs/btrfs/ioctl.c   | 129 +++--
 fs/btrfs/qgroup.c  |  20 +++-
 fs/btrfs/send.c|   4 +-
 fs/btrfs/volumes.c |  21 ++--
 7 files changed, 204 insertions(+), 68 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] Btrfs fixes

2013-01-22 Thread Chris Mason

On Tue, Jan 22, 2013 at 06:28:21PM -0700, Liu Bo wrote:
> On Tue, Jan 22, 2013 at 07:48:33PM -0500, Chris Mason wrote:
> > Hi Linus,
> > 
> > My for-linus branch has our batch of btrfs fixes:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
> > for-linus
> > 
> > We've been hammering away at a crc corruption as well, which I was
> > really hoping to get into this pull.  It isn't nailed down yet, but we
> > were finally able to get a solid way to reproduce.  The only good
> > news is it isn't a recent regression.
> > 
> > The most important batch of fixes in here come from Ilya.  They address
> > a regression Liu Bo found in the balance ioctls for pausing and resuming
> > a running balance across drives.
> > 
> > Josef's orphan truncate patch fixes an obscure corruption we'd see
> > during xfstests.
> > 
> > Arne's patches address problems with subvolume quotas.  If the user
> > destroys quota groups incorrectly the FS will refuse to mount.
> > 
> > The rest are smaller fixes and plugs for memory leaks.
> 
> Hi,
> 
> Any chance to get these in this round?  I think they're good fixes,
> a memory leak and a warning fix, both are got from xfstests.

I'll get these tested in the next pull.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] Btrfs updates

2013-02-15 Thread Chris Mason

Hi Linus,

If you're doing another RC, please grab these two.  Otherwise I'll send
them off to -stable.

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

This fixes a long standing problem where the btrfs scan ioctl was racing
with mkfs.btrfs and dropping dirty pages created by mkfs.  It also fixes
a crash during tree log replay with quota enabled.

David Sterba (1) commits (+64/-6):
btrfs: access superblock via pagecache in scan_one_device

Arne Jansen (1) commits (+1/-1):
Btrfs: fix crash in log replay with qgroups enabled

Total: (2) commits (+65/-7)

 fs/btrfs/ctree.c   |  2 +-
 fs/btrfs/volumes.c | 70 +-
 2 files changed, 65 insertions(+), 7 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: build failure after merge of the btrfs tree

2012-12-16 Thread Chris Mason

On Sun, Dec 16, 2012 at 04:00:22PM -0700, Stephen Rothwell wrote:
> Hi Chris,
> 
> After merging the btrfs tree, today's linux-next build (powerpc
> ppc64_defconfig) failed like this:
> 
> fs/btrfs/ioctl.c: In function 'btrfs_ioctl':
> fs/btrfs/ioctl.c:3940:7: error: case label does not reduce to an integer 
> constant

Thanks Stephen.  In my sources, this line is:

case BTRFS_IOC_DEV_REPLACE:

And the define is:

#define BTRFS_IOC_DEV_REPLACE _IOWR(BTRFS_IOCTL_MAGIC, 53, \
struct btrfs_ioctl_dev_replace_args)

Is there a way to see what ppc64 is doing with this macro?

> 
> Caused by commit 0aa7cbc7585a ("Btrfs: add support for device replace 
> ioctls").
> 
> I have used the btrfs tree from next-20121214 for today (which was empty).
> 
> I have to say that these btrfs commits have come to linux-next very late
> in the game (i.e. some of them have author dates back in September and
> October and yet they only appeared in linux-next today).

This is true, we've had these in testing for some time.  Especially when
new interfaces come in, we tend to delay them.

> Also, the
> committer of this commits is Josef Bacik  but there
> is not Signed-off-by him.  There are other commits that are committed by
> you, Chris, that also do not have a Signed-off-by you.

Josef and I have sob on all of our commits (at least all the ones not in
3.7, I didn't go back farther).  In this case the Author was Stefan and
it ended up rebased in either Josef's or my tree.  We usually try to
preserve merges on rebase, but this time was a bigger set of changes
than usual and it didn't work out.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: build failure after merge of the btrfs tree

2012-12-16 Thread Chris Mason

On Sun, Dec 16, 2012 at 04:00:22PM -0700, Stephen Rothwell wrote:
> Hi Chris,
> 
> After merging the btrfs tree, today's linux-next build (powerpc
> ppc64_defconfig) failed like this:
> 
> fs/btrfs/ioctl.c: In function 'btrfs_ioctl':
> fs/btrfs/ioctl.c:3940:7: error: case label does not reduce to an integer 
> constant

Many thanks Stephen for helping debug this.

It turned out to be the size of the ioctl arg was just too big.  Stefan,
I lowered this by using BTRFS_DEVICE_PATH_NAME_MAX instead, and I also
put the char arrays after the u64s.

It works here with the progs patch you sent (and matching struct
changes), including cancel and the btrfs replace -r mode.  I've pushed
the result out to my next branch and my for-linus branch, please take a
look and make sure I didn't miss anything.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: build failure after merge of the btrfs tree

2012-12-16 Thread Chris Mason

On Sun, Dec 16, 2012 at 05:15:04PM -0700, Chris Mason wrote:
> On Sun, Dec 16, 2012 at 04:00:22PM -0700, Stephen Rothwell wrote:
> Josef and I have sob on all of our commits (at least all the ones not in
> 3.7, I didn't go back farther).  In this case the Author was Stefan and
> it ended up rebased in either Josef's or my tree.  We usually try to
> preserve merges on rebase, but this time was a bigger set of changes
> than usual and it didn't work out.

Just FYI, I've pushed out a fixed version of the tree with proper
signed-off-by on all the commits, thanks for catching that.

It also has the fix for the ppc compile problems.  (This isn't a pull
request yet, I've got the hash collision fix and a few others pending as
well).

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: build failure after merge of the btrfs tree

2012-12-16 Thread Chris Mason

On Sun, Dec 16, 2012 at 08:13:55PM -0700, Stephen Rothwell wrote:
> Hi Chris,
> 
> On Sun, 16 Dec 2012 21:52:41 -0500 Chris Mason  
> wrote:
> >
> > On Sun, Dec 16, 2012 at 05:15:04PM -0700, Chris Mason wrote:
> > > On Sun, Dec 16, 2012 at 04:00:22PM -0700, Stephen Rothwell wrote:
> > > Josef and I have sob on all of our commits (at least all the ones not in
> > > 3.7, I didn't go back farther).  In this case the Author was Stefan and
> > > it ended up rebased in either Josef's or my tree.  We usually try to
> > > preserve merges on rebase, but this time was a bigger set of changes
> > > than usual and it didn't work out.
> > 
> > Just FYI, I've pushed out a fixed version of the tree with proper
> > signed-off-by on all the commits, thanks for catching that.
> > 
> > It also has the fix for the ppc compile problems.  (This isn't a pull
> > request yet, I've got the hash collision fix and a few others pending as
> > well).
> 
> You do realise that I just fetch your tree/branch each day, right?  So it
> will be in tomorrow's linux-next.

Yes, but I had pushed to that for-linus branch for Stefan to more easily
see the changes, and I wanted to make it clear I had just redone things
for the signed-off-by.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] Btrfs updates

2012-12-17 Thread Chris Mason

btrfs_allocate() (+4/-4)
Btrfs: make ordered extent be flushed by multi-task (+37/-9)
Btrfs: don't auto defrag a file when doing directIO (+0/-3)
Btrfs: fix unprotected defragable inode insertion (+55/-15)
Btrfs: restructure btrfs_run_defrag_inodes() (+109/-91)
Btrfs: get write access for qgroup operations (+48/-25)
Btrfs: get write access when removing a device (+8/-4)
Btrfs: cleanup duplicated division functions (+46/-40)
Btrfs: punch hole past the end of the file (+12/-10)
Btrfs: use slabs for auto defrag allocation (+34/-5)
Btrfs: get write access when doing resize fs (+8/-2)
Btrfs: fix wrong comment in can_overcommit() (+3/-3)
Btrfs: improve the noflush reservation (+97/-86)
Btrfs: fix the page that is beyond EOF (+9/-7)
Btrfs: fix wrong file extent length (+23/-9)
Btrfs: get write access for scrub (+13/-3)
Btrfs: fix freeze vs auto defrag (+3/-0)

Josef Bacik (18) commits (+805/-361):
Btrfs: don't take inode delalloc mutex if we're a free space inode (+19/-6)
Btrfs: only clear dirty on the buffer if it is marked as dirty (+4/-4)
Btrfs: recheck bio against block device when we map the bio (+131/-28)
Btrfs: do not mark ems as prealloc if we are writing to them (+5/-4)
Btrfs: don't bother copying if we're only logging the inode (+34/-6)
Btrfs: log changed inodes based on the extent map tree (+372/-210)
Btrfs: only log the inode item if we can get away with it (+11/-2)
Btrfs: keep track of the extents original block length (+24/-4)
Btrfs: fill the global reserve when unpinning space (+24/-5)
Btrfs: do not call file_update_time in aio_write (+48/-29)
Btrfs: use tokens where we can in the tree log (+73/-54)
Btrfs: move checks in set_page_dirty under DEBUG (+2/-0)
Btrfs: only unlock and relock if we have to (+4/-1)
Btrfs: fix autodefrag and umount lockup (+17/-2)
Btrfs: inline csums if we're fsyncing (+21/-1)
Btrfs: add path->really_keep_locks (+6/-2)
Btrfs: optimize leaf_space_used (+9/-2)
Btrfs: don't memset new tokens (+1/-1)

Liu Bo (17) commits (+124/-151):
Btrfs: MOD_LOG_KEY_REMOVE_WHILE_MOVING never change node's nritems (+2/-2)
Btrfs: fix a double free on pending snapshots in error handling (+5/-1)
Btrfs: reorder tree mod log operations in deleting a pointer (+6/-4)
Btrfs: fix a deadlock in aborting transaction due to ENOSPC (+7/-0)
Btrfs: skip adding an acl attribute if we don't have to (+2/-0)
Btrfs: get right arguments for btrfs_wait_ordered_range (+1/-1)
Btrfs: parse parent 0 into correct value in tracepoint (+2/-1)
Btrfs: do not log extents when we only log new names (+2/-1)
Btrfs: put raid properties into global table (+29/-33)
Btrfs: cleanup for btrfs_btree_balance_dirty (+34/-81)
Btrfs: kill unnecessary arguments in del_ptr (+7/-9)
Btrfs: don't add a NULL extended attribute (+10/-0)
Btrfs: protect devices list with its mutex (+5/-4)
Btrfs: cleanup for btrfs_wait_order_range (+0/-3)
Btrfs: fix an while-loop of listxattr (+1/-1)
Btrfs: fix a bug of per-file nocow (+5/-3)
Btrfs: cleanup unused arguments (+6/-7)

Filipe Brandenburger (3) commits (+21/-12):
Btrfs: refactor error handling to drop inode in btrfs_create() (+11/-12)
Btrfs: fix permissions of empty files not affected by umask (+4/-0)
Btrfs: fix permissions of empty files not affected by umask (+6/-0)

Julia Lawall (2) commits (+26/-48):
fs/btrfs: drop if around WARN_ON (+5/-10)
fs/btrfs: use WARN (+21/-38)

Wang Sheng-Hui (2) commits (+9/-12):
Btrfs: use ctl->unit for free space calculation instead of 
block_group->sectorsize (+9/-11)
Btrfs: do not warn_on io_ctl->cur in io_ctl_map_page (+0/-1)

Tsutomu Itoh (2) commits (+11/-0):
Btrfs: set hole punching time properly (+3/-0)
Btrfs: add fiemap's flag check (+8/-0)

Lukas Czerner (1) commits (+16/-0):
btrfs: Notify udev when removing device

Anand Jain (1) commits (+16/-16):
Btrfs: rename root_times_lock to root_item_lock

Alexander Block (1) commits (+11/-2):
Btrfs: merge inode_list in __merge_refs

Chris Mason (1) commits (+95/-2):
Btrfs: fix hash overflow handling

Masanari Iida (1) commits (+2/-2):
Btrfs: Fix typo in fs/btrfs

jeff.liu (1) commits (+1/-4):
Btrfs: Remove the invalid shrink size check up from btrfs_shrink_dev()

Total: (115) commits (+5423/-1912)

 fs/btrfs/Makefile|2 +-
 fs/btrfs/acl.c   |2 +
 fs/btrfs/backref.c   |   16 +-
 fs/btrfs/btrfs_inode.h   |4 +
 fs/btrfs/check-integrity.c   |   31 +-
 fs/btrfs/compression.c   |6 +-
 fs/btrfs/ctree.c |  241 --
 fs/btrfs/ctree.h |  184 -
 fs/btrfs/delayed-inode.c |   11 +-
 fs/btrfs/dev-replace.c   |  856 
 fs/btrfs/dev-replace.h   |   44 +
 fs/btrfs/dir-item.c  |   59 +

dma engine bugs

2013-01-17 Thread Chris Mason

Hi Dan,

I'm doing some benchmarking on MD raid5/6 on 4 fusionio cards in an HP
DL380p.  I'm doing 128K randomw writes on a 4 drive raid6 with a 64K
stripe size per drive.  I have 4 fio processes sending down the aio/dio,
and a high queue depth (8192).

When I bump up the MD raid stripe cache size, I'm running into
soft lockups in the async memcopy code:

[34336.959645] BUG: soft lockup - CPU#6 stuck for 22s! [fio:38296]
[34336.959648] BUG: soft lockup - CPU#9 stuck for 22s! [md0_raid6:5172]
[34336.959704] Modules linked in: raid456 async_raid6_recov async_pq
async_xor async_memcpy async_tx iomemory_vsl(O) binfmt_misc
cpufreq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq
mperf loop dm_mod coretemp kvm_intel kvm ghash_clmulni_intel sr_mod
cdrom aesni_intel ablk_helper cryptd lrw aes_x86_64 ata_generic xts
gf128mul ioatdma sb_edac gpio_ich ata_piix hid_generic dca edac_core
lpc_ich microcode serio_raw mfd_core hpilo hpwdt button container tg3 sg
acpi_power_meter usbhid mgag200 ttm drm_kms_helper drm i2c_algo_bit
sysimgblt sysfillrect syscopyarea uhci_hcd crc32c_intel ehci_hcd hpsa
processor thermal_sys scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc
scsi_dh_alua scsi_dh btrfs raid6_pq zlib_deflate xor libcrc32c asix
usbnet usbcore usb_common

[34336.959709] CPU 9
[34336.959709] Pid: 5172, comm: md0_raid6 Tainted: GW  O 
3.7.1-1-default #2 HP ProLiant DL380p Gen8
[34336.959720] RIP: 0010:[]  [] 
_raw_spin_unlock_irqrestore+0xd/0x20
[34336.959721] RSP: 0018:8807af6db858  EFLAGS: 0292
[34336.959722] RAX: 1000 RBX: 8810176fd000 RCX: 0292
[34336.959723] RDX: 1000 RSI: 0292 RDI: 0292
[34336.959724] RBP: 8807af6db858 R08: 881017e40440 R09: 880f554fabc0
[34336.959725] R10: 2000 R11:  R12: 881017e40460
[34336.959726] R13: 0040 R14: 0001 R15: 881017e40480
[34336.959728] FS:  () GS:88103f66() 
knlGS:
[34336.959729] CS:  0010 DS:  ES:  CR0: 80050033
[34336.959730] CR2: 035cf458 CR3: 01a0b000 CR4: 000407e0
[34336.959731] DR0:  DR1:  DR2: 
[34336.959733] DR3:  DR6: 0ff0 DR7: 0400
[34336.959734] Process md0_raid6 (pid: 5172, threadinfo 8807af6da000, task 
88077d7725c0)
[34336.959735] Stack:
[34336.959738]  8807af6db898 8114f287 8807af6db8b8 

[34336.959740]   005bd84a 881015f2fa18 
881017632a38
[34336.959742]  8807af6db8e8 a057adf4  
881015f2fa18
[34336.959743] Call Trace:
[34336.959750]  [] dma_pool_alloc+0x67/0x270
[34336.959758]  [] ioat2_alloc_ring_ent+0x34/0xc0 [ioatdma]
[34336.959761]  [] reshape_ring+0x145/0x370 [ioatdma]
[34336.959764]  [] ? _raw_spin_lock_bh+0x2d/0x40
[34336.959767]  [] ioat2_check_space_lock+0xe9/0x240 [ioatdma]
[34336.959768]  [] ? _raw_spin_unlock_bh+0x11/0x20
[34336.959771]  [] ioat2_dma_prep_memcpy_lock+0x5c/0x280 
[ioatdma]
[34336.959773]  [] ? do_async_gen_syndrome+0x29f/0x3d0 
[async_pq]
[34336.959775]  [] ? _raw_spin_unlock_bh+0x11/0x20
[34336.959790]  [] ? ioat2_tx_submit_unlock+0x92/0x100 
[ioatdma]
[34336.959792]  [] async_memcpy+0x207/0x1000 [async_memcpy]
[34336.959795]  [] async_copy_data+0x9d/0x150 [raid456]
[34336.959797]  [] __raid_run_ops+0x4ca/0x990 [raid456]
[34336.959802]  [] ? __aio_put_req+0x102/0x150
[34336.959805]  [] ?  handle_stripe_dirtying+0x30e/0x440 
[raid456]
[34336.959807]  [] handle_stripe+0x528/0x10b0 [raid456]
[34336.959810]  [] handle_active_stripes+0x1e0/0x270 [raid456]
[34336.959814]  [] ? blk_flush_plug_list+0xb3/0x220
[34336.959817]  [] raid5d+0x220/0x3c0 [raid456]
[34336.959822]  [] md_thread+0x12e/0x160
[34336.959828]  [] ? wake_up_bit+0x40/0x40
[34336.959829]  [] ? md_rdev_init+0x110/0x110
[34336.959831]  [] kthread+0xc6/0xd0
[34336.959834]  [] ?  kthread_freezable_should_stop+0x70/0x70
[34336.959849]  [] ret_from_fork+0x7c/0xb0
[34336.959851]  [] ?  kthread_freezable_should_stop+0x70/0x70

Since I'm running on fast cards, I assumed MD was just hammering on this
path so much that MD needed a cond_resched().  But now that I've
sprinkled conditional pixie dust everywhere I'm still seeing exactly the
same trace, and the lockups keep flowing forever, even after I've
stopped all new IO.

Looking at ioat2_check_space_lock(), it is looping when the ring
allocation fails.  We're trying to grow our ring with atomic allocations
and not giving up the CPU?

I'm not sure what the right answer is for a patch.  If it is safe for
the callers we could add the cond_resched() but still we might fail to
grow the ring.

Would it be better to fallback to synchronous operations if we can't get
into the ring?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
Mor

dma engine bugs

2013-01-17 Thread Chris Mason

[ Sorry resend with the right address for Dan ]

Hi Dan,

I'm doing some benchmarking on MD raid5/6 on 4 fusionio cards in an HP
DL380p.  I'm doing 128K randomw writes on a 4 drive raid6 with a 64K
stripe size per drive.  I have 4 fio processes sending down the aio/dio,
and a high queue depth (8192).

When I bump up the MD raid stripe cache size, I'm running into
soft lockups in the async memcopy code:

[34336.959645] BUG: soft lockup - CPU#6 stuck for 22s! [fio:38296]
[34336.959648] BUG: soft lockup - CPU#9 stuck for 22s! [md0_raid6:5172]
[34336.959704] Modules linked in: raid456 async_raid6_recov async_pq
async_xor async_memcpy async_tx iomemory_vsl(O) binfmt_misc
cpufreq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq
mperf loop dm_mod coretemp kvm_intel kvm ghash_clmulni_intel sr_mod
cdrom aesni_intel ablk_helper cryptd lrw aes_x86_64 ata_generic xts
gf128mul ioatdma sb_edac gpio_ich ata_piix hid_generic dca edac_core
lpc_ich microcode serio_raw mfd_core hpilo hpwdt button container tg3 sg
acpi_power_meter usbhid mgag200 ttm drm_kms_helper drm i2c_algo_bit
sysimgblt sysfillrect syscopyarea uhci_hcd crc32c_intel ehci_hcd hpsa
processor thermal_sys scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc
scsi_dh_alua scsi_dh btrfs raid6_pq zlib_deflate xor libcrc32c asix
usbnet usbcore usb_common

[34336.959709] CPU 9
[34336.959709] Pid: 5172, comm: md0_raid6 Tainted: GW  O 
3.7.1-1-default #2 HP ProLiant DL380p Gen8
[34336.959720] RIP: 0010:[]  [] 
_raw_spin_unlock_irqrestore+0xd/0x20
[34336.959721] RSP: 0018:8807af6db858  EFLAGS: 0292
[34336.959722] RAX: 1000 RBX: 8810176fd000 RCX: 0292
[34336.959723] RDX: 1000 RSI: 0292 RDI: 0292
[34336.959724] RBP: 8807af6db858 R08: 881017e40440 R09: 880f554fabc0
[34336.959725] R10: 2000 R11:  R12: 881017e40460
[34336.959726] R13: 0040 R14: 0001 R15: 881017e40480
[34336.959728] FS:  () GS:88103f66() 
knlGS:
[34336.959729] CS:  0010 DS:  ES:  CR0: 80050033
[34336.959730] CR2: 035cf458 CR3: 01a0b000 CR4: 000407e0
[34336.959731] DR0:  DR1:  DR2: 
[34336.959733] DR3:  DR6: 0ff0 DR7: 0400
[34336.959734] Process md0_raid6 (pid: 5172, threadinfo 8807af6da000, task 
88077d7725c0)
[34336.959735] Stack:
[34336.959738]  8807af6db898 8114f287 8807af6db8b8 

[34336.959740]   005bd84a 881015f2fa18 
881017632a38
[34336.959742]  8807af6db8e8 a057adf4  
881015f2fa18
[34336.959743] Call Trace:
[34336.959750]  [] dma_pool_alloc+0x67/0x270
[34336.959758]  [] ioat2_alloc_ring_ent+0x34/0xc0 [ioatdma]
[34336.959761]  [] reshape_ring+0x145/0x370 [ioatdma]
[34336.959764]  [] ? _raw_spin_lock_bh+0x2d/0x40
[34336.959767]  [] ioat2_check_space_lock+0xe9/0x240 [ioatdma]
[34336.959768]  [] ? _raw_spin_unlock_bh+0x11/0x20
[34336.959771]  [] ioat2_dma_prep_memcpy_lock+0x5c/0x280 
[ioatdma]
[34336.959773]  [] ? do_async_gen_syndrome+0x29f/0x3d0 
[async_pq]
[34336.959775]  [] ? _raw_spin_unlock_bh+0x11/0x20
[34336.959790]  [] ? ioat2_tx_submit_unlock+0x92/0x100 
[ioatdma]
[34336.959792]  [] async_memcpy+0x207/0x1000 [async_memcpy]
[34336.959795]  [] async_copy_data+0x9d/0x150 [raid456]
[34336.959797]  [] __raid_run_ops+0x4ca/0x990 [raid456]
[34336.959802]  [] ? __aio_put_req+0x102/0x150
[34336.959805]  [] ?  handle_stripe_dirtying+0x30e/0x440 
[raid456]
[34336.959807]  [] handle_stripe+0x528/0x10b0 [raid456]
[34336.959810]  [] handle_active_stripes+0x1e0/0x270 [raid456]
[34336.959814]  [] ? blk_flush_plug_list+0xb3/0x220
[34336.959817]  [] raid5d+0x220/0x3c0 [raid456]
[34336.959822]  [] md_thread+0x12e/0x160
[34336.959828]  [] ? wake_up_bit+0x40/0x40
[34336.959829]  [] ? md_rdev_init+0x110/0x110
[34336.959831]  [] kthread+0xc6/0xd0
[34336.959834]  [] ?  kthread_freezable_should_stop+0x70/0x70
[34336.959849]  [] ret_from_fork+0x7c/0xb0
[34336.959851]  [] ?  kthread_freezable_should_stop+0x70/0x70

Since I'm running on fast cards, I assumed MD was just hammering on this
path so much that MD needed a cond_resched().  But now that I've
sprinkled conditional pixie dust everywhere I'm still seeing exactly the
same trace, and the lockups keep flowing forever, even after I've
stopped all new IO.

Looking at ioat2_check_space_lock(), it is looping when the ring
allocation fails.  We're trying to grow our ring with atomic allocations
and not giving up the CPU?

I'm not sure what the right answer is for a patch.  If it is safe for
the callers we could add the cond_resched() but still we might fail to
grow the ring.

Would it be better to fallback to synchronous operations if we can't get
into the ring?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the b

Re: dma engine bugs

2013-01-17 Thread Chris Mason

On Thu, Jan 17, 2013 at 07:53:18PM -0700, Dan Williams wrote:
> On Thu, Jan 17, 2013 at 6:38 AM, Chris Mason  wrote:
> > [ Sorry resend with the right address for Dan ]
> >
> > Hi Dan,
> >
> > I'm doing some benchmarking on MD raid5/6 on 4 fusionio cards in an HP
> > DL380p.  I'm doing 128K randomw writes on a 4 drive raid6 with a 64K
> > stripe size per drive.  I have 4 fio processes sending down the aio/dio,
> > and a high queue depth (8192).
> >
> > When I bump up the MD raid stripe cache size, I'm running into
> > soft lockups in the async memcopy code:
> >
> > [34336.959645] BUG: soft lockup - CPU#6 stuck for 22s! [fio:38296]
> > [34336.959648] BUG: soft lockup - CPU#9 stuck for 22s! [md0_raid6:5172]
> > [34336.959704] Modules linked in: raid456 async_raid6_recov async_pq
> > async_xor async_memcpy async_tx iomemory_vsl(O) binfmt_misc
> > cpufreq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq
> > mperf loop dm_mod coretemp kvm_intel kvm ghash_clmulni_intel sr_mod
> > cdrom aesni_intel ablk_helper cryptd lrw aes_x86_64 ata_generic xts
> > gf128mul ioatdma sb_edac gpio_ich ata_piix hid_generic dca edac_core
> > lpc_ich microcode serio_raw mfd_core hpilo hpwdt button container tg3 sg
> > acpi_power_meter usbhid mgag200 ttm drm_kms_helper drm i2c_algo_bit
> > sysimgblt sysfillrect syscopyarea uhci_hcd crc32c_intel ehci_hcd hpsa
> > processor thermal_sys scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc
> > scsi_dh_alua scsi_dh btrfs raid6_pq zlib_deflate xor libcrc32c asix
> > usbnet usbcore usb_common
> >
> > [34336.959709] CPU 9
> > [34336.959709] Pid: 5172, comm: md0_raid6 Tainted: GW  O 
> > 3.7.1-1-default #2 HP ProLiant DL380p Gen8
> > [34336.959720] RIP: 0010:[]  [] 
> > _raw_spin_unlock_irqrestore+0xd/0x20
> > [34336.959721] RSP: 0018:8807af6db858  EFLAGS: 0292
> > [34336.959722] RAX: 1000 RBX: 8810176fd000 RCX: 
> > 0292
> > [34336.959723] RDX: 1000 RSI: 0292 RDI: 
> > 0292
> > [34336.959724] RBP: 8807af6db858 R08: 881017e40440 R09: 
> > 880f554fabc0
> > [34336.959725] R10: 2000 R11:  R12: 
> > 881017e40460
> > [34336.959726] R13: 0040 R14: 0001 R15: 
> > 881017e40480
> > [34336.959728] FS:  () GS:88103f66() 
> > knlGS:
> > [34336.959729] CS:  0010 DS:  ES:  CR0: 80050033
> > [34336.959730] CR2: 035cf458 CR3: 01a0b000 CR4: 
> > 000407e0
> > [34336.959731] DR0:  DR1:  DR2: 
> > 
> > [34336.959733] DR3:  DR6: 0ff0 DR7: 
> > 0400
> > [34336.959734] Process md0_raid6 (pid: 5172, threadinfo 8807af6da000, 
> > task 88077d7725c0)
> > [34336.959735] Stack:
> > [34336.959738]  8807af6db898 8114f287 8807af6db8b8 
> > 
> > [34336.959740]   005bd84a 881015f2fa18 
> > 881017632a38
> > [34336.959742]  8807af6db8e8 a057adf4  
> > 881015f2fa18
> > [34336.959743] Call Trace:
> > [34336.959750]  [] dma_pool_alloc+0x67/0x270
> > [34336.959758]  [] ioat2_alloc_ring_ent+0x34/0xc0 
> > [ioatdma]
> > [34336.959761]  [] reshape_ring+0x145/0x370 [ioatdma]
> > [34336.959764]  [] ? _raw_spin_lock_bh+0x2d/0x40
> > [34336.959767]  [] ioat2_check_space_lock+0xe9/0x240 
> > [ioatdma]
> > [34336.959768]  [] ? _raw_spin_unlock_bh+0x11/0x20
> > [34336.959771]  [] ioat2_dma_prep_memcpy_lock+0x5c/0x280 
> > [ioatdma]
> > [34336.959773]  [] ? do_async_gen_syndrome+0x29f/0x3d0 
> > [async_pq]
> > [34336.959775]  [] ? _raw_spin_unlock_bh+0x11/0x20
> > [34336.959790]  [] ? ioat2_tx_submit_unlock+0x92/0x100 
> > [ioatdma]
> > [34336.959792]  [] async_memcpy+0x207/0x1000 
> > [async_memcpy]
> > [34336.959795]  [] async_copy_data+0x9d/0x150 [raid456]
> > [34336.959797]  [] __raid_run_ops+0x4ca/0x990 [raid456]
> > [34336.959802]  [] ? __aio_put_req+0x102/0x150
> > [34336.959805]  [] ?  handle_stripe_dirtying+0x30e/0x440 
> > [raid456]
> > [34336.959807]  [] handle_stripe+0x528/0x10b0 [raid456]
> > [34336.959810]  [] handle_active_stripes+0x1e0/0x270 
> > [raid456]
> > [34336.959814]  [] ? blk_flush_plug_list+0xb3/0x220
> > [34336.959817]  [] raid5d+0x220/0x3c0 [raid456]
> > [34336.959822]  [] md_thread+0x12e/0x160
> > [34336.959828]  [] ? wake_up_bit+0x40/

Re: [PATCH 00/37] Permit filesystem local caching

2008-02-22 Thread Chris Mason

On Thursday 21 February 2008, David Howells wrote:
> David Howells <[EMAIL PROTECTED]> wrote:
> > > Have you got before/after benchmark results?
> >
> > See attached.
>
> Attached here are results using BTRFS (patched so that it'll work at all)
> rather than Ext3 on the client on the partition backing the cache.

Thanks for trying this, of course I'll ask you to try again with the latest 
v0.13 code, it has a number of optimizations especially for CPU usage.

>
> Note that I didn't bother redoing the tests that didn't involve a cache as
> the choice of filesystem backing the cache should have no bearing on the
> result.
>
> Generally, completely cold caches shouldn't show much variation as all the
> writing can be done completely asynchronously, provided the client doesn't
> fill its RAM.
>
> The interesting case is where the disk cache is warm, but the pagecache is
> cold (ie: just after a reboot after filling the caches).  Here, for the two
> big files case, BTRFS appears quite a bit better than Ext3, showing a 21%
> reduction in time for the smaller case and a 13% reduction for the larger
> case.

I'm afraid I don't have a good handle on the filesystem operations that result 
from this workload.  Are we reading from the FS to fill the NFS page cache?

>
> For the many small/medium files case, BTRFS performed significantly better
> (15% reduction in time) in the case where the caches were completely cold.
> I'm not sure why, though - perhaps because it doesn't execute a
> write_begin() stage during the write_one_page() call and thus doesn't go
> allocating disk blocks to back the data, but instead allocates them later.

If your write_one_page call does parts of btrfs_file_write, you'll get delayed 
allocation for anything bigger than 8k by default.  <= 8k will get packed 
into the btree leaves.

>
> More surprising is that BTRFS performed significantly worse (15% increase
> in time) in the case where the cache on disk was fully populated and then
> the machine had been rebooted to clear the pagecaches.

Which FS operations are included here?  Finding all the files or just an 
unmount?  Btrfs defrags metadata in the background, and unmount has to wait 
for that defrag to finish.

Thanks again,
Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: data=guarded mode in ext3

2013-01-07 Thread Chris Mason

On Mon, Jan 07, 2013 at 08:46:56AM -0700, Ric Wheeler wrote:
> On 12/03/2012 09:34 PM, Keith Chew wrote:
> > Hi
> >
> > Just wanted to check if the "'Data=guarded' mode in Ext3" work started
> > by Chris Mason, is still being considered for merging to the mainline
> > kernel? Or has that effort stopped?
> >
> > Regards
> > Keith
> >
> 
> Hi Keith,
> 
> I think that Chris is spending pretty much all of his time on btrfs these 
> days. 
> I believe this work is pretty much abandoned at this point...
> 
> What are you looking for specifically from this?

Ric has it right, the change was big enough that I didn't think it was
worth the risk of changing how ext3 works.  Most people are going to
want ext4 anyway for a lot of reasons, so it made sense to focus bigger
changes in the ext4 code base.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] reiserfs old data bug 2.2.x (was: ReiserFS? Howreliable ...)

2001-04-05 Thread Chris Mason

On Thursday, April 05, 2001 02:13:55 AM +0100 Alan Cox
<[EMAIL PROTECTED]> wrote:

>> This is a reiserfs security issue, but only of theoretical nature (Even
>> i= f
>> triggered, it won't harm you). But the reason for this bug is in NFS
>> (v2,=
> 
> If the blocks contained my old /etc/shadow I'd be a bit upset.
> 

I think we're talking about different things here.  Alan, I think you are
referring to the ability to get old data in files during mmap.  Where the
exploit roughly looks like this:

truncate(file, 0)
truncate(file, X)
char *foo = mmap(file, X)
write(file, foo, X)

This should produce all zeros, but under 2.2.x reiserfs can instead include
old file data.  Turns out this is because during the write, the block
pointer is inserted before the newly allocated (and zero'd) buffer was set
up to date.  If a readpage is triggered when reiserfs_file_write calls
copy_from_user, you get the old data.  The fix is to mark the buffer up to
date right after zeroing.

Two patches attached, one for 3.5.32 (uptodate_hole.diff.gz) and one for
older reiserfs versions (uptodate_hole-old.diff.gz).  Both are small,
gzip'd because my mailer is dumb.

3.5.33 should come out soon with this included.  2.4.x reiserfs doesn't
need this patch.

-chris

 uptodate_hole-old.diff.gz
 uptodate_hole.diff.gz

Re: gcc oopses with 2.4.3

2001-04-06 Thread Chris Mason




On Friday, April 06, 2001 05:44:42 PM +0200 Norbert Preining
<[EMAIL PROTECTED]> wrote:

> Hi!
> 
> I get frequent `internal compiler error', killed with Sig 4  or Sig 11
> and sometimes Ooops from compiling X or kernel. 
> 
> System: 2.4.3-vanilla, reiserfs, glibc-2.1.3
> [~] gcc -v
> Reading specs from /usr/lib/gcc-lib/i486-suse-linux/2.95.2/specs
> gcc version 2.95.2 19991024 (release)
> 
> 
> Here a decoded Ooops:
> 
> ksymoops 0.7c on i586 2.4.3.  Options used
>  -V (default)
>  -k /proc/ksyms (default)
>  -l /proc/modules (default)
>  -o /lib/modules/2.4.3/ (default)
>  -m /boot/System.map-2.4.3 (specified)
> 
> Unable to handle kernel NULL pointer dereference at virtual address
>  c0145e41
> *pde = 
> Oops: 
> CPU:0
> EIP:0010:[ext2_new_block+317/1808]
> EFLAGS: 00010282
> eax:    ebx: c7261de8   ecx:    edx: 
> esi: c6dab000   edi:    ebp: c7261dec   esp: c7261d9c
> ds: 0018   es: 0018   ss: 0018
> Process cc1 (pid: 20767, stackpage=c7261000)
> Stack: c2f280e0 0001 c7261e3c 0001 c01675d0  
> c6dab038  c6dab034 c7261e7c c7261e94 c020e91c 0001 0009 0008
>c7cc7c00  c1265800  c473c9e0 c1264120 c7261e40 c014755c
>c2f280e0 0001  Call Trace: [search_by_key+2028/3140]
> [ext2_alloc_block+120/128] [ext2_alloc_branch+41/456]
> [ext2_get_block+695/1152] [create_empty_buffers+23/108]
> [__block_prepare_write+234/560] [block_prepare_write+29/52]  Code: 74 04
> 31 d2 eb 52 83 be c8 00 00 00 08 77 20 8d 04 bd 00 00  Using defaults
> from ksymoops -t elf32-i386 -a i386

Neat, looks like you've installed the all new extreiser2fs.  Really though,
do you have ext2 on the box at all?

sigbus from gcc usually points to the ram, have you run a tester?

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: PROBLEM: kernel oops in reiserfs under 2.4.2-ac28 and 2.4.3-ac3when rming files

2001-04-09 Thread Chris Mason

On Sunday, April 08, 2001 03:43:19 PM -0500 xOr <[EMAIL PROTECTED]> wrote:

> [1.] kernel oops in reiserfs under 2.4.2-ac28 and 2.4.3-ac3 when rming
> files 

Ok, reiserfs must be picking the wrong member in an array of function
pointers, probably on a bad item from disk.  We're testing some code from
Alexander Zarochentcev that tries to detect this kind of thing, I'll
forward it to you off list.

thanks,
Chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: VFS problem

2001-04-18 Thread Chris Mason




On Wednesday, April 18, 2001 01:44:04 PM +0200 Jaquemet Loic
<[EMAIL PROTECTED]> wrote:

> Jaquemet Loic a écrit :
> >> Sorry if this problem has already been disscussed.
>> >> I run an linux box with a HD 30Go/reiserfs .
>> I tried several 2.4 kernel ( 2.4.2 , 2.4.3 , 2.4.4-pre3 , 2.4.3-ac7)
>> After a random time I've got a fs problem which lead to :
>> -first a segfault of a process which reads/writes on the partition
>> ex:
>> [jal@skippy prog]$ ./configure
>> 
>> ln -s dialects/linux/machine.h machine.h
>> Erreur de segmentation ( SEGFAULT )
>> >> -and then the partition freeze .Any attempt to read/write on it leads to


Hmmm, are you sure there aren't any reiserfs messages on screen or in the
log?

-chris



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] reiserfs transaction overflow

2001-04-18 Thread Chris Mason



Hi guys,

Under certain loads, the reiserfs journal can overflow the
max transaction size, leading to a crash (but not corruption).

When the transaction is too full for another writer to join,
the writer triggers a commit, and waits for the next transaction.
But, it doesn't properly check to make sure the next transcation
has enough room, which can lead to overflow.  It is hard to
hit because there is a large margin of error in the way log space
is reserved (this bug was probably in v.1 of the journal
code).

A similar patch will be needed for 3.5.x reiserfs, that will
follow soon.

Anyway, this patch should fix 2.4.x, please apply:

-chris

--- linux/fs/reiserfs/journal.c.1   Tue Apr 17 09:36:36 2001
+++ linux/fs/reiserfs/journal.c Tue Apr 17 09:37:50 2001
@@ -2052,7 +2052,7 @@
sleep_on(&(SB_JOURNAL(p_s_sb)->j_join_wait)) ;
   }
 }
-lock_journal(p_s_sb) ; /* relock to continue */
+goto relock ;
   }
 
   if (SB_JOURNAL(p_s_sb)->j_trans_start_time == 0) { /* we are the first writer, set 
trans_id */


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] ac only, allow reiserfs files > 4GB

2001-04-18 Thread Chris Mason



This patch should set s_maxbytes correctly for reiserfs in the
ac kernels, and adds a reiserfs_setattr call to catch expanding
truncates past the MAX_NON_LFS limit for old format files.

reiserfs_get_block already catches file writes and such for
this case.

It also adds a generic_inode_setattr call, mostly because I
didn't want to copy/maintain that hunk of code in reiserfs.

Testing has been light, I'll beat on it more this evening.

patch against 2.4.3-ac7.

-chris

diff -Nru a/fs/attr.c b/fs/attr.c
--- a/fs/attr.c Wed Apr 18 18:33:44 2001
+++ b/fs/attr.c Wed Apr 18 18:33:44 2001
@@ -111,6 +111,21 @@
return dn_mask;
 }
 
+int generic_inode_setattr(struct inode *inode, struct iattr * attr) {
+   int error  ;
+   unsigned int ia_valid = attr->ia_valid;
+
+   error = inode_change_ok(inode, attr);
+   if (!error) {
+   if ((ia_valid & ATTR_UID && attr->ia_uid != inode->i_uid) ||
+   (ia_valid & ATTR_GID && attr->ia_gid != inode->i_gid))
+   error = DQUOT_TRANSFER(inode, attr) ? -EDQUOT : 0;
+   if (!error)
+   error = inode_setattr(inode, attr);
+   }
+   return error ;
+}
+
 int notify_change(struct dentry * dentry, struct iattr * attr)
 {
struct inode *inode = dentry->d_inode;
@@ -131,14 +146,7 @@
if (inode->i_op && inode->i_op->setattr) 
error = inode->i_op->setattr(dentry, attr);
else {
-   error = inode_change_ok(inode, attr);
-   if (!error) {
-   if ((ia_valid & ATTR_UID && attr->ia_uid != inode->i_uid) ||
-   (ia_valid & ATTR_GID && attr->ia_gid != inode->i_gid))
-   error = DQUOT_TRANSFER(inode, attr) ? -EDQUOT : 0;
-   if (!error)
-   error = inode_setattr(inode, attr);
-   }
+   error = generic_inode_setattr(inode, attr) ;
}
unlock_kernel();
if (!error) {
diff -Nru a/fs/reiserfs/file.c b/fs/reiserfs/file.c
--- a/fs/reiserfs/file.cWed Apr 18 18:33:44 2001
+++ b/fs/reiserfs/file.cWed Apr 18 18:33:44 2001
@@ -106,6 +106,18 @@
   return ( n_err < 0 ) ? -EIO : 0;
 }
 
+static int reiserfs_setattr(struct dentry *dentry, struct iattr *attr) {
+struct inode *inode = dentry->d_inode ;
+if (attr->ia_valid & ATTR_SIZE) {
+   /* version 2 items will be caught by the s_maxbytes check
+   ** done for us in vmtruncate
+   */
+if (inode_items_version(inode) == ITEM_VERSION_1 && 
+   attr->ia_size > MAX_NON_LFS)
+return -EFBIG ;
+}
+return generic_inode_setattr(inode, attr) ;
+}
 
 struct file_operations reiserfs_file_operations = {
 read:  generic_file_read,
@@ -119,6 +131,7 @@
 
 struct  inode_operations reiserfs_file_inode_operations = {
 truncate:  reiserfs_vfs_truncate_file,
+setattr:reiserfs_setattr,
 };
 
 
diff -Nru a/fs/reiserfs/super.c b/fs/reiserfs/super.c
--- a/fs/reiserfs/super.c   Wed Apr 18 18:33:44 2001
+++ b/fs/reiserfs/super.c   Wed Apr 18 18:33:44 2001
@@ -412,7 +412,7 @@
 SB_BUFFER_WITH_SB (s) = bh;
 SB_DISK_SUPER_BLOCK (s) = rs;
 s->s_op = &reiserfs_sops;
-s->s_maxbytes = MAX_NON_LFS;
+s->s_maxbytes = MAX_NON_LFS; /* old format is always limited at 2GB */
 return 0;
 }
 #endif
@@ -493,7 +493,11 @@
 SB_BUFFER_WITH_SB (s) = bh;
 SB_DISK_SUPER_BLOCK (s) = rs;
 s->s_op = &reiserfs_sops;
-s->s_maxbytes = 0x;/* 4Gig */
+
+/* new format is limited by the 32 bit wide i_blocks field, want to
+** be one full block below that.
+*/
+s->s_maxbytes = (512LL << 32) - s->s_blocksize ;
 return 0;
 }
 
diff -Nru a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.hWed Apr 18 18:33:44 2001
+++ b/include/linux/fs.hWed Apr 18 18:33:44 2001
@@ -1359,6 +1359,7 @@
 
 extern int inode_change_ok(struct inode *, struct iattr *);
 extern int inode_setattr(struct inode *, struct iattr *);
+extern int generic_inode_setattr(struct inode *, struct iattr *);
 
 /*
  * Common dentry functions for inclusion in the VFS
diff -Nru a/kernel/ksyms.c b/kernel/ksyms.c
--- a/kernel/ksyms.cWed Apr 18 18:33:44 2001
+++ b/kernel/ksyms.cWed Apr 18 18:33:44 2001
@@ -180,6 +180,7 @@
 EXPORT_SYMBOL(permission);
 EXPORT_SYMBOL(vfs_permission);
 EXPORT_SYMBOL(inode_setattr);
+EXPORT_SYMBOL(generic_inode_setattr);
 EXPORT_SYMBOL(inode_change_ok);
 EXPORT_SYMBOL(write_inode_now);
 EXPORT_SYMBOL(notify_change);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 >

1 - 100 of 788 matches

Mail list logo