Re: Recent kernel "mount" slow
On Tue, 27 Nov 2012, Jens Axboe wrote: > On 2012-11-27 11:06, Jeff Chua wrote: > > On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe wrote: > >> On 2012-11-27 06:57, Jeff Chua wrote: > >>> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua > >>> wrote: > >>>> On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka > >>>> wrote: > >>>>> So it's better to slow down mount. > >>>> > >>>> I am quite proud of the linux boot time pitting against other OS. Even > >>>> with 10 partitions. Linux can boot up in just a few seconds, but now > >>>> you're saying that we need to do this semaphore check at boot up. By > >>>> doing so, it's inducing additional 4 seconds during boot up. > >>> > >>> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU > >>> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what > >>> kind of degradation would this cause or just the same? > >> > >> It'd likely be the same slow down time wise, but as a percentage it > >> would appear smaller on a slower disk. > >> > >> Could you please test Mikulas' suggestion of changing > >> synchronize_sched() in include/linux/percpu-rwsem.h to > >> synchronize_sched_expedited()? > > > > Tested. It seems as fast as before, but may be a "tick" slower. Just > > perception. I was getting pretty much 0.012s with everything reverted. > > With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s. > > So, it's good. > > Excellent > > >> linux-next also has a re-write of the per-cpu rw sems, out of Andrews > >> tree. It would be a good data point it you could test that, too. > > > > Tested. It's slower. 0.350s. But still faster than 0.500s without the patch. > > Makes sense, it's 2 synchronize_sched() instead of 3. So it doesn't fix > the real issue, which is having to do synchronize_sched() in the first > place. > > > # time mount /dev/sda1 /mnt; sync; sync; umount /mnt > > > > > > So, here's the comparison ... > > > > 0.500s 3.7.0-rc7 > > 0.168s 3.7.0-rc2 > > 0.012s 3.6.0 > > 0.013s 3.7.0-rc7 + synchronize_sched_expedited() > > 0.350s 3.7.0-rc7 + Oleg's patch. > > I wonder how many of them are due to changing to the same block size. > Does the below patch make a difference? This patch is wrong because you must check if the device is mapped while holding bdev->bd_block_size_semaphore (because bdev->bd_block_size_semaphore prevents new mappings from being created) I'm sending another patch that has the same effect. Note that ext[234] filesystems set blocksize to 1024 temporarily during mount, so it doesn't help much (it only helps for other filesystems, such as jfs). For ext[234], you have a device with default block size 4096, the filesystem sets block size to 1024 during mount, reads the super block and sets it back to 4096. If you want, you can fix ext[234] to avoid this behavior. Mikulas > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 1a1e5e3..f041c56 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -126,29 +126,28 @@ int set_blocksize(struct block_device *bdev, int size) > if (size < bdev_logical_block_size(bdev)) > return -EINVAL; > > - /* Prevent starting I/O or mapping the device */ > - percpu_down_write(&bdev->bd_block_size_semaphore); > - > /* Check that the block device is not memory mapped */ > mapping = bdev->bd_inode->i_mapping; > mutex_lock(&mapping->i_mmap_mutex); > if (mapping_mapped(mapping)) { > mutex_unlock(&mapping->i_mmap_mutex); > - percpu_up_write(&bdev->bd_block_size_semaphore); > return -EBUSY; > } > mutex_unlock(&mapping->i_mmap_mutex); > > /* Don't change the size if it is same as current */ > if (bdev->bd_block_size != size) { > - sync_blockdev(bdev); > - bdev->bd_block_size = size; > - bdev->bd_inode->i_blkbits = blksize_bits(size); > - kill_bdev(bdev); > + /* Prevent starting I/O */ > + percpu_down_write(&bdev->bd_block_size_semaphore); > + if (bdev->bd_block_size != size) { > + sync_blockdev(bdev); > + bdev->bd_block_size = size; > + bdev->bd_inode->i_blkbits = blksize_bits(size); > + ki
[PATCH 2/2] block_dev: don't take the write lock if block size doesn't change
block_dev: don't take the write lock if block size doesn't change Taking the write lock has a big performance impact on the whole system (because of synchronize_sched_expedited). This patch avoids taking the write lock if the block size doesn't change (i.e. when mounting filesystem with block size equal to the default block size). The logic to test if the block device is mapped was moved to a separate function is_bdev_mapped to avoid code duplication. Signed-off-by: Mikulas Patocka --- fs/block_dev.c | 25 ++--- 1 file changed, 18 insertions(+), 7 deletions(-) Index: linux-3.7-rc7/fs/block_dev.c === --- linux-3.7-rc7.orig/fs/block_dev.c 2012-11-28 04:09:01.0 +0100 +++ linux-3.7-rc7/fs/block_dev.c2012-11-28 04:13:53.0 +0100 @@ -114,10 +114,18 @@ void invalidate_bdev(struct block_device } EXPORT_SYMBOL(invalidate_bdev); -int set_blocksize(struct block_device *bdev, int size) +static int is_bdev_mapped(struct block_device *bdev) { - struct address_space *mapping; + int ret_val; + struct address_space *mapping = bdev->bd_inode->i_mapping; + mutex_lock(&mapping->i_mmap_mutex); + ret_val = mapping_mapped(mapping); + mutex_unlock(&mapping->i_mmap_mutex); + return ret_val; +} +int set_blocksize(struct block_device *bdev, int size) +{ /* Size must be a power of two, and between 512 and PAGE_SIZE */ if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size)) return -EINVAL; @@ -126,18 +134,21 @@ int set_blocksize(struct block_device *b if (size < bdev_logical_block_size(bdev)) return -EINVAL; + /* +* If the block size doesn't change, don't take the write lock. +* We check for is_bdev_mapped anyway, for consistent behavior. +*/ + if (size == bdev->bd_block_size) + return is_bdev_mapped(bdev) ? -EBUSY : 0; + /* Prevent starting I/O or mapping the device */ percpu_down_write(&bdev->bd_block_size_semaphore); /* Check that the block device is not memory mapped */ - mapping = bdev->bd_inode->i_mapping; - mutex_lock(&mapping->i_mmap_mutex); - if (mapping_mapped(mapping)) { - mutex_unlock(&mapping->i_mmap_mutex); + if (is_bdev_mapped(bdev)) { percpu_up_write(&bdev->bd_block_size_semaphore); return -EBUSY; } - mutex_unlock(&mapping->i_mmap_mutex); /* Don't change the size if it is same as current */ if (bdev->bd_block_size != size) { -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] percpu-rwsem: use synchronize_sched_expedited
On Tue, 27 Nov 2012, Jeff Chua wrote: > On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe wrote: > > On 2012-11-27 06:57, Jeff Chua wrote: > >> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua > >> wrote: > >>> On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka > >>> wrote: > >>>> So it's better to slow down mount. > >>> > >>> I am quite proud of the linux boot time pitting against other OS. Even > >>> with 10 partitions. Linux can boot up in just a few seconds, but now > >>> you're saying that we need to do this semaphore check at boot up. By > >>> doing so, it's inducing additional 4 seconds during boot up. > >> > >> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU > >> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what > >> kind of degradation would this cause or just the same? > > > > It'd likely be the same slow down time wise, but as a percentage it > > would appear smaller on a slower disk. > > > > Could you please test Mikulas' suggestion of changing > > synchronize_sched() in include/linux/percpu-rwsem.h to > > synchronize_sched_expedited()? > > Tested. It seems as fast as before, but may be a "tick" slower. Just > perception. I was getting pretty much 0.012s with everything reverted. > With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s. > So, it's good. > > > > linux-next also has a re-write of the per-cpu rw sems, out of Andrews > > tree. It would be a good data point it you could test that, too. > > Tested. It's slower. 0.350s. But still faster than 0.500s without the patch. > > # time mount /dev/sda1 /mnt; sync; sync; umount /mnt > > > So, here's the comparison ... > > 0.500s 3.7.0-rc7 > 0.168s 3.7.0-rc2 > 0.012s 3.6.0 > 0.013s 3.7.0-rc7 + synchronize_sched_expedited() > 0.350s 3.7.0-rc7 + Oleg's patch. > > > Thanks, > Jeff. OK, I'm seinding two patches to reduce mount times. If it is possible to put them to 3.7.0, put them there. Mikulas --- percpu-rwsem: use synchronize_sched_expedited Use synchronize_sched_expedited() instead of synchronize_sched() to improve mount speed. This patch improves mount time from 0.500s to 0.013s. Note: if realtime people complain about the use synchronize_sched_expedited() and synchronize_rcu_expedited(), I suggest that they introduce an option CONFIG_REALTIME or /proc/sys/kernel/realtime and turn off these *_expedited functions if the option is enabled (i.e. turn synchronize_sched_expedited into synchronize_sched and synchronize_rcu_expedited into synchronize_rcu). Signed-off-by: Mikulas Patocka --- include/linux/percpu-rwsem.h |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-3.7-rc7/include/linux/percpu-rwsem.h === --- linux-3.7-rc7.orig/include/linux/percpu-rwsem.h 2012-11-28 02:41:03.0 +0100 +++ linux-3.7-rc7/include/linux/percpu-rwsem.h 2012-11-28 02:41:15.0 +0100 @@ -13,7 +13,7 @@ struct percpu_rw_semaphore { }; #define light_mb() barrier() -#define heavy_mb() synchronize_sched() +#define heavy_mb() synchronize_sched_expedited() static inline void percpu_down_read(struct percpu_rw_semaphore *p) { @@ -51,7 +51,7 @@ static inline void percpu_down_write(str { mutex_lock(&p->mtx); p->locked = true; - synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */ + synchronize_sched_expedited(); /* make sure that all readers exit the rcu_read_lock_sched region */ while (__percpu_count(p->counters)) msleep(1); heavy_mb(); /* C, between read of p->counter and write to data, paired with B */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Wed, 28 Nov 2012, Jens Axboe wrote: > On 2012-11-28 04:57, Mikulas Patocka wrote: > > > > This patch is wrong because you must check if the device is mapped while > > holding bdev->bd_block_size_semaphore (because > > bdev->bd_block_size_semaphore prevents new mappings from being created) > > No it doesn't. If you read the patch, that was moved to i_mmap_mutex. Hmm, it was wrong before the patch and it is wrong after the patch too. The problem is that ->mmap method doesn't do the actual mapping, the caller of ->mmap (mmap_region) does it. So we must actually catch mmap_region and protect it with the lock, not ->mmap. Mikulas --- Introduce a method to catch mmap_region For block devices, we must prevent the device from being mapped while block size is being changed. This was implemented by catching the mmap method and taking bd_block_size_semaphore in it. The problem is that the generic_file_mmap method does almost nothing, it only sets vma->vm_ops = &generic_file_vm_ops, all the hard work is done by the caller, mmap_region. Consequently, locking the mmap method is insufficient, to prevent the race condition. The race condition can happen for example this way: process 1: Starts mapping a block device, it goes to mmap_region, calls file->f_op->mmap (blkdev_mmap - that acquires and releases bd_block_size_semaphore), and reschedule happens just after blkdev_mmap returns. process 2: Changes block device size, goes into set_blocksize, acquires percpu_down_write, calls mapping_mapped to test if the device is mapped (it still isn't). Then, it reschedules. process 1: Continues in mmap_region, successfully finishes the mapping. process 2: Continues in set_blocksize, changes the block size while the block device is mapped. (which is incorrect and may result in a crash or misbehavior). To fix this possible race condition, we need to catch the function that does real mapping - mmap_region. This patch adds a new method, file_operations->mmap_region. mmap_region calls the method file_operations->mmap_region if it is non-NULL, otherwise, it calls generic_mmap_region. Signed-off-by: Mikulas Patocka --- fs/block_dev.c | 12 include/linux/fs.h |4 include/linux/mm.h |3 +++ mm/mmap.c | 10 ++ 4 files changed, 25 insertions(+), 4 deletions(-) Index: linux-3.7-rc7/include/linux/fs.h === --- linux-3.7-rc7.orig/include/linux/fs.h 2012-11-28 17:45:55.0 +0100 +++ linux-3.7-rc7/include/linux/fs.h2012-11-28 18:02:45.0 +0100 @@ -27,6 +27,7 @@ #include #include #include +#include #include #include @@ -1528,6 +1529,9 @@ struct file_operations { unsigned int (*poll) (struct file *, struct poll_table_struct *); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); + unsigned long (*mmap_region)(struct file *, unsigned long, +unsigned long, unsigned long, vm_flags_t, +unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); int (*flush) (struct file *, fl_owner_t id); Index: linux-3.7-rc7/include/linux/mm.h === --- linux-3.7-rc7.orig/include/linux/mm.h 2012-11-28 17:45:55.0 +0100 +++ linux-3.7-rc7/include/linux/mm.h2012-11-28 17:47:12.0 +0100 @@ -1444,6 +1444,9 @@ extern unsigned long get_unmapped_area(s extern unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff); +extern unsigned long generic_mmap_region(struct file *file, unsigned long addr, + unsigned long len, unsigned long flags, + vm_flags_t vm_flags, unsigned long pgoff); extern unsigned long do_mmap_pgoff(struct file *, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long); Index: linux-3.7-rc7/mm/mmap.c === --- linux-3.7-rc7.orig/mm/mmap.c2012-11-28 17:45:55.0 +0100 +++ linux-3.7-rc7/mm/mmap.c 2012-11-28 17:57:40.0 +0100 @@ -1244,6 +1244,16 @@ unsigned long mmap_region(struct file *f unsigned long len, unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff) { + if (file && file->f_op->mmap_region) + return file->f_op->mmap_region(file, addr, len, flags, vm_flags, + pgoff); + return generic_mmap_region(file, addr, len, flags, vm_flags, pgoff); +} + +unsigned long gene
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Wed, 28 Nov 2012, Linus Torvalds wrote: > No, this is crap. > > We don't introduce random hooks like this just because the block layer > has shit-for-brains and cannot be bothered to do things right. > > The fact is, the whole locking in the block layer open routine is > total and utter crap. It doesn't lock the right thing, even with your > change *anyway* (or with the change Jens had). Absolutely nothing in > "mmap_region()" cares at all about the block-size anywhere - it's > generic, after all - so locking around it is f*cking pointless. There > is no way in hell that the caller of ->mmap can *ever* care about the > block size, since it never even looks at it. > > Don't do random crap like this. > > Why does the code think that mmap matters so much anyway? As you say, > the mmap itself does *nothing*. It has no impact for the block size. > > Linus mmap_region() doesn't care about the block size. But a lot of page-in/page-out code does. The problem is that once the block device is mapped, page faults or page writeback can happen anytime - so the simplest solution is to not allow the block device being mapped while we change block size. The function set_blocksize takes bd_block_size_semaphore for write (that blocks read/write/mmap), then it calls sync_blockdev (now we are sure that there is no more writeback), then it changes the block size, then it calls kill_bdev (now we are sure that there are no more any pages with buffers with the old blocksize). If you want to allow to change block size while a block device is mapped, you'd have to add explicit locks around all mm callbacks (so that the block size can't change while the callback is in progress) - and even then, there are some unsolvable cases - i.e. what are you going to do if the user mlocks a mapped block device and you change block size of that device? - you can't drop the pages (that would violate mlock semantics) and you can leave them there (because they have buffers with wrong size). If you don't like what I sent, propose a different solution. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Wed, 28 Nov 2012, Linus Torvalds wrote: > On Wed, Nov 28, 2012 at 12:03 PM, Linus Torvalds > wrote: > > > > mmap() is in *no* way special. The exact same thing happens for > > regular read/write. Yet somehow the mmap code is special-cased, while > > the normal read-write code is not. > > I just double-checked, because it's been a long time since I actually > looked at the code. > > But yeah, block device read/write uses the pure page cache functions. > IOW, it has the *exact* same IO engine as mmap() would have. > > So here's my suggestion: > > - get rid of *all* the locking in aio_read/write and the splice paths > - get rid of all the stupid mmap games > > - instead, add them to the functions that actually use > "blkdev_get_block()" and "blkdev_get_blocks()" and nowhere else. > >That's a fairly limited number of functions: > blkdev_{read,write}page(), blkdev_direct_IO() and > blkdev_write_{begin,end}() > > Doesn't that sounds simpler? And more logical: it protects the actual > places that use the block size of the device. > > I dunno. Maybe there is some fundamental reason why the above is > broken, but it seems to be a much simpler approach. Sure, you need to > guarantee that the people who get the write-lock cannot possibly cause > IO while holding it, but since the only reason to get the write lock > would be to change the block size, that should be pretty simple, no? > > Yeah, yeah, I'm probably missing something fundamental, but the above > sounds like the simple approach to fixing things. Aiming for having > the block size read-lock be taken by the things that pass in the > block-size itself. > > It would be nice for things to be logical and straightforward. > >Linus The problem with this approach is that it is very easy to miss points where it is assumed that the block size doesn't change - and if you miss a point, it results in a hidden bug that has a little possibility of being found. For example, __block_write_full_page and __block_write_begin do if (!page_has_buffers(page)) { create_empty_buffers... } and then they do WARN_ON(bh->b_size != blocksize) err = get_block(inode, block, bh, 1) ... so if the buffers were left over from some previous call to create_empty_buffers with a different blocksize, that WARN_ON is trigged. And it's not only a harmless warning - now bh->b_size is left set to the old block size, but bh->b_blocknr is set to a number, that was calculated according to the new block size - and when you submit that buffer with submit_bh, it is written to the wrong place! Now, prove that there are no more bugs like this. Locking the whole read/write/mmap operations is crude, but at least it can be done without thorough review of all the memory management code. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] Do a proper locking for mmap and block size change
On Wed, 28 Nov 2012, Al Viro wrote: > On Wed, Nov 28, 2012 at 11:15:12AM -0800, Linus Torvalds wrote: > > No, this is crap. > > > > We don't introduce random hooks like this just because the block layer > > has shit-for-brains and cannot be bothered to do things right. > > > > The fact is, the whole locking in the block layer open routine is > > total and utter crap. It doesn't lock the right thing, even with your > > change *anyway* (or with the change Jens had). Absolutely nothing in > > "mmap_region()" cares at all about the block-size anywhere - it's > > generic, after all - so locking around it is f*cking pointless. There > > is no way in hell that the caller of ->mmap can *ever* care about the > > block size, since it never even looks at it. > > > > Don't do random crap like this. > > > > Why does the code think that mmap matters so much anyway? As you say, > > the mmap itself does *nothing*. It has no impact for the block size. > > ... and here I was, trying to massage a reply into form that would be > at least borderline printable... Anyway, this is certainly the wrong > way to go. We want to catch if the damn thing is mapped and mapping_mapped() > leaves a race? Fine, so let's not use mapping_mapped() at all. Have a > private vm_operations - a copy of generic_file_vm_ops with ->open()/->close() > added to it. Incrementing/decrementing a per-block_device atomic counter, > with increment side done under your rwsem, to make sure that 0->positive > transition doesn't change in critical section of set_blocksize(). And don't > use generic_file_mmap(), just do file_accessed() and set ->vm_ops to that > local copy. This sounds sensible. I'm sending this patch. Mikulas --- Do a proper locking for mmap and block size change For block devices, we must prevent the device from being mapped while block size is being changed. This was implemented by catching the mmap method and taking bd_block_size_semaphore in it. The problem is that the generic_file_mmap method does almost nothing, it only sets vma->vm_ops = &generic_file_vm_ops, all the hard work is done by the caller, mmap_region. Consequently, locking the mmap method is insufficient, to prevent the race condition. The race condition can happen for example this way: process 1: Starts mapping a block device, it goes to mmap_region, calls file->f_op->mmap (blkdev_mmap - that acquires and releases bd_block_size_semaphore), and reschedule happens just after blkdev_mmap returns. process 2: Changes block device size, goes into set_blocksize, acquires percpu_down_write, calls mapping_mapped to test if the device is mapped (it still isn't). Then, it reschedules. process 1: Continues in mmap_region, successfully finishes the mapping. process 2: Continues in set_blocksize, changes the block size while the block device is mapped. (which is incorrect and may result in a crash or misbehavior). To fix this possible race condition, we introduce a counter bd_mmap_count that counts the number of vmas that maps the block device. bd_mmap_count is incremented in blkdev_mmap and in blkdev_vm_open, it is decremented in blkdev_vm_close. We don't allow changing block size while bd_mmap_count is non-zero. Signed-off-by: Mikulas Patocka --- fs/block_dev.c | 49 - include/linux/fs.h |3 +++ 2 files changed, 35 insertions(+), 17 deletions(-) Index: linux-3.7-rc7/fs/block_dev.c === --- linux-3.7-rc7.orig/fs/block_dev.c 2012-11-28 21:13:12.0 +0100 +++ linux-3.7-rc7/fs/block_dev.c2012-11-28 21:33:23.0 +0100 @@ -35,6 +35,7 @@ struct bdev_inode { struct inode vfs_inode; }; +static const struct vm_operations_struct def_blk_vm_ops; static const struct address_space_operations def_blk_aops; static inline struct bdev_inode *BDEV_I(struct inode *inode) @@ -114,16 +115,6 @@ void invalidate_bdev(struct block_device } EXPORT_SYMBOL(invalidate_bdev); -static int is_bdev_mapped(struct block_device *bdev) -{ - int ret_val; - struct address_space *mapping = bdev->bd_inode->i_mapping; - mutex_lock(&mapping->i_mmap_mutex); - ret_val = mapping_mapped(mapping); - mutex_unlock(&mapping->i_mmap_mutex); - return ret_val; -} - int set_blocksize(struct block_device *bdev, int size) { /* Size must be a power of two, and between 512 and PAGE_SIZE */ @@ -136,16 +127,16 @@ int set_blocksize(struct block_device *b /* * If the block size doesn't change, don't take the write lock. -* We check for is_bdev_mapped anyway, for consistent behavior. +* We check for bd_mmap_count anyway, for co
Re: [PATCH 2/2] block_dev: don't take the write lock if block size doesn't change
On Wed, 28 Nov 2012, Jeff Chua wrote: > On Wed, Nov 28, 2012 at 12:01 PM, Mikulas Patocka wrote: > > block_dev: don't take the write lock if block size doesn't change > > > > Taking the write lock has a big performance impact on the whole system > > (because of synchronize_sched_expedited). This patch avoids taking the > > write lock if the block size doesn't change (i.e. when mounting > > filesystem with block size equal to the default block size). > > > > The logic to test if the block device is mapped was moved to a separate > > function is_bdev_mapped to avoid code duplication. > > > > Signed-off-by: Mikulas Patocka > > > > --- > > fs/block_dev.c | 25 ++--- > > 1 file changed, 18 insertions(+), 7 deletions(-) > > > > Index: linux-3.7-rc7/fs/block_dev.c > > === > > --- linux-3.7-rc7.orig/fs/block_dev.c 2012-11-28 04:09:01.0 +0100 > > +++ linux-3.7-rc7/fs/block_dev.c2012-11-28 04:13:53.0 +0100 > > @@ -114,10 +114,18 @@ void invalidate_bdev(struct block_device > > } > > EXPORT_SYMBOL(invalidate_bdev); > > > > -int set_blocksize(struct block_device *bdev, int size) > > +static int is_bdev_mapped(struct block_device *bdev) > > { > > - struct address_space *mapping; > > + int ret_val; > > + struct address_space *mapping = bdev->bd_inode->i_mapping; > > + mutex_lock(&mapping->i_mmap_mutex); > > + ret_val = mapping_mapped(mapping); > > + mutex_unlock(&mapping->i_mmap_mutex); > > + return ret_val; > > +} > > > > +int set_blocksize(struct block_device *bdev, int size) > > +{ > > /* Size must be a power of two, and between 512 and PAGE_SIZE */ > > if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size)) > > return -EINVAL; > > @@ -126,18 +134,21 @@ int set_blocksize(struct block_device *b > > if (size < bdev_logical_block_size(bdev)) > > return -EINVAL; > > > > + /* > > +* If the block size doesn't change, don't take the write lock. > > +* We check for is_bdev_mapped anyway, for consistent behavior. > > +*/ > > + if (size == bdev->bd_block_size) > > + return is_bdev_mapped(bdev) ? -EBUSY : 0; > > + > > /* Prevent starting I/O or mapping the device */ > > percpu_down_write(&bdev->bd_block_size_semaphore); > > > > /* Check that the block device is not memory mapped */ > > - mapping = bdev->bd_inode->i_mapping; > > - mutex_lock(&mapping->i_mmap_mutex); > > - if (mapping_mapped(mapping)) { > > - mutex_unlock(&mapping->i_mmap_mutex); > > + if (is_bdev_mapped(bdev)) { > > percpu_up_write(&bdev->bd_block_size_semaphore); > > return -EBUSY; > > } > > - mutex_unlock(&mapping->i_mmap_mutex); > > > > /* Don't change the size if it is same as current */ > > if (bdev->bd_block_size != size) { > > > This patch didn't really make any difference for ext2/3/4 but for > reiserfs it does. > > With the synchronize_sched_expedited() patch applied, it didn't make > any difference. > > Thanks, > Jeff It could make difference for computers with many cores - synchronize_sched_expedited() is expensive there because it synchronizes all active cores, so if we can avoid it, just do it. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Wed, 28 Nov 2012, Linus Torvalds wrote: > On Wed, Nov 28, 2012 at 12:32 PM, Linus Torvalds > wrote: > > > > Here is a *COMPLETELY* untested patch. Caveat emptor. It will probably > > do unspeakable things to your family and pets. > > Btw, *if* this approach works, I suspect we could just switch the > bd_block_size_semaphore semaphore to be a regular rw-sem. > > Why? Because now it's no longer ever gotten in the cached IO paths, we > only get it when we're doing much more expensive things (ie actual IO, > and buffer head allocations etc etc). As long as we just work with the > page cache, we never get to the whole lock at all. > > Which means that the whole percpu-optimized thing is likely no longer > all that relevant. Using normal semaphore degrades direct-IO performance on raw devices, I measured that (45.1s with normal rw-semaphore vs. 42.8s with percpu-rw-semaphore). It could be measured on ramdisk or high-performance SSDs. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Wed, 28 Nov 2012, Linus Torvalds wrote: > > For example, __block_write_full_page and __block_write_begin do > > if (!page_has_buffers(page)) { create_empty_buffers... } > > and then they do > > WARN_ON(bh->b_size != blocksize) > > err = get_block(inode, block, bh, 1) > > Right. And none of this is new. > > > ... so if the buffers were left over from some previous call to > > create_empty_buffers with a different blocksize, that WARN_ON is trigged. > > None of this can happen. It can happen. Take your patch (the one that moves bd_block_size_semaphore into blkdev_readpage, blkdev_writepage and blkdev_write_begin). Insert msleep(1000) into set_blocksize, just before sync_blockdev. Run this program: #define _XOPEN_SOURCE 500 #include #include #include #include #include static char array[4096]; int main(void) { int h; system("dmsetup remove test 2>/dev/null"); if (system("dmsetup create test --table '0 1 zero'")) exit(1); h = open("/dev/mapper/test", O_RDWR); if (h < 0) perror("open"), exit(1); if (pread(h, array, 512, 0) != 512) perror("pread"), exit(1); if (system("dmsetup load test --table '0 8 zero'")) exit(1); if (system("dmsetup suspend test")) exit(1); if (system("dmsetup resume test")) exit(1); if (system("blockdev --setbsz 2048 /dev/mapper/test &")) exit(1); usleep(50); if (pwrite(h, array, 4096, 0) != 4096) perror("pwrite"), exit(1); return 0; } --- it triggers WARNING: at fs/buffer.c:1830 in __block_write_begin [ 1243.30] Backtrace: [ 1243.33] [<40230ba8>] block_write_begin+0x70/0xd0 [ 1243.40] [<402350cc>] blkdev_write_begin+0xb4/0x208 [ 1243.48] [<401a9f10>] generic_file_buffered_write+0x248/0x348 [ 1243.57] [<401ac8c4>] __generic_file_aio_write+0x1fc/0x388 [ 1243.66] [<40235e74>] blkdev_aio_write+0x64/0xf0 [ 1243.74] [<401f2108>] do_sync_write+0xd0/0x128 [ 1243.81] [<401f2930>] vfs_write+0xa0/0x180 [ 1243.88] [<401f2ecc>] sys_pwrite64+0xb4/0xd8 [ 1243.95] [<40122104>] parisc_pwrite64+0x1c/0x28 [ 1244.02] [<40106060>] syscall_exit+0x0/0x14 I'm not saying that your approach is wrong, you just have to carefuly review all memory management code for assumptions that block size doesn't change. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Wed, 28 Nov 2012, Linus Torvalds wrote: > A bigger issue is for things that emulate what blkdev.c does, and > doesn't do the locking. I see code in md/bitmap.c that seems a bit > suspicious, for example. That said, it's not *new* breakage, and the > "lock at mmap/read/write() time" approach doesn't fix it either (since > the mapping will be different for the underlying MD device). So I do > think that we should take a look at all the users of > "alloc_page_buffers()" and "create_empty_buffers()" to see what *they* > do to protect the block-size, but I think that's an independent issue > from the raw device access case in fs/block_dev.c.. > > I guess I have to actually test my patch. I don't have very > interesting test-cases, though. > >Linus Yes, it md looks suspicious. It disables write access in deny_bitmap_write_access (that functions looks buggy on its own - what happens if i_writecount == 0 or if it is already negative on entry?). So we could disallow changing block size if i_writecount < 0, that should do the trick. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Thu, 29 Nov 2012, Linus Torvalds wrote: > On Wed, Nov 28, 2012 at 2:01 PM, Mikulas Patocka wrote: > > > > This sounds sensible. I'm sending this patch. > > This looks much better. > > I think I'll apply this for 3.7 (since it's too late to do anything > fancier), and then for 3.8 I will rip out all the locking entirely, > because looking at the fs/buffer.c patch I wrote up, it's all totally > unnecessary. > > Adding a ACCESS_ONCE() to the read of the i_blkbits value (when > creating new buffers) simply makes the whole locking thing pointless. > Just make the page lock protect the block size, and make it per-page, > and we're done. > > No RCU grace period crap, no expedited mess, no nothing. > > Linus Yes. If you remove that percpu rw lock, you also need to rewrite direct i/o code. In theory, block device direct i/o doesn't need buffer block size at all. But in practice, it shares a lot of code with filesystem direct i/o, it reads the block size multiple times and it crashes if it changes. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] percpu-rwsem: use synchronize_sched_expedited
On Thu, 29 Nov 2012, Andrew Morton wrote: > On Tue, 27 Nov 2012 22:59:52 -0500 (EST) > Mikulas Patocka wrote: > > > percpu-rwsem: use synchronize_sched_expedited > > > > Use synchronize_sched_expedited() instead of synchronize_sched() > > to improve mount speed. > > > > This patch improves mount time from 0.500s to 0.013s. > > > > Note: if realtime people complain about the use > > synchronize_sched_expedited() and synchronize_rcu_expedited(), I suggest > > that they introduce an option CONFIG_REALTIME or > > /proc/sys/kernel/realtime and turn off these *_expedited functions if > > the option is enabled (i.e. turn synchronize_sched_expedited into > > synchronize_sched and synchronize_rcu_expedited into synchronize_rcu). > > > > Signed-off-by: Mikulas Patocka > > So I read through this thread but I really didn't see a clear > description of why mount() got slower. The changelog for 4b05a1c74d1 > is spectacularly awful :( > > > Apparently the slowdown occurred because a blockdev mount patch > 62ac665ff9fc07497ca524 ("blockdev: turn a rw semaphore into a percpu rw > semaphore") newly uses percpu rwsems, and percpu rwsems are slow on the > down_write() path. > > And using synchronize_sched_expedited() rather than synchronize_sched() > makes percpu_down_write() somewhat less slow. Correct? Yes. > Why is it OK to use synchronize_sched_expedited() here? If it's > faster, why can't we use synchronize_sched_expedited() everywhere and > zap synchronize_sched()? Because synchronize_sched_expedited sends interrupts to all processors and it is bad for realtime workloads. Peter Zijlstra once complained when I used synchronize_rcu_expedited in bdi_remove_from_list (but he left it there). I suggest that if it really hurts real time response for someone, let they introduce a switch to turn it into non-expedited call. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] [PATCH 0/3 v3] add resync speed control for dm-raid1
On Wed, 23 Jan 2013, NeilBrown wrote: > On Tue, 22 Jan 2013 20:44:41 -0500 (EST) Mikulas Patocka > wrote: > > > > > > > On Wed, 16 Jan 2013, Guangliang Zhao wrote: > > > > > On Wed, Jan 09, 2013 at 12:43:21AM -0500, Mikulas Patocka wrote: > > > > Hi > > > Hi, > > > > > > I think it is very good that your patches could be used for other > > > targets(snapshot, thin) after reviewing yours, but I find some issues > > > (maybe not, please correct me if I am wrong). > > > > > > > > > > > I did this already some times ago. > > > > I'm sending my patches in the next mail. > > > > > > > > Basically, my and Guangliang's patches have the following differences: > > > > > > > > my patch: uses per-module throttle settings > > > > Guangliang's patch: uses per-device settings > > > > (my patch could be changed to use per-device throttle too, but without > > > > userspace support it isn't much useful because userspace lvm can > > > > reload the mirror and per-device settings would be lost) > > > > > > We couldn't force every devices in the system hold the same throttle, > > > IMHO, per-device settings couldn't be ignored. > > > Setting the global value by the parameters of module is a good way, and > > > it could also be used to set the default value in my patches. In this way, > > > the global setting wouldn't be lost, and we could also adjust every > > > device's > > > speed. > > > > It could be good to have per-device throttle. > > > > > > my patch: uses fine grained throttling of the individual IOs in kcopyd > > > > - > > > > it measures active/inactive ratio and if the disk is active more than > > > > the > > > > specified percentage of time, sleep is inserted. > > > > > > I think this policy might not be able to represent the exact write speed, > > > while other modules(such as md, drbd) monitor the real IO speed. > > > > But you don't want to limit raid resynchronization to a certain speed. A > > disk has varying speed, it is faster in the beginning and slower in the > > end. > > > > So if you want to limit raid resynchronization so that other tasks have > > faster access to the disk, you need to limit percentage of time that is > > spent on resynchronization, not resynchronization speed. > > Sounds good . not that easy though. > > But if the disk is otherwise idle, I want 100% of the time to be spend on > synchronisation. If it isn't otherwise idle, I want a much more modest > faction to be used. So, hack the i/o scheduler - the scheduler can make such decisions, it is impossible to do this decision in the dm-mirror layer, because the dm-mirror layer has no knowledge of how much is the disk loaded. It is clearly better solution to solve it in the scheduler, this throttle is a simple solution for the case when the i/o scheduler isn't right. > Getting this "right" is very hard. You want to resync aggressively if there > is no other traffic, but to back off quickly to some small background rate if > there is any other traffic. That is what md tries to do. > > dm-raid1 has an extra complication. It is used in clusters (clvm) where > multiple separate hosts might be accessing the device. So the host which is > driving the resync cannot know what other IO there might be. The same problem arises on a single computer - there may be multiple logical volumes or multiple partitions on the same disk. > In that case the only thing that seems to be practical is an maximum sync > speed that can be set by the admin. Maximum speed sync doesn't work well because the disk has a different speed in the beginning and in the end. For example, if you set maximum speed as 50% of the resync speed when you start resynchronizing, the throttle stops working at the end of the disk, because the disk itself is 50% slower at the end. A better solution is to limit the time spent doing i/o. If you set it to 50%, it is 50% loaded and 50% idle for the whole duration of the resynchronization, regardless of different transfer speed in each disk region. > NeilBrown Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Track block device users that created dirty pages
On Mon, 1 Apr 2013, Jeff Moyer wrote: > Mikulas Patocka writes: > > > The new semantics is: if a process did some buffered writes to the block > > device (with write or mmap), the cache is flushed when the process > > closes the block device. Processes that didn't do any buffered writes to > > the device don't cause cache flush. It has these advantages: > > * processes that don't do buffered writes (such as "lvm") don't flush > > other process's data. > > * if the user runs "dd" on a block device, it is actually guaranteed > > that the data is flushed when "dd" exits. > > Why don't applications that want data to go to disk just call fsync > instead of relying on being the last process to have had the device > open? > > Cheers, > Jeff Because the user may forget to specify "conv=fsync" on dd command line. Anyway, when using dd to copy partitions, it should either always flush buffers on exit or never do it. The current state, when dd mostly flushes buffers, but doesn't with very low probability (if it races with lvm or udev) is confusing. If the admin sees that dd flushes buffers on block devices for all his trials, he assumes that dd always flushes buffers on block devices. He doesn't know that there is a tiny race condition that makes dd not flush buffers. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] A possible deadlock with stacked devices (was: [PATCH v4 08/12] block: Introduce new bio_split())
On Wed, 15 Aug 2012, Kent Overstreet wrote: > > Both md and dm use __GFP_WAIT allocations from mempools in > > generic_make_request. > > > > I think you found an interesting bug here. Suppose that we have three > > stacked devices: d1 depends on d2 and d2 depends on d3. > > > > Now, a bio b1 comes to d1. d1 splits it to two bios: b2.1 and b2.2 and > > sends them to the device d2 - these bios end up in current->bio_list. The > > driver for d2 receives bio b2.1 and sends bio b3.1 to d3. Now, > > current->bio_list contains bios b2.2, b3.1. Now, bio b2.2 is popped off > > the bio list and the driver for d2 is called with b2.2 - suppose that for > > some reason mempool in d2 is exhausted and the driver needs to wait until > > b2.1 finishes. b2.1 never finishes, because b2.1 depends on b3.1 and b3.1 > > is still in current->bio_list. So it deadlocks. > > > > Turning off __GFP_WAIT fixes nothing - it just turns one bug (a possible > > deadlock) into another bug (a possible bio failure with -ENOMEM). > > > > Increasing mempool sizes doesn't fix it either, the bio would just have to > > be split to more pieces in the above example to make it deadlock. > > > > I think the above possible deadlock scenario could be fixed by reversing > > current->bio_list processing - i.e. when some device's make_request_fn > > adds some bios to current->bio_list, these bios are processed before other > > bios that were on the list before. This way, bio list would contain "b3.1, > > b2.2" instead of "b2.2, b3.1" in the above example and the deadlock would > > not happen. > > Your patch isn't sufficient in the case where a bio may be split > multiple times (I'm not sure if it's sufficient in the case where bios > are only split once; trying to work it all out makes my head hurt). > > You don't need multiple stacked drivers to see this; just the case where > a single driver is running that splits multiple times is sufficient, if > you have enough threads submitting at the same time. That is true. dm splits one bio to multiple bios, so it could still deadlock. Mikulas > Bcache works around this with the trick I mentioned previously, where it > masks out _GFP_WAIT if current->bio_list != NULL, and punts to workqueue > if the allocation fails. > > This works but it'd have to be done in each stacking driver... it's not > a generic solution, and it's a pain in the ass. > > I came up with another idea the other day. Conceptually, it inverts my > previous workaround - the punting to workqueue is done in the allocation > code when necessary, for the bios that would be blocked. > > It's lightly tested, gonna rig up some kind of fault injection and test > it more thoroughly later. > > commit d61bbb074cc8f2e089eb57e2bee8e84500f390a8 > Author: Kent Overstreet > Date: Mon Aug 13 18:11:01 2012 -0700 > > block: Avoid deadlocks with bio allocation by stacking drivers > > Previously, if we ever try to allocate more than once from the same bio > set while running under generic_make_request(), we risk deadlock. > > This would happen if e.g. a bio ever needed to be split more than once, > and it's difficult to handle correctly in the drivers - so in practice > it's not. > > This patch fixes this issue by allocating a rescuer workqueue for each > bio_set, and punting queued bios to said rescuer when necessary: > > diff --git a/fs/bio.c b/fs/bio.c > index bc4265a..7b4f655 100644 > --- a/fs/bio.c > +++ b/fs/bio.c > @@ -281,6 +281,23 @@ void bio_reset(struct bio *bio) > } > EXPORT_SYMBOL(bio_reset); > > +static void bio_alloc_rescue(struct work_struct *work) > +{ > + struct bio_set *bs = container_of(work, struct bio_set, rescue_work); > + struct bio *bio; > + > + while (1) { > + spin_lock(&bs->rescue_lock); > + bio = bio_list_pop(&bs->rescue_list); > + spin_unlock(&bs->rescue_lock); > + > + if (!bio) > + break; > + > + generic_make_request(bio); > + } > +} > + > /** > * bio_alloc_bioset - allocate a bio for I/O > * @gfp_mask: the GFP_ mask given to the slab allocator > @@ -294,6 +311,7 @@ EXPORT_SYMBOL(bio_reset); > **/ > struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set > *bs) > { > + gfp_t saved_gfp = gfp_mask; > unsigned front_pad; > unsigned inline_vecs; > unsigned long idx = BIO_POOL_NONE; > @@ -308,16 +326,39 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int > nr_iovecs, struct bio_set *bs) > p = kmalloc(sizeof(struct bio) + > nr_iovecs * sizeof(struct bio_vec), > gfp_mask); > + > front_pad = 0; > inline_vecs = nr_iovecs; > } else { > - p = mempool_alloc(bs->bio_pool, gfp_mask); > + /* > + * If we're running under generic_make_request() > + * (current->bio_list !
Re: [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers
Hi This fixes the bio allocation problems, but doesn't fix a similar deadlock in device mapper when allocating from md->io_pool or other mempools in the target driver. The problem is that majority of device mapper code assumes that if we submit a bio, that bio will be finished in a finite time. The commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 in 2.6.22 broke this assumption. I suggest - instead of writing workarounds for this current->bio_list misbehavior, why not remove current->bio_list at all? We could revert d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1, allocate a per-device workqueue, test stack usage in generic_make_request, and if it is too high (more than half of the stack used, or so), put the bio to the target device's blockqueue. That could be simpler than allocating per-bioset workqueue and it also solves more problems (possible deadlocks in dm). Mikulas On Tue, 28 Aug 2012, Kent Overstreet wrote: > Previously, if we ever try to allocate more than once from the same bio > set while running under generic_make_request() (i.e. a stacking block > driver), we risk deadlock. > > This is because of the code in generic_make_request() that converts > recursion to iteration; any bios we submit won't actually be submitted > (so they can complete and eventually be freed) until after we return - > this means if we allocate a second bio, we're blocking the first one > from ever being freed. > > Thus if enough threads call into a stacking block driver at the same > time with bios that need multiple splits, and the bio_set's reserve gets > used up, we deadlock. > > This can be worked around in the driver code - we could check if we're > running under generic_make_request(), then mask out __GFP_WAIT when we > go to allocate a bio, and if the allocation fails punt to workqueue and > retry the allocation. > > But this is tricky and not a generic solution. This patch solves it for > all users by inverting the previously described technique. We allocate a > rescuer workqueue for each bio_set, and then in the allocation code if > there are bios on current->bio_list we would be blocking, we punt them > to the rescuer workqueue to be submitted. > > Tested it by forcing the rescue codepath to be taken (by disabling the > first GFP_NOWAIT) attempt, and then ran it with bcache (which does a lot > of arbitrary bio splitting) and verified that the rescuer was being > invoked. > > Signed-off-by: Kent Overstreet > CC: Jens Axboe > --- > fs/bio.c| 73 > ++--- > include/linux/bio.h | 9 +++ > 2 files changed, 79 insertions(+), 3 deletions(-) > > diff --git a/fs/bio.c b/fs/bio.c > index 31e637a..5d46318 100644 > --- a/fs/bio.c > +++ b/fs/bio.c > @@ -285,6 +285,23 @@ void bio_reset(struct bio *bio) > } > EXPORT_SYMBOL(bio_reset); > > +static void bio_alloc_rescue(struct work_struct *work) > +{ > + struct bio_set *bs = container_of(work, struct bio_set, rescue_work); > + struct bio *bio; > + > + while (1) { > + spin_lock(&bs->rescue_lock); > + bio = bio_list_pop(&bs->rescue_list); > + spin_unlock(&bs->rescue_lock); > + > + if (!bio) > + break; > + > + generic_make_request(bio); > + } > +} > + > /** > * bio_alloc_bioset - allocate a bio for I/O > * @gfp_mask: the GFP_ mask given to the slab allocator > @@ -307,6 +324,7 @@ EXPORT_SYMBOL(bio_reset); > */ > struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set > *bs) > { > + gfp_t saved_gfp = gfp_mask; > unsigned front_pad; > unsigned inline_vecs; > unsigned long idx = BIO_POOL_NONE; > @@ -324,13 +342,37 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int > nr_iovecs, struct bio_set *bs) > front_pad = 0; > inline_vecs = nr_iovecs; > } else { > + /* > + * generic_make_request() converts recursion to iteration; this > + * means if we're running beneath it, any bios we allocate and > + * submit will not be submitted (and thus freed) until after we > + * return. > + * > + * This exposes us to a potential deadlock if we allocate > + * multiple bios from the same bio_set() while running > + * underneath generic_make_request(). If we were to allocate > + * multiple bios (say a stacking block driver that was splitting > + * bios), we would deadlock if we exhausted the mempool's > + * reserve. > + * > + * We solve this, and guarantee forward progress, with a rescuer > + * workqueue per bio_set. If we go to allocate and there are > + * bios on current->bio_list, we first try the allocation > + * without __GFP_WAIT; if that fails, we punt those bios we > + * would be blocking to the resc
Re: [PATCH 5/8] hpfs: drop lock/unlock super
It looks ok. Mikulas On Thu, 30 Aug 2012, Marco Stornelli wrote: > Removed lock/unlock super. > > Signed-off-by: Marco Stornelli > --- > fs/hpfs/super.c |3 --- > 1 files changed, 0 insertions(+), 3 deletions(-) > > diff --git a/fs/hpfs/super.c b/fs/hpfs/super.c > index 706a12c..8af2cdc 100644 > --- a/fs/hpfs/super.c > +++ b/fs/hpfs/super.c > @@ -389,7 +389,6 @@ static int hpfs_remount_fs(struct super_block *s, int > *flags, char *data) > *flags |= MS_NOATIME; > > hpfs_lock(s); > - lock_super(s); > uid = sbi->sb_uid; gid = sbi->sb_gid; > umask = 0777 & ~sbi->sb_mode; > lowercase = sbi->sb_lowercase; > @@ -422,12 +421,10 @@ static int hpfs_remount_fs(struct super_block *s, int > *flags, char *data) > > replace_mount_options(s, new_opts); > > - unlock_super(s); > hpfs_unlock(s); > return 0; > > out_err: > - unlock_super(s); > hpfs_unlock(s); > kfree(new_opts); > return -EINVAL; > -- > 1.7.3.4 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time
Hi This is a series of patches to prevent a crash when when someone is reading block device and block size is changed simultaneously. (the crash is already happening in the production environment) The first patch adds a rw-lock to struct block_device, but doesn't use the lock anywhere. The reason why I submit this as a separate patch is that on my computer adding an unused field to this structure affects performance much more than any locking changes. The second patch uses the rw-lock. The lock is locked for read when doing I/O on the block device and it is locked for write when changing block size. The third patch converts the rw-lock to a percpu rw-lock for better performance, to avoid cache line bouncing. The fourth patch is an alternate percpu rw-lock implementation using RCU by Eric Dumazet. It avoids any atomic instruction in the hot path. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/4] Add a lock that will be needed by the next patch
Add a lock that will be needed by the next patch. I am submitting this as a separate patch, because the act of adding an extra field to "struct block_device" affects performance significantly. So that performance change from adding the lock field can be differentiated from the performance change of locking. Signed-off-by: Mikulas Patocka --- drivers/char/raw.c |2 - fs/block_dev.c | 60 +++-- include/linux/fs.h |4 +++ 3 files changed, 63 insertions(+), 3 deletions(-) Index: linux-3.5-rc6-devel/include/linux/fs.h === --- linux-3.5-rc6-devel.orig/include/linux/fs.h 2012-07-16 20:20:12.0 +0200 +++ linux-3.5-rc6-devel/include/linux/fs.h 2012-07-16 01:29:21.0 +0200 @@ -713,6 +713,8 @@ struct block_device { int bd_fsfreeze_count; /* Mutex for freeze */ struct mutexbd_fsfreeze_mutex; + /* A semaphore that prevents I/O while block size is being changed */ + struct rw_semaphore bd_block_size_semaphore; }; /* -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/4] blockdev: fix a crash when block size is changed and I/O is issued simultaneously
blockdev: fix a crash when block size is changed and I/O is issued simultaneously The kernel may crash when block size is changed and I/O is issued simultaneously. Because some subsystems (udev or lvm) may read any block device anytime, the bug actually puts any code that changes a block device size in jeopardy. The crash can be reproduced if you place "msleep(1000)" to blkdev_get_blocks just before "bh->b_size = max_blocks << inode->i_blkbits;". Then, run "dd if=/dev/ram0 of=/dev/null bs=4k count=1 iflag=direct" While it is waiting in msleep, run "blockdev --setbsz 2048 /dev/ram0" You get a BUG. The direct and non-direct I/O is written with the assumption that block size does not change. It doesn't seem practical to fix these crashes one-by-one there may be many crash possibilities when block size changes at a certain place and it is impossible to find them all and verify the code. This patch introduces a new rw-lock bd_block_size_semaphore. The lock is taken for read during I/O. It is taken for write when changing block size. Consequently, block size can't be changed while I/O is being submitted. For asynchronous I/O, the patch only prevents block size change while the I/O is being submitted. The block size can change when the I/O is in progress or when the I/O is being finished. This is acceptable because there are no accesses to block size when asynchronous I/O is being finished. The patch prevents block size changing while the device is mapped with mmap. Signed-off-by: Mikulas Patocka --- drivers/char/raw.c |2 - fs/block_dev.c | 60 +++-- include/linux/fs.h |2 + 3 files changed, 61 insertions(+), 3 deletions(-) Index: linux-3.5-fast/fs/block_dev.c === --- linux-3.5-fast.orig/fs/block_dev.c 2012-08-02 23:05:30.0 +0200 +++ linux-3.5-fast/fs/block_dev.c 2012-08-02 23:05:39.0 +0200 @@ -116,6 +116,8 @@ EXPORT_SYMBOL(invalidate_bdev); int set_blocksize(struct block_device *bdev, int size) { + struct address_space *mapping; + /* Size must be a power of two, and between 512 and PAGE_SIZE */ if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size)) return -EINVAL; @@ -124,6 +126,20 @@ int set_blocksize(struct block_device *b if (size < bdev_logical_block_size(bdev)) return -EINVAL; + /* Prevent starting I/O or mapping the device */ + down_write(&bdev->bd_block_size_semaphore); + + /* Check that the block device is not memory mapped */ + mapping = bdev->bd_inode->i_mapping; + mutex_lock(&mapping->i_mmap_mutex); + if (!prio_tree_empty(&mapping->i_mmap) || + !list_empty(&mapping->i_mmap_nonlinear)) { + mutex_unlock(&mapping->i_mmap_mutex); + up_write(&bdev->bd_block_size_semaphore); + return -EBUSY; + } + mutex_unlock(&mapping->i_mmap_mutex); + /* Don't change the size if it is same as current */ if (bdev->bd_block_size != size) { sync_blockdev(bdev); @@ -131,6 +147,9 @@ int set_blocksize(struct block_device *b bdev->bd_inode->i_blkbits = blksize_bits(size); kill_bdev(bdev); } + + up_write(&bdev->bd_block_size_semaphore); + return 0; } @@ -472,6 +491,7 @@ static void init_once(void *foo) inode_init_once(&ei->vfs_inode); /* Initialize mutex for freeze. */ mutex_init(&bdev->bd_fsfreeze_mutex); + init_rwsem(&bdev->bd_block_size_semaphore); } static inline void __bd_forget(struct inode *inode) @@ -1567,6 +1587,22 @@ static long block_ioctl(struct file *fil return blkdev_ioctl(bdev, mode, cmd, arg); } +ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) +{ + ssize_t ret; + struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host); + + down_read(&bdev->bd_block_size_semaphore); + + ret = generic_file_aio_read(iocb, iov, nr_segs, pos); + + up_read(&bdev->bd_block_size_semaphore); + + return ret; +} +EXPORT_SYMBOL_GPL(blkdev_aio_read); + /* * Write data to the block device. Only intended for the block device itself * and the raw driver which basically is a fake block device. @@ -1578,10 +1614,13 @@ ssize_t blkdev_aio_write(struct kiocb *i unsigned long nr_segs, loff_t pos) { struct file *file = iocb->ki_filp; + struct block_device *bdev = I_BDEV(file->f_mapping->host); ssize_t ret; BUG_ON(iocb->ki_pos != pos); + down_read(&bdev->bd_block_size_semaphore); +
[PATCH 3/4] blockdev: turn a rw semaphore into a percpu rw semaphore
blockdev: turn a rw semaphore into a percpu rw semaphore This avoids cache line bouncing when many processes lock the semaphore for read. Partially based on a patch by Jeff Moyer . Signed-off-by: Mikulas Patocka --- fs/block_dev.c | 30 -- include/linux/fs.h |3 ++- 2 files changed, 22 insertions(+), 11 deletions(-) Index: linux-3.5-fast/fs/block_dev.c === --- linux-3.5-fast.orig/fs/block_dev.c 2012-07-28 18:32:10.0 +0200 +++ linux-3.5-fast/fs/block_dev.c 2012-07-28 18:32:12.0 +0200 @@ -127,7 +127,7 @@ int set_blocksize(struct block_device *b return -EINVAL; /* Prevent starting I/O or mapping the device */ - down_write(&bdev->bd_block_size_semaphore); + percpu_down_write(&bdev->bd_block_size_semaphore); /* Check that the block device is not memory mapped */ mapping = bdev->bd_inode->i_mapping; @@ -135,7 +135,7 @@ int set_blocksize(struct block_device *b if (!prio_tree_empty(&mapping->i_mmap) || !list_empty(&mapping->i_mmap_nonlinear)) { mutex_unlock(&mapping->i_mmap_mutex); - up_write(&bdev->bd_block_size_semaphore); + percpu_up_write(&bdev->bd_block_size_semaphore); return -EBUSY; } mutex_unlock(&mapping->i_mmap_mutex); @@ -148,7 +148,7 @@ int set_blocksize(struct block_device *b kill_bdev(bdev); } - up_write(&bdev->bd_block_size_semaphore); + percpu_up_write(&bdev->bd_block_size_semaphore); return 0; } @@ -460,6 +460,12 @@ static struct inode *bdev_alloc_inode(st struct bdev_inode *ei = kmem_cache_alloc(bdev_cachep, GFP_KERNEL); if (!ei) return NULL; + + if (unlikely(percpu_init_rwsem(&ei->bdev.bd_block_size_semaphore))) { + kmem_cache_free(bdev_cachep, ei); + return NULL; + } + return &ei->vfs_inode; } @@ -468,6 +474,8 @@ static void bdev_i_callback(struct rcu_h struct inode *inode = container_of(head, struct inode, i_rcu); struct bdev_inode *bdi = BDEV_I(inode); + percpu_free_rwsem(&bdi->bdev.bd_block_size_semaphore); + kmem_cache_free(bdev_cachep, bdi); } @@ -491,7 +499,6 @@ static void init_once(void *foo) inode_init_once(&ei->vfs_inode); /* Initialize mutex for freeze. */ mutex_init(&bdev->bd_fsfreeze_mutex); - init_rwsem(&bdev->bd_block_size_semaphore); } static inline void __bd_forget(struct inode *inode) @@ -1592,12 +1599,13 @@ ssize_t blkdev_aio_read(struct kiocb *io { ssize_t ret; struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host); + percpu_rwsem_ptr p; - down_read(&bdev->bd_block_size_semaphore); + p = percpu_down_read(&bdev->bd_block_size_semaphore); ret = generic_file_aio_read(iocb, iov, nr_segs, pos); - up_read(&bdev->bd_block_size_semaphore); + percpu_up_read(&bdev->bd_block_size_semaphore, p); return ret; } @@ -1616,10 +1624,11 @@ ssize_t blkdev_aio_write(struct kiocb *i struct file *file = iocb->ki_filp; struct block_device *bdev = I_BDEV(file->f_mapping->host); ssize_t ret; + percpu_rwsem_ptr p; BUG_ON(iocb->ki_pos != pos); - down_read(&bdev->bd_block_size_semaphore); + p = percpu_down_read(&bdev->bd_block_size_semaphore); ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos); if (ret > 0 || ret == -EIOCBQUEUED) { @@ -1630,7 +1639,7 @@ ssize_t blkdev_aio_write(struct kiocb *i ret = err; } - up_read(&bdev->bd_block_size_semaphore); + percpu_up_read(&bdev->bd_block_size_semaphore, p); return ret; } @@ -1640,12 +1649,13 @@ int blkdev_mmap(struct file *file, struc { int ret; struct block_device *bdev = I_BDEV(file->f_mapping->host); + percpu_rwsem_ptr p; - down_read(&bdev->bd_block_size_semaphore); + p = percpu_down_read(&bdev->bd_block_size_semaphore); ret = generic_file_mmap(file, vma); - up_read(&bdev->bd_block_size_semaphore); + percpu_up_read(&bdev->bd_block_size_semaphore, p); return ret; } Index: linux-3.5-fast/include/linux/fs.h === --- linux-3.5-fast.orig/include/linux/fs.h 2012-07-28 18:32:10.0 +0200 +++ linux-3.5-fast/include/linux/fs.h 2012-07-28 18:32:12.0 +0200 @@ -10,6 +10,7 @@ #include #include #include +#include /* * It's silly to have NR_OPEN bigger than
[PATCH 4/4] New percpu lock implementation
New percpu lock implementation An alternative percpu lock implementation. The original idea by Eric Dumazet The lock consists of an array of percpu unsigned integers, a boolean variable and a mutex. When we take the lock for read, we enter rcu read section, check for a "locked" variable. If it is false, we increase a percpu counter on the current cpu and exit the rcu section. If "locked" is true, we exit the rcu section, take the mutex and drop it (this waits until a writer finished) and retry. Unlocking for read just decreases percpu variable. Note that we can unlock on a difference cpu than where we locked, in this case the counter underflows. The sum of all percpu counters represents the number of processes that hold the lock for read. When we need to lock for write, we take the mutex, set "locked" variable to true and synchronize rcu. Since RCU has been synchronized, no processes can create new read locks. We wait until the sum of percpu counters is zero - when it is, there are no readers in the critical section. Signed-off-by: Mikulas Patocka --- fs/block_dev.c | 15 ++ include/linux/percpu-rwsem.h | 95 +-- 2 files changed, 54 insertions(+), 56 deletions(-) Index: linux-3.5.2-fast/fs/block_dev.c === --- linux-3.5.2-fast.orig/fs/block_dev.c2012-08-18 01:23:32.0 +0200 +++ linux-3.5.2-fast/fs/block_dev.c 2012-08-18 01:23:32.0 +0200 @@ -1599,13 +1599,12 @@ ssize_t blkdev_aio_read(struct kiocb *io { ssize_t ret; struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host); - percpu_rwsem_ptr p; - p = percpu_down_read(&bdev->bd_block_size_semaphore); + percpu_down_read(&bdev->bd_block_size_semaphore); ret = generic_file_aio_read(iocb, iov, nr_segs, pos); - percpu_up_read(&bdev->bd_block_size_semaphore, p); + percpu_up_read(&bdev->bd_block_size_semaphore); return ret; } @@ -1624,11 +1623,10 @@ ssize_t blkdev_aio_write(struct kiocb *i struct file *file = iocb->ki_filp; struct block_device *bdev = I_BDEV(file->f_mapping->host); ssize_t ret; - percpu_rwsem_ptr p; BUG_ON(iocb->ki_pos != pos); - p = percpu_down_read(&bdev->bd_block_size_semaphore); + percpu_down_read(&bdev->bd_block_size_semaphore); ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos); if (ret > 0 || ret == -EIOCBQUEUED) { @@ -1639,7 +1637,7 @@ ssize_t blkdev_aio_write(struct kiocb *i ret = err; } - percpu_up_read(&bdev->bd_block_size_semaphore, p); + percpu_up_read(&bdev->bd_block_size_semaphore); return ret; } @@ -1649,13 +1647,12 @@ int blkdev_mmap(struct file *file, struc { int ret; struct block_device *bdev = I_BDEV(file->f_mapping->host); - percpu_rwsem_ptr p; - p = percpu_down_read(&bdev->bd_block_size_semaphore); + percpu_down_read(&bdev->bd_block_size_semaphore); ret = generic_file_mmap(file, vma); - percpu_up_read(&bdev->bd_block_size_semaphore, p); + percpu_up_read(&bdev->bd_block_size_semaphore); return ret; } Index: linux-3.5.2-fast/include/linux/percpu-rwsem.h === --- linux-3.5.2-fast.orig/include/linux/percpu-rwsem.h 2012-08-18 01:23:32.0 +0200 +++ linux-3.5.2-fast/include/linux/percpu-rwsem.h 2012-08-18 01:26:08.0 +0200 @@ -1,77 +1,78 @@ #ifndef _LINUX_PERCPU_RWSEM_H #define _LINUX_PERCPU_RWSEM_H -#include +#include #include - -#ifndef CONFIG_SMP - -#define percpu_rw_semaphorerw_semaphore -#define percpu_rwsem_ptr int -#define percpu_down_read(x)(down_read(x), 0) -#define percpu_up_read(x, y) up_read(x) -#define percpu_down_write down_write -#define percpu_up_writeup_write -#define percpu_init_rwsem(x) (({init_rwsem(x);}), 0) -#define percpu_free_rwsem(x) do { } while (0) - -#else +#include +#include struct percpu_rw_semaphore { - struct rw_semaphore __percpu *s; + unsigned __percpu *counters; + bool locked; + struct mutex mtx; }; -typedef struct rw_semaphore *percpu_rwsem_ptr; - -static inline percpu_rwsem_ptr percpu_down_read(struct percpu_rw_semaphore *sem) +static inline void percpu_down_read(struct percpu_rw_semaphore *p) { - struct rw_semaphore *s = __this_cpu_ptr(sem->s); - down_read(s); - return s; + rcu_read_lock(); + if (unlikely(p->locked)) { + rcu_read_unlock(); + mutex_lock(&p->mtx); + this_cpu_inc(*p->counters); + mutex_unlock(&p->
Re: [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time
On Fri, 31 Aug 2012, Mikulas Patocka wrote: > Hi > > This is a series of patches to prevent a crash when when someone is > reading block device and block size is changed simultaneously. (the crash > is already happening in the production environment) > > The first patch adds a rw-lock to struct block_device, but doesn't use the > lock anywhere. The reason why I submit this as a separate patch is that on > my computer adding an unused field to this structure affects performance > much more than any locking changes. > > The second patch uses the rw-lock. The lock is locked for read when doing > I/O on the block device and it is locked for write when changing block > size. > > The third patch converts the rw-lock to a percpu rw-lock for better > performance, to avoid cache line bouncing. > > The fourth patch is an alternate percpu rw-lock implementation using RCU > by Eric Dumazet. It avoids any atomic instruction in the hot path. > > Mikulas I tested performance of patches. I created 4GB ramdisk, I initially filled it with zeros (so that ramdisk allocation-on-demand doesn't affect the results). I ran fio to perform 8 concurrent accesses on 8 core machine (two Barcelona Opterons): time fio --rw=randrw --size=4G --bs=512 --filename=/dev/ram0 --direct=1 --name=job1 --name=job2 --name=job3 --name=job4 --name=job5 --name=job6 --name=job7 --name=job8 The results actually show that the size of struct block_device and alignment of subsequent fields in struct inode have far more effect on result that the type of locking used. (struct inode is placed just after struct block_device in "struct bdev_inode" in fs/block-dev.c) plain kernel 3.5.3: 57.9s patch 1: 43.4s patches 1,2: 43.7s patches 1,2,3: 38.5s patches 1,2,3,4: 58.6s You can see that patch 1 improves the time by 14.5 seconds, but all that patch 1 does is adding an unused structure field. Patch 3 is 4.9 seconds faster than patch 1, althogh patch 1 does no locking at all and patch 3 does per-cpu locking. So, the reason for the speedup is different sizeof of struct block_device (and subsequently, different alignment of struct inode), rather than locking improvement. I would be interested if other people did performance testing of the patches too. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time
On Fri, 31 Aug 2012, Jeff Moyer wrote: > Mikulas Patocka writes: > > > On Fri, 31 Aug 2012, Mikulas Patocka wrote: > > > >> Hi > >> > >> This is a series of patches to prevent a crash when when someone is > >> reading block device and block size is changed simultaneously. (the crash > >> is already happening in the production environment) > >> > >> The first patch adds a rw-lock to struct block_device, but doesn't use the > >> lock anywhere. The reason why I submit this as a separate patch is that on > >> my computer adding an unused field to this structure affects performance > >> much more than any locking changes. > >> > >> The second patch uses the rw-lock. The lock is locked for read when doing > >> I/O on the block device and it is locked for write when changing block > >> size. > >> > >> The third patch converts the rw-lock to a percpu rw-lock for better > >> performance, to avoid cache line bouncing. > >> > >> The fourth patch is an alternate percpu rw-lock implementation using RCU > >> by Eric Dumazet. It avoids any atomic instruction in the hot path. > >> > >> Mikulas > > > > I tested performance of patches. I created 4GB ramdisk, I initially filled > > it with zeros (so that ramdisk allocation-on-demand doesn't affect the > > results). > > > > I ran fio to perform 8 concurrent accesses on 8 core machine (two > > Barcelona Opterons): > > time fio --rw=randrw --size=4G --bs=512 --filename=/dev/ram0 --direct=1 > > --name=job1 --name=job2 --name=job3 --name=job4 --name=job5 --name=job6 > > --name=job7 --name=job8 > > > > The results actually show that the size of struct block_device and > > alignment of subsequent fields in struct inode have far more effect on > > result that the type of locking used. (struct inode is placed just after > > struct block_device in "struct bdev_inode" in fs/block-dev.c) > > > > plain kernel 3.5.3: 57.9s > > patch 1: 43.4s > > patches 1,2: 43.7s > > patches 1,2,3: 38.5s > > patches 1,2,3,4: 58.6s > > > > You can see that patch 1 improves the time by 14.5 seconds, but all that > > patch 1 does is adding an unused structure field. > > > > Patch 3 is 4.9 seconds faster than patch 1, althogh patch 1 does no > > locking at all and patch 3 does per-cpu locking. So, the reason for the > > speedup is different sizeof of struct block_device (and subsequently, > > different alignment of struct inode), rather than locking improvement. > > How many runs did you do? Did you see much run to run variation? These results come from two runs (which differed by no more than 1s), but I observed the same phenomenon - difference in time due to the size of block_device - many times before when I was doing benchmarking when developing these patches. I actually had to apply something like this to make the results not depend on the size of block_dev. I would be interested if the same difference could be observed on other processors or if it is something specific to AMD K10 architecture. --- fs/block_dev.c |5 - 1 file changed, 4 insertions(+), 1 deletion(-) Index: linux-3.5.3-fast/fs/block_dev.c === --- linux-3.5.3-fast.orig/fs/block_dev.c2012-08-31 22:30:07.0 +0200 +++ linux-3.5.3-fast/fs/block_dev.c 2012-08-31 22:30:43.0 +0200 @@ -31,7 +31,10 @@ #include "internal.h" struct bdev_inode { - struct block_device bdev; + union { + struct block_device bdev; + char pad[0x140]; + }; struct inode vfs_inode; }; > > I would be interested if other people did performance testing of the > > patches too. > > I'll do some testing next week, but don't expect to get to it before > Wednesday. > > Cheers, > Jeff Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers
On Thu, 30 Aug 2012, Kent Overstreet wrote: > On Thu, Aug 30, 2012 at 06:07:45PM -0400, Vivek Goyal wrote: > > On Wed, Aug 29, 2012 at 10:13:45AM -0700, Kent Overstreet wrote: > > > > [..] > > > > Performance aside, punting submission to per device worker in case of > > > > deep > > > > stack usage sounds cleaner solution to me. > > > > > > Agreed, but performance tends to matter in the real world. And either > > > way the tricky bits are going to be confined to a few functions, so I > > > don't think it matters that much. > > > > > > If someone wants to code up the workqueue version and test it, they're > > > more than welcome... > > > > Here is one quick and dirty proof of concept patch. It checks for stack > > depth and if remaining space is less than 20% of stack size, then it > > defers the bio submission to per queue worker. > > I can't think of any correctness issues. I see some stuff that could be > simplified (blk_drain_deferred_bios() is redundant, just make it a > wrapper around blk_deffered_bio_work()). > > Still skeptical about the performance impact, though - frankly, on some > of the hardware I've been running bcache on this would be a visible > performance regression - probably double digit percentages but I'd have > to benchmark it. That kind of of hardware/usage is not normal today, > but I've put a lot of work into performance and I don't want to make > things worse without good reason. > > Have you tested/benchmarked it? > > There's scheduling behaviour, too. We really want the workqueue thread's > cpu time to be charged to the process that submitted the bio. (We could > use a mechanism like that in other places, too... not like this is a new > issue). > > This is going to be a real issue for users that need strong isolation - > for any driver that uses non negligable cpu (i.e. dm crypt), we're > breaking that (not that it wasn't broken already, but this makes it > worse). ... or another possibility - start a timer when something is put to current->bio_list and use that timer to pop entries off current->bio_list and submit them to a workqueue. The timer can be cpu-local so only interrupt masking is required to synchronize against the timer. This would normally run just like the current kernel and in case of deadlock, the timer would kick in and resolve the deadlock. > I could be convinced, but right now I prefer my solution. It fixes bio allocation problem, but not other similar mempool problems in dm and md. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers
On Mon, 3 Sep 2012, Kent Overstreet wrote: > On Mon, Sep 03, 2012 at 04:41:37PM -0400, Mikulas Patocka wrote: > > ... or another possibility - start a timer when something is put to > > current->bio_list and use that timer to pop entries off current->bio_list > > and submit them to a workqueue. The timer can be cpu-local so only > > interrupt masking is required to synchronize against the timer. > > > > This would normally run just like the current kernel and in case of > > deadlock, the timer would kick in and resolve the deadlock. > > Ugh. That's a _terrible_ idea. > > Remember the old plugging code? You ever have to debug performance > issues caused by it? Yes, I do remember it (and I fixed one bug that resulted in missed unplug and degraded performance). But currently, deadlocks due to exhausted mempools and bios being stalled in current->bio_list don't happen (or do happen below so rarely that they aren't reported). If we add a timer, it will turn a deadlock into an i/o delay, but it can't make things any worse. BTW. can these new-style timerless plugs introduce deadlocks too? What happens when some bios are indefinitely delayed because their requests are held in a plug and a mempool runs out? > > > I could be convinced, but right now I prefer my solution. > > > > It fixes bio allocation problem, but not other similar mempool problems in > > dm and md. > > I looked a bit more, and actually I think the rest of the problem is > pretty limited in scope - most of those mempool allocations are per > request, not per split. > > I'm willing to put some time into converting dm/md over to bioset's > front_pad. I'm having to learn the code for the immutable biovec work, > anyways. Currently, dm targets allocate request-specific data from target-specific mempool. mempools are in dm-crypt, dm-delay, dm-mirror, dm-snapshot, dm-thin, dm-verity. You can change it to allocate request-specific data with the bio. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2] dm: Use bioset's front_pad for dm_target_io
On Tue, 4 Sep 2012, Kent Overstreet wrote: > On Tue, Sep 04, 2012 at 03:26:19PM -0400, Mikulas Patocka wrote: > > > > > > On Mon, 3 Sep 2012, Kent Overstreet wrote: > > > > > On Mon, Sep 03, 2012 at 04:41:37PM -0400, Mikulas Patocka wrote: > > > > ... or another possibility - start a timer when something is put to > > > > current->bio_list and use that timer to pop entries off > > > > current->bio_list > > > > and submit them to a workqueue. The timer can be cpu-local so only > > > > interrupt masking is required to synchronize against the timer. > > > > > > > > This would normally run just like the current kernel and in case of > > > > deadlock, the timer would kick in and resolve the deadlock. > > > > > > Ugh. That's a _terrible_ idea. > > > > > > Remember the old plugging code? You ever have to debug performance > > > issues caused by it? > > > > Yes, I do remember it (and I fixed one bug that resulted in missed unplug > > and degraded performance). > > > > But currently, deadlocks due to exhausted mempools and bios being stalled > > in current->bio_list don't happen (or do happen below so rarely that they > > aren't reported). > > > > If we add a timer, it will turn a deadlock into an i/o delay, but it can't > > make things any worse. > > This is all true. I'm not arguing your solution wouldn't _work_... I'd > try and give some real reasoning for my objections but it'll take me > awhile to figure out how to coherently explain it and I'm very sleep > deprived. > > > Currently, dm targets allocate request-specific data from target-specific > > mempool. mempools are in dm-crypt, dm-delay, dm-mirror, dm-snapshot, > > dm-thin, dm-verity. You can change it to allocate request-specific data > > with the bio. > > I wrote a patch for dm_target_io last night. I think I know an easy way > to go about converting the rest but it'll probably have to wait until > I'm further along with my immutable bvec stuff. > > Completely untested patch below: The patch didn't work (it has random allocation failures and crashes when the device is closed). The patch also contains some unrelated changes. I created this patch that does the same thing. Mikulas --- Use front pad to allocate dm_target_io dm_target_io was previously allocated from a mempool. For each dm_target_io, there is exactly one bio allocated from a bioset. This patch merges these two allocations into one allocating: we create a bioset with front_pad equal to the size of dm_target_io - so that every bio allocated from the bioset has sizeof(struct dm_target_io) bytes before it. We allocate a bio and use the bytes before the bio as dm_target_io. This idea was introdiced by Kent Overstreet Signed-off-by: Mikulas Patocka --- drivers/md/dm.c | 96 +++- 1 file changed, 41 insertions(+), 55 deletions(-) Index: linux-3.5.3-fast/drivers/md/dm.c === --- linux-3.5.3-fast.orig/drivers/md/dm.c 2012-09-10 16:06:30.0 +0200 +++ linux-3.5.3-fast/drivers/md/dm.c2012-09-11 19:32:36.0 +0200 @@ -71,6 +71,7 @@ struct dm_target_io { struct dm_io *io; struct dm_target *ti; union map_info info; + struct bio clone; }; /* @@ -464,7 +465,9 @@ static void free_io(struct mapped_device static void free_tio(struct mapped_device *md, struct dm_target_io *tio) { - mempool_free(tio, md->tio_pool); + tio->clone.bi_private = md->bs; + tio->clone.bi_end_io = NULL; + bio_put(&tio->clone); } static struct dm_rq_target_io *alloc_rq_tio(struct mapped_device *md, @@ -710,13 +713,7 @@ static void clone_endio(struct bio *bio, } } - /* -* Store md for cleanup instead of tio which is about to get freed. -*/ - bio->bi_private = md->bs; - free_tio(md, tio); - bio_put(bio); dec_pending(io, error); } @@ -1032,12 +1029,12 @@ int dm_set_target_max_io_len(struct dm_t } EXPORT_SYMBOL_GPL(dm_set_target_max_io_len); -static void __map_bio(struct dm_target *ti, struct bio *clone, - struct dm_target_io *tio) +static void __map_bio(struct dm_target *ti, struct dm_target_io *tio) { int r; sector_t sector; struct mapped_device *md; + struct bio *clone = &tio->clone; clone->bi_end_io = clone_endio; clone->bi_private = tio; @@ -1061,12 +1058,6 @@ static void __map_bio(struct dm_target * /* error the io and bail o
Re: [PATCH 2] dm: Use bioset's front_pad for dm_target_io
On Tue, 11 Sep 2012, Kent Overstreet wrote: > On Tue, Sep 11, 2012 at 03:28:57PM -0400, Mikulas Patocka wrote: > > > > > > On Tue, 4 Sep 2012, Kent Overstreet wrote: > > > > > On Tue, Sep 04, 2012 at 03:26:19PM -0400, Mikulas Patocka wrote: > > > > > > > > > > > > On Mon, 3 Sep 2012, Kent Overstreet wrote: > > > > > > > > > On Mon, Sep 03, 2012 at 04:41:37PM -0400, Mikulas Patocka wrote: > > > > > > ... or another possibility - start a timer when something is put to > > > > > > current->bio_list and use that timer to pop entries off > > > > > > current->bio_list > > > > > > and submit them to a workqueue. The timer can be cpu-local so only > > > > > > interrupt masking is required to synchronize against the timer. > > > > > > > > > > > > This would normally run just like the current kernel and in case of > > > > > > deadlock, the timer would kick in and resolve the deadlock. > > > > > > > > > > Ugh. That's a _terrible_ idea. > > > > > > > > > > Remember the old plugging code? You ever have to debug performance > > > > > issues caused by it? > > > > > > > > Yes, I do remember it (and I fixed one bug that resulted in missed > > > > unplug > > > > and degraded performance). > > > > > > > > But currently, deadlocks due to exhausted mempools and bios being > > > > stalled > > > > in current->bio_list don't happen (or do happen below so rarely that > > > > they > > > > aren't reported). > > > > > > > > If we add a timer, it will turn a deadlock into an i/o delay, but it > > > > can't > > > > make things any worse. > > > > > > This is all true. I'm not arguing your solution wouldn't _work_... I'd > > > try and give some real reasoning for my objections but it'll take me > > > awhile to figure out how to coherently explain it and I'm very sleep > > > deprived. > > > > > > > Currently, dm targets allocate request-specific data from > > > > target-specific > > > > mempool. mempools are in dm-crypt, dm-delay, dm-mirror, dm-snapshot, > > > > dm-thin, dm-verity. You can change it to allocate request-specific data > > > > with the bio. > > > > > > I wrote a patch for dm_target_io last night. I think I know an easy way > > > to go about converting the rest but it'll probably have to wait until > > > I'm further along with my immutable bvec stuff. > > > > > > Completely untested patch below: > > > > The patch didn't work (it has random allocation failures and crashes when > > the device is closed). The patch also contains some unrelated changes. > > > > I created this patch that does the same thing. > > Cool! Thanks for finishing this. > > I like your approach with rolling the bio allocation into alloc_tio() - > it solves a problem I was having with the front_pad not being zeroed - > but it does prevent the use of bio_clone_bioset(), which is going to > cause minor issues with my immutable bvec work. > > Perhaps bio_alloc_bioset() should just be changed to zero the front_pad, > that'd probably be useful elsewhere anyways. > > You might want to rebase onto Jens' tree, it has my patches that get rid > of bi_destructor and the front_pad conversion for request based dm: > > git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next This is the patch based on this tree. Mikulas --- Use front pad to allocate dm_target_io dm_target_io was previously allocated from a mempool. For each dm_target_io, there is exactly one bio allocated from a bioset. This patch merges these two allocations into one allocating: we create a bioset with front_pad equal to the size of dm_target_io - so that every bio allocated from the bioset has sizeof(struct dm_target_io) bytes before it. We allocate a bio and use the bytes before the bio as dm_target_io. This idea was introdiced by Kent Overstreet Signed-off-by: Mikulas Patocka --- drivers/md/dm.c | 80 ++-- 1 file changed, 38 insertions(+), 42 deletions(-) Index: linux-block-copy/drivers/md/dm.c === --- linux-block-copy.orig/drivers/md/dm.c 2012-09-12 21:04:34.0 +0200 +++ linux-block-copy/drivers/md/dm
Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers
On Tue, 23 Oct 2012, Oleg Nesterov wrote: > On 10/23, Oleg Nesterov wrote: > > > > Not really the comment, but the question... > > Damn. And another question. > > Mikulas, I am sorry for this (almost) off-topic noise. Let me repeat > just in case that I am not arguing with your patches. > > > > > So write_lock/write_unlock needs to call synchronize_sched() 3 times. > I am wondering if it makes any sense to try to make it a bit heavier > but faster. > > What if we change the reader to use local_irq_disable/enable around > this_cpu_inc/dec (instead of rcu read lock)? I have to admit, I have > no idea how much cli/sti is slower compared to preempt_disable/enable. > > Then the writer can use > > static void mb_ipi(void *arg) > { > smp_mb(); /* unneeded ? */ > } > > static void force_mb_on_each_cpu(void) > { > smp_mb(); > smp_call_function(mb_ipi, NULL, 1); > } > > to a) synchronise with irq_disable and b) to insert the necessary mb's. > > Of course smp_call_function() means more work for each CPU, but > write_lock() should be rare... > > This can also wakeup the idle CPU's, but probably we can do > on_each_cpu_cond(cond_func => !idle_cpu). Perhaps cond_func() can > also return false if rcu_user_enter() was called... > > Actually I was thinking about this from the very beginning, but I do > not feel this looks like a good idea. Still I'd like to ask what do > you think. > > Oleg. I think - if we can avoid local_irq_disable/enable, just avoid it (and use barrier-vs-synchronize_kernel). Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers
On Tue, 23 Oct 2012, Paul E. McKenney wrote: > On Tue, Oct 23, 2012 at 01:29:02PM -0700, Paul E. McKenney wrote: > > On Tue, Oct 23, 2012 at 08:41:23PM +0200, Oleg Nesterov wrote: > > > On 10/23, Paul E. McKenney wrote: > > > > > > > > * Note that this guarantee implies a further memory-ordering guarantee. > > > > * On systems with more than one CPU, when synchronize_sched() returns, > > > > * each CPU is guaranteed to have executed a full memory barrier since > > > > * the end of its last RCU read-side critical section > > > ^^ > > > > > > Ah wait... I misread this comment. > > > > And I miswrote it. It should say "since the end of its last RCU-sched > > read-side critical section." So, for example, RCU-sched need not force > > a CPU that is idle, offline, or (eventually) executing in user mode to > > execute a memory barrier. Fixed this. Or you can write "each CPU that is executing a kernel code is guaranteed to have executed a full memory barrier". It would be consistent with the current implementation and it would make it possible to use barrier()-synchronize_sched() as biased memory barriers. --- In percpu-rwlocks, CPU 1 executes ...make some writes in the critical section... barrier(); this_cpu_dec(*p->counters); and the CPU 2 executes while (__percpu_count(p->counters)) msleep(1); synchronize_sched(); So, when CPU 2 finishes synchronize_sched(), we must make sure that all writes done by CPU 1 are visible to CPU 2. The current implementation fulfills this requirement, you can just add it to the specification so that whoever changes the implementation keeps it. Mikulas > And I should hasten to add that for synchronize_sched(), disabling > preemption (including disabling irqs, further including NMI handlers) > acts as an RCU-sched read-side critical section. (This is in the > comment header for synchronize_sched() up above my addition to it.) > > Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers
On Wed, 24 Oct 2012, Paul E. McKenney wrote: > On Tue, Oct 23, 2012 at 05:39:43PM -0400, Mikulas Patocka wrote: > > > > > > On Tue, 23 Oct 2012, Paul E. McKenney wrote: > > > > > On Tue, Oct 23, 2012 at 01:29:02PM -0700, Paul E. McKenney wrote: > > > > On Tue, Oct 23, 2012 at 08:41:23PM +0200, Oleg Nesterov wrote: > > > > > On 10/23, Paul E. McKenney wrote: > > > > > > > > > > > > * Note that this guarantee implies a further memory-ordering > > > > > > guarantee. > > > > > > * On systems with more than one CPU, when synchronize_sched() > > > > > > returns, > > > > > > * each CPU is guaranteed to have executed a full memory barrier > > > > > > since > > > > > > * the end of its last RCU read-side critical section > > > > > ^^ > > > > > > > > > > Ah wait... I misread this comment. > > > > > > > > And I miswrote it. It should say "since the end of its last RCU-sched > > > > read-side critical section." So, for example, RCU-sched need not force > > > > a CPU that is idle, offline, or (eventually) executing in user mode to > > > > execute a memory barrier. Fixed this. > > > > Or you can write "each CPU that is executing a kernel code is guaranteed > > to have executed a full memory barrier". > > Perhaps I could, but it isn't needed, nor is it particularly helpful. > Please see suggestions in preceding email. It is helpful, because if you add this requirement (that already holds for the current implementation), you can drop rcu_read_lock_sched() and rcu_read_unlock_sched() from the following code that you submitted. static inline void percpu_up_read(struct percpu_rw_semaphore *p) { /* * Decrement our count, but protected by RCU-sched so that * the writer can force proper serialization. */ rcu_read_lock_sched(); this_cpu_dec(*p->counters); rcu_read_unlock_sched(); } > > The current implementation fulfills this requirement, you can just add it > > to the specification so that whoever changes the implementation keeps it. > > I will consider doing that if and when someone shows me a situation where > adding that requirement makes things simpler and/or faster. From what I > can see, your example does not do so. > > Thanx, Paul If you do, the above code can be simplified to: { barrier(); this_cpu_dec(*p->counters); } Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers
On Wed, 24 Oct 2012, Paul E. McKenney wrote: > On Wed, Oct 24, 2012 at 04:22:17PM -0400, Mikulas Patocka wrote: > > > > > > On Wed, 24 Oct 2012, Paul E. McKenney wrote: > > > > > On Tue, Oct 23, 2012 at 05:39:43PM -0400, Mikulas Patocka wrote: > > > > > > > > > > > > On Tue, 23 Oct 2012, Paul E. McKenney wrote: > > > > > > > > > On Tue, Oct 23, 2012 at 01:29:02PM -0700, Paul E. McKenney wrote: > > > > > > On Tue, Oct 23, 2012 at 08:41:23PM +0200, Oleg Nesterov wrote: > > > > > > > On 10/23, Paul E. McKenney wrote: > > > > > > > > > > > > > > > > * Note that this guarantee implies a further memory-ordering > > > > > > > > guarantee. > > > > > > > > * On systems with more than one CPU, when synchronize_sched() > > > > > > > > returns, > > > > > > > > * each CPU is guaranteed to have executed a full memory > > > > > > > > barrier since > > > > > > > > * the end of its last RCU read-side critical section > > > > > > > ^^ > > > > > > > > > > > > > > Ah wait... I misread this comment. > > > > > > > > > > > > And I miswrote it. It should say "since the end of its last > > > > > > RCU-sched > > > > > > read-side critical section." So, for example, RCU-sched need not > > > > > > force > > > > > > a CPU that is idle, offline, or (eventually) executing in user mode > > > > > > to > > > > > > execute a memory barrier. Fixed this. > > > > > > > > Or you can write "each CPU that is executing a kernel code is > > > > guaranteed > > > > to have executed a full memory barrier". > > > > > > Perhaps I could, but it isn't needed, nor is it particularly helpful. > > > Please see suggestions in preceding email. > > > > It is helpful, because if you add this requirement (that already holds for > > the current implementation), you can drop rcu_read_lock_sched() and > > rcu_read_unlock_sched() from the following code that you submitted. > > > > static inline void percpu_up_read(struct percpu_rw_semaphore *p) > > { > > /* > > * Decrement our count, but protected by RCU-sched so that > > * the writer can force proper serialization. > > */ > > rcu_read_lock_sched(); > > this_cpu_dec(*p->counters); > > rcu_read_unlock_sched(); > > } > > > > > > The current implementation fulfills this requirement, you can just add > > > > it > > > > to the specification so that whoever changes the implementation keeps > > > > it. > > > > > > I will consider doing that if and when someone shows me a situation where > > > adding that requirement makes things simpler and/or faster. From what I > > > can see, your example does not do so. > > > > > > Thanx, Paul > > > > If you do, the above code can be simplified to: > > { > > barrier(); > > this_cpu_dec(*p->counters); > > } > > The readers are lightweight enough that you are worried about the overhead > of rcu_read_lock_sched() and rcu_read_unlock_sched()? Really??? > > Thanx, Paul There was no lock in previous kernels, so we should make it as simple as possible. Disabling and reenabling preemption is probably not a big deal, but if don't have to do it, why do it? Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers
On Wed, 24 Oct 2012, Paul E. McKenney wrote: > On Wed, Oct 24, 2012 at 04:44:14PM -0400, Mikulas Patocka wrote: > > > > > > On Wed, 24 Oct 2012, Paul E. McKenney wrote: > > > > > On Wed, Oct 24, 2012 at 04:22:17PM -0400, Mikulas Patocka wrote: > > > > > > > > > > > > On Wed, 24 Oct 2012, Paul E. McKenney wrote: > > > > > > > > > On Tue, Oct 23, 2012 at 05:39:43PM -0400, Mikulas Patocka wrote: > > > > > > > > > > > > > > > > > > On Tue, 23 Oct 2012, Paul E. McKenney wrote: > > > > > > > > > > > > > On Tue, Oct 23, 2012 at 01:29:02PM -0700, Paul E. McKenney wrote: > > > > > > > > On Tue, Oct 23, 2012 at 08:41:23PM +0200, Oleg Nesterov wrote: > > > > > > > > > On 10/23, Paul E. McKenney wrote: > > > > > > > > > > > > > > > > > > > > * Note that this guarantee implies a further > > > > > > > > > > memory-ordering guarantee. > > > > > > > > > > * On systems with more than one CPU, when > > > > > > > > > > synchronize_sched() returns, > > > > > > > > > > * each CPU is guaranteed to have executed a full memory > > > > > > > > > > barrier since > > > > > > > > > > * the end of its last RCU read-side critical section > > > > > > > > > ^^ > > > > > > > > > > > > > > > > > > Ah wait... I misread this comment. > > > > > > > > > > > > > > > > And I miswrote it. It should say "since the end of its last > > > > > > > > RCU-sched > > > > > > > > read-side critical section." So, for example, RCU-sched need > > > > > > > > not force > > > > > > > > a CPU that is idle, offline, or (eventually) executing in user > > > > > > > > mode to > > > > > > > > execute a memory barrier. Fixed this. > > > > > > > > > > > > Or you can write "each CPU that is executing a kernel code is > > > > > > guaranteed > > > > > > to have executed a full memory barrier". > > > > > > > > > > Perhaps I could, but it isn't needed, nor is it particularly helpful. > > > > > Please see suggestions in preceding email. > > > > > > > > It is helpful, because if you add this requirement (that already holds > > > > for > > > > the current implementation), you can drop rcu_read_lock_sched() and > > > > rcu_read_unlock_sched() from the following code that you submitted. > > > > > > > > static inline void percpu_up_read(struct percpu_rw_semaphore *p) > > > > { > > > > /* > > > > * Decrement our count, but protected by RCU-sched so that > > > > * the writer can force proper serialization. > > > > */ > > > > rcu_read_lock_sched(); > > > > this_cpu_dec(*p->counters); > > > > rcu_read_unlock_sched(); > > > > } > > > > > > > > > > The current implementation fulfills this requirement, you can just > > > > > > add it > > > > > > to the specification so that whoever changes the implementation > > > > > > keeps it. > > > > > > > > > > I will consider doing that if and when someone shows me a situation > > > > > where > > > > > adding that requirement makes things simpler and/or faster. From > > > > > what I > > > > > can see, your example does not do so. > > > > > > > > > > Thanx, Paul > > > > > > > > If you do, the above code can be simplified to: > > > > { > > > > barrier(); > > > > this_cpu_dec(*p->counters); > > > > } > > > > > > The readers are lightweight enough that you are worried about the overhead > > > of rcu_read_lock_sched() and rcu_read_unlock_sched()? Really??? > > > > > > Thanx, Paul &g
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On Wed, 24 Oct 2012, Dave Chinner wrote: > On Fri, Oct 19, 2012 at 06:54:41PM -0400, Mikulas Patocka wrote: > > > > > > On Fri, 19 Oct 2012, Peter Zijlstra wrote: > > > > > > Yes, I tried this approach - it involves doing LOCK instruction on read > > > > lock, remembering the cpu and doing another LOCK instruction on read > > > > unlock (which will hopefully be on the same CPU, so no cacheline > > > > bouncing > > > > happens in the common case). It was slower than the approach without > > > > any > > > > LOCK instructions (43.3 seconds seconds for the implementation with > > > > per-cpu LOCKed access, 42.7 seconds for this implementation without > > > > atomic > > > > instruction; the benchmark involved doing 512-byte direct-io reads and > > > > writes on a ramdisk with 8 processes on 8-core machine). > > > > > > So why is that a problem? Surely that's already tons better then what > > > you've currently got. > > > > Percpu rw-semaphores do not improve performance at all. I put them there > > to avoid performance regression, not to improve performance. > > > > All Linux kernels have a race condition - when you change block size of a > > block device and you read or write the device at the same time, a crash > > may happen. This bug is there since ever. Recently, this bug started to > > cause major trouble - multiple high profile business sites report crashes > > because of this race condition. > > > > You can fix this race by using a read lock around I/O paths and write lock > > around block size changing, but normal rw semaphore cause cache line > > bouncing when taken for read by multiple processors and I/O performance > > degradation because of it is measurable. > > This doesn't sound like a new problem. Hasn't this global access, > single modifier exclusion problem been solved before in the VFS? > e.g. mnt_want_write()/mnt_make_readonly() > > Cheers, > > Dave. Yes, mnt_want_write()/mnt_make_readonly() do the same thing as percpu rw semaphores. I think you can convert mnt_want_write()/mnt_make_readonly() to use percpu rw semaphores and remove the duplicated code. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched
On Wed, 24 Oct 2012, Paul E. McKenney wrote: > On Mon, Oct 22, 2012 at 07:39:16PM -0400, Mikulas Patocka wrote: > > Use rcu_read_lock_sched / rcu_read_unlock_sched / synchronize_sched > > instead of rcu_read_lock / rcu_read_unlock / synchronize_rcu. > > > > This is an optimization. The RCU-protected region is very small, so > > there will be no latency problems if we disable preempt in this region. > > > > So we use rcu_read_lock_sched / rcu_read_unlock_sched that translates > > to preempt_disable / preempt_disable. It is smaller (and supposedly > > faster) than preemptible rcu_read_lock / rcu_read_unlock. > > > > Signed-off-by: Mikulas Patocka > > OK, as promised/threatened, I finally got a chance to take a closer look. > > The light_mb() and heavy_mb() definitions aren't doing much for me, > the code would be cleared with them expanded inline. And while the > approach of pairing barrier() with synchronize_sched() is interesting, > it would be simpler to rely on RCU's properties. The key point is that > if RCU cannot prove that a given RCU-sched read-side critical section > is seen by all CPUs to have started after a given synchronize_sched(), > then that synchronize_sched() must wait for that RCU-sched read-side > critical section to complete. Also note that you can define both light_mb() and heavy_mb() to be smp_mb() and slow down the reader path a bit and speed up the writer path. On architectures with in-order memory access (and thus smp_mb() equals barrier()), it doesn't hurt the reader but helps the writer, for example: #ifdef ARCH_HAS_INORDER_MEMORY_ACCESS #define light_mb() smp_mb() #define heavy_mb() smp_mb() #else #define light_mb() barrier() #define heavy_mb() synchronize_sched() #endif Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched
On Thu, 25 Oct 2012, Paul E. McKenney wrote: > On Thu, Oct 25, 2012 at 10:54:11AM -0400, Mikulas Patocka wrote: > > > > > > On Wed, 24 Oct 2012, Paul E. McKenney wrote: > > > > > On Mon, Oct 22, 2012 at 07:39:16PM -0400, Mikulas Patocka wrote: > > > > Use rcu_read_lock_sched / rcu_read_unlock_sched / synchronize_sched > > > > instead of rcu_read_lock / rcu_read_unlock / synchronize_rcu. > > > > > > > > This is an optimization. The RCU-protected region is very small, so > > > > there will be no latency problems if we disable preempt in this region. > > > > > > > > So we use rcu_read_lock_sched / rcu_read_unlock_sched that translates > > > > to preempt_disable / preempt_disable. It is smaller (and supposedly > > > > faster) than preemptible rcu_read_lock / rcu_read_unlock. > > > > > > > > Signed-off-by: Mikulas Patocka > > > > > > OK, as promised/threatened, I finally got a chance to take a closer look. > > > > > > The light_mb() and heavy_mb() definitions aren't doing much for me, > > > the code would be cleared with them expanded inline. And while the > > > approach of pairing barrier() with synchronize_sched() is interesting, > > > it would be simpler to rely on RCU's properties. The key point is that > > > if RCU cannot prove that a given RCU-sched read-side critical section > > > is seen by all CPUs to have started after a given synchronize_sched(), > > > then that synchronize_sched() must wait for that RCU-sched read-side > > > critical section to complete. > > > > Also note that you can define both light_mb() and heavy_mb() to be > > smp_mb() and slow down the reader path a bit and speed up the writer path. > > > > On architectures with in-order memory access (and thus smp_mb() equals > > barrier()), it doesn't hurt the reader but helps the writer, for example: > > #ifdef ARCH_HAS_INORDER_MEMORY_ACCESS > > #define light_mb() smp_mb() > > #define heavy_mb() smp_mb() > > #else > > #define light_mb() barrier() > > #define heavy_mb() synchronize_sched() > > #endif > > Except that there are no systems running Linux with in-order memory > access. Even x86 and s390 require a barrier instruction for smp_mb() > on SMP=y builds. > > Thanx, Paul PA-RISC is in-order. But it is used very rarely. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On Fri, 26 Oct 2012, Oleg Nesterov wrote: > On 10/26, Dave Chinner wrote: > > > > On Thu, Oct 25, 2012 at 10:09:31AM -0400, Mikulas Patocka wrote: > > > > > > Yes, mnt_want_write()/mnt_make_readonly() do the same thing as percpu rw > > > semaphores. I think you can convert mnt_want_write()/mnt_make_readonly() > > > to use percpu rw semaphores and remove the duplicated code. > > > > I think you misunderstood my point - that rather than re-inventing > > the wheel, why didn't you just copy something that is known to > > work? I didn't know about. The code is not reusable, and it doesn't really do locking. And it has two barriers on the read path, while percpu rw semaphores have none. > I don't understand why do you both think that __mnt_want_write() > and mnt_make_readonly() provides the same functionality. I looked > at this code before I started this patch, and unless I completely > misread it this does very different things. It is not "lock" at all. > > Oleg. mnt_want_write uses percpu array of counters, just like percpu semaphores. The code is different, but it can be changed to use percpu rw semaphores (if we add percpu_down_write_trylock). __mnt_want_write could call percpu_down_read and check if it is readonly (if it is, drop the lock and return -EROFS) __mnt_drop_write could call percpu_up_read mnt_make_readonly and sb_prepare_remount_readonly could call percpu_down_write_trylock instead of mnt_get_writers (if they get the write lock, set it to readonly and drop the write lock) ... and that's it, then, you can remove MNT_WRITE_HOLD, the barriers, spinning and other complexity from fs/namespace.c Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] A possible deadlock with stacked devices (was: [PATCH v4 08/12] block: Introduce new bio_split())
ed devices: d1 depends on d2 and d2 depends on d3. Now, a bio b1 comes to d1. d1 splits it to two bios: b2.1 and b2.2 and sends them to the device d2 - these bios end up in current->bio_list. The driver for d2 receives bio b2.1 and sends bio b3.1 to d3. Now, current->bio_list contains bios b2.2, b3.1. Now, bio b2.2 is popped off the bio list and the driver for d2 is called with b2.2 - suppose that for some reason mempool in d2 is exhausted and the driver needs to wait until b2.1 finishes. b2.1 never finishes, because b2.1 depends on b3.1 and b3.1 is still in current->bio_list. So it deadlocks. Turning off __GFP_WAIT fixes nothing - it just turns one bug (a possible deadlock) into another bug (a possible bio failure with -ENOMEM). Increasing mempool sizes doesn't fix it either, the bio would just have to be split to more pieces in the above example to make it deadlock. I think the above possible deadlock scenario could be fixed by reversing current->bio_list processing - i.e. when some device's make_request_fn adds some bios to current->bio_list, these bios are processed before other bios that were on the list before. This way, bio list would contain "b3.1, b2.2" instead of "b2.2, b3.1" in the above example and the deadlock would not happen. Signed-off-by: Mikulas Patocka --- block/blk-core.c | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) Index: linux-3.5-fast/block/blk-core.c === --- linux-3.5-fast.orig/block/blk-core.c2012-07-27 02:20:25.0 +0200 +++ linux-3.5-fast/block/blk-core.c 2012-07-27 02:34:37.0 +0200 @@ -1740,6 +1740,7 @@ end_io: void generic_make_request(struct bio *bio) { struct bio_list bio_list_on_stack; + struct bio_list current_bio_list; if (!generic_make_request_checks(bio)) return; @@ -1789,14 +1790,22 @@ void generic_make_request(struct bio *bi * bio_list, and call into ->make_request() again. */ BUG_ON(bio->bi_next); - bio_list_init(&bio_list_on_stack); + bio_list_init(¤t_bio_list); current->bio_list = &bio_list_on_stack; do { struct request_queue *q = bdev_get_queue(bio->bi_bdev); + bio_list_init(&bio_list_on_stack); + q->make_request_fn(q, bio); - bio = bio_list_pop(current->bio_list); + /* +* To avoid a possible deadlock, bios that were added by +* the most recent make_request_fn must be processed first. +*/ + bio_list_merge_head(¤t_bio_list, &bio_list_on_stack); + + bio = bio_list_pop(¤t_bio_list); } while (bio); current->bio_list = NULL; /* deactivate */ } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] Fix Crash when IO is being submitted and block size is changed
On Thu, 19 Jul 2012, Jeff Moyer wrote: > Mikulas Patocka writes: > > > On Tue, 17 Jul 2012, Jeff Moyer wrote: > > > > >> > This is the patch that fixes this crash: it takes a rw-semaphore around > >> > all direct-IO path. > >> > > >> > (note that if someone is concerned about performance, the rw-semaphore > >> > could be made per-cpu --- take it for read on the current CPU and take > >> > it > >> > for write on all CPUs). > >> > >> Here we go again. :-) I believe we had at one point tried taking a rw > >> semaphore around GUP inside of the direct I/O code path to fix the fork > >> vs. GUP race (that still exists today). When testing that, the overhead > >> of the semaphore was *way* too high to be considered an acceptable > >> solution. I've CC'd Larry Woodman, Andrea, and Kosaki Motohiro who all > >> worked on that particular bug. Hopefully they can give better > >> quantification of the slowdown than my poor memory. > >> > >> Cheers, > >> Jeff > > > > Both down_read and up_read together take 82 ticks on Core2, 69 ticks on > > AMD K10, 62 ticks on UltraSparc2 if the target is in L1 cache. So, if > > percpu rw_semaphores were used, it would slow down only by this amount. > > Sorry, I'm not familiar with per-cpu rw semaphores. Where are they > implemented? Here I'm resending the upstream patches with per rw-semaphores - percpu rw-semaphores are implemented in the next patch. (For Jeff: you can use your patch for RHEL-6 that you did for perfocmance testing, with the change that I proposed). Mikulas --- blockdev: fix a crash when block size is changed and I/O is issued simultaneously The kernel may crash when block size is changed and I/O is issued simultaneously. Because some subsystems (udev or lvm) may read any block device anytime, the bug actually puts any code that changes a block device size in jeopardy. The crash can be reproduced if you place "msleep(1000)" to blkdev_get_blocks just before "bh->b_size = max_blocks << inode->i_blkbits;". Then, run "dd if=/dev/ram0 of=/dev/null bs=4k count=1 iflag=direct" While it is waiting in msleep, run "blockdev --setbsz 2048 /dev/ram0" You get a BUG. The direct and non-direct I/O is written with the assumption that block size does not change. It doesn't seem practical to fix these crashes one-by-one there may be many crash possibilities when block size changes at a certain place and it is impossible to find them all and verify the code. This patch introduces a new rw-lock bd_block_size_semaphore. The lock is taken for read during I/O. It is taken for write when changing block size. Consequently, block size can't be changed while I/O is being submitted. For asynchronous I/O, the patch only prevents block size change while the I/O is being submitted. The block size can change when the I/O is in progress or when the I/O is being finished. This is acceptable because there are no accesses to block size when asynchronous I/O is being finished. The patch prevents block size changing while the device is mapped with mmap. Signed-off-by: Mikulas Patocka --- drivers/char/raw.c |2 - fs/block_dev.c | 60 +++-- include/linux/fs.h |4 +++ 3 files changed, 63 insertions(+), 3 deletions(-) Index: linux-3.5-rc6-devel/include/linux/fs.h === --- linux-3.5-rc6-devel.orig/include/linux/fs.h 2012-07-16 20:20:12.0 +0200 +++ linux-3.5-rc6-devel/include/linux/fs.h 2012-07-16 01:29:21.0 +0200 @@ -713,6 +713,8 @@ struct block_device { int bd_fsfreeze_count; /* Mutex for freeze */ struct mutexbd_fsfreeze_mutex; + /* A semaphore that prevents I/O while block size is being changed */ + struct rw_semaphore bd_block_size_semaphore; }; /* @@ -2414,6 +2416,8 @@ extern int generic_segment_checks(const unsigned long *nr_segs, size_t *count, int access_flags); /* fs/block_dev.c */ +extern ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos); extern ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos); extern int blkdev_fsync(struct file *filp, loff_t start, loff_t end, Index: linux-3.5-rc6-devel/fs/block_dev.c === --- linux-3.5-rc6-devel.orig/fs/block_dev.c 2012-07-16 20:20:12.0 +0200 +++ linux-3.5-rc6-devel/fs/block_dev.c 2012-07-16 21:47:30.0 +0200 @@ -11
[PATCH 2/3] Introduce percpu rw semaphores
Introduce percpu rw semaphores When many CPUs are locking a rw semaphore for read concurrently, cache line bouncing occurs. When a CPU acquires rw semaphore for read, the CPU writes to the cache line holding the semaphore. Consequently, the cache line is being moved between CPUs and this slows down semaphore acquisition. This patch introduces new percpu rw semaphores. They are functionally identical to existing rw semaphores, but locking the percpu rw semaphore for read is faster and locking for write is slower. The percpu rw semaphore is implemented as a percpu array of rw semaphores, each semaphore for one CPU. When some thread needs to lock the semaphore for read, only semaphore on the current CPU is locked for read. When some thread needs to lock the semaphore for write, semaphores for all CPUs are locked for write. This avoids cache line bouncing. Note that the thread that is locking percpu rw semaphore may be rescheduled, it doesn't cause bug, but cache line bouncing occurs in this case. Signed-off-by: Mikulas Patocka --- include/linux/percpu-rwsem.h | 77 +++ 1 file changed, 77 insertions(+) Index: linux-3.5-fast/include/linux/percpu-rwsem.h === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-3.5-fast/include/linux/percpu-rwsem.h 2012-07-28 18:41:05.0 +0200 @@ -0,0 +1,77 @@ +#ifndef _LINUX_PERCPU_RWSEM_H +#define _LINUX_PERCPU_RWSEM_H + +#include +#include + +#ifndef CONFIG_SMP + +#define percpu_rw_semaphorerw_semaphore +#define percpu_rwsem_ptr int +#define percpu_down_read(x)(down_read(x), 0) +#define percpu_up_read(x, y) up_read(x) +#define percpu_down_write down_write +#define percpu_up_writeup_write +#define percpu_init_rwsem(x) (({init_rwsem(x);}), 0) +#define percpu_free_rwsem(x) do { } while (0) + +#else + +struct percpu_rw_semaphore { + struct rw_semaphore __percpu *s; +}; + +typedef struct rw_semaphore *percpu_rwsem_ptr; + +static inline percpu_rwsem_ptr percpu_down_read(struct percpu_rw_semaphore *sem) +{ + struct rw_semaphore *s = __this_cpu_ptr(sem->s); + down_read(s); + return s; +} + +static inline void percpu_up_read(struct percpu_rw_semaphore *sem, percpu_rwsem_ptr s) +{ + up_read(s); +} + +static inline void percpu_down_write(struct percpu_rw_semaphore *sem) +{ + int cpu; + for_each_possible_cpu(cpu) { + struct rw_semaphore *s = per_cpu_ptr(sem->s, cpu); + down_write(s); + } +} + +static inline void percpu_up_write(struct percpu_rw_semaphore *sem) +{ + int cpu; + for_each_possible_cpu(cpu) { + struct rw_semaphore *s = per_cpu_ptr(sem->s, cpu); + up_write(s); + } +} + +static inline int percpu_init_rwsem(struct percpu_rw_semaphore *sem) +{ + int cpu; + sem->s = alloc_percpu(struct rw_semaphore); + if (unlikely(!sem->s)) + return -ENOMEM; + for_each_possible_cpu(cpu) { + struct rw_semaphore *s = per_cpu_ptr(sem->s, cpu); + init_rwsem(s); + } + return 0; +} + +static inline void percpu_free_rwsem(struct percpu_rw_semaphore *sem) +{ + free_percpu(sem->s); + sem->s = NULL; /* catch use after free bugs */ +} + +#endif + +#endif -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] blockdev: turn a rw semaphore into a percpu rw semaphore
blockdev: turn a rw semaphore into a percpu rw semaphore This avoids cache line bouncing when many processes lock the semaphore for read. Partially based on a patch by Jeff Moyer . Signed-off-by: Mikulas Patocka --- fs/block_dev.c | 30 -- include/linux/fs.h |3 ++- 2 files changed, 22 insertions(+), 11 deletions(-) Index: linux-3.5-fast/fs/block_dev.c === --- linux-3.5-fast.orig/fs/block_dev.c 2012-07-28 18:32:10.0 +0200 +++ linux-3.5-fast/fs/block_dev.c 2012-07-28 18:32:12.0 +0200 @@ -127,7 +127,7 @@ int set_blocksize(struct block_device *b return -EINVAL; /* Prevent starting I/O or mapping the device */ - down_write(&bdev->bd_block_size_semaphore); + percpu_down_write(&bdev->bd_block_size_semaphore); /* Check that the block device is not memory mapped */ mapping = bdev->bd_inode->i_mapping; @@ -135,7 +135,7 @@ int set_blocksize(struct block_device *b if (!prio_tree_empty(&mapping->i_mmap) || !list_empty(&mapping->i_mmap_nonlinear)) { mutex_unlock(&mapping->i_mmap_mutex); - up_write(&bdev->bd_block_size_semaphore); + percpu_up_write(&bdev->bd_block_size_semaphore); return -EBUSY; } mutex_unlock(&mapping->i_mmap_mutex); @@ -148,7 +148,7 @@ int set_blocksize(struct block_device *b kill_bdev(bdev); } - up_write(&bdev->bd_block_size_semaphore); + percpu_up_write(&bdev->bd_block_size_semaphore); return 0; } @@ -460,6 +460,12 @@ static struct inode *bdev_alloc_inode(st struct bdev_inode *ei = kmem_cache_alloc(bdev_cachep, GFP_KERNEL); if (!ei) return NULL; + + if (unlikely(percpu_init_rwsem(&ei->bdev.bd_block_size_semaphore))) { + kmem_cache_free(bdev_cachep, ei); + return NULL; + } + return &ei->vfs_inode; } @@ -468,6 +474,8 @@ static void bdev_i_callback(struct rcu_h struct inode *inode = container_of(head, struct inode, i_rcu); struct bdev_inode *bdi = BDEV_I(inode); + percpu_free_rwsem(&bdi->bdev.bd_block_size_semaphore); + kmem_cache_free(bdev_cachep, bdi); } @@ -491,7 +499,6 @@ static void init_once(void *foo) inode_init_once(&ei->vfs_inode); /* Initialize mutex for freeze. */ mutex_init(&bdev->bd_fsfreeze_mutex); - init_rwsem(&bdev->bd_block_size_semaphore); } static inline void __bd_forget(struct inode *inode) @@ -1592,12 +1599,13 @@ ssize_t blkdev_aio_read(struct kiocb *io { ssize_t ret; struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host); + percpu_rwsem_ptr p; - down_read(&bdev->bd_block_size_semaphore); + p = percpu_down_read(&bdev->bd_block_size_semaphore); ret = generic_file_aio_read(iocb, iov, nr_segs, pos); - up_read(&bdev->bd_block_size_semaphore); + percpu_up_read(&bdev->bd_block_size_semaphore, p); return ret; } @@ -1616,10 +1624,11 @@ ssize_t blkdev_aio_write(struct kiocb *i struct file *file = iocb->ki_filp; struct block_device *bdev = I_BDEV(file->f_mapping->host); ssize_t ret; + percpu_rwsem_ptr p; BUG_ON(iocb->ki_pos != pos); - down_read(&bdev->bd_block_size_semaphore); + p = percpu_down_read(&bdev->bd_block_size_semaphore); ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos); if (ret > 0 || ret == -EIOCBQUEUED) { @@ -1630,7 +1639,7 @@ ssize_t blkdev_aio_write(struct kiocb *i ret = err; } - up_read(&bdev->bd_block_size_semaphore); + percpu_up_read(&bdev->bd_block_size_semaphore, p); return ret; } @@ -1640,12 +1649,13 @@ int blkdev_mmap(struct file *file, struc { int ret; struct block_device *bdev = I_BDEV(file->f_mapping->host); + percpu_rwsem_ptr p; - down_read(&bdev->bd_block_size_semaphore); + p = percpu_down_read(&bdev->bd_block_size_semaphore); ret = generic_file_mmap(file, vma); - up_read(&bdev->bd_block_size_semaphore); + percpu_up_read(&bdev->bd_block_size_semaphore, p); return ret; } Index: linux-3.5-fast/include/linux/fs.h === --- linux-3.5-fast.orig/include/linux/fs.h 2012-07-28 18:32:10.0 +0200 +++ linux-3.5-fast/include/linux/fs.h 2012-07-28 18:32:12.0 +0200 @@ -10,6 +10,7 @@ #include #include #include +#include /* * It's silly to have NR_OPEN bigger than
Re: [dm-devel] [PATCH 2/3] Introduce percpu rw semaphores
On Sat, 28 Jul 2012, Eric Dumazet wrote: > On Sat, 2012-07-28 at 12:41 -0400, Mikulas Patocka wrote: > > Introduce percpu rw semaphores > > > > When many CPUs are locking a rw semaphore for read concurrently, cache > > line bouncing occurs. When a CPU acquires rw semaphore for read, the > > CPU writes to the cache line holding the semaphore. Consequently, the > > cache line is being moved between CPUs and this slows down semaphore > > acquisition. > > > > This patch introduces new percpu rw semaphores. They are functionally > > identical to existing rw semaphores, but locking the percpu rw semaphore > > for read is faster and locking for write is slower. > > > > The percpu rw semaphore is implemented as a percpu array of rw > > semaphores, each semaphore for one CPU. When some thread needs to lock > > the semaphore for read, only semaphore on the current CPU is locked for > > read. When some thread needs to lock the semaphore for write, semaphores > > for all CPUs are locked for write. This avoids cache line bouncing. > > > > Note that the thread that is locking percpu rw semaphore may be > > rescheduled, it doesn't cause bug, but cache line bouncing occurs in > > this case. > > > > Signed-off-by: Mikulas Patocka > > I am curious to see how this performs with 4096 cpus ? Each cpu should have its own rw semaphore in its cache, so I don't see a problem there. When you change block size, all 4096 rw semaphores are locked for write, but changing block size is not a performance sensitive operation. > Really you shouldnt use rwlock in a path if this might hurt performance. > > RCU is probably a better answer. RCU is meaningless here. RCU allows lockless dereference of a pointer. Here the problem is not pointer dereference, the problem is that integer bd_block_size may change. > (bdev->bd_block_size should be read exactly once ) Rewrite all direct and non-direct io code so that it reads block size just once ... Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] [PATCH 2/3] Introduce percpu rw semaphores
On Mon, 30 Jul 2012, Paul E. McKenney wrote: > On Sun, Jul 29, 2012 at 01:13:34AM -0400, Mikulas Patocka wrote: > > On Sat, 28 Jul 2012, Eric Dumazet wrote: > > > On Sat, 2012-07-28 at 12:41 -0400, Mikulas Patocka wrote: > > [ . . . ] > > > > (bdev->bd_block_size should be read exactly once ) > > > > Rewrite all direct and non-direct io code so that it reads block size just > > once ... > > For whatever it is worth, the 3.5 Linux kernel only has about ten mentions > of bd_block_size, at least according to cscope. > > Thanx, Paul plus 213 uses of i_blkbits (which is derived directly from bd_block_size). 45 of them is in fs/*.c and mm/*.c Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] Fix a crash when block device is read and block size is changed at the same time
On Tue, 25 Sep 2012, Jens Axboe wrote: > On 2012-09-25 19:59, Jens Axboe wrote: > > On 2012-09-25 19:49, Jeff Moyer wrote: > >> Jeff Moyer writes: > >> > >>> Mikulas Patocka writes: > >>> > >>>> Hi Jeff > >>>> > >>>> Thanks for testing. > >>>> > >>>> It would be interesting ... what happens if you take the patch 3, leave > >>>> "struct percpu_rw_semaphore bd_block_size_semaphore" in "struct > >>>> block_device", but remove any use of the semaphore from fs/block_dev.c? > >>>> - > >>>> will the performance be like unpatched kernel or like patch 3? It could > >>>> be > >>>> that the change in the alignment affects performance on your CPU too, > >>>> just > >>>> differently than on my CPU. > >>> > >>> It turns out to be exactly the same performance as with the 3rd patch > >>> applied, so I guess it does have something to do with cache alignment. > >>> Here is the patch (against vanilla) I ended up testing. Let me know if > >>> I've botched it somehow. > >>> > >>> So, I next up I'll play similar tricks to what you did (padding struct > >>> block_device in all kernels) to eliminate the differences due to > >>> structure alignment and provide a clear picture of what the locking > >>> effects are. > >> > >> After trying again with the same padding you used in the struct > >> bdev_inode, I see no performance differences between any of the > >> patches. I tried bumping up the number of threads to saturate the > >> number of cpus on a single NUMA node on my hardware, but that resulted > >> in lower IOPS to the device, and hence consumption of less CPU time. > >> So, I believe my results to be inconclusive. > >> > >> After talking with Vivek about the problem, he had mentioned that it > >> might be worth investigating whether bd_block_size could be protected > >> using SRCU. I looked into it, and the one thing I couldn't reconcile is > >> updating both the bd_block_size and the inode->i_blkbits at the same > >> time. It would involve (afaiui) adding fields to both the inode and the > >> block_device data structures and using rcu_assign_pointer and > >> rcu_dereference to modify and access the fields, and both fields would > >> need to protected by the same struct srcu_struct. I'm not sure whether > >> that's a desirable approach. When I started to implement it, it got > >> ugly pretty quickly. What do others think? > >> > >> For now, my preference is to get the full patch set in. I will continue > >> to investigate the performance impact of the data structure size changes > >> that I've been seeing. > >> > >> So, for the four patches: > >> > >> Acked-by: Jeff Moyer > >> > >> Jens, can you have a look at the patch set? We are seeing problem > >> reports of this in the wild[1][2]. > > > > I'll queue it up for 3.7. I can run my regular testing on the 8-way, it > > has a nack for showing scaling problems very nicely in aio/dio. As long > > as we're not adding per-inode cache line dirtying per IO (and the > > per-cpu rw sem looks OK), then I don't think there's too much to worry > > about. > > I take that back. The series doesn't apply to my current tree. Not too > unexpected, since it's some weeks old. But more importantly, please send > this is a "real" patch series. I don't want to see two implementations > of rw semaphores. I think it's perfectly fine to first do a regular rw > sem, then a last patch adding the cache friendly variant from Eric and > converting to that. > > In other words, get rid of 3/4. > > -- > Jens Axboe Hi Jens Here I'm resending it as two patches. The first one uses existing semaphore, the second converts it to RCU-based percpu semaphore. Mikulas --- blockdev: fix a crash when block size is changed and I/O is issued simultaneously The kernel may crash when block size is changed and I/O is issued simultaneously. Because some subsystems (udev or lvm) may read any block device anytime, the bug actually puts any code that changes a block device size in jeopardy. The crash can be reproduced if you place "msleep(1000)" to blkdev_get_blocks just before "bh->b_size = max_blocks << inode->i_blkbits;". Then, run "dd if=/dev/ram0 of=/dev/nu
[PATCH 2/2] Fix a crash when block device is read and block size is changed at the same time
blockdev: turn a rw semaphore into a percpu rw semaphore This avoids cache line bouncing when many processes lock the semaphore for read. New percpu lock implementation The lock consists of an array of percpu unsigned integers, a boolean variable and a mutex. When we take the lock for read, we enter rcu read section, check for a "locked" variable. If it is false, we increase a percpu counter on the current cpu and exit the rcu section. If "locked" is true, we exit the rcu section, take the mutex and drop it (this waits until a writer finished) and retry. Unlocking for read just decreases percpu variable. Note that we can unlock on a difference cpu than where we locked, in this case the counter underflows. The sum of all percpu counters represents the number of processes that hold the lock for read. When we need to lock for write, we take the mutex, set "locked" variable to true and synchronize rcu. Since RCU has been synchronized, no processes can create new read locks. We wait until the sum of percpu counters is zero - when it is, there are no readers in the critical section. Signed-off-by: Mikulas Patocka --- Documentation/percpu-rw-semaphore.txt | 27 ++ fs/block_dev.c| 27 ++ include/linux/fs.h|3 - include/linux/percpu-rwsem.h | 89 ++ 4 files changed, 135 insertions(+), 11 deletions(-) --- Index: linux-2.6-copy/fs/block_dev.c === --- linux-2.6-copy.orig/fs/block_dev.c 2012-09-26 00:42:49.0 +0200 +++ linux-2.6-copy/fs/block_dev.c 2012-09-26 00:45:29.0 +0200 @@ -127,7 +127,7 @@ int set_blocksize(struct block_device *b return -EINVAL; /* Prevent starting I/O or mapping the device */ - down_write(&bdev->bd_block_size_semaphore); + percpu_down_write(&bdev->bd_block_size_semaphore); /* Check that the block device is not memory mapped */ mapping = bdev->bd_inode->i_mapping; @@ -135,7 +135,7 @@ int set_blocksize(struct block_device *b if (!prio_tree_empty(&mapping->i_mmap) || !list_empty(&mapping->i_mmap_nonlinear)) { mutex_unlock(&mapping->i_mmap_mutex); - up_write(&bdev->bd_block_size_semaphore); + percpu_up_write(&bdev->bd_block_size_semaphore); return -EBUSY; } mutex_unlock(&mapping->i_mmap_mutex); @@ -148,7 +148,7 @@ int set_blocksize(struct block_device *b kill_bdev(bdev); } - up_write(&bdev->bd_block_size_semaphore); + percpu_up_write(&bdev->bd_block_size_semaphore); return 0; } @@ -460,6 +460,12 @@ static struct inode *bdev_alloc_inode(st struct bdev_inode *ei = kmem_cache_alloc(bdev_cachep, GFP_KERNEL); if (!ei) return NULL; + + if (unlikely(percpu_init_rwsem(&ei->bdev.bd_block_size_semaphore))) { + kmem_cache_free(bdev_cachep, ei); + return NULL; + } + return &ei->vfs_inode; } @@ -468,6 +474,8 @@ static void bdev_i_callback(struct rcu_h struct inode *inode = container_of(head, struct inode, i_rcu); struct bdev_inode *bdi = BDEV_I(inode); + percpu_free_rwsem(&bdi->bdev.bd_block_size_semaphore); + kmem_cache_free(bdev_cachep, bdi); } @@ -491,7 +499,6 @@ static void init_once(void *foo) inode_init_once(&ei->vfs_inode); /* Initialize mutex for freeze. */ mutex_init(&bdev->bd_fsfreeze_mutex); - init_rwsem(&bdev->bd_block_size_semaphore); } static inline void __bd_forget(struct inode *inode) @@ -1593,11 +1600,11 @@ ssize_t blkdev_aio_read(struct kiocb *io ssize_t ret; struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host); - down_read(&bdev->bd_block_size_semaphore); + percpu_down_read(&bdev->bd_block_size_semaphore); ret = generic_file_aio_read(iocb, iov, nr_segs, pos); - up_read(&bdev->bd_block_size_semaphore); + percpu_up_read(&bdev->bd_block_size_semaphore); return ret; } @@ -1622,7 +1629,7 @@ ssize_t blkdev_aio_write(struct kiocb *i blk_start_plug(&plug); - down_read(&bdev->bd_block_size_semaphore); + percpu_down_read(&bdev->bd_block_size_semaphore); ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos); if (ret > 0 || ret == -EIOCBQUEUED) { @@ -1633,7 +1640,7 @@ ssize_t blkdev_aio_write(struct kiocb *i ret = err; } - up_read(&bdev->bd_block_size_semaphore); + percpu_up_read(&bdev->bd_block_size_semaphore); blk_finish_plug(&plug); @
Re: [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time
On Tue, 25 Sep 2012, Jeff Moyer wrote: > Jeff Moyer writes: > > > Mikulas Patocka writes: > > > >> Hi Jeff > >> > >> Thanks for testing. > >> > >> It would be interesting ... what happens if you take the patch 3, leave > >> "struct percpu_rw_semaphore bd_block_size_semaphore" in "struct > >> block_device", but remove any use of the semaphore from fs/block_dev.c? - > >> will the performance be like unpatched kernel or like patch 3? It could be > >> that the change in the alignment affects performance on your CPU too, just > >> differently than on my CPU. > > > > It turns out to be exactly the same performance as with the 3rd patch > > applied, so I guess it does have something to do with cache alignment. > > Here is the patch (against vanilla) I ended up testing. Let me know if > > I've botched it somehow. > > > > So, I next up I'll play similar tricks to what you did (padding struct > > block_device in all kernels) to eliminate the differences due to > > structure alignment and provide a clear picture of what the locking > > effects are. > > After trying again with the same padding you used in the struct > bdev_inode, I see no performance differences between any of the > patches. I tried bumping up the number of threads to saturate the > number of cpus on a single NUMA node on my hardware, but that resulted > in lower IOPS to the device, and hence consumption of less CPU time. > So, I believe my results to be inconclusive. For me, the fourth patch with RCU-based locks performed better, so I am submitting that. > After talking with Vivek about the problem, he had mentioned that it > might be worth investigating whether bd_block_size could be protected > using SRCU. I looked into it, and the one thing I couldn't reconcile is > updating both the bd_block_size and the inode->i_blkbits at the same > time. It would involve (afaiui) adding fields to both the inode and the > block_device data structures and using rcu_assign_pointer and > rcu_dereference to modify and access the fields, and both fields would > need to protected by the same struct srcu_struct. I'm not sure whether > that's a desirable approach. When I started to implement it, it got > ugly pretty quickly. What do others think? Using RCU doesn't seem sensible to me (except for lock implementation, as it is in patch 4). The major problem is that the block layer reads blocksize multiple times and when different values are read, a crash may happen - RCU doesn't protect you against that - if you read a variable multiple times in a RCU-protected section, you can still get different results. If we wanted to use RCU, we would have to read blocksize just once and pass the value between all functions involved - that would result in a massive code change. > For now, my preference is to get the full patch set in. I will continue > to investigate the performance impact of the data structure size changes > that I've been seeing. Yes, we should get the patches to the kernel. Mikulas > So, for the four patches: > > Acked-by: Jeff Moyer > > Jens, can you have a look at the patch set? We are seeing problem > reports of this in the wild[1][2]. > > Cheers, > Jeff > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=824107 > [2] https://bugzilla.redhat.com/show_bug.cgi?id=812129 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time
On Wed, 26 Sep 2012, Jeff Moyer wrote: > Mikulas Patocka writes: > > > On Tue, 25 Sep 2012, Jeff Moyer wrote: > > > >> Jeff Moyer writes: > >> > >> > Mikulas Patocka writes: > >> > > >> >> Hi Jeff > >> >> > >> >> Thanks for testing. > >> >> > >> >> It would be interesting ... what happens if you take the patch 3, leave > >> >> "struct percpu_rw_semaphore bd_block_size_semaphore" in "struct > >> >> block_device", but remove any use of the semaphore from fs/block_dev.c? > >> >> - > >> >> will the performance be like unpatched kernel or like patch 3? It could > >> >> be > >> >> that the change in the alignment affects performance on your CPU too, > >> >> just > >> >> differently than on my CPU. > >> > > >> > It turns out to be exactly the same performance as with the 3rd patch > >> > applied, so I guess it does have something to do with cache alignment. > >> > Here is the patch (against vanilla) I ended up testing. Let me know if > >> > I've botched it somehow. > >> > > >> > So, I next up I'll play similar tricks to what you did (padding struct > >> > block_device in all kernels) to eliminate the differences due to > >> > structure alignment and provide a clear picture of what the locking > >> > effects are. > >> > >> After trying again with the same padding you used in the struct > >> bdev_inode, I see no performance differences between any of the > >> patches. I tried bumping up the number of threads to saturate the > >> number of cpus on a single NUMA node on my hardware, but that resulted > >> in lower IOPS to the device, and hence consumption of less CPU time. > >> So, I believe my results to be inconclusive. > > > > For me, the fourth patch with RCU-based locks performed better, so I am > > submitting that. > > > >> After talking with Vivek about the problem, he had mentioned that it > >> might be worth investigating whether bd_block_size could be protected > >> using SRCU. I looked into it, and the one thing I couldn't reconcile is > >> updating both the bd_block_size and the inode->i_blkbits at the same > >> time. It would involve (afaiui) adding fields to both the inode and the > >> block_device data structures and using rcu_assign_pointer and > >> rcu_dereference to modify and access the fields, and both fields would > >> need to protected by the same struct srcu_struct. I'm not sure whether > >> that's a desirable approach. When I started to implement it, it got > >> ugly pretty quickly. What do others think? > > > > Using RCU doesn't seem sensible to me (except for lock implementation, as > > it is in patch 4). The major problem is that the block layer reads > > blocksize multiple times and when different values are read, a crash may > > happen - RCU doesn't protect you against that - if you read a variable > > multiple times in a RCU-protected section, you can still get different > > results. > > SRCU is sleepable, so could be (I think) used in the same manner as your > rw semaphore. The only difference is that it would require changing the > bd_blocksize and the i_blkbits to pointers and protecting them both with > the same srcu struct. Then, the inode i_blkbits would also need to be > special cased, so that we only require such handling when it is > associated with a block device. It got messy. No, it couldn't be used this way. If you do srcu_read_lock(&srcu) ptr1 = srcu_dereference(pointer, &srcu); ptr2 = srcu_dereference(pointer, &srcu); srcu_read_unlock(&srcu) it doesn't guarantee that ptr1 == ptr2. All that it guarantees is that when synchronize_srcu exits, there are no references to the old structure. But after rcu_assign_pointer and before synchronize_srcu exits, readers can read both old and new value of the pointer and it is not specified which value do they read. > > If we wanted to use RCU, we would have to read blocksize just once and > > pass the value between all functions involved - that would result in a > > massive code change. > > If we did that, we wouldn't need rcu at all, would we? Yes, we wouldn't need RCU then. Mikulas > Cheers, > Jeff > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On Fri, 26 Oct 2012, Oleg Nesterov wrote: > > The code is different, but it can be changed to use percpu rw semaphores > > (if we add percpu_down_write_trylock). > > I don't really understand how you can make percpu_down_write_trylock() > atomic so that it can be called under br_write_lock(vfsmount_lock) in > sb_prepare_remount_readonly(). So I guess you also need to replace > vfsmount_lock at least. Or _trylock needs the barriers in _down_read. > Or I missed something. > > Oleg. That's true - that code is under spinlock and you can't implement non-blocking percpu_down_write_trylock. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dm: Make MIN_IOS, et al, tunable via sysctl.
On Tue, 20 Aug 2013, Frank Mayhar wrote: > On Tue, 2013-08-20 at 18:24 -0400, Mike Snitzer wrote: > > Mikulas' point is that you cannot reduce the size to smaller than 1. > > And aside from rq-based DM, 1 is sufficient to allow for forward > > progress even when memory is completely consumed. > > > > A patch that simply changes them to 1 but makes the rq-based DM > > mempool's size configurable should actually be fine. > > So you're saying that I should submit a patch to drop the pool size for > BIO_BASED to 1 and make the pool size for REQUEST_BASED configurable? > At the moment, in dm.c the former is hardcoded to 16 and the latter is > set via MIN_IOS (currently 256). There's also io_pool, a slab pool, > which is also set via MIN_IOS. In case of request-based io, yes, submit a patch that makes it configurable. Regarding bio-based io - there is: pools->io_pool = mempool_create_slab_pool(MIN_IOS, cachep); allocating a pool of 256 entries. I think it could be reduced to 16 for bio based devices, like other pools. As for reducing 16 further down to 1 - do you have a setup where it really helps? Please describe your system - the amount of memory, the number of dm devices, and how much memory is saved by reducing the mempools from 16 to 1. > How does this relate to the rest of the DM modules? Mpath also sets > MIN_IOS to 256 and creates a slab pool from that, and there are a number > of hardcoded constants in dm-io (MIN_IOS and MIN_BIOS), dm-snap > (MIN_IOS), In dm-snap you shouldn't reduce it because it can cause failure. > dm-crypt (MIN_IOS and MIN_POOL_PAGES), dm-bufio > (DM_BUFIO_HASH_BITS, which is allocated via vmalloc per client) and dm-bufio hash could be reduced - one possibility is to auto-tune it based on device size, another possibility is to allocate shared hash table for all devices. > dm-verity (DM_VERITY_MEMPOOL_SIZE, which is allocated per device). > > For the most part I can't imagine that people will want to change these > from their defaults, but when someone does need to change one of these, > they need to do so badly and there's currently no good way to do that > besides hacking the source and building a new kernel. > > By the way, I do appreciate the advice. I'm just trying to clear up > confusion on my part, make sure that our needs are met and, while I'm at > it, make things a bit better for those who come after me. > -- > Frank Mayhar > 310-460-4042 Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] [RFC] dm-lc: plan to go to staging
Another idea: Make the interface of dm-lc (the arguments to constructor, messages and the status line) the same as dm-cache, so that they can be driven by the same userspace code. Mikulas On Thu, 29 Aug 2013, Alasdair G Kergon wrote: > On Wed, Aug 28, 2013 at 07:05:55PM -0700, Greg KH wrote: > > For staging drivers, I need a TODO file that lists > > what needs to be done to the code to get it into a mergable state for > > the "real" part of the kernel, > > Two simple requirements before putting your proof-of-concept into staging > if you want to work that way: > > 1) Drop the major version number to 0. Version 1 is reserved for > supported modules. > > 2) Agree a new and meaningful target name with us so you don't have to > change it later. "lc" means nothing, I'm afraid. > > Then in general terms, you should continue to compare your device-mapper > target with the existing targets and where there are differences, either > change your target to be like something that already exists, or be ready > to explain why that can't or shouldn't be done. > > In particular, the interface and architecture will need substantial > changes and working these out should be your highest priority. > > For example, you write: > > Be careful, you MUST create all the LVs as the destinations of > the dirty blocks on the cache device before this operation. Otherwise, > the kernel may crash. > > I read a statement like that as an indication of an interface or > architectural problem. The device-mapper approach is to 'design out' > problems, rather than relying on users not doing bad things. > Study the existing interfaces used by other targets to understand > some approaches that proved successful, then decide which ones > come closest to your needs. > > (Your code also needs many more comments inline to explain what it does > and how it works.) > > The documentation file will eventually need rewriting to follow the same > format as the other targets recently added to the kernel. We document > the kernel interface rather than any particular userspace tools, which > just have the status of convenient examples. > > Another little thing I noticed: look into using something like > __dm_bless_for_disk() for your metadata and clearly segregate your > on-disk structures and document the layout. > > Alasdair > > -- > dm-devel mailing list > dm-de...@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
On Tue, 1 Oct 2013, Joe Thornber wrote: > > Alternatively, delaying them will stall the filesystem because it's > > waiting for said REQ_FUA IO to complete. For example, journal writes > > in XFS are extremely IO latency sensitive in workloads that have a > > signifincant number of ordering constraints (e.g. O_SYNC writes, > > fsync, etc) and delaying even one REQ_FUA/REQ_FLUSH can stall the > > filesystem for the majority of that barrier_deadline_ms. > > Yes, this is a valid concern, but I assume Akira has benchmarked. > With dm-thin, I delay the REQ_FUA/REQ_FLUSH for a tiny amount, just to > see if there are any other FUA requests on my queue that can be > aggregated into a single flush. I agree with you that the target > should never delay waiting for new io; that's asking for trouble. > > - Joe You could send the first REQ_FUA/REQ_FLUSH request directly to the disk and aggregate all the requests that were received while you processed the initial request. This way, you can do request batching without introducing artifical delays. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
dm-writeboost testing
Hi I tested dm-writeboost and I found these problems: Performance problems: I tested dm-writeboost with disk as backing device and ramdisk as cache device. When I run mkfs.ext4 on the dm-writeboost device, it writes data to the cache on the first time. However, on next mkfs.ext4 invocations, dm-writeboost writes data to the disk, not to the cache. mkfs.ext4 on raw disk: 1.5s mkfs.ext4 on dm-cache using raw disk and ramdisk: 1st time - 0.15s next time - 0.12s mkfs.ext4 on dm-writeboost using raw disk and ramdisk: 1st time - 0.11s next time - 1.71s, 1.31s, 0.91s, 0.86s, 0.82s - there seems to be some error in logic in dm-writeboost that makes it not cache writes if these writes are already placed in the cache. In real-world scenarios where the same piece of disk is overwritten over and over again (for example journal), this could cause performance problems. dm-cache doesn't have this problem, if you overwrite the same piece of data again and again, it goes to the cache device. Reliability problems (these problems are reproducible, they happen every time). Tests were done on 3.10.13 on Opteron and 3.11.3 on PA-RISC. On 3.10.13 on Opteron on preemptible kernel, I get "BUG: workqueue leaked lock or atomic" when unloading the device with dmsetup remove. Also, on this machine, I see a bug, if I load dm-writeboost and terminate X server - Xorg hangs flusing workqueue. On 3.11.3 on PA-RISC without preemption, the device unloads (although it takes many seconds and vmstat shows that the machine is idle during this time), but I get deadlock when I load the device the second time. The deadlock happens either on load on when writing to the newly loaded device. deadlock when the device is loaded the second time: [ 8336.212499] SysRq : Show Blocked State [ 8336.212499] taskPC stack pid father [ 8336.212499] dmsetup D 401040c0 0 1594 1572 0x0010 [ 8336.212499] Backtrace: [ 8336.212499] [<40111608>] __schedule+0x280/0x678 [ 8336.212499] [<40111e88>] schedule+0x38/0x90 [ 8336.212499] [<4010ed1c>] schedule_timeout+0x1b4/0x208 [ 8336.212499] [<40111c1c>] wait_for_common+0x124/0x1e8 [ 8336.212499] [<40111d04>] wait_for_completion+0x24/0x38 [ 8336.212499] [<107d9778>] wait_for_migration+0x38/0xb0 [dm_writeboost] [ 8336.212499] [<107d7cf8>] resume_cache+0x1100/0x16f8 [dm_writeboost] Another deadlock when loaded the second time and running mkfs.ex3 on the writeboost device (it got cut off in the scrollback buffer): [ 782.579921] [<40112280>] schedule_preempt_disabled+0x20/0x38 [ 782.579921] [<40110764>] __mutex_lock_slowpath+0x15c/0x290 [ 782.579921] [<40110928>] mutex_lock+0x90/0x98 [ 782.579921] [<107f8b74>] flush_current_buffer+0x2c/0xb0 [dm_writeboost] [ 782.579921] [<107fecbc>] sync_proc+0x7c/0xc8 [dm_writeboost] [ 782.579921] [<401591d0>] process_one_work+0x160/0x460 [ 782.579921] [<40159bb8>] worker_thread+0x300/0x478 [ 782.579921] [<40161a68>] kthread+0x118/0x128 [ 782.579921] [<40104020>] end_fault_vector+0x20/0x28 [ 782.579921] timer_interrupt(CPU 0): delayed! cycles A099A8C1 rem 22345C next/now CDFE401953/CDFE1DE4F7 [ 785.403254] [ 785.403254] mkfs.ext3 D 401040c0 0 2309 2237 0x0010 [ 785.403254] Backtrace: [ 785.403254] [<40111608>] __schedule+0x280/0x678 [ 785.403254] [<40111e88>] schedule+0x38/0x90 [ 785.403254] [<4010ed1c>] schedule_timeout+0x1b4/0x208 [ 785.403254] [<40111c1c>] wait_for_common+0x124/0x1e8 [ 785.403254] [<40111d04>] wait_for_completion+0x24/0x38 [ 785.403254] [<107fe778>] wait_for_migration+0x38/0xb0 [dm_writeboost] [ 785.403254] [<107f7fe8>] queue_current_buffer+0x78/0x3c8 [dm_writeboost] [ 785.403254] [<107f96b8>] writeboost_map+0x660/0x970 [dm_writeboost] [ 785.403254] [<1079477c>] __map_bio+0x9c/0x148 [dm_mod] [ 785.403254] [<10794cf0>] __clone_and_map_data_bio+0x188/0x288 [dm_mod] [ 785.403254] [<10795594>] __split_and_process_bio+0x474/0x6c8 [dm_mod] [ 785.403254] [<10796180>] dm_request+0x118/0x278 [dm_mod] [ 785.403254] [<402d8360>] generic_make_request+0x128/0x1a0 [ 785.403254] [<402d8448>] submit_bio+0x70/0x140 [ 785.403254] [<40231c68>] _submit_bh+0x200/0x3b8 [ 785.403254] [<40231e34>] submit_bh+0x14/0x20 A leaked prermpt count on unload (with preemptible kernel): BUG: workqueue leaked lock or atomic: kworker/u26:1/0x0001/1031 last function: flush_proc [dm_writeboost] CPU: 10 PID: 1031 Comm: kworker/u26:1 Tainted: P O 3.10.13 #9 Hardware name: empty empty/S3992-E, BIOS 'V1.06 ' 06/09/2009 Workqueue: flushwq flush_proc [dm_writeboost] 8134e746 81052c1d 3e5428c0 88023f578c00 88023f578c18 88044
Re: [dm-devel] dm-writeboost testing
On Fri, 4 Oct 2013, Akira Hayakawa wrote: > Hi, Mikulas, > > I am sorry to say that > I don't have such machines to reproduce the problem. > > But agree with that I am dealing with workqueue subsystem > in a little bit weird way. > I should clean them up. > > For example, > free_cache() routine below is > a deconstructor of the cache metadata > including all the workqueues. > > void free_cache(struct wb_cache *cache) > { > cache->on_terminate = true; > > /* Kill in-kernel daemons */ > cancel_work_sync(&cache->sync_work); > cancel_work_sync(&cache->recorder_work); > cancel_work_sync(&cache->modulator_work); > > cancel_work_sync(&cache->flush_work); > destroy_workqueue(cache->flush_wq); > > cancel_work_sync(&cache->barrier_deadline_work); > > cancel_work_sync(&cache->migrate_work); > destroy_workqueue(cache->migrate_wq); > free_migration_buffer(cache); > > /* Destroy in-core structures */ > free_ht(cache); > free_segment_header_array(cache); > > free_rambuf_pool(cache); > } > > cancel_work_sync() before destroy_workqueue() > can probably be removed because destroy_workqueue() first > flush all the works. > > Although I prepares independent workqueue > for each flush_work and migrate_work > other four works are queued into the system_wq > through schedule_work() routine. > This asymmetricity is not welcome for > architecture-portable code. > Dependencies to the subsystem should be minimized. > In detail, workqueue subsystem is really changing > about its concurrency support so > trusting only the single threaded workqueue > will be a good idea for stability. The problem is that you are using workqueues the wrong way. You submit a work item to a workqueue and the work item is active until the device is unloaded. If you submit a work item to a workqueue, it is required that the work item finishes in finite time. Otherwise, it may stall stall other tasks. The deadlock when I terminate Xserver is caused by this - the nvidia driver tries to flush system workqueue and it waits for all work items to terminate - but your work items don't terminate. If you need a thread that runs for a long time, you should use kthread_create, not workqueues (see this http://people.redhat.com/~mpatocka/patches/kernel/dm-crypt-paralelizace/old-3/dm-crypt-encryption-threads.patch or this http://people.redhat.com/~mpatocka/patches/kernel/dm-crypt-paralelizace/old-3/dm-crypt-offload-writes-to-thread.patch as an example how to use kthreads). Mikulas > To begin with, > these works are never out of queue > until the deconstructor is called > but they are repeating running and sleeping. > Queuing these kind of works to system_wq > may be unsupported. > > So, > my strategy is to clean them up in a way that > 1. all daemons are having their own workqueue > 2. never use cancel_work_sync() but only calls destroy_workqueue() >in the deconstructor free_cache() and error handling in resume_cache(). > > Could you please run the same test again > after I fixed these points > to see whether it is still reproducible? > > > > On 3.11.3 on PA-RISC without preemption, the device unloads (although it > > takes many seconds and vmstat shows that the machine is idle during this > > time) > This behavior is benign but probably should be improved. > In said free_cache() it first turns `on_terminate` flag to true > to notify all the daemons that we are shutting down. > Since the `update_interval` and `sync_interval` are 60 seconds by default > we must wait for them to finish for a while. > > Akira > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] dm-writeboost testing
On Fri, 4 Oct 2013, Akira Hayakawa wrote: > Mikulas, > > Thanks for your pointing out. > > > The problem is that you are using workqueues the wrong way. You submit a > > work item to a workqueue and the work item is active until the device is > > unloaded. > > > > If you submit a work item to a workqueue, it is required that the work > > item finishes in finite time. Otherwise, it may stall stall other tasks. > > The deadlock when I terminate Xserver is caused by this - the nvidia > > driver tries to flush system workqueue and it waits for all work items to > > terminate - but your work items don't terminate. > > > > If you need a thread that runs for a long time, you should use > > kthread_create, not workqueues (see this > > http://people.redhat.com/~mpatocka/patches/kernel/dm-crypt-paralelizace/old-3/dm-crypt-encryption-threads.patch > > > > or this > > http://people.redhat.com/~mpatocka/patches/kernel/dm-crypt-paralelizace/old-3/dm-crypt-offload-writes-to-thread.patch > > > > as an example how to use kthreads). > > But I see no reason why you recommend > using a kthread for looping job > instead of putting a looping work item > into a single-threaded not-system workqueue. > > For me, they both seem to be working. As I said, the system locks up when it tries to flush the system workqueue. This happens for example when terminating Xwindow with the nvidia binary driver, but it may happen in other parts of the kernel too. The fact that it works in your setup doesn't mean that it is correct. > Is it documented that > looping job should not be put into > any type of workqueue? It is general assumption when workqueues were created. Maybe it's not documented. > You are only mentioning that > putting a looping work item in system_wq > is the wrong way since > nvidia driver flush the workqueue. > > Akira Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] dm-writeboost testing
On Sat, 5 Oct 2013, Akira Hayakawa wrote: > Mikulas, > > > nvidia binary driver, but it may happen in other parts of the kernel too. > > The fact that it works in your setup doesn't mean that it is correct. > You are right. I am convinced. > > As far as I looked around the kernel code, > it seems to be choosing kthread when one needs looping in background. > There are other examples such as loop_thread in drivers/block/loop.c . > > And done. Please git pull. > commit b6d2d3892e09145cd0b9c5ad69ea2d8c410fdeab is tested in my environment. > All the looping daemons are running on kthread now. > > The diff between the older version (with wq) and the new version (with > kthread) > is shown below. I defined fancy CREATE_DAEMON macro. > > Another by-product is that > you are no longer in need to wait for long time in dmsetup remove > since kthread_stop() forcefully wakes the thread up. The change seems ok. Please, also move this piece of code in flush_proc out of the spinlock: if (kthread_should_stop()) return 0; It caused the workqueue warning I reported before and still causes warning with kthreads: note: flush_daemon[5145] exited with preempt_count 1 I will send you next email with more bugs that I found in your code. Mikulas > diff --git a/Driver/dm-writeboost-daemon.c b/Driver/dm-writeboost-daemon.c > index 6090661..65974e2 100644 > --- a/Driver/dm-writeboost-daemon.c > +++ b/Driver/dm-writeboost-daemon.c > @@ -10,12 +10,11 @@ > > /**/ > > -void flush_proc(struct work_struct *work) > +int flush_proc(void *data) > { > unsigned long flags; > > - struct wb_cache *cache = > - container_of(work, struct wb_cache, flush_work); > + struct wb_cache *cache = data; > > while (true) { > struct flush_job *job; > @@ -32,8 +31,12 @@ void flush_proc(struct work_struct *work) > msecs_to_jiffies(100)); > spin_lock_irqsave(&cache->flush_queue_lock, flags); > > - if (cache->on_terminate) > - return; > + /* > + * flush daemon can exit > + * only if no flush job is queued. > + */ > + if (kthread_should_stop()) > + return 0; > } > > /* > @@ -84,6 +87,7 @@ void flush_proc(struct work_struct *work) > > kfree(job); > } > + return 0; > } > > /**/ > @@ -353,19 +357,15 @@ migrate_write: > } > } > > -void migrate_proc(struct work_struct *work) > +int migrate_proc(void *data) > { > - struct wb_cache *cache = > - container_of(work, struct wb_cache, migrate_work); > + struct wb_cache *cache = data; > > - while (true) { > + while (!kthread_should_stop()) { > bool allow_migrate; > size_t i, nr_mig_candidates, nr_mig, nr_max_batch; > struct segment_header *seg, *tmp; > > - if (cache->on_terminate) > - return; > - > /* >* If reserving_id > 0 >* Migration should be immediate. > @@ -430,6 +430,7 @@ void migrate_proc(struct work_struct *work) > list_del(&seg->migrate_list); > } > } > + return 0; > } > > /* > @@ -453,19 +454,16 @@ void wait_for_migration(struct wb_cache *cache, u64 id) > > /**/ > > -void modulator_proc(struct work_struct *work) > +int modulator_proc(void *data) > { > - struct wb_cache *cache = > - container_of(work, struct wb_cache, modulator_work); > + struct wb_cache *cache = data; > struct wb_device *wb = cache->wb; > > struct hd_struct *hd = wb->device->bdev->bd_part; > unsigned long old = 0, new, util; > unsigned long intvl = 1000; > > - while (true) { > - if (cache->on_terminate) > - return; > + while (!kthread_should_stop()) { > > new = jiffies_to_msecs(part_stat_read(hd, io_ticks)); > > @@ -484,6 +482,7 @@ modulator_update: > > schedule_timeout_interruptible(msecs_to_jiffies(intvl)); > } > + return 0; > } > > /**/ > @@ -517,16 +516,12 @@ static void update_superblock_record(struct wb_cache > *cache) > kfree(buf); > } > > -void recorder_proc(struct work_struct *work) > +int recorder_proc(void *data) > { > - struct wb_cache *cache = > - container_of(work, struct wb_cache, recorder_work); > + struct wb_cache *cache = data; > unsigned long intvl; > > - while (true) { > -
A review of dm-writeboost
Hi I looked at dm-writeboost source code and here I'm sending the list of problems I found: Polling for state - Some of the kernel threads that you spawn poll for data in one-second interval - see migrate_proc, modulator_proc, recorder_proc, sync_proc. flush_proc correctly contains wait_event, but there is still some 100ms polling timeout in flush_proc that shouldn't be there. Don't do this. You can do polling in a piece of code that is executed infrequently, such as driver unload, but general functionality must not depend on polling. If you set the polling interval too low, it wastes CPU cycle and it wastes energy due to CPU wake-ups. If you set the polling interval too high, it causes artifical delays. For example, current middle-class SSDs can do writes at a rate 500MB/s. Now, think that the user uses 400MB partition for the cache - the partition is filled up in 0.8s and the kernel waits for additional 0.2s doing absolutely nothing, until your polling code wakes up and notices that it should start flusing the cache. So - the polling code introduces artifical delays that can cause performance degradation. You may think about lowering the polling interval to lessen the performance impact, but if you lower the polling interval, it increases CPU and power consumption when the driver is idle. Either way, it is wrong. Just don't use polling. Lack of locking/barriers migrate_proc lacks any lock or barrier. Note that processors may execute instructions out of order, and thus concurrent access without locks or barriers is always wrong. Think of this piece of code: nr_mig_candidates = cache->last_flushed_segment_id - cache->last_migrated_segment_id; ... nr_mig = min(nr_mig_candidates, cache->nr_cur_batched_migration); ... for (i = 1; i <= nr_mig; i++) { seg = get_segment_header_by_id( cache, cache->last_migrated_segment_id + i); list_add_tail(&seg->migrate_list, &cache->migrate_list); } The processor may reorder all memory accesses - for example it may read the data accessed in the "for" cycle before reading the variables "cache->last_flushed_segment_id" and "cache->last_migrated_segment_id". If this happens, the code may work with invalid data. Similarly, the processor that updates cache->last_flushed_segment_id can update it before updating the segment variables itself. You need to use smp_wmb() before incrementing cache->last_flushed_segment_id in the producer process and smp_rmb() after reading cache->last_flushed_segment_id in the consumer process. Read Documentation/memory-barriers.txt for full explanation. You can use locks instead of memory barriers, locks are simpler to use and simpler to verify, but they are usually slower than memory barriers. Nonatomic 64-bit variables -- Note that 64-bit variables are not atomic on 32-bit architectures. Linux assumes that 32-bit variables are atomic on 32-bit architectures and 64-bit or 32-bit variables are atomic on 64-bit architectures. That is, variables having the "long" or "int" type are atomic. Atomicity means that if you read and modify the variable at the same time, you read either the old value or the new values. 64-bit variables on 32-bit architectures are usually read and written in two steps, and thus the atomicity requirement isn't true. For example, suppose that you change cache->last_flushed_segment_id from 0x to 0x0001. Now, the function migrate_proc that asynchronously reads cache->last_flushed_segment_id can read 0x or 0x0001 (if it reads one of these values, it behaves correctly), but it can also read 0x or 0x0001 - if migrate_proc reads one of these two values, it goes wild, migrating segments that weren't ever supposed to be migrated, and likely causes a crash or data corruption. I found this bug in migrate_proc and update_superblock_record (but it may be also in other parts of the code). You can use atomic64_t type for atomic 64-bit variables (plus memory barriers as explained in the previous section). Or you can use locks. reserving_segment_id is also affected. However, you never actually need the value of reserving_segment_id, you only compare it to zero. Thus, you can change this variable to the "int" type and set it to "0" or "1". (you must use int and not bool, see the next section). Variables smaller than 32 bits must not be asynchronously modified -- Early Alpha processors can't access memory objects smaller than 32 bits - so, for example, if your code writes 8-bit char or bool variable, the processor reads 32 bits to a register, modifies 8-bit part of the register and writes 32 bits from the register back to memory. Now, if you define something like un
Re: A review of dm-writeboost
On Sun, 6 Oct 2013, Akira Hayakawa wrote: > Mikulas, > > Thank you for your reviewing. > > I will reply to polling issue first. > For the rest, I will reply later. > > > Polling for state > > - > > > > Some of the kernel threads that you spawn poll for data in one-second > > interval - see migrate_proc, modulator_proc, recorder_proc, sync_proc. > > > > flush_proc correctly contains wait_event, but there is still some 100ms > > polling timeout in flush_proc that shouldn't be there. > > > > If you set the polling interval too low, it wastes CPU cycle and it wastes > > energy due to CPU wake-ups. If you set the polling interval too high, it > > causes artifical delays. For example, current middle-class SSDs can do > > writes at a rate 500MB/s. Now, think that the user uses 400MB partition > > for the cache - the partition is filled up in 0.8s and the kernel waits > > for additional 0.2s doing absolutely nothing, until your polling code > > wakes up and notices that it should start flusing the cache. > > > > So - the polling code introduces artifical delays that can cause > > performance degradation. You may think about lowering the polling interval > > to lessen the performance impact, but if you lower the polling interval, > > it increases CPU and power consumption when the driver is idle. Either > > way, it is wrong. Just don't use polling. > > You dislike the fact that looping thread sleeping > can delay to the newly coming jobs. > > I am afraid your understanding > on the sleeping of flush daemon is wrong. > Please see queue_flushing() it wakes up the daemon > if the queue is empty. > > spin_lock_irqsave(&cache->flush_queue_lock, flags); > empty = list_empty(&cache->flush_queue); > list_add_tail(&job->flush_queue, &cache->flush_queue); > spin_unlock_irqrestore(&cache->flush_queue_lock, flags); > if (empty) > wake_up_interruptible(&cache->flush_wait_queue); > > Even if this waking up lines are removed > the flush daemon wakes up by its self > 100ms at worst after the first 1MB written. flush_proc is woken up correctly. But the other threads aren't. migrate_proc, modulator_proc, recorder_proc, sync_proc all do polling. Waking up every 100ms in flush_proc is not good because it wastes CPU time and energy if the driver is idle. BTW. You should use wait_event_interruptible_lock_irq instead of wait_event_interruptible and wait_event_interruptible_lock_irq_timeout instead of wait_event_interruptible_timeout. The functions with "lock_irq" automatically drop and re-acquire the spinlock, so that the condition is tested while the lock is held - so that there are no asynchronous accesses to !list_empty(&cache->flush_queue). > > Don't do this. You can do polling in a piece of code that is executed > > infrequently, such as driver unload, but general functionality must not > > depend on polling. > 100ms is the time the user must wait in termination as your claim. > > However, I think this is a good time to > explain why I didn't choose to use workqueue. > > Keeping it a polling thread that consumes what in queue > instead of replacing it with workqueue > has these reasons: > > 1. CPU overhead is lower >queue_work overhead everytime flush job is to queue >is crucial for writeboost. queue_work is too CPU-consuming >as far as I looked at the code. >I have measured the theoretical maximum 4KB randwrite throughput >of writeboost using fio benchmark. It was 1.5GB/sec >with fast enough cache device. >I don't think this extraordinary score can not be achieved with >other caching softwares. >In this benchmark, the flush daemon didn't sleep at all >because the writes are ceaseless. >Using workqueue will not only lose throughput >but also wastes CPU cycles both regularly is the fact I dislike. > 2. Ease of Error Handling >Keeping it a single thread looping and self-managing the queue >makes it easy to handle error. >If we choose to use workqueue, we need a mechanism to >* notify the occurrence of a error to other work items >* play-back the work items in the same order since waiting until the > system recovers for long time is prohibited with workqueue > as we discussed in the previous posts. > Segments must be flushed and then migrated in sequential > order along its id. >Leaving the control of the queue which should be in desired order >to the workqueue subsystem is dangerous in this case. You don't have to use workqueues if you don't like them. You can use kernel threads and wait_event/wake_up instead. Workqueues are simpler to use in some circumstances (for example, you can submit the same work item multiple times) but it's up to you if you want that simplicity or not. But you shouldn't do looping and waiting for events in migrate_proc. > For migrate_proc, > I see no reason this daemon polling is bad > although it actually sleeps in leisure time. > I
Re: [REGRESSION][BISECTED] skge: add dma_mapping check
On Tue, 24 Sep 2013, Joseph Salisbury wrote: > On 09/19/2013 05:03 AM, Igor Gnatenko wrote: > > Please, send patch. > > > The patch is in mainline as of 3.12-rc2 as commit: > > Author: Mikulas Patocka > Date: Thu Sep 19 14:13:17 2013 -0400 > > skge: fix broken driver > > I don't see that the commit was Cc'd to stable. Mikulas, we might need > to send a request directly to the stable maintainers and reqeust that > the commit be pulled into stable, in case they didn't notice the request > in the commit message. The patch needs to be added to 3.11 stable tree. David Miller's networking tree has a rule that it doesn't want "Cc: sta...@kernel.org" in patches submitted through it. So, David hopefully submits the patch to the stable tree on his own. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] percpu-rwsem: use barrier in unlock path
Hi On Wed, 17 Oct 2012, Lai Jiangshan wrote: > On 10/17/2012 10:23 AM, Linus Torvalds wrote: > > [ Architecture people, note the potential new SMP barrier! ] > > > > On Tue, Oct 16, 2012 at 4:30 PM, Mikulas Patocka > > wrote: > >> + /* > >> +* The lock is considered unlocked when p->locked is set to false. > >> +* Use barrier prevent reordering of operations around p->locked. > >> +*/ > >> +#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && > >> !defined(CONFIG_X86_OOSTORE)) > >> + barrier(); > >> +#else > >> + smp_mb(); > >> +#endif > >> p->locked = false; > > > > Ugh. The #if is too ugly to live. > > Even the previous patch is applied, percpu_down_read() still > needs mb() to pair with it. percpu_down_read uses rcu_read_lock which should guarantee that memory accesses don't escape in front of a rcu-protected section. If rcu_read_unlock has only an unlock barrier and not a full barrier, memory accesses could be moved in front of rcu_read_unlock and reordered with this_cpu_inc(*p->counters), but it doesn't matter because percpu_down_write does synchronize_rcu(), so it never sees these accesses halfway through. > > This is a classic case of "people who write their own serialization > > primitives invariably get them wrong". And this fix is just horrible, > > and code like this should not be allowed. > > One of the most major problems of 62ac665ff9fc07497ca524bd20d6a96893d11071 is > that > it is merged without Ackeds or Revieweds from Paul or Peter or someone else > who are expert at synchronization/arch memory models. > > I suggest any new synchronization should stay in -tip for 2 or more cycles > before merged to mainline. But the bug that this synchronization is fixing is quite serious (it causes random crashes when block size is being changed, the crash happens regularly at multiple important business sites) so it must be fixed soon and not wait half a year. > Thanks, > Lai Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] percpu-rwsem: use barrier in unlock path
On Thu, 18 Oct 2012, Lai Jiangshan wrote: > On 10/18/2012 04:28 AM, Steven Rostedt wrote: > > On Wed, Oct 17, 2012 at 11:07:21AM -0400, Mikulas Patocka wrote: > >>> > >>> Even the previous patch is applied, percpu_down_read() still > >>> needs mb() to pair with it. > >> > >> percpu_down_read uses rcu_read_lock which should guarantee that memory > >> accesses don't escape in front of a rcu-protected section. > > > > You do realize that rcu_read_lock() does nothing more that a barrier(), > > right? > > > > Paul worked really hard to get rcu_read_locks() to not call HW barriers. > > > >> > >> If rcu_read_unlock has only an unlock barrier and not a full barrier, > >> memory accesses could be moved in front of rcu_read_unlock and reordered > >> with this_cpu_inc(*p->counters), but it doesn't matter because > >> percpu_down_write does synchronize_rcu(), so it never sees these accesses > >> halfway through. > > > > Looking at the patch, you are correct. The read side doesn't need the > > memory barrier as the worse thing that will happen is that it sees the > > locked = false, and will just grab the mutex unnecessarily. > > - > A memory barrier can be added iff these two things are known: > 1) it disables the disordering between what and what. > 2) what is the corresponding mb() that it pairs with. > > You tried to add a mb() in percpu_up_write(), OK, I know it disables the > disordering > between the writes to the protected data and the statement "p->locked = > false", > But I can't find out the corresponding mb() that it pairs with. Or alternativelly, instead of barrier, we can do this. Mikulas --- percpu-rwsem: use barrier in unlock path The lock is considered unlocked when p->locked is set to false. Use barrier prevent reordering of operations around p->locked. Signed-off-by: Mikulas Patocka --- include/linux/percpu-rwsem.h |6 ++ 1 file changed, 6 insertions(+) Index: linux-3.6.2-fast/include/linux/percpu-rwsem.h === --- linux-3.6.2-fast.orig/include/linux/percpu-rwsem.h 2012-10-17 20:48:40.0 +0200 +++ linux-3.6.2-fast/include/linux/percpu-rwsem.h 2012-10-18 17:19:24.0 +0200 @@ -66,6 +66,12 @@ static inline void percpu_down_write(str static inline void percpu_up_write(struct percpu_rw_semaphore *p) { + /* +* Make sure that other processes that are in rcu section and that +* may have seen partially modified state exit the rcu section and +* try to grab the mutex. +*/ + synchronize_rcu(); p->locked = false; mutex_unlock(&p->mtx); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] percpu-rwsem: use barrier in unlock path
On Wed, 17 Oct 2012, Steven Rostedt wrote: > On Wed, Oct 17, 2012 at 11:07:21AM -0400, Mikulas Patocka wrote: > > > > > > Even the previous patch is applied, percpu_down_read() still > > > needs mb() to pair with it. > > > > percpu_down_read uses rcu_read_lock which should guarantee that memory > > accesses don't escape in front of a rcu-protected section. > > You do realize that rcu_read_lock() does nothing more that a barrier(), > right? > > Paul worked really hard to get rcu_read_locks() to not call HW barriers. > > > > > If rcu_read_unlock has only an unlock barrier and not a full barrier, > > memory accesses could be moved in front of rcu_read_unlock and reordered > > with this_cpu_inc(*p->counters), but it doesn't matter because > > percpu_down_write does synchronize_rcu(), so it never sees these accesses > > halfway through. > > Looking at the patch, you are correct. The read side doesn't need the > memory barrier as the worse thing that will happen is that it sees the > locked = false, and will just grab the mutex unnecessarily. It wasn't correct. CPU 1 is holding the write lock. CPU 2 could get to percpu_down_read past rcu_read_lock and prefetch some data that are accessed after percpu_down_read. CPU 1 goes into percpu_up_write(), calls a barrier, p->locked = false; and mutex_unlock(&p->mtx); CPU 2 now sees p->locked == false, so it goes through percpu_down_read. It exists percpu_down_read and uses the invalid prefetched data. It could be fixed by using synchronize_rcu(); in percpu_up_write, I sent a patch for that. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] percpu-rwsem: use barrier in unlock path
On Thu, 18 Oct 2012, Steven Rostedt wrote: > On Thu, 2012-10-18 at 10:18 +0800, Lai Jiangshan wrote: > > > > > > Looking at the patch, you are correct. The read side doesn't need the > > > memory barrier as the worse thing that will happen is that it sees the > > > locked = false, and will just grab the mutex unnecessarily. > > > > - > > A memory barrier can be added iff these two things are known: > > 1) it disables the disordering between what and what. > > 2) what is the corresponding mb() that it pairs with. > > > > OK, I was just looking at the protection and actions of the locked flag, > but I see what you are saying with the data itself. > > > You tried to add a mb() in percpu_up_write(), OK, I know it disables the > > disordering > > between the writes to the protected data and the statement "p->locked = > > false", > > But I can't find out the corresponding mb() that it pairs with. > > > > percpu_down_read() writes to the data > > The cpu cache/prefetch the data writes to the data > > which is chaos writes to the data > > percpu_up_write() > > mb() > > p->locked = > > false; > > unlikely(p->locked) > > the cpu see p->lock = false, > > don't discard the cached/prefetch data > > this_cpu_inc(*p->counters); > > the code of read-access to the data > > and we use the chaos data* > > > > So you need to add a mb() after "unlikely(p->locked)". > > Does it need a full mb() or could it be just a rmb()? The down_read I > wouldn't think would need to protect against stores, would it? The > memory barrier should probably go in front of the unlikely() too. The > write to p->counters is handled by the synchronized sched, and adding a > rmb() in front of the unlikely check would keep prefetched data from > passing this barrier. > > This is a perfect example why this primitive should be vetted outside of > mainline before it gets merged. > > -- Steve If we do synchronize_rcu() in percpu_up_write, we don't need a barrier in percpu_down_read(). So I would do that. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] percpu-rwsem: use barrier in unlock path
On Tue, 16 Oct 2012, Linus Torvalds wrote: > [ Architecture people, note the potential new SMP barrier! ] > > On Tue, Oct 16, 2012 at 4:30 PM, Mikulas Patocka wrote: > > + /* > > +* The lock is considered unlocked when p->locked is set to false. > > +* Use barrier prevent reordering of operations around p->locked. > > +*/ > > +#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && > > !defined(CONFIG_X86_OOSTORE)) > > + barrier(); > > +#else > > + smp_mb(); > > +#endif > > p->locked = false; > > Ugh. The #if is too ugly to live. One instance of this is already present in the code. I suggest that you drop this patch and use synchronize_rcu() that I have just posted. But another instance of this "#if defined(CONFIG_X86) && ..." is already there in percpu_up_read. What is the procedure for making changes that require support of architectures? It is trivial to make a patch that moves this into arch-specific includes, the problem is that the patch break all the architectures - I wrote support for x86, sparc, parisc, alpha (I can test those) but not the others. Are you going to apply this patch, break majority of architectures and wait until architecture maintainers fix it? Or is there some arch-specific git tree, where the patch should be applied and where the maintainers fix things? > This is a classic case of "people who write their own serialization > primitives invariably get them wrong". And this fix is just horrible, > and code like this should not be allowed. > > I suspect we could make the above x86-specific optimization be valid > by introducing a new barrier, called "smp_release_before_store()" or > something like that, which on x86 just happens to be a no-op (but > still a compiler barrier). That's because earlier reads will not pass > a later stores, and stores are viewed in program order, so all x86 > stores have "release consistency" modulo the theoretical PPro bugs > (that I don't think we actually ever saw in practice). > > But it is possible that that is true on other architectures too, so > your hand-crafted x86-specific #if is not just ugly, it's liable to > get worse. > > The alternative is to just say that we should use "smp_mb()" > unconditionally, depending on how critical this code actually ends up > being. > > Added linux-arch in case architecture-maintainers have comments on > "smp_release_before_store()" thing. It would be kind of similar to the > odd "smp_read_barrier_depends()", except it would normally be a full > memory barrier, except on architectures where a weaker barrier might > be sufficient. > > I suspect there may be many architectures where a "smp_wmb()" is > sufficient for this case, for the simple reason that no sane > microarchitecture would *ever* move earlier reads down past a later > write, Alpha can move reads down past writes (if the read is not in cache and the write hits the cache, it may make sense to do this reordering). > so release consistency really only needs the local writes to be > ordered, not the full memory ordering. > > Arch people? > > The more optimal solution may be to mark the store *itself* to be > "store with release consistency", which on x86 would be a regular > store (with the compiler barrier), but on other architectures may be a > special memory operation. On architectures with > release/aqcuire-consistency, there's not a separate barrier before the > store, the store instruction itself is done with special semantics. So > maybe the right thing to do is > >#define smp_release_consistency_store(val, ptr) ... > > where on x86, the implementation would be a simple > >do { barrier(); *(ptr)=(val); } while (0) smp_release_consistency_store doesn't work. In include/linux/percpu-rwsem.h it is required that "this_cpu_dec(*p->counters);" works as an unlock barrier. So we could do either. "smp_mb(); this_cpu_dec(*p->counters);" - generates pointless barrier on x86 "#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE)) barrier(); #else smp_mb(); #endif this_cpu_dec(*p->counters); " - removes the barrier on the most common architecture (X86), but the code looks dirty "smp_release_before_store();this_cpu_dec(*p->counters);" - the code is clean, but the downside is that it breaks architectures that don't have smp_release_before_store(). > but on other architectures it might be a inline asm with the required > magic store-wi
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On Thu, 18 Oct 2012, Oleg Nesterov wrote: > Ooooh. And I just noticed include/linux/percpu-rwsem.h which does > something similar. Certainly it was not in my tree when I started > this patch... percpu_down_write() doesn't allow multiple writers, > but the main problem it uses msleep(1). It should not, I think. synchronize_rcu() can sleep for hundred milliseconds, so msleep(1) is not a big problem. > But. It seems that percpu_up_write() is equally wrong? Doesn't > it need synchronize_rcu() before "p->locked = false" ? Yes, it does ... and I sent patch for that to Linus. > (add Mikulas) > > Oleg. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] percpu-rwsem: use barrier in unlock path
This patch looks sensible. I'd apply either this or my previous patch that adds synchronize_rcu() to percpu_up_write. This patch avoids the memory barrier on non-x86 cpus in percpu_up_read, so it is faster than the previous approach. Mikulas On Thu, 18 Oct 2012, Lai Jiangshan wrote: > --- > > a very draft example of paired-mb()s is here: > > > diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h > index cf80f7e..84a93c0 100644 > --- a/include/linux/percpu-rwsem.h > +++ b/include/linux/percpu-rwsem.h > @@ -12,6 +12,14 @@ struct percpu_rw_semaphore { > struct mutex mtx; > }; > > +#if 1 > +#define light_mb() barrier() > +#define heavy_mb() synchronize_sched() > +#else > +#define light_mb() smp_mb() > +#define heavy_mb() smp_mb(); > +#endif > + > static inline void percpu_down_read(struct percpu_rw_semaphore *p) > { > rcu_read_lock(); > @@ -24,22 +32,12 @@ static inline void percpu_down_read(struct > percpu_rw_semaphore *p) > } > this_cpu_inc(*p->counters); > rcu_read_unlock(); > + light_mb(); /* A, between read of p->locked and read of data, paired > with D */ > } > > static inline void percpu_up_read(struct percpu_rw_semaphore *p) > { > - /* > - * On X86, write operation in this_cpu_dec serves as a memory unlock > - * barrier (i.e. memory accesses may be moved before the write, but > - * no memory accesses are moved past the write). > - * On other architectures this may not be the case, so we need smp_mb() > - * there. > - */ > -#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && > !defined(CONFIG_X86_OOSTORE)) > - barrier(); > -#else > - smp_mb(); > -#endif > + light_mb(); /* B, between read of the data and write to p->counter, > paired with C */ > this_cpu_dec(*p->counters); > } > > @@ -61,11 +59,12 @@ static inline void percpu_down_write(struct > percpu_rw_semaphore *p) > synchronize_rcu(); > while (__percpu_count(p->counters)) > msleep(1); > - smp_rmb(); /* paired with smp_mb() in percpu_sem_up_read() */ > + heavy_mb(); /* C, between read of p->counter and write to data, paired > with B */ > } > > static inline void percpu_up_write(struct percpu_rw_semaphore *p) > { > + heavy_mb(); /* D, between write to data and write to p->locked, paired > with A */ > p->locked = false; > mutex_unlock(&p->mtx); > } > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On Fri, 19 Oct 2012, Peter Zijlstra wrote: > On Thu, 2012-10-18 at 15:28 -0400, Mikulas Patocka wrote: > > > > On Thu, 18 Oct 2012, Oleg Nesterov wrote: > > > > > Ooooh. And I just noticed include/linux/percpu-rwsem.h which does > > > something similar. Certainly it was not in my tree when I started > > > this patch... percpu_down_write() doesn't allow multiple writers, > > > but the main problem it uses msleep(1). It should not, I think. > > > > synchronize_rcu() can sleep for hundred milliseconds, so msleep(1) is not > > a big problem. > > That code is beyond ugly though.. it should really not have been merged. > > There's absolutely no reason for it to use RCU except to make it more So if you can do an alternative implementation without RCU, show it. The goal is - there should be no LOCK instructions on the read path and as few barriers as possible. > complicated. And as Oleg pointed out that msleep() is very ill > considered. > > The very worst part of it seems to be that nobody who's usually involved > with locking primitives was ever consulted (Linus, PaulMck, Oleg, Ingo, > tglx, dhowells and me). It doesn't even have lockdep annotations :/ > > So the only reason you appear to use RCU is because you don't actually > have a sane way to wait for count==0. And I'm contesting rcu_sync() is > sane here -- for the very simple reason you still need while (count) > loop right after it. > > So it appears you want an totally reader biased, sleepable rw-lock like > thing? Yes. > So did you consider keeping the inc/dec on the same per-cpu variable? > Yes this adds a potential remote access to dec and requires you to use > atomics, but I would not be surprised if the inc/dec were mostly on the > same cpu most of the times -- which might be plenty fast for what you > want. Yes, I tried this approach - it involves doing LOCK instruction on read lock, remembering the cpu and doing another LOCK instruction on read unlock (which will hopefully be on the same CPU, so no cacheline bouncing happens in the common case). It was slower than the approach without any LOCK instructions (43.3 seconds seconds for the implementation with per-cpu LOCKed access, 42.7 seconds for this implementation without atomic instruction; the benchmark involved doing 512-byte direct-io reads and writes on a ramdisk with 8 processes on 8-core machine). > If you've got coherent per-cpu counts, you can better do the > waitqueue/wake condition for write_down. synchronize_rcu() is way slower than msleep(1) - so I don't see a reason why should it be complicated to avoid msleep(1). > It might also make sense to do away with the mutex, there's no point in > serializing the wakeups in the p->locked case of down_read. The advantage of a mutex is that it is already protected against starvation. If I replace the mutex with a wait queue and retry, there is no starvation protection. > Furthermore, > p->locked seems a complete duplicate of the mutex state, so removing the > mutex also removes that duplication. We could replace if (p->locked) with if (mutex_is_locked(p->mtx)) > Also, that CONFIG_x86 thing.. *shudder*... Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On Fri, 19 Oct 2012, Peter Zijlstra wrote: > > Yes, I tried this approach - it involves doing LOCK instruction on read > > lock, remembering the cpu and doing another LOCK instruction on read > > unlock (which will hopefully be on the same CPU, so no cacheline bouncing > > happens in the common case). It was slower than the approach without any > > LOCK instructions (43.3 seconds seconds for the implementation with > > per-cpu LOCKed access, 42.7 seconds for this implementation without atomic > > instruction; the benchmark involved doing 512-byte direct-io reads and > > writes on a ramdisk with 8 processes on 8-core machine). > > So why is that a problem? Surely that's already tons better then what > you've currently got. Percpu rw-semaphores do not improve performance at all. I put them there to avoid performance regression, not to improve performance. All Linux kernels have a race condition - when you change block size of a block device and you read or write the device at the same time, a crash may happen. This bug is there since ever. Recently, this bug started to cause major trouble - multiple high profile business sites report crashes because of this race condition. You can fix this race by using a read lock around I/O paths and write lock around block size changing, but normal rw semaphore cause cache line bouncing when taken for read by multiple processors and I/O performance degradation because of it is measurable. So I put this percpu-rw-semaphore there to fix the crashes and minimize performance impact - on x86 it doesn't take any interlocked instructions in the read path. I don't quite understand why are people opposing to this and what do they want to do instead? If you pull percpu-rw-semaphores out of the kernel, you introduce a performance regression (raw device i/o will be slower on 3.7 than on 3.6, because on 3.6 it doesn't take any lock at all and on 3.7 it takes a read lock). So you have options: 1) don't lock i/o just like on 3.6 and previous versions - you get a fast kernel that randomly crashes 2) lock i/o with normal rw semaphore - you get a kernel that doesn't crash, but that is slower than previous versions 3) lock i/o with percpu rw semaphore - you get kernel that is almost as fast as previous kernels and that doesn't crash For the users, the option 3) is the best. The users don't care whether it looks ugly or not, they care about correctness and performance, that's all. Obviously, you can improve rw semaphores by adding lockdep annotations, or by other things (turning rcu_read_lock/sychronize_rcu into preempt_disable/synchronize_sched, by using barrier()-synchronize_sched() instead of smp_mb()...), but I don't see a reason why do you want to hurt users' experience by pulling it out and reverting to state 1) or 2) and then, two kernel cycles later, come up with percpu-rw-semaphores again. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On Fri, 19 Oct 2012, Oleg Nesterov wrote: > On 10/19, Mikulas Patocka wrote: > > > > synchronize_rcu() is way slower than msleep(1) - > > This depends, I guess. but this doesn't mmatter, > > > so I don't see a reason > > why should it be complicated to avoid msleep(1). > > I don't think this really needs complications. Please look at this > patch for example. Or initial (single writer) version below. It is > not finished and lacks the barriers too, but I do not think it is > more complex. Hi My implementation has a smaller structure (it doesn't have wait_queue_head_t). Using preempt_disable()/synchronize_sched() instead of RCU seems like a good idea. Here, the locked region is so small that it doesn't make sense to play tricks with preemptible RCU. Your implementation is prone to starvation - if the writer has a high priority and if it is doing back-to-back write unlocks/locks, it may happen that the readers have no chance to run. The use of mutex instead of a wait queue in my implementation is unusual, but I don't see anything wrong with it - it makes the structure smaller and it solves the starvation problem (which would otherwise be complicated to solve). Mikulas > Oleg. > > struct brw_sem { > long __percpu *read_ctr; > wait_queue_head_t read_waitq; > struct mutexwriter_mutex; > struct task_struct *writer; > }; > > int brw_init(struct brw_sem *brw) > { > brw->writer = NULL; > mutex_init(&brw->writer_mutex); > init_waitqueue_head(&brw->read_waitq); > brw->read_ctr = alloc_percpu(long); > return brw->read_ctr ? 0 : -ENOMEM; > } > > void brw_down_read(struct brw_sem *brw) > { > for (;;) { > bool done = false; > > preempt_disable(); > if (likely(!brw->writer)) { > __this_cpu_inc(*brw->read_ctr); > done = true; > } > preempt_enable(); > > if (likely(done)) > break; > > __wait_event(brw->read_waitq, !brw->writer); > } > } > > void brw_up_read(struct brw_sem *brw) > { > struct task_struct *writer; > > preempt_disable(); > __this_cpu_dec(*brw->read_ctr); > writer = ACCESS_ONCE(brw->writer); > if (unlikely(writer)) > wake_up_process(writer); > preempt_enable(); > } > > static inline long brw_read_ctr(struct brw_sem *brw) > { > long sum = 0; > int cpu; > > for_each_possible_cpu(cpu) > sum += per_cpu(*brw->read_ctr, cpu); Integer overflow on signed types is undefined - you should use unsigned long - you can use -fwrapv option to gcc to make signed overflow defined, but Linux doesn't use it. > > return sum; > } > > void brw_down_write(struct brw_sem *brw) > { > mutex_lock(&brw->writer_mutex); > brw->writer = current; > synchronize_sched(); > /* >* Thereafter brw_*_read() must see ->writer != NULL, >* and we should see the result of __this_cpu_inc(). >*/ > for (;;) { > set_current_state(TASK_UNINTERRUPTIBLE); > if (brw_read_ctr(brw) == 0) > break; > schedule(); > } > __set_current_state(TASK_RUNNING); > /* >* We can add another synchronize_sched() to avoid the >* spurious wakeups from brw_up_read() after return. >*/ > } > > void brw_up_write(struct brw_sem *brw) > { > brw->writer = NULL; > synchronize_sched(); That synchronize_sched should be put before brw->writer = NULL. This is incorrect, because brw->writer = NULL may be reordered with previous writes done by this process and the other CPU may see brw->writer == NULL (and think that the lock is unlocked) while it doesn't see previous writes done by the writer. I had this bug in my implementation too. > wake_up_all(&brw->read_waitq); > mutex_unlock(&brw->writer_mutex); > } Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex)
> > Ooooh. And I just noticed include/linux/percpu-rwsem.h which does > > something similar. Certainly it was not in my tree when I started > > this patch... percpu_down_write() doesn't allow multiple writers, > > but the main problem it uses msleep(1). It should not, I think. > > > > But. It seems that percpu_up_write() is equally wrong? Doesn't > > it need synchronize_rcu() before "p->locked = false" ? > > > > (add Mikulas) > > Mikulas said something about doing an updated patch, so I figured I > would look at his next version. > > Thanx, Paul The best ideas proposed in this thread are: Using heavy/light barries by Lai Jiangshan. This fixes the missing barrier bug, removes the ugly test "#if defined(X86) ..." and makes the read path use no barrier instruction on all architectures. Instead of rcu_read_lock, we can use rcu_read_lock_sched (or preempt_disable) - the resulting code is smaller. The critical section is so small that there is no problem disabling preemption. I am sending these two patches. Linus, please apply them if there are no objections. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched
Use rcu_read_lock_sched / rcu_read_unlock_sched / synchronize_sched instead of rcu_read_lock / rcu_read_unlock / synchronize_rcu. This is an optimization. The RCU-protected region is very small, so there will be no latency problems if we disable preempt in this region. So we use rcu_read_lock_sched / rcu_read_unlock_sched that translates to preempt_disable / preempt_disable. It is smaller (and supposedly faster) than preemptible rcu_read_lock / rcu_read_unlock. Signed-off-by: Mikulas Patocka --- include/linux/percpu-rwsem.h |8 1 file changed, 4 insertions(+), 4 deletions(-) Index: linux-3.6.3-fast/include/linux/percpu-rwsem.h === --- linux-3.6.3-fast.orig/include/linux/percpu-rwsem.h 2012-10-23 01:21:49.0 +0200 +++ linux-3.6.3-fast/include/linux/percpu-rwsem.h 2012-10-23 01:36:23.0 +0200 @@ -17,16 +17,16 @@ struct percpu_rw_semaphore { static inline void percpu_down_read(struct percpu_rw_semaphore *p) { - rcu_read_lock(); + rcu_read_lock_sched(); if (unlikely(p->locked)) { - rcu_read_unlock(); + rcu_read_unlock_sched(); mutex_lock(&p->mtx); this_cpu_inc(*p->counters); mutex_unlock(&p->mtx); return; } this_cpu_inc(*p->counters); - rcu_read_unlock(); + rcu_read_unlock_sched(); light_mb(); /* A, between read of p->locked and read of data, paired with D */ } @@ -51,7 +51,7 @@ static inline void percpu_down_write(str { mutex_lock(&p->mtx); p->locked = true; - synchronize_rcu(); + synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */ while (__percpu_count(p->counters)) msleep(1); heavy_mb(); /* C, between read of p->counter and write to data, paired with B */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers
This patch introduces new barrier pair light_mb() and heavy_mb() for percpu rw semaphores. This patch fixes a bug in percpu-rw-semaphores where a barrier was missing in percpu_up_write. This patch improves performance on the read path of percpu-rw-semaphores: on non-x86 cpus, there was a smp_mb() in percpu_up_read. This patch changes it to a compiler barrier and removes the "#if defined(X86) ..." condition. From: Lai Jiangshan Signed-off-by: Mikulas Patocka --- include/linux/percpu-rwsem.h | 20 +++- 1 file changed, 7 insertions(+), 13 deletions(-) Index: linux-3.6.3-fast/include/linux/percpu-rwsem.h === --- linux-3.6.3-fast.orig/include/linux/percpu-rwsem.h 2012-10-22 23:37:57.0 +0200 +++ linux-3.6.3-fast/include/linux/percpu-rwsem.h 2012-10-23 01:21:23.0 +0200 @@ -12,6 +12,9 @@ struct percpu_rw_semaphore { struct mutex mtx; }; +#define light_mb() barrier() +#define heavy_mb() synchronize_sched() + static inline void percpu_down_read(struct percpu_rw_semaphore *p) { rcu_read_lock(); @@ -24,22 +27,12 @@ static inline void percpu_down_read(stru } this_cpu_inc(*p->counters); rcu_read_unlock(); + light_mb(); /* A, between read of p->locked and read of data, paired with D */ } static inline void percpu_up_read(struct percpu_rw_semaphore *p) { - /* -* On X86, write operation in this_cpu_dec serves as a memory unlock -* barrier (i.e. memory accesses may be moved before the write, but -* no memory accesses are moved past the write). -* On other architectures this may not be the case, so we need smp_mb() -* there. -*/ -#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE)) - barrier(); -#else - smp_mb(); -#endif + light_mb(); /* B, between read of the data and write to p->counter, paired with C */ this_cpu_dec(*p->counters); } @@ -61,11 +54,12 @@ static inline void percpu_down_write(str synchronize_rcu(); while (__percpu_count(p->counters)) msleep(1); - smp_rmb(); /* paired with smp_mb() in percpu_sem_up_read() */ + heavy_mb(); /* C, between read of p->counter and write to data, paired with B */ } static inline void percpu_up_write(struct percpu_rw_semaphore *p) { + heavy_mb(); /* D, between write to data and write to p->locked, paired with A */ p->locked = false; mutex_unlock(&p->mtx); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: blk: bd_block_size_semaphore related lockdep warning
Hi It should be fixed in 3.7-rc8 Mikulas On Sat, 1 Dec 2012, Sasha Levin wrote: > Hi all, > > While fuzzing with trinity inside a KVM tools guest, running latest -next, > I've > stumbled on: > > [ 3130.099477] == > [ 3130.104862] [ INFO: possible circular locking dependency detected ] > [ 3130.104862] 3.7.0-rc7-next-20121130-sasha-00015-g06fcc7a-dirty #2 Tainted: > GW > [ 3130.104862] --- > [ 3130.104862] trinity-child77/12730 is trying to acquire lock: > [ 3130.104862] (&sb->s_type->i_mutex_key#17/1){+.+.+.}, at: > [] pipe_lock_nested.isra.2+0x15/0x20 > [ 3130.104862] > [ 3130.104862] but task is already holding lock: > [ 3130.104862] (sb_writers#16){.+.+..}, at: [] > generic_file_splice_write+0x6f/0x170 > [ 3130.104862] > [ 3130.104862] which lock already depends on the new lock. > [ 3130.104862] > [ 3130.104862] > [ 3130.104862] the existing dependency chain (in reverse order) is: > [ 3130.104862] > -> #3 (sb_writers#16){.+.+..}: > [ 3130.104862][] lock_acquire+0x1aa/0x240 > [ 3130.104862][] __sb_start_write+0x146/0x1b0 > [ 3130.104862][] > generic_file_splice_write+0x6f/0x170 > [ 3130.104862][] blkdev_splice_write+0x5a/0x80 > [ 3130.104862][] do_splice_from+0x83/0xb0 > [ 3130.104862][] sys_splice+0x492/0x690 > [ 3130.104862][] tracesys+0xe1/0xe6 > [ 3130.104862] > -> #2 (&ei->bdev.bd_block_size_semaphore){.+}: > [ 3130.104862][] lock_acquire+0x1aa/0x240 > [ 3130.104862][] percpu_down_read+0x55/0x90 > [ 3130.104862][] blkdev_mmap+0x33/0x60 > [ 3130.104862][] mmap_region+0x31b/0x600 > [ 3130.104862][] do_mmap_pgoff+0x2b7/0x330 > [ 3130.104862][] vm_mmap_pgoff+0x7a/0xa0 > [ 3130.104862][] sys_mmap_pgoff+0x16e/0x1b0 > [ 3130.104862][] sys_mmap+0x1d/0x20 > [ 3130.104862][] tracesys+0xe1/0xe6 > [ 3130.104862] > -> #1 (&mm->mmap_sem){++}: > [ 3130.104862][] lock_acquire+0x1aa/0x240 > [ 3130.104862][] might_fault+0x7b/0xa0 > [ 3130.104862][] sys_vmsplice+0xd5/0x220 > [ 3130.104862][] tracesys+0xe1/0xe6 > [ 3130.104862] > -> #0 (&sb->s_type->i_mutex_key#17/1){+.+.+.}: > [ 3130.104862][] __lock_acquire+0x147f/0x1c30 > [ 3130.104862][] lock_acquire+0x1aa/0x240 > [ 3130.104862][] __mutex_lock_common+0x59/0x5a0 > [ 3130.104862][] mutex_lock_nested+0x3f/0x50 > [ 3130.104862][] pipe_lock_nested.isra.2+0x15/0x20 > [ 3130.104862][] pipe_lock+0x15/0x20 > [ 3130.104862][] > generic_file_splice_write+0x77/0x170 > [ 3130.104862][] blkdev_splice_write+0x5a/0x80 > [ 3130.104862][] do_splice_from+0x83/0xb0 > [ 3130.104862][] sys_splice+0x492/0x690 > [ 3130.104862][] tracesys+0xe1/0xe6 > [ 3130.104862] > [ 3130.104862] other info that might help us debug this: > [ 3130.104862] > [ 3130.104862] Chain exists of: > &sb->s_type->i_mutex_key#17/1 --> &ei->bdev.bd_block_size_semaphore --> > sb_writers#16 > [ 3130.104862] Possible unsafe locking scenario: > [ 3130.104862] > [ 3130.104862]CPU0CPU1 > [ 3130.104862] > [ 3130.104862] lock(sb_writers#16); > [ 3130.104862] > lock(&ei->bdev.bd_block_size_semaphore); > [ 3130.104862]lock(sb_writers#16); > [ 3130.104862] lock(&sb->s_type->i_mutex_key#17/1); > [ 3130.104862] > [ 3130.104862] *** DEADLOCK *** > [ 3130.104862] > [ 3130.104862] 2 locks held by trinity-child77/12730: > [ 3130.104862] #0: (&ei->bdev.bd_block_size_semaphore){.+}, at: > [] blkdev_splice_write+0x45/0x80 > [ 3130.104862] #1: (sb_writers#16){.+.+..}, at: [] > generic_file_splice_write+0x6f/0x170 > [ 3130.104862] > [ 3130.104862] stack backtrace: > [ 3130.104862] Pid: 12730, comm: trinity-child77 Tainted: GW > 3.7.0-rc7-next-20121130-sasha-00015-g06fcc7a-dirty #2 > [ 3130.104862] Call Trace: > [ 3130.104862] [] print_circular_bug+0x1fb/0x20c > [ 3130.104862] [] __lock_acquire+0x147f/0x1c30 > [ 3130.104862] [] lock_acquire+0x1aa/0x240 > [ 3130.104862] [] ? pipe_lock_nested.isra.2+0x15/0x20 > [ 3130.104862] [] __mutex_lock_common+0x59/0x5a0 > [ 3130.104862] [] ? pipe_lock_nested.isra.2+0x15/0x20 > [ 3130.104862] [] ? __lock_is_held+0x5a/0x80 > [ 3130.104862] [] ? pipe_lock_nested.isra.2+0x15/0x20 > [ 3130.104862] [] mutex_lock_nested+0x3f/0x50 > [ 3130.104862] [] pipe_lock_nested.isra.2+0x15/0x20 > [ 3130.104862] [] pipe_lock+0x15/0x20 > [ 3130.104862] [] generic_file_splice_write+0x77/0x170 > [ 3130.104862] [] blkdev_splice_write+0x5a/0x80 > [ 3130.104862] [] do_splice_from+0x83/0xb0 > [ 3130.104862] [] sys_splice+0x492/0x690 > [ 3130.104862] [] ? syscall_trace_enter+0x20/0x2e0 > [ 3130.104862] [] tracesys+0xe1/0xe6 > > > Thanks, > sasha > -- To unsubscribe from this lis
[PATCH v3 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
It looks sensible. Here I'm sending an improvement of the patch - I changed it so that there are not two-level nested functions for the fast path and so that both percpu_down_read and percpu_up_read use the same piece of code (to reduce cache footprint). --- Currently the writer does msleep() plus synchronize_sched() 3 times to acquire/release the semaphore, and during this time the readers are blocked completely. Even if the "write" section was not actually started or if it was already finished. With this patch down_write/up_write does synchronize_sched() twice and down_read/up_read are still possible during this time, just they use the slow path. percpu_down_write() first forces the readers to use rw_semaphore and increment the "slow" counter to take the lock for reading, then it takes that rw_semaphore for writing and blocks the readers. Also. With this patch the code relies on the documented behaviour of synchronize_sched(), it doesn't try to pair synchronize_sched() with barrier. Signed-off-by: Oleg Nesterov Signed-off-by: Mikulas Patocka --- include/linux/percpu-rwsem.h | 80 ++- lib/Makefile |2 lib/percpu-rwsem.c | 110 +++ 3 files changed, 127 insertions(+), 65 deletions(-) create mode 100644 lib/percpu-rwsem.c Index: linux-3.6.6-fast/include/linux/percpu-rwsem.h === --- linux-3.6.6-fast.orig/include/linux/percpu-rwsem.h 2012-11-05 16:21:29.0 +0100 +++ linux-3.6.6-fast/include/linux/percpu-rwsem.h 2012-11-07 16:44:04.0 +0100 @@ -2,82 +2,34 @@ #define _LINUX_PERCPU_RWSEM_H #include +#include #include -#include -#include +#include struct percpu_rw_semaphore { - unsigned __percpu *counters; - bool locked; - struct mutex mtx; + unsigned int __percpu *fast_read_ctr; + struct mutexwriter_mutex; + struct rw_semaphore rw_sem; + atomic_tslow_read_ctr; + wait_queue_head_t write_waitq; }; -#define light_mb() barrier() -#define heavy_mb() synchronize_sched() +extern void __percpu_down_up_read(struct percpu_rw_semaphore *, int); -static inline void percpu_down_read(struct percpu_rw_semaphore *p) -{ - rcu_read_lock_sched(); - if (unlikely(p->locked)) { - rcu_read_unlock_sched(); - mutex_lock(&p->mtx); - this_cpu_inc(*p->counters); - mutex_unlock(&p->mtx); - return; - } - this_cpu_inc(*p->counters); - rcu_read_unlock_sched(); - light_mb(); /* A, between read of p->locked and read of data, paired with D */ -} - -static inline void percpu_up_read(struct percpu_rw_semaphore *p) -{ - light_mb(); /* B, between read of the data and write to p->counter, paired with C */ - this_cpu_dec(*p->counters); -} - -static inline unsigned __percpu_count(unsigned __percpu *counters) -{ - unsigned total = 0; - int cpu; - - for_each_possible_cpu(cpu) - total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu)); +extern void percpu_down_write(struct percpu_rw_semaphore *); +extern void percpu_up_write(struct percpu_rw_semaphore *); - return total; -} - -static inline void percpu_down_write(struct percpu_rw_semaphore *p) -{ - mutex_lock(&p->mtx); - p->locked = true; - synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */ - while (__percpu_count(p->counters)) - msleep(1); - heavy_mb(); /* C, between read of p->counter and write to data, paired with B */ -} - -static inline void percpu_up_write(struct percpu_rw_semaphore *p) -{ - heavy_mb(); /* D, between write to data and write to p->locked, paired with A */ - p->locked = false; - mutex_unlock(&p->mtx); -} +extern int percpu_init_rwsem(struct percpu_rw_semaphore *); +extern void percpu_free_rwsem(struct percpu_rw_semaphore *); -static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p) +static inline void percpu_down_read(struct percpu_rw_semaphore *s) { - p->counters = alloc_percpu(unsigned); - if (unlikely(!p->counters)) - return -ENOMEM; - p->locked = false; - mutex_init(&p->mtx); - return 0; + __percpu_down_up_read(s, 1); } -static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p) +static inline void percpu_up_read(struct percpu_rw_semaphore *s) { - free_percpu(p->counters); - p->counters = NULL; /* catch use after free bugs */ + __percpu_down_up_read(s, -1); } #endif Index: linux-3.6.6-fast/lib/Makefile === --- linux-3.6.6-fast.orig/lib/Makefile
Re: [PATCH v3 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On Wed, 7 Nov 2012, Oleg Nesterov wrote: > On 11/07, Mikulas Patocka wrote: > > > > It looks sensible. > > > > Here I'm sending an improvement of the patch - I changed it so that there > > are not two-level nested functions for the fast path and so that both > > percpu_down_read and percpu_up_read use the same piece of code (to reduce > > cache footprint). > > IOW, the only change is that you eliminate "static update_fast_ctr()" > and fold it into down/up_read which takes the additional argument. > > Honestly, personally I do not think this is better, but I won't argue. > I agree with everything but I guess we need the ack from Paul. If you look at generated assembly (for x86-64), the footprint of my patch is 78 bytes shared for both percpu_down_read and percpu_up_read. The footprint of your patch is 62 bytes for update_fast_ctr, 46 bytes for percpu_down_read and 20 bytes for percpu_up_read. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On Thu, 8 Nov 2012, Paul E. McKenney wrote: > On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote: > > On Thu, 8 Nov 2012 14:48:49 +0100 > > Oleg Nesterov wrote: > > > > > Currently the writer does msleep() plus synchronize_sched() 3 times > > > to acquire/release the semaphore, and during this time the readers > > > are blocked completely. Even if the "write" section was not actually > > > started or if it was already finished. > > > > > > With this patch down_write/up_write does synchronize_sched() twice > > > and down_read/up_read are still possible during this time, just they > > > use the slow path. > > > > > > percpu_down_write() first forces the readers to use rw_semaphore and > > > increment the "slow" counter to take the lock for reading, then it > > > takes that rw_semaphore for writing and blocks the readers. > > > > > > Also. With this patch the code relies on the documented behaviour of > > > synchronize_sched(), it doesn't try to pair synchronize_sched() with > > > barrier. > > > > > > ... > > > > > > include/linux/percpu-rwsem.h | 83 + > > > lib/Makefile |2 +- > > > lib/percpu-rwsem.c | 123 > > > ++ > > > > The patch also uninlines everything. > > > > And it didn't export the resulting symbols to modules, so it isn't an > > equivalent. We can export thing later if needed I guess. > > > > It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will > > avoid including the code altogether, methinks? > > > > > > > > ... > > > > > > --- /dev/null > > > +++ b/lib/percpu-rwsem.c > > > @@ -0,0 +1,123 @@ > > > > That was nice and terse ;) > > > > > +#include > > > +#include > > > +#include > > > > This list is nowhere near sufficient to support this file's > > requirements. atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty > > more. IOW, if it compiles, it was sheer luck. > > > > > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) > > > +{ > > > + brw->fast_read_ctr = alloc_percpu(int); > > > + if (unlikely(!brw->fast_read_ctr)) > > > + return -ENOMEM; > > > + > > > + mutex_init(&brw->writer_mutex); > > > + init_rwsem(&brw->rw_sem); > > > + atomic_set(&brw->slow_read_ctr, 0); > > > + init_waitqueue_head(&brw->write_waitq); > > > + return 0; > > > +} > > > + > > > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw) > > > +{ > > > + free_percpu(brw->fast_read_ctr); > > > + brw->fast_read_ctr = NULL; /* catch use after free bugs */ > > > +} > > > + > > > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned > > > int val) > > > +{ > > > + bool success = false; > > > + > > > + preempt_disable(); > > > + if (likely(!mutex_is_locked(&brw->writer_mutex))) { > > > + __this_cpu_add(*brw->fast_read_ctr, val); > > > + success = true; > > > + } > > > + preempt_enable(); > > > + > > > + return success; > > > +} > > > + > > > +/* > > > + * Like the normal down_read() this is not recursive, the writer can > > > + * come after the first percpu_down_read() and create the deadlock. > > > + */ > > > +void percpu_down_read(struct percpu_rw_semaphore *brw) > > > +{ > > > + if (likely(update_fast_ctr(brw, +1))) > > > + return; > > > + > > > + down_read(&brw->rw_sem); > > > + atomic_inc(&brw->slow_read_ctr); > > > + up_read(&brw->rw_sem); > > > +} > > > + > > > +void percpu_up_read(struct percpu_rw_semaphore *brw) > > > +{ > > > + if (likely(update_fast_ctr(brw, -1))) > > > + return; > > > + > > > + /* false-positive is possible but harmless */ > > > + if (atomic_dec_and_test(&brw->slow_read_ctr)) > > > + wake_up_all(&brw->write_waitq); > > > +} > > > + > > > +static int clear_fast_ctr(struct percpu_rw_semaphore *brw) > > > +{ > > > + unsigned int sum = 0; > > > + int cpu; > > > + > > > + for_each_possible_cpu(cpu) { > > > + sum += per_cpu(*brw->fast_read_ctr, cpu); > > > + per_cpu(*brw->fast_read_ctr, cpu) = 0; > > > + } > > > + > > > + return sum; > > > +} > > > + > > > +/* > > > + * A writer takes ->writer_mutex to exclude other writers and to force > > > the > > > + * readers to switch to the slow mode, note the mutex_is_locked() check > > > in > > > + * update_fast_ctr(). > > > + * > > > + * After that the readers can only inc/dec the slow ->slow_read_ctr > > > counter, > > > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow > > > + * counter it represents the number of active readers. > > > + * > > > + * Finally the writer takes ->rw_sem for writing and blocks the new > > > readers, > > > + * then waits until the slow counter becomes zero. > > > + */ > > > > Some overview of how fast/slow_read_ctr are supposed to work would be > > useful. This comment seems to assume that the reader already knew > > that. > > > > > +void percpu_down_write(struct percpu_rw_semaphore *brw) > > > +{ > > > + /* also blocks update_fast_ctr() which checks mutex_is_lock
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On Thu, 8 Nov 2012, Andrew Morton wrote: > On Thu, 8 Nov 2012 14:48:49 +0100 > Oleg Nesterov wrote: > > > Currently the writer does msleep() plus synchronize_sched() 3 times > > to acquire/release the semaphore, and during this time the readers > > are blocked completely. Even if the "write" section was not actually > > started or if it was already finished. > > > > With this patch down_write/up_write does synchronize_sched() twice > > and down_read/up_read are still possible during this time, just they > > use the slow path. > > > > percpu_down_write() first forces the readers to use rw_semaphore and > > increment the "slow" counter to take the lock for reading, then it > > takes that rw_semaphore for writing and blocks the readers. > > > > Also. With this patch the code relies on the documented behaviour of > > synchronize_sched(), it doesn't try to pair synchronize_sched() with > > barrier. > > > > ... > > > > include/linux/percpu-rwsem.h | 83 + > > lib/Makefile |2 +- > > lib/percpu-rwsem.c | 123 > > ++ > > The patch also uninlines everything. > > And it didn't export the resulting symbols to modules, so it isn't an > equivalent. We can export thing later if needed I guess. > > It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will > avoid including the code altogether, methinks? If you want to use percpu-rwsem only for block devices then you can remove Oleg's patch at all. Oleg's optimizations are useless for block device use case (the contention between readers and writers is very rare and it doesn't matter if readers are blocked in case of contention). I suppose that Oleg made the optimizations because he wants to use percpu-rwsem for something else - if not, you can drop the patch and revert to the previois version that is simpler. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Tue, 16 Apr 2013, Tejun Heo wrote: > Hey, > > On Mon, Apr 15, 2013 at 09:02:06AM -0400, Mikulas Patocka wrote: > > The patch is not bug-prone, because we already must make sure that the > > cloned bio has shorter lifetime than the master bio - so the patch doesn't > > introduce any new possibilities to make bugs. > > The whole world isn't composed of only your code. As I said > repeatedly, you're introducing an API which is misleading and can > easily cause subtle bugs which are very difficult to reproduce. > > Imagine it being used to tag a metatdata or checksum update bio being > sent down while processing another bio and used to "clone" the context > of the original bio. It'll work most of the time even if the original > bio gets completed first but it'll break when it gets really unlucky - > e.g. racing with other operations which can put the base css ref, and > it'll be hellish to reproduce and everyone would have to pay for your > silly hack. That's why the comment at the function says: "it is assumed that the lifestime of the new bio is shorter than the lifetime of the original bio. If the new bio can outlive the old bio, the caller must increment the reference counts." - do you think that it is so bad that someone will use the function without reading the comment? Anyway, the situation that you describe could only happen in dm-bufio or dm-kcopyd files, so it's easy to control and increment the reference counts there. There are no other places in device mapper where we create bios that live longer than original one. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [07/65] dm bufio: avoid a possible __vmalloc deadlock
On Mon, 3 Jun 2013, Steven Rostedt wrote: > 3.6.11.5 stable review patch. > If anyone has any objections, please let me know. > > -- > > From: Mikulas Patocka > > [ Upstream commit 502624bdad3dba45dfaacaf36b7d83e39e74b2d2 ] > > This patch uses memalloc_noio_save to avoid a possible deadlock in > dm-bufio. (it could happen only with large block size, at most > PAGE_SIZE << MAX_ORDER (typically 8MiB). > > __vmalloc doesn't fully respect gfp flags. The specified gfp flags are > used for allocation of requested pages, structures vmap_area, vmap_block > and vm_struct and the radix tree nodes. > > However, the kernel pagetables are allocated always with GFP_KERNEL. > Thus the allocation of pagetables can recurse back to the I/O layer and > cause a deadlock. > > This patch uses the function memalloc_noio_save to set per-process > PF_MEMALLOC_NOIO flag and the function memalloc_noio_restore to restore > it. When this flag is set, all allocations in the process are done with > implied GFP_NOIO flag, thus the deadlock can't happen. > > This should be backported to stable kernels, but they don't have the > PF_MEMALLOC_NOIO flag and memalloc_noio_save/memalloc_noio_restore > functions. So, PF_MEMALLOC should be set and restored instead. > > Signed-off-by: Mikulas Patocka > Cc: sta...@kernel.org > Signed-off-by: Alasdair G Kergon > [ Set and clear PF_MEMALLOC manually - SR ] > Signed-off-by: Steven Rostedt > --- > drivers/md/dm-bufio.c | 26 +- > 1 file changed, 25 insertions(+), 1 deletion(-) > > diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c > index c0fc827..a1f2487 100644 > --- a/drivers/md/dm-bufio.c > +++ b/drivers/md/dm-bufio.c > @@ -321,6 +321,9 @@ static void __cache_size_refresh(void) > static void *alloc_buffer_data(struct dm_bufio_client *c, gfp_t gfp_mask, > enum data_mode *data_mode) > { > + unsigned noio_flag; > + void *ptr; > + > if (c->block_size <= DM_BUFIO_BLOCK_SIZE_SLAB_LIMIT) { > *data_mode = DATA_MODE_SLAB; > return kmem_cache_alloc(DM_BUFIO_CACHE(c), gfp_mask); > @@ -334,7 +337,28 @@ static void *alloc_buffer_data(struct dm_bufio_client > *c, gfp_t gfp_mask, > } > > *data_mode = DATA_MODE_VMALLOC; > - return __vmalloc(c->block_size, gfp_mask, PAGE_KERNEL); > + > + /* > + * __vmalloc allocates the data pages and auxiliary structures with > + * gfp_flags that were specified, but pagetables are always allocated > + * with GFP_KERNEL, no matter what was specified as gfp_mask. > + * > + * Consequently, we must set per-process flag PF_MEMALLOC_NOIO so that > + * all allocations done by this process (including pagetables) are done > + * as if GFP_NOIO was specified. > + */ > + > + if (gfp_mask & __GFP_NORETRY) { > + noio_flag = current->flags; There should be noio_flag = current->flags & PF_MEMALLOC; because we don't want to restore other flags. > + current->flags |= PF_MEMALLOC; > + } > + > + ptr = __vmalloc(c->block_size, gfp_mask, PAGE_KERNEL); > + > + if (gfp_mask & __GFP_NORETRY) > + current->flags = (current->flags & ~PF_MEMALLOC) | noio_flag; > + > + return ptr; > } > > /* > -- > 1.7.10.4 Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] [PATCH v2] dm: verity: Add support for emitting uevents on dm-verity errors.
Hi I think the idea is fine. There is architecture problem - that target specific routines are being pushed into generic dm core. I suggest that instead of dm_send_verity_uevent and dm_path_uevent you create just one generic function (for example dm_send_uevent) that takes variable argument list of target-specific uevent variables and that can be used for both multipath and verity. Something like: dm_send_uevent(ti, "DM_VERITY_BLOCK_NR=123", "DM_ACTION=VERITY_DATA_ERROR", NULL); There is GFP_ATOMIC allocation and this type of allocation can fail anytime. Could it be a problem if uevent is lost? It's probably not a problem for dm-verity, because uevents are generated only in errorneous state ... but what about multipath? Can it misbehave if we lose uevent? Mikulas On Thu, 20 Jun 2013, Geremy Condra wrote: > With this change dm-verity errors will cause uevents to be > sent to userspace, notifying it that an error has occurred > and potentially triggering recovery actions. > > Signed-off-by: Geremy Condra > --- > Changelog since v1: > - Removed the DM_VERITY_ERROR_NOTIFY config option > > drivers/md/dm-uevent.c | 40 > drivers/md/dm-uevent.h | 12 ++-- > drivers/md/dm-verity.c | 28 > 3 files changed, 78 insertions(+), 2 deletions(-) > > diff --git a/drivers/md/dm-uevent.c b/drivers/md/dm-uevent.c > index 8efe033..17a6662 100644 > --- a/drivers/md/dm-uevent.c > +++ b/drivers/md/dm-uevent.c > @@ -36,6 +36,8 @@ static const struct { > } _dm_uevent_type_names[] = { > {DM_UEVENT_PATH_FAILED, KOBJ_CHANGE, "PATH_FAILED"}, > {DM_UEVENT_PATH_REINSTATED, KOBJ_CHANGE, "PATH_REINSTATED"}, > + {DM_UEVENT_VERITY_DATA_ERROR, KOBJ_CHANGE, "VERITY_DATA_ERROR"}, > + {DM_UEVENT_VERITY_HASH_ERROR, KOBJ_CHANGE, "VERITY_HASH_ERROR"}, > }; > > static struct kmem_cache *_dm_event_cache; > @@ -202,6 +204,44 @@ void dm_path_uevent(enum dm_uevent_type event_type, > struct dm_target *ti, > } > EXPORT_SYMBOL_GPL(dm_path_uevent); > > +void dm_send_verity_uevent(enum dm_uevent_type event_type, struct dm_target > *ti, > + unsigned long long block_nr) > +{ > + struct mapped_device *md = dm_table_get_md(ti->table); > + struct dm_uevent *event = dm_uevent_alloc(md); > + struct gendisk *disk = dm_disk(md); > + > + if (!event) { > + DMERR("%s: dm_uevent_alloc() failed", __func__); > + return; > + } > + > + if (event_type >= ARRAY_SIZE(_dm_uevent_type_names)) { > + DMERR("%s: Invalid event_type %d", __func__, event_type); > + goto out; > + } > + > + if (add_uevent_var(&event->ku_env, "DM_VERITY_BLOCK_NR=%llu", > block_nr)) { > + DMERR("%s: add_uevent_var() for DM_VERITY_BLOCK failed", > __func__); > + goto out; > + } > + > + if (add_uevent_var(&event->ku_env, "DM_ACTION=%s", > + _dm_uevent_type_names[event_type].name)) { > + DMERR("%s: add_uevent_var() for DM_ACTION failed", __func__); > + goto out; > + } > + > + if (kobject_uevent_env(&disk_to_dev(disk)->kobj, event_type, > + event->ku_env.envp)) { > + DMERR("%s: kobject_uevent_env failed", __func__); > + } > + > +out: > + dm_uevent_free(event); > +} > +EXPORT_SYMBOL_GPL(dm_send_verity_uevent); > + > int dm_uevent_init(void) > { > _dm_event_cache = KMEM_CACHE(dm_uevent, 0); > diff --git a/drivers/md/dm-uevent.h b/drivers/md/dm-uevent.h > index 2eccc8b..1c59dba 100644 > --- a/drivers/md/dm-uevent.h > +++ b/drivers/md/dm-uevent.h > @@ -24,6 +24,8 @@ > enum dm_uevent_type { > DM_UEVENT_PATH_FAILED, > DM_UEVENT_PATH_REINSTATED, > + DM_UEVENT_VERITY_DATA_ERROR, > + DM_UEVENT_VERITY_HASH_ERROR, > }; > > #ifdef CONFIG_DM_UEVENT > @@ -34,7 +36,9 @@ extern void dm_send_uevents(struct list_head *events, > struct kobject *kobj); > extern void dm_path_uevent(enum dm_uevent_type event_type, > struct dm_target *ti, const char *path, > unsigned nr_valid_paths); > - > +extern void dm_send_verity_uevent(enum dm_uevent_type event_type, > + struct dm_target *ti, > + unsigned long long block_nr); > #else > > static inline int dm_uevent_init(void) > @@ -53,7 +57,11 @@ static inline void dm_path_uevent(enum dm_uevent_type > event_type, > unsigned nr_valid_paths) > { > } > - > +static inline void dm_send_verity_uevent(enum dm_uevent_type event_type, > + struct dm_target *ti, > + unsigned long long block_nr) > +{ > +} > #endif /* CONFIG_DM_UEVENT */ > > #endif /* DM_UEVENT_H */ > diff --git a/drivers/md/dm-verity.c b/drivers/md/dm-verity.c > index 68bf5c3..2e
Re: [dm-devel] [PATCH] dm: Make MIN_IOS, et al, tunable via sysctl.
On Fri, 16 Aug 2013, Frank Mayhar wrote: > The device mapper and some of its modules allocate memory pools at > various points when setting up a device. In some cases, these pools are > fairly large, for example the multipath module allocates a 256-entry > pool and the dm itself allocates three of that size. In a > memory-constrained environment where we're creating a lot of these > devices, the memory use can quickly become significant. Unfortunately, > there's currently no way to change the size of the pools other than by > changing a constant and rebuilding the kernel. > > This patch fixes that by changing the hardcoded MIN_IOS (and certain > other) #defines in dm-crypt, dm-io, dm-mpath, dm-snap and dm itself to > sysctl-modifiable values. This lets us change the size of these pools > on the fly, we can reduce the size of the pools and reduce memory > pressure. > > We tested performance of dm-mpath with smaller MIN_IOS sizes for both dm > and dm-mpath, from a value of 32 all the way down to zero. Bearing in > mind that the underlying devices were network-based, we saw essentially > no performance degradation; if there was any, it was down in the noise. > One might wonder why these sizes are the way they are; I investigated > and they've been unchanged since at least 2006. I think it would be better to set the values to some low value (1 should be enough, there is 16 in some places that is already low enough). There is no need to make it user-configurable, because the user will see no additional benefit from bigger mempools. Maybe multipath is the exception - but other dm targets don't really need big mempools so there is no need to make the size configurable. Mikulas > Signed-off-by: Frank Mayhar > --- > drivers/md/dm-crypt.c | 63 > +++ > drivers/md/dm-io.c| 58 +-- > drivers/md/dm-mpath.c | 48 ++- > drivers/md/dm-snap.c | 45 +++- > drivers/md/dm.c | 41 ++--- > 5 files changed, 244 insertions(+), 11 deletions(-) > diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c > index 6d2d41a..d68f10a 100644 > --- a/drivers/md/dm-crypt.c > +++ b/drivers/md/dm-crypt.c > @@ -178,12 +178,55 @@ struct crypt_config { > #define MIN_IOS16 > #define MIN_POOL_PAGES 32 > > +static int sysctl_dm_crypt_min_ios = MIN_IOS; > +static int sysctl_dm_crypt_min_pool_pages = MIN_POOL_PAGES; > + > static struct kmem_cache *_crypt_io_pool; > > static void clone_init(struct dm_crypt_io *, struct bio *); > static void kcryptd_queue_crypt(struct dm_crypt_io *io); > static u8 *iv_of_dmreq(struct crypt_config *cc, struct dm_crypt_request > *dmreq); > > +static struct ctl_table_header *dm_crypt_table_header; > +static ctl_table dm_crypt_ctl_table[] = { > + { > + .procname = "min_ios", > + .data = &sysctl_dm_crypt_min_ios, > + .maxlen = sizeof(int), > + .mode = S_IRUGO|S_IWUSR, > + .proc_handler = proc_dointvec, > + }, > + { > + .procname = "min_pool_pages", > + .data = &sysctl_dm_crypt_min_pool_pages, > + .maxlen = sizeof(int), > + .mode = S_IRUGO|S_IWUSR, > + .proc_handler = proc_dointvec, > + }, > + { } > +}; > + > +static ctl_table dm_crypt_dir_table[] = { > + { .procname = "crypt", > + .mode = 0555, > + .child= dm_crypt_ctl_table }, > + { } > +}; > + > +static ctl_table dm_crypt_dm_dir_table[] = { > + { .procname = "dm", > + .mode = 0555, > + .child= dm_crypt_dir_table }, > + { } > +}; > + > +static ctl_table dm_crypt_root_table[] = { > + { .procname = "dev", > + .mode = 0555, > + .child= dm_crypt_dm_dir_table }, > + { } > +}; > + > static struct crypt_cpu *this_crypt_config(struct crypt_config *cc) > { > return this_cpu_ptr(cc->cpu); > @@ -1571,7 +1614,8 @@ static int crypt_ctr(struct dm_target *ti, unsigned int > argc, char **argv) > goto bad; > > ret = -ENOMEM; > - cc->io_pool = mempool_create_slab_pool(MIN_IOS, _crypt_io_pool); > + cc->io_pool = mempool_create_slab_pool(sysctl_dm_crypt_min_ios, > +_crypt_io_pool); > if (!cc->io_pool) { > ti->error = "Cannot allocate crypt io mempool"; > goto bad; > @@ -1583,20 +1627,22 @@ static int crypt_ctr(struct dm_target *ti, unsigned > int argc, char **argv) > cc->dmreq_start += crypto_ablkcipher_alignmask(any_tfm(cc)) & > ~(crypto_tfm_ctx_alignment() - 1); > > - cc->req_pool = mempool_create_kmalloc_pool(MIN_IOS, cc->dmreq_start + > + cc->req_pool = mempool_create_
Re: [dm-devel] dm: Make MIN_IOS, et al, tunable via sysctl.
On Mon, 19 Aug 2013, Mike Snitzer wrote: > On Fri, Aug 16 2013 at 6:55pm -0400, > Frank Mayhar wrote: > > > The device mapper and some of its modules allocate memory pools at > > various points when setting up a device. In some cases, these pools are > > fairly large, for example the multipath module allocates a 256-entry > > pool and the dm itself allocates three of that size. In a > > memory-constrained environment where we're creating a lot of these > > devices, the memory use can quickly become significant. Unfortunately, > > there's currently no way to change the size of the pools other than by > > changing a constant and rebuilding the kernel. > > > > This patch fixes that by changing the hardcoded MIN_IOS (and certain > > other) #defines in dm-crypt, dm-io, dm-mpath, dm-snap and dm itself to > > sysctl-modifiable values. This lets us change the size of these pools > > on the fly, we can reduce the size of the pools and reduce memory > > pressure. > > These memory reserves are a long-standing issue with DM (made worse when > request-based mpath was introduced). Two years ago, I assembled a patch > series that took one approach to trying to fix it: > http://people.redhat.com/msnitzer/patches/upstream/dm-rq-based-mempool-sharing/series.html > > But in the end I wasn't convinced sharing the memory reserve would allow > for 100s of mpath devices to make forward progress if memory is > depleted. > > All said, I think adding the ability to control the size of the memory > reserves is reasonable. It allows for informed admins to establish > lower reserves (based on the awareness that rq-based mpath doesn't need > to support really large IOs, etc) without compromising the ability to > make forward progress. > > But, as mentioned in my porevious mail, I'd like to see this implemnted > in terms of module_param_named(). > > > We tested performance of dm-mpath with smaller MIN_IOS sizes for both dm > > and dm-mpath, from a value of 32 all the way down to zero. > > Bio-based can safely be reduced, as this older (uncommitted) patch did: > http://people.redhat.com/msnitzer/patches/upstream/dm-rq-based-mempool-sharing/-dm-lower-bio-based-reservation.patch > > > Bearing in mind that the underlying devices were network-based, we saw > > essentially no performance degradation; if there was any, it was down > > in the noise. One might wonder why these sizes are the way they are; > > I investigated and they've been unchanged since at least 2006. > > Performance isn't the concern. The concern is: does DM allow for > forward progress if the system's memory is completely exhausted? There is one possible deadlock that was introduced in commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 in 2.6.22-rc1. Unfortunatelly, no one found that bug at that time and now it seems to be hard to revert that. The problem is this: * we send bio1 to the device dm-1, device mapper splits it to bio2 and bio3 and sends both of them to the device dm-2. These two bios are added to current->bio_list. * bio2 is popped off current->bio_list, a mempool entry from device dm-2's mempool is allocated, bio4 is created and sent to the device dm-3. bio4 is added to the end of current->bio_list. * bio3 is popped off current->bio_list, a mempool entry from device dm-2's mempool is allocated. Suppose that the mempool is exhausted, so we wait until some existing work (bio2) finishes and returns the entry to the mempool. So: bio3's request routine waits until bio2 finishes and refills the mempool. bio2 is waiting for bio4 to finish. bio4 is in current->bio_list and is waiting until bio3's request routine fininshes. Deadlock. In practice, it is not so serious because in mempool_alloc there is: /* * FIXME: this should be io_schedule(). The timeout is there as a * workaround for some DM problems in 2.6.18. */ io_schedule_timeout(5*HZ); - so it waits for 5 seconds and retries. If there is something in the system that is able to free memory, it resumes. > This is why request-based has such an extensive reserve, because it > needs to account for cloning the largest possible request that comes in > (with multiple bios). Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] dm: Make MIN_IOS, et al, tunable via sysctl.
On Mon, 19 Aug 2013, Frank Mayhar wrote: > On Mon, 2013-08-19 at 10:00 -0400, Mike Snitzer wrote: > > Performance isn't the concern. The concern is: does DM allow for > > forward progress if the system's memory is completely exhausted? > > > > This is why request-based has such an extensive reserve, because it > > needs to account for cloning the largest possible request that comes in > > (with multiple bios). > > Thanks for the response. In our particular case, I/O will be file > system based and over a network, which makes it pretty easy for us to be > sure that large I/Os never happen. That notwithstanding, however, as > you said it just seems reasonable to make these values configurable. > > I'm also looking at making some similar constants in dm-verity and > dm-bufio configurable in the same way and for similar reasons. Regarding dm-bufio: the user of dm-bufio sets the pool size as an argument in dm_bufio_client_create. There is no need to make it configurable - if the user selects too low value, deadlock is possible, if the user selects too high value, there is no additional advantage. Regarding dm-verity: the mempool size is 4, there is no need to make it bigger, there is no advantage from that. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] [PATCH] dm: Make MIN_IOS, et al, tunable via sysctl.
On Tue, 20 Aug 2013, Frank Mayhar wrote: > On Tue, 2013-08-20 at 17:22 -0400, Mikulas Patocka wrote: > > On Fri, 16 Aug 2013, Frank Mayhar wrote: > > > The device mapper and some of its modules allocate memory pools at > > > various points when setting up a device. In some cases, these pools are > > > fairly large, for example the multipath module allocates a 256-entry > > > pool and the dm itself allocates three of that size. In a > > > memory-constrained environment where we're creating a lot of these > > > devices, the memory use can quickly become significant. Unfortunately, > > > there's currently no way to change the size of the pools other than by > > > changing a constant and rebuilding the kernel. > > I think it would be better to set the values to some low value (1 should > > be enough, there is 16 in some places that is already low enough). There > > is no need to make it user-configurable, because the user will see no > > additional benefit from bigger mempools. > > > > Maybe multipath is the exception - but other dm targets don't really need > > big mempools so there is no need to make the size configurable. > > The point is not to make the mempools bigger, the point is to be able to > make them _smaller_ for memory-constrained environments. In some cases, > even 16 can be too big, especially when creating a large number of > devices. > > In any event, it seems unfortunate to me that these values are > hard-coded. One shouldn't have to rebuild the kernel to change them, > IMHO. So make a patch that changes 16 to 1 if it helps on your system (I'd like to ask - what is the configuration where this kind of change helps?). There is no need for that to be tunable, anyone can live with mempool size being 1. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dm: Make MIN_IOS, et al, tunable via sysctl.
On Tue, 20 Aug 2013, Mike Snitzer wrote: > Mikulas' point is that you cannot reduce the size to smaller than 1. > And aside from rq-based DM, 1 is sufficient to allow for forward > progress even when memory is completely consumed. > > A patch that simply changes them to 1 but makes the rq-based DM > mempool's size configurable should actually be fine. Yes, I think it would be viable. Mikulas > Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] [PATCH v2] dm ioctl: allow change device target type to error
On Wed, 21 Aug 2013, Joe Jin wrote: > commit a5664da "dm ioctl: make bio or request based device type immutable" > prevented "dmsetup wape_table" change the target type to "error". That commit a5664da is there for a reason (it is not possible to change bio-based device to request-based and vice versa) and I don't really see how this patch is supposed to work. If there are bios that are in flight and that already passed through blk_queue_bio, and you change the device from request-based to bio-based, what are you going to do with them? - The patch doesn't do anything about it. A better approach would be to create a new request-based target "error-rq" and change the multipath target to "error-rq" target. That way, you don't have to change device type from request based to bio based. Mikulas > -v2: setup md->queue even target type is "error". > > Signed-off-by: Joe Jin > --- > drivers/md/dm-ioctl.c | 4 > drivers/md/dm-table.c | 12 > drivers/md/dm.h | 1 + > 3 files changed, 17 insertions(+) > > diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c > index f1b7586..2a9b63d 100644 > --- a/drivers/md/dm-ioctl.c > +++ b/drivers/md/dm-ioctl.c > @@ -1280,6 +1280,9 @@ static int table_load(struct dm_ioctl *param, size_t > param_size) > goto out; > } > > + if (dm_is_error_target(t)) > + goto error_target; > + > /* Protect md->type and md->queue against concurrent table loads. */ > dm_lock_md_type(md); > if (dm_get_md_type(md) == DM_TYPE_NONE) > @@ -1293,6 +1296,7 @@ static int table_load(struct dm_ioctl *param, size_t > param_size) > goto out; > } > > +error_target: > /* setup md->queue to reflect md's type (may block) */ > r = dm_setup_md_queue(md); > if (r) { > diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c > index f221812..27be46a 100644 > --- a/drivers/md/dm-table.c > +++ b/drivers/md/dm-table.c > @@ -184,6 +184,18 @@ static int alloc_targets(struct dm_table *t, unsigned > int num) > return 0; > } > > +bool dm_is_error_target(struct dm_table *t) > +{ > + unsigned i; > + > + for (i = 0; i < t->num_targets; i++) { > + struct dm_target *tgt = t->targets + i; > + if (strcmp(tgt->type->name, "error") == 0) > + return true; > + } > + return false; > +} > + > int dm_table_create(struct dm_table **result, fmode_t mode, > unsigned num_targets, struct mapped_device *md) > { > diff --git a/drivers/md/dm.h b/drivers/md/dm.h > index 45b97da..c7bceeb 100644 > --- a/drivers/md/dm.h > +++ b/drivers/md/dm.h > @@ -69,6 +69,7 @@ unsigned dm_table_get_type(struct dm_table *t); > struct target_type *dm_table_get_immutable_target_type(struct dm_table *t); > bool dm_table_request_based(struct dm_table *t); > bool dm_table_supports_discards(struct dm_table *t); > +bool dm_is_error_target(struct dm_table *t); > int dm_table_alloc_md_mempools(struct dm_table *t); > void dm_table_free_md_mempools(struct dm_table *t); > struct dm_md_mempools *dm_table_get_md_mempools(struct dm_table *t); > -- > 1.8.3.1 > > -- > dm-devel mailing list > dm-de...@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] md: dm-verity: Fix to avoid a deadlock in dm-bufio
On Mon, 4 Mar 2013, Paul Taysom wrote: > On Mon, Mar 4, 2013 at 9:42 AM, Alasdair G Kergon wrote: > > On Mon, Mar 04, 2013 at 08:45:48AM -0800, Paul Taysom wrote: > >> @@ -449,8 +468,14 @@ static void verity_prefetch_io(struct dm_verity *v, > >> struct dm_verity_io *io) > >> hash_block_end = v->hash_blocks - 1; > >> } > >> no_prefetch_cluster: > >> - dm_bufio_prefetch(v->bufio, hash_block_start, > >> - hash_block_end - hash_block_start + 1); > >> + vw = kmalloc(sizeof(*vw), GFP_KERNEL); > > > > kmalloc? mempool? ... > > > > Alasdair > > > The use of mempool would be a separate patch that would have to be > measured for performance impact. > -Paul > You don't have to use mempool. Just avoid prefetching if there is not enough memory for the prefetch structure. I reworked the patch, is uses an allocation that can fail and it generates just one workqueue entry for one request (the original patch generated one workqueue entry for hash tree level). Please test the patch and if it works and performs well, let's submit it. Mikulas --- Changed the dm-verity prefetching to use a worker thread to avoid a deadlock in dm-bufio. If generic_make_request is called recursively, it queues the I/O request on the current->bio_list without making the I/O request and returns. The routine making the recursive call cannot wait for the I/O to complete. The deadlock occurred when one thread grabbed the bufio_client mutex and waited for an I/O to complete but the I/O was queued on another thread's current->bio_list and it was waiting to get the mutex held by the first thread. The fix allows only one I/O request from dm-verity to dm-bufio per thread. To do this, the prefetch requests were queued on worker threads. In addition to avoiding the deadlock, this fix made a slight improvement in performance. seconds_kernel_to_login: with prefetch:8.43s without prefetch: 9.2s worker prefetch: 8.28s Signed-off-by: Paul Taysom Signed-off-by: Mikulas Patocka Cc: sta...@kernel.org --- drivers/md/dm-bufio.c |2 ++ drivers/md/dm-verity.c | 37 + 2 files changed, 35 insertions(+), 4 deletions(-) Index: linux-3.8-fast/drivers/md/dm-verity.c === --- linux-3.8-fast.orig/drivers/md/dm-verity.c 2013-03-04 22:49:20.0 +0100 +++ linux-3.8-fast/drivers/md/dm-verity.c 2013-03-04 23:10:45.0 +0100 @@ -93,6 +93,13 @@ struct dm_verity_io { */ }; +struct dm_verity_prefetch_work { + struct work_struct work; + struct dm_verity *v; + sector_t block; + unsigned n_blocks; +}; + static struct shash_desc *io_hash_desc(struct dm_verity *v, struct dm_verity_io *io) { return (struct shash_desc *)(io + 1); @@ -424,15 +431,18 @@ static void verity_end_io(struct bio *bi * The root buffer is not prefetched, it is assumed that it will be cached * all the time. */ -static void verity_prefetch_io(struct dm_verity *v, struct dm_verity_io *io) +static void verity_prefetch_io(struct work_struct *work) { + struct dm_verity_prefetch_work *pw = + container_of(work, struct dm_verity_prefetch_work, work); + struct dm_verity *v = pw->v; int i; for (i = v->levels - 2; i >= 0; i--) { sector_t hash_block_start; sector_t hash_block_end; - verity_hash_at_level(v, io->block, i, &hash_block_start, NULL); - verity_hash_at_level(v, io->block + io->n_blocks - 1, i, &hash_block_end, NULL); + verity_hash_at_level(v, pw->block, i, &hash_block_start, NULL); + verity_hash_at_level(v, pw->block + pw->n_blocks - 1, i, &hash_block_end, NULL); if (!i) { unsigned cluster = ACCESS_ONCE(dm_verity_prefetch_cluster); @@ -452,6 +462,25 @@ no_prefetch_cluster: dm_bufio_prefetch(v->bufio, hash_block_start, hash_block_end - hash_block_start + 1); } + + kfree(pw); +} + +static void verity_submit_prefetch(struct dm_verity *v, struct dm_verity_io *io) +{ + struct dm_verity_prefetch_work *pw; + + pw = kmalloc(sizeof(struct dm_verity_prefetch_work), + GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN); + + if (!pw) + return; + + INIT_WORK(&pw->work, verity_prefetch_io); + pw->v = v; + pw->block = io->block; + pw->n_blocks = io->n_blocks; + queue_work(v->verify_wq, &pw->work); } /* @@ -498,7 +527,7 @@ static int verity_map(struct dm_target * m
dm-crypt parallelization patches
Hi I placed the dm-crypt parallization patches at: http://people.redhat.com/~mpatocka/patches/kernel/dm-crypt-paralelizace/current/ The patches paralellize dm-crypt and make it possible to use all processor cores. The patch dm-crypt-remove-percpu.patch removes some percpu variables and replaces them with per-request variables. The patch dm-crypt-unbound-workqueue.patch sets WQ_UNBOUND on the encryption workqueue, allowing the encryption to be distributed to all CPUs in the system. The patch dm-crypt-offload-writes-to-thread.patch moves submission of all write requests to a single thread. The patch dm-crypt-sort-requests.patch sorts write requests submitted by a single thread. The requests are sorted according to the sector number, rb-tree is used for efficient sorting. Some usage notes: * turn off automatic cpu frequency scaling (or set it to "performance" governor) - cpufreq doesn't recognize encryption workload correctly, sometimes it underclocks all the CPU cores when there is some encryption work to do, resulting in bad performance * when using filesystem on encrypted dm-crypt device, reduce maximum request size with "/sys/block/dm-2/queue/max_sectors_kb" (substitute "dm-2" with the real name of your dm-crypt device). Note that having too big requests means that there is a small number of requests and they cannot be distributed to all available processors in parallel - it results in worse performance. Having too small requests results in high request overhead and also reduced performance. So you must find the optimal request size for your system and workload. For me, when testing this on ramdisk, the optimal is 8KiB. --- Now, the problem with I/O scheduler: when doing performance testing, it turns out that the parallel version is sometimes worse than the previous implementation. When I create a 4.3GiB dm-crypt device on the top of dm-loop on the top of ext2 filesystem on 15k SCSI disk and run this command time fio --rw=randrw --size=64M --bs=256k --filename=/dev/mapper/crypt --direct=1 --name=job1 --name=job2 --name=job3 --name=job4 --name=job5 --name=job6 --name=job7 --name=job8 --name=job9 --name=job10 --name=job11 --name=job12 the results are this: CFQ scheduler: -- no patches: 21.9s patch 1: 21.7s patches 1,2: 2:33s patches 1,2 (+ nr_requests = 128) 2:18s patches 1,2,3: 20.7s patches 1,2,3,4: 20.7s deadline scheduler: --- no patches: 27.4s patch 1: 27.4s patches 1,2: 27.8s patches 1,2,3: 29.6s patches 1,2,3,4: 29.6s We can see that CFQ performs badly with the patch 2, but improves with the patch 3. All that patch 3 does is that it moves write requests from encryption threads to a separate thread. So it seems that CFQ has some deficiency that it cannot merge adjacent requests done by different processes. The problem is this: - we have 256k write direct-i/o request - it is broken to 4k bios (because we run on dm-loop on a filesystem with 4k block size) - encryption of these 4k bios is distributed to 12 processes on a 12-core machine - encryption finishes out of order and in different processes, 4k bios with encrypted data are submitted to CFQ - CFQ doesn't merge them - the disk is flooded with random 4k write requests, and performs much worse than with 256k requests Increasing nr_requests to 128 helps a little, but not much - it is still order of magnitute slower. I'd like to ask if someone who knows the CFQ scheduler (Jens?) could look at it and find out why it doesn't merge requests from different processes. Why do I have to do a seemingly senseless operation (hand over write requests to a separate thread) in patch 3 to improve performance? Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dm-crypt parallelization patches
On Tue, 9 Apr 2013, Tejun Heo wrote: > On Tue, Apr 09, 2013 at 01:51:43PM -0400, Mikulas Patocka wrote: > > The patch dm-crypt-sort-requests.patch sorts write requests submitted by a > > single thread. The requests are sorted according to the sector number, > > rb-tree is used for efficient sorting. > > Hmmm? Why not just keep the issuing order along with plugging > boundaries? What do you mean? I used to have a patch that keeps order of requests as they were introduced, but sorting the requests according to sector number is a bit simpler. > > So it seems that CFQ has some deficiency that it cannot merge adjacent > > requests done by different processes. > > As I wrote before, please use bio_associate_current(). Currently, > dm-crypt is completely messing up all the context information that cfq > depends on to schedule IOs. Of course, it doesn't perform well. bio_associate_current() is only valid on a system with cgroups and there are no cgroups on the kernel where I tested it. It is an empty function: static inline int bio_associate_current(struct bio *bio) { return -ENOENT; } Mikulas > Thanks. > > -- > tejun > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] dm-crypt performance
On Tue, 26 Mar 2013, Milan Broz wrote: > - Are we sure we are not inroducing some another side channel in disc > encryption? (Unprivileged user can measure timing here). > (Perhaps stupid reason but please do not prefer performance to security > in encryption. Enough we have timing attacks for AES implementations...) So use serpent - it is implemented without any data-dependent lookup tables, so it has no timing attacks. AES uses data-dependent lookup tables, on CPU with hyperthreding, the second thread can observe L1 cache footprint done by the first thread and get some information about data being encrypted... Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dm-crypt parallelization patches
On Tue, 9 Apr 2013, Tejun Heo wrote: > Hey, > > On Tue, Apr 09, 2013 at 02:08:06PM -0400, Mikulas Patocka wrote: > > > Hmmm? Why not just keep the issuing order along with plugging > > > boundaries? > > > > What do you mean? > > > > I used to have a patch that keeps order of requests as they were > > introduced, but sorting the requests according to sector number is a bit > > simpler. > > You're still destroying the context information. Please just keep the > issuing order along with plugging boundaries. > > > > As I wrote before, please use bio_associate_current(). Currently, > > > dm-crypt is completely messing up all the context information that cfq > > > depends on to schedule IOs. Of course, it doesn't perform well. > > > > bio_associate_current() is only valid on a system with cgroups and there > > are no cgroups on the kernel where I tested it. It is an empty function: > > > > static inline int bio_associate_current(struct bio *bio) { return -ENOENT; } > > Yeah, because blkcg was the only user. Please feel free to drop the > ifdefs. It covers both iocontext and cgroup association. > > Thanks. If I drop ifdefs, it doesn't compile (because other cgroup stuff it missing). So I enabled bio cgroups. bio_associate_current can't be used, because by the time we allocate the outgoing write bio, we are no longer in the process that submitted the original bio. Anyway, I tried to reproduce in dm-crypt what bio_associate_current does - in the submitting process I record "ioc" and "css" fields in "dm_crypt_io" structure and set these fields on all outgoing bios. It has no effect on performance, it is as bad as if I hadn't done it. Mikulas --- (this is the patch that I used, to be applied after dm-crypt-unbound-workqueue.patch) --- drivers/md/dm-crypt.c | 41 + 1 file changed, 41 insertions(+) Index: linux-3.8.6-fast/drivers/md/dm-crypt.c === --- linux-3.8.6-fast.orig/drivers/md/dm-crypt.c 2013-04-09 20:32:41.0 +0200 +++ linux-3.8.6-fast/drivers/md/dm-crypt.c 2013-04-09 21:29:12.0 +0200 @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -60,6 +61,9 @@ struct dm_crypt_io { int error; sector_t sector; struct dm_crypt_io *base_io; + + struct io_context *ioc; + struct cgroup_subsys_state *css; }; struct dm_crypt_request { @@ -797,6 +801,14 @@ static struct bio *crypt_alloc_buffer(st if (!clone) return NULL; + if (unlikely(io->base_io != NULL)) { + clone->bi_ioc = io->base_io->ioc; + clone->bi_css = io->base_io->css; + } else { + clone->bi_ioc = io->ioc; + clone->bi_css = io->css; + } + clone_init(io, clone); *out_of_pages = 0; @@ -859,6 +871,9 @@ static struct dm_crypt_io *crypt_io_allo io->ctx.req = NULL; atomic_set(&io->io_pending, 0); + io->ioc = NULL; + io->css = NULL; + return io; } @@ -884,6 +899,14 @@ static void crypt_dec_pending(struct dm_ if (io->ctx.req) mempool_free(io->ctx.req, cc->req_pool); + + if (io->ioc) { + put_io_context(io->ioc); + } + if (io->css) { + css_put(io->css); + } + mempool_free(io, cc->io_pool); if (likely(!base_io)) @@ -927,6 +950,9 @@ static void crypt_endio(struct bio *clon if (rw == WRITE) crypt_free_buffer_pages(cc, clone); + clone->bi_ioc = NULL; + clone->bi_css = NULL; + bio_put(clone); if (rw == READ && !error) { @@ -1658,6 +1684,21 @@ static int crypt_map(struct dm_target *t io = crypt_io_alloc(cc, bio, dm_target_offset(ti, bio->bi_sector)); + if (current->io_context) { + struct io_context *ioc = current->io_context; + struct cgroup_subsys_state *css; + + get_io_context_active(ioc); + io->ioc = ioc; + + /* associate blkcg if exists */ + rcu_read_lock(); + css = task_subsys_state(current, blkio_subsys_id); + if (css && css_tryget(css)) + io->css = css; + rcu_read_unlock(); + } + if (bio_data_dir(io->base_bio) == READ) { if (kcryptd_io_read(io, GFP_NOWAIT)) kcryptd_queue_io(io); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dm-crypt parallelization patches
On Tue, 9 Apr 2013, Tejun Heo wrote: > On Tue, Apr 09, 2013 at 03:42:16PM -0400, Mikulas Patocka wrote: > > If I drop ifdefs, it doesn't compile (because other cgroup stuff it > > missing). > > > > So I enabled bio cgroups. > > > > bio_associate_current can't be used, because by the time we allocate the > > outgoing write bio, we are no longer in the process that submitted the > > original bio. > > Oh, I suppose it'd need some massaging to selectively turn off the > cgroup part. > > > Anyway, I tried to reproduce in dm-crypt what bio_associate_current does - > > and we probably need to change that to bio_associate_task(). Generally, we shouldn't associate bios with "current" task in device mapper targets. For example suppose that we have two stacked dm-crypt targets: In the "current" process pointer in lower dm-crypt target's request function always points to the workqueue of the upper dm-crypt target that submits the bios. So if we associate the bio with "current" in the lower target, we are associating it with a preallocated workqueue and we already lost the information who submitted it. You should associate a bio with a task when you create the bio and "md" and "dm" midlayers should just forward this association to lower layer bios. > > in the submitting process I record "ioc" and "css" fields in "dm_crypt_io" > > structure and set these fields on all outgoing bios. It has no effect on > > performance, it is as bad as if I hadn't done it. > > A good way to verify that the tagging is correct would be configuring > io limits in block cgroup and see whether the limits are correctly > applied when going through dm-crypt (please test with direct-io or > reads, writeback is horribly broken, sorry).working correctly, maybe > plugging is the overriding factor? > > Thanks. It doesn't work because device mapper on the underlying layers ignores bi_ioc and bi_css. If I make device mapper forward bi_ioc and bi_css to outgoing bios, it improves performance (from 2:30 to 1:30), but it is still far from perfect. Mikulas --- dm: forward cgroup context This patch makes dm forward associated cgroup context to cloned bios. Signed-off-by: Mikulas Patocka --- drivers/md/dm.c |9 + fs/bio.c|2 ++ 2 files changed, 11 insertions(+) Index: linux-3.8.6-fast/drivers/md/dm.c === --- linux-3.8.6-fast.orig/drivers/md/dm.c 2013-04-09 22:00:36.0 +0200 +++ linux-3.8.6-fast/drivers/md/dm.c2013-04-09 22:19:40.0 +0200 @@ -453,6 +453,10 @@ static void free_io(struct mapped_device static void free_tio(struct mapped_device *md, struct dm_target_io *tio) { +#ifdef CONFIG_BLK_CGROUP + tio->clone.bi_ioc = NULL; + tio->clone.bi_css = NULL; +#endif bio_put(&tio->clone); } @@ -1124,6 +1128,11 @@ static struct dm_target_io *alloc_tio(st clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, ci->md->bs); tio = container_of(clone, struct dm_target_io, clone); +#ifdef CONFIG_BLK_CGROUP + tio->clone.bi_ioc = ci->bio->bi_ioc; + tio->clone.bi_css = ci->bio->bi_css; +#endif + tio->io = ci->io; tio->ti = ti; memset(&tio->info, 0, sizeof(tio->info)); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dm-crypt parallelization patches
On Tue, 9 Apr 2013, Vivek Goyal wrote: > On Tue, Apr 09, 2013 at 04:32:28PM -0400, Mikulas Patocka wrote: > > [..] > > Generally, we shouldn't associate bios with "current" task in device > > mapper targets. For example suppose that we have two stacked dm-crypt > > targets: > > > > In the "current" process pointer in lower dm-crypt target's request > > function always points to the workqueue of the upper dm-crypt target that > > submits the bios. So if we associate the bio with "current" in the lower > > target, we are associating it with a preallocated workqueue and we already > > lost the information who submitted it. > > > > You should associate a bio with a task when you create the bio and "md" > > and "dm" midlayers should just forward this association to lower layer > > bios. > > bio_associate_current() return -EBUSY if bio has already been associated > with an io context. > > So in a stack if every driver calls bio_associate_current(), upon bio > submission, it will automatically amke sure bio gets associated with > submitter task in top level device and calls by lower level device > will be ignored. The stacking drivers do not pass the same bio to each other. The stacking driver receives a bio, allocates zero or more new bios and sends these new bios to the lower layer. So you need to propagate ownership from the received bio to newly allocated bios, you don't want to associate newly allocated bio with "current" process. > Lower level devices I think just need to make sure this context > info is propogated to cloned bios. > > > [..] > > +#ifdef CONFIG_BLK_CGROUP > > + tio->clone.bi_ioc = ci->bio->bi_ioc; > > + tio->clone.bi_css = ci->bio->bi_css; > > You also need to take references to ioc and css objects. I guess a helper > function will be better. May be something like. The lifetime of the "tio" structure is shorter than the lifetime of "ci->bio". So we don't need to increment reference. We only need to increment reference if we copy ownership to a new bio that has could have longer lifetime than the original bio. But this situation is very rare - in most stacking drivers the newly allocated bio has shorter lifetime than the original one. > bio_associate_bio_context(bio1, bio2) > > And this initialize bio2's context with bio1's context. Yes, that would be ok. > Thanks > Vivek Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Wed, 10 Apr 2013, Vivek Goyal wrote: > On Tue, Apr 09, 2013 at 05:18:25PM -0400, Mikulas Patocka wrote: > > [..] > > > bio_associate_current() return -EBUSY if bio has already been associated > > > with an io context. > > > > > > So in a stack if every driver calls bio_associate_current(), upon bio > > > submission, it will automatically amke sure bio gets associated with > > > submitter task in top level device and calls by lower level device > > > will be ignored. > > > > The stacking drivers do not pass the same bio to each other. > > > > The stacking driver receives a bio, allocates zero or more new bios and > > sends these new bios to the lower layer. So you need to propagate > > ownership from the received bio to newly allocated bios, you don't want to > > associate newly allocated bio with "current" process. > > Actually I was asking to call bio_associate_current() for the incoming > bio in the driver and not for the newly created bios by the driver. Yes, I think it's better to call it in the driver than in the upper layer, because if the driver doesn't forward the bio to a worker thread, we don't have to call bio_associate_current() and we save a few atomic instructions. > For any newly created bios on behalf of this incoming bio, we need to > copy the context so that this context info can be propogated down the > stack. See this patch. It implements cgroup associations for dm core and dm-crypt target. Do you think the interface is correct? (i.e. can I start modifying more dm dm-targets to use it?) > > We only need to increment reference if we copy ownership to a new bio that > > has could have longer lifetime than the original bio. But this situation > > is very rare - in most stacking drivers the newly allocated bio has > > shorter lifetime than the original one. > > I think it is not a good idea to rely on the fact that cloned bios or > newly created bios will have shorter lifetime than the original bio. In fact > when the bio completes and you free it up bio_disassociate_task() will try > to put io context and blkcg reference. So it is important to take > reference if you are copying context info in any newly created bio. We clear the associate with bio_clear_context before freeing the bio. > Thanks > Vivek --- dm: retain cgroup context This patch makes dm and dm-crypt target retain cgroup context. New functions bio_clone_context and bio_clear_context are introduced. bio_associate_current and bio_disassociate_task are exported to modules. dm core is changed so that it copies the context to cloned bio. dm associates the bio with current process if it is going to offload bio to a thread. dm-crypt copies the context to outgoing bios and associates the bio with current process. Signed-off-by: Mikulas Patocka --- drivers/md/dm-crypt.c | 17 ++--- drivers/md/dm.c |5 + fs/bio.c |2 ++ include/linux/bio.h | 39 +++ 4 files changed, 60 insertions(+), 3 deletions(-) Index: linux-3.8.6-fast/drivers/md/dm.c === --- linux-3.8.6-fast.orig/drivers/md/dm.c 2013-04-10 14:39:28.0 +0200 +++ linux-3.8.6-fast/drivers/md/dm.c2013-04-10 20:09:52.0 +0200 @@ -453,6 +453,7 @@ static void free_io(struct mapped_device static void free_tio(struct mapped_device *md, struct dm_target_io *tio) { + bio_clear_context(&tio->clone); bio_put(&tio->clone); } @@ -521,6 +522,8 @@ static void queue_io(struct mapped_devic { unsigned long flags; + bio_associate_current(bio); + spin_lock_irqsave(&md->deferred_lock, flags); bio_list_add(&md->deferred, bio); spin_unlock_irqrestore(&md->deferred_lock, flags); @@ -1124,6 +1127,8 @@ static struct dm_target_io *alloc_tio(st clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, ci->md->bs); tio = container_of(clone, struct dm_target_io, clone); + bio_clone_context(ci->bio, &tio->clone); + tio->io = ci->io; tio->ti = ti; memset(&tio->info, 0, sizeof(tio->info)); Index: linux-3.8.6-fast/include/linux/bio.h === --- linux-3.8.6-fast.orig/include/linux/bio.h 2013-04-10 14:38:56.0 +0200 +++ linux-3.8.6-fast/include/linux/bio.h2013-04-10 20:14:08.0 +0200 @@ -291,6 +291,15 @@ extern void bvec_free_bs(struct bio_set extern unsigned int bvec_nr_vecs(unsigned short idx); #ifdef CONFIG_BLK_CGROUP +/* + * bio_associate_current associates the bio with the current process. It must be + * called by any block device driver
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Wed, 10 Apr 2013, Tejun Heo wrote: > On Wed, Apr 10, 2013 at 07:42:59PM -0400, Mikulas Patocka wrote: > > /* > > + * bio_clone_context copies cgroup context from the original bio to the > > new bio. > > + * It is used by bio midlayer drivers that create new bio based on an > > original > > + * bio and forward it to the lower layer. > > + * > > + * No reference counts are incremented - it is assumed that the lifestime > > of the > > + * new bio is shorter than the lifetime of the original bio. If the new > > bio can > > + * outlive the old bio, the caller must increment the reference counts. > > + * > > + * Before freeing the new bio, the caller must clear the context with > > + * bio_clear_context function. If bio_clear_context were not called, the > > + * reference counts would be decremented on both new and original bio, > > resulting > > + * in crash due to reference count underflow. > > + */ > > +static inline void bio_clone_context(struct bio *orig, struct bio *new) > > +{ > > +#ifdef CONFIG_BLK_CGROUP > > + new->bi_ioc = orig->bi_ioc; > > + new->bi_css = orig->bi_css; > > Hmmm... Let's not do this. Sure, you'd be saving several instructions > but the gain is unlikely to be significant given that those cachelines > are likely to be hot anyway. Also, please name it > bio_copy_association(). > > Thanks. > > -- > tejun If the bi_css pointer points to a structure that is shared between processes, using atomic instruction causes cache line boucing - it doesn't cost a few instructions, it costs 2-3 hundreds cycles. I modified the patch to use new flag BIO_DROP_CGROUP_REFCOUNT to note that the refcount must be decremented - if the flag is set, refcounts must be decremented when bio is destroyed, if it is not set, references are borrowed from upper layer bio. It is less bug-prone than the previous patch. Mikulas --- dm: retain cgroup context This patch makes dm and dm-crypt target retain cgroup context. New function bio_clone_context is introduced. It copies cgroup context from one bio to another without incrementing the reference count. A new bio flag BIO_DROP_CGROUP_REFCOUNT specifies that cgroup refcounts should be decremented when the bio is freed. bio_associate_current and bio_disassociate_task are exported to modules. dm core is changed so that it copies the context to cloned bio. dm associates the bio with current process if it is going to offload bio to a thread. dm-crypt associates the bio with the current process and copies the context to outgoing bios. Signed-off-by: Mikulas Patocka --- drivers/md/dm-crypt.c |4 drivers/md/dm.c |7 +-- fs/bio.c | 11 +-- include/linux/bio.h | 27 +++ include/linux/blk_types.h |3 +++ 5 files changed, 48 insertions(+), 4 deletions(-) Index: linux-3.8.6-fast/drivers/md/dm.c === --- linux-3.8.6-fast.orig/drivers/md/dm.c 2013-04-11 17:29:09.0 +0200 +++ linux-3.8.6-fast/drivers/md/dm.c2013-04-11 19:33:47.0 +0200 @@ -1124,6 +1124,8 @@ static struct dm_target_io *alloc_tio(st clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, ci->md->bs); tio = container_of(clone, struct dm_target_io, clone); + bio_clone_context(&tio->clone, ci->bio); + tio->io = ci->io; tio->ti = ti; memset(&tio->info, 0, sizeof(tio->info)); @@ -1469,9 +1471,10 @@ static void _dm_request(struct request_q if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags))) { dm_put_live_table(md, srcu_idx); - if (bio_rw(bio) != READA) + if (bio_rw(bio) != READA) { + bio_associate_current(bio); queue_io(md, bio); - else + } else bio_io_error(bio); return; } Index: linux-3.8.6-fast/include/linux/bio.h === --- linux-3.8.6-fast.orig/include/linux/bio.h 2013-04-11 17:29:07.0 +0200 +++ linux-3.8.6-fast/include/linux/bio.h2013-04-11 19:34:11.0 +0200 @@ -291,6 +291,15 @@ extern void bvec_free_bs(struct bio_set extern unsigned int bvec_nr_vecs(unsigned short idx); #ifdef CONFIG_BLK_CGROUP +/* + * bio_associate_current associates the bio with the current process. It should + * be called by any block device driver that passes the bio to a different + * process to be processed. It must be called in the original process. + * bio_associate_current does nothing if the bio is already associated. + * + * bio_dissociate_task dissociates the bio from the
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Thu, 11 Apr 2013, Tejun Heo wrote: > On Thu, Apr 11, 2013 at 12:52:03PM -0700, Tejun Heo wrote: > > If this becomes an actual bottleneck, the right thing to do is making > > css ref per-cpu. Please stop messing around with refcounting. > > If you think this kind of hackery is acceptable, you really need to > re-evaluate your priorities in making engineering decisions. In > tightly coupled code, maybe, but you're trying to introduce utterly > broken error-prone thing as a generic block layer API. I mean, are > you for real? > > -- > tejun All that I can tell you is that adding an empty atomic operation "cmpxchg(&bio->bi_css->refcnt, bio->bi_css->refcnt, bio->bi_css->refcnt);" to bio_clone_context and bio_disassociate_task increases the time to run a benchmark from 23 to 40 seconds. Every single atomic reference in the block layer is measurable. How did I measure it: (1) use dm SRCU patches (http://people.redhat.com/~mpatocka/patches/kernel/dm-lock-optimization/) that replace some atomic accesses in device mapper with SRCU. The patches will likely be included in the kernel to improve performance. (2) use the patch v2 that I posted in this thread (3) add bio_associate_current(bio) to _dm_request (so that each bio is associated with a process even if it is not offloaded to a workqueue) (4) change bio_clone_context to actually increase reference counts: static inline void bio_clone_context(struct bio *bio, struct bio *bio_src) { #ifdef CONFIG_BLK_CGROUP BUG_ON(bio->bi_ioc != NULL); if (bio_src->bi_ioc) { get_io_context_active(bio_src->bi_ioc); bio->bi_ioc = bio_src->bi_ioc; if (bio_src->bi_css && css_tryget(bio_src->bi_css)) { bio->bi_css = bio_src->bi_css; } bio->bi_flags |= 1UL << BIO_DROP_CGROUP_REFCOUNT; } #endif } (5) add "cmpxchg(&bio->bi_css->refcnt, bio->bi_css->refcnt, bio->bi_css->refcnt)" to bio_clone_context and bio_disassociate_task Now, measuring: - create 4GiB ramdisk, fill it with dd so that it is allocated - create 5 nested device mapper linear targets on it - run "time fio --rw=randrw --size=1G --bs=512 --filename=/dev/mapper/linear5 --direct=1 --name=job1 --name=job2 --name=job3 --name=job4 --name=job5 --name=job6 --name=job7 --name=job8 --name=job9 --name=job10 --name=job11 --name=job12" (it was ran on 12-core machine, so there are 12 concurrent jobs) If I measure kernel (4), the benchmark takes 23 seconds. For kernel (5) it takes 40 seconds. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Thu, 11 Apr 2013, Tejun Heo wrote: > On Thu, Apr 11, 2013 at 12:52:03PM -0700, Tejun Heo wrote: > > If this becomes an actual bottleneck, the right thing to do is making > > css ref per-cpu. Please stop messing around with refcounting. > > If you think this kind of hackery is acceptable, you really need to > re-evaluate your priorities in making engineering decisions. In > tightly coupled code, maybe, but you're trying to introduce utterly > broken error-prone thing as a generic block layer API. I mean, are > you for real? > > -- > tejun Please describe what is wrong with the code? Why do you call it hackery? When device mapper is creating a cloned bio for the lower layer, it is already assumed that the cloned bio has shorter lifetime than the original bio it was created from. The device mapper copies a part of the bio vector from the original bio to the cloned bio, it copies pointers to pages without increasing reference counts on those pages. As long as the original bio is not returned with bio_endio, the pages must exist, so there is no need to increase their reference counts. Now, if copying pointers without increasing reference counts is OK for pages, why do you think it is not OK for cgroup context? Why do you call this bug-prone? - how do you think a bug could happen? If someone in device mapper errorneously ends the master bio while the cloned bio is still in progress, there is already a memory corruption bug (the cloned bio vector points to potentially free pages) and safeguarding the cgroup pointers won't fix it. So if you think that reference counts should be incremented by every clone of the original bio, what kind of bug should it protect against? If we don't increment reference counts for pages, why should we do it for cgroup pointers? Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Thu, 11 Apr 2013, Tejun Heo wrote: > On Thu, Apr 11, 2013 at 08:06:10PM -0400, Mikulas Patocka wrote: > > All that I can tell you is that adding an empty atomic operation > > "cmpxchg(&bio->bi_css->refcnt, bio->bi_css->refcnt, bio->bi_css->refcnt);" > > to bio_clone_context and bio_disassociate_task increases the time to run a > > benchmark from 23 to 40 seconds. > > Right, linear target on ramdisk, very realistic, and you know what, > hell with dm, let's just hand code everything into submit_bio(). I'm > sure it will speed up your test case significantly. The purpose of this benchmarking is not to optimize linear target on ramdisk. The purpose is to optimize the kernel for upcoming massively parallel servers, with possibly hundreds of energy-efficient cores or so. On these systems every single atomic reference really becomes a bottleneck. And since I don't have such massively parallel server, I am testing it on 12-core machine with ramdisk - the performance drop due to atomic accesses can be measured even on that. I already eliminated most of the atomic operations with this patch http://people.redhat.com/~mpatocka/patches/kernel/dm-lock-optimization/dm-optimize.patch And I don't see a sense in adding more for cgroup, especially, if it can be easily avoided. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Fri, 12 Apr 2013, Tejun Heo wrote: > On Fri, Apr 12, 2013 at 02:01:08PM -0400, Mikulas Patocka wrote: > > So if you think that reference counts should be incremented by every clone > > of the original bio, what kind of bug should it protect against? If we > > don't increment reference counts for pages, why should we do it for cgroup > > pointers? > > These things are called trade-offs. You look at the overhead of the > things and how complex / fragile things get when certain shortcuts are > taken and how well contained and easy to verify / debug when things go > wrong and then make your choice. So what we are here trading for what? The patch eliminates the atomic references when passing the bio down the stack of bio drivers. The patch adds a flag BIO_DROP_CGROUP_REFCOUNT, modifies it at two places and tests it on two places - that is not big. The flag BIO_DROP_CGROUP_REFCOUNT is never supposed to be used outside bio cgroup functions, so it doesn't complicate interface to other subsystems. The patch is not bug-prone, because we already must make sure that the cloned bio has shorter lifetime than the master bio - so the patch doesn't introduce any new possibilities to make bugs. > Do the two really look the same to you? The page refs are much more > expensive, mostly contained in and the main focus of dm. ioc/css refs > aren't that expensive to begin with, css refcnting is widely scattered ioc is per-task, so it is likely to be cached (but there are processors that have slow atomic operations even on cached data - on Pentium 4 it takes about 100 cycles). But css is shared between tasks and produces the cache ping-pong effect. > across the kernel, the association interface is likely to be used by > any entity issuing IOs asynchronously soonish, and there is much saner > way to improve it - which would be beneficial not only to block / dm > but everyone else using it. > > Something being done in one place doesn't automatically make it okay > everywhere else. We can and do use hackery but *with* discretion. > > If you still can't understand, I'm not sure what more I can tell you. > > -- > tejun I don't know what's wrong with 4 lines of code to manipulate a flag. I understand that you don't want to do something complicated and bug-prone to improve performance. But the patch is neither complicated nor bug-prone. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Track block device users that created dirty pages
Hi Here I'm sending a patch to change block device buffer flush semantics: only processes that created some dirty data do buffer flush on close. Mikulas --- Track block device users that created dirty pages This patch changes the block device driver to use the field "private_data" to track "file" structures that created some dirty data in the pagecache. When such "file" structure is closed, we flush block device cache. This changes previously defined flush semantics. Previously, the block device cache was flushed when the last user closed the block device. That has various problems: * because system tools and daemons (for example udev or lvm) can open any block device anytime, this semantics doesn't guarantee that flush happens. For example, if the user runs "dd" to copy some data to the partition, then udev opens the device and then "dd" finishes copying and closes the device, data is not flushed when "dd" exits (because udev still keeps the device open). * if the last user that closes the device ends up being "lvm" process, it can introduce deadlocks. "lvm" would than have to flush some dirty data created by some other process. If writeback of these dirty data waits for some other operation to be performed by "lvm", we get a deadlock. The new semantics is: if a process did some buffered writes to the block device (with write or mmap), the cache is flushed when the process closes the block device. Processes that didn't do any buffered writes to the device don't cause cache flush. It has these advantages: * processes that don't do buffered writes (such as "lvm") don't flush other process's data. * if the user runs "dd" on a block device, it is actually guaranteed that the data is flushed when "dd" exits. Signed-off-by: Mikulas Patocka --- fs/block_dev.c | 18 -- 1 file changed, 16 insertions(+), 2 deletions(-) Index: linux-3.9-rc4-fast/fs/block_dev.c === --- linux-3.9-rc4-fast.orig/fs/block_dev.c 2013-03-26 01:59:13.0 +0100 +++ linux-3.9-rc4-fast/fs/block_dev.c 2013-03-28 20:58:05.0 +0100 @@ -295,10 +295,22 @@ static int blkdev_readpage(struct file * return block_read_full_page(page, blkdev_get_block); } +static int blkdev_mmap(struct file *file, struct vm_area_struct *vma) +{ + if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == (VM_WRITE | VM_SHARED)) { + if (!file->private_data) + file->private_data = (void *)1; + } + return generic_file_mmap(file, vma); +} + static int blkdev_write_begin(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, struct page **pagep, void **fsdata) { + if (unlikely(!file->private_data)) + file->private_data = (void *)1; + return block_write_begin(mapping, pos, len, flags, pagep, blkdev_get_block); } @@ -1413,7 +1425,6 @@ static int __blkdev_put(struct block_dev if (!--bdev->bd_openers) { WARN_ON_ONCE(bdev->bd_holders); - sync_blockdev(bdev); kill_bdev(bdev); /* ->release can cause the old bdi to disappear, * so must switch it out first @@ -1497,6 +1508,9 @@ static int blkdev_close(struct inode * i { struct block_device *bdev = I_BDEV(filp->f_mapping->host); + if (unlikely(filp->private_data)) + sync_blockdev(bdev); + return blkdev_put(bdev, filp->f_mode); } @@ -1595,7 +1609,7 @@ const struct file_operations def_blk_fop .write = do_sync_write, .aio_read = blkdev_aio_read, .aio_write = blkdev_aio_write, - .mmap = generic_file_mmap, + .mmap = blkdev_mmap, .fsync = blkdev_fsync, .unlocked_ioctl = block_ioctl, #ifdef CONFIG_COMPAT -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/