Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.

2017-03-31 Thread Christoph Hellwig
You're calling memcpy_{to,from}_iomem on non-__iomem pointers.  This
is a fundamental no-go as we keep I/O memory separate from kernel
pointers.


Re: [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES

2017-03-31 Thread Christoph Hellwig
On Thu, Mar 30, 2017 at 07:15:50PM -0400, Mike Snitzer wrote:
> I got pretty far along with implementing the DM thinp support for
> WRITE_ZEROES in terms of thinp's DISCARD support (more of an
> implementation detail.. or so I thought).
> 
> But while discussing this effort with Jeff Moyer he asked: shouldn't the
> zeroed blocks be provisioned?  This is a fairly embarassing question not
> to be able to answer in the moment.  So I clearly need to learn what the
> overall intent of WRITE_ZEROES actually is.

It is to ensure the that the block range it's called on returns zeroes
on future reads.  Currently if REQ_UNMAP is set you may free the space
allocation, else not.

After looking at the callers I think this is the wrong way around, so
I'm going to invert the flag, so that the two callers that care that
blocks are allocated will set it instead.


Re: [PATCH 12/23] sd: handle REQ_UNMAP

2017-03-31 Thread h...@lst.de
On Thu, Mar 30, 2017 at 10:19:55PM -0400, Martin K. Petersen wrote:
> > If you manually change the provisioning mode to WS10 on a device that
> > must use WRITE SAME (16) to be able to address all blocks you're already
> > screwed right now, and with this patch you can screw yourself through
> > the WRITE_ZEROES path in addition to the DISCARD path.
> 
> Oh, I see. We only had the LBA sanity check in place for write same, not
> for discard.

And btw, I'd be happy to add such a check, I'd just rather keep it out
of this patch.  It might be a good idea if you give it a turn after
this series given that you have all the devices that have weird
provisioning modes while I don't..


Re: [PATCH 0/3] Improve block device testing coverage

2017-03-31 Thread Dmitry Monakhov
Christoph Hellwig  writes:

> On Thu, Mar 30, 2017 at 05:19:01PM +0400, Dmitry Monakhov wrote:
>> During LSFMM we have discussed how to test lower-backend of linux IO-stack.
>> Common opinion was that xfstests is the most obvious solution which cover
>> most of use cases filesystem care about.
>> 
>> I'm working on integration T10-DIF/DIF data integrity features to ext4,
>> for that reason we need to be shure that linux integrity framework is
>> in working state, which is currently broken in several places.
>> 
>> In fact, it is relatively simple to add basic coverage tests for basic
>> IO operations over virtual device with integrity support. All we need
>> is to add lio target support.
>
> First:  Thanks for adding block layer testing!
>
> Second: even more so than Darrick's blockdev fallocate test this is
> the wrong place.  If I run xfstests I want to test my file system,
> not random block device features.  Please start a proper block device
> testsuite instead, possibly by copy and pasting code from xfstests.
Fair enough. I also not happy to place blkdev feature  to tests/generic
namespace. But altearnative to fork xfstests infrastructure to dedicated
test-framework only for blkdevice seems not very good. Because fork is
always pain. I already maintain one internal fork of xfstests which
tests our Vituozzo's speciffic features.

May be it would be reasonable idea to add didicated namespace
'tests/blockdev' in xfstests, and move all blkdev related tests here?
IMHO this is good idea. Because filesystem relay on some basic
features from blkdev which should be tested explicitly, because
implicit testing is too hard to debug/investigation.

>
> That's how I started the test suite for qemu's block layer for example.
Do you mean qemu/tests/qemu-iotests ?





Re: [PATCH 0/3] Improve block device testing coverage

2017-03-31 Thread Eryu Guan
On Fri, Mar 31, 2017 at 10:43:19AM +0300, Dmitry Monakhov wrote:
> Christoph Hellwig  writes:
> 
> > On Thu, Mar 30, 2017 at 05:19:01PM +0400, Dmitry Monakhov wrote:
> >> During LSFMM we have discussed how to test lower-backend of linux IO-stack.
> >> Common opinion was that xfstests is the most obvious solution which cover
> >> most of use cases filesystem care about.
> >> 
> >> I'm working on integration T10-DIF/DIF data integrity features to ext4,
> >> for that reason we need to be shure that linux integrity framework is
> >> in working state, which is currently broken in several places.
> >> 
> >> In fact, it is relatively simple to add basic coverage tests for basic
> >> IO operations over virtual device with integrity support. All we need
> >> is to add lio target support.
> >
> > First:  Thanks for adding block layer testing!
> >
> > Second: even more so than Darrick's blockdev fallocate test this is
> > the wrong place.  If I run xfstests I want to test my file system,
> > not random block device features.  Please start a proper block device
> > testsuite instead, possibly by copy and pasting code from xfstests.
> Fair enough. I also not happy to place blkdev feature  to tests/generic
> namespace. But altearnative to fork xfstests infrastructure to dedicated
> test-framework only for blkdevice seems not very good. Because fork is
> always pain. I already maintain one internal fork of xfstests which
> tests our Vituozzo's speciffic features.
> 
> May be it would be reasonable idea to add didicated namespace
> 'tests/blockdev' in xfstests, and move all blkdev related tests here?
> IMHO this is good idea. Because filesystem relay on some basic
> features from blkdev which should be tested explicitly, because
> implicit testing is too hard to debug/investigation.

I'm not sure if xfstests is the right place for blockdev tests,
especially for the pure blockdev level features (at least Darrick's
blockdev tests are testing fallocate(2) interface).

But yeah, a new tests/blockdev dir would be good if we eventually decide
adding blockdev tests to xfstests, so they're not run by default when
people want to test filesystems.

Thanks,
Eryu


Re: [PATCH 0/3] Improve block device testing coverage

2017-03-31 Thread Dmitry Monakhov
Eryu Guan  writes:

> On Fri, Mar 31, 2017 at 10:43:19AM +0300, Dmitry Monakhov wrote:
>> Christoph Hellwig  writes:
>> 
>> > On Thu, Mar 30, 2017 at 05:19:01PM +0400, Dmitry Monakhov wrote:
>> >> During LSFMM we have discussed how to test lower-backend of linux 
>> >> IO-stack.
>> >> Common opinion was that xfstests is the most obvious solution which cover
>> >> most of use cases filesystem care about.
>> >> 
>> >> I'm working on integration T10-DIF/DIF data integrity features to ext4,
>> >> for that reason we need to be shure that linux integrity framework is
>> >> in working state, which is currently broken in several places.
>> >> 
>> >> In fact, it is relatively simple to add basic coverage tests for basic
>> >> IO operations over virtual device with integrity support. All we need
>> >> is to add lio target support.
>> >
>> > First:  Thanks for adding block layer testing!
>> >
>> > Second: even more so than Darrick's blockdev fallocate test this is
>> > the wrong place.  If I run xfstests I want to test my file system,
>> > not random block device features.  Please start a proper block device
>> > testsuite instead, possibly by copy and pasting code from xfstests.
>> Fair enough. I also not happy to place blkdev feature  to tests/generic
>> namespace. But altearnative to fork xfstests infrastructure to dedicated
>> test-framework only for blkdevice seems not very good. Because fork is
>> always pain. I already maintain one internal fork of xfstests which
>> tests our Vituozzo's speciffic features.
>> 
>> May be it would be reasonable idea to add didicated namespace
>> 'tests/blockdev' in xfstests, and move all blkdev related tests here?
>> IMHO this is good idea. Because filesystem relay on some basic
>> features from blkdev which should be tested explicitly, because
>> implicit testing is too hard to debug/investigation.
>
> I'm not sure if xfstests is the right place for blockdev tests,
> especially for the pure blockdev level features (at least Darrick's
> blockdev tests are testing fallocate(2) interface).
Another good example may be a bug with dirty page cache after blkdiscard
https://lkml.org/lkml/2017/3/22/789 . This simple bug  result in crappy
fsimage if mkfs relay on discard_zeroes_data behaviour.
So IMHO basic blkdev test coverage is important filesystem testing. i.e.
important for xfstests.
>
> But yeah, a new tests/blockdev dir would be good if we eventually decide
> adding blockdev tests to xfstests, so they're not run by default when
> people want to test filesystems.
>
> Thanks,
> Eryu


Re: [PATCH] scsi: mpt3sas: fix hang on ata passthrough commands

2017-03-31 Thread Greg Kroah-Hartman
On Fri, Mar 31, 2017 at 09:50:30AM -0400, Joe Korty wrote:
> [PATCH] scsi: mpt3sas: fix hang on ata passthrough commands
> 
> commit 16236802bfecb1082144a48b7d6fa60997824662 upstream
> commit ffb58456589443ca572221fabbdef3db8483a779 upstream
> 
> Lockdep complains that the base level-only hctx->lock
> is being grabbed from interrupt level, at the time the
> mpt3sas driver initializes.  This is a result of two
> of three tightly interconnected mpt3sas bugfix patches
> having been backported into 4.4.  The previously two ported
> patches are:
> 
> > From: Andrey Grodzovsky 
> > Date: Thu, 10 Nov 2016 09:35:27 -0500
> > Subject: [PATCH] scsi: mpt3sas: Fix secure erase premature termination
> > Git-Commit: c1ed47e76d13cafbc645792ca184331b3123
>   
> > From: Suganath Prabu S 
> > Date: Thu, 17 Nov 2016 16:15:58 +0530
> > Subject: [PATCH] scsi: mpt3sas: Unblock device after controller reset
> > Git-Commit: 6eddf5c993dd9bf4efcf2509e4ca633b9441a66a
> 
> The missing third patch is attached below.
> 
> Signed-off-by: Joe Korty 
> 
> 
> > From: James Bottomley 
> > Date: Sun, 1 Jan 2017 09:39:24 -0800
> > Subject: [PATCH] scsi: mpt3sas: fix hang on ata passthrough commands
> > Git-Commit: 16236802bfecb1082144a48b7d6fa60997824662
> 
> commit ffb58456589443ca572221fabbdef3db8483a779 upstream.
> 
> mpt3sas has a firmware failure where it can only handle one pass through
> ATA command at a time.  If another comes in, contrary to the SAT
> standard, it will hang until the first one completes (causing long
> commands like secure erase to timeout).  The original fix was to block
> the device when an ATA command came in, but this caused a regression
> with
> 
> commit 669f044170d8933c3d66d231b69ea97cb8447338
> Author: Bart Van Assche 
> Date:   Tue Nov 22 16:17:13 2016 -0800
> 
> scsi: srp_transport: Move queuecommand() wait code to SCSI core
> 
> So fix the original fix of the secure erase timeout by properly
> returning SAM_STAT_BUSY like the SAT recommends.  The original patch
> also had a concurrency problem since scsih_qcmd is lockless at that
> point (this is fixed by using atomic bitops to set and test the flag).
> 
> [mkp: addressed feedback wrt. test_bit and fixed whitespace]
> 
> Fixes: 18f6084a989ba1b (mpt3sas: Fix secure erase premature termination)
> Signed-off-by: James Bottomley 
> Acked-by: Sreekanth Reddy 
> Reviewed-by: Christoph Hellwig 
> Reported-by: Ingo Molnar 
> Tested-by: Ingo Molnar 
> Signed-off-by: Martin K. Petersen 
> Signed-off-by: Greg Kroah-Hartman 

I signed off on this patch?  When?  Where?

totall confused here,

greg k-h


Re: [PATCH] scsi: mpt3sas: fix hang on ata passthrough commands

2017-03-31 Thread Greg Kroah-Hartman
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?

A: No.
Q: Should I include quotations after my reply?


http://daringfireball.net/2007/07/on_top

On Fri, Mar 31, 2017 at 10:16:47AM -0400, Joe Korty wrote:
> Hi Greg,
> This patch is not yet in 4.4.

What patch is that?  I can't find a git commit id here at all :(

> It is in 4.9 and upstream.  No one has submitted it to you before this
> email.  I only discovered the need for it because lockdep complains
> when it is missing.  I do not know how it was missed; perhaps there is
> a reason rather than it being missed by accident. Others in the know
> can comment if they like.

You need to say what you want to have happen for a stable patch, you did
read Documenation/stable_kernel_rules.txt, right?

still confused,

greg k-h


Re: [PATCH 0/3] Improve block device testing coverage

2017-03-31 Thread Bart Van Assche
On Fri, 2017-03-31 at 13:02 +0300, Dmitry Monakhov wrote:
> Another good example may be a bug with dirty page cache after blkdiscard
> https://lkml.org/lkml/2017/3/22/789 . This simple bug  result in crappy
> fsimage if mkfs relay on discard_zeroes_data behaviour.
> So IMHO basic blkdev test coverage is important filesystem testing. i.e.
> important for xfstests.

Mixing up filesystem tests and block layer / block driver tests in the same
directory is completely wrong. Block driver developers will be primarily
interested in the block tests and may want to skip the filesystem tests.
Filesystem developers will probably run the block tests only once and will
likely run the filesystem tests repeatedly. Mixing up different kinds of
tests in the same directory makes it unnecessarily hard to run block and
filesystem tests separately.

Bart.

Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.

2017-03-31 Thread Logan Gunthorpe


On 31/03/17 01:09 AM, Christoph Hellwig wrote:
> You're calling memcpy_{to,from}_iomem on non-__iomem pointers.  This
> is a fundamental no-go as we keep I/O memory separate from kernel
> pointers.

Yes, that's true, however I don't know how we could get around that when
the iomem is referenced by struct pages inside a scatter gather list. Do
we need to now have special __iomem sgls? And even still, I'm not sure
how that could work when the nvme target code is using the same sgls to
sometimes point to iomem and sometimes point to regular memory.

I'm certainly open to suggestions, though.

Logan



Re: [PATCH 0/3] Improve block device testing coverage

2017-03-31 Thread Darrick J. Wong
On Fri, Mar 31, 2017 at 03:11:28PM +, Bart Van Assche wrote:
> On Fri, 2017-03-31 at 13:02 +0300, Dmitry Monakhov wrote:
> > Another good example may be a bug with dirty page cache after blkdiscard
> > https://lkml.org/lkml/2017/3/22/789 . This simple bug  result in crappy
> > fsimage if mkfs relay on discard_zeroes_data behaviour.
> > So IMHO basic blkdev test coverage is important filesystem testing. i.e.
> > important for xfstests.
> 
> Mixing up filesystem tests and block layer / block driver tests in the same
> directory is completely wrong. Block driver developers will be primarily
> interested in the block tests and may want to skip the filesystem tests.
> Filesystem developers will probably run the block tests only once and will
> likely run the filesystem tests repeatedly. Mixing up different kinds of
> tests in the same directory makes it unnecessarily hard to run block and
> filesystem tests separately.

During LSF I had started to wonder if we should just create a new
FSTYP=blockdev fs type with a no-op mkfs & mount.  "_require_fs generic"
could be taught to ignore FSTYP=blockdev; blockdev tests that should
work on all block devices can stay in tests/generic, and blockdev tests
that require specific features or complicated setup can go in
tests/blockdev.

The benefit (for the fs developers, anyway) of having complex block
device setup code helper functions in common/ is that then we can also
start writing tests to see how the fs reacts with more complex storage
setups.  We already have some of that for dm_{thin,flakey,delay,error}.

That way we keep the tests together and make it easy to run them (when
applicable) as part of regular fs testing, and avoid the situation where
bdevtests and xfstests slowly drift apart in terms of behaviors and
command line switches.

The downside ofc is the potential for bloat. :)

(The blockdev fallocate tests fit the fs/block split awkwardly --
they call what is nominally a fs feature on something that isn't itself
a filesystem...)

 Just my 5c.

--D

> 
> Bart.--
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] virtio: wrap find_vqs

2017-03-31 Thread Michael S. Tsirkin
On Fri, Mar 31, 2017 at 12:04:55PM +0800, Jason Wang wrote:
> 
> 
> On 2017年03月30日 22:32, Michael S. Tsirkin wrote:
> > On Thu, Mar 30, 2017 at 02:00:08PM +0800, Jason Wang wrote:
> > > 
> > > On 2017年03月30日 04:48, Michael S. Tsirkin wrote:
> > > > We are going to add more parameters to find_vqs, let's wrap the call so
> > > > we don't need to tweak all drivers every time.
> > > > 
> > > > Signed-off-by: Michael S. Tsirkin
> > > > ---
> > > A quick glance and it looks ok, but what the benefit of this series, is it
> > > required by other changes?
> > > 
> > > Thanks
> > Yes - to avoid touching all devices when doing the rest of
> > the patchset.
> 
> Maybe I'm not clear. I mean the benefit of this series not this single
> patch. I guess it may be used by you proposal that avoid reset when set XDP?

In particular, yes. It generally simplifies things significantly if
we can get the true buffer size back.

> If yes, do we really want to drop some packets after XDP is set?
> 
> Thanks

We would rather not drop packets. We could detect and copy them to make
XDP work.

-- 
MST


always use REQ_OP_WRITE_ZEROES for zeroing offload

2017-03-31 Thread Christoph Hellwig
This series makes REQ_OP_WRITE_ZEROES the only zeroing offload
supported by the block layer, and switches existing implementations
of REQ_OP_DISCARD that correctly set discard_zeroes_data to it,
removes incorrect discard_zeroes_data, and also switches WRITE SAME
based zeroing in SCSI to this new method.

The series is against the block for-next tree.

A git tree is also avaiable at:

git://git.infradead.org/users/hch/block.git discard-rework

Gitweb:


http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/discard-rework


[PATCH 01/25] ѕd: split sd_setup_discard_cmnd

2017-03-31 Thread Christoph Hellwig
Split sd_setup_discard_cmnd into one function per provisioning type.  While
this creates some very slight duplication of boilerplate code it keeps the
code modular for additions of new provisioning types, and for reusing the
write same functions for the upcoming scsi implementation of the Write Zeroes
operation.

Signed-off-by: Christoph Hellwig 
---
 drivers/scsi/sd.c | 153 ++
 1 file changed, 84 insertions(+), 69 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index fcfeddc79331..b853f91fb3da 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -701,93 +701,97 @@ static void sd_config_discard(struct scsi_disk *sdkp, 
unsigned int mode)
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
 }
 
-/**
- * sd_setup_discard_cmnd - unmap blocks on thinly provisioned device
- * @sdp: scsi device to operate on
- * @rq: Request to prepare
- *
- * Will issue either UNMAP or WRITE SAME(16) depending on preference
- * indicated by target device.
- **/
-static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd)
+static int sd_setup_unmap_cmnd(struct scsi_cmnd *cmd)
 {
-   struct request *rq = cmd->request;
struct scsi_device *sdp = cmd->device;
-   struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
-   sector_t sector = blk_rq_pos(rq);
-   unsigned int nr_sectors = blk_rq_sectors(rq);
-   unsigned int len;
-   int ret;
+   struct request *rq = cmd->request;
+   u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
+   u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
+   unsigned int data_len = 24;
char *buf;
-   struct page *page;
 
-   sector >>= ilog2(sdp->sector_size) - 9;
-   nr_sectors >>= ilog2(sdp->sector_size) - 9;
-
-   page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
-   if (!page)
+   rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+   if (!rq->special_vec.bv_page)
return BLKPREP_DEFER;
+   rq->special_vec.bv_offset = 0;
+   rq->special_vec.bv_len = data_len;
+   rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
 
-   switch (sdkp->provisioning_mode) {
-   case SD_LBP_UNMAP:
-   buf = page_address(page);
-
-   cmd->cmd_len = 10;
-   cmd->cmnd[0] = UNMAP;
-   cmd->cmnd[8] = 24;
-
-   put_unaligned_be16(6 + 16, &buf[0]);
-   put_unaligned_be16(16, &buf[2]);
-   put_unaligned_be64(sector, &buf[8]);
-   put_unaligned_be32(nr_sectors, &buf[16]);
+   cmd->cmd_len = 10;
+   cmd->cmnd[0] = UNMAP;
+   cmd->cmnd[8] = 24;
 
-   len = 24;
-   break;
+   buf = page_address(rq->special_vec.bv_page);
+   put_unaligned_be16(6 + 16, &buf[0]);
+   put_unaligned_be16(16, &buf[2]);
+   put_unaligned_be64(sector, &buf[8]);
+   put_unaligned_be32(nr_sectors, &buf[16]);
 
-   case SD_LBP_WS16:
-   cmd->cmd_len = 16;
-   cmd->cmnd[0] = WRITE_SAME_16;
-   cmd->cmnd[1] = 0x8; /* UNMAP */
-   put_unaligned_be64(sector, &cmd->cmnd[2]);
-   put_unaligned_be32(nr_sectors, &cmd->cmnd[10]);
+   cmd->allowed = SD_MAX_RETRIES;
+   cmd->transfersize = data_len;
+   rq->timeout = SD_TIMEOUT;
+   scsi_req(rq)->resid_len = data_len;
 
-   len = sdkp->device->sector_size;
-   break;
+   return scsi_init_io(cmd);
+}
 
-   case SD_LBP_WS10:
-   case SD_LBP_ZERO:
-   cmd->cmd_len = 10;
-   cmd->cmnd[0] = WRITE_SAME;
-   if (sdkp->provisioning_mode == SD_LBP_WS10)
-   cmd->cmnd[1] = 0x8; /* UNMAP */
-   put_unaligned_be32(sector, &cmd->cmnd[2]);
-   put_unaligned_be16(nr_sectors, &cmd->cmnd[7]);
+static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd)
+{
+   struct scsi_device *sdp = cmd->device;
+   struct request *rq = cmd->request;
+   u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
+   u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
+   u32 data_len = sdp->sector_size;
 
-   len = sdkp->device->sector_size;
-   break;
+   rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+   if (!rq->special_vec.bv_page)
+   return BLKPREP_DEFER;
+   rq->special_vec.bv_offset = 0;
+   rq->special_vec.bv_len = data_len;
+   rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
 
-   default:
-   ret = BLKPREP_INVALID;
-   goto out;
-   }
+   cmd->cmd_len = 16;
+   cmd->cmnd[0] = WRITE_SAME_16;
+   cmd->cmnd[1] = 0x8; /* UNMAP */
+   put_unaligned_be64(sector, &cmd->cmnd[2]);
+   put_unaligned_be32(nr_sectors, &cmd->cmnd[10]);
 
+   cmd->allowed = SD_MAX_RETRIES;
+   cmd->transfersize = data_len;
rq->timeout = SD_TIMEOUT;
+   s

[PATCH 02/25] block: renumber REQ_OP_WRITE_ZEROES

2017-03-31 Thread Christoph Hellwig
Make life easy for implementations that needs to send a data buffer
to the device (e.g. SCSI) by numbering it as a data out command.

Signed-off-by: Christoph Hellwig 
---
 include/linux/blk_types.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 67bcf8a5326e..4eae30bfbfca 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -168,7 +168,7 @@ enum req_opf {
/* write the same sector many times */
REQ_OP_WRITE_SAME   = 7,
/* write the zero filled sector many times */
-   REQ_OP_WRITE_ZEROES = 8,
+   REQ_OP_WRITE_ZEROES = 9,
 
/* SCSI passthrough using struct scsi_request */
REQ_OP_SCSI_IN  = 32,
-- 
2.11.0



[PATCH 03/25] block: implement splitting of REQ_OP_WRITE_ZEROES bios

2017-03-31 Thread Christoph Hellwig
Copy and past the REQ_OP_WRITE_SAME code to prepare to implementations
that limit the write zeroes size.

Signed-off-by: Christoph Hellwig 
---
 block/blk-merge.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 2afa262425d1..3990ae406341 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -54,6 +54,20 @@ static struct bio *blk_bio_discard_split(struct 
request_queue *q,
return bio_split(bio, split_sectors, GFP_NOIO, bs);
 }
 
+static struct bio *blk_bio_write_zeroes_split(struct request_queue *q,
+   struct bio *bio, struct bio_set *bs, unsigned *nsegs)
+{
+   *nsegs = 1;
+
+   if (!q->limits.max_write_zeroes_sectors)
+   return NULL;
+
+   if (bio_sectors(bio) <= q->limits.max_write_zeroes_sectors)
+   return NULL;
+
+   return bio_split(bio, q->limits.max_write_zeroes_sectors, GFP_NOIO, bs);
+}
+
 static struct bio *blk_bio_write_same_split(struct request_queue *q,
struct bio *bio,
struct bio_set *bs,
@@ -200,8 +214,7 @@ void blk_queue_split(struct request_queue *q, struct bio 
**bio,
split = blk_bio_discard_split(q, *bio, bs, &nsegs);
break;
case REQ_OP_WRITE_ZEROES:
-   split = NULL;
-   nsegs = (*bio)->bi_phys_segments;
+   split = blk_bio_write_zeroes_split(q, *bio, bs, &nsegs);
break;
case REQ_OP_WRITE_SAME:
split = blk_bio_write_same_split(q, *bio, bs, &nsegs);
-- 
2.11.0



[PATCH 08/25] dm kcopyd: switch to use REQ_OP_WRITE_ZEROES

2017-03-31 Thread Christoph Hellwig
It seems like the code currently passes whatever it was using for writes
to WRITE SAME.  Just switch it to WRITE ZEROES, although that doesn't
need any payload.

Untested, and confused by the code, maybe someone who understands it
better than me can help..

Not-yet-signed-off-by: Christoph Hellwig 
---
 drivers/md/dm-kcopyd.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index 9e9d04cb7d51..f85846741d50 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -733,11 +733,11 @@ int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct 
dm_io_region *from,
job->pages = &zero_page_list;
 
/*
-* Use WRITE SAME to optimize zeroing if all dests support it.
+* Use WRITE ZEROES to optimize zeroing if all dests support it.
 */
-   job->rw = REQ_OP_WRITE_SAME;
+   job->rw = REQ_OP_WRITE_ZEROES;
for (i = 0; i < job->num_dests; i++)
-   if (!bdev_write_same(job->dests[i].bdev)) {
+   if (!bdev_write_zeroes_sectors(job->dests[i].bdev)) {
job->rw = WRITE;
break;
}
-- 
2.11.0



[PATCH 05/25] md: support REQ_OP_WRITE_ZEROES

2017-03-31 Thread Christoph Hellwig
Copy & paste from the REQ_OP_WRITE_SAME code.

Signed-off-by: Christoph Hellwig 
---
 drivers/md/linear.c| 1 +
 drivers/md/md.h| 7 +++
 drivers/md/multipath.c | 1 +
 drivers/md/raid0.c | 2 ++
 drivers/md/raid1.c | 4 +++-
 drivers/md/raid10.c| 1 +
 drivers/md/raid5.c | 1 +
 7 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 3e38e0207a3e..377a8a3672e3 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -293,6 +293,7 @@ static void linear_make_request(struct mddev *mddev, struct 
bio *bio)
  split, 
disk_devt(mddev->gendisk),
  bio_sector);
mddev_check_writesame(mddev, split);
+   mddev_check_write_zeroes(mddev, split);
generic_make_request(split);
}
} while (split != bio);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index dde8ecb760c8..1e76d64ce180 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -709,4 +709,11 @@ static inline void mddev_check_writesame(struct mddev 
*mddev, struct bio *bio)
!bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors)
mddev->queue->limits.max_write_same_sectors = 0;
 }
+
+static inline void mddev_check_write_zeroes(struct mddev *mddev, struct bio 
*bio)
+{
+   if (bio_op(bio) == REQ_OP_WRITE_ZEROES &&
+   !bdev_get_queue(bio->bi_bdev)->limits.max_write_zeroes_sectors)
+   mddev->queue->limits.max_write_zeroes_sectors = 0;
+}
 #endif /* _MD_MD_H */
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index 79a12b59250b..e95d521d93e9 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -139,6 +139,7 @@ static void multipath_make_request(struct mddev *mddev, 
struct bio * bio)
mp_bh->bio.bi_end_io = multipath_end_request;
mp_bh->bio.bi_private = mp_bh;
mddev_check_writesame(mddev, &mp_bh->bio);
+   mddev_check_write_zeroes(mddev, &mp_bh->bio);
generic_make_request(&mp_bh->bio);
return;
 }
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 93347ca7c7a6..ce7a6a56cf73 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -383,6 +383,7 @@ static int raid0_run(struct mddev *mddev)
 
blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
blk_queue_max_write_same_sectors(mddev->queue, 
mddev->chunk_sectors);
+   blk_queue_max_write_zeroes_sectors(mddev->queue, 
mddev->chunk_sectors);
blk_queue_max_discard_sectors(mddev->queue, 
mddev->chunk_sectors);
 
blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9);
@@ -504,6 +505,7 @@ static void raid0_make_request(struct mddev *mddev, struct 
bio *bio)
  split, 
disk_devt(mddev->gendisk),
  bio_sector);
mddev_check_writesame(mddev, split);
+   mddev_check_write_zeroes(mddev, split);
generic_make_request(split);
}
} while (split != bio);
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index a34f58772022..b59cc100320a 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -3177,8 +3177,10 @@ static int raid1_run(struct mddev *mddev)
if (IS_ERR(conf))
return PTR_ERR(conf);
 
-   if (mddev->queue)
+   if (mddev->queue) {
blk_queue_max_write_same_sectors(mddev->queue, 0);
+   blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
+   }
 
rdev_for_each(rdev, mddev) {
if (!mddev->gendisk)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index e89a8d78a9ed..28ec3a93acee 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3749,6 +3749,7 @@ static int raid10_run(struct mddev *mddev)
blk_queue_max_discard_sectors(mddev->queue,
  mddev->chunk_sectors);
blk_queue_max_write_same_sectors(mddev->queue, 0);
+   blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
blk_queue_io_min(mddev->queue, chunk_size);
if (conf->geo.raid_disks % conf->geo.near_copies)
blk_queue_io_opt(mddev->queue, chunk_size * 
conf->geo.raid_disks);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ed5cd705b985..8cf1f86dcd05 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7272,6 +7272,7 @@ static int raid5_run(struct mddev *mddev)
mddev->queue->limits.discard_zeroes_data = 0;
 
blk_queue_max_write_same_sectors(mddev->queue, 0);
+   blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
 
rdev_for_each(rdev, mddev) {
disk_s

[PATCH 10/25] block: add a flags argument to (__)blkdev_issue_zeroout

2017-03-31 Thread Christoph Hellwig
Turn the existin discard flag into a new BLKDEV_ZERO_UNMAP flag with
similar semantics, but without referring to diѕcard.

Signed-off-by: Christoph Hellwig 
---
 block/blk-lib.c| 31 ++-
 block/ioctl.c  |  2 +-
 drivers/block/drbd/drbd_receiver.c |  9 ++---
 drivers/nvme/target/io-cmd.c   |  2 +-
 fs/block_dev.c |  2 +-
 fs/dax.c   |  2 +-
 fs/xfs/xfs_bmap_util.c |  2 +-
 include/linux/blkdev.h | 16 ++--
 8 files changed, 35 insertions(+), 31 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 2a8d638544a7..f9f24ec69c27 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -282,14 +282,18 @@ static int __blkdev_issue_write_zeroes(struct 
block_device *bdev,
  * @nr_sects:  number of sectors to write
  * @gfp_mask:  memory allocation flags (for bio_alloc)
  * @biop:  pointer to anchor bio
- * @discard:   discard flag
+ * @flags: controls detailed behavior
  *
  * Description:
- *  Generate and issue number of bios with zerofiled pages.
+ *  Zero-fill a block range, either using hardware offload or by explicitly
+ *  writing zeroes to the device.
+ *
+ *  If a device is using logical block provisioning, the underlying space will
+ *  not be released if %flags contains BLKDEV_ZERO_NOUNMAP.
  */
 int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
-   bool discard)
+   unsigned flags)
 {
int ret;
int bi_size = 0;
@@ -337,28 +341,21 @@ EXPORT_SYMBOL(__blkdev_issue_zeroout);
  * @sector:start sector
  * @nr_sects:  number of sectors to write
  * @gfp_mask:  memory allocation flags (for bio_alloc)
- * @discard:   whether to discard the block range
+ * @flags: controls detailed behavior
  *
  * Description:
- *  Zero-fill a block range.  If the discard flag is set and the block
- *  device guarantees that subsequent READ operations to the block range
- *  in question will return zeroes, the blocks will be discarded. Should
- *  the discard request fail, if the discard flag is not set, or if
- *  discard_zeroes_data is not supported, this function will resort to
- *  zeroing the blocks manually, thus provisioning (allocating,
- *  anchoring) them. If the block device supports WRITE ZEROES or WRITE SAME
- *  command(s), blkdev_issue_zeroout() will use it to optimize the process of
- *  clearing the block range. Otherwise the zeroing will be performed
- *  using regular WRITE calls.
+ *  Zero-fill a block range, either using hardware offload or by explicitly
+ *  writing zeroes to the device.  See __blkdev_issue_zeroout() for the
+ *  valid values for %flags.
  */
 int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
-sector_t nr_sects, gfp_t gfp_mask, bool discard)
+   sector_t nr_sects, gfp_t gfp_mask, unsigned flags)
 {
int ret;
struct bio *bio = NULL;
struct blk_plug plug;
 
-   if (discard) {
+   if (!(flags & BLKDEV_ZERO_NOUNMAP)) {
if (!blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask,
BLKDEV_DISCARD_ZERO))
return 0;
@@ -366,7 +363,7 @@ int blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
 
blk_start_plug(&plug);
ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
-   &bio, discard);
+   &bio, flags);
if (ret == 0 && bio) {
ret = submit_bio_wait(bio);
bio_put(bio);
diff --git a/block/ioctl.c b/block/ioctl.c
index 7b88820b93d9..8ea00a41be01 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -255,7 +255,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, 
fmode_t mode,
truncate_inode_pages_range(mapping, start, end);
 
return blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL,
-   false);
+   BLKDEV_ZERO_NOUNMAP);
 }
 
 static int put_ushort(unsigned long arg, unsigned short val)
diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index aa6bf9692eff..dc9a6dcd431c 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1499,19 +1499,22 @@ int drbd_issue_discard_or_zero_out(struct drbd_device 
*device, sector_t start, u
tmp = start + granularity - sector_div(tmp, granularity);
 
nr = tmp - start;
-   err |= blkdev_issue_zeroout(bdev, start, nr, GFP_NOIO, 0);
+   err |= blkdev_issue_zeroout(bdev, start, nr, GFP_NOIO,
+   BLKDEV_ZERO_NOUNMAP);
nr_sectors -= nr;
start = tmp;
}
while (nr_sectors >= granularity) {
nr = min_t(sector_

[PATCH 18/25] brd: remove discard support

2017-03-31 Thread Christoph Hellwig
It's just a in-driver reimplementation of writing zeroes to the pages,
which fails if the discards aren't page aligned.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/brd.c | 54 -
 1 file changed, 54 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 3adc32a3153b..4ec84d504780 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -134,28 +134,6 @@ static struct page *brd_insert_page(struct brd_device 
*brd, sector_t sector)
return page;
 }
 
-static void brd_free_page(struct brd_device *brd, sector_t sector)
-{
-   struct page *page;
-   pgoff_t idx;
-
-   spin_lock(&brd->brd_lock);
-   idx = sector >> PAGE_SECTORS_SHIFT;
-   page = radix_tree_delete(&brd->brd_pages, idx);
-   spin_unlock(&brd->brd_lock);
-   if (page)
-   __free_page(page);
-}
-
-static void brd_zero_page(struct brd_device *brd, sector_t sector)
-{
-   struct page *page;
-
-   page = brd_lookup_page(brd, sector);
-   if (page)
-   clear_highpage(page);
-}
-
 /*
  * Free all backing store pages and radix tree. This must only be called when
  * there are no other users of the device.
@@ -212,24 +190,6 @@ static int copy_to_brd_setup(struct brd_device *brd, 
sector_t sector, size_t n)
return 0;
 }
 
-static void discard_from_brd(struct brd_device *brd,
-   sector_t sector, size_t n)
-{
-   while (n >= PAGE_SIZE) {
-   /*
-* Don't want to actually discard pages here because
-* re-allocating the pages can result in writeback
-* deadlocks under heavy load.
-*/
-   if (0)
-   brd_free_page(brd, sector);
-   else
-   brd_zero_page(brd, sector);
-   sector += PAGE_SIZE >> SECTOR_SHIFT;
-   n -= PAGE_SIZE;
-   }
-}
-
 /*
  * Copy n bytes from src to the brd starting at sector. Does not sleep.
  */
@@ -338,14 +298,6 @@ static blk_qc_t brd_make_request(struct request_queue *q, 
struct bio *bio)
if (bio_end_sector(bio) > get_capacity(bdev->bd_disk))
goto io_error;
 
-   if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) {
-   if (sector & ((PAGE_SIZE >> SECTOR_SHIFT) - 1) ||
-   bio->bi_iter.bi_size & ~PAGE_MASK)
-   goto io_error;
-   discard_from_brd(brd, sector, bio->bi_iter.bi_size);
-   goto out;
-   }
-
bio_for_each_segment(bvec, bio, iter) {
unsigned int len = bvec.bv_len;
int err;
@@ -357,7 +309,6 @@ static blk_qc_t brd_make_request(struct request_queue *q, 
struct bio *bio)
sector += len >> SECTOR_SHIFT;
}
 
-out:
bio_endio(bio);
return BLK_QC_T_NONE;
 io_error:
@@ -464,11 +415,6 @@ static struct brd_device *brd_alloc(int i)
 *  is harmless)
 */
blk_queue_physical_block_size(brd->brd_queue, PAGE_SIZE);
-
-   brd->brd_queue->limits.discard_granularity = PAGE_SIZE;
-   blk_queue_max_discard_sectors(brd->brd_queue, UINT_MAX);
-   brd->brd_queue->limits.discard_zeroes_data = 1;
-   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, brd->brd_queue);
 #ifdef CONFIG_BLK_DEV_RAM_DAX
queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue);
 #endif
-- 
2.11.0



[PATCH 12/25] block: add a new BLKDEV_ZERO_NOFALLBACK flag

2017-03-31 Thread Christoph Hellwig
This avoids fallbacks to explicit zeroing in (__)blkdev_issue_zeroout if
the caller doesn't want them.

Also clean up the convoluted check for the return condition that this
new flag is added to.

Signed-off-by: Christoph Hellwig 
---
 block/blk-lib.c| 5 -
 include/linux/blkdev.h | 1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 2f6d2cb2e1a2..2f882e22890b 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -281,6 +281,9 @@ static int __blkdev_issue_write_zeroes(struct block_device 
*bdev,
  *
  *  If a device is using logical block provisioning, the underlying space will
  *  not be released if %flags contains BLKDEV_ZERO_NOUNMAP.
+ *
+ *  If %flags contains BLKDEV_ZERO_NOFALLBACK, the function will return
+ *  -EOPNOTSUPP if no explicit hardware offload for zeroing is provided.
  */
 int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
@@ -298,7 +301,7 @@ int __blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
 
ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask,
biop, flags);
-   if (ret == 0 || (ret && ret != -EOPNOTSUPP))
+   if (ret != -EOPNOTSUPP || (flags & BLKDEV_ZERO_NOFALLBACK))
goto out;
 
ret = 0;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e7513ce3dbde..a5055d760661 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1351,6 +1351,7 @@ extern int __blkdev_issue_discard(struct block_device 
*bdev, sector_t sector,
struct bio **biop);
 
 #define BLKDEV_ZERO_NOUNMAP(1 << 0)  /* do not free blocks */
+#define BLKDEV_ZERO_NOFALLBACK (1 << 1)  /* don't write explicit zeroes */
 
 extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
-- 
2.11.0



[PATCH 07/25] dm: support REQ_OP_WRITE_ZEROES

2017-03-31 Thread Christoph Hellwig
Copy & paste from the REQ_OP_WRITE_SAME code.

Signed-off-by: Christoph Hellwig 
---
 drivers/md/dm-core.h  |  1 +
 drivers/md/dm-io.c|  8 ++--
 drivers/md/dm-linear.c|  1 +
 drivers/md/dm-mpath.c |  1 +
 drivers/md/dm-rq.c| 11 ---
 drivers/md/dm-stripe.c|  2 ++
 drivers/md/dm-table.c | 30 ++
 drivers/md/dm.c   | 31 ---
 include/linux/device-mapper.h |  6 ++
 9 files changed, 83 insertions(+), 8 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 136fda3ff9e5..fea5bd52ada8 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -132,6 +132,7 @@ void dm_init_md_queue(struct mapped_device *md);
 void dm_init_normal_md_queue(struct mapped_device *md);
 int md_in_flight(struct mapped_device *md);
 void disable_write_same(struct mapped_device *md);
+void disable_write_zeroes(struct mapped_device *md);
 
 static inline struct completion *dm_get_completion_from_kobject(struct kobject 
*kobj)
 {
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index b808cbe22678..3702e502466d 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -312,9 +312,12 @@ static void do_region(int op, int op_flags, unsigned 
region,
 */
if (op == REQ_OP_DISCARD)
special_cmd_max_sectors = q->limits.max_discard_sectors;
+   else if (op == REQ_OP_WRITE_ZEROES)
+   special_cmd_max_sectors = q->limits.max_write_zeroes_sectors;
else if (op == REQ_OP_WRITE_SAME)
special_cmd_max_sectors = q->limits.max_write_same_sectors;
-   if ((op == REQ_OP_DISCARD || op == REQ_OP_WRITE_SAME) &&
+   if ((op == REQ_OP_DISCARD || op == REQ_OP_WRITE_ZEROES ||
+op == REQ_OP_WRITE_SAME)  &&
special_cmd_max_sectors == 0) {
dec_count(io, region, -EOPNOTSUPP);
return;
@@ -330,6 +333,7 @@ static void do_region(int op, int op_flags, unsigned region,
 */
switch (op) {
case REQ_OP_DISCARD:
+   case REQ_OP_WRITE_ZEROES:
num_bvecs = 0;
break;
case REQ_OP_WRITE_SAME:
@@ -347,7 +351,7 @@ static void do_region(int op, int op_flags, unsigned region,
bio_set_op_attrs(bio, op, op_flags);
store_io_and_region_in_bio(bio, io, region);
 
-   if (op == REQ_OP_DISCARD) {
+   if (op == REQ_OP_DISCARD || op == REQ_OP_WRITE_ZEROES) {
num_sectors = min_t(sector_t, special_cmd_max_sectors, 
remaining);
bio->bi_iter.bi_size = num_sectors << SECTOR_SHIFT;
remaining -= num_sectors;
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 4788b0b989a9..e17fd44ceef5 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -59,6 +59,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int 
argc, char **argv)
ti->num_flush_bios = 1;
ti->num_discard_bios = 1;
ti->num_write_same_bios = 1;
+   ti->num_write_zeroes_bios = 1;
ti->private = lc;
return 0;
 
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 7f223dbed49f..ab55955ed704 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1103,6 +1103,7 @@ static int multipath_ctr(struct dm_target *ti, unsigned 
argc, char **argv)
ti->num_flush_bios = 1;
ti->num_discard_bios = 1;
ti->num_write_same_bios = 1;
+   ti->num_write_zeroes_bios = 1;
if (m->queue_mode == DM_TYPE_BIO_BASED)
ti->per_io_data_size = multipath_per_bio_data_size();
else
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 28955b94d2b2..6006d5d4b06e 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -298,9 +298,14 @@ static void dm_done(struct request *clone, int error, bool 
mapped)
r = rq_end_io(tio->ti, clone, error, &tio->info);
}
 
-   if (unlikely(r == -EREMOTEIO && (req_op(clone) == REQ_OP_WRITE_SAME) &&
-!clone->q->limits.max_write_same_sectors))
-   disable_write_same(tio->md);
+   if (unlikely(r == -EREMOTEIO)) {
+   if (req_op(clone) == REQ_OP_WRITE_SAME &&
+   !clone->q->limits.max_write_same_sectors)
+   disable_write_same(tio->md);
+   if (req_op(clone) == REQ_OP_WRITE_ZEROES &&
+   !clone->q->limits.max_write_zeroes_sectors)
+   disable_write_zeroes(tio->md);
+   }
 
if (r <= 0)
/* The target wants to complete the I/O */
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 28193a57bf47..5ef49c121d99 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -169,6 +169,7 @@ static int stripe_ctr(struct dm_target *ti, unsig

[PATCH 22/25] block: stop using discards for zeroing

2017-03-31 Thread Christoph Hellwig
Now that we have REQ_OP_WRITE_ZEROES implemented for all devices that
support efficient zeroing of devices we can remove the call to
blkdev_issue_discard.  This means we only have two ways of zeroing left
and can simply the code.

Signed-off-by: Christoph Hellwig 
---
 block/blk-lib.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 2f882e22890b..7c27211570fa 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -279,6 +279,11 @@ static int __blkdev_issue_write_zeroes(struct block_device 
*bdev,
  *  Zero-fill a block range, either using hardware offload or by explicitly
  *  writing zeroes to the device.
  *
+ *  Note that this function may fail with -EOPNOTSUPP if the driver supports
+ *  efficient zeroing operation, but the device capabilities can only be
+ *  discovered by trial and error.  In this case the caller should call the
+ *  function again, and it will use the fallback path.
+ *
  *  If a device is using logical block provisioning, the underlying space will
  *  not be released if %flags contains BLKDEV_ZERO_NOUNMAP.
  *
@@ -349,12 +354,6 @@ int blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
struct bio *bio = NULL;
struct blk_plug plug;
 
-   if (!(flags & BLKDEV_ZERO_NOUNMAP)) {
-   if (!blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask,
-   BLKDEV_DISCARD_ZERO))
-   return 0;
-   }
-
blk_start_plug(&plug);
ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
&bio, flags);
-- 
2.11.0



[PATCH 16/25] zram: implement REQ_OP_WRITE_ZEROES

2017-03-31 Thread Christoph Hellwig
Just the same as discard if the block size equals the system page size.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/zram/zram_drv.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index dceb5edd1e54..1710b06f04a7 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -829,10 +829,14 @@ static void __zram_make_request(struct zram *zram, struct 
bio *bio)
offset = (bio->bi_iter.bi_sector &
  (SECTORS_PER_PAGE - 1)) << SECTOR_SHIFT;
 
-   if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) {
+   switch (bio_op(bio)) {
+   case REQ_OP_DISCARD:
+   case REQ_OP_WRITE_ZEROES:
zram_bio_discard(zram, index, offset, bio);
bio_endio(bio);
return;
+   default:
+   break;
}
 
bio_for_each_segment(bvec, bio, iter) {
@@ -1192,6 +1196,8 @@ static int zram_add(void)
zram->disk->queue->limits.max_sectors = SECTORS_PER_PAGE;
zram->disk->queue->limits.chunk_sectors = 0;
blk_queue_max_discard_sectors(zram->disk->queue, UINT_MAX);
+   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, zram->disk->queue);
+
/*
 * zram_bio_discard() will clear all logical blocks if logical block
 * size is identical with physical block size(PAGE_SIZE). But if it is
@@ -1201,10 +1207,7 @@ static int zram_add(void)
 * zeroed.
 */
if (ZRAM_LOGICAL_BLOCK_SIZE == PAGE_SIZE)
-   zram->disk->queue->limits.discard_zeroes_data = 1;
-   else
-   zram->disk->queue->limits.discard_zeroes_data = 0;
-   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, zram->disk->queue);
+   blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX);
 
add_disk(zram->disk);
 
-- 
2.11.0



[PATCH 20/25] rsxx: remove the discard_zeroes_data flag

2017-03-31 Thread Christoph Hellwig
rsxx only supports discarding on large alignments, so the zeroing code
would always fall back to explicit writings of zeroes.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/rsxx/dev.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c
index f81d70b39d10..9c566364ac9c 100644
--- a/drivers/block/rsxx/dev.c
+++ b/drivers/block/rsxx/dev.c
@@ -300,7 +300,6 @@ int rsxx_setup_dev(struct rsxx_cardinfo *card)
RSXX_HW_BLK_SIZE >> 9);
card->queue->limits.discard_granularity = RSXX_HW_BLK_SIZE;
card->queue->limits.discard_alignment   = RSXX_HW_BLK_SIZE;
-   card->queue->limits.discard_zeroes_data = 1;
}
 
card->queue->queuedata = card;
-- 
2.11.0



[PATCH 24/25] drbd: implement REQ_OP_WRITE_ZEROES

2017-03-31 Thread Christoph Hellwig
It seems like DRBD assumes its on the wire TRIM request always zeroes data.
Use that fact to implement REQ_OP_WRITE_ZEROES.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/drbd/drbd_main.c | 3 ++-
 drivers/block/drbd/drbd_nl.c   | 2 ++
 drivers/block/drbd/drbd_receiver.c | 6 +++---
 drivers/block/drbd/drbd_req.c  | 7 +--
 drivers/block/drbd/drbd_worker.c   | 4 +++-
 5 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 92c60cbd04ee..8e62d9f65510 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -1668,7 +1668,8 @@ static u32 bio_flags_to_wire(struct drbd_connection 
*connection,
(bio->bi_opf & REQ_FUA ? DP_FUA : 0) |
(bio->bi_opf & REQ_PREFLUSH ? DP_FLUSH : 0) |
(bio_op(bio) == REQ_OP_WRITE_SAME ? DP_WSAME : 0) |
-   (bio_op(bio) == REQ_OP_DISCARD ? DP_DISCARD : 0);
+   (bio_op(bio) == REQ_OP_DISCARD ? DP_DISCARD : 0) |
+   (bio_op(bio) == REQ_OP_WRITE_ZEROES ? DP_DISCARD : 0);
else
return bio->bi_opf & REQ_SYNC ? DP_RW_SYNC : 0;
 }
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 908c704e20aa..e4516d3b971d 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1217,10 +1217,12 @@ static void decide_on_discard_support(struct 
drbd_device *device,
blk_queue_discard_granularity(q, 512);
q->limits.max_discard_sectors = 
drbd_max_discard_sectors(connection);
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
+   q->limits.max_write_zeroes_sectors = 
drbd_max_discard_sectors(connection);
} else {
queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, q);
blk_queue_discard_granularity(q, 0);
q->limits.max_discard_sectors = 0;
+   q->limits.max_write_zeroes_sectors = 0;
}
 }
 
diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index bc1d296581f9..1b0a2be24f39 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -2285,7 +2285,7 @@ static unsigned long wire_flags_to_bio_flags(u32 dpf)
 static unsigned long wire_flags_to_bio_op(u32 dpf)
 {
if (dpf & DP_DISCARD)
-   return REQ_OP_DISCARD;
+   return REQ_OP_WRITE_ZEROES;
else
return REQ_OP_WRITE;
 }
@@ -2476,7 +2476,7 @@ static int receive_Data(struct drbd_connection 
*connection, struct packet_info *
op_flags = wire_flags_to_bio_flags(dp_flags);
if (pi->cmd == P_TRIM) {
D_ASSERT(peer_device, peer_req->i.size > 0);
-   D_ASSERT(peer_device, op == REQ_OP_DISCARD);
+   D_ASSERT(peer_device, op == REQ_OP_WRITE_ZEROES);
D_ASSERT(peer_device, peer_req->pages == NULL);
} else if (peer_req->pages == NULL) {
D_ASSERT(device, peer_req->i.size == 0);
@@ -4789,7 +4789,7 @@ static int receive_rs_deallocated(struct drbd_connection 
*connection, struct pac
 
if (get_ldev(device)) {
struct drbd_peer_request *peer_req;
-   const int op = REQ_OP_DISCARD;
+   const int op = REQ_OP_WRITE_ZEROES;
 
peer_req = drbd_alloc_peer_req(peer_device, ID_SYNCER, sector,
   size, 0, GFP_NOIO);
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 6da9ea8c48b6..b5730e17b455 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -59,6 +59,7 @@ static struct drbd_request *drbd_req_new(struct drbd_device 
*device, struct bio
drbd_req_make_private_bio(req, bio_src);
req->rq_state = (bio_data_dir(bio_src) == WRITE ? RQ_WRITE : 0)
  | (bio_op(bio_src) == REQ_OP_WRITE_SAME ? RQ_WSAME : 0)
+ | (bio_op(bio_src) == REQ_OP_WRITE_ZEROES ? RQ_UNMAP : 0)
  | (bio_op(bio_src) == REQ_OP_DISCARD ? RQ_UNMAP : 0);
req->device = device;
req->master_bio = bio_src;
@@ -1180,7 +1181,8 @@ drbd_submit_req_private_bio(struct drbd_request *req)
if (get_ldev(device)) {
if (drbd_insert_fault(device, type))
bio_io_error(bio);
-   else if (bio_op(bio) == REQ_OP_DISCARD)
+   else if (bio_op(bio) == REQ_OP_WRITE_ZEROES ||
+bio_op(bio) == REQ_OP_DISCARD)
drbd_process_discard_req(req);
else
generic_make_request(bio);
@@ -1234,7 +1236,8 @@ drbd_request_prepare(struct drbd_device *device, struct 
bio *bio, unsigned long
_drbd_start_io_acct(device, req);
 
/* process discards always from our submitter thread */
-   if (bio_op(b

[PATCH 25/25] block: remove the discard_zeroes_data flag

2017-03-31 Thread Christoph Hellwig
Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
kill this hack.

Signed-off-by: Christoph Hellwig 
---
 Documentation/ABI/testing/sysfs-block | 10 ++-
 Documentation/block/queue-sysfs.txt   |  5 
 block/blk-lib.c   |  7 +
 block/blk-settings.c  |  3 ---
 block/blk-sysfs.c |  2 +-
 block/compat_ioctl.c  |  2 +-
 block/ioctl.c |  2 +-
 drivers/block/drbd/drbd_main.c|  2 --
 drivers/block/drbd/drbd_nl.c  |  7 +
 drivers/block/loop.c  |  2 --
 drivers/block/mtip32xx/mtip32xx.c |  1 -
 drivers/block/nbd.c   |  1 -
 drivers/md/dm-cache-target.c  |  1 -
 drivers/md/dm-crypt.c |  1 -
 drivers/md/dm-raid.c  |  6 ++---
 drivers/md/dm-raid1.c |  1 -
 drivers/md/dm-table.c | 19 -
 drivers/md/dm-thin.c  |  2 --
 drivers/md/raid5.c| 50 +++
 drivers/scsi/sd.c |  5 
 drivers/target/target_core_device.c   |  2 +-
 include/linux/blkdev.h| 15 ---
 include/linux/device-mapper.h |  5 
 23 files changed, 27 insertions(+), 124 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block 
b/Documentation/ABI/testing/sysfs-block
index 2da04ce6aeef..dea212db9df3 100644
--- a/Documentation/ABI/testing/sysfs-block
+++ b/Documentation/ABI/testing/sysfs-block
@@ -213,14 +213,8 @@ What:  
/sys/block//queue/discard_zeroes_data
 Date:  May 2011
 Contact:   Martin K. Petersen 
 Description:
-   Devices that support discard functionality may return
-   stale or random data when a previously discarded block
-   is read back. This can cause problems if the filesystem
-   expects discarded blocks to be explicitly cleared. If a
-   device reports that it deterministically returns zeroes
-   when a discarded area is read the discard_zeroes_data
-   parameter will be set to one. Otherwise it will be 0 and
-   the result of reading a discarded area is undefined.
+   Will always return 0.  Don't rely on any specific behavior
+   for discards, and don't read this file.
 
 What:  /sys/block//queue/write_same_max_bytes
 Date:  January 2012
diff --git a/Documentation/block/queue-sysfs.txt 
b/Documentation/block/queue-sysfs.txt
index b7f6bdc96d73..2c1e67058fd3 100644
--- a/Documentation/block/queue-sysfs.txt
+++ b/Documentation/block/queue-sysfs.txt
@@ -43,11 +43,6 @@ large discards are issued, setting this value lower will 
make Linux issue
 smaller discards and potentially help reduce latencies induced by large
 discard operations.
 
-discard_zeroes_data (RO)
-
-When read, this file will show if the discarded block are zeroed by the
-device or not. If its value is '1' the blocks are zeroed otherwise not.
-
 hw_sector_size (RO)
 ---
 This is the hardware sector size of the device, in bytes.
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 7c27211570fa..268bc9e054c0 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -37,17 +37,12 @@ int __blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
return -ENXIO;
 
if (flags & BLKDEV_DISCARD_SECURE) {
-   if (flags & BLKDEV_DISCARD_ZERO)
-   return -EOPNOTSUPP;
if (!blk_queue_secure_erase(q))
return -EOPNOTSUPP;
op = REQ_OP_SECURE_ERASE;
} else {
if (!blk_queue_discard(q))
return -EOPNOTSUPP;
-   if ((flags & BLKDEV_DISCARD_ZERO) &&
-   !q->limits.discard_zeroes_data)
-   return -EOPNOTSUPP;
op = REQ_OP_DISCARD;
}
 
@@ -126,7 +121,7 @@ int blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
&bio);
if (!ret && bio) {
ret = submit_bio_wait(bio);
-   if (ret == -EOPNOTSUPP && !(flags & BLKDEV_DISCARD_ZERO))
+   if (ret == -EOPNOTSUPP)
ret = 0;
bio_put(bio);
}
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 1e7174ffc9d4..4fa81ed383ca 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -103,7 +103,6 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->discard_granularity = 0;
lim->discard_alignment = 0;
lim->discard_misaligned = 0;
-   lim->discard_zeroes_data = 0;
lim->logical_block_size = lim->physical_block_size = lim->io_min = 512;
lim->bounce_pfn = (unsigned long)(BLK_BOUNCE_ANY >> PAGE_SHIFT);
lim->alignment_offset = 0;
@@ -127,7 +126,6 @@ void blk_s

[PATCH 23/25] drbd: make intelligent use of blkdev_issue_zeroout

2017-03-31 Thread Christoph Hellwig
drbd always wants its discard wire operations to zero the blocks, so
use blkdev_issue_zeroout with the BLKDEV_ZERO_UNMAP flag instead of
reinventing it poorly.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/drbd/drbd_debugfs.c  |   3 --
 drivers/block/drbd/drbd_int.h  |   6 ---
 drivers/block/drbd/drbd_receiver.c | 102 ++---
 drivers/block/drbd/drbd_req.c  |   6 +--
 4 files changed, 7 insertions(+), 110 deletions(-)

diff --git a/drivers/block/drbd/drbd_debugfs.c 
b/drivers/block/drbd/drbd_debugfs.c
index de5c3ee8a790..494837e59f23 100644
--- a/drivers/block/drbd/drbd_debugfs.c
+++ b/drivers/block/drbd/drbd_debugfs.c
@@ -236,9 +236,6 @@ static void seq_print_peer_request_flags(struct seq_file 
*m, struct drbd_peer_re
seq_print_rq_state_bit(m, f & EE_CALL_AL_COMPLETE_IO, &sep, "in-AL");
seq_print_rq_state_bit(m, f & EE_SEND_WRITE_ACK, &sep, "C");
seq_print_rq_state_bit(m, f & EE_MAY_SET_IN_SYNC, &sep, "set-in-sync");
-
-   if (f & EE_IS_TRIM)
-   __seq_print_rq_state_bit(m, f & EE_IS_TRIM_USE_ZEROOUT, &sep, 
"zero-out", "trim");
seq_print_rq_state_bit(m, f & EE_WRITE_SAME, &sep, "write-same");
seq_putc(m, '\n');
 }
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 724d1c50fc52..d5da45bb03a6 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -437,9 +437,6 @@ enum {
 
/* is this a TRIM aka REQ_DISCARD? */
__EE_IS_TRIM,
-   /* our lower level cannot handle trim,
-* and we want to fall back to zeroout instead */
-   __EE_IS_TRIM_USE_ZEROOUT,
 
/* In case a barrier failed,
 * we need to resubmit without the barrier flag. */
@@ -482,7 +479,6 @@ enum {
 #define EE_CALL_AL_COMPLETE_IO (1<<__EE_CALL_AL_COMPLETE_IO)
 #define EE_MAY_SET_IN_SYNC (1<<__EE_MAY_SET_IN_SYNC)
 #define EE_IS_TRIM (1<<__EE_IS_TRIM)
-#define EE_IS_TRIM_USE_ZEROOUT (1<<__EE_IS_TRIM_USE_ZEROOUT)
 #define EE_RESUBMITTED (1<<__EE_RESUBMITTED)
 #define EE_WAS_ERROR   (1<<__EE_WAS_ERROR)
 #define EE_HAS_DIGEST  (1<<__EE_HAS_DIGEST)
@@ -1561,8 +1557,6 @@ extern void start_resync_timer_fn(unsigned long data);
 extern void drbd_endio_write_sec_final(struct drbd_peer_request *peer_req);
 
 /* drbd_receiver.c */
-extern int drbd_issue_discard_or_zero_out(struct drbd_device *device,
-   sector_t start, unsigned int nr_sectors, bool discard);
 extern int drbd_receiver(struct drbd_thread *thi);
 extern int drbd_ack_receiver(struct drbd_thread *thi);
 extern void drbd_send_ping_wf(struct work_struct *ws);
diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index dc9a6dcd431c..bc1d296581f9 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1448,108 +1448,14 @@ void drbd_bump_write_ordering(struct drbd_resource 
*resource, struct drbd_backin
drbd_info(resource, "Method to ensure write ordering: %s\n", 
write_ordering_str[resource->write_ordering]);
 }
 
-/*
- * We *may* ignore the discard-zeroes-data setting, if so configured.
- *
- * Assumption is that it "discard_zeroes_data=0" is only because the backend
- * may ignore partial unaligned discards.
- *
- * LVM/DM thin as of at least
- *   LVM version: 2.02.115(2)-RHEL7 (2015-01-28)
- *   Library version: 1.02.93-RHEL7 (2015-01-28)
- *   Driver version:  4.29.0
- * still behaves this way.
- *
- * For unaligned (wrt. alignment and granularity) or too small discards,
- * we zero-out the initial (and/or) trailing unaligned partial chunks,
- * but discard all the aligned full chunks.
- *
- * At least for LVM/DM thin, the result is effectively "discard_zeroes_data=1".
- */
-int drbd_issue_discard_or_zero_out(struct drbd_device *device, sector_t start, 
unsigned int nr_sectors, bool discard)
-{
-   struct block_device *bdev = device->ldev->backing_bdev;
-   struct request_queue *q = bdev_get_queue(bdev);
-   sector_t tmp, nr;
-   unsigned int max_discard_sectors, granularity;
-   int alignment;
-   int err = 0;
-
-   if (!discard)
-   goto zero_out;
-
-   /* Zero-sector (unknown) and one-sector granularities are the same.  */
-   granularity = max(q->limits.discard_granularity >> 9, 1U);
-   alignment = (bdev_discard_alignment(bdev) >> 9) % granularity;
-
-   max_discard_sectors = min(q->limits.max_discard_sectors, (1U << 22));
-   max_discard_sectors -= max_discard_sectors % granularity;
-   if (unlikely(!max_discard_sectors))
-   goto zero_out;
-
-   if (nr_sectors < granularity)
-   goto zero_out;
-
-   tmp = start;
-   if (sector_div(tmp, granularity) != alignment) {
-   if (nr_sectors < 2*granularity)
-   goto zero_out;
-   /* start + gran - (start + gran - align) % gran */
-   tmp = start + granularity

[PATCH 21/25] mmc: remove the discard_zeroes_data flag

2017-03-31 Thread Christoph Hellwig
mmc only supports discarding on large alignments, so the zeroing code
would always fall back to explicit writings of zeroes.

Signed-off-by: Christoph Hellwig 
---
 drivers/mmc/core/queue.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 493eb10ce580..4c54ad34e17a 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -167,8 +167,6 @@ static void mmc_queue_setup_discard(struct request_queue *q,
 
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
blk_queue_max_discard_sectors(q, max_discard);
-   if (card->erased_byte == 0 && !mmc_can_discard(card))
-   q->limits.discard_zeroes_data = 1;
q->limits.discard_granularity = card->pref_erase << 9;
/* granularity must not be greater than max. discard */
if (card->pref_erase > max_discard)
-- 
2.11.0



[PATCH 17/25] loop: implement REQ_OP_WRITE_ZEROES

2017-03-31 Thread Christoph Hellwig
It's identical to discard as hole punches will always leave us with
zeroes on reads.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/loop.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 0ecb6461ed81..265cd2e33ff0 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -528,6 +528,7 @@ static int do_req_filebacked(struct loop_device *lo, struct 
request *rq)
case REQ_OP_FLUSH:
return lo_req_flush(lo, rq);
case REQ_OP_DISCARD:
+   case REQ_OP_WRITE_ZEROES:
return lo_discard(lo, rq, pos);
case REQ_OP_WRITE:
if (lo->transfer)
@@ -826,6 +827,7 @@ static void loop_config_discard(struct loop_device *lo)
q->limits.discard_granularity = 0;
q->limits.discard_alignment = 0;
blk_queue_max_discard_sectors(q, 0);
+   blk_queue_max_write_zeroes_sectors(q, 0);
q->limits.discard_zeroes_data = 0;
queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, q);
return;
@@ -834,6 +836,7 @@ static void loop_config_discard(struct loop_device *lo)
q->limits.discard_granularity = inode->i_sb->s_blocksize;
q->limits.discard_alignment = 0;
blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
+   blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
q->limits.discard_zeroes_data = 1;
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
 }
@@ -1660,6 +1663,7 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
switch (req_op(cmd->rq)) {
case REQ_OP_FLUSH:
case REQ_OP_DISCARD:
+   case REQ_OP_WRITE_ZEROES:
cmd->use_aio = false;
break;
default:
-- 
2.11.0



[PATCH 14/25] sd: implement unmapping Write Zeroes

2017-03-31 Thread Christoph Hellwig
Try to use a write same with unmap bit variant if the device supports it
and the caller allows for it.

Signed-off-by: Christoph Hellwig 
---
 drivers/scsi/sd.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index d8d9c0bdd93c..001593ed0444 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -803,6 +803,15 @@ static int sd_setup_write_zeroes_cmnd(struct scsi_cmnd 
*cmd)
u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
 
+   if (!(rq->cmd_flags & REQ_NOUNMAP)) {
+   switch (sdkp->provisioning_mode) {
+   case SD_LBP_WS16:
+   return sd_setup_write_same16_cmnd(cmd, true);
+   case SD_LBP_WS10:
+   return sd_setup_write_same10_cmnd(cmd, true);
+   }
+   }
+
if (sdp->no_write_same)
return BLKPREP_INVALID;
if (sdkp->ws16 || sector > 0x || nr_sectors > 0x)
-- 
2.11.0



[PATCH 11/25] block: add a REQ_UNMAP flag for REQ_OP_WRITE_ZEROES

2017-03-31 Thread Christoph Hellwig
If this flag is set logical provisioning capable device should
release space for the zeroed blocks if possible, if it is not set
devices should keep the blocks anchored.

Also remove an out of sync kerneldoc comment for a static function
that would have become even more out of data with this change.

Signed-off-by: Christoph Hellwig 
---
 block/blk-lib.c   | 19 +--
 include/linux/blk_types.h |  6 ++
 2 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index f9f24ec69c27..2f6d2cb2e1a2 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -226,20 +226,9 @@ int blkdev_issue_write_same(struct block_device *bdev, 
sector_t sector,
 }
 EXPORT_SYMBOL(blkdev_issue_write_same);
 
-/**
- * __blkdev_issue_write_zeroes - generate number of bios with WRITE ZEROES
- * @bdev:  blockdev to issue
- * @sector:start sector
- * @nr_sects:  number of sectors to write
- * @gfp_mask:  memory allocation flags (for bio_alloc)
- * @biop:  pointer to anchor bio
- *
- * Description:
- *  Generate and issue number of bios(REQ_OP_WRITE_ZEROES) with zerofiled 
pages.
- */
 static int __blkdev_issue_write_zeroes(struct block_device *bdev,
sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
-   struct bio **biop)
+   struct bio **biop, unsigned flags)
 {
struct bio *bio = *biop;
unsigned int max_write_zeroes_sectors;
@@ -258,7 +247,9 @@ static int __blkdev_issue_write_zeroes(struct block_device 
*bdev,
bio = next_bio(bio, 0, gfp_mask);
bio->bi_iter.bi_sector = sector;
bio->bi_bdev = bdev;
-   bio_set_op_attrs(bio, REQ_OP_WRITE_ZEROES, 0);
+   bio->bi_opf = REQ_OP_WRITE_ZEROES;
+   if (flags & BLKDEV_ZERO_NOUNMAP)
+   bio->bi_opf |= REQ_NOUNMAP;
 
if (nr_sects > max_write_zeroes_sectors) {
bio->bi_iter.bi_size = max_write_zeroes_sectors << 9;
@@ -306,7 +297,7 @@ int __blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
return -EINVAL;
 
ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask,
-   biop);
+   biop, flags);
if (ret == 0 || (ret && ret != -EOPNOTSUPP))
goto out;
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 4eae30bfbfca..8eaa7dca7057 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -195,6 +195,10 @@ enum req_flag_bits {
__REQ_PREFLUSH, /* request for cache flush */
__REQ_RAHEAD,   /* read ahead, can fail anytime */
__REQ_BACKGROUND,   /* background IO */
+
+   /* command specific flags for REQ_OP_WRITE_ZEROES: */
+   __REQ_NOUNMAP,  /* do not free blocks when zeroing */
+
__REQ_NR_BITS,  /* stops here */
 };
 
@@ -212,6 +216,8 @@ enum req_flag_bits {
 #define REQ_RAHEAD (1ULL << __REQ_RAHEAD)
 #define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)
 
+#define REQ_NOUNMAP(1ULL << __REQ_NOUNMAP)
+
 #define REQ_FAILFAST_MASK \
(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
 
-- 
2.11.0



[PATCH 15/25] nvme: implement REQ_OP_WRITE_ZEROES

2017-03-31 Thread Christoph Hellwig
But now for the real NVMe Write Zeroes yet, just to get rid of the
discard abuse for zeroing.  Also rename the quirk flag to be a bit
more self-explanatory.

Signed-off-by: Christoph Hellwig 
---
 drivers/nvme/host/core.c | 10 +-
 drivers/nvme/host/nvme.h |  6 +++---
 drivers/nvme/host/pci.c  |  6 +++---
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 4a6d7f408769..94b41d847b01 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -335,6 +335,8 @@ int nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
case REQ_OP_FLUSH:
nvme_setup_flush(ns, cmd);
break;
+   case REQ_OP_WRITE_ZEROES:
+   /* currently only aliased to deallocate for a few ctrls: */
case REQ_OP_DISCARD:
ret = nvme_setup_discard(ns, req, cmd);
break;
@@ -900,16 +902,14 @@ static void nvme_config_discard(struct nvme_ns *ns)
BUILD_BUG_ON(PAGE_SIZE / sizeof(struct nvme_dsm_range) <
NVME_DSM_MAX_RANGES);
 
-   if (ctrl->quirks & NVME_QUIRK_DISCARD_ZEROES)
-   ns->queue->limits.discard_zeroes_data = 1;
-   else
-   ns->queue->limits.discard_zeroes_data = 0;
-
ns->queue->limits.discard_alignment = logical_block_size;
ns->queue->limits.discard_granularity = logical_block_size;
blk_queue_max_discard_sectors(ns->queue, UINT_MAX);
blk_queue_max_discard_segments(ns->queue, NVME_DSM_MAX_RANGES);
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, ns->queue);
+
+   if (ctrl->quirks & NVME_QUIRK_DEALLOCATE_ZEROES)
+   blk_queue_max_write_zeroes_sectors(ns->queue, UINT_MAX);
 }
 
 static int nvme_revalidate_ns(struct nvme_ns *ns, struct nvme_id_ns **id)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 2aa20e3e5675..07ebc4a1c8fc 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -68,10 +68,10 @@ enum nvme_quirks {
NVME_QUIRK_IDENTIFY_CNS = (1 << 1),
 
/*
-* The controller deterministically returns O's on reads to discarded
-* logical blocks.
+* The controller deterministically returns O's on reads to
+* logical blocks that deallocate was called on.
 */
-   NVME_QUIRK_DISCARD_ZEROES   = (1 << 2),
+   NVME_QUIRK_DEALLOCATE_ZEROES= (1 << 2),
 
/*
 * The controller needs a delay before starts checking the device
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 26a5fd05fe88..0a28787267f0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2135,13 +2135,13 @@ static const struct pci_error_handlers nvme_err_handler 
= {
 static const struct pci_device_id nvme_id_table[] = {
{ PCI_VDEVICE(INTEL, 0x0953),
.driver_data = NVME_QUIRK_STRIPE_SIZE |
-   NVME_QUIRK_DISCARD_ZEROES, },
+   NVME_QUIRK_DEALLOCATE_ZEROES, },
{ PCI_VDEVICE(INTEL, 0x0a53),
.driver_data = NVME_QUIRK_STRIPE_SIZE |
-   NVME_QUIRK_DISCARD_ZEROES, },
+   NVME_QUIRK_DEALLOCATE_ZEROES, },
{ PCI_VDEVICE(INTEL, 0x0a54),
.driver_data = NVME_QUIRK_STRIPE_SIZE |
-   NVME_QUIRK_DISCARD_ZEROES, },
+   NVME_QUIRK_DEALLOCATE_ZEROES, },
{ PCI_VDEVICE(INTEL, 0x5845),   /* Qemu emulated controller */
.driver_data = NVME_QUIRK_IDENTIFY_CNS, },
{ PCI_DEVICE(0x1c58, 0x0003),   /* HGST adapter */
-- 
2.11.0



[PATCH 19/25] rbd: remove the discard_zeroes_data flag

2017-03-31 Thread Christoph Hellwig
rbd only supports discarding on large alignments, so the zeroing code
would always fall back to explicit writings of zeroes.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/rbd.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 517838b65964..0ec3b430e81d 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -4380,7 +4380,6 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
q->limits.discard_granularity = segment_size;
q->limits.discard_alignment = segment_size;
blk_queue_max_discard_sectors(q, segment_size / SECTOR_SIZE);
-   q->limits.discard_zeroes_data = 1;
 
if (!ceph_test_opt(rbd_dev->rbd_client->client, NOCRC))
q->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES;
-- 
2.11.0



[PATCH 13/25] block_dev: use blkdev_issue_zerout for hole punches

2017-03-31 Thread Christoph Hellwig
This gets us support for non-discard efficient write of zeroes (e.g. NVMe)
and prepare for removing the discard_zeroes_data flag.

Also remove a pointless discard support check, which is done in
blkdev_issue_discard already.

Signed-off-by: Christoph Hellwig 
---
 fs/block_dev.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2f704c3a816f..e405d8e58e31 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -2069,7 +2069,6 @@ static long blkdev_fallocate(struct file *file, int mode, 
loff_t start,
 loff_t len)
 {
struct block_device *bdev = I_BDEV(bdev_file_inode(file));
-   struct request_queue *q = bdev_get_queue(bdev);
struct address_space *mapping;
loff_t end = start + len - 1;
loff_t isize;
@@ -2108,15 +2107,10 @@ static long blkdev_fallocate(struct file *file, int 
mode, loff_t start,
GFP_KERNEL, BLKDEV_ZERO_NOUNMAP);
break;
case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
-   /* Only punch if the device can do zeroing discard. */
-   if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data)
-   return -EOPNOTSUPP;
-   error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
-GFP_KERNEL, 0);
+   error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
+GFP_KERNEL, 
BLKDEV_ZERO_NOFALLBACK);
break;
case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | 
FALLOC_FL_NO_HIDE_STALE:
-   if (!blk_queue_discard(q))
-   return -EOPNOTSUPP;
error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
 GFP_KERNEL, 0);
break;
-- 
2.11.0



[PATCH 06/25] dm io: discards don't take a payload

2017-03-31 Thread Christoph Hellwig
Fix up do_region to not allocate a bio_vec for discards.  We've
got rid of the discard payload allocated by the caller years ago.

Obviously this wasn't actually harmful given how long it's been
there, but it's still good to avoid the pointless allocation.

Signed-off-by: Christoph Hellwig 
---
 drivers/md/dm-io.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 03940bf36f6c..b808cbe22678 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -328,11 +328,17 @@ static void do_region(int op, int op_flags, unsigned 
region,
/*
 * Allocate a suitably sized-bio.
 */
-   if ((op == REQ_OP_DISCARD) || (op == REQ_OP_WRITE_SAME))
+   switch (op) {
+   case REQ_OP_DISCARD:
+   num_bvecs = 0;
+   break;
+   case REQ_OP_WRITE_SAME:
num_bvecs = 1;
-   else
+   break;
+   default:
num_bvecs = min_t(int, BIO_MAX_PAGES,
  dm_sector_div_up(remaining, 
(PAGE_SIZE >> SECTOR_SHIFT)));
+   }
 
bio = bio_alloc_bioset(GFP_NOIO, num_bvecs, io->client->bios);
bio->bi_iter.bi_sector = where->sector + (where->count - 
remaining);
-- 
2.11.0



[PATCH 04/25] sd: implement REQ_OP_WRITE_ZEROES

2017-03-31 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig 
---
 drivers/scsi/sd.c | 31 ++-
 drivers/scsi/sd_zbc.c |  1 +
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index b853f91fb3da..d8d9c0bdd93c 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -735,7 +735,7 @@ static int sd_setup_unmap_cmnd(struct scsi_cmnd *cmd)
return scsi_init_io(cmd);
 }
 
-static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd)
+static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd, bool unmap)
 {
struct scsi_device *sdp = cmd->device;
struct request *rq = cmd->request;
@@ -752,13 +752,14 @@ static int sd_setup_write_same16_cmnd(struct scsi_cmnd 
*cmd)
 
cmd->cmd_len = 16;
cmd->cmnd[0] = WRITE_SAME_16;
-   cmd->cmnd[1] = 0x8; /* UNMAP */
+   if (unmap)
+   cmd->cmnd[1] = 0x8; /* UNMAP */
put_unaligned_be64(sector, &cmd->cmnd[2]);
put_unaligned_be32(nr_sectors, &cmd->cmnd[10]);
 
cmd->allowed = SD_MAX_RETRIES;
cmd->transfersize = data_len;
-   rq->timeout = SD_TIMEOUT;
+   rq->timeout = unmap ? SD_TIMEOUT : SD_WRITE_SAME_TIMEOUT;
scsi_req(rq)->resid_len = data_len;
 
return scsi_init_io(cmd);
@@ -788,12 +789,27 @@ static int sd_setup_write_same10_cmnd(struct scsi_cmnd 
*cmd, bool unmap)
 
cmd->allowed = SD_MAX_RETRIES;
cmd->transfersize = data_len;
-   rq->timeout = SD_TIMEOUT;
+   rq->timeout = unmap ? SD_TIMEOUT : SD_WRITE_SAME_TIMEOUT;
scsi_req(rq)->resid_len = data_len;
 
return scsi_init_io(cmd);
 }
 
+static int sd_setup_write_zeroes_cmnd(struct scsi_cmnd *cmd)
+{
+   struct request *rq = cmd->request;
+   struct scsi_device *sdp = cmd->device;
+   struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+   u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
+   u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
+
+   if (sdp->no_write_same)
+   return BLKPREP_INVALID;
+   if (sdkp->ws16 || sector > 0x || nr_sectors > 0x)
+   return sd_setup_write_same16_cmnd(cmd, false);
+   return sd_setup_write_same10_cmnd(cmd, false);
+}
+
 static void sd_config_write_same(struct scsi_disk *sdkp)
 {
struct request_queue *q = sdkp->disk->queue;
@@ -823,6 +839,8 @@ static void sd_config_write_same(struct scsi_disk *sdkp)
 out:
blk_queue_max_write_same_sectors(q, sdkp->max_ws_blocks *
 (logical_block_size >> 9));
+   blk_queue_max_write_zeroes_sectors(q, sdkp->max_ws_blocks *
+(logical_block_size >> 9));
 }
 
 /**
@@ -1163,7 +1181,7 @@ static int sd_init_command(struct scsi_cmnd *cmd)
case SD_LBP_UNMAP:
return sd_setup_unmap_cmnd(cmd);
case SD_LBP_WS16:
-   return sd_setup_write_same16_cmnd(cmd);
+   return sd_setup_write_same16_cmnd(cmd, true);
case SD_LBP_WS10:
return sd_setup_write_same10_cmnd(cmd, true);
case SD_LBP_ZERO:
@@ -1171,6 +1189,8 @@ static int sd_init_command(struct scsi_cmnd *cmd)
default:
return BLKPREP_INVALID;
}
+   case REQ_OP_WRITE_ZEROES:
+   return sd_setup_write_zeroes_cmnd(cmd);
case REQ_OP_WRITE_SAME:
return sd_setup_write_same_cmnd(cmd);
case REQ_OP_FLUSH:
@@ -1810,6 +1830,7 @@ static int sd_done(struct scsi_cmnd *SCpnt)
 
switch (req_op(req)) {
case REQ_OP_DISCARD:
+   case REQ_OP_WRITE_ZEROES:
case REQ_OP_WRITE_SAME:
case REQ_OP_ZONE_RESET:
if (!result) {
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 92620c8ea8ad..1994f7799fce 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -329,6 +329,7 @@ void sd_zbc_complete(struct scsi_cmnd *cmd,
 
switch (req_op(rq)) {
case REQ_OP_WRITE:
+   case REQ_OP_WRITE_ZEROES:
case REQ_OP_WRITE_SAME:
case REQ_OP_ZONE_RESET:
 
-- 
2.11.0



[PATCH 09/25] block: stop using blkdev_issue_write_same for zeroing

2017-03-31 Thread Christoph Hellwig
We'll always use the WRITE ZEROES code for zeroing now.

Signed-off-by: Christoph Hellwig 
---
 block/blk-lib.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index e5b853f2b8a2..2a8d638544a7 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -364,10 +364,6 @@ int blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
return 0;
}
 
-   if (!blkdev_issue_write_same(bdev, sector, nr_sects, gfp_mask,
-   ZERO_PAGE(0)))
-   return 0;
-
blk_start_plug(&plug);
ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
&bio, discard);
-- 
2.11.0



problem with discard granularity in sd

2017-03-31 Thread David Buckley
Hello,

Hopefully this is the right place for this, and apologies for the
lengthy mail.  I'm struggling with an issue with SCSI UNMAP/discard in
newer kernels, and I'm hoping to find a resolution or at least to
better understand why this has changed.

Some background info:
Our Linux boxes are primarily VMs running on VMware backed by NetApp
storage.  We have a fair number of systems that directly mount LUNs
(due to i/o requirements, snapshot scheduling, dedupe issues, etc.).
On newer LUNs, the 'space_alloc' option is enabled, which causes the
LUN to report unmap support and free unused blocks on the underlying
storage.

The problem:
I noticed multiple LUNs with space_alloc enabled reported 100%
utilization on the netapp but much less from the Linux. I verified
they were mounted with discard option and also ran fstrim, which
reported success but did not change the utilization reported by the
netapp.  I eventually was able to isolate kernel version as the only
factor in whether discard worked.  Further testing showed 3.10.x
handled discard correctly, but 4.4.x would never free blocks.  This
was verified on a single machine with the only change being the
kernel.

The only notable difference I could find was in
/sys/block/sdX/discard* values - on 3.10.x the discard granularity was
reported as 4096, while on 4.4.x it was 512 (logical block size is
512, physical is 4096 on the LUNs).  Eventually that led me to these
patches for sd.c:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/drivers/scsi/sd.c?id=397737223c59e89dca7305feb6528caef8fbef84
and 
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/drivers/scsi/sd.c?id=f4327a95dd080ed6aecb185478a88ce1ee4fa3c4.
They result in discard granularity being forced to logical block size
if the disk reports LBPRZ is enabled (which the netapp luns do).  It
seems that this change is responsible for the difference in discard
granularity, and my assumption is that because wafl is actually a 4k
block filesystem the netapp requires 4k granularity and ignores the
512b discard requests.

It's not clear to me whether this is a bug in sd or an issue in the
way the LUNs are presented from the netapp side (I've opened a case
with them as well and am waiting to hear back).  However,
minimum_io_size is 4096, so it seems a bit odd that
discard_granularity would be smaller.  And earlier kernel versions
work as expected, which seems to indicate the problem is in sd.

As far as fixes or workarounds, it seems that there are three potential options:

1) The netapp could change the reported logical block size to match
the physical block size
2) The netapp could report LBPRZ=0
3) The sd code could be updated to use max(logical_block_size,
physical_block_size) or  max(logical_block_size, minimum_io_size) or
otherwise changed to ensure discard_granularity is set to a supported
value

I'm not sure of the implications of either the netapp changes, though
reporting 4k logical blocks seems potential as this is supported in
newer OS at least.  The sd change potentially would at least partially
undo the patches referenced above.  But it would seem that (assuming
an aligned filesystem with 4k blocks and minimum_io_size=4096) there
is no possibility of a partial block discard or advantage to sending
the discard requests in 512 blocks?

Any help is greatly appreciated.

Thanks,
-David


Re: [PATCH v6 00/15] Replace PCI pool by DMA pool API

2017-03-31 Thread Romain Perier
ping


Le 19/03/2017 à 18:03, Romain Perier a écrit :
> The current PCI pool API are simple macro functions direct expanded to
> the appropriate dma pool functions. The prototypes are almost the same
> and semantically, they are very similar. I propose to use the DMA pool
> API directly and get rid of the old API.
>
> This set of patches, replaces the old API by the dma pool API
> and remove the defines.
>
> Changes in v6:
> - Fixed an issue reported by kbuild test robot about changes in DAC960
> - Removed patches 15/19,16/19,17/19,18/19. They have been merged by Greg
> - Added Acked-by Tags
>
> Changes in v5:
> - Re-worded the cover letter (remove sentence about checkpatch.pl)
> - Rebased series onto next-20170308
> - Fix typos in commit message
> - Added Acked-by Tags
>
> Changes in v4:
> - Rebased series onto next-20170301
> - Removed patch 20/20: checks done by checkpath.pl, no longer required.
>   Thanks to Peter and Joe for their feedbacks.
> - Added Reviewed-by tags
>
> Changes in v3:
> - Rebased series onto next-20170224
> - Fix checkpath.pl reports for patch 11/20 and patch 12/20
> - Remove prefix RFC
> Changes in v2:
> - Introduced patch 18/20
> - Fixed cosmetic changes: spaces before brace, live over 80 characters
> - Removed some of the check for NULL pointers before calling dma_pool_destroy
> - Improved the regexp in checkpatch for pci_pool, thanks to Joe Perches
> - Added Tested-by and Acked-by tags
>
> Romain Perier (15):
>   block: DAC960: Replace PCI pool old API
>   dmaengine: pch_dma: Replace PCI pool old API
>   IB/mthca: Replace PCI pool old API
>   net: e100: Replace PCI pool old API
>   mlx4: Replace PCI pool old API
>   mlx5: Replace PCI pool old API
>   wireless: ipw2200: Replace PCI pool old API
>   scsi: be2iscsi: Replace PCI pool old API
>   scsi: csiostor: Replace PCI pool old API
>   scsi: lpfc: Replace PCI pool old API
>   scsi: megaraid: Replace PCI pool old API
>   scsi: mpt3sas: Replace PCI pool old API
>   scsi: mvsas: Replace PCI pool old API
>   scsi: pmcraid: Replace PCI pool old API
>   PCI: Remove PCI pool macro functions
>
>  drivers/block/DAC960.c|  38 +
>  drivers/block/DAC960.h|   4 +-
>  drivers/dma/pch_dma.c |  12 +--
>  drivers/infiniband/hw/mthca/mthca_av.c|  10 +--
>  drivers/infiniband/hw/mthca/mthca_cmd.c   |   8 +-
>  drivers/infiniband/hw/mthca/mthca_dev.h   |   4 +-
>  drivers/net/ethernet/intel/e100.c |  12 +--
>  drivers/net/ethernet/mellanox/mlx4/cmd.c  |  10 +--
>  drivers/net/ethernet/mellanox/mlx4/mlx4.h |   2 +-
>  drivers/net/ethernet/mellanox/mlx5/core/cmd.c |  11 +--
>  drivers/net/wireless/intel/ipw2x00/ipw2200.c  |  13 ++--
>  drivers/scsi/be2iscsi/be_iscsi.c  |   6 +-
>  drivers/scsi/be2iscsi/be_main.c   |   6 +-
>  drivers/scsi/be2iscsi/be_main.h   |   2 +-
>  drivers/scsi/csiostor/csio_hw.h   |   2 +-
>  drivers/scsi/csiostor/csio_init.c |  11 +--
>  drivers/scsi/csiostor/csio_scsi.c |   6 +-
>  drivers/scsi/lpfc/lpfc.h  |  14 ++--
>  drivers/scsi/lpfc/lpfc_init.c |  16 ++--
>  drivers/scsi/lpfc/lpfc_mem.c  | 106 
> +-
>  drivers/scsi/lpfc/lpfc_nvme.c |   6 +-
>  drivers/scsi/lpfc/lpfc_nvmet.c|   4 +-
>  drivers/scsi/lpfc/lpfc_scsi.c |  12 +--
>  drivers/scsi/megaraid/megaraid_mbox.c |  33 
>  drivers/scsi/megaraid/megaraid_mm.c   |  32 
>  drivers/scsi/megaraid/megaraid_sas_base.c |  29 +++
>  drivers/scsi/megaraid/megaraid_sas_fusion.c   |  66 
>  drivers/scsi/mpt3sas/mpt3sas_base.c   |  73 +-
>  drivers/scsi/mvsas/mv_init.c  |   6 +-
>  drivers/scsi/mvsas/mv_sas.c   |   6 +-
>  drivers/scsi/pmcraid.c|  10 +--
>  drivers/scsi/pmcraid.h|   2 +-
>  include/linux/mlx5/driver.h   |   2 +-
>  include/linux/pci.h   |   9 ---
>  34 files changed, 280 insertions(+), 303 deletions(-)
>



Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device

2017-03-31 Thread Sinan Kaya
Hi Logan,

> +/**
> + * p2pmem_unregister() - unregister a p2pmem device
> + * @p: the device to unregister
> + *
> + * The device will remain until all users are done with it
> + */
> +void p2pmem_unregister(struct p2pmem_dev *p)
> +{
> + if (!p)
> + return;
> +
> + dev_info(&p->dev, "unregistered");
> + device_del(&p->dev);
> + ida_simple_remove(&p2pmem_ida, p->id);

Don't you need to clean up the p->pool here.

> + put_device(&p->dev);
> +}
> +EXPORT_SYMBOL(p2pmem_unregister);
> +

I don't like the ugliness around the switch port to be honest. 

Going to whitelist/blacklist looks simpler in my opinion.

Sinan


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH] scsi: mpt3sas: fix hang on ata passthrough commands

2017-03-31 Thread Greg Kroah-Hartman
On Fri, Mar 31, 2017 at 02:10:29PM -0400, Joe Korty wrote:
> On Fri, Mar 31, 2017 at 04:50:52PM +0200, Greg Kroah-Hartman wrote:
> > A: Because it messes up the order in which people normally read text.
> > Q: Why is top-posting such a bad thing?
> > A: Top-posting.
> > Q: What is the most annoying thing in e-mail?
> > 
> > A: No.
> > Q: Should I include quotations after my reply?
> > 
> > 
> > http://daringfireball.net/2007/07/on_top
> > 
> > On Fri, Mar 31, 2017 at 10:16:47AM -0400, Joe Korty wrote:
> > > Hi Greg,
> > > This patch is not yet in 4.4.
> > 
> > What patch is that?  I can't find a git commit id here at all :(
> > 
> > > It is in 4.9 and upstream.  No one has submitted it to you before this
> > > email.  I only discovered the need for it because lockdep complains
> > > when it is missing.  I do not know how it was missed; perhaps there is
> > > a reason rather than it being missed by accident. Others in the know
> > > can comment if they like.
> > 
> > You need to say what you want to have happen for a stable patch, you did
> > read Documenation/stable_kernel_rules.txt, right?
> > 
> > still confused,
> > 
> > greg k-h
> 
> 
> Hi Greg,
> The patch is one that has already been backported into 4.9,
> and as such I assumed it was already in a format acceptable
> for submission.
> 
> 4.4 is the only additional branch that needs this patch,
> and there it is needed only because in 4.4.28 two other
> mpt3sas patches had been backported.  Those two patches
> introduced the lock dependency problem that this, the
> third patch, fixes.
> 
> In summary, either all three patches should be backported,
> or none should be backported.  Porting just two is wrong.

What patches?  Please start over, resend the email in a format that can
be understand as to what commits you want added to what stable kernel
tree. As it is, I still have no clue...

greg k-h


Re: [PATCH] tcmu: Skip Data-Out blocks before gathering Data-In buffer for BIDI case

2017-03-31 Thread Ilias Tsitsimpis
Hi Xiubo,

On Fri, Mar 31, 2017 at 10:35AM, lixi...@cmss.chinamobile.com wrote:
> From: Xiubo Li 
> 
> For the bidirectional case, the Data-Out buffer blocks will always at
> the head of the tcmu_cmd's bitmap, and before gathering the Data-In
> buffer, first of all it should skip the Data-Out ones, or the device
> supporting BIDI commands won't work.
> 
> Fixed: 26418649eead ("target/user: Introduce data_bitmap, replace
>   data_length/data_head/data_tail")
> Reported-by: Ilias Tsitsimpis 
> Signed-off-by: Xiubo Li 

Thanks for taking care of this.

Tested-by: Ilias Tsitsimpis 

Best,

-- 
Ilias


Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device

2017-03-31 Thread Logan Gunthorpe


On 31/03/17 12:49 PM, Sinan Kaya wrote:
> Don't you need to clean up the p->pool here.

See Patch 7 in the series.

>> +put_device(&p->dev);
>> +}
>> +EXPORT_SYMBOL(p2pmem_unregister);
>> +
> 
> I don't like the ugliness around the switch port to be honest. 
> 
> Going to whitelist/blacklist looks simpler in my opinion.

What exactly would you white/black list? It can't be the NIC or the
disk. If it's going to be a white/black list on the switch or root port
then you'd need essentially the same code to ensure they are all behind
the same switch or root port. So you could add a white/black list on top
of the current scheme but you couldn't get rid of it.

Our original plan was to just punt the decision to userspace but we had
pushback on that at LSF.

Thanks,

Logan




[PATCH 0/2] qla2xxx: Bug Fixes for qla2xxx driver

2017-03-31 Thread Himanshu Madhani
Hi Martin,

Couple bug fixes for the scsi-fixes branch please apply to 4.11/scsi-fixes. 

Thanks,
Himanshu 

Milan P Gandhi (1):
  qla2xxx: Fix typo in driver

Sawan Chandak (1):
  qla2xxx: Add fix to read correct register value for ISP82xx.

 drivers/scsi/qla2xxx/qla_attr.c | 2 +-
 drivers/scsi/qla2xxx/qla_bsg.c  | 2 +-
 drivers/scsi/qla2xxx/qla_gs.c   | 2 +-
 drivers/scsi/qla2xxx/qla_init.c | 2 +-
 drivers/scsi/qla2xxx/qla_isr.c  | 6 +++---
 drivers/scsi/qla2xxx/qla_os.c   | 7 ++-
 6 files changed, 13 insertions(+), 8 deletions(-)

-- 
2.12.0



[PATCH 1/2] qla2xxx: Add fix to read correct register value for ISP82xx.

2017-03-31 Thread Himanshu Madhani
From: Sawan Chandak 

Add fix to read correct register value for ISP82xx, during
check for register disconnect.ISP82xx has different base register.

Fixes: a465537ad1a4 ("qla2xxx: Disable the adapter and skip error recovery in 
case of register disconnect")
Signed-off-by: Sawan Chandak 
Signed-off-by: Himanshu Madhani 
Cc: 
---
 drivers/scsi/qla2xxx/qla_os.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 3e7011757c82..83d61d2142e9 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -1160,8 +1160,13 @@ static inline
 uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha)
 {
struct device_reg_24xx __iomem *reg = &ha->iobase->isp24;
+   struct device_reg_82xx __iomem *reg82 = &ha->iobase->isp82;
 
-   return ((RD_REG_DWORD(®->host_status)) == ISP_REG_DISCONNECT);
+   if (IS_P3P_TYPE(ha))
+   return ((RD_REG_DWORD(®82->host_int)) == ISP_REG_DISCONNECT);
+   else
+   return ((RD_REG_DWORD(®->host_status)) ==
+   ISP_REG_DISCONNECT);
 }
 
 /**
-- 
2.12.0



[PATCH 2/2] qla2xxx: Fix typo in driver

2017-03-31 Thread Himanshu Madhani
From: Milan P Gandhi 

Signed-off-by: Milan P Gandhi 
Signed-off-by: Himanshu Madhani 
---
 drivers/scsi/qla2xxx/qla_attr.c | 2 +-
 drivers/scsi/qla2xxx/qla_bsg.c  | 2 +-
 drivers/scsi/qla2xxx/qla_gs.c   | 2 +-
 drivers/scsi/qla2xxx/qla_init.c | 2 +-
 drivers/scsi/qla2xxx/qla_isr.c  | 6 +++---
 5 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/scsi/qla2xxx/qla_attr.c b/drivers/scsi/qla2xxx/qla_attr.c
index 435ff7fd6384..7c8d6c54ab70 100644
--- a/drivers/scsi/qla2xxx/qla_attr.c
+++ b/drivers/scsi/qla2xxx/qla_attr.c
@@ -695,7 +695,7 @@ qla2x00_sysfs_write_reset(struct file *filp, struct kobject 
*kobj,
case 0x2025e:
if (!IS_P3P_TYPE(ha) || vha != base_vha) {
ql_log(ql_log_info, vha, 0x7071,
-   "FCoE ctx reset no supported.\n");
+   "FCoE ctx reset not supported.\n");
return -EPERM;
}
 
diff --git a/drivers/scsi/qla2xxx/qla_bsg.c b/drivers/scsi/qla2xxx/qla_bsg.c
index 84c9098cc089..069104fed267 100644
--- a/drivers/scsi/qla2xxx/qla_bsg.c
+++ b/drivers/scsi/qla2xxx/qla_bsg.c
@@ -1822,7 +1822,7 @@ qla24xx_process_bidir_cmd(struct bsg_job *bsg_job)
/* Check if operating mode is P2P */
if (ha->operating_mode != P2P) {
ql_log(ql_log_warn, vha, 0x70a4,
-   "Host is operating mode is not P2p\n");
+   "Host operating mode is not P2p\n");
rval = EXT_STATUS_INVALID_CFG;
goto done;
}
diff --git a/drivers/scsi/qla2xxx/qla_gs.c b/drivers/scsi/qla2xxx/qla_gs.c
index ab0f873fd6a1..9bc9aa9e164a 100644
--- a/drivers/scsi/qla2xxx/qla_gs.c
+++ b/drivers/scsi/qla2xxx/qla_gs.c
@@ -144,7 +144,7 @@ qla2x00_chk_ms_status(scsi_qla_host_t *vha, ms_iocb_entry_t 
*ms_pkt,
if (ct_rsp->header.response !=
cpu_to_be16(CT_ACCEPT_RESPONSE)) {
ql_dbg(ql_dbg_disc + ql_dbg_buffer, vha, 0x2077,
-   "%s failed rejected request on port_id: 
%02x%02x%02x Compeltion status 0x%x, response 0x%x\n",
+   "%s failed rejected request on port_id: 
%02x%02x%02x Completion status 0x%x, response 0x%x\n",
routine, vha->d_id.b.domain,
vha->d_id.b.area, vha->d_id.b.al_pa,
comp_status, ct_rsp->header.response);
diff --git a/drivers/scsi/qla2xxx/qla_init.c b/drivers/scsi/qla2xxx/qla_init.c
index f9d2fe7b1ade..034743309ada 100644
--- a/drivers/scsi/qla2xxx/qla_init.c
+++ b/drivers/scsi/qla2xxx/qla_init.c
@@ -2289,7 +2289,7 @@ qla2x00_chip_diag(scsi_qla_host_t *vha)
goto chip_diag_failed;
 
/* Check product ID of chip */
-   ql_dbg(ql_dbg_init, vha, 0x007d, "Checking product Id of chip.\n");
+   ql_dbg(ql_dbg_init, vha, 0x007d, "Checking product ID of chip.\n");
 
mb[1] = RD_MAILBOX_REG(ha, reg, 1);
mb[2] = RD_MAILBOX_REG(ha, reg, 2);
diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index 3203367a4f42..aac03504d9a3 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -2100,14 +2100,14 @@ qla25xx_process_bidir_status_iocb(scsi_qla_host_t *vha, 
void *pkt,
 
case CS_DATA_OVERRUN:
ql_dbg(ql_dbg_user, vha, 0x70b1,
-   "Command completed with date overrun thread_id=%d\n",
+   "Command completed with data overrun thread_id=%d\n",
thread_id);
rval = EXT_STATUS_DATA_OVERRUN;
break;
 
case CS_DATA_UNDERRUN:
ql_dbg(ql_dbg_user, vha, 0x70b2,
-   "Command completed with date underrun thread_id=%d\n",
+   "Command completed with data underrun thread_id=%d\n",
thread_id);
rval = EXT_STATUS_DATA_UNDERRUN;
break;
@@ -2134,7 +2134,7 @@ qla25xx_process_bidir_status_iocb(scsi_qla_host_t *vha, 
void *pkt,
 
case CS_BIDIR_RD_UNDERRUN:
ql_dbg(ql_dbg_user, vha, 0x70b6,
-   "Command completed with read data data underrun "
+   "Command completed with read data underrun "
"thread_id=%d\n", thread_id);
rval = EXT_STATUS_DATA_UNDERRUN;
break;
-- 
2.12.0



Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device

2017-03-31 Thread Sinan Kaya
On 3/31/2017 5:23 PM, Logan Gunthorpe wrote:
> What exactly would you white/black list? It can't be the NIC or the
> disk. If it's going to be a white/black list on the switch or root port
> then you'd need essentially the same code to ensure they are all behind
> the same switch or root port.

What is so special about being connected to the same switch?

Why don't we allow the feature by default and blacklist by the root ports
that don't work with a quirk.

I'm looking at this from portability perspective to be honest.

I'd rather see the feature enabled by default without any assumptions.
Using it with a switch is just a use case that you happened to test.
It can allow new architectures to use your code tomorrow.

Sinan
-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device

2017-03-31 Thread Logan Gunthorpe


On 31/03/17 03:38 PM, Sinan Kaya wrote:
> On 3/31/2017 5:23 PM, Logan Gunthorpe wrote:
>> What exactly would you white/black list? It can't be the NIC or the
>> disk. If it's going to be a white/black list on the switch or root port
>> then you'd need essentially the same code to ensure they are all behind
>> the same switch or root port.
> 
> What is so special about being connected to the same switch?
> 
> Why don't we allow the feature by default and blacklist by the root ports
> that don't work with a quirk.

Well root ports have the same issue here. There may be more than one
root port or other buses (ie QPI) between the devices in question. So
you can't just say "this system has root port X therefore we can always
use p2pmem". In the end, if you want to do any kind of restrictions
you're going to have to walk the tree, as the code currently does, and
figure out what's between the devices being used and black or white list
accordingly. Then seeing there's just such a vast number of devices out
there you'd almost certainly have to use some kind of white list and not
a black list. Then the question becomes which devices will be white
listed? The first to be listed would be switches seeing they will always
work. This is pretty much what we have (though it doesn't currently
cover multiple levels of switches). The next step, if someone wanted to
test with specific hardware, might be to allow the case where all the
devices are behind the same root port which Intel Ivy Bridge or newer.
However, I don't think a comprehensive white list should be a
requirement for this work to go forward and I don't think anything
you've suggested will remove any of the "ugliness".

What we discussed at LSF was that only allowing cases with a switch was
the simplest way to be sure any given setup would actually work.

> I'm looking at this from portability perspective to be honest.

I'm looking at this from the fact that there's a vast number of
topologies and devices involved, and figuring out which will work is
very complicated and could require a lot of hardware testing. The LSF
folks were primarily concerned with not having users enable the feature
and see breakage or terrible performance.

> I'd rather see the feature enabled by default without any assumptions.
> Using it with a switch is just a use case that you happened to test.
> It can allow new architectures to use your code tomorrow.

That's why I was advocating for letting userspace decide such that if
you're setting up a system with this you say to use a specific p2pmem
device and then you are responsible to test and benchmark it and decide
to use it in going forward. However, this has received a lot of push back.

Logan


Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device

2017-03-31 Thread Sinan Kaya
On 3/31/2017 6:42 PM, Logan Gunthorpe wrote:
> 
> 
> On 31/03/17 03:38 PM, Sinan Kaya wrote:
>> On 3/31/2017 5:23 PM, Logan Gunthorpe wrote:
>>> What exactly would you white/black list? It can't be the NIC or the
>>> disk. If it's going to be a white/black list on the switch or root port
>>> then you'd need essentially the same code to ensure they are all behind
>>> the same switch or root port.
>>
>> What is so special about being connected to the same switch?
>>
>> Why don't we allow the feature by default and blacklist by the root ports
>> that don't work with a quirk.
> 
> Well root ports have the same issue here. There may be more than one
> root port or other buses (ie QPI) between the devices in question. So
> you can't just say "this system has root port X therefore we can always
> use p2pmem". 

We only care about devices on the data path between two devices.

> In the end, if you want to do any kind of restrictions
> you're going to have to walk the tree, as the code currently does, and
> figure out what's between the devices being used and black or white list
> accordingly. Then seeing there's just such a vast number of devices out
> there you'd almost certainly have to use some kind of white list and not
> a black list. Then the question becomes which devices will be white
> listed? 

How about a combination of blacklist + time bomb + peer-to-peer feature?

You can put a restriction with DMI/SMBIOS such that all devices from 2016
work else they belong to blacklist.

> The first to be listed would be switches seeing they will always
> work. This is pretty much what we have (though it doesn't currently
> cover multiple levels of switches). The next step, if someone wanted to
> test with specific hardware, might be to allow the case where all the
> devices are behind the same root port which Intel Ivy Bridge or newer.

Sorry, I'm not familiar with Intel architecture. Based on what you just
wrote, I think I see your point. 

I'm trying to generalize what you are doing to a little
bigger context so that I can use it on another architecture like arm64
where I may or may not have a switch.

This text below is sort of repeating what you are writing above. 

How about this:

The goal is to find a common parent between any two devices that need to
use your code. 

- all bridges/switches on the data need to support peer-to-peer, otherwise
stop.

- Make sure that all devices on the data path are not blacklisted via your
code.

- If there is at least somebody blacklisted, we stop and the feature is
not allowed.

- If we find a common parent and no errors, you are good to go.

- We don't care about devices above the common parent whether they have
some feature X, Y, Z or not. 

Maybe, a little bit less code than what you have but it is flexible and
not that too hard to implement.

Well, the code is in RFC. I don't see why we can't remove some restrictions
and still have your code move forward. 

> However, I don't think a comprehensive white list should be a
> requirement for this work to go forward and I don't think anything
> you've suggested will remove any of the "ugliness".

I don't think the ask above is a very big deal. If you feel like
addressing this on another patchset like you suggested in your cover letter,
I'm fine with that too.

> 
> What we discussed at LSF was that only allowing cases with a switch was
> the simplest way to be sure any given setup would actually work.
> 
>> I'm looking at this from portability perspective to be honest.
> 
> I'm looking at this from the fact that there's a vast number of
> topologies and devices involved, and figuring out which will work is
> very complicated and could require a lot of hardware testing. The LSF
> folks were primarily concerned with not having users enable the feature
> and see breakage or terrible performance.
> 
>> I'd rather see the feature enabled by default without any assumptions.
>> Using it with a switch is just a use case that you happened to test.
>> It can allow new architectures to use your code tomorrow.
> 
> That's why I was advocating for letting userspace decide such that if
> you're setting up a system with this you say to use a specific p2pmem
> device and then you are responsible to test and benchmark it and decide
> to use it in going forward. However, this has received a lot of push back.

Yeah, we shouldn't trust the userspace for such things.

> 
> Logan
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device

2017-03-31 Thread Logan Gunthorpe


On 31/03/17 05:51 PM, Sinan Kaya wrote:
> You can put a restriction with DMI/SMBIOS such that all devices from 2016
> work else they belong to blacklist.

How do you get a manufacturing date for a given device within the
kernel? Is this actually something generically available?

Logan



Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device

2017-03-31 Thread okaya

On 2017-03-31 21:57, Logan Gunthorpe wrote:

On 31/03/17 05:51 PM, Sinan Kaya wrote:
You can put a restriction with DMI/SMBIOS such that all devices from 
2016

work else they belong to blacklist.


How do you get a manufacturing date for a given device within the
kernel? Is this actually something generically available?

Logan


Smbios calls are used all over the place in kernel for introducing new 
functionality while maintaining backwards compatibility.


See drivers/pci and drivers/acpi directory.