Re: [Xen-devel] [RFC PATCH] blkif.h: document scsi/0x12/0x83 node

2016-03-20 Thread Bob Liu

On 03/16/2016 10:32 PM, David Vrabel wrote:
> On 16/03/16 13:59, Bob Liu wrote:
>>
>> But we'd like to get the VPD information(of underlying storage device) also 
>> in Linux blkfront, even blkfront is not a SCSI device.
> 
> Why does blkback/blkfront need to involved here?  This is just some
> xenstore keys that can be written by the toolstack and directly read by
> the relevant application in the guest.
> 

Exactly, let me check if they can direct read this xenstore node.

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] blkif.h: document scsi/0x12/0x83 node

2016-03-22 Thread Bob Liu

On 03/17/2016 07:12 PM, Ian Jackson wrote:
> David Vrabel writes ("Re: [Xen-devel] [RFC PATCH] blkif.h: document 
> scsi/0x12/0x83 node"):
>> On 16/03/16 13:59, Bob Liu wrote:
>>> But we'd like to get the VPD information(of underlying storage device) also 
>>> in Linux blkfront, even blkfront is not a SCSI device.
>>
>> Why does blkback/blkfront need to involved here?  This is just some
>> xenstore keys that can be written by the toolstack and directly read by
>> the relevant application in the guest.
> 

They want a more generic way because the application may run on all kinds of 
environment including baremetal.
So they prefers to just call ioctl(SG_IO) against a storage device.

> I'm getting rather a different picture here than at first.  Previously
> I thought you had some 3rd-party application, not under your control,
> which expected to see this VPD data.
> 
> But now I think that you're saying the application is under your own
> control.  I don't understand why synthetic VPD data is the best way to
> give your application the information it needs.
> 
> What is the application doing with this VPD data ?  I mean,
> which specific application functions, and how do they depend on the
> VPD data ?
> 

From the feedbacks I just got, they do *not* want the details to be in public.

Anyway, I think this is not a block of this patch.
In Windows PV block driver, we already use the same way to get the raw INQUIRY 
data.
 * The Windows PV block driver accepts ioctl(SG_IO).
 * Then it reads this /scsi/0x12/0x83 node.
 * Then return the raw INQURIY data back to ioctl.

Since Linux guest also wants to do the same thing, let's making this mechanism 
to be a generic interface!
I'll post a patch adding ioctl(SG_IO) support to xen-blkfront together with a 
updated version of this patch soon.

Thanks,
Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH] blkif.h: document scsi/0x12/0x node

2016-03-23 Thread Bob Liu
This patch documents a xenstore node which is used by XENVBD Windows PV
driver.

The use case is that XenServer may have OEM specific storage backends and
there is requirement to run OEM software in guest which relied on VPD
information supplied by the storages.
Adding a node to xenstore is the easiest way to get this VPD information from
the backend into guest where XENVBD Windows PV driver can get INQUIRY VPD data
from this node and return to OEM software.

Signed-off-by: Bob Liu 
---
 xen/include/public/io/blkif.h |   24 
 1 file changed, 24 insertions(+)

diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
index 99f0326..afbcbff 100644
--- a/xen/include/public/io/blkif.h
+++ b/xen/include/public/io/blkif.h
@@ -182,6 +182,30 @@
  *  backend driver paired with a LIFO queue in the frontend will
  *  allow us to have better performance in this scenario.
  *
+ * scsi/0x12/0x
+ * Values: base64 encoded string
+ *
+ * This optional node contains SCSI INQUIRY VPD information.
+ *  is the hexadecimal representation of the VPD page code.
+ * Currently only XENVBD Windows PV driver is using this node.
+ *
+ * A frontend e.g XENVBD Windows PV driver which represents a Xen VBD to
+ * its containing operating system as a (virtual) SCSI target may return 
the
+ * specified data in response to INQUIRY commands from its containing OS.
+ *
+ * A frontend which supports this feature must return the backend-specified
+ * data for every INQUIRY command with the EVPD bit set.
+ * For EVPD=1 INQUIRY commands where the corresponding xenstore node
+ * does not exist, the frontend must report (to its containing OS) an
+ * appropriate failure condition.
+ *
+ * A frontend which does not support this feature just disregard these
+ * xenstore nodes.
+ *
+ * The data of this string node is base64 encoded. Base64 is a group of
+ * similar binary-to-text encoding schemes that represent binary data in an
+ * ASCII string format by translating it into a radix-64 representation.
+ *
  *--- Request Transport Parameters 
  *
  * max-ring-page-order
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] blkif.h: document scsi/0x12/0x node

2016-03-23 Thread Bob Liu

On 03/23/2016 08:33 PM, Roger Pau Monné wrote:
> On Wed, 23 Mar 2016, Bob Liu wrote:
> 
>> This patch documents a xenstore node which is used by XENVBD Windows PV
>> driver.
>>
>> The use case is that XenServer may have OEM specific storage backends and
>> there is requirement to run OEM software in guest which relied on VPD
>> information supplied by the storages.
>> Adding a node to xenstore is the easiest way to get this VPD information from
>> the backend into guest where XENVBD Windows PV driver can get INQUIRY VPD 
>> data
>> from this node and return to OEM software.
>>
>> Signed-off-by: Bob Liu 
>> ---
>>  xen/include/public/io/blkif.h |   24 
>>  1 file changed, 24 insertions(+)
>>
>> diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
>> index 99f0326..afbcbff 100644
>> --- a/xen/include/public/io/blkif.h
>> +++ b/xen/include/public/io/blkif.h
>> @@ -182,6 +182,30 @@
>>   *  backend driver paired with a LIFO queue in the frontend will
>>   *  allow us to have better performance in this scenario.
>>   *
>> + * scsi/0x12/0x
>> + *  Values: base64 encoded string
>> + *
>> + *  This optional node contains SCSI INQUIRY VPD information.
>> + *   is the hexadecimal representation of the VPD page code.
>> + *  Currently only XENVBD Windows PV driver is using this node.
>> + *
>> + *  A frontend e.g XENVBD Windows PV driver which represents a Xen VBD to
>> + *  its containing operating system as a (virtual) SCSI target may return 
>> the
>> + *  specified data in response to INQUIRY commands from its containing OS.
>> + *
>> + *  A frontend which supports this feature must return the backend-specified
>> + *  data for every INQUIRY command with the EVPD bit set.
>> + *  For EVPD=1 INQUIRY commands where the corresponding xenstore node
>> + *  does not exist, the frontend must report (to its containing OS) an
>> + *  appropriate failure condition.
>> + *
>> + *  A frontend which does not support this feature just disregard these
>> + *  xenstore nodes.
>> + *
>> + *  The data of this string node is base64 encoded. Base64 is a group of
>> + *  similar binary-to-text encoding schemes that represent binary data in an
>> + *  ASCII string format by translating it into a radix-64 representation.
>> + *
> 
> I'm sorry, but I need to raise similar concerns as the ones expressed by 
> other people.
> 
> I understand that those pages that you plan to export to the guest contain 
> some kind of hardware specific information, but how is the guest going to 
> make use of this?
> 
> It can only interact with a Xen virtual block device, and there you can 
> only send read, write, flush and discard requests. Even the block size is 
> hardcoded to 512b by the protocol, so I'm not sure how are you going to 
> use this information.
> 

For this part, there is ioctl() interface for all block device.
Looking at virtio-blk in KVM world, it can accept almost all SCSI commands also 
in ioctl() even they already have virtio-scsi.
But that's another story.

Thanks,
Bob

> Also, the fact that's implemented in some drivers in some OS isn't an 
> argument in order to have them added. FreeBSD had for a very long time a 
> set of custom extensions, that where never added to blkif.h simply because 
> they were broken and unneeded, so the solution was to remove them from the 
> implementation, and the same could happen here IMHO.
> 
> Roger.
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [RFC PATCH] Data integrity extension support for xen-block

2016-04-07 Thread Bob Liu
* What's data integrity extension and why?
Modern filesystems feature checksumming of data and metadata to protect against
data corruption.  However, the detection of the corruption is done at read time
which could potentially be months after the data was written.  At that point the
original data that the application tried to write is most likely lost.

The solution in Linux is the data integrity framework which enables protection
information to be pinned to I/Os and sent to/received from controllers that
support it. struct bio has been extended with a pointer to a struct bip which
in turn contains the integrity metadata. The bip is essentially a trimmed down
bio with a bio_vec and some housekeeping.

* Issues when xen-block get involved.
xen-blkfront only transmits the normal data of struct bio while the integrity
metadata buffer(struct bio_integrity_payload in each bio) is ignored.

* Proposal of transmitting bio integrity payload.
Adding an extra request following the normal data request, this extra request
contains the integrity payload.
The xen-blkback will reconstruct an new bio with both received normal data and
integrity metadata.

Welcome any better ideas, thank you!

[1] http://lwn.net/Articles/280023/
[2] https://www.kernel.org/doc/Documentation/block/data-integrity.txt

Signed-off-by: Bob Liu 
---
 xen/include/public/io/blkif.h |   50 +
 1 file changed, 50 insertions(+)

diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
index 99f0326..3d8d39f 100644
--- a/xen/include/public/io/blkif.h
+++ b/xen/include/public/io/blkif.h
@@ -635,6 +635,28 @@
 #define BLKIF_OP_INDIRECT  6
 
 /*
+ * Recognized only if "feature-extra-request" is present in backend xenbus 
info.
+ * A request with BLKIF_OP_EXTRA_FLAG indicates an extra request is followed
+ * in the shared ring buffer.
+ *
+ * By this way, extra data like bio integrity payload can be transmitted from
+ * frontend to backend.
+ *
+ * The 'wire' format is like:
+ *  Request 1: xen_blkif_request
+ * [Request 2: xen_blkif_extra_request](only if request 1 has 
BLKIF_OP_EXTRA_FLAG)
+ *  Request 3: xen_blkif_request
+ *  Request 4: xen_blkif_request
+ * [Request 5: xen_blkif_extra_request](only if request 4 has 
BLKIF_OP_EXTRA_FLAG)
+ *  ...
+ *  Request N: xen_blkif_request
+ *
+ * If a backend does not recognize BLKIF_OP_EXTRA_FLAG, it should *not* create 
the
+ * "feature-extra-request" node!
+ */
+#define BLKIF_OP_EXTRA_FLAG (0x80)
+
+/*
  * Maximum scatter/gather segments per request.
  * This is carefully chosen so that sizeof(blkif_ring_t) <= PAGE_SIZE.
  * NB. This could be 12 if the ring indexes weren't stored in the same page.
@@ -703,6 +725,34 @@ struct blkif_request_indirect {
 };
 typedef struct blkif_request_indirect blkif_request_indirect_t;
 
+enum blkif_extra_request_type {
+   BLKIF_EXTRA_TYPE_DIX = 1,   /* Data integrity extension 
payload.  */
+};
+
+struct bio_integrity_req {
+   /*
+* Grant mapping for transmitting bio integrity payload to backend.
+*/
+   grant_ref_t *gref;
+   unsigned int nr_grefs;
+   unsigned int len;
+};
+
+/*
+ * Extra request, must follow a normal-request and a normal-request can
+ * only be followed by one extra request.
+ */
+struct blkif_request_extra {
+   uint8_t type;   /* BLKIF_EXTRA_TYPE_* */
+   uint16_t _pad1;
+#ifndef CONFIG_X86_32
+   uint32_t _pad2; /* offsetof(blkif_...,u.extra.id) == 8 */
+#endif
+   uint64_t id;
+   struct bio_integrity_req bi_req;
+} __attribute__((__packed__));
+typedef struct blkif_request_extra blkif_request_extra_t;
+
 struct blkif_response {
 uint64_tid;  /* copied from request */
 uint8_t operation;   /* copied from request */
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] Data integrity extension support for xen-block

2016-04-07 Thread Bob Liu

On 04/07/2016 11:55 PM, Juergen Gross wrote:
> On 07/04/16 12:00, Bob Liu wrote:
>> * What's data integrity extension and why?
>> Modern filesystems feature checksumming of data and metadata to protect 
>> against
>> data corruption.  However, the detection of the corruption is done at read 
>> time
>> which could potentially be months after the data was written.  At that point 
>> the
>> original data that the application tried to write is most likely lost.
>>
>> The solution in Linux is the data integrity framework which enables 
>> protection
>> information to be pinned to I/Os and sent to/received from controllers that
>> support it. struct bio has been extended with a pointer to a struct bip which
>> in turn contains the integrity metadata. The bip is essentially a trimmed 
>> down
>> bio with a bio_vec and some housekeeping.
>>
>> * Issues when xen-block get involved.
>> xen-blkfront only transmits the normal data of struct bio while the integrity
>> metadata buffer(struct bio_integrity_payload in each bio) is ignored.
>>
>> * Proposal of transmitting bio integrity payload.
>> Adding an extra request following the normal data request, this extra request
>> contains the integrity payload.
>> The xen-blkback will reconstruct an new bio with both received normal data 
>> and
>> integrity metadata.
>>
>> Welcome any better ideas, thank you!
>>
>> [1] http://lwn.net/Articles/280023/
>> [2] https://www.kernel.org/doc/Documentation/block/data-integrity.txt
>>
>> Signed-off-by: Bob Liu 
>> ---
>>  xen/include/public/io/blkif.h |   50 
>> +
>>  1 file changed, 50 insertions(+)
>>
>> diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
>> index 99f0326..3d8d39f 100644
>> --- a/xen/include/public/io/blkif.h
>> +++ b/xen/include/public/io/blkif.h
>> @@ -635,6 +635,28 @@
>>  #define BLKIF_OP_INDIRECT  6
>>  
>>  /*
>> + * Recognized only if "feature-extra-request" is present in backend xenbus 
>> info.
>> + * A request with BLKIF_OP_EXTRA_FLAG indicates an extra request is followed
>> + * in the shared ring buffer.
>> + *
>> + * By this way, extra data like bio integrity payload can be transmitted 
>> from
>> + * frontend to backend.
>> + *
>> + * The 'wire' format is like:
>> + *  Request 1: xen_blkif_request
>> + * [Request 2: xen_blkif_extra_request](only if request 1 has 
>> BLKIF_OP_EXTRA_FLAG)
>> + *  Request 3: xen_blkif_request
>> + *  Request 4: xen_blkif_request
>> + * [Request 5: xen_blkif_extra_request](only if request 4 has 
>> BLKIF_OP_EXTRA_FLAG)
>> + *  ...
>> + *  Request N: xen_blkif_request
>> + *
>> + * If a backend does not recognize BLKIF_OP_EXTRA_FLAG, it should *not* 
>> create the
>> + * "feature-extra-request" node!
>> + */
>> +#define BLKIF_OP_EXTRA_FLAG (0x80)
>> +
>> +/*
>>   * Maximum scatter/gather segments per request.
>>   * This is carefully chosen so that sizeof(blkif_ring_t) <= PAGE_SIZE.
>>   * NB. This could be 12 if the ring indexes weren't stored in the same page.
>> @@ -703,6 +725,34 @@ struct blkif_request_indirect {
>>  };
>>  typedef struct blkif_request_indirect blkif_request_indirect_t;
>>  
>> +enum blkif_extra_request_type {
>> +BLKIF_EXTRA_TYPE_DIX = 1,   /* Data integrity extension 
>> payload.  */
>> +};
>> +
>> +struct bio_integrity_req {
>> +/*
>> + * Grant mapping for transmitting bio integrity payload to backend.
>> + */
>> +grant_ref_t *gref;
>> +unsigned int nr_grefs;
>> +unsigned int len;
>> +};
> 
> How does the payload look like? It's structure should be defined here
> or a reference to it's definition in case it is a standard should be
> given.
> 

The payload is also described using struct bio_vec(the same as bio).

/*
 * bio integrity payload
 */
struct bio_integrity_payload {
struct bio  *bip_bio;   /* parent bio */

struct bvec_iterbip_iter;

bio_end_io_t*bip_end_io;/* saved I/O completion fn */

unsigned short  bip_slab;   /* slab the bip came from */
unsigned short  bip_vcnt;   /* # of integrity bio_vecs */
unsigned short  bip_max_vcnt;   /* integrity bio_vec slots */
unsigned short  bip_flags;  /* control flags */

struct work_struct  bip_work;   

Re: [Xen-devel] [RFC PATCH] Data integrity extension support for xen-block

2016-04-08 Thread Bob Liu

On 04/08/2016 05:44 PM, Roger Pau Monné wrote:
> On Fri, 8 Apr 2016, Bob Liu wrote:
>>
>> On 04/07/2016 11:55 PM, Juergen Gross wrote:
>>> On 07/04/16 12:00, Bob Liu wrote:
>>>> * What's data integrity extension and why?
>>>> Modern filesystems feature checksumming of data and metadata to protect 
>>>> against
>>>> data corruption.  However, the detection of the corruption is done at read 
>>>> time
>>>> which could potentially be months after the data was written.  At that 
>>>> point the
>>>> original data that the application tried to write is most likely lost.
>>>>
>>>> The solution in Linux is the data integrity framework which enables 
>>>> protection
>>>> information to be pinned to I/Os and sent to/received from controllers that
>>>> support it. struct bio has been extended with a pointer to a struct bip 
>>>> which
>>>> in turn contains the integrity metadata. The bip is essentially a trimmed 
>>>> down
>>>> bio with a bio_vec and some housekeeping.
>>>>
>>>> * Issues when xen-block get involved.
>>>> xen-blkfront only transmits the normal data of struct bio while the 
>>>> integrity
>>>> metadata buffer(struct bio_integrity_payload in each bio) is ignored.
>>>>
>>>> * Proposal of transmitting bio integrity payload.
>>>> Adding an extra request following the normal data request, this extra 
>>>> request
>>>> contains the integrity payload.
>>>> The xen-blkback will reconstruct an new bio with both received normal data 
>>>> and
>>>> integrity metadata.
>>>>
>>>> Welcome any better ideas, thank you!
>>>>
>>>> [1] http://lwn.net/Articles/280023/
>>>> [2] https://www.kernel.org/doc/Documentation/block/data-integrity.txt
>>>>
>>>> Signed-off-by: Bob Liu 
>>>> ---
>>>>  xen/include/public/io/blkif.h |   50 
>>>> +
>>>>  1 file changed, 50 insertions(+)
>>>>
>>>> diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
>>>> index 99f0326..3d8d39f 100644
>>>> --- a/xen/include/public/io/blkif.h
>>>> +++ b/xen/include/public/io/blkif.h
>>>> @@ -635,6 +635,28 @@
>>>>  #define BLKIF_OP_INDIRECT  6
>>>>  
>>>>  /*
>>>> + * Recognized only if "feature-extra-request" is present in backend 
>>>> xenbus info.
>>>> + * A request with BLKIF_OP_EXTRA_FLAG indicates an extra request is 
>>>> followed
>>>> + * in the shared ring buffer.
>>>> + *
>>>> + * By this way, extra data like bio integrity payload can be transmitted 
>>>> from
>>>> + * frontend to backend.
>>>> + *
>>>> + * The 'wire' format is like:
>>>> + *  Request 1: xen_blkif_request
>>>> + * [Request 2: xen_blkif_extra_request](only if request 1 has 
>>>> BLKIF_OP_EXTRA_FLAG)
>>>> + *  Request 3: xen_blkif_request
>>>> + *  Request 4: xen_blkif_request
>>>> + * [Request 5: xen_blkif_extra_request](only if request 4 has 
>>>> BLKIF_OP_EXTRA_FLAG)
>>>> + *  ...
>>>> + *  Request N: xen_blkif_request
>>>> + *
>>>> + * If a backend does not recognize BLKIF_OP_EXTRA_FLAG, it should *not* 
>>>> create the
>>>> + * "feature-extra-request" node!
>>>> + */
>>>> +#define BLKIF_OP_EXTRA_FLAG (0x80)
>>>> +
>>>> +/*
>>>>   * Maximum scatter/gather segments per request.
>>>>   * This is carefully chosen so that sizeof(blkif_ring_t) <= PAGE_SIZE.
>>>>   * NB. This could be 12 if the ring indexes weren't stored in the same 
>>>> page.
>>>> @@ -703,6 +725,34 @@ struct blkif_request_indirect {
>>>>  };
>>>>  typedef struct blkif_request_indirect blkif_request_indirect_t;
>>>>  
>>>> +enum blkif_extra_request_type {
>>>> +  BLKIF_EXTRA_TYPE_DIX = 1,   /* Data integrity extension 
>>>> payload.  */
>>>> +};
>>>> +
>>>> +struct bio_integrity_req {
>>>> +  /*
>>>> +   * Grant mapping for transmitting bio integrity payload to backend.
>>>> +   */
>>>> +  grant_ref_t *gref;
>>>> +  

Re: [Xen-devel] [RFC PATCH] Data integrity extension support for xen-block

2016-04-11 Thread Bob Liu

On 04/08/2016 10:32 PM, David Vrabel wrote:
> On 08/04/16 15:20, Ian Jackson wrote:
>> David Vrabel writes ("Re: [RFC PATCH] Data integrity extension support for 
>> xen-block"):
>>> You need to read the relevant SCSI specification and find out what
>>> interfaces and behaviour the hardware has so you can specify compatible
>>> interfaces in blkif.
>>>
>>> My (brief) reading around this suggests that the integrity data has a
>>> specific format (a CRC of some form) and the integrity data written for
>>> sector S and retrieved verbatim when sector S is re-read.
>>
>> I think it's this:
>>
>> https://en.wikipedia.org/wiki/Data_Integrity_Field
>> https://www.kernel.org/doc/Documentation/block/data-integrity.txt
>>
>> In which case AFAICT the format is up to the guest (ie the operating
>> system or file system) and it's opaque to the host (the storage) -
>> unless the guest consents, of course.
> 
> I disagree, but I can't work out where to get the relevant T10 PI/DIF
> spec from to provide an authoritative link[1].  The DI metadata has as a
> set of well defined format, most of which include a 16-bit GUARD CRC, a
> 32 bit REFERENCE tag and 16 bit for user defined usage.
> 

Yes.

> The application cannot use all the bits for its own use since the
> hardware may check the GUARD and REFERENCE tags itself.
> 
> David
> 
> [0] Try: https://www.usenix.org/legacy/event/lsf07/tech/petersen.pdf
> 

And https://oss.oracle.com/projects/data-integrity/dist/documentation/dix.pdf

No matter the actual format of the Integrity Meta Data looks like, it can be 
mapped to a scatter-list by using:
blk_rq_map_integrity_sg(struct request_queue *q, struct bio *bio, struct 
scatterlist *sglist)
just like blk_rq_map_sg(struct request_queue *q, struct request *rq, struct 
scatterlist *sglist) for normal data.

The extra scatter-list can be seen as the interface, we just need to find a 
good way transmitting this extra scatter-list between blkfront and blkback.

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] Data integrity extension support for xen-block

2016-04-13 Thread Bob Liu

On 04/07/2016 06:00 PM, Bob Liu wrote:
> * What's data integrity extension and why?
> Modern filesystems feature checksumming of data and metadata to protect 
> against
> data corruption.  However, the detection of the corruption is done at read 
> time
> which could potentially be months after the data was written.  At that point 
> the
> original data that the application tried to write is most likely lost.
> 
> The solution in Linux is the data integrity framework which enables protection
> information to be pinned to I/Os and sent to/received from controllers that
> support it. struct bio has been extended with a pointer to a struct bip which
> in turn contains the integrity metadata. The bip is essentially a trimmed down
> bio with a bio_vec and some housekeeping.
> 
> * Issues when xen-block get involved.
> xen-blkfront only transmits the normal data of struct bio while the integrity
> metadata buffer(struct bio_integrity_payload in each bio) is ignored.
> 
> * Proposal of transmitting bio integrity payload.
> Adding an extra request following the normal data request, this extra request
> contains the integrity payload.
> The xen-blkback will reconstruct an new bio with both received normal data and
> integrity metadata.
> 
> Welcome any better ideas, thank you!
> 

A simpler possible solution:

bob@boliuliu:~/xen$ git diff xen/include/public/io/blkif.h
diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
index 3d8d39f..34581a5 100644
--- a/xen/include/public/io/blkif.h
+++ b/xen/include/public/io/blkif.h
@@ -689,6 +689,11 @@ struct blkif_request_segment {
 struct blkif_request {
 uint8_toperation;/* BLKIF_OP_??? */
 uint8_tnr_segments;  /* number of segments   */
+/*
+ * Recording how many segments are data integrity segments.
+ * raw data_segments + dix_segments = nr_segments
+ */
+uint8_t   dix_segments;
 blkif_vdev_t   handle;   /* only for read/write requests */
 uint64_t   id;   /* private guest value, echoed in resp  */
 blkif_sector_t sector_number;/* start sector idx on disk (r/w only)  */
@@ -715,6 +720,11 @@ struct blkif_request_indirect {
 uint8_toperation;/* BLKIF_OP_INDIRECT*/
 uint8_tindirect_op;  /* BLKIF_OP_{READ/WRITE}*/
 uint16_t   nr_segments;  /* number of segments   */
+/*
+ * Recording how many segments are data integrity segments.
+ * raw data_segments + dix_segments = nr_segments
+ */
+uint16_t   dix_segments;
 uint64_t   id;   /* private guest value, echoed in resp  */
 blkif_sector_t sector_number;/* start sector idx on disk (r/w only)  */
 blkif_vdev_t   handle;   /* same as for read/write requests  */

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [RFC PATCH v2] Data integrity extension(DIX) support for xen-block

2016-04-20 Thread Bob Liu
* What's data integrity extension(DIX) and why?
Modern filesystems feature checksumming of data and metadata to protect against
data corruption.  However, the detection of the corruption is done at read time
which could potentially be months after the data was written.  At that point the
original data that the application tried to write is most likely lost.

The solution in Linux is the data integrity framework which enables protection
information to be pinned to I/Os and sent to/received from controllers that
support it. struct bio has been extended with a pointer to a struct bip which
in turn contains the integrity metadata.
Both raw data and integrity metadata are mapped to two separate scatterlists.

* Issues when xen-block get involved.
xen-blkfront only transmits the raw data-segment scatterlist of each bio
while the integrity-metadata-segment scatterlist has been ignored.

* Proposal for transmitting integrity-metadata-segment scatterlist.
Adding an extra request following the normal data request, this extra request
contains integrity-metadata segments only.

The xen-blkback will reconstruct the new bio with recevied data and integrity
segments.

Signed-off-by: Bob Liu 
---
 xen/include/public/io/blkif.h |   19 +++
 1 file changed, 19 insertions(+)

diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
index 99f0326..a0124b2 100644
--- a/xen/include/public/io/blkif.h
+++ b/xen/include/public/io/blkif.h
@@ -182,6 +182,15 @@
  *  backend driver paired with a LIFO queue in the frontend will
  *  allow us to have better performance in this scenario.
  *
+ * feature-data-integrity
+ *  Values: 0/1 (boolean)
+ *  Default Value:  0
+ *
+ *  A value of "1" indicates that the backend can process requests
+ *  containing the BLKIF_OP_DIX_FLAG request opcode.  Requests
+ *  with this flag means the following request is a special request which
+ *  only contains integrity-metadata segments of current request.
+ *
  *--- Request Transport Parameters 
  *
  * max-ring-page-order
@@ -635,6 +644,16 @@
 #define BLKIF_OP_INDIRECT  6
 
 /*
+ * Recognized only if "feature-data-integrity" is present in backend xenbus 
info.
+ * A request with BLKIF_OP_DIX_FLAG indicates the following request is a 
special
+ * request which only contains integrity-metadata segments of current request.
+ *
+ * If a backend does not recognize BLKIF_OP_DIX_FLAG, it should *not* create 
the
+ * "feature-data-integrity" node!
+ */
+#define BLKIF_OP_DIX_FLAG (0x80)
+
+/*
  * Maximum scatter/gather segments per request.
  * This is carefully chosen so that sizeof(blkif_ring_t) <= PAGE_SIZE.
  * NB. This could be 12 if the ring indexes weren't stored in the same page.
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH v2] Data integrity extension(DIX) support for xen-block

2016-04-20 Thread Bob Liu

On 04/20/2016 04:59 PM, David Vrabel wrote:
> On 20/04/16 08:26, Bob Liu wrote:
>>
>>  /*
>> + * Recognized only if "feature-data-integrity" is present in backend xenbus 
>> info.
>> + * A request with BLKIF_OP_DIX_FLAG indicates the following request is a 
>> special
>> + * request which only contains integrity-metadata segments of current 
>> request.
>> + *
>> + * If a backend does not recognize BLKIF_OP_DIX_FLAG, it should *not* 
>> create the
>> + * "feature-data-integrity" node!
>> + */
>> +#define BLKIF_OP_DIX_FLAG (0x80)
> 
> This looks fine as a mechanism for actually transferring the data but
> you do need to specify:
> 
> 1. The format of this DIX data.  You may reference external
> specifications for this.
> 

Sure!

> 2. A mechanism for reporting which DIX formats the backend supports and
> a way for the frontend to select one (if multiple are selected).
> 

The "feature-data-integrity" could be extended to "unsigned int" instead of 
"bool",
so as to report all DIX formats backend supports.

> 3. The behaviour the frontend can expect from the backend.  (e.g., if
> the frontend writes sector S with DIX data D, a read of sector S with
> complete with DIX data D).
> 

Sorry, I didn't get the point of this example.

Thank you for your review!

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH] block: xen-blkback: don't get/put blkif ref for each queue

2016-09-26 Thread Bob Liu
xen_blkif_get/put() for each queue is useless, and introduces a bug:

If there is I/O inflight when removing device, xen_blkif_disconnect() will
return -EBUSY and xen_blkif_put() not be called.
Which means the references are leaked, then even if I/O completed, the
xen_blkif_put() won't call xen_blkif_deferred_free() to free resources anymore.

Signed-off-by: Bob Liu 
---
 drivers/block/xen-blkback/xenbus.c |2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index 3cc6d1d..2e1bb6d 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -159,7 +159,6 @@ static int xen_blkif_alloc_rings(struct xen_blkif *blkif)
init_waitqueue_head(&ring->shutdown_wq);
ring->blkif = blkif;
ring->st_print = jiffies;
-   xen_blkif_get(blkif);
}
 
return 0;
@@ -296,7 +295,6 @@ static int xen_blkif_disconnect(struct xen_blkif *blkif)
BUG_ON(ring->free_pages_num != 0);
BUG_ON(ring->persistent_gnt_c != 0);
WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));
-   xen_blkif_put(blkif);
}
blkif->nr_ring_pages = 0;
/*
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 3/3] xen-blkfront: dynamic configuration of per-vbd resources

2016-07-25 Thread Bob Liu

On 07/25/2016 05:20 PM, Roger Pau Monné wrote:
> On Sat, Jul 23, 2016 at 06:18:23AM +0800, Bob Liu wrote:
>>
>> On 07/22/2016 07:45 PM, Roger Pau Monné wrote:
>>> On Fri, Jul 22, 2016 at 05:43:32PM +0800, Bob Liu wrote:
>>>>
>>>> On 07/22/2016 05:34 PM, Roger Pau Monné wrote:
>>>>> On Fri, Jul 22, 2016 at 04:17:48PM +0800, Bob Liu wrote:
>>>>>>
>>>>>> On 07/22/2016 03:45 PM, Roger Pau Monné wrote:
>>>>>>> On Thu, Jul 21, 2016 at 06:08:05PM +0800, Bob Liu wrote:
>>>>>>>>
>>>>>>>> On 07/21/2016 04:57 PM, Roger Pau Monné wrote:
>>>>>> ..[snip]..
>>>>>>>>>> +
>>>>>>>>>> +static ssize_t dynamic_reconfig_device(struct blkfront_info *info, 
>>>>>>>>>> ssize_t count)
>>>>>>>>>> +{
>>>>>>>>>> +unsigned int i;
>>>>>>>>>> +int err = -EBUSY;
>>>>>>>>>> +
>>>>>>>>>> +/*
>>>>>>>>>> + * Make sure no migration in parallel, device lock is actually a
>>>>>>>>>> + * mutex.
>>>>>>>>>> + */
>>>>>>>>>> +if (!device_trylock(&info->xbdev->dev)) {
>>>>>>>>>> +pr_err("Fail to acquire dev:%s lock, may be in 
>>>>>>>>>> migration.\n",
>>>>>>>>>> +dev_name(&info->xbdev->dev));
>>>>>>>>>> +return err;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +/*
>>>>>>>>>> + * Prevent new requests and guarantee no uncompleted reqs.
>>>>>>>>>> + */
>>>>>>>>>> +blk_mq_freeze_queue(info->rq);
>>>>>>>>>> +if (part_in_flight(&info->gd->part0))
>>>>>>>>>> +goto out;
>>>>>>>>>> +
>>>>>>>>>> +/*
>>>>>>>>>> + * FrontBackend
>>>>>>>>>> + * Switch to XenbusStateClosed
>>>>>>>>>> + *  frontend_changed():
>>>>>>>>>> + *   case XenbusStateClosed:
>>>>>>>>>> + *  
>>>>>>>>>> xen_blkif_disconnect()
>>>>>>>>>> + *  Switch to 
>>>>>>>>>> XenbusStateClosed
>>>>>>>>>> + * blkfront_resume():
>>>>>>>>>> + *  frontend_changed():
>>>>>>>>>> + *  reconnect
>>>>>>>>>> + * Wait until XenbusStateConnected
>>>>>>>>>> + */
>>>>>>>>>> +info->reconfiguring = true;
>>>>>>>>>> +xenbus_switch_state(info->xbdev, XenbusStateClosed);
>>>>>>>>>> +
>>>>>>>>>> +/* Poll every 100ms, 1 minute timeout. */
>>>>>>>>>> +for (i = 0; i < 600; i++) {
>>>>>>>>>> +/*
>>>>>>>>>> + * Wait backend enter XenbusStateClosed, 
>>>>>>>>>> blkback_changed()
>>>>>>>>>> + * will clear reconfiguring.
>>>>>>>>>> + */
>>>>>>>>>> +if (!info->reconfiguring)
>>>>>>>>>> +goto resume;
>>>>>>>>>> +schedule_timeout_interruptible(msecs_to_jiffies(100));
>>>>>>>>>> +}
>>>>>>>>>
>>>>>>>>> Instead of having this wait, could you just set info->reconfiguring = 
>>>>>>>>> 1, set 
>>>>>>>>> the frontend state to XenbusStateClosed and mimic exactly what a 
>>>>>>>>> resume

Re: [Xen-devel] [PATCH 3/3] xen-blkfront: dynamic configuration of per-vbd resources

2016-07-25 Thread Bob Liu

On 07/25/2016 06:53 PM, Roger Pau Monné wrote:
..[snip..]
  * We get the device lock and blk_mq_freeze_queue() in 
 dynamic_reconfig_device(),
and then have to release in blkif_recover() asynchronously which makes 
 the code more difficult to readable.
>>>
>>> I'm not sure I'm following here, do you mean that you will pick the lock in 
>>> dynamic_reconfig_device and release it in blkif_recover? Why wouldn't you 
>>
>> Yes.
>>
>>> release the lock in dynamic_reconfig_device after doing whatever is needed?
>>>
>>
>> Both 'dynamic configuration' and migration:xenbus_dev_resume() use { 
>> blkfront_resume(); blkif_recover() } and depends on the change of 
>> xbdev->state.
>> If they happen simultaneously, the State machine of xbdev->state is going to 
>> be a mess and very difficult to track.
>>
>> The lock(actually a mutex) is like a big lock to make sure no race would 
>> happen at all.
> 
> So the only thing that you should do is set the frontend state to closed and 
> wait for the backend to also switch to state closed, and then switch the
> frontend state to init again in order to trigger a reconnection.
> 

But if migration:xenbus_dev_resume() is triggered at the same time, any state 
be set might be changed.
=
E.g
Dynamic_reconfig_device:
Migration:xenbus_dev_resume()

Set the frontend state to closed   
 

frontend state will be 
reset to XenbusStateInitialising by xenbus_dev_resume()

Wait for the backend to also switch to state closed
=
Similar situation may happen at any other place regarding set the state.

> You are right that all this process depends on the state being updated 
> correctly, but I don't see how's that different from a normal connection or 
> resume, and we don't seem to have races there.
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 3/3] xen-blkfront: dynamic configuration of per-vbd resources

2016-07-25 Thread Bob Liu

On 07/25/2016 08:11 PM, Roger Pau Monné wrote:
> On Mon, Jul 25, 2016 at 07:08:36PM +0800, Bob Liu wrote:
>>
>> On 07/25/2016 06:53 PM, Roger Pau Monné wrote:
>> ..[snip..]
>>>>>>  * We get the device lock and blk_mq_freeze_queue() in 
>>>>>> dynamic_reconfig_device(),
>>>>>>and then have to release in blkif_recover() asynchronously which 
>>>>>> makes the code more difficult to readable.
>>>>>
>>>>> I'm not sure I'm following here, do you mean that you will pick the lock 
>>>>> in 
>>>>> dynamic_reconfig_device and release it in blkif_recover? Why wouldn't you 
>>>>
>>>> Yes.
>>>>
>>>>> release the lock in dynamic_reconfig_device after doing whatever is 
>>>>> needed?
>>>>>
>>>>
>>>> Both 'dynamic configuration' and migration:xenbus_dev_resume() use { 
>>>> blkfront_resume(); blkif_recover() } and depends on the change of 
>>>> xbdev->state.
>>>> If they happen simultaneously, the State machine of xbdev->state is going 
>>>> to be a mess and very difficult to track.
>>>>
>>>> The lock(actually a mutex) is like a big lock to make sure no race would 
>>>> happen at all.
>>>
>>> So the only thing that you should do is set the frontend state to closed 
>>> and 
>>> wait for the backend to also switch to state closed, and then switch the
>>> frontend state to init again in order to trigger a reconnection.
>>>
>>
>> But if migration:xenbus_dev_resume() is triggered at the same time, any 
>> state be set might be changed.
>> =
>> E.g
>> Dynamic_reconfig_device:
>> Migration:xenbus_dev_resume()
>>
>> Set the frontend state to closed   
>>  
>>  
>>  frontend state will be 
>> reset to XenbusStateInitialising by xenbus_dev_resume()
>>
>> Wait for the backend to also switch to state closed
> 
> This is not really a race, the backend will not switch to state closed, and 
> instead will appear at state init again, which is what we wanted anyway in 
> order to reconnect, so I don't see an issue with this flow.
> 
>> =
>> Similar situation may happen at any other place regarding set the state.
> 
> Right, but I don't see how your proposed patch also avoids this issues. I 
> see that you pick the device lock in dynamic_reconfig_device, but I don't 
> see xenbus_dev_resume picking the lock at all, and in any case I don't think 

The lock is picked from the power management framework.

Anyway, I'm convinced and will follow your suggestion.
Thank you!

> you should prevent the VM from migrating.
> 
> Depending on the toolstack it might decide to kill the VM because it has not 
> been able to migrate it, in which case the result is not better.
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2 3/3] xen-blkfront: dynamic configuration of per-vbd resources

2016-07-25 Thread Bob Liu
The current VBD layer reserves buffer space for each attached device based on
three statically configured settings which are read at boot time.
 * max_indirect_segs: Maximum amount of segments.
 * max_ring_page_order: Maximum order of pages to be used for the shared ring.
 * max_queues: Maximum of queues(rings) to be used.

But the storage backend, workload, and guest memory result in very different
tuning requirements. It's impossible to centrally predict application
characteristics so it's best to leave allow the settings can be dynamiclly
adjusted based on workload inside the Guest.

Usage:
Show current values:
cat /sys/devices/vbd-xxx/max_indirect_segs
cat /sys/devices/vbd-xxx/max_ring_page_order
cat /sys/devices/vbd-xxx/max_queues

Write new values:
echo  > /sys/devices/vbd-xxx/max_indirect_segs
echo  > /sys/devices/vbd-xxx/max_ring_page_order
echo  > /sys/devices/vbd-xxx/max_queues

Signed-off-by: Bob Liu 
--
v2: Rename to max_ring_page_order and rm the waiting code suggested by Roger.
---
 drivers/block/xen-blkfront.c |  275 +-
 1 file changed, 269 insertions(+), 6 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 1b4c380..ff5ebe5 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -212,6 +212,11 @@ struct blkfront_info
/* Save uncomplete reqs and bios for migration. */
struct list_head requests;
struct bio_list bio_list;
+   /* For dynamic configuration. */
+   unsigned int reconfiguring:1;
+   int new_max_indirect_segments;
+   int max_ring_page_order;
+   int max_queues;
 };
 
 static unsigned int nr_minors;
@@ -1350,6 +1355,31 @@ static void blkif_free(struct blkfront_info *info, int 
suspend)
for (i = 0; i < info->nr_rings; i++)
blkif_free_ring(&info->rinfo[i]);
 
+   /* Remove old xenstore nodes. */
+   if (info->nr_ring_pages > 1)
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, "ring-page-order");
+
+   if (info->nr_rings == 1) {
+   if (info->nr_ring_pages == 1) {
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, "ring-ref");
+   } else {
+   for (i = 0; i < info->nr_ring_pages; i++) {
+   char ring_ref_name[RINGREF_NAME_LEN];
+
+   snprintf(ring_ref_name, RINGREF_NAME_LEN, 
"ring-ref%u", i);
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, 
ring_ref_name);
+   }
+   }
+   } else {
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, 
"multi-queue-num-queues");
+
+   for (i = 0; i < info->nr_rings; i++) {
+   char queuename[QUEUE_NAME_LEN];
+
+   snprintf(queuename, QUEUE_NAME_LEN, "queue-%u", i);
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, queuename);
+   }
+   }
kfree(info->rinfo);
info->rinfo = NULL;
info->nr_rings = 0;
@@ -1763,15 +1793,21 @@ static int talk_to_blkback(struct xenbus_device *dev,
const char *message = NULL;
struct xenbus_transaction xbt;
int err;
-   unsigned int i, max_page_order = 0;
+   unsigned int i, backend_max_order = 0;
unsigned int ring_page_order = 0;
 
err = xenbus_scanf(XBT_NIL, info->xbdev->otherend,
-  "max-ring-page-order", "%u", &max_page_order);
+  "max-ring-page-order", "%u", &backend_max_order);
if (err != 1)
info->nr_ring_pages = 1;
else {
-   ring_page_order = min(xen_blkif_max_ring_order, max_page_order);
+   if (info->max_ring_page_order) {
+   /* Dynamic configured through /sys. */
+   BUG_ON(info->max_ring_page_order > backend_max_order);
+   ring_page_order = info->max_ring_page_order;
+   } else
+   /* Default. */
+   ring_page_order = min(xen_blkif_max_ring_order, 
backend_max_order);
info->nr_ring_pages = 1 << ring_page_order;
}
 
@@ -1894,7 +1930,14 @@ static int negotiate_mq(struct blkfront_info *info)
if (err < 0)
backend_max_queues = 1;
 
-   info->nr_rings = min(backend_max_queues, xen_blkif_max_queues);
+   if (info->max_queues) {
+   /* Dynamic configured through /sys */
+   BUG_ON(info->max_queues > backend_max_queues);
+   info->nr_rings = info->max_queues;
+   } else
+   /* Default. */
+   info->nr_rings = min(b

[Xen-devel] [PATCH v2 2/3] xen-blkfront: introduce blkif_set_queue_limits()

2016-07-25 Thread Bob Liu
blk_mq_update_nr_hw_queues() reset all queue limits to default which it's not
as xen-blkfront expected, introducing blkif_set_queue_limits() to reset limits
with initial correct values.

Signed-off-by: Bob Liu 
---
v2: Move blkif_set_queue_limits() after blkfront_gather_backend_features.
---
 drivers/block/xen-blkfront.c |   87 +++---
 1 file changed, 48 insertions(+), 39 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 032fc94..1b4c380 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -189,6 +189,8 @@ struct blkfront_info
struct mutex mutex;
struct xenbus_device *xbdev;
struct gendisk *gd;
+   u16 sector_size;
+   unsigned int physical_sector_size;
int vdevice;
blkif_vdev_t handle;
enum blkif_state connected;
@@ -913,9 +915,45 @@ static struct blk_mq_ops blkfront_mq_ops = {
.map_queue = blk_mq_map_queue,
 };
 
+static void blkif_set_queue_limits(struct blkfront_info *info)
+{
+   struct request_queue *rq = info->rq;
+   struct gendisk *gd = info->gd;
+   unsigned int segments = info->max_indirect_segments ? :
+   BLKIF_MAX_SEGMENTS_PER_REQUEST;
+
+   queue_flag_set_unlocked(QUEUE_FLAG_VIRT, rq);
+
+   if (info->feature_discard) {
+   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, rq);
+   blk_queue_max_discard_sectors(rq, get_capacity(gd));
+   rq->limits.discard_granularity = info->discard_granularity;
+   rq->limits.discard_alignment = info->discard_alignment;
+   if (info->feature_secdiscard)
+   queue_flag_set_unlocked(QUEUE_FLAG_SECDISCARD, rq);
+   }
+
+   /* Hard sector size and max sectors impersonate the equiv. hardware. */
+   blk_queue_logical_block_size(rq, info->sector_size);
+   blk_queue_physical_block_size(rq, info->physical_sector_size);
+   blk_queue_max_hw_sectors(rq, (segments * XEN_PAGE_SIZE) / 512);
+
+   /* Each segment in a request is up to an aligned page in size. */
+   blk_queue_segment_boundary(rq, PAGE_SIZE - 1);
+   blk_queue_max_segment_size(rq, PAGE_SIZE);
+
+   /* Ensure a merged request will fit in a single I/O ring slot. */
+   blk_queue_max_segments(rq, segments / GRANTS_PER_PSEG);
+
+   /* Make sure buffer addresses are sector-aligned. */
+   blk_queue_dma_alignment(rq, 511);
+
+   /* Make sure we don't use bounce buffers. */
+   blk_queue_bounce_limit(rq, BLK_BOUNCE_ANY);
+}
+
 static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
-   unsigned int physical_sector_size,
-   unsigned int segments)
+   unsigned int physical_sector_size)
 {
struct request_queue *rq;
struct blkfront_info *info = gd->private_data;
@@ -947,37 +985,11 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
}
 
rq->queuedata = info;
-   queue_flag_set_unlocked(QUEUE_FLAG_VIRT, rq);
-
-   if (info->feature_discard) {
-   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, rq);
-   blk_queue_max_discard_sectors(rq, get_capacity(gd));
-   rq->limits.discard_granularity = info->discard_granularity;
-   rq->limits.discard_alignment = info->discard_alignment;
-   if (info->feature_secdiscard)
-   queue_flag_set_unlocked(QUEUE_FLAG_SECDISCARD, rq);
-   }
-
-   /* Hard sector size and max sectors impersonate the equiv. hardware. */
-   blk_queue_logical_block_size(rq, sector_size);
-   blk_queue_physical_block_size(rq, physical_sector_size);
-   blk_queue_max_hw_sectors(rq, (segments * XEN_PAGE_SIZE) / 512);
-
-   /* Each segment in a request is up to an aligned page in size. */
-   blk_queue_segment_boundary(rq, PAGE_SIZE - 1);
-   blk_queue_max_segment_size(rq, PAGE_SIZE);
-
-   /* Ensure a merged request will fit in a single I/O ring slot. */
-   blk_queue_max_segments(rq, segments / GRANTS_PER_PSEG);
-
-   /* Make sure buffer addresses are sector-aligned. */
-   blk_queue_dma_alignment(rq, 511);
-
-   /* Make sure we don't use bounce buffers. */
-   blk_queue_bounce_limit(rq, BLK_BOUNCE_ANY);
-
-   gd->queue = rq;
-
+   info->rq = gd->queue = rq;
+   info->gd = gd;
+   info->sector_size = sector_size;
+   info->physical_sector_size = physical_sector_size;
+   blkif_set_queue_limits(info);
return 0;
 }
 
@@ -1142,16 +1154,11 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
gd->driverfs_dev = &(info->xbdev->dev);
set_capacity(gd, capacity);
 
-   if (xlvbd_init_blk_qu

[Xen-devel] [PATCH 1/3] xen-blkfront: fix places not updated after introducing 64KB page granularity

2016-07-25 Thread Bob Liu
Two places didn't get updated when 64KB page granularity was introduced, this
patch fix them.

Signed-off-by: Bob Liu 
Acked-by: Roger Pau Monné 
---
 drivers/block/xen-blkfront.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index fcc5b4e..032fc94 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -1321,7 +1321,7 @@ free_shadow:
rinfo->ring_ref[i] = GRANT_INVALID_REF;
}
}
-   free_pages((unsigned long)rinfo->ring.sring, 
get_order(info->nr_ring_pages * PAGE_SIZE));
+   free_pages((unsigned long)rinfo->ring.sring, 
get_order(info->nr_ring_pages * XEN_PAGE_SIZE));
rinfo->ring.sring = NULL;
 
if (rinfo->irq)
@@ -2013,7 +2013,7 @@ static int blkif_recover(struct blkfront_info *info)
 
blkfront_gather_backend_features(info);
segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
-   blk_queue_max_segments(info->rq, segs);
+   blk_queue_max_segments(info->rq, segs / GRANTS_PER_PSEG);
 
for (r_index = 0; r_index < info->nr_rings; r_index++) {
struct blkfront_ring_info *rinfo = &info->rinfo[r_index];
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 3/3] xen-blkfront: dynamic configuration of per-vbd resources

2016-07-26 Thread Bob Liu

On 07/26/2016 04:44 PM, Roger Pau Monné wrote:
> On Tue, Jul 26, 2016 at 01:19:37PM +0800, Bob Liu wrote:
>> The current VBD layer reserves buffer space for each attached device based on
>> three statically configured settings which are read at boot time.
>>  * max_indirect_segs: Maximum amount of segments.
>>  * max_ring_page_order: Maximum order of pages to be used for the shared 
>> ring.
>>  * max_queues: Maximum of queues(rings) to be used.
>>
>> But the storage backend, workload, and guest memory result in very different
>> tuning requirements. It's impossible to centrally predict application
>> characteristics so it's best to leave allow the settings can be dynamiclly
>> adjusted based on workload inside the Guest.
>>
>> Usage:
>> Show current values:
>> cat /sys/devices/vbd-xxx/max_indirect_segs
>> cat /sys/devices/vbd-xxx/max_ring_page_order
>> cat /sys/devices/vbd-xxx/max_queues
>>
>> Write new values:
>> echo  > /sys/devices/vbd-xxx/max_indirect_segs
>> echo  > /sys/devices/vbd-xxx/max_ring_page_order
>> echo  > /sys/devices/vbd-xxx/max_queues
>>
>> Signed-off-by: Bob Liu 
>> --
>> v2: Rename to max_ring_page_order and rm the waiting code suggested by Roger.
>> ---
>>  drivers/block/xen-blkfront.c |  275 
>> +-
>>  1 file changed, 269 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index 1b4c380..ff5ebe5 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
>> @@ -212,6 +212,11 @@ struct blkfront_info
>>  /* Save uncomplete reqs and bios for migration. */
>>  struct list_head requests;
>>  struct bio_list bio_list;
>> +/* For dynamic configuration. */
>> +unsigned int reconfiguring:1;
>> +int new_max_indirect_segments;
> 
> Can't you just use max_indirect_segments? Is it really needed to introduce a 
> new struct member?
> 
>> +int max_ring_page_order;
>> +int max_queues;

Do you mean also get rid of these two new struct members?
I'll think about that.

>>  };
>>  
>>  static unsigned int nr_minors;
>> @@ -1350,6 +1355,31 @@ static void blkif_free(struct blkfront_info *info, 
>> int suspend)
>>  for (i = 0; i < info->nr_rings; i++)
>>  blkif_free_ring(&info->rinfo[i]);
>>  
>> +/* Remove old xenstore nodes. */
>> +if (info->nr_ring_pages > 1)
>> +xenbus_rm(XBT_NIL, info->xbdev->nodename, "ring-page-order");
>> +
>> +if (info->nr_rings == 1) {
>> +if (info->nr_ring_pages == 1) {
>> +xenbus_rm(XBT_NIL, info->xbdev->nodename, "ring-ref");
>> +} else {
>> +for (i = 0; i < info->nr_ring_pages; i++) {
>> +char ring_ref_name[RINGREF_NAME_LEN];
>> +
>> +snprintf(ring_ref_name, RINGREF_NAME_LEN, 
>> "ring-ref%u", i);
>> +xenbus_rm(XBT_NIL, info->xbdev->nodename, 
>> ring_ref_name);
>> +}
>> +}
>> +} else {
>> +xenbus_rm(XBT_NIL, info->xbdev->nodename, 
>> "multi-queue-num-queues");
>> +
>> +for (i = 0; i < info->nr_rings; i++) {
>> +char queuename[QUEUE_NAME_LEN];
>> +
>> +snprintf(queuename, QUEUE_NAME_LEN, "queue-%u", i);
>> +xenbus_rm(XBT_NIL, info->xbdev->nodename, queuename);
>> +}
>> +}
>>  kfree(info->rinfo);
>>  info->rinfo = NULL;
>>  info->nr_rings = 0;
>> @@ -1763,15 +1793,21 @@ static int talk_to_blkback(struct xenbus_device *dev,
>>  const char *message = NULL;
>>  struct xenbus_transaction xbt;
>>  int err;
>> -unsigned int i, max_page_order = 0;
>> +unsigned int i, backend_max_order = 0;
>>  unsigned int ring_page_order = 0;
>>  
>>  err = xenbus_scanf(XBT_NIL, info->xbdev->otherend,
>> -   "max-ring-page-order", "%u", &max_page_order);
>> +   "max-ring-page-order", "%u", &backend_max_order);
>>  if (err != 1)
>>  info->nr_ring_pages = 1;
>>  else {
>> -ring_page_order = min(xen_blkif_ma

[Xen-devel] [PATCH v3] xen-blkfront: dynamic configuration of per-vbd resources

2016-07-26 Thread Bob Liu
The current VBD layer reserves buffer space for each attached device based on
three statically configured settings which are read at boot time.
 * max_indirect_segs: Maximum amount of segments.
 * max_ring_page_order: Maximum order of pages to be used for the shared ring.
 * max_queues: Maximum of queues(rings) to be used.

But the storage backend, workload, and guest memory result in very different
tuning requirements. It's impossible to centrally predict application
characteristics so it's best to leave allow the settings can be dynamiclly
adjusted based on workload inside the Guest.

Usage:
Show current values:
cat /sys/devices/vbd-xxx/max_indirect_segs
cat /sys/devices/vbd-xxx/max_ring_page_order
cat /sys/devices/vbd-xxx/max_queues

Write new values:
echo  > /sys/devices/vbd-xxx/max_indirect_segs
echo  > /sys/devices/vbd-xxx/max_ring_page_order
echo  > /sys/devices/vbd-xxx/max_queues

Signed-off-by: Bob Liu 
--
v3:
 * Remove new_max_indirect_segments.
 * Fix BUG_ON().
v2:
 * Rename to max_ring_page_order.
 * Remove the waiting code suggested by Roger.
---
 drivers/block/xen-blkfront.c |  277 --
 1 file changed, 269 insertions(+), 8 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 1b4c380..57baa54 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -212,6 +212,10 @@ struct blkfront_info
/* Save uncomplete reqs and bios for migration. */
struct list_head requests;
struct bio_list bio_list;
+   /* For dynamic configuration. */
+   unsigned int reconfiguring:1;
+   unsigned int max_ring_page_order;
+   unsigned int max_queues;
 };
 
 static unsigned int nr_minors;
@@ -1350,6 +1354,31 @@ static void blkif_free(struct blkfront_info *info, int 
suspend)
for (i = 0; i < info->nr_rings; i++)
blkif_free_ring(&info->rinfo[i]);
 
+   /* Remove old xenstore nodes. */
+   if (info->nr_ring_pages > 1)
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, "ring-page-order");
+
+   if (info->nr_rings == 1) {
+   if (info->nr_ring_pages == 1) {
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, "ring-ref");
+   } else {
+   for (i = 0; i < info->nr_ring_pages; i++) {
+   char ring_ref_name[RINGREF_NAME_LEN];
+
+   snprintf(ring_ref_name, RINGREF_NAME_LEN, 
"ring-ref%u", i);
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, 
ring_ref_name);
+   }
+   }
+   } else {
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, 
"multi-queue-num-queues");
+
+   for (i = 0; i < info->nr_rings; i++) {
+   char queuename[QUEUE_NAME_LEN];
+
+   snprintf(queuename, QUEUE_NAME_LEN, "queue-%u", i);
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, queuename);
+   }
+   }
kfree(info->rinfo);
info->rinfo = NULL;
info->nr_rings = 0;
@@ -1763,15 +1792,20 @@ static int talk_to_blkback(struct xenbus_device *dev,
const char *message = NULL;
struct xenbus_transaction xbt;
int err;
-   unsigned int i, max_page_order = 0;
+   unsigned int i, backend_max_order = 0;
unsigned int ring_page_order = 0;
 
err = xenbus_scanf(XBT_NIL, info->xbdev->otherend,
-  "max-ring-page-order", "%u", &max_page_order);
+  "max-ring-page-order", "%u", &backend_max_order);
if (err != 1)
info->nr_ring_pages = 1;
else {
-   ring_page_order = min(xen_blkif_max_ring_order, max_page_order);
+   if (info->max_ring_page_order)
+   /* Dynamic configured through /sys. */
+   ring_page_order = min(info->max_ring_page_order, 
backend_max_order);
+   else
+   /* Default. */
+   ring_page_order = min(xen_blkif_max_ring_order, 
backend_max_order);
info->nr_ring_pages = 1 << ring_page_order;
}
 
@@ -1894,7 +1928,13 @@ static int negotiate_mq(struct blkfront_info *info)
if (err < 0)
backend_max_queues = 1;
 
-   info->nr_rings = min(backend_max_queues, xen_blkif_max_queues);
+   if (info->max_queues)
+   /* Dynamic configured through /sys */
+   info->nr_rings = min(backend_max_queues, info->max_queues);
+   else
+   /* Default. */
+   info->nr_rings = min(backend_max_queues, xen_blkif_max_queues);
+
/* We need a

Re: [Xen-devel] [PATCH v3] xen-blkfront: dynamic configuration of per-vbd resources

2016-07-27 Thread Bob Liu

On 07/27/2016 04:07 PM, Roger Pau Monné wrote:
..[snip]..
>> @@ -2443,6 +2674,22 @@ static void blkfront_connect(struct blkfront_info 
>> *info)
>>  return;
>>  }
>>  
>> +err = device_create_file(&info->xbdev->dev, 
>> &dev_attr_max_ring_page_order);
>> +if (err)
>> +goto fail;
>> +
>> +err = device_create_file(&info->xbdev->dev, 
>> &dev_attr_max_indirect_segs);
>> +if (err) {
>> +device_remove_file(&info->xbdev->dev, 
>> &dev_attr_max_ring_page_order);
>> +goto fail;
>> +}
>> +
>> +err = device_create_file(&info->xbdev->dev, &dev_attr_max_queues);
>> +if (err) {
>> +device_remove_file(&info->xbdev->dev, 
>> &dev_attr_max_ring_page_order);
>> +device_remove_file(&info->xbdev->dev, 
>> &dev_attr_max_indirect_segs);
>> +goto fail;
>> +}
>>  xenbus_switch_state(info->xbdev, XenbusStateConnected);
>>  
>>  /* Kick pending requests. */
>> @@ -2453,6 +2700,12 @@ static void blkfront_connect(struct blkfront_info 
>> *info)
>>  add_disk(info->gd);
>>  
>>  info->is_ready = 1;
>> +return;
>> +
>> +fail:
>> +blkif_free(info, 0);
>> +xlvbd_release_gendisk(info);
>> +return;
> 
> Hm, I'm not sure whether this chunk should be in a separate patch, it seems 
> like blkfront_connect doesn't properly cleanup on error (if 
> xlvbd_alloc_gendisk fails blkif_free will not be called). Do you think you 
> could send the addition of the 'fail' label as a separate patch and fix the 
> error path of xlvbd_alloc_gendisk?
> 

Sure, will fix all of your comments above.

>>  }
>>  
>>  /**
>> @@ -2500,8 +2753,16 @@ static void blkback_changed(struct xenbus_device *dev,
>>  break;
>>  
>>  case XenbusStateClosed:
>> -if (dev->state == XenbusStateClosed)
>> +if (dev->state == XenbusStateClosed) {
>> +if (info->reconfiguring)
>> +if (blkfront_resume(info->xbdev)) {
> 
> Could you please join those two conditions:
> 
> if (info->reconfiguring && blkfront_resume(info->xbdev)) { ...
> 
> Also, I'm not sure this is correct, if blkfront sees the "Closing" state on 
> blkback it will try to close the frontend and destroy the block device (see 
> blkfront_closing), and this should be avoided. You should call 
> blkfront_resume as soon as you see the backend move to the Closed or Closing 
> states, without calling blkfront_closing.
> 

I didn't get how this can happen, backend state won't be changed to 'Closing' 
before blkfront_closing() is called.
So I think current logic is fine.

Btw: could you please ack [PATCH v2 2/3] xen-blkfront: introduce 
blkif_set_queue_limits()?

Thank you!
Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3] xen-blkfront: dynamic configuration of per-vbd resources

2016-07-27 Thread Bob Liu

On 07/27/2016 06:59 PM, Roger Pau Monné wrote:
> On Wed, Jul 27, 2016 at 11:21:25AM +0800, Bob Liu wrote:
> [...]
>> +static ssize_t dynamic_reconfig_device(struct blkfront_info *info, ssize_t 
>> count)
>> +{
>> +/*
>> + * Prevent new requests even to software request queue.
>> + */
>> +blk_mq_freeze_queue(info->rq);
>> +
>> +/*
>> + * Guarantee no uncompleted reqs.
>> + */
> 
> I'm also wondering, why do you need to guarantee that there are no 
> uncompleted requests? The resume procedure is going to call blkif_recover 
> that will take care of requeuing any unfinished requests that are on the 
> ring.
> 

Because there may have requests in the software request queue with more 
segments than
we can handle(if info->max_indirect_segments is reduced).

The blkif_recover() can't handle this since blk-mq was introduced,
because there is no way to iterate the sw-request queues(blk_fetch_request() 
can't be used by blk-mq).

So there is a bug in blkif_recover(), I was thinking implement the suspend 
function of blkfront_driver like:

+static int blkfront_suspend(struct xenbus_device *dev)
+{
+   blk_mq_freeze_queue(info->rq);
+   ..
+}
 static struct xenbus_driver blkfront_driver = {
.ids  = blkfront_ids,
.probe = blkfront_probe,
.remove = blkfront_remove,
+   .suspend = blkfront_suspend,
.resume = blkfront_resume,

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3] xen-blkfront: dynamic configuration of per-vbd resources

2016-07-27 Thread Bob Liu

On 07/27/2016 10:24 PM, Roger Pau Monné wrote:
> On Wed, Jul 27, 2016 at 07:21:05PM +0800, Bob Liu wrote:
>>
>> On 07/27/2016 06:59 PM, Roger Pau Monné wrote:
>>> On Wed, Jul 27, 2016 at 11:21:25AM +0800, Bob Liu wrote:
>>> [...]
>>>> +static ssize_t dynamic_reconfig_device(struct blkfront_info *info, 
>>>> ssize_t count)
>>>> +{
>>>> +  /*
>>>> +   * Prevent new requests even to software request queue.
>>>> +   */
>>>> +  blk_mq_freeze_queue(info->rq);
>>>> +
>>>> +  /*
>>>> +   * Guarantee no uncompleted reqs.
>>>> +   */
>>>
>>> I'm also wondering, why do you need to guarantee that there are no 
>>> uncompleted requests? The resume procedure is going to call blkif_recover 
>>> that will take care of requeuing any unfinished requests that are on the 
>>> ring.
>>>
>>
>> Because there may have requests in the software request queue with more 
>> segments than
>> we can handle(if info->max_indirect_segments is reduced).
>>
>> The blkif_recover() can't handle this since blk-mq was introduced,
>> because there is no way to iterate the sw-request queues(blk_fetch_request() 
>> can't be used by blk-mq).
>>
>> So there is a bug in blkif_recover(), I was thinking implement the suspend 
>> function of blkfront_driver like:
> 
> Hm, this is a regression and should be fixed ASAP. I'm still not sure I 
> follow, don't blk_queue_max_segments change the number of segments the 
> requests on the queue are going to have? So that you will only have to 
> re-queue the requests already on the ring?
> 

That's not enough, request queues were split to software queues and hardware 
queues since blk-mq was introduced.
We need to consider two more things:
 * Stop new requests be added to software queues before 
blk_queue_max_segments() is called(still using old 'max-indirect-segments').
   I didn't see other way except call blk_mq_freeze_queue().

 * Requests already in software queues but with old 'max-indirect-segments' 
also have to be re-queued based on new 'max-indirect-segments'.
   Neither blk-mq API can do this.

> Waiting for the whole queue to be flushed before suspending is IMHO not 
> acceptable, it introduces an unbounded delay during migration if the backend 
> is slow for some reason.
> 

Right, I also hope there is better solution.

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2 1/3] xen-blkfront: fix places not updated after introducing 64KB page granularity

2016-07-28 Thread Bob Liu
Two places didn't get updated when 64KB page granularity was introduced,
this patch fix them.

Signed-off-by: Bob Liu 
Acked-by: Roger Pau Monné 
---
 drivers/block/xen-blkfront.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index ca0536e..36d9a0d 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -1318,7 +1318,7 @@ free_shadow:
rinfo->ring_ref[i] = GRANT_INVALID_REF;
}
}
-   free_pages((unsigned long)rinfo->ring.sring, 
get_order(info->nr_ring_pages * PAGE_SIZE));
+   free_pages((unsigned long)rinfo->ring.sring, 
get_order(info->nr_ring_pages * XEN_PAGE_SIZE));
rinfo->ring.sring = NULL;
 
if (rinfo->irq)
@@ -2013,7 +2013,7 @@ static int blkif_recover(struct blkfront_info *info)
 
blkfront_gather_backend_features(info);
segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
-   blk_queue_max_segments(info->rq, segs);
+   blk_queue_max_segments(info->rq, segs / GRANTS_PER_PSEG);
bio_list_init(&bio_list);
INIT_LIST_HEAD(&requests);
 
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2 2/3] xen-blkfront: introduce blkif_set_queue_limits()

2016-07-28 Thread Bob Liu
blk_mq_update_nr_hw_queues() reset all queue limits to default which it's
not as xen-blkfront expected, introducing blkif_set_queue_limits() to reset
limits with initial correct values.

Signed-off-by: Bob Liu 
Acked-by: Roger Pau Monné 
---
 drivers/block/xen-blkfront.c |   87 +++---
 1 file changed, 48 insertions(+), 39 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 36d9a0d..d5ed60b 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -189,6 +189,8 @@ struct blkfront_info
struct mutex mutex;
struct xenbus_device *xbdev;
struct gendisk *gd;
+   u16 sector_size;
+   unsigned int physical_sector_size;
int vdevice;
blkif_vdev_t handle;
enum blkif_state connected;
@@ -910,9 +912,45 @@ static struct blk_mq_ops blkfront_mq_ops = {
.map_queue = blk_mq_map_queue,
 };
 
+static void blkif_set_queue_limits(struct blkfront_info *info)
+{
+   struct request_queue *rq = info->rq;
+   struct gendisk *gd = info->gd;
+   unsigned int segments = info->max_indirect_segments ? :
+   BLKIF_MAX_SEGMENTS_PER_REQUEST;
+
+   queue_flag_set_unlocked(QUEUE_FLAG_VIRT, rq);
+
+   if (info->feature_discard) {
+   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, rq);
+   blk_queue_max_discard_sectors(rq, get_capacity(gd));
+   rq->limits.discard_granularity = info->discard_granularity;
+   rq->limits.discard_alignment = info->discard_alignment;
+   if (info->feature_secdiscard)
+   queue_flag_set_unlocked(QUEUE_FLAG_SECDISCARD, rq);
+   }
+
+   /* Hard sector size and max sectors impersonate the equiv. hardware. */
+   blk_queue_logical_block_size(rq, info->sector_size);
+   blk_queue_physical_block_size(rq, info->physical_sector_size);
+   blk_queue_max_hw_sectors(rq, (segments * XEN_PAGE_SIZE) / 512);
+
+   /* Each segment in a request is up to an aligned page in size. */
+   blk_queue_segment_boundary(rq, PAGE_SIZE - 1);
+   blk_queue_max_segment_size(rq, PAGE_SIZE);
+
+   /* Ensure a merged request will fit in a single I/O ring slot. */
+   blk_queue_max_segments(rq, segments / GRANTS_PER_PSEG);
+
+   /* Make sure buffer addresses are sector-aligned. */
+   blk_queue_dma_alignment(rq, 511);
+
+   /* Make sure we don't use bounce buffers. */
+   blk_queue_bounce_limit(rq, BLK_BOUNCE_ANY);
+}
+
 static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
-   unsigned int physical_sector_size,
-   unsigned int segments)
+   unsigned int physical_sector_size)
 {
struct request_queue *rq;
struct blkfront_info *info = gd->private_data;
@@ -944,37 +982,11 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
}
 
rq->queuedata = info;
-   queue_flag_set_unlocked(QUEUE_FLAG_VIRT, rq);
-
-   if (info->feature_discard) {
-   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, rq);
-   blk_queue_max_discard_sectors(rq, get_capacity(gd));
-   rq->limits.discard_granularity = info->discard_granularity;
-   rq->limits.discard_alignment = info->discard_alignment;
-   if (info->feature_secdiscard)
-   queue_flag_set_unlocked(QUEUE_FLAG_SECDISCARD, rq);
-   }
-
-   /* Hard sector size and max sectors impersonate the equiv. hardware. */
-   blk_queue_logical_block_size(rq, sector_size);
-   blk_queue_physical_block_size(rq, physical_sector_size);
-   blk_queue_max_hw_sectors(rq, (segments * XEN_PAGE_SIZE) / 512);
-
-   /* Each segment in a request is up to an aligned page in size. */
-   blk_queue_segment_boundary(rq, PAGE_SIZE - 1);
-   blk_queue_max_segment_size(rq, PAGE_SIZE);
-
-   /* Ensure a merged request will fit in a single I/O ring slot. */
-   blk_queue_max_segments(rq, segments / GRANTS_PER_PSEG);
-
-   /* Make sure buffer addresses are sector-aligned. */
-   blk_queue_dma_alignment(rq, 511);
-
-   /* Make sure we don't use bounce buffers. */
-   blk_queue_bounce_limit(rq, BLK_BOUNCE_ANY);
-
-   gd->queue = rq;
-
+   info->rq = gd->queue = rq;
+   info->gd = gd;
+   info->sector_size = sector_size;
+   info->physical_sector_size = physical_sector_size;
+   blkif_set_queue_limits(info);
return 0;
 }
 
@@ -1139,16 +1151,11 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
gd->driverfs_dev = &(info->xbdev->dev);
set_capacity(gd, capacity);
 
-   if (xlvbd_init_blk_queue(gd, sector_size, physical_sector_size,
-info-&g

[Xen-devel] [PATCH 3/3] xen-blkfront: free resources if xlvbd_alloc_gendisk fails

2016-07-28 Thread Bob Liu
Current code forgets to free resources in the failure path of
xlvbd_alloc_gendisk(), this patch fix it.

Signed-off-by: Bob Liu 
---
 drivers/block/xen-blkfront.c |7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index d5ed60b..d8429d4 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -2446,7 +2446,7 @@ static void blkfront_connect(struct blkfront_info *info)
if (err) {
xenbus_dev_fatal(info->xbdev, err, "xlvbd_add at %s",
 info->xbdev->otherend);
-   return;
+   goto fail;
}
 
xenbus_switch_state(info->xbdev, XenbusStateConnected);
@@ -2459,6 +2459,11 @@ static void blkfront_connect(struct blkfront_info *info)
add_disk(info->gd);
 
info->is_ready = 1;
+   return;
+
+fail:
+   blkif_free(info, 0);
+   return;
 }
 
 /**
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 1/3] xen-blkfront: fix places not updated after introducing 64KB page granularity

2016-07-28 Thread Bob Liu

On 07/28/2016 09:19 AM, Konrad Rzeszutek Wilk wrote:
> On Tue, Jul 26, 2016 at 01:19:35PM +0800, Bob Liu wrote:
>> Two places didn't get updated when 64KB page granularity was introduced, this
>> patch fix them.
>>
>> Signed-off-by: Bob Liu 
>> Acked-by: Roger Pau Monné 
> 
> Could you rebase this on xen-tip/for-linus-4.8 pls?

Done, sent the v2 for you to pick up.

> 
>> ---
>>  drivers/block/xen-blkfront.c |4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index fcc5b4e..032fc94 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
>> @@ -1321,7 +1321,7 @@ free_shadow:
>>  rinfo->ring_ref[i] = GRANT_INVALID_REF;
>>  }
>>  }
>> -free_pages((unsigned long)rinfo->ring.sring, 
>> get_order(info->nr_ring_pages * PAGE_SIZE));
>> +free_pages((unsigned long)rinfo->ring.sring, 
>> get_order(info->nr_ring_pages * XEN_PAGE_SIZE));
>>  rinfo->ring.sring = NULL;
>>  
>>  if (rinfo->irq)
>> @@ -2013,7 +2013,7 @@ static int blkif_recover(struct blkfront_info *info)
>>  
>>  blkfront_gather_backend_features(info);
>>  segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
>> -blk_queue_max_segments(info->rq, segs);
>> +blk_queue_max_segments(info->rq, segs / GRANTS_PER_PSEG);
>>  
>>  for (r_index = 0; r_index < info->nr_rings; r_index++) {
>>  struct blkfront_ring_info *rinfo = &info->rinfo[r_index];
>> -- 
>> 1.7.10.4
>>

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH] tools:libxl: return tty path for all serials

2016-08-02 Thread Bob Liu
When specifying a serial list in domain config, users of
libxl_console_get_tty cannot get the tty path of a second specified pty serial,
since right now it always returns the tty path of serial 0.

Signed-off-by: Bob Liu 
---
 tools/libxl/libxl.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 2cf7451..00af286 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -1795,7 +1795,7 @@ int libxl_console_get_tty(libxl_ctx *ctx, uint32_t domid, 
int cons_num,
 
 switch (type) {
 case LIBXL_CONSOLE_TYPE_SERIAL:
-tty_path = GCSPRINTF("%s/serial/0/tty", dom_path);
+tty_path = GCSPRINTF("%s/serial/%d/tty", dom_path, cons_num);
 break;
 case LIBXL_CONSOLE_TYPE_PV:
 if (cons_num == 0)
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2] libxl: return any serial tty path in libxl_console_get_tty

2016-08-03 Thread Bob Liu
When specifying a serial list in domain config, users of
libxl_console_get_tty cannot get the tty path of a second specified pty serial,
since right now it always returns the tty path of serial 0.

Signed-off-by: Bob Liu 
---
v2: Rename the patch title.
---
 tools/libxl/libxl.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 2cf7451..00af286 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -1795,7 +1795,7 @@ int libxl_console_get_tty(libxl_ctx *ctx, uint32_t domid, 
int cons_num,
 
 switch (type) {
 case LIBXL_CONSOLE_TYPE_SERIAL:
-tty_path = GCSPRINTF("%s/serial/0/tty", dom_path);
+tty_path = GCSPRINTF("%s/serial/%d/tty", dom_path, cons_num);
 break;
 case LIBXL_CONSOLE_TYPE_PV:
 if (cons_num == 0)
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [BUG] kernel BUG at drivers/block/xen-blkfront.c:1711

2016-08-10 Thread Bob Liu

On 08/10/2016 08:33 PM, Evgenii Shatokhin wrote:
> On 14.07.2016 15:04, Bob Liu wrote:
>>
>> On 07/14/2016 07:49 PM, Evgenii Shatokhin wrote:
>>> On 11.07.2016 15:04, Bob Liu wrote:
>>>>
>>>>
>>>> On 07/11/2016 04:50 PM, Evgenii Shatokhin wrote:
>>>>> On 06.06.2016 11:42, Dario Faggioli wrote:
>>>>>> Just Cc-ing some Linux, block, and Xen on CentOS people...
>>>>>>
>>>>>
>>>>> Ping.
>>>>>
>>>>> Any suggestions how to debug this or what might cause the problem?
>>>>>
>>>>> Obviously, we cannot control Xen on the Amazon's servers. But perhaps 
>>>>> there is something we can do at the kernel's side, is it?
>>>>>
>>>>>> On Mon, 2016-06-06 at 11:24 +0300, Evgenii Shatokhin wrote:
>>>>>>> (Resending this bug report because the message I sent last week did
>>>>>>> not
>>>>>>> make it to the mailing list somehow.)
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> One of our users gets kernel panics from time to time when he tries
>>>>>>> to
>>>>>>> use his Amazon EC2 instance with CentOS7 x64 in it [1]. Kernel panic
>>>>>>> happens within minutes from the moment the instance starts. The
>>>>>>> problem
>>>>>>> does not show up every time, however.
>>>>>>>
>>>>>>> The user first observed the problem with a custom kernel, but it was
>>>>>>> found later that the stock kernel 3.10.0-327.18.2.el7.x86_64 from
>>>>>>> CentOS7 was affected as well.
>>>>
>>>> Please try this patch:
>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7b0767502b5db11cb1f0daef2d01f6d71b1192dc
>>>>
>>>> Regards,
>>>> Bob
>>>>
>>>
>>> Unfortunately, it did not help. The same BUG_ON() in 
>>> blkfront_setup_indirect() still triggers in our kernel based on RHEL's 
>>> 3.10.0-327.18.2, where I added the patch.
>>>
>>> As far as I can see, the patch makes sure the indirect pages are added to 
>>> the list only if (!info->feature_persistent) holds. I suppose it holds in 
>>> our case and the pages are added to the list because the triggered BUG_ON() 
>>> is here:
>>>
>>>  if (!info->feature_persistent && info->max_indirect_segments) {
>>>  <...>
>>>  BUG_ON(!list_empty(&info->indirect_pages));
>>>  <...>
>>>  }
>>>
>>
>> That's odd.
>> Could you please try to reproduce this issue with a recent upstream kernel?
>>
>> Thanks,
>> Bob
> 
> No luck with the upstream kernel 4.7.0 so far due to unrelated issues (bad 
> initrd, I suppose, so the system does not even boot).
> 
> However, the problem reproduced with the stable upstream kernel 3.14.74. 
> After the system booted the second time with this kernel, that BUG_ON 
> triggered:
>  kernel BUG at drivers/block/xen-blkfront.c:1701
> 

Could you please provide more detail on how to reproduce this bug? I'd like to 
have a test.

Thanks!
Bob

>>
>>> So the problem is still out there somewhere, it seems.
>>>
>>> Regards,
>>> Evgenii
>>>
>>>>>>>
>>>>>>> The part of the system log he was able to retrieve is attached. Here
>>>>>>> is
>>>>>>> the bug info, for convenience:
>>>>>>>
>>>>>>> 
>>>>>>> [2.246912] kernel BUG at drivers/block/xen-blkfront.c:1711!
>>>>>>> [2.246912] invalid opcode:  [#1] SMP
>>>>>>> [2.246912] Modules linked in: ata_generic pata_acpi
>>>>>>> crct10dif_pclmul
>>>>>>> crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel
>>>>>>> xen_netfront xen_blkfront(+) aesni_intel lrw ata_piix gf128mul
>>>>>>> glue_helper ablk_helper cryptd libata serio_raw floppy sunrpc
>>>>>>> dm_mirror
>>>>>>> dm_region_hash dm_log dm_mod scsi_transport_iscsi
>>>>>>> [2.246912] CPU: 1 PID: 50 Comm: xenwatch Not tainted
>>>>>>> 3.10.0-327.18.2.el7.x86_64 #1
>>>>&g

Re: [Xen-devel] [BUG] kernel BUG at drivers/block/xen-blkfront.c:1711

2016-08-10 Thread Bob Liu

On 08/10/2016 10:54 PM, Evgenii Shatokhin wrote:
> On 10.08.2016 15:49, Bob Liu wrote:
>>
>> On 08/10/2016 08:33 PM, Evgenii Shatokhin wrote:
>>> On 14.07.2016 15:04, Bob Liu wrote:
>>>>
>>>> On 07/14/2016 07:49 PM, Evgenii Shatokhin wrote:
>>>>> On 11.07.2016 15:04, Bob Liu wrote:
>>>>>>
>>>>>>
>>>>>> On 07/11/2016 04:50 PM, Evgenii Shatokhin wrote:
>>>>>>> On 06.06.2016 11:42, Dario Faggioli wrote:
>>>>>>>> Just Cc-ing some Linux, block, and Xen on CentOS people...
>>>>>>>>
>>>>>>>
>>>>>>> Ping.
>>>>>>>
>>>>>>> Any suggestions how to debug this or what might cause the problem?
>>>>>>>
>>>>>>> Obviously, we cannot control Xen on the Amazon's servers. But perhaps 
>>>>>>> there is something we can do at the kernel's side, is it?
>>>>>>>
>>>>>>>> On Mon, 2016-06-06 at 11:24 +0300, Evgenii Shatokhin wrote:
>>>>>>>>> (Resending this bug report because the message I sent last week did
>>>>>>>>> not
>>>>>>>>> make it to the mailing list somehow.)
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> One of our users gets kernel panics from time to time when he tries
>>>>>>>>> to
>>>>>>>>> use his Amazon EC2 instance with CentOS7 x64 in it [1]. Kernel panic
>>>>>>>>> happens within minutes from the moment the instance starts. The
>>>>>>>>> problem
>>>>>>>>> does not show up every time, however.
>>>>>>>>>
>>>>>>>>> The user first observed the problem with a custom kernel, but it was
>>>>>>>>> found later that the stock kernel 3.10.0-327.18.2.el7.x86_64 from
>>>>>>>>> CentOS7 was affected as well.
>>>>>>
>>>>>> Please try this patch:
>>>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7b0767502b5db11cb1f0daef2d01f6d71b1192dc
>>>>>>
>>>>>> Regards,
>>>>>> Bob
>>>>>>
>>>>>
>>>>> Unfortunately, it did not help. The same BUG_ON() in 
>>>>> blkfront_setup_indirect() still triggers in our kernel based on RHEL's 
>>>>> 3.10.0-327.18.2, where I added the patch.
>>>>>
>>>>> As far as I can see, the patch makes sure the indirect pages are added to 
>>>>> the list only if (!info->feature_persistent) holds. I suppose it holds in 
>>>>> our case and the pages are added to the list because the triggered 
>>>>> BUG_ON() is here:
>>>>>
>>>>>   if (!info->feature_persistent && info->max_indirect_segments) {
>>>>>   <...>
>>>>>   BUG_ON(!list_empty(&info->indirect_pages));
>>>>>   <...>
>>>>>   }
>>>>>
>>>>
>>>> That's odd.
>>>> Could you please try to reproduce this issue with a recent upstream kernel?
>>>>
>>>> Thanks,
>>>> Bob
>>>
>>> No luck with the upstream kernel 4.7.0 so far due to unrelated issues (bad 
>>> initrd, I suppose, so the system does not even boot).
>>>
>>> However, the problem reproduced with the stable upstream kernel 3.14.74. 
>>> After the system booted the second time with this kernel, that BUG_ON 
>>> triggered:
>>>   kernel BUG at drivers/block/xen-blkfront.c:1701
>>>
>>
>> Could you please provide more detail on how to reproduce this bug? I'd like 
>> to have a test.
>>
>> Thanks!
>> Bob
> 
> As the user says, he uses an Amazon EC2 instance. Namely: HVM CentOS7 AMI on 
> a c3.large instance with EBS magnetic storage.
> 

Oh, then it would be difficult to debug this issue.
The xen-blkfront communicates with xen-blkback(in dom0 or driver domain), but 
that part is a black box when running Amazon EC2.
We can't see the source code of the backend side!

Can this bug be reproduced on your own environment(xen + dom0)?

> At least 2 LVM partitions are needed:
> * /, 20-30 Gb should be enough, ext4
> * /vz, 5-10 Gb should be

Re: [Xen-devel] [PATCH v4 10/10] xen/blkback: make pool of persistent grants and free pages per-queue

2015-11-04 Thread Bob Liu

On 11/05/2015 10:43 AM, Konrad Rzeszutek Wilk wrote:
> On Mon, Nov 02, 2015 at 12:21:46PM +0800, Bob Liu wrote:
>> Make pool of persistent grants and free pages per-queue/ring instead of
>> per-device to get better scalability.
> 
> How much better scalability do we get?
> 

Which already showed in [00/10], I paste them here:

domU(orig)  4 queues8 queues16 queues
iops:690k   1024k(+30%) 800k750k


After patch 9 and 10:
domU(orig)  4 queues8 queues16 queues
iops:690k   1600k(+100%)   1450k1320k

Chart: https://www.dropbox.com/s/agrcy2pbzbsvmwv/iops.png?dl=0

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 07/10] xen/blkback: pseudo support for multi hardware queues/rings

2015-11-04 Thread Bob Liu

On 11/05/2015 10:30 AM, Konrad Rzeszutek Wilk wrote:
> On Mon, Nov 02, 2015 at 12:21:43PM +0800, Bob Liu wrote:
>> Preparatory patch for multiple hardware queues (rings). The number of
>> rings is unconditionally set to 1, larger number will be enabled in next
>> patch so as to make every single patch small and readable.
> 
> Instead of saying 'next patch' - spell out the title of the patch.
> 
> 
>>
>> Signed-off-by: Arianna Avanzini 
>> Signed-off-by: Bob Liu 
>> ---
>>  drivers/block/xen-blkback/common.h |   3 +-
>>  drivers/block/xen-blkback/xenbus.c | 292 
>> +++--
>>  2 files changed, 185 insertions(+), 110 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkback/common.h 
>> b/drivers/block/xen-blkback/common.h
>> index f0dd69a..4de1326 100644
>> --- a/drivers/block/xen-blkback/common.h
>> +++ b/drivers/block/xen-blkback/common.h
>> @@ -341,7 +341,8 @@ struct xen_blkif {
>>  struct work_struct  free_work;
>>  unsigned int nr_ring_pages;
>>  /* All rings for this device */
>> -struct xen_blkif_ring ring;
>> +struct xen_blkif_ring *rings;
>> +unsigned int nr_rings;
>>  };
>>  
>>  struct seg_buf {
>> diff --git a/drivers/block/xen-blkback/xenbus.c 
>> b/drivers/block/xen-blkback/xenbus.c
>> index 7bdd5fd..ac4b458 100644
>> --- a/drivers/block/xen-blkback/xenbus.c
>> +++ b/drivers/block/xen-blkback/xenbus.c
>> @@ -84,11 +84,12 @@ static int blkback_name(struct xen_blkif *blkif, char 
>> *buf)
>>  
>>  static void xen_update_blkif_status(struct xen_blkif *blkif)
>>  {
>> -int err;
>> +int err, i;
> 
> unsigned int.
>>  char name[BLKBACK_NAME_LEN];
>> +struct xen_blkif_ring *ring;
>>  
>>  /* Not ready to connect? */
>> -if (!blkif->ring.irq || !blkif->vbd.bdev)
>> +if (!blkif->rings || !blkif->rings[0].irq || !blkif->vbd.bdev)
>>  return;
>>  
>>  /* Already connected? */
>> @@ -113,19 +114,57 @@ static void xen_update_blkif_status(struct xen_blkif 
>> *blkif)
>>  }
>>  invalidate_inode_pages2(blkif->vbd.bdev->bd_inode->i_mapping);
>>  
>> -blkif->ring.xenblkd = kthread_run(xen_blkif_schedule, &blkif->ring, 
>> "%s", name);
>> -if (IS_ERR(blkif->ring.xenblkd)) {
>> -err = PTR_ERR(blkif->ring.xenblkd);
>> -blkif->ring.xenblkd = NULL;
>> -xenbus_dev_error(blkif->be->dev, err, "start xenblkd");
>> -return;
>> +if (blkif->nr_rings == 1) {
>> +blkif->rings[0].xenblkd = kthread_run(xen_blkif_schedule, 
>> &blkif->rings[0], "%s", name);
>> +if (IS_ERR(blkif->rings[0].xenblkd)) {
>> +err = PTR_ERR(blkif->rings[0].xenblkd);
>> +blkif->rings[0].xenblkd = NULL;
>> +xenbus_dev_error(blkif->be->dev, err, "start xenblkd");
>> +return;
>> +}
>> +} else {
>> +for (i = 0; i < blkif->nr_rings; i++) {
>> +ring = &blkif->rings[i];
>> +ring->xenblkd = kthread_run(xen_blkif_schedule, ring, 
>> "%s-%d", name, i);
>> +if (IS_ERR(ring->xenblkd)) {
>> +err = PTR_ERR(ring->xenblkd);
>> +ring->xenblkd = NULL;
>> +xenbus_dev_error(blkif->be->dev, err,
>> +"start %s-%d xenblkd", name, i);
>> +return;
>> +}
>> +}
> 
> Please collapse this together and just have one loop.
> 
> And just use the same name throughout ('%s-%d')
> 
> Also this does not take care of actually freeing the allocated
> ring->xenblkd if one of them fails to start.
> 
> Hmm, looking at this function .. we seem to forget to change the
> state if something fails.
> 
> As in, connect switches the state to Connected.. And if anything blows
> up after the connect() we don't switch the state. We do report an error
> in the XenBus, but the state is the same.
> 
> We should be using xenbus_dev_fatal I believe - so at least the state
> changes to Closing (and then the frontend can switch itself to
> Closing->Closed) - and we will call 'xen_blkif_disconnect'

Re: [Xen-devel] [PATCH 06/32] xen blkback: prepare for bi_rw split

2015-11-08 Thread Bob Liu

On 11/07/2015 06:17 PM, Christoph Hellwig wrote:
> A little offtopic for this patch, but can some explain this whole
> mess about bios in Xen blkfront?  We can happily do partial completions
> at the request later.
> 
> Also since the blk-mq conversion the call to blk_end_request_all is

This will be fixed after my next blk-mq patch series which also modified the 
recover path.

> completely broken, so it doesn't look like this code path is exactly
> well tested.
>

Thanks,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v5 00/10] xen-block: multi hardware-queues/rings support

2015-11-13 Thread Bob Liu
Note: These patches were based on original work of Arianna's internship for
GNOME's Outreach Program for Women.

After using blk-mq api, a guest has more than one(nr_vpus) software request
queues associated with each block front. These queues can be mapped over several
rings(hardware queues) to the backend, making it very easy for us to run
multiple threads on the backend for a single virtual disk.

By having different threads issuing requests at the same time, the performance
of guest can be improved significantly.

Test was done based on null_blk driver:
dom0: v4.3-rc7 16vcpus 10GB "modprobe null_blk"
domU: v4.3-rc7 16vcpus 10GB

[test]
rw=read
direct=1
ioengine=libaio
bs=4k
time_based
runtime=30
filename=/dev/xvdb
numjobs=16
iodepth=64
iodepth_batch=64
iodepth_batch_complete=64
group_reporting

Results:
iops1: After commit("xen/blkfront: make persistent grants per-queue").
iops2: After commit("xen/blkback: make persistent grants and free pages pool 
per-queue").

Queues:   14  8  16
Iops orig(k):   810 1064780 700
Iops1(k):   810 1230(~20%)  1024(~20%)  850(~20%)
Iops2(k):   810 1410(~35%)  1354(~75%)  1440(~100%)

With 4 queues after this series we can get ~75% increase in IOPS, and
performance won't drop if incresing queue numbers.

Please find the respective chart in this link:
https://www.dropbox.com/s/agrcy2pbzbsvmwv/iops.png?dl=0

---
v5:
 * Rebase to xen/tip.git tags/for-linus-4.4-rc0-tag.
 * Comments from Konrad.

v4:
 * Rebase to v4.3-rc7.
 * Comments from Roger.

v3:
 * Rebased to v4.2-rc8.

Bob Liu (10):
  xen/blkif: document blkif multi-queue/ring extension
  xen/blkfront: separate per ring information out of device info
  xen/blkfront: pseudo support for multi hardware queues/rings
  xen/blkfront: split per device io_lock
  xen/blkfront: negotiate number of queues/rings to be used with backend
  xen/blkback: separate ring information out of struct xen_blkif
  xen/blkback: pseudo support for multi hardware queues/rings
  xen/blkback: get the number of hardware queues/rings from blkfront
  xen/blkfront: make persistent grants per-queue
  xen/blkback: make pool of persistent grants and free pages per-queue

 drivers/block/xen-blkback/blkback.c | 386 ++-
 drivers/block/xen-blkback/common.h  |  78 ++--
 drivers/block/xen-blkback/xenbus.c  | 359 --
 drivers/block/xen-blkfront.c| 718 ++--
 include/xen/interface/io/blkif.h|  48 +++
 5 files changed, 971 insertions(+), 618 deletions(-)

-- 
1.8.3.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v5 01/10] xen/blkif: document blkif multi-queue/ring extension

2015-11-13 Thread Bob Liu
Document the multi-queue/ring feature in terms of XenStore keys to be written by
the backend and by the frontend.

Signed-off-by: Bob Liu 
---
v2:
Add descriptions together with multi-page ring buffer.
---
 include/xen/interface/io/blkif.h |   48 ++
 1 file changed, 48 insertions(+)

diff --git a/include/xen/interface/io/blkif.h b/include/xen/interface/io/blkif.h
index c33e1c4..8b8cfad 100644
--- a/include/xen/interface/io/blkif.h
+++ b/include/xen/interface/io/blkif.h
@@ -28,6 +28,54 @@ typedef uint16_t blkif_vdev_t;
 typedef uint64_t blkif_sector_t;
 
 /*
+ * Multiple hardware queues/rings:
+ * If supported, the backend will write the key "multi-queue-max-queues" to
+ * the directory for that vbd, and set its value to the maximum supported
+ * number of queues.
+ * Frontends that are aware of this feature and wish to use it can write the
+ * key "multi-queue-num-queues" with the number they wish to use, which must be
+ * greater than zero, and no more than the value reported by the backend in
+ * "multi-queue-max-queues".
+ *
+ * For frontends requesting just one queue, the usual event-channel and
+ * ring-ref keys are written as before, simplifying the backend processing
+ * to avoid distinguishing between a frontend that doesn't understand the
+ * multi-queue feature, and one that does, but requested only one queue.
+ *
+ * Frontends requesting two or more queues must not write the toplevel
+ * event-channel and ring-ref keys, instead writing those keys under sub-keys
+ * having the name "queue-N" where N is the integer ID of the queue/ring for
+ * which those keys belong. Queues are indexed from zero.
+ * For example, a frontend with two queues must write the following set of
+ * queue-related keys:
+ *
+ * /local/domain/1/device/vbd/0/multi-queue-num-queues = "2"
+ * /local/domain/1/device/vbd/0/queue-0 = ""
+ * /local/domain/1/device/vbd/0/queue-0/ring-ref = ""
+ * /local/domain/1/device/vbd/0/queue-0/event-channel = ""
+ * /local/domain/1/device/vbd/0/queue-1 = ""
+ * /local/domain/1/device/vbd/0/queue-1/ring-ref = ""
+ * /local/domain/1/device/vbd/0/queue-1/event-channel = ""
+ *
+ * It is also possible to use multiple queues/rings together with
+ * feature multi-page ring buffer.
+ * For example, a frontend requests two queues/rings and the size of each ring
+ * buffer is two pages must write the following set of related keys:
+ *
+ * /local/domain/1/device/vbd/0/multi-queue-num-queues = "2"
+ * /local/domain/1/device/vbd/0/ring-page-order = "1"
+ * /local/domain/1/device/vbd/0/queue-0 = ""
+ * /local/domain/1/device/vbd/0/queue-0/ring-ref0 = ""
+ * /local/domain/1/device/vbd/0/queue-0/ring-ref1 = ""
+ * /local/domain/1/device/vbd/0/queue-0/event-channel = ""
+ * /local/domain/1/device/vbd/0/queue-1 = ""
+ * /local/domain/1/device/vbd/0/queue-1/ring-ref0 = ""
+ * /local/domain/1/device/vbd/0/queue-1/ring-ref1 = ""
+ * /local/domain/1/device/vbd/0/queue-1/event-channel = ""
+ *
+ */
+
+/*
  * REQUEST CODES.
  */
 #define BLKIF_OP_READ  0
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v5 05/10] xen/blkfront: negotiate number of queues/rings to be used with backend

2015-11-13 Thread Bob Liu
The max number of hardware queues for xen/blkfront is set by parameter
'max_queues'(default 4), while it is also capped by the max value that the
xen/blkback exposes through XenStore key 'multi-queue-max-queues'.

The negotiated number is the smaller one and would be written back to xenstore
as "multi-queue-num-queues", blkback needs to read this negotiated number.

Signed-off-by: Bob Liu 
---
v2:
 * Make 'i' be an unsigned int.
 * Other comments from Konrad.
---
 drivers/block/xen-blkfront.c |  160 +++---
 1 file changed, 119 insertions(+), 41 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 56c9ec6..84496be 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -99,6 +99,10 @@ static unsigned int xen_blkif_max_segments = 32;
 module_param_named(max, xen_blkif_max_segments, int, S_IRUGO);
 MODULE_PARM_DESC(max, "Maximum amount of segments in indirect requests 
(default is 32)");
 
+static unsigned int xen_blkif_max_queues = 4;
+module_param_named(max_queues, xen_blkif_max_queues, uint, S_IRUGO);
+MODULE_PARM_DESC(max_queues, "Maximum number of hardware queues/rings used per 
virtual disk");
+
 /*
  * Maximum order of pages to be used for the shared ring between front and
  * backend, 4KB page granularity is used.
@@ -118,6 +122,10 @@ MODULE_PARM_DESC(max_ring_page_order, "Maximum order of 
pages to be used for the
  * characters are enough. Define to 20 to keep consist with backend.
  */
 #define RINGREF_NAME_LEN (20)
+/*
+ * queue-%u would take 7 + 10(UINT_MAX) = 17 characters
+ */
+#define QUEUE_NAME_LEN (17)
 
 /*
  *  Per-ring info.
@@ -823,7 +831,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
 
memset(&info->tag_set, 0, sizeof(info->tag_set));
info->tag_set.ops = &blkfront_mq_ops;
-   info->tag_set.nr_hw_queues = 1;
+   info->tag_set.nr_hw_queues = info->nr_rings;
info->tag_set.queue_depth =  BLK_RING_SIZE(info);
info->tag_set.numa_node = NUMA_NO_NODE;
info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
@@ -1520,6 +1528,53 @@ fail:
return err;
 }
 
+/*
+ * Write out per-ring/queue nodes including ring-ref and event-channel, and 
each
+ * ring buffer may have multi pages depending on ->nr_ring_pages.
+ */
+static int write_per_ring_nodes(struct xenbus_transaction xbt,
+   struct blkfront_ring_info *rinfo, const char 
*dir)
+{
+   int err;
+   unsigned int i;
+   const char *message = NULL;
+   struct blkfront_info *info = rinfo->dev_info;
+
+   if (info->nr_ring_pages == 1) {
+   err = xenbus_printf(xbt, dir, "ring-ref", "%u", 
rinfo->ring_ref[0]);
+   if (err) {
+   message = "writing ring-ref";
+   goto abort_transaction;
+   }
+   } else {
+   for (i = 0; i < info->nr_ring_pages; i++) {
+   char ring_ref_name[RINGREF_NAME_LEN];
+
+   snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", 
i);
+   err = xenbus_printf(xbt, dir, ring_ref_name,
+   "%u", rinfo->ring_ref[i]);
+   if (err) {
+   message = "writing ring-ref";
+   goto abort_transaction;
+   }
+   }
+   }
+
+   err = xenbus_printf(xbt, dir, "event-channel", "%u", rinfo->evtchn);
+   if (err) {
+   message = "writing event-channel";
+   goto abort_transaction;
+   }
+
+   return 0;
+
+abort_transaction:
+   xenbus_transaction_end(xbt, 1);
+   if (message)
+   xenbus_dev_fatal(info->xbdev, err, "%s", message);
+
+   return err;
+}
 
 /* Common code used when first setting up, and when resuming. */
 static int talk_to_blkback(struct xenbus_device *dev,
@@ -1527,10 +1582,9 @@ static int talk_to_blkback(struct xenbus_device *dev,
 {
const char *message = NULL;
struct xenbus_transaction xbt;
-   int err, i;
-   unsigned int max_page_order = 0;
+   int err;
+   unsigned int i, max_page_order = 0;
unsigned int ring_page_order = 0;
-   struct blkfront_ring_info *rinfo;
 
err = xenbus_scanf(XBT_NIL, info->xbdev->otherend,
   "max-ring-page-order", "%u", &max_page_order);
@@ -1542,7 +1596,8 @@ static int talk_to_blkback(struct xenbus_device *dev,
}
 
for (i = 0; i < info->nr_rings; i++) {
-   rinfo = &info->rinfo[i];
+   struct blkfront_ring_info *rinfo 

[Xen-devel] [PATCH v5 10/10] xen/blkback: make pool of persistent grants and free pages per-queue

2015-11-13 Thread Bob Liu
Make pool of persistent grants and free pages per-queue/ring instead of
per-device to get better scalability.

Test was done based on null_blk driver:
dom0: v4.2-rc8 16vcpus 10GB "modprobe null_blk"
domu: v4.2-rc8 16vcpus 10GB

[test]
rw=read
direct=1
ioengine=libaio
bs=4k
time_based
runtime=30
filename=/dev/xvdb
numjobs=16
iodepth=64
iodepth_batch=64
iodepth_batch_complete=64
group_reporting

Results:
iops1: After commit("xen/blkfront: make persistent grants per-queue").
iops2: After this commit.

Queues:   14  8  16
Iops orig(k):   810 1064780 700
Iops1(k):   810 1230(~20%)  1024(~20%)  850(~20%)
Iops2(k):   810 1410(~35%)  1354(~75%)  1440(~100%)

With 4 queues after this commit we can get ~75% increase in IOPS, and
performance won't drop if incresing queue numbers.

Please find the respective chart in this link:
https://www.dropbox.com/s/agrcy2pbzbsvmwv/iops.png?dl=0

Signed-off-by: Bob Liu 
---
 drivers/block/xen-blkback/blkback.c |  202 ---
 drivers/block/xen-blkback/common.h  |   32 +++---
 drivers/block/xen-blkback/xenbus.c  |   21 ++--
 3 files changed, 118 insertions(+), 137 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c 
b/drivers/block/xen-blkback/blkback.c
index acedc46..0e8a04d 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -122,60 +122,60 @@ module_param(log_stats, int, 0644);
 /* Number of free pages to remove on each call to gnttab_free_pages */
 #define NUM_BATCH_FREE_PAGES 10
 
-static inline int get_free_page(struct xen_blkif *blkif, struct page **page)
+static inline int get_free_page(struct xen_blkif_ring *ring, struct page 
**page)
 {
unsigned long flags;
 
-   spin_lock_irqsave(&blkif->free_pages_lock, flags);
-   if (list_empty(&blkif->free_pages)) {
-   BUG_ON(blkif->free_pages_num != 0);
-   spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+   spin_lock_irqsave(&ring->free_pages_lock, flags);
+   if (list_empty(&ring->free_pages)) {
+   BUG_ON(ring->free_pages_num != 0);
+   spin_unlock_irqrestore(&ring->free_pages_lock, flags);
return gnttab_alloc_pages(1, page);
}
-   BUG_ON(blkif->free_pages_num == 0);
-   page[0] = list_first_entry(&blkif->free_pages, struct page, lru);
+   BUG_ON(ring->free_pages_num == 0);
+   page[0] = list_first_entry(&ring->free_pages, struct page, lru);
list_del(&page[0]->lru);
-   blkif->free_pages_num--;
-   spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+   ring->free_pages_num--;
+   spin_unlock_irqrestore(&ring->free_pages_lock, flags);
 
return 0;
 }
 
-static inline void put_free_pages(struct xen_blkif *blkif, struct page **page,
+static inline void put_free_pages(struct xen_blkif_ring *ring, struct page 
**page,
   int num)
 {
unsigned long flags;
int i;
 
-   spin_lock_irqsave(&blkif->free_pages_lock, flags);
+   spin_lock_irqsave(&ring->free_pages_lock, flags);
for (i = 0; i < num; i++)
-   list_add(&page[i]->lru, &blkif->free_pages);
-   blkif->free_pages_num += num;
-   spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+   list_add(&page[i]->lru, &ring->free_pages);
+   ring->free_pages_num += num;
+   spin_unlock_irqrestore(&ring->free_pages_lock, flags);
 }
 
-static inline void shrink_free_pagepool(struct xen_blkif *blkif, int num)
+static inline void shrink_free_pagepool(struct xen_blkif_ring *ring, int num)
 {
/* Remove requested pages in batches of NUM_BATCH_FREE_PAGES */
struct page *page[NUM_BATCH_FREE_PAGES];
unsigned int num_pages = 0;
unsigned long flags;
 
-   spin_lock_irqsave(&blkif->free_pages_lock, flags);
-   while (blkif->free_pages_num > num) {
-   BUG_ON(list_empty(&blkif->free_pages));
-   page[num_pages] = list_first_entry(&blkif->free_pages,
+   spin_lock_irqsave(&ring->free_pages_lock, flags);
+   while (ring->free_pages_num > num) {
+   BUG_ON(list_empty(&ring->free_pages));
+   page[num_pages] = list_first_entry(&ring->free_pages,
   struct page, lru);
list_del(&page[num_pages]->lru);
-   blkif->free_pages_num--;
+   ring->free_pages_num--;
if (++num_pages == NUM_BATCH_FREE_PAGES) {
-   spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+ 

[Xen-devel] [PATCH v5 03/10] xen/blkfront: pseudo support for multi hardware queues/rings

2015-11-13 Thread Bob Liu
Preparatory patch for multiple hardware queues (rings). The number of
rings is unconditionally set to 1, larger number will be enabled in next
patch("xen/blkfront: negotiate number of queues/rings to be used with backend")
so as to make every single patch small and readable.

Signed-off-by: Bob Liu 
---
v2:
 * Fix memleak.
 * Other comments from Konrad.
---
 drivers/block/xen-blkfront.c |  341 --
 1 file changed, 195 insertions(+), 146 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 0c3ad21..d73734f 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -150,6 +150,7 @@ struct blkfront_info
int vdevice;
blkif_vdev_t handle;
enum blkif_state connected;
+   /* Number of pages per ring buffer. */
unsigned int nr_ring_pages;
struct request_queue *rq;
struct list_head grants;
@@ -164,7 +165,8 @@ struct blkfront_info
unsigned int max_indirect_segments;
int is_ready;
struct blk_mq_tag_set tag_set;
-   struct blkfront_ring_info rinfo;
+   struct blkfront_ring_info *rinfo;
+   unsigned int nr_rings;
 };
 
 static unsigned int nr_minors;
@@ -209,7 +211,7 @@ static DEFINE_SPINLOCK(minor_lock);
 #define GREFS(_psegs)  ((_psegs) * GRANTS_PER_PSEG)
 
 static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo);
-static int blkfront_gather_backend_features(struct blkfront_info *info);
+static void blkfront_gather_backend_features(struct blkfront_info *info);
 
 static int get_id_from_freelist(struct blkfront_ring_info *rinfo)
 {
@@ -338,8 +340,8 @@ static struct grant *get_indirect_grant(grant_ref_t 
*gref_head,
struct page *indirect_page;
 
/* Fetch a pre-allocated page to use for indirect grefs */
-   BUG_ON(list_empty(&info->rinfo.indirect_pages));
-   indirect_page = list_first_entry(&info->rinfo.indirect_pages,
+   BUG_ON(list_empty(&info->rinfo->indirect_pages));
+   indirect_page = list_first_entry(&info->rinfo->indirect_pages,
 struct page, lru);
list_del(&indirect_page->lru);
gnt_list_entry->page = indirect_page;
@@ -597,7 +599,6 @@ static int blkif_queue_rw_req(struct request *req, struct 
blkfront_ring_info *ri
 * existing persistent grants, or if we have to get new grants,
 * as there are not sufficiently many free.
 */
-   bool new_persistent_gnts;
struct scatterlist *sg;
int num_sg, max_grefs, num_grant;
 
@@ -609,12 +610,12 @@ static int blkif_queue_rw_req(struct request *req, struct 
blkfront_ring_info *ri
 */
max_grefs += INDIRECT_GREFS(max_grefs);
 
-   /* Check if we have enough grants to allocate a requests */
-   if (info->persistent_gnts_c < max_grefs) {
-   new_persistent_gnts = 1;
-   if (gnttab_alloc_grant_references(
-   max_grefs - info->persistent_gnts_c,
-   &setup.gref_head) < 0) {
+   /*
+* We have to reserve 'max_grefs' grants at first because persistent
+* grants are shared by all rings.
+*/
+   if (max_grefs > 0)
+   if (gnttab_alloc_grant_references(max_grefs, &setup.gref_head) 
< 0) {
gnttab_request_free_callback(
&rinfo->callback,
blkif_restart_queue_callback,
@@ -622,8 +623,6 @@ static int blkif_queue_rw_req(struct request *req, struct 
blkfront_ring_info *ri
max_grefs);
return 1;
}
-   } else
-   new_persistent_gnts = 0;
 
/* Fill out a communications ring structure. */
ring_req = RING_GET_REQUEST(&rinfo->ring, rinfo->ring.req_prod_pvt);
@@ -712,7 +711,7 @@ static int blkif_queue_rw_req(struct request *req, struct 
blkfront_ring_info *ri
/* Keep a private copy so we can reissue requests when recovering. */
rinfo->shadow[id].req = *ring_req;
 
-   if (new_persistent_gnts)
+   if (max_grefs > 0)
gnttab_free_grant_references(setup.gref_head);
 
return 0;
@@ -791,7 +790,8 @@ static int blk_mq_init_hctx(struct blk_mq_hw_ctx *hctx, 
void *data,
 {
struct blkfront_info *info = (struct blkfront_info *)data;
 
-   hctx->driver_data = &info->rinfo;
+   BUG_ON(info->nr_rings <= index);
+   hctx->driver_data = &info->rinfo[index];
return 0;
 }
 
@@ -1050,8 +1050,7 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
 
 static void xlvbd_release_gendisk(struct blkfront_info *info)
 {
-   unsigned int minor, nr_minors;
-   struct blkfront_rin

[Xen-devel] [PATCH v5 04/10] xen/blkfront: split per device io_lock

2015-11-13 Thread Bob Liu
After commit "xen/blkfront: separate per ring information out of device
info", per-ring data is protected by a per-device lock('io_lock').

This is not a good way and will effect the scalability, so introduces a
per-ring lock('ring_lock').

The old 'io_lock' is renamed to 'dev_lock' which protects the ->grants list and
persistent_gnts_c shared by all rings.

Signed-off-by: Bob Liu 
---
v2:
 * Introduce kick_pending_request_queues_locked().
 * Add comment for 'ring_lock'.
 * Move locks to more suitable place.
---
 drivers/block/xen-blkfront.c |   73 +++---
 1 file changed, 47 insertions(+), 26 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index d73734f..56c9ec6 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -125,6 +125,8 @@ MODULE_PARM_DESC(max_ring_page_order, "Maximum order of 
pages to be used for the
  *  depending on how many hardware queues/rings to be used.
  */
 struct blkfront_ring_info {
+   /* Lock to protect data in every ring buffer. */
+   spinlock_t ring_lock;
struct blkif_front_ring ring;
unsigned int ring_ref[XENBUS_MAX_RING_GRANTS];
unsigned int evtchn, irq;
@@ -143,7 +145,6 @@ struct blkfront_ring_info {
  */
 struct blkfront_info
 {
-   spinlock_t io_lock;
struct mutex mutex;
struct xenbus_device *xbdev;
struct gendisk *gd;
@@ -153,6 +154,11 @@ struct blkfront_info
/* Number of pages per ring buffer. */
unsigned int nr_ring_pages;
struct request_queue *rq;
+   /*
+* Lock to protect info->grants list and persistent_gnts_c shared by all
+* rings.
+*/
+   spinlock_t dev_lock;
struct list_head grants;
unsigned int persistent_gnts_c;
unsigned int feature_flush;
@@ -258,7 +264,9 @@ static int fill_grant_buffer(struct blkfront_ring_info 
*rinfo, int num)
}
 
gnt_list_entry->gref = GRANT_INVALID_REF;
+   spin_lock_irq(&info->dev_lock);
list_add(&gnt_list_entry->node, &info->grants);
+   spin_unlock_irq(&info->dev_lock);
i++;
}
 
@@ -267,7 +275,9 @@ static int fill_grant_buffer(struct blkfront_ring_info 
*rinfo, int num)
 out_of_memory:
list_for_each_entry_safe(gnt_list_entry, n,
 &info->grants, node) {
+   spin_lock_irq(&info->dev_lock);
list_del(&gnt_list_entry->node);
+   spin_unlock_irq(&info->dev_lock);
if (info->feature_persistent)
__free_page(gnt_list_entry->page);
kfree(gnt_list_entry);
@@ -280,7 +290,9 @@ out_of_memory:
 static struct grant *get_free_grant(struct blkfront_info *info)
 {
struct grant *gnt_list_entry;
+   unsigned long flags;
 
+   spin_lock_irqsave(&info->dev_lock, flags);
BUG_ON(list_empty(&info->grants));
gnt_list_entry = list_first_entry(&info->grants, struct grant,
  node);
@@ -288,6 +300,7 @@ static struct grant *get_free_grant(struct blkfront_info 
*info)
 
if (gnt_list_entry->gref != GRANT_INVALID_REF)
info->persistent_gnts_c--;
+   spin_unlock_irqrestore(&info->dev_lock, flags);
 
return gnt_list_entry;
 }
@@ -757,11 +770,11 @@ static inline bool blkif_request_flush_invalid(struct 
request *req,
 static int blkif_queue_rq(struct blk_mq_hw_ctx *hctx,
   const struct blk_mq_queue_data *qd)
 {
+   unsigned long flags;
struct blkfront_ring_info *rinfo = (struct blkfront_ring_info 
*)hctx->driver_data;
-   struct blkfront_info *info = rinfo->dev_info;
 
blk_mq_start_request(qd->rq);
-   spin_lock_irq(&info->io_lock);
+   spin_lock_irqsave(&rinfo->ring_lock, flags);
if (RING_FULL(&rinfo->ring))
goto out_busy;
 
@@ -772,15 +785,15 @@ static int blkif_queue_rq(struct blk_mq_hw_ctx *hctx,
goto out_busy;
 
flush_requests(rinfo);
-   spin_unlock_irq(&info->io_lock);
+   spin_unlock_irqrestore(&rinfo->ring_lock, flags);
return BLK_MQ_RQ_QUEUE_OK;
 
 out_err:
-   spin_unlock_irq(&info->io_lock);
+   spin_unlock_irqrestore(&rinfo->ring_lock, flags);
return BLK_MQ_RQ_QUEUE_ERROR;
 
 out_busy:
-   spin_unlock_irq(&info->io_lock);
+   spin_unlock_irqrestore(&rinfo->ring_lock, flags);
blk_mq_stop_hw_queue(hctx);
return BLK_MQ_RQ_QUEUE_BUSY;
 }
@@ -1082,21 +1095,28 @@ static void xlvbd_release_gendisk(struct blkfront_info 
*info)
info->gd = NULL;
 }
 
-/* Must be called with io_lock holded

[Xen-devel] [PATCH v5 07/10] xen/blkback: pseudo support for multi hardware queues/rings

2015-11-13 Thread Bob Liu
Preparatory patch for multiple hardware queues (rings). The number of
rings is unconditionally set to 1, larger number will be enabled in next
patch("xen/blkback: get the number of hardware queues/rings from blkfront") so
as to make every single patch small and readable.

Signed-off-by: Arianna Avanzini 
Signed-off-by: Bob Liu 
---
 drivers/block/xen-blkback/common.h |3 +-
 drivers/block/xen-blkback/xenbus.c |  277 ++--
 2 files changed, 175 insertions(+), 105 deletions(-)

diff --git a/drivers/block/xen-blkback/common.h 
b/drivers/block/xen-blkback/common.h
index f4dfa5b..f2386e3 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -340,7 +340,8 @@ struct xen_blkif {
struct work_struct  free_work;
unsigned int nr_ring_pages;
/* All rings for this device. */
-   struct xen_blkif_ring ring;
+   struct xen_blkif_ring *rings;
+   unsigned int nr_rings;
 };
 
 struct seg_buf {
diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index e4bfc92..6c6e048 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -86,9 +86,11 @@ static void xen_update_blkif_status(struct xen_blkif *blkif)
 {
int err;
char name[BLKBACK_NAME_LEN];
+   struct xen_blkif_ring *ring;
+   unsigned int i;
 
/* Not ready to connect? */
-   if (!blkif->ring.irq || !blkif->vbd.bdev)
+   if (!blkif->rings || !blkif->rings[0].irq || !blkif->vbd.bdev)
return;
 
/* Already connected? */
@@ -113,19 +115,55 @@ static void xen_update_blkif_status(struct xen_blkif 
*blkif)
}
invalidate_inode_pages2(blkif->vbd.bdev->bd_inode->i_mapping);
 
-   blkif->ring.xenblkd = kthread_run(xen_blkif_schedule, &blkif->ring, 
"%s", name);
-   if (IS_ERR(blkif->ring.xenblkd)) {
-   err = PTR_ERR(blkif->ring.xenblkd);
-   blkif->ring.xenblkd = NULL;
-   xenbus_dev_error(blkif->be->dev, err, "start xenblkd");
-   return;
+   for (i = 0; i < blkif->nr_rings; i++) {
+   ring = &blkif->rings[i];
+   ring->xenblkd = kthread_run(xen_blkif_schedule, ring, "%s-%d", 
name, i);
+   if (IS_ERR(ring->xenblkd)) {
+   err = PTR_ERR(ring->xenblkd);
+   ring->xenblkd = NULL;
+   xenbus_dev_fatal(blkif->be->dev, err,
+   "start %s-%d xenblkd", name, i);
+   goto out;
+   }
+   }
+   return;
+
+out:
+   while (--i >= 0) {
+   ring = &blkif->rings[i];
+   kthread_stop(ring->xenblkd);
}
+   return;
+}
+
+static int xen_blkif_alloc_rings(struct xen_blkif *blkif)
+{
+   unsigned int r;
+
+   blkif->rings = kzalloc(blkif->nr_rings * sizeof(struct xen_blkif_ring), 
GFP_KERNEL);
+   if (!blkif->rings)
+   return -ENOMEM;
+
+   for (r = 0; r < blkif->nr_rings; r++) {
+   struct xen_blkif_ring *ring = &blkif->rings[r];
+
+   spin_lock_init(&ring->blk_ring_lock);
+   init_waitqueue_head(&ring->wq);
+   INIT_LIST_HEAD(&ring->pending_free);
+
+   spin_lock_init(&ring->pending_free_lock);
+   init_waitqueue_head(&ring->pending_free_wq);
+   init_waitqueue_head(&ring->shutdown_wq);
+   ring->blkif = blkif;
+   xen_blkif_get(blkif);
+   }
+
+   return 0;
 }
 
 static struct xen_blkif *xen_blkif_alloc(domid_t domid)
 {
struct xen_blkif *blkif;
-   struct xen_blkif_ring *ring;
 
BUILD_BUG_ON(MAX_INDIRECT_PAGES > BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST);
 
@@ -143,15 +181,11 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
blkif->st_print = jiffies;
INIT_WORK(&blkif->persistent_purge_work, xen_blkbk_unmap_purged_grants);
 
-   ring = &blkif->ring;
-   ring->blkif = blkif;
-   spin_lock_init(&ring->blk_ring_lock);
-   init_waitqueue_head(&ring->wq);
-
-   INIT_LIST_HEAD(&ring->pending_free);
-   spin_lock_init(&ring->pending_free_lock);
-   init_waitqueue_head(&ring->pending_free_wq);
-   init_waitqueue_head(&ring->shutdown_wq);
+   blkif->nr_rings = 1;
+   if (xen_blkif_alloc_rings(blkif)) {
+   kmem_cache_free(xen_blkif_cachep, blkif);
+   return ERR_PTR(-ENOMEM);
+   }
 
return blkif;
 }
@@ -216,50 +250,54 @@ static int xen_blkif_map(struct xen_blkif_ring *ring, 
grant_ref_t *gref,
 static int xen_blkif_disconnect(struct xen_blkif *blkif)
 {
 

[Xen-devel] [PATCH v5 09/10] xen/blkfront: make persistent grants pool per-queue

2015-11-13 Thread Bob Liu
Make persistent grants per-queue/ring instead of per-device, so that we can
drop the 'dev_lock' and get better scalability.

Test was done based on null_blk driver:
dom0: v4.2-rc8 16vcpus 10GB "modprobe null_blk"
domu: v4.2-rc8 16vcpus 10GB

[test]
rw=read
direct=1
ioengine=libaio
bs=4k
time_based
runtime=30
filename=/dev/xvdb
numjobs=16
iodepth=64
iodepth_batch=64
iodepth_batch_complete=64
group_reporting

Queues:   14  8  16
Iops orig(k):   810 1064780 700
Iops patched(k):810 1230(~20%)  1024(~20%)  850(~20%)

Signed-off-by: Bob Liu 
---
 drivers/block/xen-blkfront.c |  110 +-
 1 file changed, 43 insertions(+), 67 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 84496be..451f852 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -142,6 +142,8 @@ struct blkfront_ring_info {
struct gnttab_free_callback callback;
struct blk_shadow shadow[BLK_MAX_RING_SIZE];
struct list_head indirect_pages;
+   struct list_head grants;
+   unsigned int persistent_gnts_c;
unsigned long shadow_free;
struct blkfront_info *dev_info;
 };
@@ -162,13 +164,6 @@ struct blkfront_info
/* Number of pages per ring buffer. */
unsigned int nr_ring_pages;
struct request_queue *rq;
-   /*
-* Lock to protect info->grants list and persistent_gnts_c shared by all
-* rings.
-*/
-   spinlock_t dev_lock;
-   struct list_head grants;
-   unsigned int persistent_gnts_c;
unsigned int feature_flush;
unsigned int feature_discard:1;
unsigned int feature_secdiscard:1;
@@ -272,9 +267,7 @@ static int fill_grant_buffer(struct blkfront_ring_info 
*rinfo, int num)
}
 
gnt_list_entry->gref = GRANT_INVALID_REF;
-   spin_lock_irq(&info->dev_lock);
-   list_add(&gnt_list_entry->node, &info->grants);
-   spin_unlock_irq(&info->dev_lock);
+   list_add(&gnt_list_entry->node, &rinfo->grants);
i++;
}
 
@@ -282,10 +275,8 @@ static int fill_grant_buffer(struct blkfront_ring_info 
*rinfo, int num)
 
 out_of_memory:
list_for_each_entry_safe(gnt_list_entry, n,
-&info->grants, node) {
-   spin_lock_irq(&info->dev_lock);
+&rinfo->grants, node) {
list_del(&gnt_list_entry->node);
-   spin_unlock_irq(&info->dev_lock);
if (info->feature_persistent)
__free_page(gnt_list_entry->page);
kfree(gnt_list_entry);
@@ -295,20 +286,17 @@ out_of_memory:
return -ENOMEM;
 }
 
-static struct grant *get_free_grant(struct blkfront_info *info)
+static struct grant *get_free_grant(struct blkfront_ring_info *rinfo)
 {
struct grant *gnt_list_entry;
-   unsigned long flags;
 
-   spin_lock_irqsave(&info->dev_lock, flags);
-   BUG_ON(list_empty(&info->grants));
-   gnt_list_entry = list_first_entry(&info->grants, struct grant,
+   BUG_ON(list_empty(&rinfo->grants));
+   gnt_list_entry = list_first_entry(&rinfo->grants, struct grant,
  node);
list_del(&gnt_list_entry->node);
 
if (gnt_list_entry->gref != GRANT_INVALID_REF)
-   info->persistent_gnts_c--;
-   spin_unlock_irqrestore(&info->dev_lock, flags);
+   rinfo->persistent_gnts_c--;
 
return gnt_list_entry;
 }
@@ -324,9 +312,10 @@ static inline void grant_foreign_access(const struct grant 
*gnt_list_entry,
 
 static struct grant *get_grant(grant_ref_t *gref_head,
   unsigned long gfn,
-  struct blkfront_info *info)
+  struct blkfront_ring_info *rinfo)
 {
-   struct grant *gnt_list_entry = get_free_grant(info);
+   struct grant *gnt_list_entry = get_free_grant(rinfo);
+   struct blkfront_info *info = rinfo->dev_info;
 
if (gnt_list_entry->gref != GRANT_INVALID_REF)
return gnt_list_entry;
@@ -347,9 +336,10 @@ static struct grant *get_grant(grant_ref_t *gref_head,
 }
 
 static struct grant *get_indirect_grant(grant_ref_t *gref_head,
-   struct blkfront_info *info)
+   struct blkfront_ring_info *rinfo)
 {
-   struct grant *gnt_list_entry = get_free_grant(info);
+   struct grant *gnt_list_entry = get_free_grant(rinfo);
+   struct blkfront_info *info = rinfo->dev_info;
 
if (gnt_list_entry->gref != GRANT_

[Xen-devel] [PATCH v5 02/10] xen/blkfront: separate per ring information out of device info

2015-11-13 Thread Bob Liu
Split per ring information to an new structure "blkfront_ring_info".

A ring is the representation of a hardware queue, every vbd device can associate
with one or more rings depending on how many hardware queues/rings to be used.

This patch is a preparation for supporting real multi hardware queues/rings.

Signed-off-by: Arianna Avanzini 
Signed-off-by: Bob Liu 
---
v2: Fix build error.
---
 drivers/block/xen-blkfront.c |  359 +++---
 1 file changed, 197 insertions(+), 162 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 2fee2ee..0c3ad21 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -120,6 +120,23 @@ MODULE_PARM_DESC(max_ring_page_order, "Maximum order of 
pages to be used for the
 #define RINGREF_NAME_LEN (20)
 
 /*
+ *  Per-ring info.
+ *  Every blkfront device can associate with one or more blkfront_ring_info,
+ *  depending on how many hardware queues/rings to be used.
+ */
+struct blkfront_ring_info {
+   struct blkif_front_ring ring;
+   unsigned int ring_ref[XENBUS_MAX_RING_GRANTS];
+   unsigned int evtchn, irq;
+   struct work_struct work;
+   struct gnttab_free_callback callback;
+   struct blk_shadow shadow[BLK_MAX_RING_SIZE];
+   struct list_head indirect_pages;
+   unsigned long shadow_free;
+   struct blkfront_info *dev_info;
+};
+
+/*
  * We have one of these per vbd, whether ide, scsi or 'other'.  They
  * hang in private_data off the gendisk structure. We may end up
  * putting all kinds of interesting stuff here :-)
@@ -133,18 +150,10 @@ struct blkfront_info
int vdevice;
blkif_vdev_t handle;
enum blkif_state connected;
-   int ring_ref[XENBUS_MAX_RING_GRANTS];
unsigned int nr_ring_pages;
-   struct blkif_front_ring ring;
-   unsigned int evtchn, irq;
struct request_queue *rq;
-   struct work_struct work;
-   struct gnttab_free_callback callback;
-   struct blk_shadow shadow[BLK_MAX_RING_SIZE];
struct list_head grants;
-   struct list_head indirect_pages;
unsigned int persistent_gnts_c;
-   unsigned long shadow_free;
unsigned int feature_flush;
unsigned int feature_discard:1;
unsigned int feature_secdiscard:1;
@@ -155,6 +164,7 @@ struct blkfront_info
unsigned int max_indirect_segments;
int is_ready;
struct blk_mq_tag_set tag_set;
+   struct blkfront_ring_info rinfo;
 };
 
 static unsigned int nr_minors;
@@ -198,33 +208,35 @@ static DEFINE_SPINLOCK(minor_lock);
 
 #define GREFS(_psegs)  ((_psegs) * GRANTS_PER_PSEG)
 
-static int blkfront_setup_indirect(struct blkfront_info *info);
+static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo);
 static int blkfront_gather_backend_features(struct blkfront_info *info);
 
-static int get_id_from_freelist(struct blkfront_info *info)
+static int get_id_from_freelist(struct blkfront_ring_info *rinfo)
 {
-   unsigned long free = info->shadow_free;
-   BUG_ON(free >= BLK_RING_SIZE(info));
-   info->shadow_free = info->shadow[free].req.u.rw.id;
-   info->shadow[free].req.u.rw.id = 0x0fee; /* debug */
+   unsigned long free = rinfo->shadow_free;
+
+   BUG_ON(free >= BLK_RING_SIZE(rinfo->dev_info));
+   rinfo->shadow_free = rinfo->shadow[free].req.u.rw.id;
+   rinfo->shadow[free].req.u.rw.id = 0x0fee; /* debug */
return free;
 }
 
-static int add_id_to_freelist(struct blkfront_info *info,
+static int add_id_to_freelist(struct blkfront_ring_info *rinfo,
   unsigned long id)
 {
-   if (info->shadow[id].req.u.rw.id != id)
+   if (rinfo->shadow[id].req.u.rw.id != id)
return -EINVAL;
-   if (info->shadow[id].request == NULL)
+   if (rinfo->shadow[id].request == NULL)
return -EINVAL;
-   info->shadow[id].req.u.rw.id  = info->shadow_free;
-   info->shadow[id].request = NULL;
-   info->shadow_free = id;
+   rinfo->shadow[id].req.u.rw.id  = rinfo->shadow_free;
+   rinfo->shadow[id].request = NULL;
+   rinfo->shadow_free = id;
return 0;
 }
 
-static int fill_grant_buffer(struct blkfront_info *info, int num)
+static int fill_grant_buffer(struct blkfront_ring_info *rinfo, int num)
 {
+   struct blkfront_info *info = rinfo->dev_info;
struct page *granted_page;
struct grant *gnt_list_entry, *n;
int i = 0;
@@ -326,8 +338,8 @@ static struct grant *get_indirect_grant(grant_ref_t 
*gref_head,
struct page *indirect_page;
 
/* Fetch a pre-allocated page to use for indirect grefs */
-   BUG_ON(list_empty(&info->indirect_pages));
-   indirect_page = list_first_entry(&info->indirect_pages,
+   BUG_ON(list_em

[Xen-devel] [PATCH v5 08/10] xen/blkback: get the number of hardware queues/rings from blkfront

2015-11-13 Thread Bob Liu
Backend advertises "multi-queue-max-queues" to front, also get the negotiated
number from "multi-queue-num-queues" written by blkfront.

Signed-off-by: Bob Liu 
---
 drivers/block/xen-blkback/blkback.c |   12 
 drivers/block/xen-blkback/common.h  |1 +
 drivers/block/xen-blkback/xenbus.c  |   34 --
 3 files changed, 41 insertions(+), 6 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c 
b/drivers/block/xen-blkback/blkback.c
index fb5bfd4..acedc46 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -84,6 +84,15 @@ MODULE_PARM_DESC(max_persistent_grants,
  "Maximum number of grants to map persistently");
 
 /*
+ * Maximum number of rings/queues blkback supports, allow as many queues as 
there
+ * are CPUs if user has not specified a value.
+ */
+unsigned int xenblk_max_queues;
+module_param_named(max_queues, xenblk_max_queues, uint, 0644);
+MODULE_PARM_DESC(max_queues,
+"Maximum number of hardware queues per virtual disk");
+
+/*
  * Maximum order of pages to be used for the shared ring between front and
  * backend, 4KB page granularity is used.
  */
@@ -1483,6 +1492,9 @@ static int __init xen_blkif_init(void)
xen_blkif_max_ring_order = XENBUS_MAX_RING_GRANT_ORDER;
}
 
+   if (xenblk_max_queues == 0)
+   xenblk_max_queues = num_online_cpus();
+
rc = xen_blkif_interface_init();
if (rc)
goto failed_init;
diff --git a/drivers/block/xen-blkback/common.h 
b/drivers/block/xen-blkback/common.h
index f2386e3..0833dc6 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -46,6 +46,7 @@
 #include 
 
 extern unsigned int xen_blkif_max_ring_order;
+extern unsigned int xenblk_max_queues;
 /*
  * This is the maximum number of segments that would be allowed in indirect
  * requests. This value will also be passed to the frontend.
diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index 6c6e048..d83b790 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -181,12 +181,6 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
blkif->st_print = jiffies;
INIT_WORK(&blkif->persistent_purge_work, xen_blkbk_unmap_purged_grants);
 
-   blkif->nr_rings = 1;
-   if (xen_blkif_alloc_rings(blkif)) {
-   kmem_cache_free(xen_blkif_cachep, blkif);
-   return ERR_PTR(-ENOMEM);
-   }
-
return blkif;
 }
 
@@ -595,6 +589,12 @@ static int xen_blkbk_probe(struct xenbus_device *dev,
goto fail;
}
 
+   /* Multi-queue: write how many queues are supported by the backend. */
+   err = xenbus_printf(XBT_NIL, dev->nodename,
+   "multi-queue-max-queues", "%u", xenblk_max_queues);
+   if (err)
+   pr_warn("Error writing multi-queue-num-queues\n");
+
/* setup back pointer */
be->blkif->be = be;
 
@@ -980,6 +980,7 @@ static int connect_ring(struct backend_info *be)
char *xspath;
size_t xspathsize;
const size_t xenstore_path_ext_size = 11; /* sufficient for 
"/queue-NNN" */
+   unsigned int requested_num_queues = 0;
 
pr_debug("%s %s\n", __func__, dev->otherend);
 
@@ -1007,6 +1008,27 @@ static int connect_ring(struct backend_info *be)
be->blkif->vbd.feature_gnt_persistent = pers_grants;
be->blkif->vbd.overflow_max_grants = 0;
 
+   /*
+* Read the number of hardware queues from frontend.
+*/
+   err = xenbus_scanf(XBT_NIL, dev->otherend, "multi-queue-num-queues",
+  "%u", &requested_num_queues);
+   if (err < 0) {
+   requested_num_queues = 1;
+   } else {
+   if (requested_num_queues > xenblk_max_queues
+   || requested_num_queues == 0) {
+   /* buggy or malicious guest */
+   xenbus_dev_fatal(dev, err,
+   "guest requested %u queues, exceeding 
the maximum of %u.",
+   requested_num_queues, 
xenblk_max_queues);
+   return -1;
+   }
+   }
+   be->blkif->nr_rings = requested_num_queues;
+   if (xen_blkif_alloc_rings(be->blkif))
+   return -ENOMEM;
+
pr_info("%s: using %d queues, protocol %d (%s) %s\n", dev->nodename,
 be->blkif->nr_rings, be->blkif->blk_protocol, protocol,
 pers_grants ? "persistent grants" : "");
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v5 06/10] xen/blkback: separate ring information out of struct xen_blkif

2015-11-13 Thread Bob Liu
Split per ring information to an new structure "xen_blkif_ring", so that one vbd
device can be associated with one or more rings/hardware queues.

Introduce 'pers_gnts_lock' to protect the pool of persistent grants since we
may have multi backend threads.

This patch is a preparation for supporting multi hardware queues/rings.

Signed-off-by: Arianna Avanzini 
Signed-off-by: Bob Liu 
---
v2:
 * Have an BUG_ON on the holding of the pers_gnts_lock.
---
 drivers/block/xen-blkback/blkback.c |  235 ---
 drivers/block/xen-blkback/common.h  |   54 
 drivers/block/xen-blkback/xenbus.c  |   96 +++---
 3 files changed, 214 insertions(+), 171 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c 
b/drivers/block/xen-blkback/blkback.c
index f909994..fb5bfd4 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -173,11 +173,11 @@ static inline void shrink_free_pagepool(struct xen_blkif 
*blkif, int num)
 
 #define vaddr(page) ((unsigned long)pfn_to_kaddr(page_to_pfn(page)))
 
-static int do_block_io_op(struct xen_blkif *blkif);
-static int dispatch_rw_block_io(struct xen_blkif *blkif,
+static int do_block_io_op(struct xen_blkif_ring *ring);
+static int dispatch_rw_block_io(struct xen_blkif_ring *ring,
struct blkif_request *req,
struct pending_req *pending_req);
-static void make_response(struct xen_blkif *blkif, u64 id,
+static void make_response(struct xen_blkif_ring *ring, u64 id,
  unsigned short op, int st);
 
 #define foreach_grant_safe(pos, n, rbtree, node) \
@@ -189,14 +189,8 @@ static void make_response(struct xen_blkif *blkif, u64 id,
 
 
 /*
- * We don't need locking around the persistent grant helpers
- * because blkback uses a single-thread for each backed, so we
- * can be sure that this functions will never be called recursively.
- *
- * The only exception to that is put_persistent_grant, that can be called
- * from interrupt context (by xen_blkbk_unmap), so we have to use atomic
- * bit operations to modify the flags of a persistent grant and to count
- * the number of used grants.
+ * pers_gnts_lock must be used around all the persistent grant helpers
+ * because blkback may use multi-thread/queue for each backend.
  */
 static int add_persistent_gnt(struct xen_blkif *blkif,
   struct persistent_gnt *persistent_gnt)
@@ -204,6 +198,7 @@ static int add_persistent_gnt(struct xen_blkif *blkif,
struct rb_node **new = NULL, *parent = NULL;
struct persistent_gnt *this;
 
+   BUG_ON(!spin_is_locked(&blkif->pers_gnts_lock));
if (blkif->persistent_gnt_c >= xen_blkif_max_pgrants) {
if (!blkif->vbd.overflow_max_grants)
blkif->vbd.overflow_max_grants = 1;
@@ -241,6 +236,7 @@ static struct persistent_gnt *get_persistent_gnt(struct 
xen_blkif *blkif,
struct persistent_gnt *data;
struct rb_node *node = NULL;
 
+   BUG_ON(!spin_is_locked(&blkif->pers_gnts_lock));
node = blkif->persistent_gnts.rb_node;
while (node) {
data = container_of(node, struct persistent_gnt, node);
@@ -265,6 +261,7 @@ static struct persistent_gnt *get_persistent_gnt(struct 
xen_blkif *blkif,
 static void put_persistent_gnt(struct xen_blkif *blkif,
struct persistent_gnt *persistent_gnt)
 {
+   BUG_ON(!spin_is_locked(&blkif->pers_gnts_lock));
if(!test_bit(PERSISTENT_GNT_ACTIVE, persistent_gnt->flags))
pr_alert_ratelimited("freeing a grant already unused\n");
set_bit(PERSISTENT_GNT_WAS_ACTIVE, persistent_gnt->flags);
@@ -286,6 +283,7 @@ static void free_persistent_gnts(struct xen_blkif *blkif, 
struct rb_root *root,
unmap_data.unmap_ops = unmap;
unmap_data.kunmap_ops = NULL;
 
+   BUG_ON(!spin_is_locked(&blkif->pers_gnts_lock));
foreach_grant_safe(persistent_gnt, n, root, node) {
BUG_ON(persistent_gnt->handle ==
BLKBACK_INVALID_HANDLE);
@@ -322,11 +320,13 @@ void xen_blkbk_unmap_purged_grants(struct work_struct 
*work)
int segs_to_unmap = 0;
struct xen_blkif *blkif = container_of(work, typeof(*blkif), 
persistent_purge_work);
struct gntab_unmap_queue_data unmap_data;
+   unsigned long flags;
 
unmap_data.pages = pages;
unmap_data.unmap_ops = unmap;
unmap_data.kunmap_ops = NULL;
 
+   spin_lock_irqsave(&blkif->pers_gnts_lock, flags);
while(!list_empty(&blkif->persistent_purge_list)) {
persistent_gnt = list_first_entry(&blkif->persistent_purge_list,
  struct persistent_gnt,
@@ -348,6 +348,7 @@ void xen_blkbk_unmap_purged_grants(struct work_struct *wo

Re: [Xen-devel] [PATCH v5 05/10] xen/blkfront: negotiate number of queues/rings to be used with backend

2015-11-16 Thread Bob Liu

On 11/17/2015 05:27 AM, Konrad Rzeszutek Wilk wrote:
>>  /* Common code used when first setting up, and when resuming. */
>>  static int talk_to_blkback(struct xenbus_device *dev,
>> @@ -1527,10 +1582,9 @@ static int talk_to_blkback(struct xenbus_device *dev,
>>  {
>>  const char *message = NULL;
>>  struct xenbus_transaction xbt;
>> -int err, i;
>> -unsigned int max_page_order = 0;
>> +int err;
>> +unsigned int i, max_page_order = 0;
>>  unsigned int ring_page_order = 0;
>> -struct blkfront_ring_info *rinfo;
> 
> Why? You end up doing the 'struct blkfront_ring_info' decleration
> in two of the loops below?

Oh, that's because Roger mentioned we would be tempted to declare rinfo only 
inside the for loop, to limit
the scope.

>>  
>>  err = xenbus_scanf(XBT_NIL, info->xbdev->otherend,
>> "max-ring-page-order", "%u", &max_page_order);
>> @@ -1542,7 +1596,8 @@ static int talk_to_blkback(struct xenbus_device *dev,
>>  }
>>  
>>  for (i = 0; i < info->nr_rings; i++) {
>> -rinfo = &info->rinfo[i];
>> +struct blkfront_ring_info *rinfo = &info->rinfo[i];
>> +
> 
> Here..
> 
>> @@ -1617,7 +1677,7 @@ again:
>>  
>>  for (i = 0; i < info->nr_rings; i++) {
>>  int j;
>> -rinfo = &info->rinfo[i];
>> +struct blkfront_ring_info *rinfo = &info->rinfo[i];
> 
> And here?
> 
> It is not a big deal but I am curious of why add this change?
> 
>> @@ -1717,7 +1789,6 @@ static int blkfront_probe(struct xenbus_device *dev,
>>  
>>  mutex_init(&info->mutex);
>>  spin_lock_init(&info->dev_lock);
>> -info->xbdev = dev;
> 
> That looks like a spurious change? Ah, I see that we do the same exact
> operation earlier in the blkfront_probe.
> 

The place of this line was changed because:

1738 info->xbdev = dev;

1739 /* Check if backend supports multiple queues. */
1740 err = xenbus_scanf(XBT_NIL, info->xbdev->otherend,

We need xbdev to be set in 
advance.
  
1741"multi-queue-max-queues", "%u", 
&backend_max_queues);
1742 if (err < 0)
1743 backend_max_queues = 1;


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v5 00/10] xen-block: multi hardware-queues/rings support

2015-11-25 Thread Bob Liu

On 11/26/2015 06:12 AM, Konrad Rzeszutek Wilk wrote:
> On Wed, Nov 25, 2015 at 03:56:03PM -0500, Konrad Rzeszutek Wilk wrote:
>> On Wed, Nov 25, 2015 at 02:25:07PM -0500, Konrad Rzeszutek Wilk wrote:
   xen/blkback: separate ring information out of struct xen_blkif
   xen/blkback: pseudo support for multi hardware queues/rings
   xen/blkback: get the number of hardware queues/rings from blkfront
   xen/blkback: make pool of persistent grants and free pages per-queue
>>>
>>> OK, got to those as well. I have put them in 'devel/for-jens-4.5' and
>>> are going to test them overnight before pushing them out.
>>>
>>> I see two bugs in the code that we MUST deal with:
>>>
>>>  - print_stats () is going to show zero values.
>>>  - the sysfs code (VBD_SHOW) aren't converted over to fetch data
>>>from all the rings.
>>
>> - kthread_run can't handle the two "name, i" arguments. I see:
>>
>> root  5101 2  0 20:47 ?00:00:00 [blkback.3.xvda-]
>> root  5102 2  0 20:47 ?00:00:00 [blkback.3.xvda-]
> 
> And doing save/restore:
> 
> xl save  /tmp/A;
> xl restore /tmp/A;
> 
> ends up us loosing the proper state and not getting the ring setup back.
> I see this is backend:
> 
> [ 2719.448600] vbd vbd-22-51712: -1 guest requested 0 queues, exceeding the 
> maximum of 3.
> 
> And XenStore agrees:
> tool = ""
>  xenstored = ""
> local = ""
>  domain = ""
>   0 = ""
>domid = "0"
>name = "Domain-0"
>device-model = ""
> 0 = ""
>  state = "running"
>error = ""
> backend = ""
>  vbd = ""
>   2 = ""
>51712 = ""
> error = "-1 guest requested 0 queues, exceeding the maximum of 3."
> 
> .. which also leads to a memory leak as xen_blkbk_remove never gets
> called.

I think which was already fix by your patch:
[PATCH RFC 2/2] xen/blkback: Free resources if connect_ring failed.

P.S. I didn't see your git tree updated with these patches.

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v5 00/10] xen-block: multi hardware-queues/rings support

2015-11-25 Thread Bob Liu

On 11/26/2015 10:57 AM, Konrad Rzeszutek Wilk wrote:
> On Thu, Nov 26, 2015 at 10:28:10AM +0800, Bob Liu wrote:
>>
>> On 11/26/2015 06:12 AM, Konrad Rzeszutek Wilk wrote:
>>> On Wed, Nov 25, 2015 at 03:56:03PM -0500, Konrad Rzeszutek Wilk wrote:
>>>> On Wed, Nov 25, 2015 at 02:25:07PM -0500, Konrad Rzeszutek Wilk wrote:
>>>>>>   xen/blkback: separate ring information out of struct xen_blkif
>>>>>>   xen/blkback: pseudo support for multi hardware queues/rings
>>>>>>   xen/blkback: get the number of hardware queues/rings from blkfront
>>>>>>   xen/blkback: make pool of persistent grants and free pages per-queue
>>>>>
>>>>> OK, got to those as well. I have put them in 'devel/for-jens-4.5' and
>>>>> are going to test them overnight before pushing them out.
>>>>>
>>>>> I see two bugs in the code that we MUST deal with:
>>>>>
>>>>>  - print_stats () is going to show zero values.
>>>>>  - the sysfs code (VBD_SHOW) aren't converted over to fetch data
>>>>>from all the rings.
>>>>
>>>> - kthread_run can't handle the two "name, i" arguments. I see:
>>>>
>>>> root  5101 2  0 20:47 ?00:00:00 [blkback.3.xvda-]
>>>> root  5102 2  0 20:47 ?00:00:00 [blkback.3.xvda-]
>>>
>>> And doing save/restore:
>>>
>>> xl save  /tmp/A;
>>> xl restore /tmp/A;
>>>
>>> ends up us loosing the proper state and not getting the ring setup back.
>>> I see this is backend:
>>>
>>> [ 2719.448600] vbd vbd-22-51712: -1 guest requested 0 queues, exceeding the 
>>> maximum of 3.
>>>
>>> And XenStore agrees:
>>> tool = ""
>>>  xenstored = ""
>>> local = ""
>>>  domain = ""
>>>   0 = ""
>>>domid = "0"
>>>name = "Domain-0"
>>>device-model = ""
>>>     0 = ""
>>>  state = "running"
>>>error = ""
>>> backend = ""
>>>  vbd = ""
>>>   2 = ""
>>>51712 = ""
>>> error = "-1 guest requested 0 queues, exceeding the maximum of 3."
>>>
>>> .. which also leads to a memory leak as xen_blkbk_remove never gets
>>> called.
>>
>> I think which was already fix by your patch:
>> [PATCH RFC 2/2] xen/blkback: Free resources if connect_ring failed.
> 
> Nope. I get that with or without the patch.
> 

Attached patch should fix this issue. 

-- 
Regards,
-Bob
>From f297a05fc27fb0bc9a3ed15407f8cc6ffd5e2a00 Mon Sep 17 00:00:00 2001
From: Bob Liu 
Date: Wed, 25 Nov 2015 14:56:32 -0500
Subject: [PATCH 1/2] xen:blkfront: fix compile error
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Fix this build error:
drivers/block/xen-blkfront.c: In function ‘blkif_free’:
drivers/block/xen-blkfront.c:1234:6: error: ‘struct blkfront_info’ has no
member named ‘ring’ info->ring = NULL;

Signed-off-by: Bob Liu 
---
 drivers/block/xen-blkfront.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 625604d..ef5ce43 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -1231,7 +1231,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
 		blkif_free_ring(&info->rinfo[i]);

 	kfree(info->rinfo);
-	info->ring = NULL;
+	info->rinfo = NULL;
 	info->nr_rings = 0;
 }

--
1.8.3.1

>From aab0bb1690213e665966ea22b021e0eeaacfc717 Mon Sep 17 00:00:00 2001
From: Bob Liu 
Date: Wed, 25 Nov 2015 17:52:55 -0500
Subject: [PATCH 2/2] xen/blkfront: realloc ring info in blkif_resume

Need to reallocate ring info in the resume path, because info->rinfo was freed
in blkif_free(). And 'multi-queue-max-queues' backend reports may have been
changed.

Signed-off-by: Bob Liu 
---
 drivers/block/xen-blkfront.c | 28 +++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index ef5ce43..9634a65 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -1926,12 +1926,38 @@ static int blkif_recover(struct blkfront_info *info)
 static int blkfront_resume(struct xenbus_device *dev)
 {
 	struct blkfront_info *info = dev_get_drvdata(&dev->dev);
-	int err;
+	int err = 0;
+	unsigned int max_queues = 0, r_index;
 
 	dev_

Re: [Xen-devel] [PATCH 3/3] xen/block: add multi-page ring support

2015-06-09 Thread Bob Liu

On 06/03/2015 01:40 PM, Bob Liu wrote:
> Extend xen/block to support multi-page ring, so that more requests can be
> issued by using more than one pages as the request ring between blkfront
> and backend.
> As a result, the performance can get improved significantly.
> 
> We got some impressive improvements on our highend iscsi storage cluster
> backend. If using 64 pages as the ring, the IOPS increased about 15 times
> for the throughput testing and above doubled for the latency testing.
> 
> The reason was the limit on outstanding requests is 32 if use only one-page
> ring, but in our case the iscsi lun was spread across about 100 physical
> drives, 32 was really not enough to keep them busy.
> 
> Changes in v2:
>  - Rebased to 4.0-rc6.
>  - Document on how multi-page ring feature working to linux io/blkif.h.
> 
> Changes in v3:
>  - Remove changes to linux io/blkif.h and follow the protocol defined
>in io/blkif.h of XEN tree.
>  - Rebased to 4.1-rc3
> 
> Changes in v4:
>  - Turn to use 'ring-page-order' and 'max-ring-page-order'.
>  - A few comments from Roger.
> 
> Changes in v5:
>  - Clarify with 4k granularity to comment
>  - Address more comments from Roger
> 
> Signed-off-by: Bob Liu 

Also tested the windows PV driver which also works fine when multi-page ring 
feature
was enabled in Linux backend.
http://www.xenproject.org/downloads/windows-pv-drivers.html

Regards,
-Bob

> ---
>  drivers/block/xen-blkback/blkback.c |   13 
>  drivers/block/xen-blkback/common.h  |2 +
>  drivers/block/xen-blkback/xenbus.c  |   89 +--
>  drivers/block/xen-blkfront.c|  135 
> +--
>  4 files changed, 180 insertions(+), 59 deletions(-)
> 
> diff --git a/drivers/block/xen-blkback/blkback.c 
> b/drivers/block/xen-blkback/blkback.c
> index 713fc9f..2126842 100644
> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -84,6 +84,13 @@ MODULE_PARM_DESC(max_persistent_grants,
>   "Maximum number of grants to map persistently");
>  
>  /*
> + * Maximum order of pages to be used for the shared ring between front and
> + * backend, 4KB page granularity is used.
> + */
> +unsigned int xen_blkif_max_ring_order = XENBUS_MAX_RING_PAGE_ORDER;
> +module_param_named(max_ring_page_order, xen_blkif_max_ring_order, int, 
> S_IRUGO);
> +MODULE_PARM_DESC(max_ring_page_order, "Maximum order of pages to be used for 
> the shared ring");
> +/*
>   * The LRU mechanism to clean the lists of persistent grants needs to
>   * be executed periodically. The time interval between consecutive executions
>   * of the purge mechanism is set in ms.
> @@ -1438,6 +1445,12 @@ static int __init xen_blkif_init(void)
>   if (!xen_domain())
>   return -ENODEV;
>  
> + if (xen_blkif_max_ring_order > XENBUS_MAX_RING_PAGE_ORDER) {
> + pr_info("Invalid max_ring_order (%d), will use default max: 
> %d.\n",
> + xen_blkif_max_ring_order, XENBUS_MAX_RING_PAGE_ORDER);
> + xen_blkif_max_ring_order = XENBUS_MAX_RING_PAGE_ORDER;
> + }
> +
>   rc = xen_blkif_interface_init();
>   if (rc)
>   goto failed_init;
> diff --git a/drivers/block/xen-blkback/common.h 
> b/drivers/block/xen-blkback/common.h
> index 043f13b..8ccc49d 100644
> --- a/drivers/block/xen-blkback/common.h
> +++ b/drivers/block/xen-blkback/common.h
> @@ -44,6 +44,7 @@
>  #include 
>  #include 
>  
> +extern unsigned int xen_blkif_max_ring_order;
>  /*
>   * This is the maximum number of segments that would be allowed in indirect
>   * requests. This value will also be passed to the frontend.
> @@ -320,6 +321,7 @@ struct xen_blkif {
>   struct work_struct  free_work;
>   /* Thread shutdown wait queue. */
>   wait_queue_head_t   shutdown_wq;
> + unsigned int nr_ring_pages;
>  };
>  
>  struct seg_buf {
> diff --git a/drivers/block/xen-blkback/xenbus.c 
> b/drivers/block/xen-blkback/xenbus.c
> index c212d41..deb3f00 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -25,6 +25,7 @@
>  
>  /* Enlarge the array size in order to fully show blkback name. */
>  #define BLKBACK_NAME_LEN (20)
> +#define RINGREF_NAME_LEN (20)
>  
>  struct backend_info {
>   struct xenbus_device*dev;
> @@ -156,8 +157,8 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
>   return blkif;
>  }
>  
> -static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t gref,
> -  unsigned int evtchn)
> +static int xen_blkif_map(struct x

Re: [Xen-devel] [PATCH 3/3] xen/block: add multi-page ring support

2015-06-09 Thread Bob Liu

On 06/09/2015 09:39 PM, Konrad Rzeszutek Wilk wrote:
> On Tue, Jun 09, 2015 at 08:52:53AM +, Paul Durrant wrote:
>>> -Original Message-
>>> From: Bob Liu [mailto:bob@oracle.com]
>>> Sent: 09 June 2015 09:50
>>> To: Bob Liu
>>> Cc: xen-devel@lists.xen.org; David Vrabel; just...@spectralogic.com;
>>> konrad.w...@oracle.com; Roger Pau Monne; Paul Durrant; Julien Grall; linux-
>>> ker...@vger.kernel.org
>>> Subject: Re: [PATCH 3/3] xen/block: add multi-page ring support
>>>
>>>
>>> On 06/03/2015 01:40 PM, Bob Liu wrote:
>>>> Extend xen/block to support multi-page ring, so that more requests can be
>>>> issued by using more than one pages as the request ring between blkfront
>>>> and backend.
>>>> As a result, the performance can get improved significantly.
>>>>
>>>> We got some impressive improvements on our highend iscsi storage cluster
>>>> backend. If using 64 pages as the ring, the IOPS increased about 15 times
>>>> for the throughput testing and above doubled for the latency testing.
>>>>
>>>> The reason was the limit on outstanding requests is 32 if use only one-page
>>>> ring, but in our case the iscsi lun was spread across about 100 physical
>>>> drives, 32 was really not enough to keep them busy.
>>>>
>>>> Changes in v2:
>>>>  - Rebased to 4.0-rc6.
>>>>  - Document on how multi-page ring feature working to linux io/blkif.h.
>>>>
>>>> Changes in v3:
>>>>  - Remove changes to linux io/blkif.h and follow the protocol defined
>>>>in io/blkif.h of XEN tree.
>>>>  - Rebased to 4.1-rc3
>>>>
>>>> Changes in v4:
>>>>  - Turn to use 'ring-page-order' and 'max-ring-page-order'.
>>>>  - A few comments from Roger.
>>>>
>>>> Changes in v5:
>>>>  - Clarify with 4k granularity to comment
>>>>  - Address more comments from Roger
>>>>
>>>> Signed-off-by: Bob Liu 
>>>
>>> Also tested the windows PV driver which also works fine when multi-page
>>> ring feature
>>> was enabled in Linux backend.
>>> http://www.xenproject.org/downloads/windows-pv-drivers.html
>>>
>>
>> Great! Thanks for verifying that :-)
> 
> Woot! Bob, could you repost the blkif.h patch for the Xen tree
> pleas e and also mention the testing part in it please? I think this
> was the only big 'what if?!' question holding this up.
> 

There is no more changes to blkif.h of Xen tree, I followed exactly
the protocol already defined there and that is why windows PV driver can also 
work well.

> 
> Roger, I put them (patches) on devel/for-jens-4.2 on
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git
> 
> I think these two patches:
> drivers: xen-blkback: delay pending_req allocation to connect_ring
> xen/block: add multi-page ring support
> 
> are the only ones that haven't been Acked by you (or maybe they
> have and I missed the Ack?)
> 

Thank you!
-Bob

> 
>>
>>   Paul
>>
>>> Regards,
>>> -Bob
>>>
>>>> ---
>>>>  drivers/block/xen-blkback/blkback.c |   13 
>>>>  drivers/block/xen-blkback/common.h  |2 +
>>>>  drivers/block/xen-blkback/xenbus.c  |   89 +--
>>>>  drivers/block/xen-blkfront.c|  135 +
>>> --
>>>>  4 files changed, 180 insertions(+), 59 deletions(-)
>>>>
>>>> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-
>>> blkback/blkback.c
>>>> index 713fc9f..2126842 100644
>>>> --- a/drivers/block/xen-blkback/blkback.c
>>>> +++ b/drivers/block/xen-blkback/blkback.c
>>>> @@ -84,6 +84,13 @@ MODULE_PARM_DESC(max_persistent_grants,
>>>>   "Maximum number of grants to map persistently");
>>>>
>>>>  /*
>>>> + * Maximum order of pages to be used for the shared ring between front
>>> and
>>>> + * backend, 4KB page granularity is used.
>>>> + */
>>>> +unsigned int xen_blkif_max_ring_order =
>>> XENBUS_MAX_RING_PAGE_ORDER;
>>>> +module_param_named(max_ring_page_order,
>>> xen_blkif_max_ring_order, int, S_IRUGO);
>>>> +MODULE_PARM_DESC(max_ring_page_order, "Maximum order of pages
>>> to be used for the shared ring");
>>>> +

Re: [Xen-devel] [PATCH 3/3] xen/block: add multi-page ring support

2015-06-21 Thread Bob Liu

On 06/09/2015 10:07 PM, Roger Pau Monné wrote:
> El 09/06/15 a les 15.39, Konrad Rzeszutek Wilk ha escrit:
...
>> Roger, I put them (patches) on devel/for-jens-4.2 on
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git
>>
>> I think these two patches:
>> drivers: xen-blkback: delay pending_req allocation to connect_ring
>> xen/block: add multi-page ring support
>>
>> are the only ones that haven't been Acked by you (or maybe they
>> have and I missed the Ack?)
> 
> Hello,
> 
> I was waiting to Ack those because the XenServer storage performance
> folks found out that these patches cause a performance regression on
> some of their tests. I'm adding them to the conversation so they can
> provide more details about the issues they found, and whether we should
> hold pushing this patches or not.
> 

Hey,

Are there any updates? What's the performance regression problem?

Thanks,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [RFC PATCH] xen-block: introduces extra request to pass-through SCSI commands

2016-02-28 Thread Bob Liu
1) What is this patch about?
This patch introduces an new block operation (BLKIF_OP_EXTRA_FLAG).
A request with BLKIF_OP_EXTRA_FLAG set means the following request is an
extra request which is used to pass through SCSI commands.
This is like a simplified version of XEN_NETIF_EXTRA_* in netif.h.
It can be extended easily to transmit other per-request/bio data from frontend
to backend e.g Data Integrity Field per bio.

2) Why we need this?
Currently only raw data segments are transmitted from blkfront to blkback, which
means some advanced features are lost.
 * Guest knows nothing about features of the real backend storage.
For example, on bare-metal environment INQUIRY SCSI command can be used
to query storage device information. If it's a SSD or flash device we
can have the option to use the device as a fast cache.
But this can't happen in current domU guests, because blkfront only
knows it's just a normal virtual disk

 * Failover Clusters in Windows
Failover clusters require SCSI-3 persistent reservation target disks,
but now this can't work in domU.

3) Known issues:
 * Security issues, how to 'validate' this extra request payload.
   E.g SCSI operates on LUN bases (the whole disk) while we really just want to
   operate on partitions

 * Can't pass SCSI commands through if the backend storage driver is bio-based
   instead of request-based.

4) Alternative approach: Using PVSCSI instead:
 * Doubt PVSCSI can support as many type of backend storage devices as 
Xen-block.

 * Much longer path:
   ioctl() -> SCSI upper layer -> Middle layer -> PVSCSI-frontend -> 
PVSCSI-backend -> Target framework(LIO?) ->

   With xen-block we only need:
   ioctl() -> blkfront -> blkback ->

 * xen-block has been existed for many years, widely used and more stable.

Welcome any input, thank you!

Signed-off-by: Bob Liu 
---
 xen/include/public/io/blkif.h |   73 +
 1 file changed, 73 insertions(+)

diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
index 99f0326..7c10bce 100644
--- a/xen/include/public/io/blkif.h
+++ b/xen/include/public/io/blkif.h
@@ -635,6 +635,28 @@
 #define BLKIF_OP_INDIRECT  6
 
 /*
+ * Recognised only if "feature-extra-request" is present in backend xenbus 
info.
+ * A request with BLKIF_OP_EXTRA_FLAG indicates an extra request is followed
+ * in the shared ring buffer.
+ *
+ * By this way, extra data like SCSI command, DIF/DIX and other per-request/bio
+ * data can be transmitted from frontend to backend.
+ *
+ * The 'wire' format is like:
+ *  Request 1: xen_blkif_request
+ * [Request 2: xen_blkif_extra_request](only if request 1 has 
BLKIF_OP_EXTRA_FLAG)
+ *  Request 3: xen_blkif_request
+ *  Request 4: xen_blkif_request
+ * [Request 5: xen_blkif_extra_request](only if request 4 has 
BLKIF_OP_EXTRA_FLAG)
+ *  ...
+ *  Request N: xen_blkif_request
+ *
+ * If a backend does not recognize BLKIF_OP_EXTRA_FLAG, it should *not* create 
the
+ * "feature-extra-request" node!
+ */
+#define BLKIF_OP_EXTRA_FLAG (0x80)
+
+/*
  * Maximum scatter/gather segments per request.
  * This is carefully chosen so that sizeof(blkif_ring_t) <= PAGE_SIZE.
  * NB. This could be 12 if the ring indexes weren't stored in the same page.
@@ -703,10 +725,61 @@ struct blkif_request_indirect {
 };
 typedef struct blkif_request_indirect blkif_request_indirect_t;
 
+enum blkif_extra_request_type {
+   BLKIF_EXTRA_TYPE_SCSI_CMD = 1,  /* Transmit SCSI command.  */
+};
+
+struct scsi_cmd_req {
+   /*
+* Grant mapping for transmiting SCSI command to backend, and
+* also receive sense data from backend.
+* One 4KB page is enough.
+*/
+   grant_ref_t cmd_gref;
+   /* Length of SCSI command in the grant mapped page. */
+   unsigned int cmd_len;
+
+   /*
+* SCSI command may require transmiting data segment length less
+* than a sector(512 bytes).
+* Record num_sg and last segment length in extra request so that
+* backend can know about them.
+*/
+   unsigned int num_sg;
+   unsigned int last_sg_len;
+};
+
+/*
+ * Extra request, must follow a normal-request and a normal-request can
+ * only be followed by one extra request.
+ */
+struct blkif_request_extra {
+   uint8_t type;   /* BLKIF_EXTRA_TYPE_* */
+   uint16_t _pad1;
+#ifndef CONFIG_X86_32
+   uint32_t _pad2; /* offsetof(blkif_...,u.extra.id) == 8 */
+#endif
+   uint64_t id;
+   struct scsi_cmd_req scsi_cmd;
+} __attribute__((__packed__));
+typedef struct blkif_request_extra blkif_request_extra_t;
+
+struct scsi_cmd_res {
+   unsigned int resid_len;
+   /* Length of sense data returned in grant mapped page. */
+   unsigned int sense_len;
+};
+
+struct blkif_response_extra {
+   uint8_

Re: [Xen-devel] [RFC PATCH] xen-block: introduces extra request to pass-through SCSI commands

2016-02-29 Thread Bob Liu

On 03/01/2016 12:29 AM, Ian Jackson wrote:
> Ian Jackson writes ("Re: [RFC PATCH] xen-block: introduces extra request to 
> pass-through SCSI commands"):
>> [stuff suggesting use of PVSCSI instead]
> 
> For the avoidance of doubt:
> 
> 1. Thanks very much for bringing this proposal to us at the concept
> stage.  It is much easier to discuss these matters in a constructive
> way before a lot of effort has been put into an implementation.
> 
> 2. I should explain the downsides which I see in your proposal:
> 
> - Your suggestion has bad security properties: previously, the PV
>   block protocol would present only a very simple and narrow
>   interface.  Your SCSI CDB passthrough proposal means that guests
>   would be able to activate features in SCSI targets which would be
>   unexpected and unintended by the host administrator.  Such features
>   would perhaps even be unknown to the host administrator.
> 
>   This could be mitigated by making this feature configurable, of
>   course, defaulting to off, along with clear documentation.  But it's
>   not a desirable property.
> 
> - For similar reasons it will often be difficult to use such a feature
>   safely.  Guest software in particular might expect that it can
>   safely use whatever features it can see, and do all sorts of
>   exciting things.
> 
> - It involves duplicating multiplexing logic which already exists in
>   PVSCSI.
> 

One thing I'm still not sure about PVSCSI is do we have the same security issue 
since LIO can interface to any block device.
E.g when using a partition /dev/sda1 as the PVSCSI-backend, but the 
PVSCSI-frontend may send SCSI operates on LUN bases (the whole disk).

P.S. Thanks to all of you, it helps a lot!

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] xen-block: introduces extra request to pass-through SCSI commands

2016-02-29 Thread Bob Liu

On 03/01/2016 12:29 AM, Ian Jackson wrote:
> Ian Jackson writes ("Re: [RFC PATCH] xen-block: introduces extra request to 
> pass-through SCSI commands"):
>> [stuff suggesting use of PVSCSI instead]
> 
> For the avoidance of doubt:
> 
> 1. Thanks very much for bringing this proposal to us at the concept
> stage.  It is much easier to discuss these matters in a constructive
> way before a lot of effort has been put into an implementation.
> 
> 2. I should explain the downsides which I see in your proposal:
> 
> - Your suggestion has bad security properties: previously, the PV
>   block protocol would present only a very simple and narrow
>   interface.  Your SCSI CDB passthrough proposal means that guests
>   would be able to activate features in SCSI targets which would be
>   unexpected and unintended by the host administrator.  Such features
>   would perhaps even be unknown to the host administrator.
> 
>   This could be mitigated by making this feature configurable, of
>   course, defaulting to off, along with clear documentation.  But it's
>   not a desirable property.
> 
> - For similar reasons it will often be difficult to use such a feature
>   safely.  Guest software in particular might expect that it can
>   safely use whatever features it can see, and do all sorts of
>   exciting things.
> 
> - It involves duplicating multiplexing logic which already exists in
>   PVSCSI.
> 

One thing I'm still not sure about PVSCSI is do we have the same security issue 
since LIO can interface to any block device.
E.g when using a partition /dev/sda1 as the PVSCSI-backend, but the 
PVSCSI-frontend may still send SCSI operates on LUN bases (the whole disk).

P.S. Thanks to all of you, it helps a lot!

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] xen-block: introduces extra request to pass-through SCSI commands

2016-03-01 Thread Bob Liu
Hi Juergen,

On 03/02/2016 03:39 PM, Juergen Gross wrote:
> On 01/03/16 19:08, Ian Jackson wrote:
>> Bob Liu writes ("Re: [RFC PATCH] xen-block: introduces extra request to 
>> pass-through SCSI commands"):
>>> One thing I'm still not sure about PVSCSI is do we have the same security 
>>> issue since LIO can interface to any block device.
>>> E.g when using a partition /dev/sda1 as the PVSCSI-backend, but the 
>>> PVSCSI-frontend may still send SCSI operates on LUN bases (the whole disk).
>>
>> I don't think you can use pvscsi to passthrough a partition such as
>> /dev/sda1.  Such a thing is not a SCSI command target.
> 
> It might be possible via the fileio target backend. In this case LUN
> based SCSI operations are ignored/refused/emulated by LIO.
> 

Do you know whether pvscsi can work on top of multipath(the device-mapper 
framework) or LVMs?
Thank you!

Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] xen-block: introduces extra request to pass-through SCSI commands

2016-03-02 Thread Bob Liu

On 03/02/2016 07:40 PM, Ian Jackson wrote:
> Bob Liu writes ("Re: [RFC PATCH] xen-block: introduces extra request to 
> pass-through SCSI commands"):
>> Do you know whether pvscsi can work on top of multipath(the device-mapper 
>> framework) or LVMs?
> 
> No, it can't.  devmapper and LVM work with the block device
> abstraction.
> 
> Implicitly you seem to be suggesting that you want to use dm-multipath
> and LVM, but also send other SCSI CDBs from the upper layers through
> to the underlying SCSI storage target.
> 

Exactly!

> I can't see how that could cause anything but pain.  In many cases
> "the underlying SCSI storage target" wouldn't be well defined.  Even
> if it was, these side channel SCSI commands are likely to Go Wrong in
> exciting ways.
> 
> What SCSI commands do you want to send ?
> 

* INQUIRY

* PERSISTENT RESERVE IN
* PERSISTENT RESERVE OUT
This is for Failover Clusters in Windows, not sure whether more commands are 
required.
I didn't get a required scsi commands list in the failover document.

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [RFC PATCH] blkif.h: document scsi/0x12/0x83 node

2016-03-15 Thread Bob Liu
Sometimes, we need to query VPD page=0x83 data from underlying storage so
that vendor supplied software can run inside the VM and believe it's talking to
the vendor's own storage.
But different vendors may have different special features, so it's not suitable
to export through "feature-".

One solution is query the whole VPD page through Xenstore node, which has
already been used by windows pv driver.
http://xenbits.xen.org/gitweb/?p=pvdrivers/win/xenvbd.git;a=blob;f=src/xenvbd/pdoinquiry.c

This patch documents the Xenstore node to blkif.h, so that blkfront in Linux and
other frontends can use the same mechanism.

Signed-off-by: Bob Liu 
---
 xen/include/public/io/blkif.h |8 
 1 file changed, 8 insertions(+)

diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
index 99f0326..30a6e46 100644
--- a/xen/include/public/io/blkif.h
+++ b/xen/include/public/io/blkif.h
@@ -182,6 +182,14 @@
  *  backend driver paired with a LIFO queue in the frontend will
  *  allow us to have better performance in this scenario.
  *
+ * scsi/0x12/0x83
+ * Values: string
+ *
+ * A base64 formatted string providing VPD pages read out from backend
+ * device.
+ * The backend driver or the toolstack should write this node with VPD
+ * informations when attaching devices.
+ *
  *--- Request Transport Parameters 
  *
  * max-ring-page-order
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] blkif.h: document scsi/0x12/0x83 node

2016-03-19 Thread Bob Liu

On 03/16/2016 08:36 PM, Ian Jackson wrote:
> Bob Liu writes ("[RFC PATCH] blkif.h: document scsi/0x12/0x83 node"):
>> Sometimes, we need to query VPD page=0x83 data from underlying
>> storage so that vendor supplied software can run inside the VM and
>> believe it's talking to the vendor's own storage.  But different
>> vendors may have different special features, so it's not suitable to
>> export through "feature-".
>>
>> One solution is query the whole VPD page through Xenstore node, which has
>> already been used by windows pv driver.
>> http://xenbits.xen.org/gitweb/?p=pvdrivers/win/xenvbd.git;a=blob;f=src/xenvbd/pdoinquiry.c
> 
> Thanks for your contribution.
> 
> Thanks also to Konrad for decoding the numbers, which really helps me
> understand what is going on here and helped me find the relevant
> references.
> 
> (For background: I have just double-checked the SCSI spec and: INQUIRY
> lets you query either the standard page, or one of a number of `vital
> product data' pages, each identified by an 8-bit page number.  The VPD
> pages are mostly full of vendor-specific data in vendor-specific
> format.)
> 
> I have some qualms about the approach you have adopted.  It is
> difficult to see how this feature could be used safely without
> knowledge specific to the storage vendor.
> 
> But I think it is probably OK to define a specification along these
> lines provided that it is very clear that if you aren't the storage
> vendor and you use this and something breaks, you get to keep all the
> pieces.
> 
>> + * scsi/0x12/0x83
>> + *  Values: string
>> + *  A base64 formatted string providing VPD pages read out from backend
>> + *  device.
> 
> I think this probably isn't the prettiest name for this node or
> necessarily the best format but given that this protocol is already
> deployed, and this syntax will do, I don't want to quibble.
> 
> I would like the base64 encoding to specified much more explicitly.
> Just `base64 formatted' is too vague.
> 
> 
>> + *  The backend driver or the toolstack should write this node with VPD
>> + *  informations when attaching devices.
> 
> I think this is the wrong semantics.  I certainly don't want to
> encourage backends to use this feature.
> 
> Rather, I would prefer something like this:
> 
>  * scsi/0x12/0x
> 
>This optional node contains SCSI INQUIRY VPD information.
> is the hexadecimal representation of the VPD page code.
> 
>A frontend which represents a Xen VBD to its containing operating
>system as a (virtual) SCSI target may return the specified data in
>response to INQUIRY commands from its containing OS.
> 
>A frontend which supports this feature must return the backend-
>specified data for every INQUIRY command with the EVPD bit set.
>For EVPD=1 INQUIRY commands where the corresponding xenstore node
>does not exist, the frontend must report (to its containing OS) an
>appropriate failure condition.
> 
>A frontend which does not support this feature (ie, which does not
>use these xenstore nodes), and which presents as a SCSI target to
>its containing OS, should support and provide whatever VPD
>information it considers appropriate, and should disregard these
>xenstore nodes.
> 
>A frontend need not - and often will not - present to its
>containing OS as a device addressable with SCSI CDBs.  Such a
>frontend has no use for SCSI INQUIRY VPD information.
> 
>A backend should set this information with caution.  Pages
>containing device-vendor-specific information should not be
>specified without the appropriate device-vendor-specific knowledge.
> 

That's much more clear, thank you very much!

> 
> Also I have two other observations:
> 
> Firstly, AFAICT you have not provided any way to set the standard
> INQUIRY response.  Is it not necessary in your application to provide

If backends are not encouraged to use this node, then we must have the 
toolstack write this node with the right VPD information.
Paul mentioned there should be corresponding code in the xapi project, but I 
haven't found out where.


> synthetic vendorid and productid, at the very least ?
> 
> Secondly, I think your hope that
> 
>> blkfront in Linux ... can use the same mechanism.
> 
> is I think misguided.  blkfront does not present the disk (to the rest
> of the Linux storage system) as a SCSI device.  Rather, Linux allows
> blkfront to present as a block device, directly, and this is what
> blkfront does.
> 

But we'd like to get the VPD information(of underlying storage device) also in 
Linux blkfront, even blkfront is not a SCSI device.

That's because our underlying storage device has some vendor-specific features 
which can be recognized through informations in VPD pages.
And Our applications in guest want to aware of these vendor-specific features.

Regards,
Bob



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] blkif.h: document scsi/0x12/0x83 node

2016-03-19 Thread Bob Liu

On 03/16/2016 10:07 PM, Paul Durrant wrote:
>> -Original Message-
>> From: Bob Liu [mailto:bob@oracle.com]
..snip..
>>>
>>
>> But we'd like to get the VPD information(of underlying storage device) also 
>> in
>> Linux blkfront, even blkfront is not a SCSI device.
>>
>> That's because our underlying storage device has some vendor-specific
>> features which can be recognized through informations in VPD pages.
>> And Our applications in guest want to aware of these vendor-specific
>> features.
> 
> I think the missing piece of the puzzle is how the applications get this 
> information. 
> In Windows, since everything is a SCSI LUN (or has to emulate one) 
> applications just send down 'scsi pass-through' IOCTLs and get the raw 
> INQUIRY data back. 
> In Linux there would need to be some alternative scheme that presumably 
> blkfront would have to support.
> 

They plan to send a REQ_TYPE_BLOCK_PC request down to blkfront, and hoping 
blkfront can handle this request and return the VPD informations.
I'll confirm weather they can read the xenstore node directly.

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 2/2] xen-blkfront: fix resume issues

2016-05-31 Thread Bob Liu
After migrate to another host, the number of rings(block hardware queues)
may be changed and the ring info structure will also be reallocated.

This patch fix two related place:
 * call blk_mq_update_nr_hw_queues() to make blk-core knows the number
of hardware queues have been changed.
 * Don't store rinfo pointer to hctx->driver_data, because rinfo may be
 * reallocated so using hctx->queue_num to get the rinfo structure instead.

Signed-off-by: Bob Liu 
---
 drivers/block/xen-blkfront.c | 20 
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 01aa460..83e36c5 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -874,8 +874,12 @@ static int blkif_queue_rq(struct blk_mq_hw_ctx *hctx,
  const struct blk_mq_queue_data *qd)
 {
unsigned long flags;
-   struct blkfront_ring_info *rinfo = (struct blkfront_ring_info 
*)hctx->driver_data;
+   int qid = hctx->queue_num;
+   struct blkfront_info *info = hctx->queue->queuedata;
+   struct blkfront_ring_info *rinfo = NULL;
 
+   BUG_ON(info->nr_rings <= qid);
+   rinfo = &info->rinfo[qid];
blk_mq_start_request(qd->rq);
spin_lock_irqsave(&rinfo->ring_lock, flags);
if (RING_FULL(&rinfo->ring))
@@ -901,20 +905,9 @@ out_busy:
return BLK_MQ_RQ_QUEUE_BUSY;
 }
 
-static int blk_mq_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
-   unsigned int index)
-{
-   struct blkfront_info *info = (struct blkfront_info *)data;
-
-   BUG_ON(info->nr_rings <= index);
-   hctx->driver_data = &info->rinfo[index];
-   return 0;
-}
-
 static struct blk_mq_ops blkfront_mq_ops = {
.queue_rq = blkif_queue_rq,
.map_queue = blk_mq_map_queue,
-   .init_hctx = blk_mq_init_hctx,
 };
 
 static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
@@ -950,6 +943,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
return PTR_ERR(rq);
}
 
+   rq->queuedata = info;
queue_flag_set_unlocked(QUEUE_FLAG_VIRT, rq);
 
if (info->feature_discard) {
@@ -2149,6 +2143,8 @@ static int blkfront_resume(struct xenbus_device *dev)
return err;
 
err = talk_to_blkback(dev, info);
+   if (!err)
+   blk_mq_update_nr_hw_queues(&info->tag_set, info->nr_rings);
 
/*
 * We have to wait for the backend to switch to
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 1/2] xen-blkfront: don't call talk_to_blkback when already connected to blkback

2016-05-31 Thread Bob Liu
Sometimes blkfont may receive twice blkback_changed() notification after
migration, then talk_to_blkback() will be called twice too and confused
xen-blkback.

Signed-off-by: Bob Liu 
---
 drivers/block/xen-blkfront.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index ca13df8..01aa460 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -2485,7 +2485,8 @@ static void blkback_changed(struct xenbus_device *dev,
break;
 
case XenbusStateConnected:
-   if (dev->state != XenbusStateInitialised) {
+   if ((dev->state != XenbusStateInitialised) &&
+   (dev->state != XenbusStateConnected)) {
if (talk_to_blkback(dev, info))
break;
}
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 1/2] xen-blkfront: don't call talk_to_blkback when already connected to blkback

2016-05-31 Thread Bob Liu

On 06/01/2016 04:33 AM, Konrad Rzeszutek Wilk wrote:
> On Tue, May 31, 2016 at 04:59:16PM +0800, Bob Liu wrote:
>> Sometimes blkfont may receive twice blkback_changed() notification after
>> migration, then talk_to_blkback() will be called twice too and confused
>> xen-blkback.
> 
> Could you enlighten the patch description by having some form of
> state transition here? I am curious how you got the frontend
> to get in XenbusStateConnected (via blkif_recover right) and then
> the backend triggering the update once more?
> 
> Or is just a simple race - the backend moves from XenbusStateConnected->
> XenbusStateConnected - which retriggers the frontend to hit in
> blkback_changed the XenbusStateConnected state and go in there?
> (That would be in conenct_ring changing the state). But I don't
> see how the frontend_changed code get there as we have:
> 
>  770 /*
>  771  * Ensure we connect even when two watches fire in
>  772  * close succession and we miss the intermediate value
>  773  * of frontend_state.
>  774  */
>  775 if (dev->state == XenbusStateConnected)
>  776 break;
>  777 
> 
> ?
> 
> Now what about 'blkfront_connect' being called on the second time?
> 
> Ah, info->connected is probably by then in BLKIF_STATE_CONNECTED
> (as blkif_recover changed) and we just reread the size of the disk.
> 
> Is that how about the flow goes?

blkfrontblkback
blkfront_resume()   
 > talk_to_blkback()
  > Set blkfront to XenbusStateInitialised
Front changed()
 > Connect()
  > Set blkback to 
XenbusStateConnected

blkback_changed()
 > Skip talk_to_blkback()
   because frontstate == XenbusStateInitialised
 > blkfront_connect()
  > Set blkfront to XenbusStateConnected


--
But sometimes blkfront receives
blkback_changed() event more than once!
Not sure why.

blkback_changed()
 > because now frontstate != XenbusStateInitialised
   talk_to_blkback() is also called again
  > blkfront state changed from 
XenbusStateConnected to XenbusStateInitialised
(Which is not correct!)


Front_changed():
 > Do nothing because blkback
           already in 
XenbusStateConnected


Now blkback is XenbusStateConnected but blkfront still XenbusStateInitialised.

>>
>> Signed-off-by: Bob Liu 
>> ---
>>  drivers/block/xen-blkfront.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index ca13df8..01aa460 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
>> @@ -2485,7 +2485,8 @@ static void blkback_changed(struct xenbus_device *dev,
>>  break;
>>  
>>  case XenbusStateConnected:
>> -if (dev->state != XenbusStateInitialised) {
>> +if ((dev->state != XenbusStateInitialised) &&
>> +(dev->state != XenbusStateConnected)) {
>>  if (talk_to_blkback(dev, info))
>>  break;
>>  }
>> -- 
>> 2.7.4
>>

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 1/2] xen-blkfront: don't call talk_to_blkback when already connected to blkback

2016-06-07 Thread Bob Liu

On 06/07/2016 11:25 PM, Konrad Rzeszutek Wilk wrote:
> On Wed, Jun 01, 2016 at 01:49:23PM +0800, Bob Liu wrote:
>>
>> On 06/01/2016 04:33 AM, Konrad Rzeszutek Wilk wrote:
>>> On Tue, May 31, 2016 at 04:59:16PM +0800, Bob Liu wrote:
>>>> Sometimes blkfont may receive twice blkback_changed() notification after
>>>> migration, then talk_to_blkback() will be called twice too and confused
>>>> xen-blkback.
>>>
>>> Could you enlighten the patch description by having some form of
>>> state transition here? I am curious how you got the frontend
>>> to get in XenbusStateConnected (via blkif_recover right) and then
>>> the backend triggering the update once more?
>>>
>>> Or is just a simple race - the backend moves from XenbusStateConnected->
>>> XenbusStateConnected - which retriggers the frontend to hit in
>>> blkback_changed the XenbusStateConnected state and go in there?
>>> (That would be in conenct_ring changing the state). But I don't
>>> see how the frontend_changed code get there as we have:
>>>
>>>  770 /*
>>>  771  * Ensure we connect even when two watches fire in
>>>  772  * close succession and we miss the intermediate value
>>>  773  * of frontend_state.
>>>  774  */
>>>  775 if (dev->state == XenbusStateConnected)
>>>  776 break;
>>>  777 
>>>
>>> ?
>>>
>>> Now what about 'blkfront_connect' being called on the second time?
>>>
>>> Ah, info->connected is probably by then in BLKIF_STATE_CONNECTED
>>> (as blkif_recover changed) and we just reread the size of the disk.
>>>
>>> Is that how about the flow goes?
>>
>>  blkfrontblkback
>> blkfront_resume()   
>>  > talk_to_blkback()
>>   > Set blkfront to XenbusStateInitialised
>>  Front changed()
>>   > Connect()
>>> Set blkback to 
>> XenbusStateConnected
>>
>> blkback_changed()
>>  > Skip talk_to_blkback()
>>because frontstate == XenbusStateInitialised
>>  > blkfront_connect()
>>   > Set blkfront to XenbusStateConnected
>>
>>
>> --
>> But sometimes blkfront receives
>> blkback_changed() event more than once!
> 
> I think I know why. The udev scripts that get invoked when when
> we attach a disk are a bit custom. As such I think they just
> revalidate the size leading to this.
> 
> And this 'poke-at-XenbusStateConnected' state multiple times
> is allowed. It is used to signal disk changes (or just to revalidate).
> Hence it does not matter why really - we need to deal with this.
> 
> I modified your patch a bit and are testing it:
> 

Looks much better, thank you very much!

Bob

> From e49dc9fc65eda4923b41d903ac51a7ddee182bcd Mon Sep 17 00:00:00 2001
> From: Bob Liu 
> Date: Tue, 7 Jun 2016 10:43:15 -0400
> Subject: [PATCH] xen-blkfront: don't call talk_to_blkback when already
>  connected to blkback
> 
> Sometimes blkfront may twice receive blkback_changed() notification
> (XenbusStateConnected) after migration, which will cause
> talk_to_blkback() to be called twice too and confuse xen-blkback.
> 
> The flow is as follow:
>blkfrontblkback
> blkfront_resume()
>  > talk_to_blkback()
>   > Set blkfront to XenbusStateInitialised
> front changed()
>  > Connect()
>   > Set blkback to 
> XenbusStateConnected
> 
> blkback_changed()
>  > Skip talk_to_blkback()
>because frontstate == XenbusStateInitialised
>  > blkfront_connect()
>   > Set blkfront to XenbusStateConnected
> 
> -
> And here we get another XenbusStateConnected notification leading
> to:
> -
> blkback_changed()
>  > because now frontstate != XenbusStateInitialised
>talk_to_blkback() is also called again
>   > blkfront state changed from
>   XenbusStateConnected to XenbusStateInitialised
> (Which is not correct!)
> 
>   front_changed():
>  &

Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback

2015-06-30 Thread Bob Liu

On 06/30/2015 10:21 PM, Marcus Granado wrote:
> On 13/05/15 11:29, Bob Liu wrote:
>>
>> On 04/28/2015 03:46 PM, Arianna Avanzini wrote:
>>> Hello Christoph,
>>>
>>> Il 28/04/2015 09:36, Christoph Hellwig ha scritto:
>>>> What happened to this patchset?
>>>>
>>>
>>> It was passed on to Bob Liu, who published a follow-up patchset here: 
>>> https://lkml.org/lkml/2015/2/15/46
>>>
>>
>> Right, and then I was interrupted by another xen-block feature: 'multi-page' 
>> ring.
>> Will back on this patchset soon. Thank you!
>>
>> -Bob
>>
> 
> Hi,
> 
> Our measurements for the multiqueue patch indicate a clear improvement in 
> iops when more queues are used.
> 
> The measurements were obtained under the following conditions:
> 
> - using blkback as the dom0 backend with the multiqueue patch applied to a 
> dom0 kernel 4.0 on 8 vcpus.
> 
> - using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend applied to 
> be used as a guest on 4 vcpus
> 
> - using a micron RealSSD P320h as the underlying local storage on a Dell 
> PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.
> 
> - fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest. We 
> used direct_io to skip caching in the guest and ran fio for 60s reading a 
> number of block sizes ranging from 512 bytes to 4MiB. Queue depth of 32 for 
> each queue was used to saturate individual vcpus in the guest.
> 
> We were interested in observing storage iops for different values of block 
> sizes. Our expectation was that iops would improve when increasing the number 
> of queues, because both the guest and dom0 would be able to make use of more 
> vcpus to handle these requests.
> 
> These are the results (as aggregate iops for all the fio threads) that we got 
> for the conditions above with sequential reads:
> 
> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
> 8   32   512   158K 264K
> 8   321K   157K 260K
> 8   322K   157K 258K
> 8   324K   148K 257K
> 8   328K   124K 207K
> 8   32   16K84K 105K
> 8   32   32K50K  54K
> 8   32   64K24K  27K
> 8   32  128K11K  13K
> 
> 8-queue iops was better than single queue iops for all the block sizes. There 
> were very good improvements as well for sequential writes with block size 4K 
> (from 80K iops with single queue to 230K iops with 8 queues), and no 
> regressions were visible in any measurement performed.
> 

Great! Thank you very much for the test.

I'm trying to rebase these patches to the latest kernel version(v4.1) and will 
send out in following days.

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] BUG: unable to handle kernel NULL pointer in __netdev_pick_tx()

2015-07-06 Thread Bob Liu
Hi,

I tried to run the latest kernel v4.2-rc1, but often got below panic during 
system boot.

[   42.118983] BUG: unable to handle kernel paging request at 003f
[   42.119008] IP: [] __netdev_pick_tx+0x70/0x120
[   42.119023] PGD 0 
[   42.119026] Oops:  [#1] PREEMPT SMP 
[   42.119031] Modules linked in: bridge stp llc iTCO_wdt iTCO_vendor_support 
x86_pkg_temp_thermal coretemp pcspkr crc32_pclmul crc32c_intel 
ghash_clmulni_intel ixgbe ptp pps_core cdc_ether usbnet mii mdio sb_edac dca 
edac_core wmi i2c_i801 tpm_tis tpm lpc_ich mfd_core ipmi_si ipmi_msghandler 
shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc uinput usb_storage mgag200 
i2c_algo_bit drm_kms_helper ttm drm i2c_core nvme mpt2sas raid_class 
scsi_transport_sas
[   42.119073] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 4.2.0-rc1 #80
[   42.119077] Hardware name: Oracle Corporation SUN SERVER X4-4/ASSY,MB WITH 
TRAY, BIOS 24030400 08/22/2014
[   42.119081] task: 880300b84000 ti: 880300b9 task.ti: 
880300b9
[   42.119085] RIP: e030:[]  [] 
__netdev_pick_tx+0x70/0x120
[   42.119091] RSP: e02b:880306d03868  EFLAGS: 00010206
[   42.119093] RAX: 8802f676b6b0 RBX: 003f RCX: 8161cf60
[   42.119097] RDX: 001c RSI: 8802fe24c900 RDI: 8802f96c
[   42.119100] RBP: 880306d038a8 R08: 00023240 R09: 8160fb1c
[   42.119104] R10:  R11:  R12: 8802fe24c900
[   42.119107] R13:  R14:  R15: 8802f96c
[   42.119121] FS:  () GS:880306d0() 
knlGS:
[   42.119124] CS:  e033 DS: 002b ES: 002b CR0: 80050033
[   42.119127] CR2: 003f CR3: 01c1c000 CR4: 00042660
[   42.119130] Stack:
[   42.119132]  81d63850 8802f63040a0 880306d03888 
8802fe24c900
[   42.119137]  000e  8802f96c 
8802fe24c400
[   42.119141]  880306d038e8 a028bea4 8189cfe0 
81d1b900
[   42.119146] Call Trace:
[   42.119149]   
[   42.119160]  [] ixgbe_select_queue+0xc4/0x150 [ixgbe]
[   42.119167]  [] netdev_pick_tx+0x5e/0xf0
[   42.119170]  [] __dev_queue_xmit+0x90/0x560
[   42.119174]  [] dev_queue_xmit_sk+0x13/0x20
[   42.119181]  [] br_dev_queue_push_xmit+0x4a/0x80 [bridge]
[   42.119186]  [] br_forward_finish+0x2a/0x80 [bridge]
[   42.119191]  [] __br_forward+0x88/0x110 [bridge]
[   42.119198]  [] ? __skb_clone+0x2e/0x140
[   42.119202]  [] ? skb_clone+0x63/0xa0
[   42.119206]  [] ? br_forward_finish+0x80/0x80 [bridge]
[   42.119211]  [] deliver_clone+0x37/0x60 [bridge]
[   42.119215]  [] br_flood+0xc8/0x130 [bridge]
[   42.119220]  [] ? br_forward_finish+0x80/0x80 [bridge]
[   42.119255]  [] br_flood_forward+0x19/0x20 [bridge]
[   42.119260]  [] br_handle_frame_finish+0x258/0x590 [bridge]
[   42.119266]  [] ? get_partial_node.isra.63+0x1b7/0x1d4
[   42.119272]  [] br_handle_frame+0x146/0x270 [bridge]
[   42.119277]  [] ? udp_gro_receive+0x129/0x150
[   42.119281]  [] __netif_receive_skb_core+0x1d6/0xa20
[   42.119286]  [] ? inet_gro_receive+0x9d/0x230
[   42.119290]  [] __netif_receive_skb+0x18/0x60
[   42.119294]  [] netif_receive_skb_internal+0x33/0xb0
[   42.119297]  [] napi_gro_receive+0xbf/0x110
[   42.119303]  [] ixgbe_clean_rx_irq+0x490/0x9e0 [ixgbe]
[   42.119308]  [] ixgbe_poll+0x420/0x790 [ixgbe]
[   42.119312]  [] net_rx_action+0x15d/0x340
[   42.119321]  [] __do_softirq+0xe6/0x2f0
[   42.119324]  [] irq_exit+0xf4/0x100
[   42.119333]  [] xen_evtchn_do_upcall+0x39/0x50
[   42.119340]  [] xen_do_hypervisor_callback+0x1e/0x30
[   42.119343]   
[   42.119348]  [] ? xen_hypercall_sched_op+0xa/0x20
[   42.119351]  [] ? xen_hypercall_sched_op+0xa/0x20
[   42.119356]  [] ? xen_safe_halt+0x10/0x20
[   42.119362]  [] ? default_idle+0x1b/0xf0
[   42.119365]  [] ? arch_cpu_idle+0xf/0x20
[   42.119370]  [] ? default_idle_call+0x3b/0x50
[   42.119374]  [] ? cpu_startup_entry+0x2bf/0x350
[   42.119379]  [] ? cpu_bringup_and_idle+0x2a/0x40
[   42.119382] Code: 8b 87 e8 03 00 00 48 85 c0 0f 84 af 00 00 00 41 8b 94 24 
ac 00 00 00 83 ea 01 48 8d 44 d0 10 48 8b 18 48 85 db 0f 84 93 00 00 00 <8b> 03 
83 f8 01 74 6b 41 f6 84 24 91 00 00 00 30 74 66 41 8b 94 
[   42.119414] RIP  [] __netdev_pick_tx+0x70/0x120
[   42.119418]  RSP 
[   42.119420] CR2: 003f
[   42.119425] ---[ end trace cbc4abc4d5c3f8b2 ]---
[   43.391014] BUG: unable to handle kernel paging request at 003f
[   43.391023] IP: [] __netdev_pick_tx+0x70/0x120
[   43.391030] PGD 0 
[   43.391032] Oops:  [#2] PREEMPT SMP 
[   43.391036] Modules linked in: bridge stp llc iTCO_wdt iTCO_vendor_support 
x86_pkg_temp_thermal coretemp pcspkr crc32_pclmul crc32c_intel 
ghash_clmulni_intel ixgbe ptp pps_core cdc_ether usbnet mii mdio sb_edac dca 
edac_core wmi i2c_i801 tpm_tis tpm lpc_ich mfd_core ipmi_si ipmi_msghandler 
shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc uinput usb

[Xen-devel] [RESEND PATCH] xen/blkfront: convert to blk-mq APIs

2015-07-06 Thread Bob Liu
From: Arianna Avanzini 

This patch converts xen-blkfront driver to use the block multiqueue APIs.
Only one hardware queue is used now, so there is no performance change.

The legacy non-mq code was deleted completely which is the same as other drivers
like virtio, mtip, and nvme.

Also dropped unnecessary holding of info->io_lock when calling into blk-mq APIs.

Signed-off-by: Arianna Avanzini 
Signed-off-by: Bob Liu 
---
 drivers/block/xen-blkfront.c |  173 ++
 1 file changed, 73 insertions(+), 100 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 6d89ed3..831a577 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -37,6 +37,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -148,6 +149,7 @@ struct blkfront_info
unsigned int feature_persistent:1;
unsigned int max_indirect_segments;
int is_ready;
+   struct blk_mq_tag_set tag_set;
 };
 
 static unsigned int nr_minors;
@@ -616,54 +618,45 @@ static inline bool blkif_request_flush_invalid(struct 
request *req,
 !(info->feature_flush & REQ_FUA)));
 }
 
-/*
- * do_blkif_request
- *  read a block; request is in a request queue
- */
-static void do_blkif_request(struct request_queue *rq)
+static int blk_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
+   const struct blk_mq_queue_data *qd)
 {
-   struct blkfront_info *info = NULL;
-   struct request *req;
-   int queued;
-
-   pr_debug("Entered do_blkif_request\n");
-
-   queued = 0;
-
-   while ((req = blk_peek_request(rq)) != NULL) {
-   info = req->rq_disk->private_data;
-
-   if (RING_FULL(&info->ring))
-   goto wait;
-
-   blk_start_request(req);
+   struct blkfront_info *info = qd->rq->rq_disk->private_data;
+   int ret = BLK_MQ_RQ_QUEUE_OK;
 
-   if (blkif_request_flush_invalid(req, info)) {
-   __blk_end_request_all(req, -EOPNOTSUPP);
-   continue;
-   }
+   blk_mq_start_request(qd->rq);
+   spin_lock_irq(&info->io_lock);
+   if (RING_FULL(&info->ring)) {
+   spin_unlock_irq(&info->io_lock);
+   blk_mq_stop_hw_queue(hctx);
+   ret = BLK_MQ_RQ_QUEUE_BUSY;
+   goto out;
+   }
 
-   pr_debug("do_blk_req %p: cmd %p, sec %lx, "
-"(%u/%u) [%s]\n",
-req, req->cmd, (unsigned long)blk_rq_pos(req),
-blk_rq_cur_sectors(req), blk_rq_sectors(req),
-rq_data_dir(req) ? "write" : "read");
-
-   if (blkif_queue_request(req)) {
-   blk_requeue_request(rq, req);
-wait:
-   /* Avoid pointless unplugs. */
-   blk_stop_queue(rq);
-   break;
-   }
+   if (blkif_request_flush_invalid(qd->rq, info)) {
+   spin_unlock_irq(&info->io_lock);
+   ret = BLK_MQ_RQ_QUEUE_ERROR;
+   goto out;
+   }
 
-   queued++;
+   if (blkif_queue_request(qd->rq)) {
+   spin_unlock_irq(&info->io_lock);
+   blk_mq_stop_hw_queue(hctx);
+   ret = BLK_MQ_RQ_QUEUE_BUSY;
+   goto out;
}
 
-   if (queued != 0)
-   flush_requests(info);
+   flush_requests(info);
+   spin_unlock_irq(&info->io_lock);
+out:
+   return ret;
 }
 
+static struct blk_mq_ops blkfront_mq_ops = {
+   .queue_rq = blk_mq_queue_rq,
+   .map_queue = blk_mq_map_queue,
+};
+
 static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
unsigned int physical_sector_size,
unsigned int segments)
@@ -671,9 +664,22 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
struct request_queue *rq;
struct blkfront_info *info = gd->private_data;
 
-   rq = blk_init_queue(do_blkif_request, &info->io_lock);
-   if (rq == NULL)
+   memset(&info->tag_set, 0, sizeof(info->tag_set));
+   info->tag_set.ops = &blkfront_mq_ops;
+   info->tag_set.nr_hw_queues = 1;
+   info->tag_set.queue_depth =  BLK_RING_SIZE(info);
+   info->tag_set.numa_node = NUMA_NO_NODE;
+   info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+   info->tag_set.cmd_size = 0;
+   info->tag_set.driver_data = info;
+
+   if (blk_mq_alloc_tag_set(&info->tag_set))
+   return -1;
+   rq = blk_mq_init_queue(&info->tag_set);
+   if (IS_ERR(rq)) {
+   blk_mq_free_tag_set(&info->tag_set);

Re: [Xen-devel] [PATCH] net/bridge: Add missing in6_dev_put in br_validate_ipv6

2015-07-06 Thread Bob Liu

On 07/04/2015 02:01 AM, Julien Grall wrote:
> The commit efb6de9b4ba0092b2c55f6a52d16294a8a698edd "netfilter: bridge:
> forward IPv6 fragmented packets" introduced a new function
> br_validate_ipv6 which take a reference on the inet6 device. Although,
> the reference is not released at the end.
> 
> This will result to the impossibility to destroy any netdevice using
> ipv6 and bridge.
> 
> Spotted while trying to destroy a Xen guest on the upstream Linux:
> "unregister_netdevice: waiting for vif1.0 to become free. Usage count = 1"
> 
> Signed-off-by: Julien Grall 

Also hit the same issue, thank you for the fix.

Tested-by: Bob Liu 

> Cc: Bernhard Thaler 
> Cc: Pablo Neira Ayuso 
> Cc: f...@strlen.de
> Cc: ian.campb...@citrix.com
> Cc: wei.l...@citrix.com
> 
> ---
> Note that it's impossible to create new guest after this message.
> I'm not sure if it's normal.
> ---
>  net/bridge/br_netfilter_ipv6.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/net/bridge/br_netfilter_ipv6.c b/net/bridge/br_netfilter_ipv6.c
> index 6d12d26..7046e19 100644
> --- a/net/bridge/br_netfilter_ipv6.c
> +++ b/net/bridge/br_netfilter_ipv6.c
> @@ -140,11 +140,16 @@ int br_validate_ipv6(struct sk_buff *skb)
>   /* No IP options in IPv6 header; however it should be
>* checked if some next headers need special treatment
>*/
> +
> + in6_dev_put(idev);
> +
>   return 0;
>  
>  inhdr_error:
>   IP6_INC_STATS_BH(dev_net(dev), idev, IPSTATS_MIB_INHDRERRORS);
>  drop:
> + in6_dev_put(idev);
> +
>   return -1;
>  }
>  
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] BUG: unable to handle kernel NULL pointer in __netdev_pick_tx()

2015-07-06 Thread Bob Liu

On 07/06/2015 06:41 PM, Eric Dumazet wrote:
> On Mon, 2015-07-06 at 16:26 +0800, Bob Liu wrote:
>> Hi,
>>
>> I tried to run the latest kernel v4.2-rc1, but often got below panic during 
>> system boot.
>>
>> [   42.118983] BUG: unable to handle kernel paging request at 
>> 003f
>> [   42.119008] IP: [] __netdev_pick_tx+0x70/0x120
>> [   42.119023] PGD 0 
>> [   42.119026] Oops:  [#1] PREEMPT SMP 
>> [   42.119031] Modules linked in: bridge stp llc iTCO_wdt 
>> iTCO_vendor_support x86_pkg_temp_thermal coretemp pcspkr crc32_pclmul 
>> crc32c_intel ghash_clmulni_intel ixgbe ptp pps_core cdc_ether usbnet mii 
>> mdio sb_edac dca edac_core wmi i2c_i801 tpm_tis tpm lpc_ich mfd_core ipmi_si 
>> ipmi_msghandler shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc uinput 
>> usb_storage mgag200 i2c_algo_bit drm_kms_helper ttm drm i2c_core nvme 
>> mpt2sas raid_class scsi_transport_sas
>> [   42.119073] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 4.2.0-rc1 #80
>> [   42.119077] Hardware name: Oracle Corporation SUN SERVER X4-4/ASSY,MB 
>> WITH TRAY, BIOS 24030400 08/22/2014
>> [   42.119081] task: 880300b84000 ti: 880300b9 task.ti: 
>> 880300b9
>> [   42.119085] RIP: e030:[]  [] 
>> __netdev_pick_tx+0x70/0x120
>> [   42.119091] RSP: e02b:880306d03868  EFLAGS: 00010206
>> [   42.119093] RAX: 8802f676b6b0 RBX: 003f RCX: 
>> 8161cf60
>> [   42.119097] RDX: 001c RSI: 8802fe24c900 RDI: 
>> 8802f96c
>> [   42.119100] RBP: 880306d038a8 R08: 00023240 R09: 
>> 8160fb1c
>> [   42.119104] R10:  R11:  R12: 
>> 8802fe24c900
>> [   42.119107] R13:  R14:  R15: 
>> 8802f96c
>> [   42.119121] FS:  () GS:880306d0() 
>> knlGS:
>> [   42.119124] CS:  e033 DS: 002b ES: 002b CR0: 80050033
>> [   42.119127] CR2: 003f CR3: 01c1c000 CR4: 
>> 00042660
>> [   42.119130] Stack:
>> [   42.119132]  81d63850 8802f63040a0 880306d03888 
>> 8802fe24c900
>> [   42.119137]  000e  8802f96c 
>> 8802fe24c400
>> [   42.119141]  880306d038e8 a028bea4 8189cfe0 
>> 81d1b900
>> [   42.119146] Call Trace:
>> [   42.119149]   
>> [   42.119160]  [] ixgbe_select_queue+0xc4/0x150 [ixgbe]
>> [   42.119167]  [] netdev_pick_tx+0x5e/0xf0
>> [   42.119170]  [] __dev_queue_xmit+0x90/0x560
>> [   42.119174]  [] dev_queue_xmit_sk+0x13/0x20
>> [   42.119181]  [] br_dev_queue_push_xmit+0x4a/0x80 
>> [bridge]
>> [   42.119186]  [] br_forward_finish+0x2a/0x80 [bridge]
>> [   42.119191]  [] __br_forward+0x88/0x110 [bridge]
>> [   42.119198]  [] ? __skb_clone+0x2e/0x140
>> [   42.119202]  [] ? skb_clone+0x63/0xa0
>> [   42.119206]  [] ? br_forward_finish+0x80/0x80 [bridge]
>> [   42.119211]  [] deliver_clone+0x37/0x60 [bridge]
>> [   42.119215]  [] br_flood+0xc8/0x130 [bridge]
>> [   42.119220]  [] ? br_forward_finish+0x80/0x80 [bridge]
>> [   42.119255]  [] br_flood_forward+0x19/0x20 [bridge]
>> [   42.119260]  [] br_handle_frame_finish+0x258/0x590 
>> [bridge]
>> [   42.119266]  [] ? get_partial_node.isra.63+0x1b7/0x1d4
>> [   42.119272]  [] br_handle_frame+0x146/0x270 [bridge]
>> [   42.119277]  [] ? udp_gro_receive+0x129/0x150
>> [   42.119281]  [] __netif_receive_skb_core+0x1d6/0xa20
>> [   42.119286]  [] ? inet_gro_receive+0x9d/0x230
>> [   42.119290]  [] __netif_receive_skb+0x18/0x60
>> [   42.119294]  [] netif_receive_skb_internal+0x33/0xb0
>> [   42.119297]  [] napi_gro_receive+0xbf/0x110
>> [   42.119303]  [] ixgbe_clean_rx_irq+0x490/0x9e0 [ixgbe]
>> [   42.119308]  [] ixgbe_poll+0x420/0x790 [ixgbe]
>> [   42.119312]  [] net_rx_action+0x15d/0x340
>> [   42.119321]  [] __do_softirq+0xe6/0x2f0
>> [   42.119324]  [] irq_exit+0xf4/0x100
>> [   42.119333]  [] xen_evtchn_do_upcall+0x39/0x50
>> [   42.119340]  [] xen_do_hypervisor_callback+0x1e/0x30
>> [   42.119343]   
>> [   42.119348]  [] ? xen_hypercall_sched_op+0xa/0x20
>> [   42.119351]  [] ? xen_hypercall_sched_op+0xa/0x20
>> [   42.119356]  [] ? xen_safe_halt+0x10/0x20
>> [   42.119362]  [] ? default_idle+0x1b/0xf0
>> [   42.119365]  [] ? arch_cpu_idle+0xf/0x20
>> [   42.119370]  [] ? default_idle_call+0x3b/0x50
>> [   42.119374]  [] ? cpu_startup_entry+0x2bf/0x350
>> [   42.119379]  [] ? cpu_bringup_and_idle+0x2a/0x40
>> [  

[Xen-devel] [PATCH] xen: blkif.h: document linux xen-block multi-page ring implementation

2015-05-12 Thread Bob Liu
After commit 1b1586eeeb8c ("xenbus_client: Extend interface to
support multi-page ring"), Linux xenbus driver can support multi-page ring.

Based on this interface, we got some impressive improvements by using multi-page
ring in xen-block driver. If using 64 pages as the ring, the IOPS increased
about 15 times for the throughput testing.

The Linux implementation reuses two 'DEPRECATED' nodes('max-ring-pages' and
'num-ring-pages), so that nothing would be broken.
Also removed the power of 2 limit and updated the default/max value accordingly.

Signed-off-by: Bob Liu 
---
 xen/include/public/io/blkif.h |   12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
index 6baf7fb..0e34ae6 100644
--- a/xen/include/public/io/blkif.h
+++ b/xen/include/public/io/blkif.h
@@ -189,11 +189,11 @@
  *
  * max-ring-pages
  *  Values: 
- *  Default Value:  1
- *  Notes:  DEPRECATED, 2, 3
+ *  Default Value:  32
+ *  Notes:  2, 3
  *
  *  The maximum supported size of the request ring buffer in units of
- *  machine pages.  The value must be a power of 2.
+ *  machine pages.
  *
  *- Backend Device Properties -
  *
@@ -302,11 +302,11 @@
  * num-ring-pages
  *  Values: 
  *  Default Value:  1
- *  Maximum Value:  MAX(max-ring-pages,(0x1 << max-ring-page-order))
- *  Notes:  DEPRECATED, 2, 3
+ *  Maximum Value:  max-ring-pages
+ *  Notes:  2, 3
  *
  *  The size of the frontend allocated request ring buffer in units of
- *  machine pages.  The value must be a power of 2.
+ *  machine pages.
  *
  * feature-persistent
  *  Values: 0/1 (boolean)
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 1/2] driver: xen-blkfront: move talk_to_blkback to the correct place

2015-05-12 Thread Bob Liu
The right place for talk_to_blkback() to query backend features and transport
parameters is after backend entered XenbusStateInitWait. There is no problem
with this yet, but it is an violation of the design and furthermore it would not
allow frontend/backend to negotiate 'multi-page' and 'multi-queue' features
which require this.

This patch move talk_to_blkback() to blkback_changed() after backend entered
XenbusStateInitWait just like blkif.h defined:

See: xen/include/public/io/blkif.h
FrontBack
==
XenbusStateInitialising  XenbusStateInitialising
 o Query virtual device   o Query backend device identification
   properties.  data.
 o Setup OS device instance.  o Open and validate backend device.
  o Publish backend features and
transport parameters.
 |
 |
 V
 XenbusStateInitWait

o Query backend features and
  transport parameters.
o Allocate and initialize the
  request ring.
o Publish transport parameters
  that will be in effect during
  this connection.
 |
 |
 V
XenbusStateInitialised

  o Query frontend transport parameters.
  o Connect to the request ring and
event channel.
  o Publish backend device properties.
 |
 |
 V
 XenbusStateConnected

 o Query backend device properties.
 o Finalize OS virtual device
   instance.
 |
 |
 V
XenbusStateConnected

Signed-off-by: Bob Liu 
---
 drivers/block/xen-blkfront.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 2c61cf8..88e23fd 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -1430,13 +1430,6 @@ static int blkfront_probe(struct xenbus_device *dev,
info->handle = simple_strtoul(strrchr(dev->nodename, '/')+1, NULL, 0);
dev_set_drvdata(&dev->dev, info);
 
-   err = talk_to_blkback(dev, info);
-   if (err) {
-   kfree(info);
-   dev_set_drvdata(&dev->dev, NULL);
-   return err;
-   }
-
return 0;
 }
 
@@ -1906,8 +1899,13 @@ static void blkback_changed(struct xenbus_device *dev,
dev_dbg(&dev->dev, "blkfront:blkback_changed to state %d.\n", 
backend_state);
 
switch (backend_state) {
-   case XenbusStateInitialising:
case XenbusStateInitWait:
+   if (talk_to_blkback(dev, info)) {
+   kfree(info);
+   dev_set_drvdata(&dev->dev, NULL);
+   break;
+   }
+   case XenbusStateInitialising:
case XenbusStateInitialised:
case XenbusStateReconfiguring:
case XenbusStateReconfigured:
-- 
1.8.3.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v3 2/2] xen/block: add multi-page ring support

2015-05-12 Thread Bob Liu
Extend xen/block to support multi-page ring, so that more requests can be issued
by using more than one pages as the request ring between blkfront and backend.
As a result, the performance can get improved significantly.

We got some impressive improvements on our highend iscsi storage cluster 
backend.
If using 64 pages as the ring, the IOPS increased about 15 times for the
throughput testing and above doubled for the latency testing.

The reason was the limit on outstanding requests is 32 if use only one-page
ring, but in our case the iscsi lun was spread across about 100 physical drives,
32 was really not enough to keep them busy.

Changes in v2:
 - Rebased to 4.0-rc6
 - Added description on how this protocol works into io/blkif.h

Changes in v3:
 - Follow the protocol defined in io/blkif.h on XEN tree

Signed-off-by: Bob Liu 
---
 drivers/block/xen-blkback/blkback.c |  14 -
 drivers/block/xen-blkback/common.h  |   4 +-
 drivers/block/xen-blkback/xenbus.c  |  83 ++---
 drivers/block/xen-blkfront.c| 102 +++-
 4 files changed, 156 insertions(+), 47 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c 
b/drivers/block/xen-blkback/blkback.c
index 713fc9f..f191083 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -84,6 +84,12 @@ MODULE_PARM_DESC(max_persistent_grants,
  "Maximum number of grants to map persistently");
 
 /*
+ * Maximum number of pages to be used as the ring between front and backend
+ */
+unsigned int xen_blkif_max_ring_pages = XENBUS_MAX_RING_PAGES;
+module_param_named(max_ring_pages, xen_blkif_max_ring_pages, int, S_IRUGO);
+MODULE_PARM_DESC(max_ring_pages, "Maximum amount of pages to be used as the 
ring");
+/*
  * The LRU mechanism to clean the lists of persistent grants needs to
  * be executed periodically. The time interval between consecutive executions
  * of the purge mechanism is set in ms.
@@ -630,7 +636,7 @@ purge_gnt_list:
}
 
/* Shrink if we have more than xen_blkif_max_buffer_pages */
-   shrink_free_pagepool(blkif, xen_blkif_max_buffer_pages);
+   shrink_free_pagepool(blkif, xen_blkif_max_buffer_pages * 
blkif->nr_ring_pages);
 
if (log_stats && time_after(jiffies, blkif->st_print))
print_stats(blkif);
@@ -1435,6 +1441,12 @@ static int __init xen_blkif_init(void)
 {
int rc = 0;
 
+   if (xen_blkif_max_ring_pages > XENBUS_MAX_RING_PAGES) {
+   pr_info("Invalid max_ring_pages (%d), will use default max: 
%d.\n",
+   xen_blkif_max_ring_pages, XENBUS_MAX_RING_PAGES);
+   xen_blkif_max_ring_pages = XENBUS_MAX_RING_PAGES;
+   }
+
if (!xen_domain())
return -ENODEV;
 
diff --git a/drivers/block/xen-blkback/common.h 
b/drivers/block/xen-blkback/common.h
index f620b5d..84a964c 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -44,6 +44,7 @@
 #include 
 #include 
 
+extern unsigned int xen_blkif_max_ring_pages;
 /*
  * This is the maximum number of segments that would be allowed in indirect
  * requests. This value will also be passed to the frontend.
@@ -248,7 +249,7 @@ struct backend_info;
 #define PERSISTENT_GNT_WAS_ACTIVE  1
 
 /* Number of requests that we can fit in a ring */
-#define XEN_BLKIF_REQS 32
+#define XEN_BLKIF_REQS (32 * XENBUS_MAX_RING_PAGES)
 
 struct persistent_gnt {
struct page *page;
@@ -320,6 +321,7 @@ struct xen_blkif {
struct work_struct  free_work;
/* Thread shutdown wait queue. */
wait_queue_head_t   shutdown_wq;
+   int nr_ring_pages;
 };
 
 struct seg_buf {
diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index 6ab69ad..909babd 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -198,8 +198,8 @@ fail:
return ERR_PTR(-ENOMEM);
 }
 
-static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t gref,
-unsigned int evtchn)
+static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t *gref,
+unsigned int nr_grefs, unsigned int evtchn)
 {
int err;
 
@@ -207,7 +207,7 @@ static int xen_blkif_map(struct xen_blkif *blkif, 
grant_ref_t gref,
if (blkif->irq)
return 0;
 
-   err = xenbus_map_ring_valloc(blkif->be->dev, &gref, 1,
+   err = xenbus_map_ring_valloc(blkif->be->dev, gref, nr_grefs,
 &blkif->blk_ring);
if (err < 0)
return err;
@@ -217,21 +217,21 @@ static int xen_blkif_map(struct xen_blkif *blkif, 
grant_ref_t gref,
{
struct blkif_sring *sring;
sring = (struct blkif_sring *)blk

Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback

2015-05-13 Thread Bob Liu

On 04/28/2015 03:46 PM, Arianna Avanzini wrote:
> Hello Christoph,
> 
> Il 28/04/2015 09:36, Christoph Hellwig ha scritto:
>> What happened to this patchset?
>>
> 
> It was passed on to Bob Liu, who published a follow-up patchset here: 
> https://lkml.org/lkml/2015/2/15/46
> 

Right, and then I was interrupted by another xen-block feature: 'multi-page' 
ring.
Will back on this patchset soon. Thank you!

-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen: blkif.h: document linux xen-block multi-page ring implementation

2015-05-15 Thread Bob Liu

On 05/15/2015 05:51 PM, David Vrabel wrote:
> On 12/05/15 11:58, Bob Liu wrote:
>> After commit 1b1586eeeb8c ("xenbus_client: Extend interface to
>> support multi-page ring"), Linux xenbus driver can support multi-page ring.
>>
>> Based on this interface, we got some impressive improvements by using 
>> multi-page
>> ring in xen-block driver. If using 64 pages as the ring, the IOPS increased
>> about 15 times for the throughput testing.
>>
>> The Linux implementation reuses two 'DEPRECATED' nodes('max-ring-pages' and
>> 'num-ring-pages), so that nothing would be broken.
>> Also removed the power of 2 limit and updated the default/max value 
>> accordingly.
> 
> You can't drop the power of 2 restriction as there may be frontends that
> support this old option (from before it was deprecated) and these may
> not support non-powers of 2.
> 

After take a closer look I think we can fully reuse the current protocol which
only uses 'ring-page-order' and 'max-ring-page-order'.
And leave 'max-ring-pages' and 'num-ring-pages' to DEPRECATED.

In conclusion, the blkif.h don't need to be modified and I'll update Linux
implementation to use 'ring-page-order' and 'max-ring-page-order' too.
What do you think? Thank you!

Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 1/2] driver: xen-blkfront: move talk_to_blkback to the correct place

2015-05-15 Thread Bob Liu

On 05/15/2015 06:01 PM, Roger Pau Monné wrote:
> El 12/05/15 a les 13.01, Bob Liu ha escrit:
>> The right place for talk_to_blkback() to query backend features and transport
>> parameters is after backend entered XenbusStateInitWait. There is no problem
> 
> talk_to_blkback doesn't gather any backend features, it just publishes
> the features supported by the frontend, which AFAICT can be done at any

1) But talk_tlkback will also allocate and initialize the request ring which
should be done after backend entered XenbusStateInitWait.

Please see the protocol defined in xen/include/public/io/blkif.h:
 *
 *   Startup *
 *
 *
 * Tool stack creates front and back nodes with state XenbusStateInitialising.
 *
 * FrontBack
 * ==
 * XenbusStateInitialising  XenbusStateInitialising
 *  o Query virtual device   o Query backend device identification
 *properties.  data.
 *  o Setup OS device instance.  o Open and validate backend device.
 *   o Publish backend features and
 * transport parameters.
 *  |
 *  |
 *  V
 *  XenbusStateInitWait
 *
 * o Query backend features and
 *   transport parameters.
 * o Allocate and initialize the
 *   request ring.


2) Another problem is after 'mutli-page' ring feature get introduced, we have 
to know the max
ring pages supported by backend in setup_blkring().
If backend haven't enter XenbusStateInitWait, we may not query the right value. 
E.g.

Frontend  Backend

in .probe:
talk_to_blkback()
 > setup_blkring()
  > xenbus_scanf(max_ring_pages)



   in .probe:
   xenbus_printf(max_ring_pages)
    Too late to write the real 
value
   xenbus_switch_state(dev, 
XenbusStateInitWait)


Thank you reviewing these patches!

Regards,
-Bob

> time provided that it's before switching to state XenbusStateInitWait.
> Blkfront doesn't have to wait for the backend to switch to state
> XenbusStateInitWait before publishing the features supported by the
> frontend, which is what talk_to_blkback does.
> 
> Roger.
> 


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 1/2] driver: xen-blkfront: move talk_to_blkback to the correct place

2015-05-15 Thread Bob Liu

On 05/15/2015 07:14 PM, Roger Pau Monné wrote:
> El 15/05/15 a les 13.03, Bob Liu ha escrit:
>>
>> On 05/15/2015 06:01 PM, Roger Pau Monné wrote:
>>> El 12/05/15 a les 13.01, Bob Liu ha escrit:
>>>> The right place for talk_to_blkback() to query backend features and 
>>>> transport
>>>> parameters is after backend entered XenbusStateInitWait. There is no 
>>>> problem
>>>
>>> talk_to_blkback doesn't gather any backend features, it just publishes
>>> the features supported by the frontend, which AFAICT can be done at any
>>
>> 1) But talk_tlkback will also allocate and initialize the request ring which
>> should be done after backend entered XenbusStateInitWait.
> 
> Maybe setup_blkring should be moved to a more suitable location instead
> of moving the whole function?
> 

Most of other parts in talk_to_blkback() depends on setup_blkring() like write
out ring-ref and event-channel.

Only notify 'feature-persistent' and 'protocol:XEN_IO_PROTO_ABI_NATIVE' can be 
left in front probe().

Then the patch would like this:
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 2c61cf8..6b918e0 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -1318,17 +1318,6 @@ again:
message = "writing event-channel";
goto abort_transaction;
}
-   err = xenbus_printf(xbt, dev->nodename, "protocol", "%s",
-   XEN_IO_PROTO_ABI_NATIVE);
-   if (err) {
-   message = "writing protocol";
-   goto abort_transaction;
-   }
-   err = xenbus_printf(xbt, dev->nodename,
-   "feature-persistent", "%u", 1);
-   if (err)
-   dev_warn(&dev->dev,
-"writing persistent grants feature to xenbus");
 
err = xenbus_transaction_end(xbt, 0);
if (err) {
@@ -1430,13 +1419,17 @@ static int blkfront_probe(struct xenbus_device *dev,
info->handle = simple_strtoul(strrchr(dev->nodename, '/')+1, NULL, 0);
dev_set_drvdata(&dev->dev, info);
 
-   err = talk_to_blkback(dev, info);
+   err = xenbus_printf(xbt, dev->nodename, "protocol", "%s",
+   XEN_IO_PROTO_ABI_NATIVE);
if (err) {
-   kfree(info);
-   dev_set_drvdata(&dev->dev, NULL);
-   return err;
+   message = "writing protocol";
+   goto abort_transaction;
}
-
+   err = xenbus_printf(xbt, dev->nodename,
+   "feature-persistent", "%u", 1);
+   if (err)
+   dev_warn(&dev->dev,
+"writing persistent grants feature to xenbus");
return 0;
 }
 
@@ -1906,8 +1899,13 @@ static void blkback_changed(struct xenbus_device *dev,
dev_dbg(&dev->dev, "blkfront:blkback_changed to state %d.\n", 
backend_state);
 
switch (backend_state) {
-   case XenbusStateInitialising:
case XenbusStateInitWait:
+   if (talk_to_blkback(dev, info)) {
+   kfree(info);
+   dev_set_drvdata(&dev->dev, NULL);
+   break;
+   }
+   case XenbusStateInitialising:
case XenbusStateInitialised:
case XenbusStateReconfiguring:
case XenbusStateReconfigured:

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3 2/2] xen/block: add multi-page ring support

2015-05-15 Thread Bob Liu

On 05/15/2015 07:13 PM, Roger Pau Monné wrote:
> El 12/05/15 a les 13.01, Bob Liu ha escrit:
>> Extend xen/block to support multi-page ring, so that more requests can be 
>> issued
>> by using more than one pages as the request ring between blkfront and 
>> backend.
>> As a result, the performance can get improved significantly.
>   ^ s/can get improved/improves/
> 
>>
>> We got some impressive improvements on our highend iscsi storage cluster 
>> backend.
>> If using 64 pages as the ring, the IOPS increased about 15 times for the
>> throughput testing and above doubled for the latency testing.
>>
>> The reason was the limit on outstanding requests is 32 if use only one-page
>> ring, but in our case the iscsi lun was spread across about 100 physical 
>> drives,
>> 32 was really not enough to keep them busy.
>>
>> Changes in v2:
>>  - Rebased to 4.0-rc6
>>  - Added description on how this protocol works into io/blkif.h
> 
> I don't see any changes to io/blkif.h in this patch, is something missing?
> 

Sorry, I should mention in v3 that these changed were removed because I 
followed the protocol
already defined in XEN git tree: xen/include/public/io/blkif.h

> Also you use XENBUS_MAX_RING_PAGES which AFAICT it's not defined anywhere.
> 

It was defined in include/xen/xenbus.h.

>>
>> Changes in v3:
>>  - Follow the protocol defined in io/blkif.h on XEN tree
>>
>> Signed-off-by: Bob Liu 
>> ---
>>  drivers/block/xen-blkback/blkback.c |  14 -
>>  drivers/block/xen-blkback/common.h  |   4 +-
>>  drivers/block/xen-blkback/xenbus.c  |  83 ++---
>>  drivers/block/xen-blkfront.c| 102 
>> +++-
>>  4 files changed, 156 insertions(+), 47 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkback/blkback.c 
>> b/drivers/block/xen-blkback/blkback.c
>> index 713fc9f..f191083 100644
>> --- a/drivers/block/xen-blkback/blkback.c
>> +++ b/drivers/block/xen-blkback/blkback.c
>> @@ -84,6 +84,12 @@ MODULE_PARM_DESC(max_persistent_grants,
>>   "Maximum number of grants to map persistently");
>>  
>>  /*
>> + * Maximum number of pages to be used as the ring between front and backend
>> + */
>> +unsigned int xen_blkif_max_ring_pages = XENBUS_MAX_RING_PAGES;
>> +module_param_named(max_ring_pages, xen_blkif_max_ring_pages, int, S_IRUGO);
>> +MODULE_PARM_DESC(max_ring_pages, "Maximum amount of pages to be used as the 
>> ring");
>> +/*
>>   * The LRU mechanism to clean the lists of persistent grants needs to
>>   * be executed periodically. The time interval between consecutive 
>> executions
>>   * of the purge mechanism is set in ms.
>> @@ -630,7 +636,7 @@ purge_gnt_list:
>>  }
>>  
>>  /* Shrink if we have more than xen_blkif_max_buffer_pages */
>> -shrink_free_pagepool(blkif, xen_blkif_max_buffer_pages);
>> +shrink_free_pagepool(blkif, xen_blkif_max_buffer_pages * 
>> blkif->nr_ring_pages);
> 
> You are greatly increasing the buffer of free (ballooned) pages.
> Possibly making it 32 times bigger than it used to be, is this really
> needed?
> 

Hmm.. it's a bit aggressive.
How about (xen_blkif_max_buffer_pages * blkif->nr_ring_pages) / 2?

>>  
>>  if (log_stats && time_after(jiffies, blkif->st_print))
>>  print_stats(blkif);
>> @@ -1435,6 +1441,12 @@ static int __init xen_blkif_init(void)
>>  {
>>  int rc = 0;
>>  
>> +if (xen_blkif_max_ring_pages > XENBUS_MAX_RING_PAGES) {
>> +pr_info("Invalid max_ring_pages (%d), will use default max: 
>> %d.\n",
>> +xen_blkif_max_ring_pages, XENBUS_MAX_RING_PAGES);
>> +xen_blkif_max_ring_pages = XENBUS_MAX_RING_PAGES;
>> +}
>> +
>>  if (!xen_domain())
>>  return -ENODEV;
>>  
>> diff --git a/drivers/block/xen-blkback/common.h 
>> b/drivers/block/xen-blkback/common.h
>> index f620b5d..84a964c 100644
>> --- a/drivers/block/xen-blkback/common.h
>> +++ b/drivers/block/xen-blkback/common.h
>> @@ -44,6 +44,7 @@
>>  #include 
>>  #include 
>>  
>> +extern unsigned int xen_blkif_max_ring_pages;
>>  /*
>>   * This is the maximum number of segments that would be allowed in indirect
>>   * requests. This value will also be passed to the frontend.
>> @@ -248,7 +249,7 @@ struct backend_info;
>>  #define 

[Xen-devel] [PATCH v4 2/2] xen/block: add multi-page ring support

2015-05-20 Thread Bob Liu
Extend xen/block to support multi-page ring, so that more requests can be issued
by using more than one pages as the request ring between blkfront and backend.
As a result, the performance can get improved significantly.

We got some impressive improvements on our highend iscsi storage cluster 
backend.
If using 64 pages as the ring, the IOPS increased about 15 times for the
throughput testing and above doubled for the latency testing.

The reason was the limit on outstanding requests is 32 if use only one-page
ring, but in our case the iscsi lun was spread across about 100 physical drives,
32 was really not enough to keep them busy.

Changes in v2:
 - Rebased to 4.0-rc6.
 - Document on how mutli-page ring feature working to linux io/blkif.h.

Changes in v3:
 - Remove changes to linux io/blkif.h and follow the protocol defined
   in io/blkif.h of XEN tree.
 - Rebased to 4.1-rc3

Changes in v4:
 - Turn to use 'ring-page-order' and 'max-ring-page-order'.
 - A few comments from Roger.

Signed-off-by: Bob Liu 
---
 drivers/block/xen-blkback/blkback.c |  12 
 drivers/block/xen-blkback/common.h  |   3 +-
 drivers/block/xen-blkback/xenbus.c  |  85 +---
 drivers/block/xen-blkfront.c| 110 ++--
 4 files changed, 161 insertions(+), 49 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c 
b/drivers/block/xen-blkback/blkback.c
index 713fc9f..057890f 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -84,6 +84,12 @@ MODULE_PARM_DESC(max_persistent_grants,
  "Maximum number of grants to map persistently");
 
 /*
+ * Maximum number of pages to be used as the ring between front and backend
+ */
+unsigned int xen_blkif_max_ring_order = XENBUS_MAX_RING_PAGE_ORDER;
+module_param_named(max_ring_page_order, xen_blkif_max_ring_order, int, 
S_IRUGO);
+MODULE_PARM_DESC(max_ring_page_order, "Maximum order of pages to be used as 
the ring");
+/*
  * The LRU mechanism to clean the lists of persistent grants needs to
  * be executed periodically. The time interval between consecutive executions
  * of the purge mechanism is set in ms.
@@ -1435,6 +1441,12 @@ static int __init xen_blkif_init(void)
 {
int rc = 0;
 
+   if (xen_blkif_max_ring_order > XENBUS_MAX_RING_PAGE_ORDER) {
+   pr_info("Invalid max_ring_order (%d), will use default max: 
%d.\n",
+   xen_blkif_max_ring_order, XENBUS_MAX_RING_PAGE_ORDER);
+   xen_blkif_max_ring_order = XENBUS_MAX_RING_PAGE_ORDER;
+   }
+
if (!xen_domain())
return -ENODEV;
 
diff --git a/drivers/block/xen-blkback/common.h 
b/drivers/block/xen-blkback/common.h
index f620b5d..edc0992 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -44,6 +44,7 @@
 #include 
 #include 
 
+extern unsigned int xen_blkif_max_ring_order;
 /*
  * This is the maximum number of segments that would be allowed in indirect
  * requests. This value will also be passed to the frontend.
@@ -248,7 +249,7 @@ struct backend_info;
 #define PERSISTENT_GNT_WAS_ACTIVE  1
 
 /* Number of requests that we can fit in a ring */
-#define XEN_BLKIF_REQS 32
+#define XEN_BLKIF_REQS (32 * XENBUS_MAX_RING_PAGES)
 
 struct persistent_gnt {
struct page *page;
diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index 6ab69ad..1ec05eb 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -25,6 +25,7 @@
 
 /* Enlarge the array size in order to fully show blkback name. */
 #define BLKBACK_NAME_LEN (20)
+#define RINGREF_NAME_LEN (20)
 
 struct backend_info {
struct xenbus_device*dev;
@@ -198,8 +199,8 @@ fail:
return ERR_PTR(-ENOMEM);
 }
 
-static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t gref,
-unsigned int evtchn)
+static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t *gref,
+unsigned int nr_grefs, unsigned int evtchn)
 {
int err;
 
@@ -207,7 +208,7 @@ static int xen_blkif_map(struct xen_blkif *blkif, 
grant_ref_t gref,
if (blkif->irq)
return 0;
 
-   err = xenbus_map_ring_valloc(blkif->be->dev, &gref, 1,
+   err = xenbus_map_ring_valloc(blkif->be->dev, gref, nr_grefs,
 &blkif->blk_ring);
if (err < 0)
return err;
@@ -217,21 +218,21 @@ static int xen_blkif_map(struct xen_blkif *blkif, 
grant_ref_t gref,
{
struct blkif_sring *sring;
sring = (struct blkif_sring *)blkif->blk_ring;
-   BACK_RING_INIT(&blkif->blk_rings.native, sring, PAGE_SIZE);
+   BACK_RING_INIT(&blkif->blk_rings.native, sring, PAGE_SIZE * 
nr_grefs);
break

[Xen-devel] [PATCH v2 1/2] driver: xen-blkfront: move talk_to_blkback to a more suitable place

2015-05-20 Thread Bob Liu
The major responsibility of talk_to_blkback() is allocate and initialize the
request ring and writes the ring info stuff out.
But this work should be done after backend entered 'XenbusStateInitWait' as
defined in the protocol file.
See xen/include/public/io/blkif.h in XEN git tree:
FrontBack
==
XenbusStateInitialising  XenbusStateInitialising
 o Query virtual device   o Query backend device identification
   properties.  data.
 o Setup OS device instance.  o Open and validate backend device.
  o Publish backend features and
transport parameters.
 |
 |
 V
 XenbusStateInitWait

o Query backend features and
  transport parameters.
o Allocate and initialize the
  request ring.

There is no problem with this yet, but it is an violation of the design and
furthermore it would not allow frontend/backend to negotiate 'multi-page' and
'multi-queue' features.

Changes in v2:
 - Re-write the commit message to be more clear.

Signed-off-by: Bob Liu 
---
 drivers/block/xen-blkfront.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 2c61cf8..88e23fd 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -1430,13 +1430,6 @@ static int blkfront_probe(struct xenbus_device *dev,
info->handle = simple_strtoul(strrchr(dev->nodename, '/')+1, NULL, 0);
dev_set_drvdata(&dev->dev, info);
 
-   err = talk_to_blkback(dev, info);
-   if (err) {
-   kfree(info);
-   dev_set_drvdata(&dev->dev, NULL);
-   return err;
-   }
-
return 0;
 }
 
@@ -1906,8 +1899,13 @@ static void blkback_changed(struct xenbus_device *dev,
dev_dbg(&dev->dev, "blkfront:blkback_changed to state %d.\n", 
backend_state);
 
switch (backend_state) {
-   case XenbusStateInitialising:
case XenbusStateInitWait:
+   if (talk_to_blkback(dev, info)) {
+   kfree(info);
+   dev_set_drvdata(&dev->dev, NULL);
+   break;
+   }
+   case XenbusStateInitialising:
case XenbusStateInitialised:
case XenbusStateReconfiguring:
case XenbusStateReconfigured:
-- 
1.8.3.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 2/2] xen/block: add multi-page ring support

2015-05-20 Thread Bob Liu


On 05/20/2015 11:00 PM, Julien Grall wrote:
> On 20/05/15 15:56, Roger Pau Monné wrote:
>> El 20/05/15 a les 15.21, Julien Grall ha escrit:
>>> Hi,
>>>
>>> On 20/05/15 14:10, Bob Liu wrote:
>>>> ---
>>>>  drivers/block/xen-blkback/blkback.c |  12 
>>>>  drivers/block/xen-blkback/common.h  |   3 +-
>>>>  drivers/block/xen-blkback/xenbus.c  |  85 +---
>>>>  drivers/block/xen-blkfront.c| 110 
>>>> ++--
>>>>  4 files changed, 161 insertions(+), 49 deletions(-)
>>>>
>>>> diff --git a/drivers/block/xen-blkback/blkback.c 
>>>> b/drivers/block/xen-blkback/blkback.c
>>>> index 713fc9f..057890f 100644
>>>> --- a/drivers/block/xen-blkback/blkback.c
>>>> +++ b/drivers/block/xen-blkback/blkback.c
>>>> @@ -84,6 +84,12 @@ MODULE_PARM_DESC(max_persistent_grants,
>>>>   "Maximum number of grants to map persistently");
>>>>  
>>>>  /*
>>>> + * Maximum number of pages to be used as the ring between front and 
>>>> backend
>>>> + */
>>>> +unsigned int xen_blkif_max_ring_order = XENBUS_MAX_RING_PAGE_ORDER;
>>>
>>> We will soon support 64KB page granularity with ARM64, although the PV
>>> protocol will keep a 4KB page granularity.
>>>
>>> Can you clarify with granularity is used here? The one of the host or
>>> the one of the PV protocol?
>>
>> It's using 4K pages, because those are then granted to the domain
>> handling the backend.
> 
> It would be nice to add a word in the comment.
> 

Sure, I'll make a update.

Roger, can I get your ack of these two patches except this comment update?

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 2/2] xen/block: add multi-page ring support

2015-05-21 Thread Bob Liu

On 05/21/2015 07:22 PM, Roger Pau Monné wrote:
> El 20/05/15 a les 15.10, Bob Liu ha escrit:
...
>> +} else {
>> +unsigned int i;
>> +
>> +if (ring_page_order > xen_blkif_max_ring_order) {
>> +err = -EINVAL;
>> +xenbus_dev_fatal(dev, err, "%s/request %d ring page 
>> order exceed max:%d",
>> + dev->otherend, ring_page_order, 
>> xen_blkif_max_ring_order);
>> +return err;
>> +}
>> +
>> +nr_grefs = 1 << ring_page_order;
>> +for (i = 0; i < nr_grefs; i++) {
>> +char ring_ref_name[RINGREF_NAME_LEN];
>> +
>> +snprintf(ring_ref_name, sizeof(ring_ref_name), 
>> "ring-ref%u", i);
> ^ RINGREF_NAME_LEN
>> +err = xenbus_scanf(XBT_NIL, dev->otherend,
>> +   ring_ref_name, "%u", &ring_ref[i]);
> 
> No need to split lines unless they are > 100 chars (here and elsewhere).
> 

Not 82 chars?

>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index 88e23fd..3d1c6fb 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
>> @@ -98,7 +98,17 @@ static unsigned int xen_blkif_max_segments = 32;
>>  module_param_named(max, xen_blkif_max_segments, int, S_IRUGO);
>>  MODULE_PARM_DESC(max, "Maximum amount of segments in indirect requests 
>> (default is 32)");
>>  
>> -#define BLK_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE)
>> +static unsigned int xen_blkif_max_ring_order;
>> +module_param_named(max_ring_page_order, xen_blkif_max_ring_order, int, 
>> S_IRUGO);
>> +MODULE_PARM_DESC(max_ring_page_order, "Maximum order of pages to be used as 
>> the ring");
>> +
>> +#define BLK_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE * 
>> info->nr_ring_pages)
> 
> Didn't we agreed that this macro should take a explicit info parameter?
> 

Do you mean define like this?
#define BLK_RING_SIZE(info) __CONST_RING_SIZE(blkif, PAGE_SIZE * 
info->nr_ring_pages)

Thanks,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2 1/2] driver: xen-blkfront: move talk_to_blkback to a more suitable place

2015-05-21 Thread Bob Liu
The major responsibility of talk_to_blkback() is allocate and initialize
the request ring and write the ring info to xenstore.
But this work should be done after backend entered 'XenbusStateInitWait' as
defined in the protocol file.
See xen/include/public/io/blkif.h in XEN git tree:
FrontBack
==
XenbusStateInitialising  XenbusStateInitialising
 o Query virtual device   o Query backend device identification
   properties.  data.
 o Setup OS device instance.  o Open and validate backend device.
  o Publish backend features and
transport parameters.
 |
 |
 V
 XenbusStateInitWait

o Query backend features and
  transport parameters.
o Allocate and initialize the
  request ring.

There is no problem with this yet, but it is an violation of the design and
furthermore it would not allow frontend/backend to negotiate 'multi-page'
and 'multi-queue' features.

Changes in v2:
 - Re-write the commit message to be more clear.

Signed-off-by: Bob Liu 
Acked-by: Roger Pau Monné 
---
 drivers/block/xen-blkfront.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 2c61cf8..88e23fd 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -1430,13 +1430,6 @@ static int blkfront_probe(struct xenbus_device *dev,
info->handle = simple_strtoul(strrchr(dev->nodename, '/')+1, NULL, 0);
dev_set_drvdata(&dev->dev, info);
 
-   err = talk_to_blkback(dev, info);
-   if (err) {
-   kfree(info);
-   dev_set_drvdata(&dev->dev, NULL);
-   return err;
-   }
-
return 0;
 }
 
@@ -1906,8 +1899,13 @@ static void blkback_changed(struct xenbus_device *dev,
dev_dbg(&dev->dev, "blkfront:blkback_changed to state %d.\n", 
backend_state);
 
switch (backend_state) {
-   case XenbusStateInitialising:
case XenbusStateInitWait:
+   if (talk_to_blkback(dev, info)) {
+   kfree(info);
+   dev_set_drvdata(&dev->dev, NULL);
+   break;
+   }
+   case XenbusStateInitialising:
case XenbusStateInitialised:
case XenbusStateReconfiguring:
case XenbusStateReconfigured:
-- 
1.8.3.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v5 2/2] xen/block: add multi-page ring support

2015-05-21 Thread Bob Liu
Extend xen/block to support multi-page ring, so that more requests can be
issued by using more than one pages as the request ring between blkfront
and backend.
As a result, the performance can get improved significantly.

We got some impressive improvements on our highend iscsi storage cluster
backend. If using 64 pages as the ring, the IOPS increased about 15 times
for the throughput testing and above doubled for the latency testing.

The reason was the limit on outstanding requests is 32 if use only one-page
ring, but in our case the iscsi lun was spread across about 100 physical
drives, 32 was really not enough to keep them busy.

Changes in v2:
 - Rebased to 4.0-rc6.
 - Document on how multi-page ring feature working to linux io/blkif.h.

Changes in v3:
 - Remove changes to linux io/blkif.h and follow the protocol defined
   in io/blkif.h of XEN tree.
 - Rebased to 4.1-rc3

Changes in v4:
 - Turn to use 'ring-page-order' and 'max-ring-page-order'.
 - A few comments from Roger.

Changes in v5:
 - Clarify 4k granularity to comment.
 - Address more comments from Roger.

Signed-off-by: Bob Liu 
---
 drivers/block/xen-blkback/blkback.c |  13 
 drivers/block/xen-blkback/common.h  |   3 +-
 drivers/block/xen-blkback/xenbus.c  |  88 +--
 drivers/block/xen-blkfront.c| 135 +---
 4 files changed, 179 insertions(+), 60 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c 
b/drivers/block/xen-blkback/blkback.c
index 713fc9f..2126842 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -84,6 +84,13 @@ MODULE_PARM_DESC(max_persistent_grants,
  "Maximum number of grants to map persistently");
 
 /*
+ * Maximum order of pages to be used for the shared ring between front and
+ * backend, 4KB page granularity is used.
+ */
+unsigned int xen_blkif_max_ring_order = XENBUS_MAX_RING_PAGE_ORDER;
+module_param_named(max_ring_page_order, xen_blkif_max_ring_order, int, 
S_IRUGO);
+MODULE_PARM_DESC(max_ring_page_order, "Maximum order of pages to be used for 
the shared ring");
+/*
  * The LRU mechanism to clean the lists of persistent grants needs to
  * be executed periodically. The time interval between consecutive executions
  * of the purge mechanism is set in ms.
@@ -1438,6 +1445,12 @@ static int __init xen_blkif_init(void)
if (!xen_domain())
return -ENODEV;
 
+   if (xen_blkif_max_ring_order > XENBUS_MAX_RING_PAGE_ORDER) {
+   pr_info("Invalid max_ring_order (%d), will use default max: 
%d.\n",
+   xen_blkif_max_ring_order, XENBUS_MAX_RING_PAGE_ORDER);
+   xen_blkif_max_ring_order = XENBUS_MAX_RING_PAGE_ORDER;
+   }
+
rc = xen_blkif_interface_init();
if (rc)
goto failed_init;
diff --git a/drivers/block/xen-blkback/common.h 
b/drivers/block/xen-blkback/common.h
index f620b5d..919a1ab 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -44,6 +44,7 @@
 #include 
 #include 
 
+extern unsigned int xen_blkif_max_ring_order;
 /*
  * This is the maximum number of segments that would be allowed in indirect
  * requests. This value will also be passed to the frontend.
@@ -248,7 +249,7 @@ struct backend_info;
 #define PERSISTENT_GNT_WAS_ACTIVE  1
 
 /* Number of requests that we can fit in a ring */
-#define XEN_BLKIF_REQS 32
+#define XEN_MAX_BLKIF_REQS (32 * XENBUS_MAX_RING_PAGES)
 
 struct persistent_gnt {
struct page *page;
diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index 6ab69ad..bc33888 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -25,6 +25,7 @@
 
 /* Enlarge the array size in order to fully show blkback name. */
 #define BLKBACK_NAME_LEN (20)
+#define RINGREF_NAME_LEN (20)
 
 struct backend_info {
struct xenbus_device*dev;
@@ -152,7 +153,7 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
INIT_LIST_HEAD(&blkif->pending_free);
INIT_WORK(&blkif->free_work, xen_blkif_deferred_free);
 
-   for (i = 0; i < XEN_BLKIF_REQS; i++) {
+   for (i = 0; i < XEN_MAX_BLKIF_REQS; i++) {
req = kzalloc(sizeof(*req), GFP_KERNEL);
if (!req)
goto fail;
@@ -198,8 +199,8 @@ fail:
return ERR_PTR(-ENOMEM);
 }
 
-static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t gref,
-unsigned int evtchn)
+static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t *gref,
+unsigned int nr_grefs, unsigned int evtchn)
 {
int err;
 
@@ -207,7 +208,7 @@ static int xen_blkif_map(struct xen_blkif *blkif, 
grant_ref_t gref,
if (blkif->irq)
return 0;
 
-   err = xenbus_map_ring_valloc(blkif-&

Re: [Xen-devel] [PATCH v5 2/2] xen/block: add multi-page ring support

2015-05-22 Thread Bob Liu

On 05/22/2015 04:31 PM, Paul Durrant wrote:
>> -Original Message-
>> From: Bob Liu [mailto:bob@oracle.com]
>> Sent: 22 May 2015 01:00
>> To: xen-devel@lists.xen.org
>> Cc: David Vrabel; just...@spectralogic.com; konrad.w...@oracle.com; Roger
>> Pau Monne; Paul Durrant; Julien Grall; boris.ostrov...@oracle.com; linux-
>> ker...@vger.kernel.org; Bob Liu
>> Subject: [PATCH v5 2/2] xen/block: add multi-page ring support
>>
>> Extend xen/block to support multi-page ring, so that more requests can be
>> issued by using more than one pages as the request ring between blkfront
>> and backend.
>> As a result, the performance can get improved significantly.
>>
>> We got some impressive improvements on our highend iscsi storage cluster
>> backend. If using 64 pages as the ring, the IOPS increased about 15 times
>> for the throughput testing and above doubled for the latency testing.
>>
>> The reason was the limit on outstanding requests is 32 if use only one-page
>> ring, but in our case the iscsi lun was spread across about 100 physical
>> drives, 32 was really not enough to keep them busy.
>>
>> Changes in v2:
>>  - Rebased to 4.0-rc6.
>>  - Document on how multi-page ring feature working to linux io/blkif.h.
>>
>> Changes in v3:
>>  - Remove changes to linux io/blkif.h and follow the protocol defined
>>in io/blkif.h of XEN tree.
>>  - Rebased to 4.1-rc3
>>
>> Changes in v4:
>>  - Turn to use 'ring-page-order' and 'max-ring-page-order'.
>>  - A few comments from Roger.
>>
>> Changes in v5:
>>  - Clarify 4k granularity to comment.
>>  - Address more comments from Roger.
>>
>> Signed-off-by: Bob Liu 
>> ---
>>  drivers/block/xen-blkback/blkback.c |  13 
>>  drivers/block/xen-blkback/common.h  |   3 +-
>>  drivers/block/xen-blkback/xenbus.c  |  88 +--
>>  drivers/block/xen-blkfront.c| 135 +---
>> 
>>  4 files changed, 179 insertions(+), 60 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-
>> blkback/blkback.c
>> index 713fc9f..2126842 100644
>> --- a/drivers/block/xen-blkback/blkback.c
>> +++ b/drivers/block/xen-blkback/blkback.c
>> @@ -84,6 +84,13 @@ MODULE_PARM_DESC(max_persistent_grants,
>>   "Maximum number of grants to map persistently");
>>
>>  /*
>> + * Maximum order of pages to be used for the shared ring between front
>> and
>> + * backend, 4KB page granularity is used.
>> + */
>> +unsigned int xen_blkif_max_ring_order =
>> XENBUS_MAX_RING_PAGE_ORDER;
>> +module_param_named(max_ring_page_order, xen_blkif_max_ring_order,
>> int, S_IRUGO);
>> +MODULE_PARM_DESC(max_ring_page_order, "Maximum order of pages
>> to be used for the shared ring");
>> +/*
>>   * The LRU mechanism to clean the lists of persistent grants needs to
>>   * be executed periodically. The time interval between consecutive
>> executions
>>   * of the purge mechanism is set in ms.
>> @@ -1438,6 +1445,12 @@ static int __init xen_blkif_init(void)
>>  if (!xen_domain())
>>  return -ENODEV;
>>
>> +if (xen_blkif_max_ring_order > XENBUS_MAX_RING_PAGE_ORDER)
>> {
>> +pr_info("Invalid max_ring_order (%d), will use default max:
>> %d.\n",
>> +xen_blkif_max_ring_order,
>> XENBUS_MAX_RING_PAGE_ORDER);
>> +xen_blkif_max_ring_order =
>> XENBUS_MAX_RING_PAGE_ORDER;
>> +}
>> +
>>  rc = xen_blkif_interface_init();
>>  if (rc)
>>  goto failed_init;
>> diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-
>> blkback/common.h
>> index f620b5d..919a1ab 100644
>> --- a/drivers/block/xen-blkback/common.h
>> +++ b/drivers/block/xen-blkback/common.h
>> @@ -44,6 +44,7 @@
>>  #include 
>>  #include 
>>
>> +extern unsigned int xen_blkif_max_ring_order;
>>  /*
>>   * This is the maximum number of segments that would be allowed in
>> indirect
>>   * requests. This value will also be passed to the frontend.
>> @@ -248,7 +249,7 @@ struct backend_info;
>>  #define PERSISTENT_GNT_WAS_ACTIVE   1
>>
>>  /* Number of requests that we can fit in a ring */
>> -#define XEN_BLKIF_REQS  32
>> +#define XEN_MAX_BLKIF_REQS  (32 *
>> XENBUS_MAX_RING_PAGES)
>>
>>  struct persistent_gnt {
>>  struct page *

[Xen-devel] [PATCH] drivers: xen-blkback: delay pending_req allocation to connect_ring

2015-05-25 Thread Bob Liu
In connect_ring, we can know exactly how many pages are used for the shared
ring and also whether feature-persistent is enabled, delay pending_req
allocation here so that we won't waste too much memory.

Signed-off-by: Bob Liu 
---
 drivers/block/xen-blkback/common.h |  3 +-
 drivers/block/xen-blkback/xenbus.c | 95 --
 2 files changed, 51 insertions(+), 47 deletions(-)

diff --git a/drivers/block/xen-blkback/common.h 
b/drivers/block/xen-blkback/common.h
index 919a1ab..e1d605d 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -249,7 +249,7 @@ struct backend_info;
 #define PERSISTENT_GNT_WAS_ACTIVE  1
 
 /* Number of requests that we can fit in a ring */
-#define XEN_MAX_BLKIF_REQS (32 * XENBUS_MAX_RING_PAGES)
+#define XEN_BLKIF_REQS 32
 
 struct persistent_gnt {
struct page *page;
@@ -321,6 +321,7 @@ struct xen_blkif {
struct work_struct  free_work;
/* Thread shutdown wait queue. */
wait_queue_head_t   shutdown_wq;
+   unsigned int nr_ring_pages;
 };
 
 struct seg_buf {
diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index bc33888..48336a3 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -125,8 +125,6 @@ static void xen_update_blkif_status(struct xen_blkif *blkif)
 static struct xen_blkif *xen_blkif_alloc(domid_t domid)
 {
struct xen_blkif *blkif;
-   struct pending_req *req, *n;
-   int i, j;
 
BUILD_BUG_ON(MAX_INDIRECT_PAGES > BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST);
 
@@ -153,50 +151,11 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
INIT_LIST_HEAD(&blkif->pending_free);
INIT_WORK(&blkif->free_work, xen_blkif_deferred_free);
 
-   for (i = 0; i < XEN_MAX_BLKIF_REQS; i++) {
-   req = kzalloc(sizeof(*req), GFP_KERNEL);
-   if (!req)
-   goto fail;
-   list_add_tail(&req->free_list,
- &blkif->pending_free);
-   for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
-   req->segments[j] = kzalloc(sizeof(*req->segments[0]),
-  GFP_KERNEL);
-   if (!req->segments[j])
-   goto fail;
-   }
-   for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
-   req->indirect_pages[j] = 
kzalloc(sizeof(*req->indirect_pages[0]),
-GFP_KERNEL);
-   if (!req->indirect_pages[j])
-   goto fail;
-   }
-   }
spin_lock_init(&blkif->pending_free_lock);
init_waitqueue_head(&blkif->pending_free_wq);
init_waitqueue_head(&blkif->shutdown_wq);
 
return blkif;
-
-fail:
-   list_for_each_entry_safe(req, n, &blkif->pending_free, free_list) {
-   list_del(&req->free_list);
-   for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
-   if (!req->segments[j])
-   break;
-   kfree(req->segments[j]);
-   }
-   for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
-   if (!req->indirect_pages[j])
-   break;
-   kfree(req->indirect_pages[j]);
-   }
-   kfree(req);
-   }
-
-   kmem_cache_free(xen_blkif_cachep, blkif);
-
-   return ERR_PTR(-ENOMEM);
 }
 
 static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t *gref,
@@ -313,7 +272,7 @@ static void xen_blkif_free(struct xen_blkif *blkif)
i++;
}
 
-   WARN_ON(i != XEN_MAX_BLKIF_REQS);
+   WARN_ON(i != XEN_BLKIF_REQS * blkif->nr_ring_pages);
 
kmem_cache_free(xen_blkif_cachep, blkif);
 }
@@ -868,9 +827,10 @@ static int connect_ring(struct backend_info *be)
struct xenbus_device *dev = be->dev;
unsigned int ring_ref[XENBUS_MAX_RING_PAGES];
unsigned int evtchn, nr_grefs, ring_page_order;
-   unsigned int pers_grants;
+   unsigned int pers_grants, i, j;
+   struct pending_req *req, *n;
char protocol[64] = "";
-   int err;
+   int err, nr_indiret_pages, nr_segs;
 
pr_debug("%s %s\n", __func__, dev->otherend);
 
@@ -899,8 +859,6 @@ static int connect_ring(struct backend_info *be)
pr_info("%s:using single page: ring-ref %d\n", dev->otherend,
ring_ref[0]);
} else {
-   unsigned int i;
-
if (ring_page_order > xen_blkif_max_ring_order) {
err = -EINVAL;
xenb

[Xen-devel] [PATCH] drivers: xen-blkfront: blkif_recover: recheck feature-persistent

2015-05-25 Thread Bob Liu
When migrate from !feature-persistent host to feature-persistent host, domU
still think new host/backend don't support persistent.
Dmesg like:
backed has not unmapped grant: 839
backed has not unmapped grant: 773
backed has not unmapped grant: 773
backed has not unmapped grant: 773
backed has not unmapped grant: 839

We should recheck whether the new backend support feature-persistent during
blkif_recover().

Signed-off-by: Bob Liu 
---
 drivers/block/xen-blkfront.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index d3c1a95..cad4d8c 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -1504,7 +1504,7 @@ static int blkif_recover(struct blkfront_info *info)
int i;
struct request *req, *n;
struct blk_shadow *copy;
-   int rc;
+   int rc, persistent;
struct bio *bio, *cloned_bio;
struct bio_list bio_list, merge_bio;
unsigned int segs, offset;
@@ -1525,6 +1525,14 @@ static int blkif_recover(struct blkfront_info *info)
info->shadow_free = info->ring.req_prod_pvt;
info->shadow[BLK_RING_SIZE(info)-1].req.u.rw.id = 0x0fff;
 
+   /* Should check whether the new backend support feature-persistent */
+   rc = xenbus_gather(XBT_NIL, info->xbdev->otherend,
+   "feature-persistent", "%u", &persistent,
+   NULL);
+   if (rc)
+   info->feature_persistent = 0;
+   else
+   info->feature_persistent = persistent;
rc = blkfront_setup_indirect(info);
if (rc) {
kfree(copy);
-- 
1.8.3.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] drivers: xen-blkfront: blkif_recover: recheck feature-persistent

2015-06-01 Thread Bob Liu

On 06/01/2015 03:50 PM, Roger Pau Monné wrote:
> El 26/05/15 a les 2.11, Bob Liu ha escrit:
>> When migrate from !feature-persistent host to feature-persistent host, domU
>> still think new host/backend don't support persistent.
>> Dmesg like:
>> backed has not unmapped grant: 839
>> backed has not unmapped grant: 773
>> backed has not unmapped grant: 773
>> backed has not unmapped grant: 773
>> backed has not unmapped grant: 839
>>
>> We should recheck whether the new backend support feature-persistent during
>> blkif_recover().
> 
> Right, we recheck for indirect-descriptors but not persistent grants.
> 
> Do you think it makes sense to split the part of blkfront_connect that
> checks for optional features, like persistent grants, indirect
> descriptors and flush/barrier features to a separate function and call
> it from both blkfront_connect and blkif_recover?
> 

Yep, that would be better.

Thanks,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] drivers: xen-blkback: delay pending_req allocation to connect_ring

2015-06-01 Thread Bob Liu

On 06/01/2015 04:36 PM, Roger Pau Monné wrote:
> El 26/05/15 a les 2.06, Bob Liu ha escrit:
>> In connect_ring, we can know exactly how many pages are used for the shared
>> ring and also whether feature-persistent is enabled, delay pending_req
>> allocation here so that we won't waste too much memory.
> 
> I would very much prefer for this to be a pre-patch for your multipage
> ring series. Do you think you can include it in the next iteration?
> 

I think it's unnecessary to send an new iteration if there isn't any new 
comments
about multi-page ring series.

This patch is meaningful only after multi-page ring series, else there is no
difference(no memory can be saved).
So I think it's fine for this patch coming after multi-page ring series.

>> Signed-off-by: Bob Liu 
>> ---
>>  drivers/block/xen-blkback/common.h |  3 +-
>>  drivers/block/xen-blkback/xenbus.c | 95 
>> --
>>  2 files changed, 51 insertions(+), 47 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkback/common.h 
>> b/drivers/block/xen-blkback/common.h
>> index 919a1ab..e1d605d 100644
>> --- a/drivers/block/xen-blkback/common.h
>> +++ b/drivers/block/xen-blkback/common.h
>> @@ -249,7 +249,7 @@ struct backend_info;
>>  #define PERSISTENT_GNT_WAS_ACTIVE   1
>>  
>>  /* Number of requests that we can fit in a ring */
>> -#define XEN_MAX_BLKIF_REQS  (32 * XENBUS_MAX_RING_PAGES)
>> +#define XEN_BLKIF_REQS  32
> 
> This should be XEN_BLKIF_REQS_PER_PAGE (or a similar name of your choice
> that reflects that those are the number of requests per ring page).
> 
>>  
>>  struct persistent_gnt {
>>  struct page *page;
>> @@ -321,6 +321,7 @@ struct xen_blkif {
>>  struct work_struct  free_work;
>>  /* Thread shutdown wait queue. */
>>  wait_queue_head_t   shutdown_wq;
>> +unsigned int nr_ring_pages;
>>  };
>>  
>>  struct seg_buf {
>> diff --git a/drivers/block/xen-blkback/xenbus.c 
>> b/drivers/block/xen-blkback/xenbus.c
>> index bc33888..48336a3 100644
>> --- a/drivers/block/xen-blkback/xenbus.c
>> +++ b/drivers/block/xen-blkback/xenbus.c
>> @@ -125,8 +125,6 @@ static void xen_update_blkif_status(struct xen_blkif 
>> *blkif)
>>  static struct xen_blkif *xen_blkif_alloc(domid_t domid)
>>  {
>>  struct xen_blkif *blkif;
>> -struct pending_req *req, *n;
>> -int i, j;
>>  
>>  BUILD_BUG_ON(MAX_INDIRECT_PAGES > BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST);
>>  
>> @@ -153,50 +151,11 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
>>  INIT_LIST_HEAD(&blkif->pending_free);
>>  INIT_WORK(&blkif->free_work, xen_blkif_deferred_free);
>>  
>> -for (i = 0; i < XEN_MAX_BLKIF_REQS; i++) {
>> -req = kzalloc(sizeof(*req), GFP_KERNEL);
>> -if (!req)
>> -goto fail;
>> -list_add_tail(&req->free_list,
>> -  &blkif->pending_free);
>> -for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
>> -req->segments[j] = kzalloc(sizeof(*req->segments[0]),
>> -   GFP_KERNEL);
>> -if (!req->segments[j])
>> -goto fail;
>> -}
>> -for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
>> -req->indirect_pages[j] = 
>> kzalloc(sizeof(*req->indirect_pages[0]),
>> - GFP_KERNEL);
>> -if (!req->indirect_pages[j])
>> -goto fail;
>> -}
>> -}
>>  spin_lock_init(&blkif->pending_free_lock);
>>  init_waitqueue_head(&blkif->pending_free_wq);
>>  init_waitqueue_head(&blkif->shutdown_wq);
>>  
>>  return blkif;
>> -
>> -fail:
>> -list_for_each_entry_safe(req, n, &blkif->pending_free, free_list) {
>> -list_del(&req->free_list);
>> -for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
>> -if (!req->segments[j])
>> -break;
>> -kfree(req->segments[j]);
>> -}
>> -for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
>> -if (!req->indirect_pages[j])
>> -break;
>> -kfree(req->indir

[Xen-devel] [PATCH 3/3] xen/block: add multi-page ring support

2015-06-02 Thread Bob Liu
Extend xen/block to support multi-page ring, so that more requests can be
issued by using more than one pages as the request ring between blkfront
and backend.
As a result, the performance can get improved significantly.

We got some impressive improvements on our highend iscsi storage cluster
backend. If using 64 pages as the ring, the IOPS increased about 15 times
for the throughput testing and above doubled for the latency testing.

The reason was the limit on outstanding requests is 32 if use only one-page
ring, but in our case the iscsi lun was spread across about 100 physical
drives, 32 was really not enough to keep them busy.

Changes in v2:
 - Rebased to 4.0-rc6.
 - Document on how multi-page ring feature working to linux io/blkif.h.

Changes in v3:
 - Remove changes to linux io/blkif.h and follow the protocol defined
   in io/blkif.h of XEN tree.
 - Rebased to 4.1-rc3

Changes in v4:
 - Turn to use 'ring-page-order' and 'max-ring-page-order'.
 - A few comments from Roger.

Changes in v5:
 - Clarify with 4k granularity to comment
 - Address more comments from Roger

Signed-off-by: Bob Liu 
---
 drivers/block/xen-blkback/blkback.c |   13 
 drivers/block/xen-blkback/common.h  |2 +
 drivers/block/xen-blkback/xenbus.c  |   89 +--
 drivers/block/xen-blkfront.c|  135 +--
 4 files changed, 180 insertions(+), 59 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c 
b/drivers/block/xen-blkback/blkback.c
index 713fc9f..2126842 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -84,6 +84,13 @@ MODULE_PARM_DESC(max_persistent_grants,
  "Maximum number of grants to map persistently");
 
 /*
+ * Maximum order of pages to be used for the shared ring between front and
+ * backend, 4KB page granularity is used.
+ */
+unsigned int xen_blkif_max_ring_order = XENBUS_MAX_RING_PAGE_ORDER;
+module_param_named(max_ring_page_order, xen_blkif_max_ring_order, int, 
S_IRUGO);
+MODULE_PARM_DESC(max_ring_page_order, "Maximum order of pages to be used for 
the shared ring");
+/*
  * The LRU mechanism to clean the lists of persistent grants needs to
  * be executed periodically. The time interval between consecutive executions
  * of the purge mechanism is set in ms.
@@ -1438,6 +1445,12 @@ static int __init xen_blkif_init(void)
if (!xen_domain())
return -ENODEV;
 
+   if (xen_blkif_max_ring_order > XENBUS_MAX_RING_PAGE_ORDER) {
+   pr_info("Invalid max_ring_order (%d), will use default max: 
%d.\n",
+   xen_blkif_max_ring_order, XENBUS_MAX_RING_PAGE_ORDER);
+   xen_blkif_max_ring_order = XENBUS_MAX_RING_PAGE_ORDER;
+   }
+
rc = xen_blkif_interface_init();
if (rc)
goto failed_init;
diff --git a/drivers/block/xen-blkback/common.h 
b/drivers/block/xen-blkback/common.h
index 043f13b..8ccc49d 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -44,6 +44,7 @@
 #include 
 #include 
 
+extern unsigned int xen_blkif_max_ring_order;
 /*
  * This is the maximum number of segments that would be allowed in indirect
  * requests. This value will also be passed to the frontend.
@@ -320,6 +321,7 @@ struct xen_blkif {
struct work_struct  free_work;
/* Thread shutdown wait queue. */
wait_queue_head_t   shutdown_wq;
+   unsigned int nr_ring_pages;
 };
 
 struct seg_buf {
diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index c212d41..deb3f00 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -25,6 +25,7 @@
 
 /* Enlarge the array size in order to fully show blkback name. */
 #define BLKBACK_NAME_LEN (20)
+#define RINGREF_NAME_LEN (20)
 
 struct backend_info {
struct xenbus_device*dev;
@@ -156,8 +157,8 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
return blkif;
 }
 
-static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t gref,
-unsigned int evtchn)
+static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t *gref,
+unsigned int nr_grefs, unsigned int evtchn)
 {
int err;
 
@@ -165,7 +166,7 @@ static int xen_blkif_map(struct xen_blkif *blkif, 
grant_ref_t gref,
if (blkif->irq)
return 0;
 
-   err = xenbus_map_ring_valloc(blkif->be->dev, &gref, 1,
+   err = xenbus_map_ring_valloc(blkif->be->dev, gref, nr_grefs,
 &blkif->blk_ring);
if (err < 0)
return err;
@@ -175,21 +176,21 @@ static int xen_blkif_map(struct xen_blkif *blkif, 
grant_ref_t gref,
{
struct blkif_sring *sring;
sring = (struct blkif_sring *)blkif->blk_ring;
-   BACK_RING_IN

[Xen-devel] [PATCH 1/3] drivers: xen-blkback: delay pending_req allocation to connect_ring

2015-06-02 Thread Bob Liu
This is a pre-patch for multi-page ring feature.
In connect_ring, we can know exactly how many pages are used for the shared
ring, delay pending_req allocation here so that we won't waste too much memory.

Signed-off-by: Bob Liu 
---
 drivers/block/xen-blkback/common.h |2 +-
 drivers/block/xen-blkback/xenbus.c |   82 +---
 2 files changed, 39 insertions(+), 45 deletions(-)

diff --git a/drivers/block/xen-blkback/common.h 
b/drivers/block/xen-blkback/common.h
index f620b5d..043f13b 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -248,7 +248,7 @@ struct backend_info;
 #define PERSISTENT_GNT_WAS_ACTIVE  1
 
 /* Number of requests that we can fit in a ring */
-#define XEN_BLKIF_REQS 32
+#define XEN_BLKIF_REQS_PER_PAGE32
 
 struct persistent_gnt {
struct page *page;
diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index 6ab69ad..c212d41 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -124,8 +124,6 @@ static void xen_update_blkif_status(struct xen_blkif *blkif)
 static struct xen_blkif *xen_blkif_alloc(domid_t domid)
 {
struct xen_blkif *blkif;
-   struct pending_req *req, *n;
-   int i, j;
 
BUILD_BUG_ON(MAX_INDIRECT_PAGES > BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST);
 
@@ -151,51 +149,11 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
 
INIT_LIST_HEAD(&blkif->pending_free);
INIT_WORK(&blkif->free_work, xen_blkif_deferred_free);
-
-   for (i = 0; i < XEN_BLKIF_REQS; i++) {
-   req = kzalloc(sizeof(*req), GFP_KERNEL);
-   if (!req)
-   goto fail;
-   list_add_tail(&req->free_list,
- &blkif->pending_free);
-   for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
-   req->segments[j] = kzalloc(sizeof(*req->segments[0]),
-  GFP_KERNEL);
-   if (!req->segments[j])
-   goto fail;
-   }
-   for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
-   req->indirect_pages[j] = 
kzalloc(sizeof(*req->indirect_pages[0]),
-GFP_KERNEL);
-   if (!req->indirect_pages[j])
-   goto fail;
-   }
-   }
spin_lock_init(&blkif->pending_free_lock);
init_waitqueue_head(&blkif->pending_free_wq);
init_waitqueue_head(&blkif->shutdown_wq);
 
return blkif;
-
-fail:
-   list_for_each_entry_safe(req, n, &blkif->pending_free, free_list) {
-   list_del(&req->free_list);
-   for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
-   if (!req->segments[j])
-   break;
-   kfree(req->segments[j]);
-   }
-   for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
-   if (!req->indirect_pages[j])
-   break;
-   kfree(req->indirect_pages[j]);
-   }
-   kfree(req);
-   }
-
-   kmem_cache_free(xen_blkif_cachep, blkif);
-
-   return ERR_PTR(-ENOMEM);
 }
 
 static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t gref,
@@ -312,7 +270,7 @@ static void xen_blkif_free(struct xen_blkif *blkif)
i++;
}
 
-   WARN_ON(i != XEN_BLKIF_REQS);
+   WARN_ON(i != XEN_BLKIF_REQS_PER_PAGE);
 
kmem_cache_free(xen_blkif_cachep, blkif);
 }
@@ -864,7 +822,8 @@ static int connect_ring(struct backend_info *be)
unsigned int evtchn;
unsigned int pers_grants;
char protocol[64] = "";
-   int err;
+   struct pending_req *req, *n;
+   int err, i, j;
 
pr_debug("%s %s\n", __func__, dev->otherend);
 
@@ -905,6 +864,24 @@ static int connect_ring(struct backend_info *be)
ring_ref, evtchn, be->blkif->blk_protocol, protocol,
pers_grants ? "persistent grants" : "");
 
+   for (i = 0; i < XEN_BLKIF_REQS_PER_PAGE; i++) {
+   req = kzalloc(sizeof(*req), GFP_KERNEL);
+   if (!req)
+   goto fail;
+   list_add_tail(&req->free_list, &be->blkif->pending_free);
+   for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
+   req->segments[j] = kzalloc(sizeof(*req->segments[0]), 
GFP_KERNEL);
+   if (!req->segments[j])
+   goto fail;
+   }
+   for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
+  

[Xen-devel] [PATCH 2/3] driver: xen-blkfront: move talk_to_blkback to a more suitable place

2015-06-02 Thread Bob Liu
The major responsibility of talk_to_blkback() is allocate and initialize
the request ring and write the ring info to xenstore.
But this work should be done after backend entered 'XenbusStateInitWait' as
defined in the protocol file.
See xen/include/public/io/blkif.h in XEN git tree:
FrontBack
==
XenbusStateInitialising  XenbusStateInitialising
 o Query virtual device   o Query backend device identification
   properties.  data.
 o Setup OS device instance.  o Open and validate backend device.
  o Publish backend features and
transport parameters.
 |
 |
 V
 XenbusStateInitWait

o Query backend features and
  transport parameters.
o Allocate and initialize the
  request ring.

There is no problem with this yet, but it is an violation of the design and
furthermore it would not allow frontend/backend to negotiate 'multi-page'
and 'multi-queue' features.

Changes in v2:
 - Re-write the commit message to be more clear.

Signed-off-by: Bob Liu 
Acked-by: Roger Pau Monné 
---
 drivers/block/xen-blkfront.c |   14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 2c61cf8..88e23fd 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -1430,13 +1430,6 @@ static int blkfront_probe(struct xenbus_device *dev,
info->handle = simple_strtoul(strrchr(dev->nodename, '/')+1, NULL, 0);
dev_set_drvdata(&dev->dev, info);
 
-   err = talk_to_blkback(dev, info);
-   if (err) {
-   kfree(info);
-   dev_set_drvdata(&dev->dev, NULL);
-   return err;
-   }
-
return 0;
 }
 
@@ -1906,8 +1899,13 @@ static void blkback_changed(struct xenbus_device *dev,
dev_dbg(&dev->dev, "blkfront:blkback_changed to state %d.\n", 
backend_state);
 
switch (backend_state) {
-   case XenbusStateInitialising:
case XenbusStateInitWait:
+   if (talk_to_blkback(dev, info)) {
+   kfree(info);
+   dev_set_drvdata(&dev->dev, NULL);
+   break;
+   }
+   case XenbusStateInitialising:
case XenbusStateInitialised:
case XenbusStateReconfiguring:
case XenbusStateReconfigured:
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH] xen-blkfront: save uncompleted reqs in blkfront_resume()

2016-06-27 Thread Bob Liu
Uncompleted reqs used to be 'saved and resubmitted' in blkfront_recover() during
migration, but that's too later after multi-queue introduced.

After a migrate to another host (which may not have multiqueue support), the
number of rings (block hardware queues) may be changed and the ring and shadow
structure will also be reallocated.
So that blkfront_recover() can't 'save and resubmit' the real uncompleted reqs
because shadow structure has been reallocated.

This patch fixes this issue by moving the 'save and resubmit' logic out of
blkfront_recover() to earlier place:blkfront_resume().

Signed-off-by: Bob Liu 
---
 drivers/block/xen-blkfront.c | 91 +++-
 1 file changed, 40 insertions(+), 51 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 2e6d1e9..fcc5b4e 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -207,6 +207,9 @@ struct blkfront_info
struct blk_mq_tag_set tag_set;
struct blkfront_ring_info *rinfo;
unsigned int nr_rings;
+   /* Save uncomplete reqs and bios for migration. */
+   struct list_head requests;
+   struct bio_list bio_list;
 };
 
 static unsigned int nr_minors;
@@ -2002,69 +2005,22 @@ static int blkif_recover(struct blkfront_info *info)
 {
unsigned int i, r_index;
struct request *req, *n;
-   struct blk_shadow *copy;
int rc;
struct bio *bio, *cloned_bio;
-   struct bio_list bio_list, merge_bio;
unsigned int segs, offset;
int pending, size;
struct split_bio *split_bio;
-   struct list_head requests;
 
blkfront_gather_backend_features(info);
segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
blk_queue_max_segments(info->rq, segs);
-   bio_list_init(&bio_list);
-   INIT_LIST_HEAD(&requests);
 
for (r_index = 0; r_index < info->nr_rings; r_index++) {
-   struct blkfront_ring_info *rinfo;
-
-   rinfo = &info->rinfo[r_index];
-   /* Stage 1: Make a safe copy of the shadow state. */
-   copy = kmemdup(rinfo->shadow, sizeof(rinfo->shadow),
-  GFP_NOIO | __GFP_REPEAT | __GFP_HIGH);
-   if (!copy)
-   return -ENOMEM;
-
-   /* Stage 2: Set up free list. */
-   memset(&rinfo->shadow, 0, sizeof(rinfo->shadow));
-   for (i = 0; i < BLK_RING_SIZE(info); i++)
-   rinfo->shadow[i].req.u.rw.id = i+1;
-   rinfo->shadow_free = rinfo->ring.req_prod_pvt;
-   rinfo->shadow[BLK_RING_SIZE(info)-1].req.u.rw.id = 0x0fff;
+   struct blkfront_ring_info *rinfo = &info->rinfo[r_index];
 
rc = blkfront_setup_indirect(rinfo);
-   if (rc) {
-   kfree(copy);
+   if (rc)
return rc;
-   }
-
-   for (i = 0; i < BLK_RING_SIZE(info); i++) {
-   /* Not in use? */
-   if (!copy[i].request)
-   continue;
-
-   /*
-* Get the bios in the request so we can re-queue them.
-*/
-   if (copy[i].request->cmd_flags &
-   (REQ_FLUSH | REQ_FUA | REQ_DISCARD | REQ_SECURE)) {
-   /*
-* Flush operations don't contain bios, so
-* we need to requeue the whole request
-*/
-   list_add(©[i].request->queuelist, 
&requests);
-   continue;
-   }
-   merge_bio.head = copy[i].request->bio;
-   merge_bio.tail = copy[i].request->biotail;
-   bio_list_merge(&bio_list, &merge_bio);
-   copy[i].request->bio = NULL;
-   blk_end_request_all(copy[i].request, 0);
-   }
-
-   kfree(copy);
}
xenbus_switch_state(info->xbdev, XenbusStateConnected);
 
@@ -2079,7 +2035,7 @@ static int blkif_recover(struct blkfront_info *info)
kick_pending_request_queues(rinfo);
}
 
-   list_for_each_entry_safe(req, n, &requests, queuelist) {
+   list_for_each_entry_safe(req, n, &info->requests, queuelist) {
/* Requeue pending requests (flush or discard) */
list_del_init(&req->queuelist);
BUG_ON(req->nr_phys_segments > segs);
@@ -2087,7 +2043,7 @@ static int blkif_recover(struct blkfront_info *info)
}
blk_mq_kick_requeue_list(inf

Re: [Xen-devel] [PATCH] xen-blkfront: save uncompleted reqs in blkfront_resume()

2016-06-27 Thread Bob Liu

On 06/27/2016 04:33 PM, Bob Liu wrote:
> Uncompleted reqs used to be 'saved and resubmitted' in blkfront_recover() 
> during
> migration, but that's too later after multi-queue introduced.
> 
> After a migrate to another host (which may not have multiqueue support), the
> number of rings (block hardware queues) may be changed and the ring and shadow
> structure will also be reallocated.
> So that blkfront_recover() can't 'save and resubmit' the real uncompleted reqs
> because shadow structure has been reallocated.
> 
> This patch fixes this issue by moving the 'save and resubmit' logic out of

Fix: Just moved the 'save' logic to earlier place:blkfront_resume(), the 
'resubmit' was no change and still in blkfront_recover().

> blkfront_recover() to earlier place:blkfront_resume().
> 
> Signed-off-by: Bob Liu 
> ---
>  drivers/block/xen-blkfront.c | 91 
> +++-
>  1 file changed, 40 insertions(+), 51 deletions(-)
> 
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 2e6d1e9..fcc5b4e 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -207,6 +207,9 @@ struct blkfront_info
>   struct blk_mq_tag_set tag_set;
>   struct blkfront_ring_info *rinfo;
>   unsigned int nr_rings;
> + /* Save uncomplete reqs and bios for migration. */
> + struct list_head requests;
> + struct bio_list bio_list;
>  };
>  
>  static unsigned int nr_minors;
> @@ -2002,69 +2005,22 @@ static int blkif_recover(struct blkfront_info *info)
>  {
>   unsigned int i, r_index;
>   struct request *req, *n;
> - struct blk_shadow *copy;
>   int rc;
>   struct bio *bio, *cloned_bio;
> - struct bio_list bio_list, merge_bio;
>   unsigned int segs, offset;
>   int pending, size;
>   struct split_bio *split_bio;
> - struct list_head requests;
>  
>   blkfront_gather_backend_features(info);
>   segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
>   blk_queue_max_segments(info->rq, segs);
> - bio_list_init(&bio_list);
> - INIT_LIST_HEAD(&requests);
>  
>   for (r_index = 0; r_index < info->nr_rings; r_index++) {
> - struct blkfront_ring_info *rinfo;
> -
> - rinfo = &info->rinfo[r_index];
> - /* Stage 1: Make a safe copy of the shadow state. */
> - copy = kmemdup(rinfo->shadow, sizeof(rinfo->shadow),
> -GFP_NOIO | __GFP_REPEAT | __GFP_HIGH);
> - if (!copy)
> - return -ENOMEM;
> -
> - /* Stage 2: Set up free list. */
> - memset(&rinfo->shadow, 0, sizeof(rinfo->shadow));
> - for (i = 0; i < BLK_RING_SIZE(info); i++)
> - rinfo->shadow[i].req.u.rw.id = i+1;
> - rinfo->shadow_free = rinfo->ring.req_prod_pvt;
> - rinfo->shadow[BLK_RING_SIZE(info)-1].req.u.rw.id = 0x0fff;
> + struct blkfront_ring_info *rinfo = &info->rinfo[r_index];
>  
>   rc = blkfront_setup_indirect(rinfo);
> - if (rc) {
> - kfree(copy);
> + if (rc)
>   return rc;
> - }
> -
> - for (i = 0; i < BLK_RING_SIZE(info); i++) {
> - /* Not in use? */
> - if (!copy[i].request)
> - continue;
> -
> - /*
> -  * Get the bios in the request so we can re-queue them.
> -  */
> - if (copy[i].request->cmd_flags &
> - (REQ_FLUSH | REQ_FUA | REQ_DISCARD | REQ_SECURE)) {
> - /*
> -  * Flush operations don't contain bios, so
> -  * we need to requeue the whole request
> -  */
> - list_add(©[i].request->queuelist, 
> &requests);
> - continue;
> - }
> - merge_bio.head = copy[i].request->bio;
> - merge_bio.tail = copy[i].request->biotail;
> - bio_list_merge(&bio_list, &merge_bio);
> - copy[i].request->bio = NULL;
> - blk_end_request_all(copy[i].request, 0);
> - }
> -
> - kfree(copy);
>   }
>   xenbus_switch_state(info->xbdev, XenbusStateConnected);
>

Re: [Xen-devel] [BUG] kernel BUG at drivers/block/xen-blkfront.c:1711

2016-07-11 Thread Bob Liu


On 07/11/2016 04:50 PM, Evgenii Shatokhin wrote:
> On 06.06.2016 11:42, Dario Faggioli wrote:
>> Just Cc-ing some Linux, block, and Xen on CentOS people...
>>
> 
> Ping.
> 
> Any suggestions how to debug this or what might cause the problem?
> 
> Obviously, we cannot control Xen on the Amazon's servers. But perhaps there 
> is something we can do at the kernel's side, is it?
> 
>> On Mon, 2016-06-06 at 11:24 +0300, Evgenii Shatokhin wrote:
>>> (Resending this bug report because the message I sent last week did
>>> not
>>> make it to the mailing list somehow.)
>>>
>>> Hi,
>>>
>>> One of our users gets kernel panics from time to time when he tries
>>> to
>>> use his Amazon EC2 instance with CentOS7 x64 in it [1]. Kernel panic
>>> happens within minutes from the moment the instance starts. The
>>> problem
>>> does not show up every time, however.
>>>
>>> The user first observed the problem with a custom kernel, but it was
>>> found later that the stock kernel 3.10.0-327.18.2.el7.x86_64 from
>>> CentOS7 was affected as well.

Please try this patch:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7b0767502b5db11cb1f0daef2d01f6d71b1192dc

Regards,
Bob

>>>
>>> The part of the system log he was able to retrieve is attached. Here
>>> is
>>> the bug info, for convenience:
>>>
>>> 
>>> [2.246912] kernel BUG at drivers/block/xen-blkfront.c:1711!
>>> [2.246912] invalid opcode:  [#1] SMP
>>> [2.246912] Modules linked in: ata_generic pata_acpi
>>> crct10dif_pclmul
>>> crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel
>>> xen_netfront xen_blkfront(+) aesni_intel lrw ata_piix gf128mul
>>> glue_helper ablk_helper cryptd libata serio_raw floppy sunrpc
>>> dm_mirror
>>> dm_region_hash dm_log dm_mod scsi_transport_iscsi
>>> [2.246912] CPU: 1 PID: 50 Comm: xenwatch Not tainted
>>> 3.10.0-327.18.2.el7.x86_64 #1
>>> [2.246912] Hardware name: Xen HVM domU, BIOS 4.2.amazon
>>> 12/07/2015
>>> [2.246912] task: 8800e9fcb980 ti: 8800e98bc000 task.ti:
>>> 8800e98bc000
>>> [2.246912] RIP: 0010:[]  []
>>> blkfront_setup_indirect+0x41f/0x430 [xen_blkfront]
>>> [2.246912] RSP: 0018:8800e98bfcd0  EFLAGS: 00010283
>>> [2.246912] RAX: 8800353e15c0 RBX: 8800e98c52c8 RCX:
>>> 0020
>>> [2.246912] RDX: 8800353e15b0 RSI: 8800e98c52b8 RDI:
>>> 8800353e15d0
>>> [2.246912] RBP: 8800e98bfd20 R08: 8800353e15b0 R09:
>>> 8800eb403c00
>>> [2.246912] R10: a0155532 R11: ffe8 R12:
>>> 8800e98c4000
>>> [2.246912] R13: 8800e98c52b8 R14: 0020 R15:
>>> 8800353e15c0
>>> [2.246912] FS:  () GS:8800efc2()
>>> knlGS:
>>> [2.246912] CS:  0010 DS:  ES:  CR0: 80050033
>>> [2.246912] CR2: 7f1b615ef000 CR3: e2b44000 CR4:
>>> 001406e0
>>> [2.246912] DR0:  DR1:  DR2:
>>> 
>>> [2.246912] DR3:  DR6: 0ff0 DR7:
>>> 0400
>>> [2.246912] Stack:
>>> [2.246912]  0020 0001 0020a0157217
>>> 0100e98bfdbc
>>> [2.246912]  27efa3ef 8800e98bfdbc 8800e98ce000
>>> 8800e98c4000
>>> [2.246912]  8800e98ce040 0001 8800e98bfe08
>>> a0155d4c
>>> [2.246912] Call Trace:
>>> [2.246912]  [] blkback_changed+0x4ec/0xfc8
>>> [xen_blkfront]
>>> [2.246912]  [] ? xenbus_gather+0x170/0x190
>>> [2.246912]  [] ? __slab_free+0x10e/0x277
>>> [2.246912]  []
>>> xenbus_otherend_changed+0xad/0x110
>>> [2.246912]  [] ? xenwatch_thread+0x77/0x180
>>> [2.246912]  [] backend_changed+0x13/0x20
>>> [2.246912]  [] xenwatch_thread+0x66/0x180
>>> [2.246912]  [] ? wake_up_atomic_t+0x30/0x30
>>> [2.246912]  [] ?
>>> unregister_xenbus_watch+0x1f0/0x1f0
>>> [2.246912]  [] kthread+0xcf/0xe0
>>> [2.246912]  [] ?
>>> kthread_create_on_node+0x140/0x140
>>> [2.246912]  [] ret_from_fork+0x58/0x90
>>> [2.246912]  [] ?
>>> kthread_create_on_node+0x140/0x140
>>> [2.246912] Code: e1 48 85 c0 75 ce 49 8d 84 24 40 01 00 00 48 89
>>> 45
>>> b8 e9 91 fd ff ff 4c 89 ff e8 8d ae 06 e1 e9 f2 fc ff ff 31 c0 e9 2e
>>> fe
>>> ff ff <0f> 0b e8 9a 57 f2 e0 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44
>>> 00
>>> [2.246912] RIP  []
>>> blkfront_setup_indirect+0x41f/0x430 [xen_blkfront]
>>> [2.246912]  RSP 
>>> [2.491574] ---[ end trace 8a9b992812627c71 ]---
>>> [2.495618] Kernel panic - not syncing: Fatal exception
>>> 
>>>
>>> Xen version 4.2.
>>>
>>> EC2 instance type: c3.large with EBS magnetic storage, if that
>>> matters.
>>>
>>> Here is the code where the BUG_ON triggers (drivers/block/xen-
>>> blkfront.c):
>>> 
>>> if (!info->feature_persistent && info->max_indirect_segments) {
>>>   /

Re: [Xen-devel] [BUG] kernel BUG at drivers/block/xen-blkfront.c:1711

2016-07-14 Thread Bob Liu

On 07/14/2016 07:49 PM, Evgenii Shatokhin wrote:
> On 11.07.2016 15:04, Bob Liu wrote:
>>
>>
>> On 07/11/2016 04:50 PM, Evgenii Shatokhin wrote:
>>> On 06.06.2016 11:42, Dario Faggioli wrote:
>>>> Just Cc-ing some Linux, block, and Xen on CentOS people...
>>>>
>>>
>>> Ping.
>>>
>>> Any suggestions how to debug this or what might cause the problem?
>>>
>>> Obviously, we cannot control Xen on the Amazon's servers. But perhaps there 
>>> is something we can do at the kernel's side, is it?
>>>
>>>> On Mon, 2016-06-06 at 11:24 +0300, Evgenii Shatokhin wrote:
>>>>> (Resending this bug report because the message I sent last week did
>>>>> not
>>>>> make it to the mailing list somehow.)
>>>>>
>>>>> Hi,
>>>>>
>>>>> One of our users gets kernel panics from time to time when he tries
>>>>> to
>>>>> use his Amazon EC2 instance with CentOS7 x64 in it [1]. Kernel panic
>>>>> happens within minutes from the moment the instance starts. The
>>>>> problem
>>>>> does not show up every time, however.
>>>>>
>>>>> The user first observed the problem with a custom kernel, but it was
>>>>> found later that the stock kernel 3.10.0-327.18.2.el7.x86_64 from
>>>>> CentOS7 was affected as well.
>>
>> Please try this patch:
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7b0767502b5db11cb1f0daef2d01f6d71b1192dc
>>
>> Regards,
>> Bob
>>
> 
> Unfortunately, it did not help. The same BUG_ON() in 
> blkfront_setup_indirect() still triggers in our kernel based on RHEL's 
> 3.10.0-327.18.2, where I added the patch.
> 
> As far as I can see, the patch makes sure the indirect pages are added to the 
> list only if (!info->feature_persistent) holds. I suppose it holds in our 
> case and the pages are added to the list because the triggered BUG_ON() is 
> here:
> 
> if (!info->feature_persistent && info->max_indirect_segments) {
> <...>
> BUG_ON(!list_empty(&info->indirect_pages));
> <...>
> }
> 

That's odd.
Could you please try to reproduce this issue with a recent upstream kernel?

Thanks,
Bob

> So the problem is still out there somewhere, it seems.
> 
> Regards,
> Evgenii
> 
>>>>>
>>>>> The part of the system log he was able to retrieve is attached. Here
>>>>> is
>>>>> the bug info, for convenience:
>>>>>
>>>>> 
>>>>> [2.246912] kernel BUG at drivers/block/xen-blkfront.c:1711!
>>>>> [2.246912] invalid opcode:  [#1] SMP
>>>>> [2.246912] Modules linked in: ata_generic pata_acpi
>>>>> crct10dif_pclmul
>>>>> crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel
>>>>> xen_netfront xen_blkfront(+) aesni_intel lrw ata_piix gf128mul
>>>>> glue_helper ablk_helper cryptd libata serio_raw floppy sunrpc
>>>>> dm_mirror
>>>>> dm_region_hash dm_log dm_mod scsi_transport_iscsi
>>>>> [2.246912] CPU: 1 PID: 50 Comm: xenwatch Not tainted
>>>>> 3.10.0-327.18.2.el7.x86_64 #1
>>>>> [2.246912] Hardware name: Xen HVM domU, BIOS 4.2.amazon
>>>>> 12/07/2015
>>>>> [2.246912] task: 8800e9fcb980 ti: 8800e98bc000 task.ti:
>>>>> 8800e98bc000
>>>>> [2.246912] RIP: 0010:[]  []
>>>>> blkfront_setup_indirect+0x41f/0x430 [xen_blkfront]
>>>>> [2.246912] RSP: 0018:8800e98bfcd0  EFLAGS: 00010283
>>>>> [2.246912] RAX: 8800353e15c0 RBX: 8800e98c52c8 RCX:
>>>>> 0020
>>>>> [2.246912] RDX: 8800353e15b0 RSI: 8800e98c52b8 RDI:
>>>>> 8800353e15d0
>>>>> [2.246912] RBP: 8800e98bfd20 R08: 8800353e15b0 R09:
>>>>> 8800eb403c00
>>>>> [2.246912] R10: a0155532 R11: ffe8 R12:
>>>>> 8800e98c4000
>>>>> [2.246912] R13: 8800e98c52b8 R14: 0020 R15:
>>>>> 8800353e15c0
>>>>> [2.246912] FS:  () GS:8800efc2()
>>>>> knlGS:
>>>>> [2.246912] CS:  0010 DS:  ES:  CR0: 80050033
>>

[Xen-devel] [PATCH 1/3] xen-blkfront: fix places not updated after introducing 64KB page granularity

2016-07-15 Thread Bob Liu
Two places didn't get updated when 64KB page granularity was introduced, this
patch fix them.

Signed-off-by: Bob Liu 
---
 drivers/block/xen-blkfront.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index fcc5b4e..032fc94 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -1321,7 +1321,7 @@ free_shadow:
rinfo->ring_ref[i] = GRANT_INVALID_REF;
}
}
-   free_pages((unsigned long)rinfo->ring.sring, 
get_order(info->nr_ring_pages * PAGE_SIZE));
+   free_pages((unsigned long)rinfo->ring.sring, 
get_order(info->nr_ring_pages * XEN_PAGE_SIZE));
rinfo->ring.sring = NULL;
 
if (rinfo->irq)
@@ -2013,7 +2013,7 @@ static int blkif_recover(struct blkfront_info *info)
 
blkfront_gather_backend_features(info);
segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
-   blk_queue_max_segments(info->rq, segs);
+   blk_queue_max_segments(info->rq, segs / GRANTS_PER_PSEG);
 
for (r_index = 0; r_index < info->nr_rings; r_index++) {
struct blkfront_ring_info *rinfo = &info->rinfo[r_index];
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH 3/3] xen-blkfront: dynamic configuration of per-vbd resources

2016-07-15 Thread Bob Liu
The current VBD layer reserves buffer space for each attached device based on
three statically configured settings which are read at boot time.
 * max_indirect_segs: Maximum amount of segments.
 * max_ring_page_order: Maximum order of pages to be used for the shared ring.
 * max_queues: Maximum of queues(rings) to be used.

But the storage backend, workload, and guest memory result in very different
tuning requirements. It's impossible to centrally predict application
characteristics so it's best to leave allow the settings can be dynamiclly
adjusted based on workload inside the Guest.

Usage:
Show current values:
cat /sys/devices/vbd-xxx/max_indirect_segs
cat /sys/devices/vbd-xxx/max_ring_page_order
cat /sys/devices/vbd-xxx/max_queues

Write new values:
echo  > /sys/devices/vbd-xxx/max_indirect_segs
echo  > /sys/devices/vbd-xxx/max_ring_page_order
echo  > /sys/devices/vbd-xxx/max_queues

Signed-off-by: Bob Liu 
--
v2: Add device lock and other comments from Konrad.
---
 drivers/block/xen-blkfront.c | 285 ++-
 1 file changed, 283 insertions(+), 2 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 10f46a8..9a5ed22 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -46,6 +46,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -212,6 +213,11 @@ struct blkfront_info
/* Save uncomplete reqs and bios for migration. */
struct list_head requests;
struct bio_list bio_list;
+   /* For dynamic configuration. */
+   unsigned int reconfiguring:1;
+   int new_max_indirect_segments;
+   int new_max_ring_page_order;
+   int new_max_queues;
 };
 
 static unsigned int nr_minors;
@@ -1350,6 +1356,31 @@ static void blkif_free(struct blkfront_info *info, int 
suspend)
for (i = 0; i < info->nr_rings; i++)
blkif_free_ring(&info->rinfo[i]);
 
+   /* Remove old xenstore nodes. */
+   if (info->nr_ring_pages > 1)
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, "ring-page-order");
+
+   if (info->nr_rings == 1) {
+   if (info->nr_ring_pages == 1) {
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, "ring-ref");
+   } else {
+   for (i = 0; i < info->nr_ring_pages; i++) {
+   char ring_ref_name[RINGREF_NAME_LEN];
+
+   snprintf(ring_ref_name, RINGREF_NAME_LEN, 
"ring-ref%u", i);
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, 
ring_ref_name);
+   }
+   }
+   } else {
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, 
"multi-queue-num-queues");
+
+   for (i = 0; i < info->nr_rings; i++) {
+   char queuename[QUEUE_NAME_LEN];
+
+   snprintf(queuename, QUEUE_NAME_LEN, "queue-%u", i);
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, queuename);
+   }
+   }
kfree(info->rinfo);
info->rinfo = NULL;
info->nr_rings = 0;
@@ -1772,6 +1803,10 @@ static int talk_to_blkback(struct xenbus_device *dev,
info->nr_ring_pages = 1;
else {
ring_page_order = min(xen_blkif_max_ring_order, max_page_order);
+   if (info->new_max_ring_page_order) {
+   BUG_ON(info->new_max_ring_page_order > max_page_order);
+   ring_page_order = info->new_max_ring_page_order;
+   }
info->nr_ring_pages = 1 << ring_page_order;
}
 
@@ -1895,6 +1930,10 @@ static int negotiate_mq(struct blkfront_info *info)
backend_max_queues = 1;
 
info->nr_rings = min(backend_max_queues, xen_blkif_max_queues);
+   if (info->new_max_queues) {
+   BUG_ON(info->new_max_queues > backend_max_queues);
+   info->nr_rings = info->new_max_queues;
+   }
/* We need at least one ring. */
if (!info->nr_rings)
info->nr_rings = 1;
@@ -2352,11 +2391,227 @@ static void blkfront_gather_backend_features(struct 
blkfront_info *info)
NULL);
if (err)
info->max_indirect_segments = 0;
-   else
+   else {
info->max_indirect_segments = min(indirect_segments,
  xen_blkif_max_segments);
+   if (info->new_max_indirect_segments) {
+   BUG_ON(info->new_max_indirect_segments > 
indirect_segments);
+   info->max_indirect_segments = 
info->new_max_indirect_segments;
+   }
+   }
+}
+
+static ssize_t max_ring_pa

[Xen-devel] [PATCH 2/3] xen-blkfront: introduce blkif_set_queue_limits()

2016-07-15 Thread Bob Liu
blk_mq_update_nr_hw_queues() reset all queue limits to default which it's not
as xen-blkfront expected, introducing blkif_set_queue_limits() to reset limits
with initial correct values.

Signed-off-by: Bob Liu 
---
 drivers/block/xen-blkfront.c | 91 
 1 file changed, 50 insertions(+), 41 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 032fc94..10f46a8 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -189,6 +189,8 @@ struct blkfront_info
struct mutex mutex;
struct xenbus_device *xbdev;
struct gendisk *gd;
+   u16 sector_size;
+   unsigned int physical_sector_size;
int vdevice;
blkif_vdev_t handle;
enum blkif_state connected;
@@ -913,9 +915,45 @@ static struct blk_mq_ops blkfront_mq_ops = {
.map_queue = blk_mq_map_queue,
 };
 
+static void blkif_set_queue_limits(struct blkfront_info *info)
+{
+   struct request_queue *rq = info->rq;
+   struct gendisk *gd = info->gd;
+   unsigned int segments = info->max_indirect_segments ? :
+   BLKIF_MAX_SEGMENTS_PER_REQUEST;
+
+   queue_flag_set_unlocked(QUEUE_FLAG_VIRT, rq);
+
+   if (info->feature_discard) {
+   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, rq);
+   blk_queue_max_discard_sectors(rq, get_capacity(gd));
+   rq->limits.discard_granularity = info->discard_granularity;
+   rq->limits.discard_alignment = info->discard_alignment;
+   if (info->feature_secdiscard)
+   queue_flag_set_unlocked(QUEUE_FLAG_SECDISCARD, rq);
+   }
+
+   /* Hard sector size and max sectors impersonate the equiv. hardware. */
+   blk_queue_logical_block_size(rq, info->sector_size);
+   blk_queue_physical_block_size(rq, info->physical_sector_size);
+   blk_queue_max_hw_sectors(rq, (segments * XEN_PAGE_SIZE) / 512);
+
+   /* Each segment in a request is up to an aligned page in size. */
+   blk_queue_segment_boundary(rq, PAGE_SIZE - 1);
+   blk_queue_max_segment_size(rq, PAGE_SIZE);
+
+   /* Ensure a merged request will fit in a single I/O ring slot. */
+   blk_queue_max_segments(rq, segments / GRANTS_PER_PSEG);
+
+   /* Make sure buffer addresses are sector-aligned. */
+   blk_queue_dma_alignment(rq, 511);
+
+   /* Make sure we don't use bounce buffers. */
+   blk_queue_bounce_limit(rq, BLK_BOUNCE_ANY);
+}
+
 static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
-   unsigned int physical_sector_size,
-   unsigned int segments)
+   unsigned int physical_sector_size)
 {
struct request_queue *rq;
struct blkfront_info *info = gd->private_data;
@@ -947,37 +985,11 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
}
 
rq->queuedata = info;
-   queue_flag_set_unlocked(QUEUE_FLAG_VIRT, rq);
-
-   if (info->feature_discard) {
-   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, rq);
-   blk_queue_max_discard_sectors(rq, get_capacity(gd));
-   rq->limits.discard_granularity = info->discard_granularity;
-   rq->limits.discard_alignment = info->discard_alignment;
-   if (info->feature_secdiscard)
-   queue_flag_set_unlocked(QUEUE_FLAG_SECDISCARD, rq);
-   }
-
-   /* Hard sector size and max sectors impersonate the equiv. hardware. */
-   blk_queue_logical_block_size(rq, sector_size);
-   blk_queue_physical_block_size(rq, physical_sector_size);
-   blk_queue_max_hw_sectors(rq, (segments * XEN_PAGE_SIZE) / 512);
-
-   /* Each segment in a request is up to an aligned page in size. */
-   blk_queue_segment_boundary(rq, PAGE_SIZE - 1);
-   blk_queue_max_segment_size(rq, PAGE_SIZE);
-
-   /* Ensure a merged request will fit in a single I/O ring slot. */
-   blk_queue_max_segments(rq, segments / GRANTS_PER_PSEG);
-
-   /* Make sure buffer addresses are sector-aligned. */
-   blk_queue_dma_alignment(rq, 511);
-
-   /* Make sure we don't use bounce buffers. */
-   blk_queue_bounce_limit(rq, BLK_BOUNCE_ANY);
-
-   gd->queue = rq;
-
+   info->rq = gd->queue = rq;
+   info->gd = gd;
+   info->sector_size = sector_size;
+   info->physical_sector_size = physical_sector_size;
+   blkif_set_queue_limits(info);
return 0;
 }
 
@@ -1142,16 +1154,11 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
gd->driverfs_dev = &(info->xbdev->dev);
set_capacity(gd, capacity);
 
-   if (xlvbd_init_blk_queue(gd, sector_size, physical_sector_size,
-info->max_indirect_segments ? :
- 

Re: [Xen-devel] [RFC Design Doc v2] Add vNVDIMM support for Xen

2016-07-18 Thread Bob Liu
Hey Haozhong,

On 07/18/2016 08:29 AM, Haozhong Zhang wrote:
> Hi,
> 
> Following is version 2 of the design doc for supporting vNVDIMM in

This version is really good, very clear and included almost everything I'd like 
to know.

> Xen. It's basically the summary of discussion on previous v1 design
> (https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg6.html).
> Any comments are welcome. The corresponding patches are WIP.
> 

So are you(or Intel) going to write all the patches? Is there any task the 
community to take part in?

[..snip..]
> 3. Usage Example of vNVDIMM in Xen
> 
>  Our design is to provide virtual pmem devices to HVM domains. The
>  virtual pmem devices are backed by host pmem devices.
> 
>  Dom0 Linux kernel can detect the host pmem devices and create
>  /dev/pmemXX for each detected devices. Users in Dom0 can then create
>  DAX file system on /dev/pmemXX and create several pre-allocate files
>  in the DAX file system.
> 
>  After setup the file system on the host pmem, users can add the
>  following lines in the xl configuration files to assign the host pmem
>  regions to domains:
>  vnvdimm = [ 'file=/dev/pmem0' ]
>  or
>  vnvdimm = [ 'file=/mnt/dax/pre_allocated_file' ]
> 

Could you please also consider the case when driver domain gets involved?
E.g vnvdimm = [ 'file=/dev/pmem0', backend='xxx' ]?

>   The first type of configuration assigns the entire pmem device
>   (/dev/pmem0) to the domain, while the second assigns the space
>   allocated to /mnt/dax/pre_allocated_file on the host pmem device to
>   the domain.
> 
..[snip..]
> 
> 4.2.2 Detection of Host pmem Devices
> 
>  The detection and initialize host pmem devices require a non-trivial
>  driver to interact with the corresponding ACPI namespace devices,
>  parse namespace labels and make necessary recovery actions. Instead
>  of duplicating the comprehensive Linux pmem driver in Xen hypervisor,
>  our designs leaves it to Dom0 Linux and let Dom0 Linux report
>  detected host pmem devices to Xen hypervisor.
> 
>  Our design takes following steps to detect host pmem devices when Xen
>  boots.
>  (1) As booting on bare metal, host pmem devices are detected by Dom0
>  Linux NVDIMM driver.
> 
>  (2) Our design extends Linux NVDIMM driver to reports SPA's and sizes
>  of the pmem devices and reserved areas to Xen hypervisor via a
>  new hypercall.
> 
>  (3) Xen hypervisor then checks
>  - whether SPA and size of the newly reported pmem device is overlap
>with any previously reported pmem devices;
>  - whether the reserved area can fit in the pmem device and is
>large enough to hold page_info structs for itself.
> 
>  If any checks fail, the reported pmem device will be ignored by
>  Xen hypervisor and hence will not be used by any
>  guests. Otherwise, Xen hypervisor will recorded the reported
>  parameters and create page_info structs in the reserved area.
> 
>  (4) Because the reserved area is now used by Xen hypervisor, it
>  should not be accessible by Dom0 any more. Therefore, if a host
>  pmem device is recorded by Xen hypervisor, Xen will unmap its
>  reserved area from Dom0. Our design also needs to extend Linux
>  NVDIMM driver to "balloon out" the reserved area after it
>  successfully reports a pmem device to Xen hypervisor.
> 
> 4.2.3 Get Host Machine Address (SPA) of Host pmem Files
> 
>  Before a pmem file is assigned to a domain, we need to know the host
>  SPA ranges that are allocated to this file. We do this work in xl.
> 
>  If a pmem device /dev/pmem0 is given, xl will read
>  /sys/block/pmem0/device/{resource,size} respectively for the start
>  SPA and size of the pmem device.
> 
>  If a pre-allocated file /mnt/dax/file is given,
>  (1) xl first finds the host pmem device where /mnt/dax/file is. Then
>  it uses the method above to get the start SPA of the host pmem
>  device.
>  (2) xl then uses fiemap ioctl to get the extend mappings of
>  /mnt/dax/file, and adds the corresponding physical offsets and
>  lengths in each mapping entries to above start SPA to get the SPA
>  ranges pre-allocated for this file.
> 

Looks like PMEM can't be passed through to driver domain directly like e.g PCI 
devices.

So if created a driver domain by: vnvdimm = [ 'file=/dev/pmem0' ], and make a 
DAX file system on the driver domain.

Then creating new guests with vnvdimm = [ 'file=dax file in driver domain', 
backend = 'driver domain' ].
Is this going to work? In my understanding, fiemap can only get the GPFN 
instead of the really SPA of PMEM in this case.


>  The resulting host SPA ranges will be passed to QEMU which allocates
>  guest address space for vNVDIMM devices and calls Xen hypervisor to
>  map the guest address to the host SPA ranges.
> 

Can Dom0 still access the same SPA range when Xen decides to assign it to new 
domU?
I assume the range will be unmapped automatically from dom0 in the hypercall?

Th

Re: [Xen-devel] [PATCH 2/3] xen-blkfront: introduce blkif_set_queue_limits()

2016-07-21 Thread Bob Liu

On 07/21/2016 04:29 PM, Roger Pau Monné wrote:
> On Fri, Jul 15, 2016 at 05:31:48PM +0800, Bob Liu wrote:
>> blk_mq_update_nr_hw_queues() reset all queue limits to default which it's not
>> as xen-blkfront expected, introducing blkif_set_queue_limits() to reset 
>> limits
>> with initial correct values.
> 
> Hm, great, and as usual in Linux there isn't even a comment in the function 
> that explains what it is supposed to do, or what are the side-effects of 
> calling blk_mq_update_nr_hw_queues.
>  
>> Signed-off-by: Bob Liu 
>>
>>  drivers/block/xen-blkfront.c | 91 
>> 
>>  1 file changed, 50 insertions(+), 41 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index 032fc94..10f46a8 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
>> @@ -189,6 +189,8 @@ struct blkfront_info
>>  struct mutex mutex;
>>  struct xenbus_device *xbdev;
>>  struct gendisk *gd;
>> +u16 sector_size;
>> +unsigned int physical_sector_size;
>>  int vdevice;
>>  blkif_vdev_t handle;
>>  enum blkif_state connected;
>> @@ -913,9 +915,45 @@ static struct blk_mq_ops blkfront_mq_ops = {
>>  .map_queue = blk_mq_map_queue,
>>  };
>>  
>> +static void blkif_set_queue_limits(struct blkfront_info *info)
>> +{
>> +struct request_queue *rq = info->rq;
>> +struct gendisk *gd = info->gd;
>> +unsigned int segments = info->max_indirect_segments ? :
>> +BLKIF_MAX_SEGMENTS_PER_REQUEST;
>> +
>> +queue_flag_set_unlocked(QUEUE_FLAG_VIRT, rq);
>> +
>> +if (info->feature_discard) {
>> +queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, rq);
>> +blk_queue_max_discard_sectors(rq, get_capacity(gd));
>> +rq->limits.discard_granularity = info->discard_granularity;
>> +rq->limits.discard_alignment = info->discard_alignment;
>> +if (info->feature_secdiscard)
>> +queue_flag_set_unlocked(QUEUE_FLAG_SECDISCARD, rq);
>> +}
> 
> AFAICT, at the point this function is called (in blkfront_resume), the 
> value of info->feature_discard is still from the old backend, maybe this 
> should be called from blkif_recover after blkfront_gather_backend_features?
> 

Thank you for pointing out, will be fixed.

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 3/3] xen-blkfront: dynamic configuration of per-vbd resources

2016-07-21 Thread Bob Liu

On 07/21/2016 04:57 PM, Roger Pau Monné wrote:
> On Fri, Jul 15, 2016 at 05:31:49PM +0800, Bob Liu wrote:
>> The current VBD layer reserves buffer space for each attached device based on
>> three statically configured settings which are read at boot time.
>>  * max_indirect_segs: Maximum amount of segments.
>>  * max_ring_page_order: Maximum order of pages to be used for the shared 
>> ring.
>>  * max_queues: Maximum of queues(rings) to be used.
>>
>> But the storage backend, workload, and guest memory result in very different
>> tuning requirements. It's impossible to centrally predict application
>> characteristics so it's best to leave allow the settings can be dynamiclly
>> adjusted based on workload inside the Guest.
>>
>> Usage:
>> Show current values:
>> cat /sys/devices/vbd-xxx/max_indirect_segs
>> cat /sys/devices/vbd-xxx/max_ring_page_order
>> cat /sys/devices/vbd-xxx/max_queues
>>
>> Write new values:
>> echo  > /sys/devices/vbd-xxx/max_indirect_segs
>> echo  > /sys/devices/vbd-xxx/max_ring_page_order
>> echo  > /sys/devices/vbd-xxx/max_queues
>>
>> Signed-off-by: Bob Liu 
>> --
>> v2: Add device lock and other comments from Konrad.
>> ---
>>  drivers/block/xen-blkfront.c | 285 
>> ++-
>>  1 file changed, 283 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index 10f46a8..9a5ed22 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
>> @@ -46,6 +46,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include 
>>  #include 
>> @@ -212,6 +213,11 @@ struct blkfront_info
>>  /* Save uncomplete reqs and bios for migration. */
>>  struct list_head requests;
>>  struct bio_list bio_list;
>> +/* For dynamic configuration. */
>> +unsigned int reconfiguring:1;
>> +int new_max_indirect_segments;
>> +int new_max_ring_page_order;
>> +int new_max_queues;
>>  };
>>  
>>  static unsigned int nr_minors;
>> @@ -1350,6 +1356,31 @@ static void blkif_free(struct blkfront_info *info, 
>> int suspend)
>>  for (i = 0; i < info->nr_rings; i++)
>>  blkif_free_ring(&info->rinfo[i]);
>>  
>> +/* Remove old xenstore nodes. */
>> +if (info->nr_ring_pages > 1)
>> +xenbus_rm(XBT_NIL, info->xbdev->nodename, "ring-page-order");
>> +
>> +if (info->nr_rings == 1) {
>> +if (info->nr_ring_pages == 1) {
>> +xenbus_rm(XBT_NIL, info->xbdev->nodename, "ring-ref");
>> +} else {
>> +for (i = 0; i < info->nr_ring_pages; i++) {
>> +char ring_ref_name[RINGREF_NAME_LEN];
>> +
>> +snprintf(ring_ref_name, RINGREF_NAME_LEN, 
>> "ring-ref%u", i);
>> +xenbus_rm(XBT_NIL, info->xbdev->nodename, 
>> ring_ref_name);
>> +}
>> +}
>> +} else {
>> +xenbus_rm(XBT_NIL, info->xbdev->nodename, 
>> "multi-queue-num-queues");
>> +
>> +for (i = 0; i < info->nr_rings; i++) {
>> +char queuename[QUEUE_NAME_LEN];
>> +
>> +snprintf(queuename, QUEUE_NAME_LEN, "queue-%u", i);
>> +xenbus_rm(XBT_NIL, info->xbdev->nodename, queuename);
>> +}
>> +}
>>  kfree(info->rinfo);
>>  info->rinfo = NULL;
>>  info->nr_rings = 0;
>> @@ -1772,6 +1803,10 @@ static int talk_to_blkback(struct xenbus_device *dev,
>>  info->nr_ring_pages = 1;
>>  else {
>>  ring_page_order = min(xen_blkif_max_ring_order, max_page_order);
>> +if (info->new_max_ring_page_order) {
> 
> Instead of calling this "new_max_ring_page_order", could you just call it 
> max_ring_page_order, iniitalize it to xen_blkif_max_ring_order by default 


Sure, I can do that.


> and use it everywhere instead of xen_blkif_max_ring_order?


But "xen_blkif_max_ring_order" still have to be used here, this is the only 
place "xen_blkif_max_ring_order" is used(except checking the value of it in 
xlblk_init()).


> 
>> +BUG_ON(info->new_max_ring_page_order > max_page_order);
>> +

  1   2   3   >