from:"David Edmondson"

[PATCH] hw/virtio: qmp: add RING_RESET to 'info virtio-status'

2023-07-21 Thread David Edmondson

Signed-off-by: David Edmondson 
---
 hw/virtio/virtio-qmp.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/virtio/virtio-qmp.c b/hw/virtio/virtio-qmp.c
index 3d32dbec8d..7515b0947b 100644
--- a/hw/virtio/virtio-qmp.c
+++ b/hw/virtio/virtio-qmp.c
@@ -79,6 +79,8 @@ static const qmp_virtio_feature_map_t virtio_transport_map[] 
= {
 "VIRTIO_F_ORDER_PLATFORM: Memory accesses ordered by platform"),
 FEATURE_ENTRY(VIRTIO_F_SR_IOV, \
 "VIRTIO_F_SR_IOV: Device supports single root I/O virtualization"),
+FEATURE_ENTRY(VIRTIO_F_RING_RESET, \
+"VIRTIO_F_RING_RESET: Driver can reset a queue individually"),
 /* Virtio ring transport features */
 FEATURE_ENTRY(VIRTIO_RING_F_INDIRECT_DESC, \
 "VIRTIO_RING_F_INDIRECT_DESC: Indirect descriptors supported"),
-- 
2.34.1

[PATCH 0/2] block/curl: Improve HTTP header parsing

2020-02-19 Thread David Edmondson

An HTTP object store of my acquaintance returns "accept-ranges: bytes"
(all lower case) as a header, causing the QEMU curl backend to refuse
to talk to it. RFC 7230 says that HTTP headers are case insensitive,
so update the curl backend accordingly.

At the same time, allow for arbitrary white space around the HTTP
header field value, as required by the RFC.

David Edmondson (2):
  block/curl: HTTP header fields allow whitespace around values
  block/curl: HTTP header field names are case insensitive

 block/curl.c | 31 +++
 1 file changed, 27 insertions(+), 4 deletions(-)

-- 
2.24.1

[PATCH 2/2] block/curl: HTTP header field names are case insensitive

2020-02-19 Thread David Edmondson

RFC 7230 section 3.2 indicates that HTTP header field names are case
insensitive.

Signed-off-by: David Edmondson 
---
 block/curl.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/curl.c b/block/curl.c
index 0cf99a4b31b8..4256659cd85b 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -216,11 +216,11 @@ static size_t curl_header_cb(void *ptr, size_t size, 
size_t nmemb, void *opaque)
 size_t realsize = size * nmemb;
 const char *header = (char *)ptr;
 const char *end = header + realsize;
-const char *accept_ranges = "Accept-Ranges:";
+const char *accept_ranges = "accept-ranges:";
 const char *bytes = "bytes";
 
 if (realsize >= strlen(accept_ranges)
-&& strncmp(header, accept_ranges, strlen(accept_ranges)) == 0) {
+&& strncasecmp(header, accept_ranges, strlen(accept_ranges)) == 0) {
 
 char *p = strchr(header, ':') + 1;
 
-- 
2.24.1

[PATCH 1/2] block/curl: HTTP header fields allow whitespace around values

2020-02-19 Thread David Edmondson

RFC 7230 section 3.2 indicates that whitespace is permitted between
the field name and field value and after the field value.

Signed-off-by: David Edmondson 
---
 block/curl.c | 31 +++
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/block/curl.c b/block/curl.c
index f86299378e38..0cf99a4b31b8 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -214,11 +214,34 @@ static size_t curl_header_cb(void *ptr, size_t size, 
size_t nmemb, void *opaque)
 {
 BDRVCURLState *s = opaque;
 size_t realsize = size * nmemb;
-const char *accept_line = "Accept-Ranges: bytes";
+const char *header = (char *)ptr;
+const char *end = header + realsize;
+const char *accept_ranges = "Accept-Ranges:";
+const char *bytes = "bytes";
 
-if (realsize >= strlen(accept_line)
-&& strncmp((char *)ptr, accept_line, strlen(accept_line)) == 0) {
-s->accept_range = true;
+if (realsize >= strlen(accept_ranges)
+&& strncmp(header, accept_ranges, strlen(accept_ranges)) == 0) {
+
+char *p = strchr(header, ':') + 1;
+
+/* Skip whitespace between the header name and value. */
+while (p < end && *p && isspace(*p)) {
+p++;
+}
+
+if (end - p >= strlen(bytes)
+&& strncmp(p, bytes, strlen(bytes)) == 0) {
+
+/* Check that there is nothing but whitespace after the value. */
+p += strlen(bytes);
+while (p < end && *p && isspace(*p)) {
+p++;
+}
+
+if (p == end || !*p) {
+s->accept_range = true;
+}
+}
 }
 
 return realsize;
-- 
2.24.1

Re: [PULL 06/18] qemu-img: Add --target-is-zero to convert

2020-02-20 Thread David Edmondson

On Thursday, 2020-02-20 at 10:36:04 -06, Eric Blake wrote:

> On 2/20/20 10:06 AM, Max Reitz wrote:
>> From: David Edmondson 
>> 
>> In many cases the target of a convert operation is a newly provisioned
>> target that the user knows is blank (reads as zero). In this situation
>> there is no requirement for qemu-img to wastefully zero out the entire
>> device.
>> 
>> Add a new option, --target-is-zero, allowing the user to indicate that
>> an existing target device will return zeros for all reads.
>> 
>> Signed-off-by: David Edmondson 
>> Message-Id: <20200205110248.2009589-2-david.edmond...@oracle.com>
>> Reviewed-by: Vladimir Sementsov-Ogievskiy 
>> Reviewed-by: Eric Blake 
>> Signed-off-by: Max Reitz 
>> ---
>>   docs/interop/qemu-img.rst |  9 -
>>   qemu-img-cmds.hx  |  4 ++--
>>   qemu-img.c| 26 +++---
>>   3 files changed, 33 insertions(+), 6 deletions(-)
>> 
>> diff --git a/docs/interop/qemu-img.rst b/docs/interop/qemu-img.rst
>> index 42e4451db4..5f40137c10 100644
>> --- a/docs/interop/qemu-img.rst
>> +++ b/docs/interop/qemu-img.rst
>> @@ -214,6 +214,13 @@ Parameters to convert subcommand:
>> will still be printed.  Areas that cannot be read from the source will be
>> treated as containing only zeroes.
>>   
>> +.. option:: --target-is-zero
>> +
>> +  Assume that reading the destination image will always return
>> +  zeros. This parameter is mutually exclusive with a destination image
>
> Late tweak now that this is in a pull request, so we may want a followup 
> patch, but:
>
> The image doesn't always return zeros after we write to it, maybe we 
> should tweak this sentence:
>
> Assume that reading the destination image will initially return all zeros.

I will send a patch for this.

> Also, my earlier comment about 'zeroes' one line before 'zeros' still 
> applies - although both spellings are valid, we look inconsistent when 
> we can't make up our mind within two adjacent paragraphs.

If we can agree on one of "zeros" or "zeroes" then I'm happy to send a
patch making it consistent everywhere.

I think that given there are existing functions with "zeroes" in the
name, I'd be inclined to go that way.

dme.
-- 
Why stay in college? Why go to night school?

[PATCH v3 0/1] qemu-img: Add --target-is-zero to indicate that a target is blank

2020-02-04 Thread David Edmondson



qemu-img: Add --target-is-zero to indicate that a target is blank

v3:
- Merge with the rST docs.
- No more need to fix @var -> @code.

v2:
- Remove target_is_zero, preferring to set has_zero_init
  directly (Mark Kanda).
- Disallow --target-is-zero in the presence of a backing file (Max
  Reitz).
- Add relevant documentation (Max Reitz).
- @var -> @code for options in qemu-img.texi.


David Edmondson (1):
  qemu-img: Add --target-is-zero to convert

 docs/interop/qemu-img.rst |  8 +++-
 qemu-img-cmds.hx  |  4 ++--
 qemu-img.c| 25 ++---
 3 files changed, 31 insertions(+), 6 deletions(-)

-- 
2.24.1

Re: [PATCH v2 1/2] qemu-img: Add --target-is-zero to convert

2020-02-04 Thread David Edmondson

On Monday, 2020-02-03 at 21:20:16 +03, Vladimir Sementsov-Ogievskiy wrote:

> 24.01.2020 13:34, David Edmondson wrote:
>> In many cases the target of a convert operation is a newly provisioned
>> target that the user knows is blank (filled with zeroes). In this
>> situation there is no requirement for qemu-img to wastefully zero out
>> the entire device.
>> 
>> Add a new option, --target-is-zero, allowing the user to indicate that
>> an existing target device is already zero filled.
>
> Hi! qemu-img.c part looks OK for me, but other doesn't apply on master now.

Updated v3 just sent.

dme.
-- 
But he said, leave me alone, I'm a family man.

[PATCH v3 1/1] qemu-img: Add --target-is-zero to convert

2020-02-04 Thread David Edmondson

In many cases the target of a convert operation is a newly provisioned
target that the user knows is blank (filled with zeroes). In this
situation there is no requirement for qemu-img to wastefully zero out
the entire device.

Add a new option, --target-is-zero, allowing the user to indicate that
an existing target device is already zero filled.

Signed-off-by: David Edmondson 
---
 docs/interop/qemu-img.rst |  8 +++-
 qemu-img-cmds.hx  |  4 ++--
 qemu-img.c| 25 ++---
 3 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/docs/interop/qemu-img.rst b/docs/interop/qemu-img.rst
index fa27e5c7b453..99bdbe4740ee 100644
--- a/docs/interop/qemu-img.rst
+++ b/docs/interop/qemu-img.rst
@@ -214,6 +214,12 @@ Parameters to convert subcommand:
   will still be printed.  Areas that cannot be read from the source will be
   treated as containing only zeroes.
 
+.. option:: --target-is-zero
+
+  Assume that the destination is filled with zeros. This parameter is
+  mutually exclusive with the use of a backing file. It is required to
+  also use the ``-n`` parameter to skip image creation.
+
 Parameters to dd subcommand:
 
 .. program:: qemu-img-dd
@@ -366,7 +372,7 @@ Command description:
   4
 Error on reading data
 
-.. option:: convert [--object OBJECTDEF] [--image-opts] [--target-image-opts] 
[-U] [-C] [-c] [-p] [-q] [-n] [-f FMT] [-t CACHE] [-T SRC_CACHE] [-O 
OUTPUT_FMT] [-B BACKING_FILE] [-o OPTIONS] [-l SNAPSHOT_PARAM] [-S SPARSE_SIZE] 
[-m NUM_COROUTINES] [-W] FILENAME [FILENAME2 [...]] OUTPUT_FILENAME
+.. option:: convert [--object OBJECTDEF] [--image-opts] [--target-image-opts] 
[--target-is-zero] [-U] [-C] [-c] [-p] [-q] [-n] [-f FMT] [-t CACHE] [-T 
SRC_CACHE] [-O OUTPUT_FMT] [-B BACKING_FILE] [-o OPTIONS] [-l SNAPSHOT_PARAM] 
[-S SPARSE_SIZE] [-m NUM_COROUTINES] [-W] FILENAME [FILENAME2 [...]] 
OUTPUT_FILENAME
 
   Convert the disk image *FILENAME* or a snapshot *SNAPSHOT_PARAM*
   to disk image *OUTPUT_FILENAME* using format *OUTPUT_FMT*. It can
diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index 3fd836ca9090..e6f98b75473f 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -39,9 +39,9 @@ SRST
 ERST
 
 DEF("convert", img_convert,
-"convert [--object objectdef] [--image-opts] [--target-image-opts] [-U] 
[-C] [-c] [-p] [-q] [-n] [-f fmt] [-t cache] [-T src_cache] [-O output_fmt] [-B 
backing_file] [-o options] [-l snapshot_param] [-S sparse_size] [-m 
num_coroutines] [-W] [--salvage] filename [filename2 [...]] output_filename")
+"convert [--object objectdef] [--image-opts] [--target-image-opts] 
[--target-is-zero] [-U] [-C] [-c] [-p] [-q] [-n] [-f fmt] [-t cache] [-T 
src_cache] [-O output_fmt] [-B backing_file] [-o options] [-l snapshot_param] 
[-S sparse_size] [-m num_coroutines] [-W] [--salvage] filename [filename2 
[...]] output_filename")
 SRST
-.. option:: convert [--object OBJECTDEF] [--image-opts] [--target-image-opts] 
[-U] [-C] [-c] [-p] [-q] [-n] [-f FMT] [-t CACHE] [-T SRC_CACHE] [-O 
OUTPUT_FMT] [-B BACKING_FILE] [-o OPTIONS] [-l SNAPSHOT_PARAM] [-S SPARSE_SIZE] 
[-m NUM_COROUTINES] [-W] [--salvage] FILENAME [FILENAME2 [...]] OUTPUT_FILENAME
+.. option:: convert [--object OBJECTDEF] [--image-opts] [--target-image-opts] 
[--target-is-zero] [-U] [-C] [-c] [-p] [-q] [-n] [-f FMT] [-t CACHE] [-T 
SRC_CACHE] [-O OUTPUT_FMT] [-B BACKING_FILE] [-o OPTIONS] [-l SNAPSHOT_PARAM] 
[-S SPARSE_SIZE] [-m NUM_COROUTINES] [-W] [--salvage] FILENAME [FILENAME2 
[...]] OUTPUT_FILENAME
 ERST
 
 DEF("create", img_create,
diff --git a/qemu-img.c b/qemu-img.c
index 2b4562b9d9f2..e0bfc33ef4f6 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -70,6 +70,7 @@ enum {
 OPTION_PREALLOCATION = 265,
 OPTION_SHRINK = 266,
 OPTION_SALVAGE = 267,
+OPTION_TARGET_IS_ZERO = 268,
 };
 
 typedef enum OutputFormat {
@@ -1984,10 +1985,9 @@ static int convert_do_copy(ImgConvertState *s)
 int64_t sector_num = 0;
 
 /* Check whether we have zero initialisation or can get it efficiently */
-if (s->target_is_new && s->min_sparse && !s->target_has_backing) {
+if (!s->has_zero_init && s->target_is_new && s->min_sparse &&
+!s->target_has_backing) {
 s->has_zero_init = bdrv_has_zero_init(blk_bs(s->target));
-} else {
-s->has_zero_init = false;
 }
 
 if (!s->has_zero_init && !s->target_has_backing &&
@@ -2086,6 +2086,7 @@ static int img_convert(int argc, char **argv)
 {"force-share", no_argument, 0, 'U'},
 {"target-image-opts", no_argument, 0, OPTION_TARGET_IMAGE_OPTS},
 {"salvage", no_argument, 0, OPTION_SALVAGE},
+{"target-is-zero", no_argument, 0, OPTION_TARGET_IS_ZERO},
 {0, 0, 0, 0}
 };
 c = getopt_long(argc, argv, ":hf:O:B:Cco:l:S:pt:T:

Re: [PATCH v3 1/1] qemu-img: Add --target-is-zero to convert

2020-02-04 Thread David Edmondson

On Tuesday, 2020-02-04 at 07:31:52 -06, Eric Blake wrote:

> On 2/4/20 3:52 AM, David Edmondson wrote:
>> In many cases the target of a convert operation is a newly provisioned
>> target that the user knows is blank (filled with zeroes). In this
>
> 'filled with zeroes' is accurate for a preallocated image, but reads 
> awkwardly for a sparse image; it might be better to state 'reads as zero'.

Okay.

>> situation there is no requirement for qemu-img to wastefully zero out
>> the entire device.
>> 
>> Add a new option, --target-is-zero, allowing the user to indicate that
>> an existing target device is already zero filled.
>> 
>> Signed-off-by: David Edmondson 
>> ---
>>   docs/interop/qemu-img.rst |  8 +++-
>>   qemu-img-cmds.hx  |  4 ++--
>>   qemu-img.c| 25 ++---
>>   3 files changed, 31 insertions(+), 6 deletions(-)
>> 
>> diff --git a/docs/interop/qemu-img.rst b/docs/interop/qemu-img.rst
>> index fa27e5c7b453..99bdbe4740ee 100644
>> --- a/docs/interop/qemu-img.rst
>> +++ b/docs/interop/qemu-img.rst
>> @@ -214,6 +214,12 @@ Parameters to convert subcommand:
>> will still be printed.  Areas that cannot be read from the source will be
>> treated as containing only zeroes.
>>   
>> +.. option:: --target-is-zero
>> +
>> +  Assume that the destination is filled with zeros. This parameter is
>
> Spelled 'zeroes' just three lines earlier.

My understanding is that "zeros" is the correct plural of "zero" (and
that "zeroes" relates to the use of "zero" as a verb), but perhaps
that's another British English thing?

I don't care enough to fight about it.

>> +  mutually exclusive with the use of a backing file. It is required to
>> +  also use the ``-n`` parameter to skip image creation.
>
> I understand that it is safer to have restrictions now and lift them 
> later, than to allow use of the option at any time and leave room for 
> the user to shoot themselves in the foot with no way to add safety 
> later.  The argument against no backing file is somewhat understandable 
> (technically, as long as the backing file also reads as all zeroes, then 
> the overall image reads as all zeroes - but why have a backing file that 
> has no content?); the argument requiring -n is a bit weaker (if I'm 
> creating an image, I _know_ it reads as all zeroes, so the 
> --target-is-zero argument is redundant, but it shouldn't hurt to allow it).

Given your patchset that improves the general behaviour of zero
detection, I'm definitely inclined to leave the backing file case alone
in this patch.

Convincing myself that a newly created image can only ever read as zero
will take a little more time, which I'm happy to spend if it's
considered significant.

>> +++ b/qemu-img.c
>
>> @@ -2247,6 +2256,11 @@ static int img_convert(int argc, char **argv)
>>   warn_report("This will become an error in future QEMU versions.");
>>   }
>>   
>> +if (s.has_zero_init && !skip_create) {
>> +error_report("--target-is-zero requires use of -n flag");
>> +goto fail_getopt;
>> +}
>
> So I think we could drop this hunk with no change in behavior.
>
>> +
>>   s.src_num = argc - optind - 1;
>>   out_filename = s.src_num >= 1 ? argv[argc - 1] : NULL;
>>   
>> @@ -2380,6 +2394,11 @@ static int img_convert(int argc, char **argv)
>>   }
>>   s.target_has_backing = (bool) out_baseimg;
>>   
>> +if (s.has_zero_init && s.target_has_backing) {
>> +error_report("Cannot use --target-is-zero with a backing file");
>> +goto out;
>> +}
>> +
>
> while keeping this one makes sense.  At any rate, typo fixes are minor, 
> and whether or not we drop the hunk I claim is a needless restriction 
> against redundancy,
>
> Reviewed-by: Eric Blake 

Thanks.

dme.
-- 
Everyone I know, goes away in the end.

Re: [PATCH v3 1/1] qemu-img: Add --target-is-zero to convert

2020-02-04 Thread David Edmondson

On Tuesday, 2020-02-04 at 16:59:39 +03, Vladimir Sementsov-Ogievskiy wrote:

> 04.02.2020 12:52, David Edmondson wrote:
>> In many cases the target of a convert operation is a newly provisioned
>> target that the user knows is blank (filled with zeroes). In this
>> situation there is no requirement for qemu-img to wastefully zero out
>> the entire device.
>> 
>> Add a new option, --target-is-zero, allowing the user to indicate that
>> an existing target device is already zero filled.
>> 
>> Signed-off-by: David Edmondson 
>> ---
>>   docs/interop/qemu-img.rst |  8 +++-
>>   qemu-img-cmds.hx  |  4 ++--
>>   qemu-img.c| 25 ++---
>>   3 files changed, 31 insertions(+), 6 deletions(-)
>> 
>> diff --git a/docs/interop/qemu-img.rst b/docs/interop/qemu-img.rst
>> index fa27e5c7b453..99bdbe4740ee 100644
>> --- a/docs/interop/qemu-img.rst
>> +++ b/docs/interop/qemu-img.rst
>> @@ -214,6 +214,12 @@ Parameters to convert subcommand:
>> will still be printed.  Areas that cannot be read from the source will be
>> treated as containing only zeroes.
>>   
>> +.. option:: --target-is-zero
>> +
>> +  Assume that the destination is filled with zeros. This parameter is
>> +  mutually exclusive with the use of a backing file. It is required to
>
> Should we mention, that s/backing file/backing file for destination/, to make 
> it clean
> that source backing file is unrelated?

Yes, that makes sense, similarly in the error message.

>> +  also use the ``-n`` parameter to skip image creation.
>> +
>>   Parameters to dd subcommand:
>>   
>>   .. program:: qemu-img-dd
>> @@ -366,7 +372,7 @@ Command description:
>> 4
>>   Error on reading data
>>   
>> -.. option:: convert [--object OBJECTDEF] [--image-opts] 
>> [--target-image-opts] [-U] [-C] [-c] [-p] [-q] [-n] [-f FMT] [-t CACHE] [-T 
>> SRC_CACHE] [-O OUTPUT_FMT] [-B BACKING_FILE] [-o OPTIONS] [-l 
>> SNAPSHOT_PARAM] [-S SPARSE_SIZE] [-m NUM_COROUTINES] [-W] FILENAME 
>> [FILENAME2 [...]] OUTPUT_FILENAME
>> +.. option:: convert [--object OBJECTDEF] [--image-opts] 
>> [--target-image-opts] [--target-is-zero] [-U] [-C] [-c] [-p] [-q] [-n] [-f 
>> FMT] [-t CACHE] [-T SRC_CACHE] [-O OUTPUT_FMT] [-B BACKING_FILE] [-o 
>> OPTIONS] [-l SNAPSHOT_PARAM] [-S SPARSE_SIZE] [-m NUM_COROUTINES] [-W] 
>> FILENAME [FILENAME2 [...]] OUTPUT_FILENAME
>>   
>> Convert the disk image *FILENAME* or a snapshot *SNAPSHOT_PARAM*
>> to disk image *OUTPUT_FILENAME* using format *OUTPUT_FMT*. It can
>> diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
>> index 3fd836ca9090..e6f98b75473f 100644
>> --- a/qemu-img-cmds.hx
>> +++ b/qemu-img-cmds.hx
>> @@ -39,9 +39,9 @@ SRST
>>   ERST
>>   
>>   DEF("convert", img_convert,
>> -"convert [--object objectdef] [--image-opts] [--target-image-opts] [-U] 
>> [-C] [-c] [-p] [-q] [-n] [-f fmt] [-t cache] [-T src_cache] [-O output_fmt] 
>> [-B backing_file] [-o options] [-l snapshot_param] [-S sparse_size] [-m 
>> num_coroutines] [-W] [--salvage] filename [filename2 [...]] output_filename")
>> +"convert [--object objectdef] [--image-opts] [--target-image-opts] 
>> [--target-is-zero] [-U] [-C] [-c] [-p] [-q] [-n] [-f fmt] [-t cache] [-T 
>> src_cache] [-O output_fmt] [-B backing_file] [-o options] [-l 
>> snapshot_param] [-S sparse_size] [-m num_coroutines] [-W] [--salvage] 
>> filename [filename2 [...]] output_filename")
>>   SRST
>> -.. option:: convert [--object OBJECTDEF] [--image-opts] 
>> [--target-image-opts] [-U] [-C] [-c] [-p] [-q] [-n] [-f FMT] [-t CACHE] [-T 
>> SRC_CACHE] [-O OUTPUT_FMT] [-B BACKING_FILE] [-o OPTIONS] [-l 
>> SNAPSHOT_PARAM] [-S SPARSE_SIZE] [-m NUM_COROUTINES] [-W] [--salvage] 
>> FILENAME [FILENAME2 [...]] OUTPUT_FILENAME
>> +.. option:: convert [--object OBJECTDEF] [--image-opts] 
>> [--target-image-opts] [--target-is-zero] [-U] [-C] [-c] [-p] [-q] [-n] [-f 
>> FMT] [-t CACHE] [-T SRC_CACHE] [-O OUTPUT_FMT] [-B BACKING_FILE] [-o 
>> OPTIONS] [-l SNAPSHOT_PARAM] [-S SPARSE_SIZE] [-m NUM_COROUTINES] [-W] 
>> [--salvage] FILENAME [FILENAME2 [...]] OUTPUT_FILENAME
>>   ERST
>>   
>>   DEF("create", img_create,
>
> Side question: is there plan to get rid of this duplication, and generate 
> everything from rst?
>
>> diff --git a/qemu-img.c b/qemu-img.c
>> index 2b4562b9d9f2..e0bfc33ef4f6 100644
>> --- a/qemu-img.c
>> +++ b/qemu-img.c
>> @@ -70,6 +70,7 @@ enum {
>>   OPTION_PREALLOCATION =

[PATCH v4 1/1] qemu-img: Add --target-is-zero to convert

2020-02-05 Thread David Edmondson

In many cases the target of a convert operation is a newly provisioned
target that the user knows is blank (reads as zero). In this situation
there is no requirement for qemu-img to wastefully zero out the entire
device.

Add a new option, --target-is-zero, allowing the user to indicate that
an existing target device will return zeros for all reads.

Signed-off-by: David Edmondson 
---
 docs/interop/qemu-img.rst |  9 -
 qemu-img-cmds.hx  |  4 ++--
 qemu-img.c| 26 +++---
 3 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/docs/interop/qemu-img.rst b/docs/interop/qemu-img.rst
index fa27e5c7b453..763036857451 100644
--- a/docs/interop/qemu-img.rst
+++ b/docs/interop/qemu-img.rst
@@ -214,6 +214,13 @@ Parameters to convert subcommand:
   will still be printed.  Areas that cannot be read from the source will be
   treated as containing only zeroes.
 
+.. option:: --target-is-zero
+
+  Assume that reading the destination image will always return
+  zeros. This parameter is mutually exclusive with a destination image
+  that has a backing file. It is required to also use the ``-n``
+  parameter to skip image creation.
+
 Parameters to dd subcommand:
 
 .. program:: qemu-img-dd
@@ -366,7 +373,7 @@ Command description:
   4
 Error on reading data
 
-.. option:: convert [--object OBJECTDEF] [--image-opts] [--target-image-opts] 
[-U] [-C] [-c] [-p] [-q] [-n] [-f FMT] [-t CACHE] [-T SRC_CACHE] [-O 
OUTPUT_FMT] [-B BACKING_FILE] [-o OPTIONS] [-l SNAPSHOT_PARAM] [-S SPARSE_SIZE] 
[-m NUM_COROUTINES] [-W] FILENAME [FILENAME2 [...]] OUTPUT_FILENAME
+.. option:: convert [--object OBJECTDEF] [--image-opts] [--target-image-opts] 
[--target-is-zero] [-U] [-C] [-c] [-p] [-q] [-n] [-f FMT] [-t CACHE] [-T 
SRC_CACHE] [-O OUTPUT_FMT] [-B BACKING_FILE] [-o OPTIONS] [-l SNAPSHOT_PARAM] 
[-S SPARSE_SIZE] [-m NUM_COROUTINES] [-W] FILENAME [FILENAME2 [...]] 
OUTPUT_FILENAME
 
   Convert the disk image *FILENAME* or a snapshot *SNAPSHOT_PARAM*
   to disk image *OUTPUT_FILENAME* using format *OUTPUT_FMT*. It can
diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index 3fd836ca9090..e6f98b75473f 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -39,9 +39,9 @@ SRST
 ERST
 
 DEF("convert", img_convert,
-"convert [--object objectdef] [--image-opts] [--target-image-opts] [-U] 
[-C] [-c] [-p] [-q] [-n] [-f fmt] [-t cache] [-T src_cache] [-O output_fmt] [-B 
backing_file] [-o options] [-l snapshot_param] [-S sparse_size] [-m 
num_coroutines] [-W] [--salvage] filename [filename2 [...]] output_filename")
+"convert [--object objectdef] [--image-opts] [--target-image-opts] 
[--target-is-zero] [-U] [-C] [-c] [-p] [-q] [-n] [-f fmt] [-t cache] [-T 
src_cache] [-O output_fmt] [-B backing_file] [-o options] [-l snapshot_param] 
[-S sparse_size] [-m num_coroutines] [-W] [--salvage] filename [filename2 
[...]] output_filename")
 SRST
-.. option:: convert [--object OBJECTDEF] [--image-opts] [--target-image-opts] 
[-U] [-C] [-c] [-p] [-q] [-n] [-f FMT] [-t CACHE] [-T SRC_CACHE] [-O 
OUTPUT_FMT] [-B BACKING_FILE] [-o OPTIONS] [-l SNAPSHOT_PARAM] [-S SPARSE_SIZE] 
[-m NUM_COROUTINES] [-W] [--salvage] FILENAME [FILENAME2 [...]] OUTPUT_FILENAME
+.. option:: convert [--object OBJECTDEF] [--image-opts] [--target-image-opts] 
[--target-is-zero] [-U] [-C] [-c] [-p] [-q] [-n] [-f FMT] [-t CACHE] [-T 
SRC_CACHE] [-O OUTPUT_FMT] [-B BACKING_FILE] [-o OPTIONS] [-l SNAPSHOT_PARAM] 
[-S SPARSE_SIZE] [-m NUM_COROUTINES] [-W] [--salvage] FILENAME [FILENAME2 
[...]] OUTPUT_FILENAME
 ERST
 
 DEF("create", img_create,
diff --git a/qemu-img.c b/qemu-img.c
index 2b4562b9d9f2..0faf2cd2f530 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -70,6 +70,7 @@ enum {
 OPTION_PREALLOCATION = 265,
 OPTION_SHRINK = 266,
 OPTION_SALVAGE = 267,
+OPTION_TARGET_IS_ZERO = 268,
 };
 
 typedef enum OutputFormat {
@@ -1984,10 +1985,9 @@ static int convert_do_copy(ImgConvertState *s)
 int64_t sector_num = 0;
 
 /* Check whether we have zero initialisation or can get it efficiently */
-if (s->target_is_new && s->min_sparse && !s->target_has_backing) {
+if (!s->has_zero_init && s->target_is_new && s->min_sparse &&
+!s->target_has_backing) {
 s->has_zero_init = bdrv_has_zero_init(blk_bs(s->target));
-} else {
-s->has_zero_init = false;
 }
 
 if (!s->has_zero_init && !s->target_has_backing &&
@@ -2086,6 +2086,7 @@ static int img_convert(int argc, char **argv)
 {"force-share", no_argument, 0, 'U'},
 {"target-image-opts", no_argument, 0, OPTION_TARGET_IMAGE_OPTS},
 {"salvage", no_argument, 0, OPTION_SALVAGE},
+{"target-is-zero", no_argument, 0, OPTION_TARGET_IS_ZERO},
 {0, 0, 0, 0}
 };
 c = getopt_long(argc

[PATCH v4 0/1] qemu-img: Add --target-is-zero to indicate that a target is blank

2020-02-05 Thread David Edmondson



qemu-img: Add --target-is-zero to indicate that a target is blank

v4:
- Wording in the doc and error message.

v3:
- Merge with the rST docs.
- No more need to fix @var -> @code.

v2:
- Remove target_is_zero, preferring to set has_zero_init
  directly (Mark Kanda).
- Disallow --target-is-zero in the presence of a backing file (Max
  Reitz).
- Add relevant documentation (Max Reitz).
- @var -> @code for options in qemu-img.texi.


David Edmondson (1):
  qemu-img: Add --target-is-zero to convert

 docs/interop/qemu-img.rst |  9 -
 qemu-img-cmds.hx  |  4 ++--
 qemu-img.c| 26 +++---
 3 files changed, 33 insertions(+), 6 deletions(-)

-- 
2.24.1

Re: [PATCH] virtio-net: clear guest_announce feature if no cvq backend

2023-01-25 Thread David Edmondson

On Tuesday, 2023-01-24 at 17:11:59 +01, Eugenio Pérez wrote:
> Since GUEST_ANNOUNCE is emulated the feature bit could be set without
> backend support.  This happens in the vDPA case.
>
> However, backend vDPA parent may not have CVQ support.  This causes an
> incoherent feature set, and the driver may refuse to start.  This
> happens in virtio-net Linux driver.

Could you now simplify the tests in virtio_net_announce() and
virtio_net_post_load_device() to look only for the presence of
GUEST_ANNOUNCE, given that you can now presume that it implies CTRL_VQ?

But anyway:

Reviewed-by: David Edmondson 

> This may be solved differently in the future.  Qemu is able to emulate a
> CVQ just for guest_announce purposes, helping guest to notify the new
> location with vDPA devices that does not support it.  However, this is
> left as a TODO as it is way more complex to backport.
>
> Tested with vdpa_net_sim, toggling manually VIRTIO_NET_F_CTRL_VQ in the
> driver and migrating it with x-svq=on.
>
> Fixes: 980003debddd ("vdpa: do not handle VIRTIO_NET_F_GUEST_ANNOUNCE in 
> vhost-vdpa")
> Reported-by: Dawar, Gautam 
> Signed-off-by: Eugenio Pérez 
> ---
>  hw/net/virtio-net.c | 15 +++
>  1 file changed, 15 insertions(+)
>
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index 3ae909041a..09d5c7a664 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -820,6 +820,21 @@ static uint64_t virtio_net_get_features(VirtIODevice 
> *vdev, uint64_t features,
>  features |= (1ULL << VIRTIO_NET_F_MTU);
>  }
>
> +/*
> + * Since GUEST_ANNOUNCE is emulated the feature bit could be set without
> + * enabled. This happens in the vDPA case.
> + *
> + * Make sure the feature set is not incoherent, as the driver could 
> refuse
> + * to start.
> + *
> + * TODO: QEMU is able to emulate a CVQ just for guest_announce purposes,
> + * helping guest to notify the new location with vDPA devices that does 
> not
> + * support it.
> + */
> +if (!virtio_has_feature(vdev->backend_features, VIRTIO_NET_F_CTRL_VQ)) {
> +virtio_clear_feature(&features, VIRTIO_NET_F_GUEST_ANNOUNCE);
> +}
> +
>  return features;
>  }
-- 
Why stay in college? Why go to night school?

Re: [PATCH v2 01/16] migration: Don't use INT64_MAX for unlimited rate

2023-05-16 Thread David Edmondson

Juan Quintela  writes:

> Define and use RATE_LIMIT_MAX instead.

Suggest "RATE_LIMIT_MAX_NONE".

>
> Signed-off-by: Juan Quintela 
> ---
>  migration/migration-stats.h | 6 ++
>  migration/migration.c   | 4 ++--
>  migration/qemu-file.c   | 6 +-
>  3 files changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/migration/migration-stats.h b/migration/migration-stats.h
> index cf8a4f0410..e782f1b0df 100644
> --- a/migration/migration-stats.h
> +++ b/migration/migration-stats.h
> @@ -15,6 +15,12 @@
>  
>  #include "qemu/stats64.h"
>  
> +/*
> + * If rate_limit_max is 0, there is special code to remove the rate
> + * limit.
> + */
> +#define RATE_LIMIT_MAX 0
> +
>  /*
>   * These are the ram migration statistic counters.  It is loosely
>   * based on MigrationStats.  We change to Stat64 any counter that
> diff --git a/migration/migration.c b/migration/migration.c
> index 039bba4804..c41c7491bb 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2304,7 +2304,7 @@ static void migration_completion(MigrationState *s)
>   * them if migration fails or is cancelled.
>   */
>  s->block_inactive = !migrate_colo();
> -qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX);
> +qemu_file_set_rate_limit(s->to_dst_file, RATE_LIMIT_MAX);
>  ret = qemu_savevm_state_complete_precopy(s->to_dst_file, 
> false,
>   s->block_inactive);
>  }
> @@ -3048,7 +3048,7 @@ static void *bg_migration_thread(void *opaque)
>  rcu_register_thread();
>  object_ref(OBJECT(s));
>  
> -qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX);
> +qemu_file_set_rate_limit(s->to_dst_file, RATE_LIMIT_MAX);
>  
>  setup_start = qemu_clock_get_ms(QEMU_CLOCK_HOST);
>  /*
> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
> index 597054759d..4bc875b452 100644
> --- a/migration/qemu-file.c
> +++ b/migration/qemu-file.c
> @@ -27,6 +27,7 @@
>  #include "qemu/error-report.h"
>  #include "qemu/iov.h"
>  #include "migration.h"
> +#include "migration-stats.h"
>  #include "qemu-file.h"
>  #include "trace.h"
>  #include "options.h"
> @@ -732,7 +733,10 @@ int qemu_file_rate_limit(QEMUFile *f)

Given that qemu_file_rate_limit() is really a boolean, could it be
declared as such?

>  if (qemu_file_get_error(f)) {
>  return 1;
>  }
> -if (f->rate_limit_max > 0 && f->rate_limit_used > f->rate_limit_max) {
> +if (f->rate_limit_max == RATE_LIMIT_MAX) {
> +return 0;
> +}
> +if (f->rate_limit_used > f->rate_limit_max) {
>  return 1;
>  }
>  return 0;
> -- 
> 2.40.1
-- 
All those lines and circles, to me, a mystery.

Re: [PATCH v2 02/16] migration: Correct transferred bytes value

2023-05-16 Thread David Edmondson

Juan Quintela  writes:

> We forget several places to add to trasferred amount of data.  With

"transferred".

> this fixes I get:
>
>qemu_file_transferred() + multifd_bytes == transferred
>
> The only place whrer this is not true is during devices sending.  But

"where"

> going all through the full tree searching for devices that use
> QEMUFile directly is a bit too much.
>
> Multifd, precopy and xbzrle work as expected. Postocpy still misses 35
> bytes, but searching for them is getting complicated, so I stop here.
>
> Signed-off-by: Juan Quintela 
> ---
>  migration/ram.c   | 14 ++
>  migration/savevm.c| 19 +--
>  migration/vmstate.c   |  3 +++
>  migration/meson.build |  2 +-
>  4 files changed, 35 insertions(+), 3 deletions(-)
>
> diff --git a/migration/ram.c b/migration/ram.c
> index f69d8d42b0..fd5a8db0f8 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -337,6 +337,7 @@ int64_t ramblock_recv_bitmap_send(QEMUFile *file,
>  
>  g_free(le_bitmap);
>  
> +stat64_add(&mig_stats.transferred, 8 + size + 8);

Push this up after the qemu_fflush() call?

>  if (qemu_file_get_error(file)) {
>  return qemu_file_get_error(file);
>  }
> @@ -1392,6 +1393,7 @@ static int find_dirty_block(RAMState *rs, 
> PageSearchStatus *pss)
>  return ret;
>  }
>  qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
> +stat64_add(&mig_stats.transferred, 8);
>  qemu_fflush(f);
>  }
>  /*
> @@ -3020,6 +3022,7 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>  RAMState **rsp = opaque;
>  RAMBlock *block;
>  int ret;
> +size_t size = 0;
>  
>  if (compress_threads_save_setup()) {
>  return -1;
> @@ -3038,16 +3041,20 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>  qemu_put_be64(f, ram_bytes_total_with_ignored()
>   | RAM_SAVE_FLAG_MEM_SIZE);
>  
> +size += 8;

Move the blank line down.

>  RAMBLOCK_FOREACH_MIGRATABLE(block) {
>  qemu_put_byte(f, strlen(block->idstr));
>  qemu_put_buffer(f, (uint8_t *)block->idstr, 
> strlen(block->idstr));
>  qemu_put_be64(f, block->used_length);
> +size += 1 + strlen(block->idstr) + 8;
>  if (migrate_postcopy_ram() && block->page_size !=
>qemu_host_page_size) {
>  qemu_put_be64(f, block->page_size);
> +size += 8;
>  }
>  if (migrate_ignore_shared()) {
>  qemu_put_be64(f, block->mr->addr);
> +size += 8;
>  }
>  }
>  }
> @@ -3064,11 +3071,14 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>  
>  if (!migrate_multifd_flush_after_each_section()) {
>  qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
> +size += 8;
>  }
>  
>  qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> +size += 8;
>  qemu_fflush(f);
>  
> +stat64_add(&mig_stats.transferred, size);
>  return 0;
>  }
>  
> @@ -3209,6 +3219,7 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>  RAMState **temp = opaque;
>  RAMState *rs = *temp;
>  int ret = 0;
> +size_t size = 0;
>  
>  rs->last_stage = !migration_in_colo_state();
>  
> @@ -3253,8 +3264,11 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>  
>  if (!migrate_multifd_flush_after_each_section()) {
>  qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
> +size += 8;
>  }
>  qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> +size += 8;
> +stat64_add(&mig_stats.transferred, size);

This is after qemu_fflush() in the other cases.

>  qemu_fflush(f);
>  
>  return 0;
> diff --git a/migration/savevm.c b/migration/savevm.c
> index e33788343a..c7af9050c2 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -952,6 +952,7 @@ static void save_section_header(QEMUFile *f, 
> SaveStateEntry *se,
>  qemu_put_byte(f, section_type);
>  qemu_put_be32(f, se->section_id);
>  
> +size_t size = 1 + 4;

Move the blank line down.

>  if (section_type == QEMU_VM_SECTION_FULL ||
>  section_type == QEMU_VM_SECTION_START) {
>  /* ID string */
> @@ -961,7 +962,9 @@ static void save_section_header(QEMUFile *f, 
> SaveStateEntry *se,
>  
>  qemu_put_be32(f, se->instance_id);
>  qemu_put_be32(f, se->version_id);
> +size += 1 + len + 4 + 4;
>  }
> +stat64_add(&mig_stats.transferred, size);
>  }
>  
>  /*
> @@ -973,6 +976,7 @@ static void save_section_footer(QEMUFile *f, 
> SaveStateEntry *se)
>  if (migrate_get_current()->send_section_footer) {
>  qemu_put_byte(f, QEMU_VM_SECTION_FOOTER);
>  qemu_put_be32(f, se->section_id);
> +stat64_add(&mig_stats.transferred, 1 + 4);
>  }
>  }
>  
> @@ -1032,6 +1036,7 @@ static void qemu_savevm_

Re: [PATCH v2 03/16] migration: Move setup_time to mig_stats

2023-05-16 Thread David Edmondson

Juan Quintela  writes:

> It is a time that needs to be cleaned each time cancel migration.
> Once there create migration_time_since() to calculate how time since a
> time in the past.
>
> Signed-off-by: Juan Quintela 
>
> ---
>
> Rename to migration_time_since (cédric)
> ---
>  migration/migration-stats.h | 13 +
>  migration/migration.h   |  1 -
>  migration/migration-stats.c |  7 +++
>  migration/migration.c   |  9 -
>  4 files changed, 24 insertions(+), 6 deletions(-)
>
> diff --git a/migration/migration-stats.h b/migration/migration-stats.h
> index e782f1b0df..21402af9e4 100644
> --- a/migration/migration-stats.h
> +++ b/migration/migration-stats.h
> @@ -75,6 +75,10 @@ typedef struct {
>   * Number of bytes sent during precopy stage.
>   */
>  Stat64 precopy_bytes;
> +/*
> + * How long has the setup stage took.
> + */
> +Stat64 setup_time;

Is this really a Stat64? It doesn't appear to need the atomic update
feature.

>  /*
>   * Total number of bytes transferred.
>   */
> @@ -87,4 +91,13 @@ typedef struct {
>  
>  extern MigrationAtomicStats mig_stats;
>  
> +/**
> + * migration_time_since: Calculate how much time has passed
> + *
> + * @stats: migration stats
> + * @since: reference time since we want to calculate
> + *
> + * Returns: Nothing.  The time is stored in val.

"stored in stats->setup_time"

> + */
> +void migration_time_since(MigrationAtomicStats *stats, int64_t since);
>  #endif
> diff --git a/migration/migration.h b/migration/migration.h
> index 48a46123a0..27aa3b1035 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -316,7 +316,6 @@ struct MigrationState {
>  int64_t downtime;
>  int64_t expected_downtime;
>  bool capabilities[MIGRATION_CAPABILITY__MAX];
> -int64_t setup_time;
>  /*
>   * Whether guest was running when we enter the completion stage.
>   * If migration is interrupted by any reason, we need to continue
> diff --git a/migration/migration-stats.c b/migration/migration-stats.c
> index 2f2cea965c..3431453c90 100644
> --- a/migration/migration-stats.c
> +++ b/migration/migration-stats.c
> @@ -12,6 +12,13 @@
>  
>  #include "qemu/osdep.h"
>  #include "qemu/stats64.h"
> +#include "qemu/timer.h"
>  #include "migration-stats.h"
>  
>  MigrationAtomicStats mig_stats;
> +
> +void migration_time_since(MigrationAtomicStats *stats, int64_t since)
> +{
> +int64_t now = qemu_clock_get_ms(QEMU_CLOCK_HOST);
> +stat64_set(&stats->setup_time, now - since);
> +}
> diff --git a/migration/migration.c b/migration/migration.c
> index c41c7491bb..e9466273bb 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -887,7 +887,7 @@ static void populate_time_info(MigrationInfo *info, 
> MigrationState *s)
>  {
>  info->has_status = true;
>  info->has_setup_time = true;
> -info->setup_time = s->setup_time;
> +info->setup_time = stat64_get(&mig_stats.setup_time);
>  
>  if (s->state == MIGRATION_STATUS_COMPLETED) {
>  info->has_total_time = true;
> @@ -1390,7 +1390,6 @@ void migrate_init(MigrationState *s)
>  s->pages_per_second = 0.0;
>  s->downtime = 0;
>  s->expected_downtime = 0;
> -s->setup_time = 0;
>  s->start_postcopy = false;
>  s->postcopy_after_devices = false;
>  s->migration_thread_running = false;
> @@ -2647,7 +2646,7 @@ static void migration_calculate_complete(MigrationState 
> *s)
>  s->downtime = end_time - s->downtime_start;
>  }
>  
> -transfer_time = s->total_time - s->setup_time;
> +transfer_time = s->total_time - stat64_get(&mig_stats.setup_time);
>  if (transfer_time) {
>  s->mbps = ((double) bytes * 8.0) / transfer_time / 1000;
>  }
> @@ -2969,7 +2968,7 @@ static void *migration_thread(void *opaque)
>  qemu_savevm_wait_unplug(s, MIGRATION_STATUS_SETUP,
> MIGRATION_STATUS_ACTIVE);
>  
> -s->setup_time = qemu_clock_get_ms(QEMU_CLOCK_HOST) - setup_start;
> +migration_time_since(&mig_stats, setup_start);
>  
>  trace_migration_thread_setup_complete();
>  
> @@ -3081,7 +3080,7 @@ static void *bg_migration_thread(void *opaque)
>  qemu_savevm_wait_unplug(s, MIGRATION_STATUS_SETUP,
> MIGRATION_STATUS_ACTIVE);
>  
> -s->setup_time = qemu_clock_get_ms(QEMU_CLOCK_HOST) - setup_start;
> +migration_time_since(&mig_stats, setup_start);
>  
>  trace_migration_thread_setup_complete();
>  s->downtime_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> -- 
> 2.40.1
-- 
Walk without rhythm and it won't attract the worm.

Re: [PATCH v2 01/16] migration: Don't use INT64_MAX for unlimited rate

2023-05-16 Thread David Edmondson

Juan Quintela  writes:

> David Edmondson  wrote:
>> Juan Quintela  writes:
>>
>>> Define and use RATE_LIMIT_MAX instead.
>>
>> Suggest "RATE_LIMIT_MAX_NONE".
>
> Then even better
>
> RATE_LIMIT_DISABLED?
> RATE_LIMIT_NONE?

RATE_LIMIT_NONE sounds good to me.

>
> Using MAX and NONE at the same time looks strange.
>
>>> Signed-off-by: Juan Quintela 
>>> ---
>>>  migration/migration-stats.h | 6 ++
>>>  migration/migration.c   | 4 ++--
>>>  migration/qemu-file.c   | 6 +-
>>>  3 files changed, 13 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/migration/migration-stats.h b/migration/migration-stats.h
>>> index cf8a4f0410..e782f1b0df 100644
>>> --- a/migration/migration-stats.h
>>> +++ b/migration/migration-stats.h
>>> @@ -15,6 +15,12 @@
>>>  
>>>  #include "qemu/stats64.h"
>>>  
>>> +/*
>>> + * If rate_limit_max is 0, there is special code to remove the rate
>>> + * limit.
>>> + */
>>> +#define RATE_LIMIT_MAX 0
>>> +
>>>  /*
>>>   * These are the ram migration statistic counters.  It is loosely
>>>   * based on MigrationStats.  We change to Stat64 any counter that
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index 039bba4804..c41c7491bb 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -2304,7 +2304,7 @@ static void migration_completion(MigrationState *s)
>>>   * them if migration fails or is cancelled.
>>>   */
>>>  s->block_inactive = !migrate_colo();
>>> -qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX);
>>> +qemu_file_set_rate_limit(s->to_dst_file, RATE_LIMIT_MAX);
>>>  ret = qemu_savevm_state_complete_precopy(s->to_dst_file, 
>>> false,
>>>   
>>> s->block_inactive);
>>>  }
>>> @@ -3048,7 +3048,7 @@ static void *bg_migration_thread(void *opaque)
>>>  rcu_register_thread();
>>>  object_ref(OBJECT(s));
>>>  
>>> -qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX);
>>> +qemu_file_set_rate_limit(s->to_dst_file, RATE_LIMIT_MAX);
>>>  
>>>  setup_start = qemu_clock_get_ms(QEMU_CLOCK_HOST);
>>>  /*
>>> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
>>> index 597054759d..4bc875b452 100644
>>> --- a/migration/qemu-file.c
>>> +++ b/migration/qemu-file.c
>>> @@ -27,6 +27,7 @@
>>>  #include "qemu/error-report.h"
>>>  #include "qemu/iov.h"
>>>  #include "migration.h"
>>> +#include "migration-stats.h"
>>>  #include "qemu-file.h"
>>>  #include "trace.h"
>>>  #include "options.h"
>>> @@ -732,7 +733,10 @@ int qemu_file_rate_limit(QEMUFile *f)
>>
>> Given that qemu_file_rate_limit() is really a boolean, could it be
>> declared as such?
>
> I wanted to do on this patch justn $Subject.
>
> You can see that when I move this function to
> migration/migration-stats.c already do the type change.

Thank you 👍

> That is patch:
>
> [PATCH v2 05/16] migration: Move rate_limit_max and rate_limit_used to
> migration_stats
>
> diff --git a/migration/migration-stats.c b/migration/migration-stats.c
> index 3431453c90..1b16edae7d 100644
> --- a/migration/migration-stats.c
> +++ b/migration/migration-stats.c
> @@ -22,3 +23,46 @@ void migration_time_since(MigrationAtomicStats *stats, 
> int64_t since)
>  int64_t now = qemu_clock_get_ms(QEMU_CLOCK_HOST);
>  stat64_set(&stats->setup_time, now - since);
>  }
> +
> +bool migration_rate_exceeded(QEMUFile *f)
> +{
> +if (qemu_file_get_error(f)) {
> +return true;
> +}
> +
> +uint64_t rate_limit_used = stat64_get(&mig_stats.rate_limit_used);
> +uint64_t rate_limit_max = stat64_get(&mig_stats.rate_limit_max);
> +
> +if (rate_limit_max == RATE_LIMIT_MAX) {
> +return false;
> +}
> +if (rate_limit_max > 0 && rate_limit_used > rate_limit_max) {
> +return true;
> +}
> +return false;
> +}
>
> Thanks, Juan.
-- 
I was better off when I was on your side, and I was holding on.

Re: [PATCH v2 03/16] migration: Move setup_time to mig_stats

2023-05-16 Thread David Edmondson

Juan Quintela  writes:

> David Edmondson  wrote:
>> Juan Quintela  writes:
>>
>>> It is a time that needs to be cleaned each time cancel migration.
>>> Once there create migration_time_since() to calculate how time since a
>>> time in the past.
>>>
>>> Signed-off-by: Juan Quintela 
>>>
>>> ---
>>>
>>> Rename to migration_time_since (cédric)
>>> ---
>>>  migration/migration-stats.h | 13 +
>>>  migration/migration.h   |  1 -
>>>  migration/migration-stats.c |  7 +++
>>>  migration/migration.c   |  9 -
>>>  4 files changed, 24 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/migration/migration-stats.h b/migration/migration-stats.h
>>> index e782f1b0df..21402af9e4 100644
>>> --- a/migration/migration-stats.h
>>> +++ b/migration/migration-stats.h
>>> @@ -75,6 +75,10 @@ typedef struct {
>>>   * Number of bytes sent during precopy stage.
>>>   */
>>>  Stat64 precopy_bytes;
>>> +/*
>>> + * How long has the setup stage took.
>>> + */
>>> +Stat64 setup_time;
>>
>> Is this really a Stat64? It doesn't appear to need the atomic update
>> feature.
>
> What this whole Migration Atomic Counters series try to do is that
> everything becomes atomic and then we can use everything everywhere.
>
> Before this series we had (I am simplifying here):
>
> - transferred, precopy_bytes, postcopy_bytes, downtime_bytes -> atomic,
>   you can use it anywhere
>
> - qemu_file transferred -> you can only use it from the main migration
>   thread
>
> - qemu_file rate_limit -> you can only use it from the main migration
>   thread
>
> And we had to update the three counters in every place that we did a
> write wehad to update all of them.
>
> You can see the contorsions that we go to to update the rate_limit and
> the qemu_file transferred fields.
>
> After the series (you need to get what it is already on the tree, this
> series, QEMUFileHooks cleanup, and another serie on my tree waiting for
> this to be commited), you got three counters:
>
> - qemu_file: atomic, everytime we do a qemu_file write we update it
> - multifd_bytes: atomic, everytime that we do a write in a multifd
>   channel, we update it.
> - rdma_bytes: atomic, everytime we do a write through RDMA we update it.
>
> And that is it.
>
> Both rate_limit and transferred are derived from these three counters:
>
> - at any point in time migration_transferred_bytes() returns the amount
>   of bytes written since the start of the migration:
>  qemu_file_bytes + multifd_bytes + rdma_bytes.
>
> - transferred on this period:
>at_start_of_period = migration_transferred_bytes().
>trasferred_in_this_period = migration_transferred_bytes() - 
> at_start_of_period;
>
> - Similar for precopy_bytes, postcopy_bytes and downtime_bytes.  When we
>   move from one stage to the next, we store what is the value of the
>   previous stage.
>
> The counters that we use to calculate the rate limit are updated around
> 10 times per second (can be a bit bigger at the end of periods,
> iterations, ...)  So performance is not extra critical.
>
> But as we have way less atomic operations (really one per real write),
> we don't really care a lot if we do some atomic operations when a normal
> operation will do.
>
> I.e. I think we have two options:
>
> - have the remaining counters that are only used in the main migration
>   thread not be atomic.  Document them and remember to do the correct
>   thing everytime we use it.  If we need to use it in another thread,
>   just change it to atomic.
>
> - Make all counters atomic. No need to document anything.  And you can
>   call any operation/counter/... in migration-stats.c from anywhere.
>
> I think that the second option is better.  But I can hear reasons from
> people that think that the 1st one is better.

For the counters, no argument - making them all atomic seems like the
right way forward.

start_time isn't a counter, and isn't manipulated at multiple points in
the code by different actors.

I don't hate it being a Stat64, it just seems odd when the other 'time'
related variables are not.

> Comments?
>
> Later, Juan.
-- 
You can't hide from the flipside.

Re: [PATCH 2/2] migration: Put zero_pages in alphabetical order

2023-05-04 Thread David Edmondson

Juan Quintela  writes:

> I forgot to move it when I rename it from duplicated_pages.
>
> Signed-off-by: Juan Quintela 

Reviewed-by: David Edmondson 

> ---
>  migration/migration-stats.h | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/migration/migration-stats.h b/migration/migration-stats.h
> index 0e49c236fa..b87ba30ea8 100644
> --- a/migration/migration-stats.h
> +++ b/migration/migration-stats.h
> @@ -48,10 +48,6 @@ typedef struct {
>   * guest is stopped.
>   */
>  Stat64 downtime_bytes;
> -/*
> - * number of pages transferred that were full of zeros.
> - */
> -Stat64 zero_pages;
>  /*
>   * number of bytes sent through multifd channels.
>   */
> @@ -77,6 +73,10 @@ typedef struct {
>   * total number of bytes transferred.
>   */
>  Stat64 transferred;
> +/*
> + * number of pages transferred that were full of zeros.
> + */
> +Stat64 zero_pages;
>  } MigrationAtomicStats;
>  
>  extern MigrationAtomicStats mig_stats;
> -- 
> 2.40.0
-- 
I've got a little black book with my poems in.

Re: [PATCH 1/2] migration: Document all migration_stats

2023-05-04 Thread David Edmondson

Juan Quintela  writes:

> Signed-off-by: Juan Quintela 

Reviewed-by: David Edmondson 

> ---
>  migration/migration-stats.h | 43 +
>  1 file changed, 43 insertions(+)
>
> diff --git a/migration/migration-stats.h b/migration/migration-stats.h
> index 149af932d7..0e49c236fa 100644
> --- a/migration/migration-stats.h
> +++ b/migration/migration-stats.h
> @@ -22,17 +22,60 @@
>   * one thread).
>   */
>  typedef struct {
> +/*
> + * number of bytes that were dirty last time that we sync with the

Minor, but given that you are writing sentences, please start with a
capital letter.

> + * guest memory.  We use that to calculate the downtime.  As the
> + * remaining dirty amounts to what we know that is still dirty
> + * since last iteration, not counting what the guest has dirtied
> + * sync we synchronize bitmaps.
> + */
>  Stat64 dirty_bytes_last_sync;
> +/*
> + * number of pages dirtied by second.
> + */
>  Stat64 dirty_pages_rate;
> +/*
> + * number of times we have synchronize guest bitmaps.
> + */
>  Stat64 dirty_sync_count;
> +/*
> + * number of times zero copy failed to send any page using zero
> + * copy.
> + */
>  Stat64 dirty_sync_missed_zero_copy;
> +/*
> + * number of bytes sent at migration completion stage while the
> + * guest is stopped.
> + */
>  Stat64 downtime_bytes;
> +/*
> + * number of pages transferred that were full of zeros.
> + */
>  Stat64 zero_pages;
> +/*
> + * number of bytes sent through multifd channels.
> + */
>  Stat64 multifd_bytes;
> +/*
> + * number of pages transferred that were not full of zeros.
> + */
>  Stat64 normal_pages;
> +/*
> + * number of bytes sent during postcopy.
> + */
>  Stat64 postcopy_bytes;
> +/*
> + * number of postcopy page faults that we have handled during
> + * postocpy stage.
> + */
>  Stat64 postcopy_requests;
> +/*
> + *  number of bytes sent during precopy stage.

Spurious double space before 'number'.

> + */
>  Stat64 precopy_bytes;
> +/*
> + * total number of bytes transferred.
> + */
>  Stat64 transferred;
>  } MigrationAtomicStats;
>  
> -- 
> 2.40.0
-- 
I'm not the reason you're looking for redemption.

Re: [PATCH v7 1/6] memory: Reference as->current_map directly in memory commit

2023-03-14 Thread David Edmondson

Chuang Xu  writes:

> From: Peter Xu 
>
> Calling RCU variance of address_space_get|to_flatview() during memory

"variants" rather than "variance", perhaps?

> commit (flatview updates, triggering memory listeners, or updating
> ioeventfds, etc.) is not 100% accurate, because commit() requires BQL
> rather than RCU read lock, so the context exclusively owns current_map and
> can be directly referenced.
>
> Neither does it need a refcount to current_map because it cannot be freed
> from under the caller.
>
> Add address_space_get_flatview_raw() for the case where the context holds
> BQL rather than RCU read lock and use it across the core memory updates,
> Drop the extra refcounts on FlatView*.
>
> Signed-off-by: Peter Xu 
> ---
>  softmmu/memory.c | 28 
>  1 file changed, 16 insertions(+), 12 deletions(-)
>
> diff --git a/softmmu/memory.c b/softmmu/memory.c
> index 4699ba55ec..a992a365d9 100644
> --- a/softmmu/memory.c
> +++ b/softmmu/memory.c
> @@ -61,6 +61,13 @@ struct AddrRange {
>  Int128 size;
>  };
>  
> +/* Called with BQL held */
> +static inline FlatView *address_space_to_flatview_raw(AddressSpace *as)
> +{
> +assert(qemu_mutex_iothread_locked());
> +return as->current_map;
> +}
> +
>  static AddrRange addrrange_make(Int128 start, Int128 size)
>  {
>  return (AddrRange) { start, size };
> @@ -155,7 +162,7 @@ enum ListenerDirection { Forward, Reverse };
>  #define MEMORY_LISTENER_UPDATE_REGION(fr, as, dir, callback, _args...)  \
>  do {\
>  MemoryRegionSection mrs = section_from_flat_range(fr,   \
> -address_space_to_flatview(as)); \
> +address_space_to_flatview_raw(as)); \
>  MEMORY_LISTENER_CALL(as, callback, dir, &mrs, ##_args); \
>  } while(0)
>  
> @@ -753,6 +760,7 @@ static FlatView *generate_memory_topology(MemoryRegion 
> *mr)
>  }
>  
>  static void address_space_add_del_ioeventfds(AddressSpace *as,
> + FlatView *view,
>   MemoryRegionIoeventfd *fds_new,
>   unsigned fds_new_nb,
>   MemoryRegionIoeventfd *fds_old,
> @@ -774,7 +782,7 @@ static void address_space_add_del_ioeventfds(AddressSpace 
> *as,
>&fds_new[inew]))) {
>  fd = &fds_old[iold];
>  section = (MemoryRegionSection) {
> -.fv = address_space_to_flatview(as),
> +.fv = view,
>  .offset_within_address_space = int128_get64(fd->addr.start),
>  .size = fd->addr.size,
>  };
> @@ -787,7 +795,7 @@ static void address_space_add_del_ioeventfds(AddressSpace 
> *as,
>   &fds_old[iold]))) {
>  fd = &fds_new[inew];
>  section = (MemoryRegionSection) {
> -.fv = address_space_to_flatview(as),
> +.fv = view,
>  .offset_within_address_space = int128_get64(fd->addr.start),
>  .size = fd->addr.size,
>  };
> @@ -833,7 +841,7 @@ static void address_space_update_ioeventfds(AddressSpace 
> *as)
>  ioeventfd_max = QEMU_ALIGN_UP(as->ioeventfd_nb, 4);
>  ioeventfds = g_new(MemoryRegionIoeventfd, ioeventfd_max);
>  
> -view = address_space_get_flatview(as);
> +view = address_space_to_flatview_raw(as);
>  FOR_EACH_FLAT_RANGE(fr, view) {
>  for (i = 0; i < fr->mr->ioeventfd_nb; ++i) {
>  tmp = addrrange_shift(fr->mr->ioeventfds[i].addr,
> @@ -852,13 +860,12 @@ static void 
> address_space_update_ioeventfds(AddressSpace *as)
>  }
>  }
>  
> -address_space_add_del_ioeventfds(as, ioeventfds, ioeventfd_nb,
> +address_space_add_del_ioeventfds(as, view, ioeventfds, ioeventfd_nb,
>   as->ioeventfds, as->ioeventfd_nb);
>  
>  g_free(as->ioeventfds);
>  as->ioeventfds = ioeventfds;
>  as->ioeventfd_nb = ioeventfd_nb;
> -flatview_unref(view);
>  }
>  
>  /*
> @@ -1026,7 +1033,7 @@ static void flatviews_reset(void)
>  
>  static void address_space_set_flatview(AddressSpace *as)
>  {
> -FlatView *old_view = address_space_to_flatview(as);
> +FlatView *old_view = address_space_to_flatview_raw(as);
>  MemoryRegion *physmr = memory_region_get_flatview_root(as->root);
>  FlatView *new_view = g_hash_table_lookup(flat_views, physmr);
>  
> @@ -2979,8 +2986,7 @@ static void listener_add_address_space(MemoryListener 
> *listener,
>  listener->log_global_start(listener);
>  }
>  }
> -
> -view = address_space_get_flatview(as);
> +view = address_space_to_flatview_raw(as);
>  FOR_EACH_FLAT_RANGE(fr, view) {
>  MemoryReg

Re: [PATCH v2 5/8] x86: Add AMX CPUIDs enumeration

2022-02-23 Thread David Edmondson

On Wednesday, 2022-02-16 at 22:04:31 -08, Yang Zhong wrote:

> From: Jing Liu 
>
> Add AMX primary feature bits XFD and AMX_TILE to
> enumerate the CPU's AMX capability. Meanwhile, add
> AMX TILE and TMUL CPUID leaf and subleaves which
> exist when AMX TILE is present to provide the maximum
> capability of TILE and TMUL.
>
> Signed-off-by: Jing Liu 
> Signed-off-by: Yang Zhong 

Reviewed-by: David Edmondson 

> ---
>  target/i386/cpu.c | 55 ---
>  target/i386/kvm/kvm.c |  4 +++-
>  2 files changed, 55 insertions(+), 4 deletions(-)
>
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index 5a7ee8c7e1..2465bed5df 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -576,6 +576,18 @@ static CPUCacheInfo legacy_l3_cache = {
>  #define INTEL_PT_CYCLE_BITMAP0x1fff /* Support 0,2^(0~11) */
>  #define INTEL_PT_PSB_BITMAP  (0x003f << 16) /* Support 
> 2K,4K,8K,16K,32K,64K */
>
> +/* CPUID Leaf 0x1D constants: */
> +#define INTEL_AMX_TILE_MAX_SUBLEAF 0x1
> +#define INTEL_AMX_TOTAL_TILE_BYTES 0x2000
> +#define INTEL_AMX_BYTES_PER_TILE   0x400
> +#define INTEL_AMX_BYTES_PER_ROW0x40
> +#define INTEL_AMX_TILE_MAX_NAMES   0x8
> +#define INTEL_AMX_TILE_MAX_ROWS0x10
> +
> +/* CPUID Leaf 0x1E constants: */
> +#define INTEL_AMX_TMUL_MAX_K   0x10
> +#define INTEL_AMX_TMUL_MAX_N   0x40
> +
>  void x86_cpu_vendor_words2str(char *dst, uint32_t vendor1,
>uint32_t vendor2, uint32_t vendor3)
>  {
> @@ -845,8 +857,8 @@ FeatureWordInfo feature_word_info[FEATURE_WORDS] = {
>  "avx512-vp2intersect", NULL, "md-clear", NULL,
>  NULL, NULL, "serialize", NULL,
>  "tsx-ldtrk", NULL, NULL /* pconfig */, NULL,
> -NULL, NULL, NULL, "avx512-fp16",
> -NULL, NULL, "spec-ctrl", "stibp",
> +NULL, NULL, "amx-bf16", "avx512-fp16",
> +"amx-tile", "amx-int8", "spec-ctrl", "stibp",
>  NULL, "arch-capabilities", "core-capability", "ssbd",
>  },
>  .cpuid = {
> @@ -911,7 +923,7 @@ FeatureWordInfo feature_word_info[FEATURE_WORDS] = {
>  .type = CPUID_FEATURE_WORD,
>  .feat_names = {
>  "xsaveopt", "xsavec", "xgetbv1", "xsaves",
> -NULL, NULL, NULL, NULL,
> +"xfd", NULL, NULL, NULL,
>  NULL, NULL, NULL, NULL,
>  NULL, NULL, NULL, NULL,
>  NULL, NULL, NULL, NULL,
> @@ -5587,6 +5599,43 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
> uint32_t count,
>  }
>  break;
>  }
> +case 0x1D: {
> +/* AMX TILE */
> +*eax = 0;
> +*ebx = 0;
> +*ecx = 0;
> +*edx = 0;
> +if (!(env->features[FEAT_7_0_EDX] & CPUID_7_0_EDX_AMX_TILE)) {
> +break;
> +}
> +
> +if (count == 0) {
> +/* Highest numbered palette subleaf */
> +*eax = INTEL_AMX_TILE_MAX_SUBLEAF;
> +} else if (count == 1) {
> +*eax = INTEL_AMX_TOTAL_TILE_BYTES |
> +   (INTEL_AMX_BYTES_PER_TILE << 16);
> +*ebx = INTEL_AMX_BYTES_PER_ROW | (INTEL_AMX_TILE_MAX_NAMES << 
> 16);
> +*ecx = INTEL_AMX_TILE_MAX_ROWS;
> +}
> +break;
> +}
> +case 0x1E: {
> +/* AMX TMUL */
> +*eax = 0;
> +*ebx = 0;
> +*ecx = 0;
> +*edx = 0;
> +if (!(env->features[FEAT_7_0_EDX] & CPUID_7_0_EDX_AMX_TILE)) {
> +break;
> +}
> +
> +if (count == 0) {
> +/* Highest numbered palette subleaf */
> +*ebx = INTEL_AMX_TMUL_MAX_K | (INTEL_AMX_TMUL_MAX_N << 8);
> +}
> +break;
> +}
>  case 0x4000:
>  /*
>   * CPUID code in kvm_arch_init_vcpu() ignores stuff
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index 3bdcd724c4..8562d3d138 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -1779,7 +1779,9 @@ int kvm_arch_init_vcpu(CPUState *cs)
>  c = &cpuid_data.entries[cpuid_i++];
>  }
>  break;
> -case 0x14: {
> +case 0x14:
> +case 0x1d:
> +case 0x1e: {
>  uint32_t times;
>
>  c->function = i;

dme.
-- 
Why does it have to be like this? I can never tell.

Re: [RFC PATCH 1/1] i386: Remove features from Epyc-Milan cpu

2022-01-31 Thread David Edmondson

On Saturday, 2022-01-29 at 07:23:37 -03, Leonardo Bras wrote:

> While trying to bring a VM with EPYC-Milan cpu on a host with
> EPYC-Milan cpu (EPYC 7313), the following warning can be seen:
>
> qemu-system-x86_64: warning: host doesn't support requested feature: 
> CPUID.07H:EBX.erms [bit 9]
> qemu-system-x86_64: warning: host doesn't support requested feature: 
> CPUID.07H:EDX.fsrm [bit 4]
>
> Even with this warning, the host goes up.
>
> Then, grep'ing cpuid output on both guest and host, outputs:
>
> extended feature flags (7):
>   enhanced REP MOVSB/STOSB = false
>   fast short REP MOV   = false
>   (simple synth)  = AMD EPYC (3rd Gen) (Milan B1) [Zen 3], 7nm
>brand = "AMD EPYC 7313 16-Core Processor   "
>
> This means that for the same -cpu model (EPYC-Milan), the vcpu may or may
> not have the above feature bits set, which is usually not a good idea for
> live migration:
> Migrating from a host with these features to a host without them can
> be troublesome for the guest.
>
> Remove the "optional" features (erms, fsrm) from Epyc-Milan, in order to
> avoid possible after-migration guest issues.
>
> Signed-off-by: Leonardo Bras 
> ---
>
> Does this make sense? Or maybe I am missing something here.

We have encountered some Milan CPUs (7J13) that did not initially
declare support for either ERMS or FSRM.

A firmware update caused these features to appear, which definitely
causes potential problems with migration of VMs from hosts with updated
firmware to those without.

It would be interesting to know if there is any expectation that the
features might be enabled on the 7313 CPUs that you have with a future
firmware update.

I *think* that the expectation is that Milan CPUs will have the
features, and if that is correct then they should remain present in the
EPYC-Milan definition on QEMU.

> Having a kvm guest running with a feature bit, while the host
> does not support it seems to cause a possible break the guest.

As Daniel said, this should not happen in this case.

>  target/i386/cpu.c | 5 +
>  1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index aa9e636800..a4bbd38ed0 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -4160,12 +4160,9 @@ static const X86CPUDefinition builtin_x86_defs[] = {
>  CPUID_7_0_EBX_FSGSBASE | CPUID_7_0_EBX_BMI1 | CPUID_7_0_EBX_AVX2 
> |
>  CPUID_7_0_EBX_SMEP | CPUID_7_0_EBX_BMI2 | CPUID_7_0_EBX_RDSEED |
>  CPUID_7_0_EBX_ADX | CPUID_7_0_EBX_SMAP | 
> CPUID_7_0_EBX_CLFLUSHOPT |
> -CPUID_7_0_EBX_SHA_NI | CPUID_7_0_EBX_CLWB | CPUID_7_0_EBX_ERMS |
> -CPUID_7_0_EBX_INVPCID,
> +CPUID_7_0_EBX_SHA_NI | CPUID_7_0_EBX_CLWB | 
> CPUID_7_0_EBX_INVPCID,
>  .features[FEAT_7_0_ECX] =
>  CPUID_7_0_ECX_UMIP | CPUID_7_0_ECX_RDPID | CPUID_7_0_ECX_PKU,
> -.features[FEAT_7_0_EDX] =
> -CPUID_7_0_EDX_FSRM,
>  .features[FEAT_XSAVE] =
>  CPUID_XSAVE_XSAVEOPT | CPUID_XSAVE_XSAVEC |
>  CPUID_XSAVE_XGETBV1 | CPUID_XSAVE_XSAVES,

dme.
-- 
I don't care 'bout your other girls, just be good to me.

Re: [PATCH v1 1/1] target/i386: Mask xstate_bv based on the cpu enabled features

2022-01-31 Thread David Edmondson

On Saturday, 2022-01-29 at 06:46:45 -03, Leonardo Bras wrote:

> The following steps describe a migration bug:
> 1 - Bring up a VM with -cpu EPYC on a host with EPYC-Milan cpu
> 2 - Migrate to a host with EPYC-Naples cpu
>
> The guest kernel crashes shortly after the migration.
>
> The crash happens due to a fault caused by XRSTOR:
> A set bit in XSTATE_BV is not set in XCR0.
> The faulting bit is FEATURE_PKRU (enabled in Milan, but not in Naples)

I'm trying to understand how this happens.

If we boot on EPYC-Milan with "-cpu EPYC", the PKRU feature should not
be exposed to the VM (it is not available in the EPYC CPU).

Given this, how would bit 0x200 (representing PKRU) end up set in
xstate_bv?

> To avoid this kind of bug:
> In kvm_get_xsave, mask-out from xstate_bv any bits that are not set in
> current vcpu's features.
>
> This keeps cpu->env->xstate_bv with feature bits compatible with any
> host machine capable of running the vcpu model.
>
> Signed-off-by: Leonardo Bras 
> ---
>  target/i386/xsave_helper.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/target/i386/xsave_helper.c b/target/i386/xsave_helper.c
> index ac61a96344..0628226234 100644
> --- a/target/i386/xsave_helper.c
> +++ b/target/i386/xsave_helper.c
> @@ -167,7 +167,7 @@ void x86_cpu_xrstor_all_areas(X86CPU *cpu, const void 
> *buf, uint32_t buflen)
>  env->xmm_regs[i].ZMM_Q(1) = ldq_p(xmm + 8);
>  }
>
> -env->xstate_bv = header->xstate_bv;
> +env->xstate_bv = header->xstate_bv & env->features[FEAT_XSAVE_COMP_LO];
>
>  e = &x86_ext_save_areas[XSTATE_YMM_BIT];
>  if (e->size && e->offset) {

dme.
-- 
You have underestimated my power, as you shortly will discover.

Re: [PATCH v1 1/1] target/i386: Mask xstate_bv based on the cpu enabled features

2022-02-02 Thread David Edmondson

On Tuesday, 2022-02-01 at 16:09:57 -03, Leonardo Brás wrote:

> Hello David, thanks for this feedback!
>
> On Mon, 2022-01-31 at 12:53 +0000, David Edmondson wrote:
>> On Saturday, 2022-01-29 at 06:46:45 -03, Leonardo Bras wrote:
>> 
>> > The following steps describe a migration bug:
>> > 1 - Bring up a VM with -cpu EPYC on a host with EPYC-Milan cpu
>> > 2 - Migrate to a host with EPYC-Naples cpu
>> > 
>> > The guest kernel crashes shortly after the migration.
>> > 
>> > The crash happens due to a fault caused by XRSTOR:
>> > A set bit in XSTATE_BV is not set in XCR0.
>> > The faulting bit is FEATURE_PKRU (enabled in Milan, but not in
>> > Naples)
>> 
>> I'm trying to understand how this happens.
>> 
>> If we boot on EPYC-Milan with "-cpu EPYC", the PKRU feature should
>> not
>> be exposed to the VM (it is not available in the EPYC CPU).
>> 
>> Given this, how would bit 0x200 (representing PKRU) end up set in
>> xstate_bv?
>
> During my debug, I noticed this bit gets set before the kernel even
> starts. 
>
> It's possible Seabios and/or IPXE are somehow setting 0x200 using the
> xrstor command. I am not sure if qemu is able to stop this in KVM mode.

I don't believe that this should be possible.

If the CPU is set to EPYC in QEMU then .features[FEAT_7_0_ECX] does not
include CPUID_7_0_ECX_PKU, which in turn means that when
x86_cpu_enable_xsave_components() generates FEAT_XSAVE_COMP_LO it should
not set XSTATE_PKRU_BIT.

Given that, KVM's vcpu->arch.guest_supported_xcr0 will not include
XSTATE_PKRU_BIT, and __kvm_set_xcr() should not allow that bit to be
set when it intercepts the guest xsetbv instruction.

dme.
-- 
Please forgive me if I act a little strange, for I know not what I do.

Re: [PATCH] migration: Don't return for postcopy_send_discard_bm_ram()

2022-01-04 Thread David Edmondson

On Thursday, 2021-12-30 at 17:05:25 +01, Philippe Mathieu-Daudé wrote:

> postcopy_send_discard_bm_ram() always return zero. Since it can't
> fail, simplify and do not return anything.
>
> Signed-off-by: Philippe Mathieu-Daudé 

Reviewed-by: David Edmondson 

> ---
> Based-on: <20211224065000.97572-1-pet...@redhat.com>
> ---
>  migration/ram.c | 6 +-
>  1 file changed, 1 insertion(+), 5 deletions(-)
>
> diff --git a/migration/ram.c b/migration/ram.c
> index 5234d1ece11..e241ce98461 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -2433,14 +2433,12 @@ void 
> ram_postcopy_migrated_memory_release(MigrationState *ms)
>  /**
>   * postcopy_send_discard_bm_ram: discard a RAMBlock
>   *
> - * Returns zero on success
> - *
>   * Callback from postcopy_each_ram_send_discard for each RAMBlock
>   *
>   * @ms: current migration state
>   * @block: RAMBlock to discard
>   */
> -static int postcopy_send_discard_bm_ram(MigrationState *ms, RAMBlock *block)
> +static void postcopy_send_discard_bm_ram(MigrationState *ms, RAMBlock *block)
>  {
>  unsigned long end = block->used_length >> TARGET_PAGE_BITS;
>  unsigned long current;
> @@ -2464,8 +2462,6 @@ static int postcopy_send_discard_bm_ram(MigrationState 
> *ms, RAMBlock *block)
>  postcopy_discard_send_range(ms, one, discard_length);
>  current = one + discard_length;
>  }
> -
> -return 0;
>  }
>
>  static void postcopy_chunk_hostpages_pass(MigrationState *ms, RAMBlock 
> *block);

dme.
-- 
Tell me sweet little lies.

Re: [PATCH RFCv2 3/4] i386/pc: warn if phys-bits is too low

2022-02-14 Thread David Edmondson

On Monday, 2022-02-07 at 20:24:21 GMT, Joao Martins wrote:

> Default phys-bits on Qemu is TCG_PHYS_BITS (40) which is enough
> to address 1Tb (0xff  ). On AMD platforms, if a
> ram-above-4g relocation happens and the CPU wasn't configured
> with a big enough phys-bits, warn the user. There isn't a
> catastrophic failure exactly, the guest will still boot, but
> most likely won't be able to use more than ~4G of RAM.
>
> Signed-off-by: Joao Martins 
> ---
>  hw/i386/pc.c | 7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index b060aedd38f3..f8712eb8427e 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -842,6 +842,7 @@ static void relocate_4g(MachineState *machine, 
> PCMachineState *pcms)
>  X86MachineState *x86ms = X86_MACHINE(pcms);
>  ram_addr_t device_mem_size = 0;
>  uint32_t eax, vendor[3];
> +hwaddr maxphysaddr;
>
>  host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>  if (!IS_AMD_VENDOR(vendor)) {
> @@ -858,6 +859,12 @@ static void relocate_4g(MachineState *machine, 
> PCMachineState *pcms)
>  return;
>  }
>
> +maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
> +if (maxphysaddr < AMD_ABOVE_1TB_START)

Braces around the block are required, I believe.

> +warn_report("Relocated RAM above 4G to start at %lu "

Should use PRIu64?

> +"phys-bits too low (%u)",
> +AMD_ABOVE_1TB_START, X86_CPU(first_cpu)->phys_bits);
> +
>  x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;

And a real nit - until above_4g_mem_start is modified, the number of
phys_bits is fine, so I would have put the warning after the assignment.

>  }

dme.
-- 
Tonight I'm gonna bury that horse in the ground.

Re: [PATCH v2 2/8] x86: Add AMX XTILECFG and XTILEDATA components

2022-02-21 Thread David Edmondson

On Wednesday, 2022-02-16 at 22:04:28 -08, Yang Zhong wrote:

> From: Jing Liu 
>
> The AMX TILECFG register and the TMMx tile data registers are
> saved/restored via XSAVE, respectively in state component 17
> (64 bytes) and state component 18 (8192 bytes).
>
> Add AMX feature bits to x86_ext_save_areas array to set
> up AMX components. Add structs that define the layout of
> AMX XSAVE areas and use QEMU_BUILD_BUG_ON to validate the
> structs sizes.
>
> Signed-off-by: Jing Liu 
> Signed-off-by: Yang Zhong 

Reviewed-by: David Edmondson 

> ---
>  target/i386/cpu.h | 18 +-
>  target/i386/cpu.c |  8 
>  2 files changed, 25 insertions(+), 1 deletion(-)
>
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index de1dc124ab..06d2d6bccf 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -537,6 +537,8 @@ typedef enum X86Seg {
>  #define XSTATE_ZMM_Hi256_BIT6
>  #define XSTATE_Hi16_ZMM_BIT 7
>  #define XSTATE_PKRU_BIT 9
> +#define XSTATE_XTILE_CFG_BIT17
> +#define XSTATE_XTILE_DATA_BIT   18
>
>  #define XSTATE_FP_MASK  (1ULL << XSTATE_FP_BIT)
>  #define XSTATE_SSE_MASK (1ULL << XSTATE_SSE_BIT)
> @@ -845,6 +847,8 @@ typedef uint64_t FeatureWordArray[FEATURE_WORDS];
>  #define CPUID_7_0_EDX_TSX_LDTRK (1U << 16)
>  /* AVX512_FP16 instruction */
>  #define CPUID_7_0_EDX_AVX512_FP16   (1U << 23)
> +/* AMX tile (two-dimensional register) */
> +#define CPUID_7_0_EDX_AMX_TILE  (1U << 24)
>  /* Speculation Control */
>  #define CPUID_7_0_EDX_SPEC_CTRL (1U << 26)
>  /* Single Thread Indirect Branch Predictors */
> @@ -1348,6 +1352,16 @@ typedef struct XSavePKRU {
>  uint32_t padding;
>  } XSavePKRU;
>
> +/* Ext. save area 17: AMX XTILECFG state */
> +typedef struct XSaveXTILECFG {
> +uint8_t xtilecfg[64];
> +} XSaveXTILECFG;
> +
> +/* Ext. save area 18: AMX XTILEDATA state */
> +typedef struct XSaveXTILEDATA {
> +uint8_t xtiledata[8][1024];
> +} XSaveXTILEDATA;
> +
>  QEMU_BUILD_BUG_ON(sizeof(XSaveAVX) != 0x100);
>  QEMU_BUILD_BUG_ON(sizeof(XSaveBNDREG) != 0x40);
>  QEMU_BUILD_BUG_ON(sizeof(XSaveBNDCSR) != 0x40);
> @@ -1355,6 +1369,8 @@ QEMU_BUILD_BUG_ON(sizeof(XSaveOpmask) != 0x40);
>  QEMU_BUILD_BUG_ON(sizeof(XSaveZMM_Hi256) != 0x200);
>  QEMU_BUILD_BUG_ON(sizeof(XSaveHi16_ZMM) != 0x400);
>  QEMU_BUILD_BUG_ON(sizeof(XSavePKRU) != 0x8);
> +QEMU_BUILD_BUG_ON(sizeof(XSaveXTILECFG) != 0x40);
> +QEMU_BUILD_BUG_ON(sizeof(XSaveXTILEDATA) != 0x2000);
>
>  typedef struct ExtSaveArea {
>  uint32_t feature, bits;
> @@ -1362,7 +1378,7 @@ typedef struct ExtSaveArea {
>  uint32_t ecx;
>  } ExtSaveArea;
>
> -#define XSAVE_STATE_AREA_COUNT (XSTATE_PKRU_BIT + 1)
> +#define XSAVE_STATE_AREA_COUNT (XSTATE_XTILE_DATA_BIT + 1)
>
>  extern ExtSaveArea x86_ext_save_areas[XSAVE_STATE_AREA_COUNT];
>
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index 37f06b0b1a..ea7e8f9081 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -1401,6 +1401,14 @@ ExtSaveArea x86_ext_save_areas[XSAVE_STATE_AREA_COUNT] 
> = {
>  [XSTATE_PKRU_BIT] =
>{ .feature = FEAT_7_0_ECX, .bits = CPUID_7_0_ECX_PKU,
>  .size = sizeof(XSavePKRU) },
> +[XSTATE_XTILE_CFG_BIT] = {
> +.feature = FEAT_7_0_EDX, .bits = CPUID_7_0_EDX_AMX_TILE,
> +.size = sizeof(XSaveXTILECFG),
> +},
> +[XSTATE_XTILE_DATA_BIT] = {
> +.feature = FEAT_7_0_EDX, .bits = CPUID_7_0_EDX_AMX_TILE,
> +.size = sizeof(XSaveXTILEDATA)
> +},
>  };
>
>  static uint32_t xsave_area_size(uint64_t mask)

dme.
-- 
Would you offer your throat to the wolf with the red roses?

Re: [PATCH v2 1/8] x86: Fix the 64-byte boundary enumeration for extended state

2022-02-21 Thread David Edmondson

On Wednesday, 2022-02-16 at 22:04:27 -08, Yang Zhong wrote:

> From: Jing Liu 
>
> The extended state subleaves (EAX=0Dh, ECX=n, n>1).ECX[1]
> indicate whether the extended state component locates
> on the next 64-byte boundary following the preceding state
> component when the compacted format of an XSAVE area is
> used.
>
> Right now, they are all zero because no supported component
> needed the bit to be set, but the upcoming AMX feature will
> use it.  Fix the subleaves value according to KVM's supported
> cpuid.
>
> Signed-off-by: Jing Liu 
> Signed-off-by: Yang Zhong 

Reviewed-by: David Edmondson 

> ---
>  target/i386/cpu.h | 6 ++
>  target/i386/cpu.c | 1 +
>  target/i386/kvm/kvm-cpu.c | 1 +
>  3 files changed, 8 insertions(+)
>
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index 9911d7c871..de1dc124ab 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -548,6 +548,11 @@ typedef enum X86Seg {
>  #define XSTATE_Hi16_ZMM_MASK(1ULL << XSTATE_Hi16_ZMM_BIT)
>  #define XSTATE_PKRU_MASK(1ULL << XSTATE_PKRU_BIT)
>
> +#define ESA_FEATURE_ALIGN64_BIT 1
> +
> +#define ESA_FEATURE_ALIGN64_MASK(1U << ESA_FEATURE_ALIGN64_BIT)
> +
> +
>  /* CPUID feature words */
>  typedef enum FeatureWord {
>  FEAT_1_EDX, /* CPUID[1].EDX */
> @@ -1354,6 +1359,7 @@ QEMU_BUILD_BUG_ON(sizeof(XSavePKRU) != 0x8);
>  typedef struct ExtSaveArea {
>  uint32_t feature, bits;
>  uint32_t offset, size;
> +uint32_t ecx;
>  } ExtSaveArea;
>
>  #define XSAVE_STATE_AREA_COUNT (XSTATE_PKRU_BIT + 1)
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index aa9e636800..37f06b0b1a 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -5487,6 +5487,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
> uint32_t count,
>  const ExtSaveArea *esa = &x86_ext_save_areas[count];
>  *eax = esa->size;
>  *ebx = esa->offset;
> +*ecx = esa->ecx & ESA_FEATURE_ALIGN64_MASK;
>  }
>  }
>  break;
> diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c
> index d95028018e..ce27d3b1df 100644
> --- a/target/i386/kvm/kvm-cpu.c
> +++ b/target/i386/kvm/kvm-cpu.c
> @@ -104,6 +104,7 @@ static void kvm_cpu_xsave_init(void)
>  if (sz != 0) {
>  assert(esa->size == sz);
>  esa->offset = kvm_arch_get_supported_cpuid(s, 0xd, i, R_EBX);
> +esa->ecx = kvm_arch_get_supported_cpuid(s, 0xd, i, R_ECX);
>  }
>  }
>  }

dme.
-- 
We're deep in discussion, the party's on mute.

Re: [PATCH v2 4/8] x86: Add XFD faulting bit for state components

2022-02-21 Thread David Edmondson

On Wednesday, 2022-02-16 at 22:04:30 -08, Yang Zhong wrote:

> From: Jing Liu 
>
> Intel introduces XFD faulting mechanism for extended
> XSAVE features to dynamically enable the features in
> runtime. If CPUID (EAX=0Dh, ECX=n, n>1).ECX[2] is set
> as 1, it indicates support for XFD faulting of this
> state component.
>
> Signed-off-by: Jing Liu 
> Signed-off-by: Yang Zhong 

Small comment below...

Reviewed-by: David Edmondson 

> ---
>  target/i386/cpu.h | 2 ++
>  target/i386/cpu.c | 3 ++-
>  2 files changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index d4ad0f56bd..f7fc2e97a6 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -558,8 +558,10 @@ typedef enum X86Seg {
>  #define ARCH_REQ_XCOMP_GUEST_PERM   0x1025
>
>  #define ESA_FEATURE_ALIGN64_BIT 1
> +#define ESA_FEATURE_XFD_BIT 2
>
>  #define ESA_FEATURE_ALIGN64_MASK(1U << ESA_FEATURE_ALIGN64_BIT)
> +#define ESA_FEATURE_XFD_MASK(1U << ESA_FEATURE_XFD_BIT)
>
>  /* CPUID feature words */
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index 377d993438..5a7ee8c7e1 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -5497,7 +5497,8 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
> uint32_t count,
>  const ExtSaveArea *esa = &x86_ext_save_areas[count];
>  *eax = esa->size;
>  *ebx = esa->offset;
> -*ecx = esa->ecx & ESA_FEATURE_ALIGN64_MASK;
> +*ecx = (esa->ecx & ESA_FEATURE_ALIGN64_MASK) |
> +   (esa->ecx & ESA_FEATURE_XFD_MASK);

Is:

*ecx = esa->ecx &
   (ESA_FEATURE_ALIGN64_MASK | ESA_FEATURE_XFD_MASK);

not more usual?

>  }
>  }
>  break;

dme.
-- 
All of us, we're going out tonight. We're gonna walk all over your cars.

Re: [PATCH v2 6/8] x86: add support for KVM_CAP_XSAVE2 and AMX state migration

2022-02-21 Thread David Edmondson

On Wednesday, 2022-02-16 at 22:04:32 -08, Yang Zhong wrote:

> From: Jing Liu 
>
> When dynamic xfeatures (e.g. AMX) are used by the guest, the xsave
> area would be larger than 4KB. KVM_GET_XSAVE2 and KVM_SET_XSAVE
> under KVM_CAP_XSAVE2 works with a xsave buffer larger than 4KB.
> Always use the new ioctls under KVM_CAP_XSAVE2 when KVM supports it.
>
> Signed-off-by: Jing Liu 
> Signed-off-by: Zeng Guang 
> Signed-off-by: Wei Wang 
> Signed-off-by: Yang Zhong 
> ---
>  target/i386/cpu.h  |  4 
>  target/i386/kvm/kvm.c  | 42 --
>  target/i386/xsave_helper.c | 33 ++
>  3 files changed, 64 insertions(+), 15 deletions(-)
>
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index f7fc2e97a6..de9da38e42 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -1528,6 +1528,10 @@ typedef struct CPUX86State {
>  uint64_t opmask_regs[NB_OPMASK_REGS];
>  YMMReg zmmh_regs[CPU_NB_REGS];
>  ZMMReg hi16_zmm_regs[CPU_NB_REGS];
> +#ifdef TARGET_X86_64
> +uint8_t xtilecfg[64];
> +uint8_t xtiledata[8192];
> +#endif

Can we have defined constants for these sizes? They also appear in patch
2.

>
>  /* sysenter registers */
>  uint32_t sysenter_cs;
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index 8562d3d138..ff064e3d8f 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -122,6 +122,7 @@ static uint32_t num_architectural_pmu_gp_counters;
>  static uint32_t num_architectural_pmu_fixed_counters;
>
>  static int has_xsave;
> +static int has_xsave2;
>  static int has_xcrs;
>  static int has_pit_state2;
>  static int has_sregs2;
> @@ -1585,6 +1586,26 @@ static Error *invtsc_mig_blocker;
>
>  #define KVM_MAX_CPUID_ENTRIES  100
>
> +static void kvm_init_xsave(CPUX86State *env)
> +{
> +if (has_xsave2) {
> +env->xsave_buf_len = QEMU_ALIGN_UP(has_xsave2, 4096);

Idle curiosity - why do we round this up?

> +} else if (has_xsave) {
> +env->xsave_buf_len = sizeof(struct kvm_xsave);
> +} else {
> +return;
> +}
> +
> +env->xsave_buf = qemu_memalign(4096, env->xsave_buf_len);
> +memset(env->xsave_buf, 0, env->xsave_buf_len);
> + /*
> +  * The allocated storage must be large enough for all of the
> +  * possible XSAVE state components.
> +  */
> +assert(kvm_arch_get_supported_cpuid(kvm_state, 0xd, 0, R_ECX) <=
> +   env->xsave_buf_len);
> +}
> +
>  int kvm_arch_init_vcpu(CPUState *cs)
>  {
>  struct {
> @@ -1614,6 +1635,8 @@ int kvm_arch_init_vcpu(CPUState *cs)
>
>  cpuid_i = 0;
>
> +has_xsave2 = kvm_check_extension(cs->kvm_state, KVM_CAP_XSAVE2);
> +
>  r = kvm_arch_set_tsc_khz(cs);
>  if (r < 0) {
>  return r;
> @@ -2003,19 +2026,7 @@ int kvm_arch_init_vcpu(CPUState *cs)
>  if (r) {
>  goto fail;
>  }
> -
> -if (has_xsave) {
> -env->xsave_buf_len = sizeof(struct kvm_xsave);
> -env->xsave_buf = qemu_memalign(4096, env->xsave_buf_len);
> -memset(env->xsave_buf, 0, env->xsave_buf_len);
> -
> -/*
> - * The allocated storage must be large enough for all of the
> - * possible XSAVE state components.
> - */
> -assert(kvm_arch_get_supported_cpuid(kvm_state, 0xd, 0, R_ECX)
> -   <= env->xsave_buf_len);
> -}
> +kvm_init_xsave(env);
>
>  max_nested_state_len = kvm_max_nested_state_length();
>  if (max_nested_state_len > 0) {
> @@ -3319,13 +3330,14 @@ static int kvm_get_xsave(X86CPU *cpu)
>  {
>  CPUX86State *env = &cpu->env;
>  void *xsave = env->xsave_buf;
> -int ret;
> +int type, ret;
>
>  if (!has_xsave) {
>  return kvm_get_fpu(cpu);
>  }
>
> -ret = kvm_vcpu_ioctl(CPU(cpu), KVM_GET_XSAVE, xsave);
> +type = has_xsave2 ? KVM_GET_XSAVE2 : KVM_GET_XSAVE;
> +ret = kvm_vcpu_ioctl(CPU(cpu), type, xsave);
>  if (ret < 0) {
>  return ret;
>  }
> diff --git a/target/i386/xsave_helper.c b/target/i386/xsave_helper.c
> index ac61a96344..b6a004505f 100644
> --- a/target/i386/xsave_helper.c
> +++ b/target/i386/xsave_helper.c
> @@ -5,6 +5,7 @@
>  #include "qemu/osdep.h"
>
>  #include "cpu.h"
> +#include 
>
>  void x86_cpu_xsave_all_areas(X86CPU *cpu, void *buf, uint32_t buflen)
>  {
> @@ -126,6 +127,22 @@ void x86_cpu_xsave_all_areas(X86CPU *cpu, void *buf, 
> uint32_t buflen)
>
>  memcpy(pkru, &env->pkru, sizeof(env->pkru));
>  }
> +
> +e = &x86_ext_save_areas[XSTATE_XTILE_CFG_BIT];
> +if (e->size && e->offset) {
> +XSaveXTILECFG *tilecfg = buf + e->offset;
> +
> +memcpy(tilecfg, &env->xtilecfg, sizeof(env->xtilecfg));
> +}
> +
> +if (buflen > sizeof(struct kvm_xsave)) {
> +e = &x86_ext_save_areas[XSTATE_XTILE_DATA_BIT];
> +if (e->size && e->offset && buflen >= e->size + e->offset) {
> +XSaveXTILEDATA *tiledata = buf + e->offset;
> +
> +memcpy(tiledata,

Re: [PATCH v2 7/8] x86: Support XFD and AMX xsave data migration

2022-02-21 Thread David Edmondson

On Wednesday, 2022-02-16 at 22:04:33 -08, Yang Zhong wrote:

> From: Zeng Guang 
>
> XFD(eXtended Feature Disable) allows to enable a
> feature on xsave state while preventing specific
> user threads from using the feature.
>
> Support save and restore XFD MSRs if CPUID.D.1.EAX[4]
> enumerate to be valid. Likewise migrate the MSRs and
> related xsave state necessarily.
>
> Signed-off-by: Zeng Guang 
> Signed-off-by: Wei Wang 
> Signed-off-by: Yang Zhong 

Reviewed-by: David Edmondson 

> ---
>  target/i386/cpu.h |  9 +
>  target/i386/kvm/kvm.c | 18 ++
>  target/i386/machine.c | 42 ++
>  3 files changed, 69 insertions(+)
>
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index de9da38e42..509c16323a 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -505,6 +505,9 @@ typedef enum X86Seg {
>
>  #define MSR_VM_HSAVE_PA 0xc0010117
>
> +#define MSR_IA32_XFD0x01c4
> +#define MSR_IA32_XFD_ERR0x01c5
> +
>  #define MSR_IA32_BNDCFGS0x0d90
>  #define MSR_IA32_XSS0x0da0
>  #define MSR_IA32_UMWAIT_CONTROL 0xe1
> @@ -873,6 +876,8 @@ typedef uint64_t FeatureWordArray[FEATURE_WORDS];
>  #define CPUID_7_1_EAX_AVX_VNNI  (1U << 4)
>  /* AVX512 BFloat16 Instruction */
>  #define CPUID_7_1_EAX_AVX512_BF16   (1U << 5)
> +/* XFD Extend Feature Disabled */
> +#define CPUID_D_1_EAX_XFD   (1U << 4)
>
>  /* Packets which contain IP payload have LIP values */
>  #define CPUID_14_0_ECX_LIP  (1U << 31)
> @@ -1617,6 +1622,10 @@ typedef struct CPUX86State {
>  uint64_t msr_rtit_cr3_match;
>  uint64_t msr_rtit_addrs[MAX_RTIT_ADDRS];
>
> +/* Per-VCPU XFD MSRs */
> +uint64_t msr_xfd;
> +uint64_t msr_xfd_err;
> +
>  /* exception/interrupt handling */
>  int error_code;
>  int exception_is_int;
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index ff064e3d8f..3dd24b6b0e 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -3275,6 +3275,13 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>env->msr_ia32_sgxlepubkeyhash[3]);
>  }
>
> +if (env->features[FEAT_XSAVE] & CPUID_D_1_EAX_XFD) {
> +kvm_msr_entry_add(cpu, MSR_IA32_XFD,
> +  env->msr_xfd);
> +kvm_msr_entry_add(cpu, MSR_IA32_XFD_ERR,
> +  env->msr_xfd_err);
> +}
> +
>  /* Note: MSR_IA32_FEATURE_CONTROL is written separately, see
>   *   kvm_put_msr_feature_control. */
>  }
> @@ -3667,6 +3674,11 @@ static int kvm_get_msrs(X86CPU *cpu)
>  kvm_msr_entry_add(cpu, MSR_IA32_SGXLEPUBKEYHASH3, 0);
>  }
>
> +if (env->features[FEAT_XSAVE] & CPUID_D_1_EAX_XFD) {
> +kvm_msr_entry_add(cpu, MSR_IA32_XFD, 0);
> +kvm_msr_entry_add(cpu, MSR_IA32_XFD_ERR, 0);
> +}
> +
>  ret = kvm_vcpu_ioctl(CPU(cpu), KVM_GET_MSRS, cpu->kvm_msr_buf);
>  if (ret < 0) {
>  return ret;
> @@ -3963,6 +3975,12 @@ static int kvm_get_msrs(X86CPU *cpu)
>  env->msr_ia32_sgxlepubkeyhash[index - MSR_IA32_SGXLEPUBKEYHASH0] 
> =
> msrs[i].data;
>  break;
> +case MSR_IA32_XFD:
> +env->msr_xfd = msrs[i].data;
> +break;
> +case MSR_IA32_XFD_ERR:
> +env->msr_xfd_err = msrs[i].data;
> +break;
>  }
>  }
>
> diff --git a/target/i386/machine.c b/target/i386/machine.c
> index 6202f47793..1f9d0c46f1 100644
> --- a/target/i386/machine.c
> +++ b/target/i386/machine.c
> @@ -1483,6 +1483,46 @@ static const VMStateDescription vmstate_pdptrs = {
>  }
>  };
>
> +static bool xfd_msrs_needed(void *opaque)
> +{
> +X86CPU *cpu = opaque;
> +CPUX86State *env = &cpu->env;
> +
> +return !!(env->features[FEAT_XSAVE] & CPUID_D_1_EAX_XFD);
> +}
> +
> +static const VMStateDescription vmstate_msr_xfd = {
> +.name = "cpu/msr_xfd",
> +.version_id = 1,
> +.minimum_version_id = 1,
> +.needed = xfd_msrs_needed,
> +.fields = (VMStateField[]) {
> +VMSTATE_UINT64(env.msr_xfd, X86CPU),
> +VMSTATE_UINT64(env.msr_xfd_err, X86CPU),
> +VMSTATE_END_OF_LIST()
> +}
> +};
> +
> +static bool amx_xtile_needed(void *opaque)
> +{
> +X86CPU *cpu = opaque;
> +CPUX86State *env = &cpu->env;
> +
> +return !

[PATCH 1/2] scripts/qemu-gdb/mtree.py: Int128 are decimal rather than hex

2022-02-21 Thread David Edmondson

When parsing QEMU's native Int128 type, do not attempt to convert from
hexadecimal.

Fixes: 8037fa55ac ("scripts/qemugdb/mtree.py: fix up mtree dump")
Signed-off-by: David Edmondson 
---
 scripts/qemugdb/mtree.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/qemugdb/mtree.py b/scripts/qemugdb/mtree.py
index 8fe42c3c12..c1557d44fa 100644
--- a/scripts/qemugdb/mtree.py
+++ b/scripts/qemugdb/mtree.py
@@ -25,7 +25,7 @@ def int128(p):
 if p.type.code == gdb.TYPE_CODE_STRUCT:
 return int(p['lo']) + (int(p['hi']) << 64)
 else:
-return int(("%s" % p), 16)
+return int("%s" % p)
 
 class MtreeCommand(gdb.Command):
 '''Display the memory tree hierarchy'''
-- 
2.34.1

[PATCH 2/2] scripts/qemu-gdb/timers.py: the 'last' attribute is no more

2022-02-21 Thread David Edmondson

The 'last' member of QEMUClock was removed some time ago, but the
python gdb helper did not notice.

Fixes: 3c2d4c8aa6 ("timer: last, remove last bits of last")
Signed-off-by: David Edmondson 
---
 scripts/qemugdb/timers.py | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/scripts/qemugdb/timers.py b/scripts/qemugdb/timers.py
index 46537b27cf..0538677288 100644
--- a/scripts/qemugdb/timers.py
+++ b/scripts/qemugdb/timers.py
@@ -37,10 +37,9 @@ def dump_timers(self, timer):
 
 def process_timerlist(self, tlist, ttype):
 gdb.write("Processing %s timers\n" % (ttype))
-gdb.write("  clock %s is enabled:%s, last:%s\n" % (
+gdb.write("  clock %s is enabled:%s\n" % (
 tlist['clock']['type'],
-tlist['clock']['enabled'],
-tlist['clock']['last']))
+tlist['clock']['enabled']))
 if int(tlist['active_timers']) > 0:
 self.dump_timers(tlist['active_timers'])
 
-- 
2.34.1

[PATCH 0/2] Minor fixes for the Python GDB plugins

2022-02-21 Thread David Edmondson

In attempting to use the Python GDB plugins a couple of problems were
encountered.

David Edmondson (2):
  scripts/qemu-gdb/mtree.py: Int128 are decimal rather than hex
  scripts/qemu-gdb/timers.py: the 'last' attribute is no more

 scripts/qemugdb/mtree.py  | 2 +-
 scripts/qemugdb/timers.py | 5 ++---
 2 files changed, 3 insertions(+), 4 deletions(-)

-- 
2.34.1

Re: [PULL 15/22] x86: Grant AMX permission for guest

2022-03-16 Thread David Edmondson

On Wednesday, 2022-03-16 at 16:05:01 GMT, Daniel P. Berrangé wrote:

> On Wed, Mar 16, 2022 at 04:57:39PM +0100, Peter Krempa wrote:
>> On Tue, Mar 08, 2022 at 12:34:38 +0100, Paolo Bonzini wrote:
>> > From: Yang Zhong 
>> > 
>> > Kernel allocates 4K xstate buffer by default. For XSAVE features
>> > which require large state component (e.g. AMX), Linux kernel
>> > dynamically expands the xstate buffer only after the process has
>> > acquired the necessary permissions. Those are called dynamically-
>> > enabled XSAVE features (or dynamic xfeatures).
>> > 
>> > There are separate permissions for native tasks and guests.
>> > 
>> > Qemu should request the guest permissions for dynamic xfeatures
>> > which will be exposed to the guest. This only needs to be done
>> > once before the first vcpu is created.
>> > 
>> > KVM implemented one new ARCH_GET_XCOMP_SUPP system attribute API to
>> > get host side supported_xcr0 and Qemu can decide if it can request
>> > dynamically enabled XSAVE features permission.
>> > https://lore.kernel.org/all/20220126152210.3044876-1-pbonz...@redhat.com/
>> > 
>> > Suggested-by: Paolo Bonzini 
>> > Signed-off-by: Yang Zhong 
>> > Signed-off-by: Jing Liu 
>> > Message-Id: <20220217060434.52460-4-yang.zh...@intel.com>
>> > Signed-off-by: Paolo Bonzini 
>> > ---
>> >  target/i386/cpu.c  |  7 +
>> >  target/i386/cpu.h  |  4 +++
>> >  target/i386/kvm/kvm-cpu.c  | 12 
>> >  target/i386/kvm/kvm.c  | 57 ++
>> >  target/i386/kvm/kvm_i386.h |  1 +
>> >  5 files changed, 75 insertions(+), 6 deletions(-)
>> 
>> With this commit qemu crashes for me when invoking the following
>> QMP command:
>
> It is way worse than that even. If you remove '-S' you get an
> immediate kaboom on startup on AMD hosts

Which AMD CPU is in this host?

> $ ./build/qemu-system-x86_64 -accel kvm
> Unable to init server: Could not connect: Connection refused
> qemu-system-x86_64: ../target/i386/kvm/kvm-cpu.c:105: kvm_cpu_xsave_init: 
> Assertion `esa->size == eax' failed.
> Aborted (core dumped)
>
> With regards,
> Daniel

dme.
-- 
My girl Friday, she no square.

Re: [PATCH v5 1/8] migration: Export ram_transferred_ram()

2022-03-29 Thread David Edmondson

On Thursday, 2022-03-10 at 16:34:47 +01, Juan Quintela wrote:

> Subject: Re: [PATCH v5 1/8] migration: Export ram_transferred_ram()

The function is "ram_transferred_add()".

> Signed-off-by: Juan Quintela 

Reviewed-by: David Edmondson 

> ---
>  migration/ram.h | 2 ++
>  migration/ram.c | 2 +-
>  2 files changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/migration/ram.h b/migration/ram.h
> index 2c6dc3675d..2e27c49f90 100644
> --- a/migration/ram.h
> +++ b/migration/ram.h
> @@ -64,6 +64,8 @@ int ram_postcopy_incoming_init(MigrationIncomingState *mis);
>
>  void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
>
> +void ram_transferred_add(uint64_t bytes);
> +
>  int ramblock_recv_bitmap_test(RAMBlock *rb, void *host_addr);
>  bool ramblock_recv_bitmap_test_byte_offset(RAMBlock *rb, uint64_t 
> byte_offset);
>  void ramblock_recv_bitmap_set(RAMBlock *rb, void *host_addr);
> diff --git a/migration/ram.c b/migration/ram.c
> index 170e522a1f..947ed44c89 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -394,7 +394,7 @@ uint64_t ram_bytes_remaining(void)
>
>  MigrationStats ram_counters;
>
> -static void ram_transferred_add(uint64_t bytes)
> +void ram_transferred_add(uint64_t bytes)
>  {
>  if (runstate_is_running()) {
>  ram_counters.precopy_bytes += bytes;

dme.
-- 
Oh well, I'm dressed up so nice and I'm doing my best.

Re: [PATCH 1/2] scripts/qemu-gdb/mtree.py: Int128 are decimal rather than hex

2022-03-29 Thread David Edmondson

Ping?

On Monday, 2022-02-21 at 16:49:47 GMT, David Edmondson wrote:

> When parsing QEMU's native Int128 type, do not attempt to convert from
> hexadecimal.
>
> Fixes: 8037fa55ac ("scripts/qemugdb/mtree.py: fix up mtree dump")
> Signed-off-by: David Edmondson 
> ---
>  scripts/qemugdb/mtree.py | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/scripts/qemugdb/mtree.py b/scripts/qemugdb/mtree.py
> index 8fe42c3c12..c1557d44fa 100644
> --- a/scripts/qemugdb/mtree.py
> +++ b/scripts/qemugdb/mtree.py
> @@ -25,7 +25,7 @@ def int128(p):
>  if p.type.code == gdb.TYPE_CODE_STRUCT:
>  return int(p['lo']) + (int(p['hi']) << 64)
>  else:
> -return int(("%s" % p), 16)
> +return int("%s" % p)
>
>  class MtreeCommand(gdb.Command):
>  '''Display the memory tree hierarchy'''

dme.
-- 
The band is just fantastic, that is really what I think.

Re: [PATCH v2] docs/specs/tpm: Clarify command line parameters for network migration

2021-10-14 Thread David Edmondson

On Wednesday, 2021-10-13 at 23:27:00 -04, Stefan Berger wrote:

> Clarify the command line parameters for swtpm migration across the network
> for the case of shared storage and non-shared storage.
>
> Signed-off-by: Stefan Berger 
> ---
>  docs/specs/tpm.rst | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/docs/specs/tpm.rst b/docs/specs/tpm.rst
> index 3be190343a..956f2c13dc 100644
> --- a/docs/specs/tpm.rst
> +++ b/docs/specs/tpm.rst
> @@ -482,7 +482,8 @@ VM migration across the network:
>   - QEMU command line parameters should be identical apart from the
> '-incoming' option on the destination side
>  
> - - swtpm command line parameters should be identical
> + - swtpm command line parameters can be identical if storage is not
> +   shared and should be different for shared storage

I would interpret this as relating to the storage of the VM (virtual
disk devices), but you are referring to the storage used by swtpm,
presumably.

How about:

- typically, swtpm command line parameters should be different if the
  vTPM state directory is on shared storage, and should be the same if
  the vTPM state directory is not shared.

>  
>  VM Snapshotting:
>   - QEMU command line parameters should be identical

[PATCH] migration: Report the error returned when save_live_iterate fails

2021-12-15 Thread David Edmondson

Should qemu_savevm_state_iterate() encounter a failure when calling a
particular save_live_iterate function, report the error code returned
by the function.

Signed-off-by: David Edmondson 
---
 migration/savevm.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index d59e976d50..26aac04d91 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1298,8 +1298,9 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy)
 save_section_footer(f, se);
 
 if (ret < 0) {
-error_report("failed to save SaveStateEntry with id(name): %d(%s)",
- se->section_id, se->idstr);
+error_report("failed to save SaveStateEntry with id(name): "
+ "%d(%s): %d",
+ se->section_id, se->idstr, ret);
 qemu_file_set_error(f, ret);
 }
 if (ret <= 0) {
-- 
2.33.0

Re: [PATCH 1/3] migration: Merge ram_counters and ram_atomic_counters

2023-03-02 Thread David Edmondson



On Wednesday, 2023-03-01 at 13:40:24 +01, Juan Quintela wrote:
> Using MgrationStats as type for ram_counters mean that we didn't have
> to re-declare each value in another struct. The need of atomic
> counters have make us to create MigrationAtomicStats for this atomic
> counters.
>
> Create RAMStats type which is a merge of MigrationStats and
> MigrationAtomicStats removing unused members.
>
> Signed-off-by: Juan Quintela 
> ---
>  migration/ram.h   | 26 ++
>  migration/migration.c |  8 
>  migration/multifd.c   |  4 ++--
>  migration/ram.c   | 39 ---
>  4 files changed, 36 insertions(+), 41 deletions(-)
>
> diff --git a/migration/ram.h b/migration/ram.h
> index 81cbb0947c..ca9adcb2ad 100644
> --- a/migration/ram.h
> +++ b/migration/ram.h
> @@ -35,25 +35,27 @@
>  #include "qemu/stats64.h"
>  
>  /*
> - * These are the migration statistic counters that need to be updated using
> - * atomic ops (can be accessed by more than one thread).  Here since we
> - * cannot modify MigrationStats directly to use Stat64 as it was defined in
> - * the QAPI scheme, we define an internal structure to hold them, and we
> - * propagate the real values when QMP queries happen.
> - *
> - * IOW, the corresponding fields within ram_counters on these specific
> - * fields will be always zero and not being used at all; they're just
> - * placeholders to make it QAPI-compatible.
> + * These are the ram migration statistic counters.  It is lousely

s/lousely/loosely/

> + * based on MigrationStats.  We change to Stat64 any counter that need

s/need/needs/

> + * to be updated using atomic ops (can be accessed by more than one
> + * thread).
>   */
>  typedef struct {
>  Stat64 transferred;
>  Stat64 duplicate;
>  Stat64 normal;
>  Stat64 postcopy_bytes;
> -} MigrationAtomicStats;
> +int64_t remaining;
> +int64_t dirty_pages_rate;
> +int64_t dirty_sync_count;
> +int64_t postcopy_requests;
> +uint64_t multifd_bytes;
> +uint64_t precopy_bytes;
> +uint64_t downtime_bytes;
> +uint64_t dirty_sync_missed_zero_copy;
> +} RAMStats;
>  
> -extern MigrationAtomicStats ram_atomic_counters;
> -extern MigrationStats ram_counters;
> +extern RAMStats ram_counters;
>  extern XBZRLECacheStats xbzrle_counters;
>  extern CompressionStats compression_counters;
>  
> diff --git a/migration/migration.c b/migration/migration.c
> index ae2025d9d8..923f4762f4 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1140,12 +1140,12 @@ static void populate_ram_info(MigrationInfo *info, 
> MigrationState *s)
>  size_t page_size = qemu_target_page_size();
>  
>  info->ram = g_malloc0(sizeof(*info->ram));
> -info->ram->transferred = stat64_get(&ram_atomic_counters.transferred);
> +info->ram->transferred = stat64_get(&ram_counters.transferred);
>  info->ram->total = ram_bytes_total();
> -info->ram->duplicate = stat64_get(&ram_atomic_counters.duplicate);
> +info->ram->duplicate = stat64_get(&ram_counters.duplicate);
>  /* legacy value.  It is not used anymore */
>  info->ram->skipped = 0;
> -info->ram->normal = stat64_get(&ram_atomic_counters.normal);
> +info->ram->normal = stat64_get(&ram_counters.normal);
>  info->ram->normal_bytes = info->ram->normal * page_size;
>  info->ram->mbps = s->mbps;
>  info->ram->dirty_sync_count = ram_counters.dirty_sync_count;
> @@ -1157,7 +1157,7 @@ static void populate_ram_info(MigrationInfo *info, 
> MigrationState *s)
>  info->ram->pages_per_second = s->pages_per_second;
>  info->ram->precopy_bytes = ram_counters.precopy_bytes;
>  info->ram->downtime_bytes = ram_counters.downtime_bytes;
> -info->ram->postcopy_bytes = 
> stat64_get(&ram_atomic_counters.postcopy_bytes);
> +info->ram->postcopy_bytes = stat64_get(&ram_counters.postcopy_bytes);
>  
>  if (migrate_use_xbzrle()) {
>  info->xbzrle_cache = g_malloc0(sizeof(*info->xbzrle_cache));
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 5e85c3ea9b..7cb2326d03 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -433,7 +433,7 @@ static int multifd_send_pages(QEMUFile *f)
>  transferred = ((uint64_t) pages->num) * p->page_size + p->packet_len;
>  qemu_file_acct_rate_limit(f, transferred);
>  ram_counters.multifd_bytes += transferred;
> -stat64_add(&ram_atomic_counters.transferred, transferred);
> +stat64_add(&ram_counters.transferred, transferred);
>  qemu_mutex_unlock(&p->mutex);
>  qemu_sem_post(&p->sem);
>  
> @@ -628,7 +628,7 @@ int multifd_send_sync_main(QEMUFile *f)
>  p->pending_job++;
>  qemu_file_acct_rate_limit(f, p->packet_len);
>  ram_counters.multifd_bytes += p->packet_len;
> -stat64_add(&ram_atomic_counters.transferred, p->packet_len);
> +stat64_add(&ram_counters.transferred, p->packet_len);
>  qemu_mutex_unlock(&p->mutex);
>  q

Re: [PATCH 2/3] migration: Update atomic stats out of the mutex

2023-03-02 Thread David Edmondson



On Wednesday, 2023-03-01 at 13:40:25 +01, Juan Quintela wrote:
> Signed-off-by: Juan Quintela 

Reviewed-by: David Edmondson 

> ---
>  migration/multifd.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 7cb2326d03..f558169e37 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -433,8 +433,8 @@ static int multifd_send_pages(QEMUFile *f)
>  transferred = ((uint64_t) pages->num) * p->page_size + p->packet_len;
>  qemu_file_acct_rate_limit(f, transferred);
>  ram_counters.multifd_bytes += transferred;
> +qemu_mutex_unlock(&p->mutex);
>  stat64_add(&ram_counters.transferred, transferred);
> -qemu_mutex_unlock(&p->mutex);
>  qemu_sem_post(&p->sem);
>  
>  return 1;
> @@ -628,8 +628,8 @@ int multifd_send_sync_main(QEMUFile *f)
>  p->pending_job++;
>  qemu_file_acct_rate_limit(f, p->packet_len);
>  ram_counters.multifd_bytes += p->packet_len;
> +qemu_mutex_unlock(&p->mutex);
>  stat64_add(&ram_counters.transferred, p->packet_len);
> -qemu_mutex_unlock(&p->mutex);
>  qemu_sem_post(&p->sem);
>  }
>  for (i = 0; i < migrate_multifd_channels(); i++) {
-- 
I walk like a building, I never get wet.

Re: [PATCH 3/3] migration: Make multifd_bytes atomic

2023-03-02 Thread David Edmondson



On Wednesday, 2023-03-01 at 13:40:26 +01, Juan Quintela wrote:
> In the spirit of:
>
> commit 394d323bc3451e4d07f13341cb8817fac8dfbadd
> Author: Peter Xu 
> Date:   Tue Oct 11 17:55:51 2022 -0400
>
> migration: Use atomic ops properly for page accountings
>
> Signed-off-by: Juan Quintela 

Reviewed-by: David Edmondson 

> ---
>  migration/ram.h   | 2 +-
>  migration/migration.c | 4 ++--
>  migration/multifd.c   | 4 ++--
>  3 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/migration/ram.h b/migration/ram.h
> index ca9adcb2ad..41fcec73ba 100644
> --- a/migration/ram.h
> +++ b/migration/ram.h
> @@ -45,11 +45,11 @@ typedef struct {
>  Stat64 duplicate;
>  Stat64 normal;
>  Stat64 postcopy_bytes;
> +Stat64 multifd_bytes;
>  int64_t remaining;
>  int64_t dirty_pages_rate;
>  int64_t dirty_sync_count;
>  int64_t postcopy_requests;
> -uint64_t multifd_bytes;
>  uint64_t precopy_bytes;
>  uint64_t downtime_bytes;
>  uint64_t dirty_sync_missed_zero_copy;
> diff --git a/migration/migration.c b/migration/migration.c
> index 923f4762f4..ca52c8aab3 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1153,7 +1153,7 @@ static void populate_ram_info(MigrationInfo *info, 
> MigrationState *s)
>  ram_counters.dirty_sync_missed_zero_copy;
>  info->ram->postcopy_requests = ram_counters.postcopy_requests;
>  info->ram->page_size = page_size;
> -info->ram->multifd_bytes = ram_counters.multifd_bytes;
> +info->ram->multifd_bytes = stat64_get(&ram_counters.multifd_bytes);
>  info->ram->pages_per_second = s->pages_per_second;
>  info->ram->precopy_bytes = ram_counters.precopy_bytes;
>  info->ram->downtime_bytes = ram_counters.downtime_bytes;
> @@ -3774,7 +3774,7 @@ static MigThrError 
> migration_detect_error(MigrationState *s)
>  static uint64_t migration_total_bytes(MigrationState *s)
>  {
>  return qemu_file_total_transferred(s->to_dst_file) +
> -ram_counters.multifd_bytes;
> +stat64_get(&ram_counters.multifd_bytes);
>  }
>  
>  static void migration_calculate_complete(MigrationState *s)
> diff --git a/migration/multifd.c b/migration/multifd.c
> index f558169e37..91552d33bf 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -432,9 +432,9 @@ static int multifd_send_pages(QEMUFile *f)
>  p->pages = pages;
>  transferred = ((uint64_t) pages->num) * p->page_size + p->packet_len;
>  qemu_file_acct_rate_limit(f, transferred);
> -ram_counters.multifd_bytes += transferred;
>  qemu_mutex_unlock(&p->mutex);
>  stat64_add(&ram_counters.transferred, transferred);
> +stat64_add(&ram_counters.multifd_bytes, transferred);
>  qemu_sem_post(&p->sem);
>  
>  return 1;
> @@ -627,9 +627,9 @@ int multifd_send_sync_main(QEMUFile *f)
>  p->flags |= MULTIFD_FLAG_SYNC;
>  p->pending_job++;
>  qemu_file_acct_rate_limit(f, p->packet_len);
> -ram_counters.multifd_bytes += p->packet_len;
>  qemu_mutex_unlock(&p->mutex);
>  stat64_add(&ram_counters.transferred, p->packet_len);
> +stat64_add(&ram_counters.multifd_bytes, p->packet_len);
>  qemu_sem_post(&p->sem);
>  }
>  for (i = 0; i < migrate_multifd_channels(); i++) {
-- 
I can't explain, you would not understand. This is not how I am.

Re: [PATCH v2 1/3] migration: Merge ram_counters and ram_atomic_counters

2023-03-06 Thread David Edmondson

Juan Quintela  writes:

> Using MgrationStats as type for ram_counters mean that we didn't have

Minor typo - MigrationStatus.

> to re-declare each value in another struct. The need of atomic
> counters have make us to create MigrationAtomicStats for this atomic
> counters.
>
> Create RAMStats type which is a merge of MigrationStats and
> MigrationAtomicStats removing unused members.
>
> Signed-off-by: Juan Quintela 
>
> ---
>
> Fix typos found by David Edmondson
> ---
>  migration/ram.h   | 26 ++
>  migration/migration.c |  8 
>  migration/multifd.c   |  4 ++--
>  migration/ram.c   | 39 ---
>  4 files changed, 36 insertions(+), 41 deletions(-)
>
> diff --git a/migration/ram.h b/migration/ram.h
> index 81cbb0947c..64fb92ccd4 100644
> --- a/migration/ram.h
> +++ b/migration/ram.h
> @@ -35,25 +35,27 @@
>  #include "qemu/stats64.h"
>  
>  /*
> - * These are the migration statistic counters that need to be updated using
> - * atomic ops (can be accessed by more than one thread).  Here since we
> - * cannot modify MigrationStats directly to use Stat64 as it was defined in
> - * the QAPI scheme, we define an internal structure to hold them, and we
> - * propagate the real values when QMP queries happen.
> - *
> - * IOW, the corresponding fields within ram_counters on these specific
> - * fields will be always zero and not being used at all; they're just
> - * placeholders to make it QAPI-compatible.
> + * These are the ram migration statistic counters.  It is loosely
> + * based on MigrationStats.  We change to Stat64 any counter that

Given that the comment should reflect the final state rather than any
progression, how about:

"We use Stat64 for any counter..."

> + * needs to be updated using atomic ops (can be accessed by more than
> + * one thread).
>   */
>  typedef struct {
>  Stat64 transferred;
>  Stat64 duplicate;
>  Stat64 normal;
>  Stat64 postcopy_bytes;
> -} MigrationAtomicStats;
> +int64_t remaining;
> +int64_t dirty_pages_rate;
> +int64_t dirty_sync_count;
> +int64_t postcopy_requests;
> +uint64_t multifd_bytes;
> +uint64_t precopy_bytes;
> +uint64_t downtime_bytes;
> +uint64_t dirty_sync_missed_zero_copy;
> +} RAMStats;
>  
> -extern MigrationAtomicStats ram_atomic_counters;
> -extern MigrationStats ram_counters;
> +extern RAMStats ram_counters;
>  extern XBZRLECacheStats xbzrle_counters;
>  extern CompressionStats compression_counters;
>  
> diff --git a/migration/migration.c b/migration/migration.c
> index ae2025d9d8..923f4762f4 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1140,12 +1140,12 @@ static void populate_ram_info(MigrationInfo *info, 
> MigrationState *s)
>  size_t page_size = qemu_target_page_size();
>  
>  info->ram = g_malloc0(sizeof(*info->ram));
> -info->ram->transferred = stat64_get(&ram_atomic_counters.transferred);
> +info->ram->transferred = stat64_get(&ram_counters.transferred);
>  info->ram->total = ram_bytes_total();
> -info->ram->duplicate = stat64_get(&ram_atomic_counters.duplicate);
> +info->ram->duplicate = stat64_get(&ram_counters.duplicate);
>  /* legacy value.  It is not used anymore */
>  info->ram->skipped = 0;
> -info->ram->normal = stat64_get(&ram_atomic_counters.normal);
> +info->ram->normal = stat64_get(&ram_counters.normal);
>  info->ram->normal_bytes = info->ram->normal * page_size;
>  info->ram->mbps = s->mbps;
>  info->ram->dirty_sync_count = ram_counters.dirty_sync_count;
> @@ -1157,7 +1157,7 @@ static void populate_ram_info(MigrationInfo *info, 
> MigrationState *s)
>  info->ram->pages_per_second = s->pages_per_second;
>  info->ram->precopy_bytes = ram_counters.precopy_bytes;
>  info->ram->downtime_bytes = ram_counters.downtime_bytes;
> -info->ram->postcopy_bytes = 
> stat64_get(&ram_atomic_counters.postcopy_bytes);
> +info->ram->postcopy_bytes = stat64_get(&ram_counters.postcopy_bytes);
>  
>  if (migrate_use_xbzrle()) {
>  info->xbzrle_cache = g_malloc0(sizeof(*info->xbzrle_cache));
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 5e85c3ea9b..7cb2326d03 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -433,7 +433,7 @@ static int multifd_send_pages(QEMUFile *f)
>  transferred = ((uint64_t) pages->num) * p->page_size + p->packet_len;
>  qemu_file_acct_rate_limit(f, transferred);
>

Re: [PATCH] migration: move migration_global_dump() to migration-hmp-cmds.c

2023-03-06 Thread David Edmondson

Juan Quintela  writes:

> It is only used there, so we can make it static.
> Once there, remove spice.h that it is not used.

The removal of ui/qemu-spice.h seems like an unrelated change - should
be a different changeset?

>
> Signed-off-by: Juan Quintela 
> ---
>  include/migration/misc.h   |  1 -
>  migration/migration-hmp-cmds.c | 23 +--
>  migration/migration.c  | 19 ---
>  3 files changed, 21 insertions(+), 22 deletions(-)
>
> diff --git a/include/migration/misc.h b/include/migration/misc.h
> index 8b49841016..5ebe13b4b9 100644
> --- a/include/migration/misc.h
> +++ b/include/migration/misc.h
> @@ -66,7 +66,6 @@ bool migration_has_finished(MigrationState *);
>  bool migration_has_failed(MigrationState *);
>  /* ...and after the device transmission */
>  bool migration_in_postcopy_after_devices(MigrationState *);
> -void migration_global_dump(Monitor *mon);
>  /* True if incoming migration entered POSTCOPY_INCOMING_DISCARD */
>  bool migration_in_incoming_postcopy(void);
>  /* True if incoming migration entered POSTCOPY_INCOMING_ADVISE */
> diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
> index 72519ea99f..7dcb289c05 100644
> --- a/migration/migration-hmp-cmds.c
> +++ b/migration/migration-hmp-cmds.c
> @@ -15,7 +15,6 @@
>  
>  #include "qemu/osdep.h"
>  #include "block/qapi.h"
> -#include "migration/misc.h"
>  #include "migration/snapshot.h"
>  #include "monitor/hmp.h"
>  #include "monitor/monitor.h"
> @@ -29,7 +28,27 @@
>  #include "qemu/error-report.h"
>  #include "qemu/sockets.h"
>  #include "sysemu/runstate.h"
> -#include "ui/qemu-spice.h"
> +#include "sysemu/sysemu.h"
> +#include "migration.h"
> +
> +static void migration_global_dump(Monitor *mon)
> +{
> +MigrationState *ms = migrate_get_current();
> +
> +monitor_printf(mon, "globals:\n");
> +monitor_printf(mon, "store-global-state: %s\n",
> +   ms->store_global_state ? "on" : "off");
> +monitor_printf(mon, "only-migratable: %s\n",
> +   only_migratable ? "on" : "off");
> +monitor_printf(mon, "send-configuration: %s\n",
> +   ms->send_configuration ? "on" : "off");
> +monitor_printf(mon, "send-section-footer: %s\n",
> +   ms->send_section_footer ? "on" : "off");
> +monitor_printf(mon, "decompress-error-check: %s\n",
> +   ms->decompress_error_check ? "on" : "off");
> +monitor_printf(mon, "clear-bitmap-shift: %u\n",
> +   ms->clear_bitmap_shift);
> +}
>  
>  void hmp_info_migrate(Monitor *mon, const QDict *qdict)
>  {
> diff --git a/migration/migration.c b/migration/migration.c
> index ca52c8aab3..c0584481c7 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -4410,25 +4410,6 @@ void migrate_fd_connect(MigrationState *s, Error 
> *error_in)
>  s->migration_thread_running = true;
>  }
>  
> -void migration_global_dump(Monitor *mon)
> -{
> -MigrationState *ms = migrate_get_current();
> -
> -monitor_printf(mon, "globals:\n");
> -monitor_printf(mon, "store-global-state: %s\n",
> -   ms->store_global_state ? "on" : "off");
> -monitor_printf(mon, "only-migratable: %s\n",
> -   only_migratable ? "on" : "off");
> -monitor_printf(mon, "send-configuration: %s\n",
> -   ms->send_configuration ? "on" : "off");
> -monitor_printf(mon, "send-section-footer: %s\n",
> -   ms->send_section_footer ? "on" : "off");
> -monitor_printf(mon, "decompress-error-check: %s\n",
> -   ms->decompress_error_check ? "on" : "off");
> -monitor_printf(mon, "clear-bitmap-shift: %u\n",
> -   ms->clear_bitmap_shift);
> -}
> -
>  #define DEFINE_PROP_MIG_CAP(name, x) \
>  DEFINE_PROP_BOOL(name, MigrationState, enabled_capabilities[x], false)
>  
> -- 
> 2.39.2

[RFC 0/2] migration: Tally pre-copy, downtime and post-copy bytes independently

2021-12-16 Thread David Edmondson

When examining a report of poor migration behaviour, it would often be
useful to understand how much data was transferred in different phases
of the migration process.

For example, if the downtime limit is exceeded, to know how much data
was transferred during the downtime.

RFC because the name "ram_transferred_add" doesn't seem great, and I'm
unsure whether the tests to determine the phase in the second patch
are the most appropriate.

David Edmondson (2):
  migration: Introduce ram_transferred_add()
  migration: Tally pre-copy, downtime and post-copy bytes independently

 migration/migration.c |  3 +++
 migration/ram.c   | 30 +-
 migration/ram.h   |  1 +
 monitor/hmp-cmds.c| 12 
 qapi/migration.json   |  4 +++-
 5 files changed, 40 insertions(+), 10 deletions(-)

-- 
2.33.0

[RFC 1/2] migration: Introduce ram_transferred_add()

2021-12-16 Thread David Edmondson

...and use it.

Signed-off-by: David Edmondson 
---
 migration/ram.c | 23 ++-
 migration/ram.h |  1 +
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 57efa67f20..48ef2819f6 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -386,6 +386,11 @@ uint64_t ram_bytes_remaining(void)
 
 MigrationStats ram_counters;
 
+void ram_transferred_add(uint64_t bytes)
+{
+ram_counters.transferred += bytes;
+}
+
 /* used by the search for pages to send */
 struct PageSearchStatus {
 /* Current block being searched */
@@ -767,7 +772,7 @@ static int save_xbzrle_page(RAMState *rs, uint8_t 
**current_data,
  * RAM_SAVE_FLAG_CONTINUE.
  */
 xbzrle_counters.bytes += bytes_xbzrle - 8;
-ram_counters.transferred += bytes_xbzrle;
+ram_transferred_add(bytes_xbzrle);
 
 return 1;
 }
@@ -1198,7 +1203,7 @@ static int save_zero_page(RAMState *rs, RAMBlock *block, 
ram_addr_t offset)
 
 if (len) {
 ram_counters.duplicate++;
-ram_counters.transferred += len;
+ram_transferred_add(len);
 return 1;
 }
 return -1;
@@ -1234,7 +1239,7 @@ static bool control_save_page(RAMState *rs, RAMBlock 
*block, ram_addr_t offset,
 }
 
 if (bytes_xmit) {
-ram_counters.transferred += bytes_xmit;
+ram_transferred_add(bytes_xmit);
 *pages = 1;
 }
 
@@ -1265,8 +1270,8 @@ static bool control_save_page(RAMState *rs, RAMBlock 
*block, ram_addr_t offset,
 static int save_normal_page(RAMState *rs, RAMBlock *block, ram_addr_t offset,
 uint8_t *buf, bool async)
 {
-ram_counters.transferred += save_page_header(rs, rs->f, block,
- offset | RAM_SAVE_FLAG_PAGE);
+ram_transferred_add(save_page_header(rs, rs->f, block,
+ offset | RAM_SAVE_FLAG_PAGE));
 if (async) {
 qemu_put_buffer_async(rs->f, buf, TARGET_PAGE_SIZE,
   migrate_release_ram() &
@@ -1274,7 +1279,7 @@ static int save_normal_page(RAMState *rs, RAMBlock 
*block, ram_addr_t offset,
 } else {
 qemu_put_buffer(rs->f, buf, TARGET_PAGE_SIZE);
 }
-ram_counters.transferred += TARGET_PAGE_SIZE;
+ram_transferred_add(TARGET_PAGE_SIZE);
 ram_counters.normal++;
 return 1;
 }
@@ -1373,7 +1378,7 @@ exit:
 static void
 update_compress_thread_counts(const CompressParam *param, int bytes_xmit)
 {
-ram_counters.transferred += bytes_xmit;
+ram_transferred_add(bytes_xmit);
 
 if (param->zero_page) {
 ram_counters.duplicate++;
@@ -2298,7 +2303,7 @@ void acct_update_position(QEMUFile *f, size_t size, bool 
zero)
 ram_counters.duplicate += pages;
 } else {
 ram_counters.normal += pages;
-ram_counters.transferred += size;
+ram_transferred_add(size);
 qemu_update_position(f, size);
 }
 }
@@ -3133,7 +3138,7 @@ out:
 multifd_send_sync_main(rs->f);
 qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
 qemu_fflush(f);
-ram_counters.transferred += 8;
+ram_transferred_add(8);
 
 ret = qemu_file_get_error(f);
 }
diff --git a/migration/ram.h b/migration/ram.h
index c515396a9a..a5b2ffdc18 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -51,6 +51,7 @@ int xbzrle_cache_resize(uint64_t new_size, Error **errp);
 uint64_t ram_bytes_remaining(void);
 uint64_t ram_bytes_total(void);
 void mig_throttle_counter_reset(void);
+void ram_transferred_add(uint64_t bytes);
 
 uint64_t ram_pagesize_summary(void);
 int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len);
-- 
2.33.0

[RFC 2/2] migration: Tally pre-copy, downtime and post-copy bytes independently

2021-12-16 Thread David Edmondson

Provide information on the number of bytes copied in the pre-copy,
downtime and post-copy phases of migration.

Signed-off-by: David Edmondson 
---
 migration/migration.c |  3 +++
 migration/ram.c   |  7 +++
 monitor/hmp-cmds.c| 12 
 qapi/migration.json   |  4 +++-
 4 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index 3de11ae921..3950510be7 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1013,6 +1013,9 @@ static void populate_ram_info(MigrationInfo *info, 
MigrationState *s)
 info->ram->page_size = page_size;
 info->ram->multifd_bytes = ram_counters.multifd_bytes;
 info->ram->pages_per_second = s->pages_per_second;
+info->ram->precopy_bytes = ram_counters.precopy_bytes;
+info->ram->downtime_bytes = ram_counters.downtime_bytes;
+info->ram->postcopy_bytes = ram_counters.postcopy_bytes;
 
 if (migrate_use_xbzrle()) {
 info->has_xbzrle_cache = true;
diff --git a/migration/ram.c b/migration/ram.c
index 48ef2819f6..6c5c8441fd 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -388,6 +388,13 @@ MigrationStats ram_counters;
 
 void ram_transferred_add(uint64_t bytes)
 {
+if (runstate_is_running()) {
+ram_counters.precopy_bytes += bytes;
+} else if (migration_in_postcopy()) {
+ram_counters.postcopy_bytes += bytes;
+} else {
+ram_counters.downtime_bytes += bytes;
+}
 ram_counters.transferred += bytes;
 }
 
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index 9c91bf93e9..6049772178 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -293,6 +293,18 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
 monitor_printf(mon, "postcopy request count: %" PRIu64 "\n",
info->ram->postcopy_requests);
 }
+if (info->ram->precopy_bytes) {
+monitor_printf(mon, "precopy ram: %" PRIu64 " kbytes\n",
+   info->ram->precopy_bytes >> 10);
+}
+if (info->ram->downtime_bytes) {
+monitor_printf(mon, "downtime ram: %" PRIu64 " kbytes\n",
+   info->ram->downtime_bytes >> 10);
+}
+if (info->ram->postcopy_bytes) {
+monitor_printf(mon, "postcopy ram: %" PRIu64 " kbytes\n",
+   info->ram->postcopy_bytes >> 10);
+}
 }
 
 if (info->has_disk) {
diff --git a/qapi/migration.json b/qapi/migration.json
index bbfd48cf0b..69ec9dc3a6 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -54,7 +54,9 @@
'normal-bytes': 'int', 'dirty-pages-rate' : 'int',
'mbps' : 'number', 'dirty-sync-count' : 'int',
'postcopy-requests' : 'int', 'page-size' : 'int',
-   'multifd-bytes' : 'uint64', 'pages-per-second' : 'uint64' } }
+   'multifd-bytes' : 'uint64', 'pages-per-second' : 'uint64',
+   'precopy-bytes' : 'uint64', 'downtime-bytes' : 'uint64',
+   'postcopy-bytes' : 'uint64' } }
 
 ##
 # @XBZRLECacheStats:
-- 
2.33.0

Re: [RFC 1/2] migration: Introduce ram_transferred_add()

2021-12-20 Thread David Edmondson

On Friday, 2021-12-17 at 20:09:12 +01, Philippe Mathieu-Daudé wrote:

> On 12/16/21 13:34, David Edmondson wrote:
>> ...and use it.
>> 
>> Signed-off-by: David Edmondson 
>> ---
>>  migration/ram.c | 23 ++-
>>  migration/ram.h |  1 +
>>  2 files changed, 15 insertions(+), 9 deletions(-)
>
>> diff --git a/migration/ram.h b/migration/ram.h
>> index c515396a9a..a5b2ffdc18 100644
>> --- a/migration/ram.h
>> +++ b/migration/ram.h
>> @@ -51,6 +51,7 @@ int xbzrle_cache_resize(uint64_t new_size, Error **errp);
>>  uint64_t ram_bytes_remaining(void);
>>  uint64_t ram_bytes_total(void);
>>  void mig_throttle_counter_reset(void);
>> +void ram_transferred_add(uint64_t bytes);
>
> Why make the method public? It seems an internal operation. Do you
> plan to use it elsewhere?

No such plan. Will fix in v2.

dme.
-- 
We wanna wait, but here we go again.

[RFC v2 1/2] migration: Introduce ram_transferred_add()

2021-12-20 Thread David Edmondson

...and use it.

Signed-off-by: David Edmondson 
---
 migration/ram.c | 23 ++-
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 57efa67f20..bd53e50a7f 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -386,6 +386,11 @@ uint64_t ram_bytes_remaining(void)
 
 MigrationStats ram_counters;
 
+static void ram_transferred_add(uint64_t bytes)
+{
+ram_counters.transferred += bytes;
+}
+
 /* used by the search for pages to send */
 struct PageSearchStatus {
 /* Current block being searched */
@@ -767,7 +772,7 @@ static int save_xbzrle_page(RAMState *rs, uint8_t 
**current_data,
  * RAM_SAVE_FLAG_CONTINUE.
  */
 xbzrle_counters.bytes += bytes_xbzrle - 8;
-ram_counters.transferred += bytes_xbzrle;
+ram_transferred_add(bytes_xbzrle);
 
 return 1;
 }
@@ -1198,7 +1203,7 @@ static int save_zero_page(RAMState *rs, RAMBlock *block, 
ram_addr_t offset)
 
 if (len) {
 ram_counters.duplicate++;
-ram_counters.transferred += len;
+ram_transferred_add(len);
 return 1;
 }
 return -1;
@@ -1234,7 +1239,7 @@ static bool control_save_page(RAMState *rs, RAMBlock 
*block, ram_addr_t offset,
 }
 
 if (bytes_xmit) {
-ram_counters.transferred += bytes_xmit;
+ram_transferred_add(bytes_xmit);
 *pages = 1;
 }
 
@@ -1265,8 +1270,8 @@ static bool control_save_page(RAMState *rs, RAMBlock 
*block, ram_addr_t offset,
 static int save_normal_page(RAMState *rs, RAMBlock *block, ram_addr_t offset,
 uint8_t *buf, bool async)
 {
-ram_counters.transferred += save_page_header(rs, rs->f, block,
- offset | RAM_SAVE_FLAG_PAGE);
+ram_transferred_add(save_page_header(rs, rs->f, block,
+ offset | RAM_SAVE_FLAG_PAGE));
 if (async) {
 qemu_put_buffer_async(rs->f, buf, TARGET_PAGE_SIZE,
   migrate_release_ram() &
@@ -1274,7 +1279,7 @@ static int save_normal_page(RAMState *rs, RAMBlock 
*block, ram_addr_t offset,
 } else {
 qemu_put_buffer(rs->f, buf, TARGET_PAGE_SIZE);
 }
-ram_counters.transferred += TARGET_PAGE_SIZE;
+ram_transferred_add(TARGET_PAGE_SIZE);
 ram_counters.normal++;
 return 1;
 }
@@ -1373,7 +1378,7 @@ exit:
 static void
 update_compress_thread_counts(const CompressParam *param, int bytes_xmit)
 {
-ram_counters.transferred += bytes_xmit;
+ram_transferred_add(bytes_xmit);
 
 if (param->zero_page) {
 ram_counters.duplicate++;
@@ -2298,7 +2303,7 @@ void acct_update_position(QEMUFile *f, size_t size, bool 
zero)
 ram_counters.duplicate += pages;
 } else {
 ram_counters.normal += pages;
-ram_counters.transferred += size;
+ram_transferred_add(size);
 qemu_update_position(f, size);
 }
 }
@@ -3133,7 +3138,7 @@ out:
 multifd_send_sync_main(rs->f);
 qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
 qemu_fflush(f);
-ram_counters.transferred += 8;
+ram_transferred_add(8);
 
 ret = qemu_file_get_error(f);
 }
-- 
2.33.0

[RFC v2 2/2] migration: Tally pre-copy, downtime and post-copy bytes independently

2021-12-20 Thread David Edmondson

Provide information on the number of bytes copied in the pre-copy,
downtime and post-copy phases of migration.

Signed-off-by: David Edmondson 
---
 migration/migration.c |  3 +++
 migration/ram.c   |  7 +++
 monitor/hmp-cmds.c| 12 
 qapi/migration.json   | 13 -
 4 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index 3de11ae921..3950510be7 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1013,6 +1013,9 @@ static void populate_ram_info(MigrationInfo *info, 
MigrationState *s)
 info->ram->page_size = page_size;
 info->ram->multifd_bytes = ram_counters.multifd_bytes;
 info->ram->pages_per_second = s->pages_per_second;
+info->ram->precopy_bytes = ram_counters.precopy_bytes;
+info->ram->downtime_bytes = ram_counters.downtime_bytes;
+info->ram->postcopy_bytes = ram_counters.postcopy_bytes;
 
 if (migrate_use_xbzrle()) {
 info->has_xbzrle_cache = true;
diff --git a/migration/ram.c b/migration/ram.c
index bd53e50a7f..389868c988 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -388,6 +388,13 @@ MigrationStats ram_counters;
 
 static void ram_transferred_add(uint64_t bytes)
 {
+if (runstate_is_running()) {
+ram_counters.precopy_bytes += bytes;
+} else if (migration_in_postcopy()) {
+ram_counters.postcopy_bytes += bytes;
+} else {
+ram_counters.downtime_bytes += bytes;
+}
 ram_counters.transferred += bytes;
 }
 
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index 9c91bf93e9..6049772178 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -293,6 +293,18 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
 monitor_printf(mon, "postcopy request count: %" PRIu64 "\n",
info->ram->postcopy_requests);
 }
+if (info->ram->precopy_bytes) {
+monitor_printf(mon, "precopy ram: %" PRIu64 " kbytes\n",
+   info->ram->precopy_bytes >> 10);
+}
+if (info->ram->downtime_bytes) {
+monitor_printf(mon, "downtime ram: %" PRIu64 " kbytes\n",
+   info->ram->downtime_bytes >> 10);
+}
+if (info->ram->postcopy_bytes) {
+monitor_printf(mon, "postcopy ram: %" PRIu64 " kbytes\n",
+   info->ram->postcopy_bytes >> 10);
+}
 }
 
 if (info->has_disk) {
diff --git a/qapi/migration.json b/qapi/migration.json
index bbfd48cf0b..5975a0e104 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -46,6 +46,15 @@
 # @pages-per-second: the number of memory pages transferred per second
 #(Since 4.0)
 #
+# @precopy-bytes: The number of bytes sent in the pre-copy phase
+# (since 7.0).
+#
+# @downtime-bytes: The number of bytes sent while the guest is paused
+#  (since 7.0).
+#
+# @postcopy-bytes: The number of bytes sent during the post-copy phase
+#  (since 7.0).
+#
 # Since: 0.14
 ##
 { 'struct': 'MigrationStats',
@@ -54,7 +63,9 @@
'normal-bytes': 'int', 'dirty-pages-rate' : 'int',
'mbps' : 'number', 'dirty-sync-count' : 'int',
'postcopy-requests' : 'int', 'page-size' : 'int',
-   'multifd-bytes' : 'uint64', 'pages-per-second' : 'uint64' } }
+   'multifd-bytes' : 'uint64', 'pages-per-second' : 'uint64',
+   'precopy-bytes' : 'uint64', 'downtime-bytes' : 'uint64',
+   'postcopy-bytes' : 'uint64' } }
 
 ##
 # @XBZRLECacheStats:
-- 
2.33.0

[RFC v2 0/2] migration: Tally pre-copy, downtime and post-copy bytes independently

2021-12-20 Thread David Edmondson

When examining a report of poor migration behaviour, it would often be
useful to understand how much data was transferred in different phases
of the migration process.

For example, if the downtime limit is exceeded, to know how much data
was transferred during the downtime.

RFC because the name "ram_transferred_add" doesn't seem great, and I'm
unsure whether the tests to determine the phase in the second patch
are the most appropriate.

v2:
- ram_transferred_add() should be static (Philippe)
- Document the new MigrationStats fields (dme)

David Edmondson (2):
  migration: Introduce ram_transferred_add()
  migration: Tally pre-copy, downtime and post-copy bytes independently

 migration/migration.c |  3 +++
 migration/ram.c   | 30 +-
 monitor/hmp-cmds.c| 12 
 qapi/migration.json   | 13 -
 4 files changed, 48 insertions(+), 10 deletions(-)

-- 
2.33.0

[PATCH v3 1/2] migration: Introduce ram_transferred_add()

2021-12-21 Thread David Edmondson

Replace direct manipulation of ram_counters.transferred with a
function.

Signed-off-by: David Edmondson 
Reviewed-by: Philippe Mathieu-Daudé 
---
 migration/ram.c | 23 ++-
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 57efa67f20..bd53e50a7f 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -386,6 +386,11 @@ uint64_t ram_bytes_remaining(void)
 
 MigrationStats ram_counters;
 
+static void ram_transferred_add(uint64_t bytes)
+{
+ram_counters.transferred += bytes;
+}
+
 /* used by the search for pages to send */
 struct PageSearchStatus {
 /* Current block being searched */
@@ -767,7 +772,7 @@ static int save_xbzrle_page(RAMState *rs, uint8_t 
**current_data,
  * RAM_SAVE_FLAG_CONTINUE.
  */
 xbzrle_counters.bytes += bytes_xbzrle - 8;
-ram_counters.transferred += bytes_xbzrle;
+ram_transferred_add(bytes_xbzrle);
 
 return 1;
 }
@@ -1198,7 +1203,7 @@ static int save_zero_page(RAMState *rs, RAMBlock *block, 
ram_addr_t offset)
 
 if (len) {
 ram_counters.duplicate++;
-ram_counters.transferred += len;
+ram_transferred_add(len);
 return 1;
 }
 return -1;
@@ -1234,7 +1239,7 @@ static bool control_save_page(RAMState *rs, RAMBlock 
*block, ram_addr_t offset,
 }
 
 if (bytes_xmit) {
-ram_counters.transferred += bytes_xmit;
+ram_transferred_add(bytes_xmit);
 *pages = 1;
 }
 
@@ -1265,8 +1270,8 @@ static bool control_save_page(RAMState *rs, RAMBlock 
*block, ram_addr_t offset,
 static int save_normal_page(RAMState *rs, RAMBlock *block, ram_addr_t offset,
 uint8_t *buf, bool async)
 {
-ram_counters.transferred += save_page_header(rs, rs->f, block,
- offset | RAM_SAVE_FLAG_PAGE);
+ram_transferred_add(save_page_header(rs, rs->f, block,
+ offset | RAM_SAVE_FLAG_PAGE));
 if (async) {
 qemu_put_buffer_async(rs->f, buf, TARGET_PAGE_SIZE,
   migrate_release_ram() &
@@ -1274,7 +1279,7 @@ static int save_normal_page(RAMState *rs, RAMBlock 
*block, ram_addr_t offset,
 } else {
 qemu_put_buffer(rs->f, buf, TARGET_PAGE_SIZE);
 }
-ram_counters.transferred += TARGET_PAGE_SIZE;
+ram_transferred_add(TARGET_PAGE_SIZE);
 ram_counters.normal++;
 return 1;
 }
@@ -1373,7 +1378,7 @@ exit:
 static void
 update_compress_thread_counts(const CompressParam *param, int bytes_xmit)
 {
-ram_counters.transferred += bytes_xmit;
+ram_transferred_add(bytes_xmit);
 
 if (param->zero_page) {
 ram_counters.duplicate++;
@@ -2298,7 +2303,7 @@ void acct_update_position(QEMUFile *f, size_t size, bool 
zero)
 ram_counters.duplicate += pages;
 } else {
 ram_counters.normal += pages;
-ram_counters.transferred += size;
+ram_transferred_add(size);
 qemu_update_position(f, size);
 }
 }
@@ -3133,7 +3138,7 @@ out:
 multifd_send_sync_main(rs->f);
 qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
 qemu_fflush(f);
-ram_counters.transferred += 8;
+ram_transferred_add(8);
 
 ret = qemu_file_get_error(f);
 }
-- 
2.33.0

[PATCH v3 0/2] migration: Tally pre-copy, downtime and post-copy bytes independently

2021-12-21 Thread David Edmondson

When examining a report of poor migration behaviour, it would often be
useful to understand how much data was transferred in different phases
of the migration process.

For example, if the downtime limit is exceeded, to know how much data
was transferred during the downtime.

RFC because the name "ram_transferred_add" doesn't seem great, and I'm
unsure whether the tests to determine the phase in the second patch
are the most appropriate.

v3:
- Add r-by (Philippe)
- Improve a commit message (Philippe)

v2:
- ram_transferred_add() should be static (Philippe)
- Document the new MigrationStats fields (dme)

David Edmondson (2):
  migration: Introduce ram_transferred_add()
  migration: Tally pre-copy, downtime and post-copy bytes independently

 migration/migration.c |  3 +++
 migration/ram.c   | 30 +-
 monitor/hmp-cmds.c| 12 
 qapi/migration.json   | 13 -
 4 files changed, 48 insertions(+), 10 deletions(-)

-- 
2.33.0

[PATCH v3 2/2] migration: Tally pre-copy, downtime and post-copy bytes independently

2021-12-21 Thread David Edmondson

Provide information on the number of bytes copied in the pre-copy,
downtime and post-copy phases of migration.

Signed-off-by: David Edmondson 
Reviewed-by: Philippe Mathieu-Daudé 
---
 migration/migration.c |  3 +++
 migration/ram.c   |  7 +++
 monitor/hmp-cmds.c| 12 
 qapi/migration.json   | 13 -
 4 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index 3de11ae921..3950510be7 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1013,6 +1013,9 @@ static void populate_ram_info(MigrationInfo *info, 
MigrationState *s)
 info->ram->page_size = page_size;
 info->ram->multifd_bytes = ram_counters.multifd_bytes;
 info->ram->pages_per_second = s->pages_per_second;
+info->ram->precopy_bytes = ram_counters.precopy_bytes;
+info->ram->downtime_bytes = ram_counters.downtime_bytes;
+info->ram->postcopy_bytes = ram_counters.postcopy_bytes;
 
 if (migrate_use_xbzrle()) {
 info->has_xbzrle_cache = true;
diff --git a/migration/ram.c b/migration/ram.c
index bd53e50a7f..389868c988 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -388,6 +388,13 @@ MigrationStats ram_counters;
 
 static void ram_transferred_add(uint64_t bytes)
 {
+if (runstate_is_running()) {
+ram_counters.precopy_bytes += bytes;
+} else if (migration_in_postcopy()) {
+ram_counters.postcopy_bytes += bytes;
+} else {
+ram_counters.downtime_bytes += bytes;
+}
 ram_counters.transferred += bytes;
 }
 
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index 9c91bf93e9..6049772178 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -293,6 +293,18 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
 monitor_printf(mon, "postcopy request count: %" PRIu64 "\n",
info->ram->postcopy_requests);
 }
+if (info->ram->precopy_bytes) {
+monitor_printf(mon, "precopy ram: %" PRIu64 " kbytes\n",
+   info->ram->precopy_bytes >> 10);
+}
+if (info->ram->downtime_bytes) {
+monitor_printf(mon, "downtime ram: %" PRIu64 " kbytes\n",
+   info->ram->downtime_bytes >> 10);
+}
+if (info->ram->postcopy_bytes) {
+monitor_printf(mon, "postcopy ram: %" PRIu64 " kbytes\n",
+   info->ram->postcopy_bytes >> 10);
+}
 }
 
 if (info->has_disk) {
diff --git a/qapi/migration.json b/qapi/migration.json
index bbfd48cf0b..5975a0e104 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -46,6 +46,15 @@
 # @pages-per-second: the number of memory pages transferred per second
 #(Since 4.0)
 #
+# @precopy-bytes: The number of bytes sent in the pre-copy phase
+# (since 7.0).
+#
+# @downtime-bytes: The number of bytes sent while the guest is paused
+#  (since 7.0).
+#
+# @postcopy-bytes: The number of bytes sent during the post-copy phase
+#  (since 7.0).
+#
 # Since: 0.14
 ##
 { 'struct': 'MigrationStats',
@@ -54,7 +63,9 @@
'normal-bytes': 'int', 'dirty-pages-rate' : 'int',
'mbps' : 'number', 'dirty-sync-count' : 'int',
'postcopy-requests' : 'int', 'page-size' : 'int',
-   'multifd-bytes' : 'uint64', 'pages-per-second' : 'uint64' } }
+   'multifd-bytes' : 'uint64', 'pages-per-second' : 'uint64',
+   'precopy-bytes' : 'uint64', 'downtime-bytes' : 'uint64',
+   'postcopy-bytes' : 'uint64' } }
 
 ##
 # @XBZRLECacheStats:
-- 
2.33.0

Re: [PATCH v2 6/8] migration: Dump sub-cmd name in loadvm_process_command tp

2021-12-21 Thread David Edmondson

On Monday, 2021-12-20 at 16:53:53 +08, Peter Xu wrote:

> It'll be easier to read the name rather than index of sub-cmd when debugging.
>
> Signed-off-by: Peter Xu 
> ---
>  migration/savevm.c | 3 ++-
>  migration/trace-events | 2 +-
>  2 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 0bef031acb..7f7af6f750 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2272,12 +2272,13 @@ static int loadvm_process_command(QEMUFile *f)
>  return qemu_file_get_error(f);
>  }
>
> -trace_loadvm_process_command(cmd, len);
>  if (cmd >= MIG_CMD_MAX || cmd == MIG_CMD_INVALID) {
>  error_report("MIG_CMD 0x%x unknown (len 0x%x)", cmd, len);
>  return -EINVAL;
>  }
>
> +trace_loadvm_process_command(mig_cmd_args[cmd].name, len);
> +
>  if (mig_cmd_args[cmd].len != -1 && mig_cmd_args[cmd].len != len) {
>  error_report("%s received with bad length - expecting %zu, got %d",
>   mig_cmd_args[cmd].name,
> diff --git a/migration/trace-events b/migration/trace-events
> index b48d873b8a..d63a5915f5 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -22,7 +22,7 @@ loadvm_postcopy_handle_resume(void) ""
>  loadvm_postcopy_ram_handle_discard(void) ""
>  loadvm_postcopy_ram_handle_discard_end(void) ""
>  loadvm_postcopy_ram_handle_discard_header(const char *ramid, uint16_t len) 
> "%s: %ud"
> -loadvm_process_command(uint16_t com, uint16_t len) "com=0x%x len=%d"
> +loadvm_process_command(const char *s, uint16_t len) "com=%s len=%d"

"cmd" rather than "com", to match the code.

>  loadvm_process_command_ping(uint32_t val) "0x%x"
>  postcopy_ram_listen_thread_exit(void) ""
>  postcopy_ram_listen_thread_start(void) ""

dme.
-- 
I went starin' out of my window, been caught doin' it once or twice.

Re: [PATCH v2 7/8] migration: Finer grained tracepoints for POSTCOPY_LISTEN

2021-12-21 Thread David Edmondson

On Monday, 2021-12-20 at 16:53:54 +08, Peter Xu wrote:

> The enablement of postcopy listening has a few steps, add a few tracepoints to
> be there ready for some basic measurements for them.
>
> Signed-off-by: Peter Xu 
> ---
>  migration/savevm.c | 9 -
>  migration/trace-events | 2 +-
>  2 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 7f7af6f750..25face6de0 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1947,9 +1947,10 @@ static void *postcopy_ram_listen_thread(void *opaque)
>  static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
>  {
>  PostcopyState ps = postcopy_state_set(POSTCOPY_INCOMING_LISTENING);
> -trace_loadvm_postcopy_handle_listen();
>  Error *local_err = NULL;
>
> +trace_loadvm_postcopy_handle_listen("enter");
> +
>  if (ps != POSTCOPY_INCOMING_ADVISE && ps != POSTCOPY_INCOMING_DISCARD) {
>  error_report("CMD_POSTCOPY_LISTEN in wrong postcopy state (%d)", ps);
>  return -1;
> @@ -1964,6 +1965,8 @@ static int 
> loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
>  }
>  }
>
> +trace_loadvm_postcopy_handle_listen("after disgard");

s/disgard/discard/

> +
>  /*
>   * Sensitise RAM - can now generate requests for blocks that don't exist
>   * However, at this point the CPU shouldn't be running, and the IO
> @@ -1976,6 +1979,8 @@ static int 
> loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
>  }
>  }
>
> +trace_loadvm_postcopy_handle_listen("after uffd");
> +
>  if (postcopy_notify(POSTCOPY_NOTIFY_INBOUND_LISTEN, &local_err)) {
>  error_report_err(local_err);
>  return -1;
> @@ -1990,6 +1995,8 @@ static int 
> loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
>  qemu_sem_wait(&mis->listen_thread_sem);
>  qemu_sem_destroy(&mis->listen_thread_sem);
>
> +trace_loadvm_postcopy_handle_listen("exit");
> +

"return" rather than "exit"?

>  return 0;
>  }
>
> diff --git a/migration/trace-events b/migration/trace-events
> index d63a5915f5..77d1237d89 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -14,7 +14,7 @@ loadvm_handle_cmd_packaged_main(int ret) "%d"
>  loadvm_handle_cmd_packaged_received(int ret) "%d"
>  loadvm_handle_recv_bitmap(char *s) "%s"
>  loadvm_postcopy_handle_advise(void) ""
> -loadvm_postcopy_handle_listen(void) ""
> +loadvm_postcopy_handle_listen(const char *str) "%s"
>  loadvm_postcopy_handle_run(void) ""
>  loadvm_postcopy_handle_run_cpu_sync(void) ""
>  loadvm_postcopy_handle_run_vmstart(void) ""

dme.
-- 
When you were the brightest star, who were the shadows?

Re: [PATCH v3 6/8] migration: Dump sub-cmd name in loadvm_process_command tp

2021-12-24 Thread David Edmondson

On Friday, 2021-12-24 at 14:49:58 +08, Peter Xu wrote:

> It'll be easier to read the name rather than index of sub-cmd when debugging.
>
> Signed-off-by: Peter Xu 

Reviewed-by: David Edmondson 

> ---
>  migration/savevm.c | 3 ++-
>  migration/trace-events | 2 +-
>  2 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 0bef031acb..7f7af6f750 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2272,12 +2272,13 @@ static int loadvm_process_command(QEMUFile *f)
>  return qemu_file_get_error(f);
>  }
>
> -trace_loadvm_process_command(cmd, len);
>  if (cmd >= MIG_CMD_MAX || cmd == MIG_CMD_INVALID) {
>  error_report("MIG_CMD 0x%x unknown (len 0x%x)", cmd, len);
>  return -EINVAL;
>  }
>
> +trace_loadvm_process_command(mig_cmd_args[cmd].name, len);
> +
>  if (mig_cmd_args[cmd].len != -1 && mig_cmd_args[cmd].len != len) {
>  error_report("%s received with bad length - expecting %zu, got %d",
>   mig_cmd_args[cmd].name,
> diff --git a/migration/trace-events b/migration/trace-events
> index b48d873b8a..d63a5915f5 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -22,7 +22,7 @@ loadvm_postcopy_handle_resume(void) ""
>  loadvm_postcopy_ram_handle_discard(void) ""
>  loadvm_postcopy_ram_handle_discard_end(void) ""
>  loadvm_postcopy_ram_handle_discard_header(const char *ramid, uint16_t len) 
> "%s: %ud"
> -loadvm_process_command(uint16_t com, uint16_t len) "com=0x%x len=%d"
> +loadvm_process_command(const char *s, uint16_t len) "com=%s len=%d"
>  loadvm_process_command_ping(uint32_t val) "0x%x"
>  postcopy_ram_listen_thread_exit(void) ""
>  postcopy_ram_listen_thread_start(void) ""

dme.
-- 
I'm catching up with myself!

Re: [PATCH v3 7/8] migration: Finer grained tracepoints for POSTCOPY_LISTEN

2021-12-24 Thread David Edmondson

On Friday, 2021-12-24 at 14:49:59 +08, Peter Xu wrote:

> The enablement of postcopy listening has a few steps, add a few tracepoints to
> be there ready for some basic measurements for them.
>
> Signed-off-by: Peter Xu 

Reviewed-by: David Edmondson 

> ---
>  migration/savevm.c | 9 -
>  migration/trace-events | 2 +-
>  2 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 7f7af6f750..592d550a2f 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1947,9 +1947,10 @@ static void *postcopy_ram_listen_thread(void *opaque)
>  static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
>  {
>  PostcopyState ps = postcopy_state_set(POSTCOPY_INCOMING_LISTENING);
> -trace_loadvm_postcopy_handle_listen();
>  Error *local_err = NULL;
>
> +trace_loadvm_postcopy_handle_listen("enter");
> +
>  if (ps != POSTCOPY_INCOMING_ADVISE && ps != POSTCOPY_INCOMING_DISCARD) {
>  error_report("CMD_POSTCOPY_LISTEN in wrong postcopy state (%d)", ps);
>  return -1;
> @@ -1964,6 +1965,8 @@ static int 
> loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
>  }
>  }
>
> +trace_loadvm_postcopy_handle_listen("after discard");
> +
>  /*
>   * Sensitise RAM - can now generate requests for blocks that don't exist
>   * However, at this point the CPU shouldn't be running, and the IO
> @@ -1976,6 +1979,8 @@ static int 
> loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
>  }
>  }
>
> +trace_loadvm_postcopy_handle_listen("after uffd");
> +
>  if (postcopy_notify(POSTCOPY_NOTIFY_INBOUND_LISTEN, &local_err)) {
>  error_report_err(local_err);
>  return -1;
> @@ -1990,6 +1995,8 @@ static int 
> loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
>  qemu_sem_wait(&mis->listen_thread_sem);
>  qemu_sem_destroy(&mis->listen_thread_sem);
>
> +trace_loadvm_postcopy_handle_listen("return");
> +
>  return 0;
>  }
>
> diff --git a/migration/trace-events b/migration/trace-events
> index d63a5915f5..77d1237d89 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -14,7 +14,7 @@ loadvm_handle_cmd_packaged_main(int ret) "%d"
>  loadvm_handle_cmd_packaged_received(int ret) "%d"
>  loadvm_handle_recv_bitmap(char *s) "%s"
>  loadvm_postcopy_handle_advise(void) ""
> -loadvm_postcopy_handle_listen(void) ""
> +loadvm_postcopy_handle_listen(const char *str) "%s"
>  loadvm_postcopy_handle_run(void) ""
>  loadvm_postcopy_handle_run_cpu_sync(void) ""
>  loadvm_postcopy_handle_run_vmstart(void) ""

dme.
-- 
If I could buy my reasoning, I'd pay to lose.

Re: [PATCH v3 8/8] migration: Tracepoint change in postcopy-run bottom half

2021-12-24 Thread David Edmondson

On Friday, 2021-12-24 at 14:50:00 +08, Peter Xu wrote:

> Remove the old two tracepoints and they're even near each other:
>
> trace_loadvm_postcopy_handle_run_cpu_sync()
> trace_loadvm_postcopy_handle_run_vmstart()
>
> Add trace_loadvm_postcopy_handle_run_bh() with a finer granule trace.
>
> Signed-off-by: Peter Xu 

Reviewed-by: David Edmondson 

> ---
>  migration/savevm.c | 12 +---
>  migration/trace-events |  3 +--
>  2 files changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 592d550a2f..3b8f565b14 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2005,13 +2005,19 @@ static void loadvm_postcopy_handle_run_bh(void 
> *opaque)
>  Error *local_err = NULL;
>  MigrationIncomingState *mis = opaque;
>
> +trace_loadvm_postcopy_handle_run_bh("enter");
> +
>  /* TODO we should move all of this lot into postcopy_ram.c or a shared 
> code
>   * in migration.c
>   */
>  cpu_synchronize_all_post_init();
>
> +trace_loadvm_postcopy_handle_run_bh("after cpu sync");
> +
>  qemu_announce_self(&mis->announce_timer, migrate_announce_params());
>
> +trace_loadvm_postcopy_handle_run_bh("after announce");
> +
>  /* Make sure all file formats flush their mutable metadata.
>   * If we get an error here, just don't restart the VM yet. */
>  bdrv_invalidate_cache_all(&local_err);
> @@ -2021,9 +2027,7 @@ static void loadvm_postcopy_handle_run_bh(void *opaque)
>  autostart = false;
>  }
>
> -trace_loadvm_postcopy_handle_run_cpu_sync();
> -
> -trace_loadvm_postcopy_handle_run_vmstart();
> +trace_loadvm_postcopy_handle_run_bh("after invalidate cache");
>
>  dirty_bitmap_mig_before_vm_start();
>
> @@ -2036,6 +2040,8 @@ static void loadvm_postcopy_handle_run_bh(void *opaque)
>  }
>
>  qemu_bh_delete(mis->bh);
> +
> +trace_loadvm_postcopy_handle_run_bh("return");
>  }
>
>  /* After all discards we can start running and asking for pages */
> diff --git a/migration/trace-events b/migration/trace-events
> index 77d1237d89..e165687af2 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -16,8 +16,7 @@ loadvm_handle_recv_bitmap(char *s) "%s"
>  loadvm_postcopy_handle_advise(void) ""
>  loadvm_postcopy_handle_listen(const char *str) "%s"
>  loadvm_postcopy_handle_run(void) ""
> -loadvm_postcopy_handle_run_cpu_sync(void) ""
> -loadvm_postcopy_handle_run_vmstart(void) ""
> +loadvm_postcopy_handle_run_bh(const char *str) "%s"
>  loadvm_postcopy_handle_resume(void) ""
>  loadvm_postcopy_ram_handle_discard(void) ""
>  loadvm_postcopy_ram_handle_discard_end(void) ""

dme.
-- 
Jane was in her turtle neck, I was much happier then.

[RFC PATCH 4/9] block/curl: Perform IO in fixed size chunks

2020-08-18 Thread David Edmondson

Do all IO requests to the remote server in 256kB chunks.

Signed-off-by: David Edmondson 
---
 block/curl.c   | 151 -
 block/trace-events |   2 +
 2 files changed, 109 insertions(+), 44 deletions(-)

diff --git a/block/curl.c b/block/curl.c
index d2f4de46c9..cfc518efda 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -78,6 +78,14 @@ static CURLMcode __curl_multi_socket_action(CURLM 
*multi_handle,
 #define CURL_BLOCK_OPT_SSLVERIFY_DEFAULT true
 #define CURL_BLOCK_OPT_TIMEOUT_DEFAULT 5
 
+/* Must be a non-zero power of 2. */
+#define CURL_BLOCK_SIZE (256 * 1024)
+
+/* Align "n" to the start of the containing block. */
+#define CURL_BLOCK_ALIGN(n) ((n) & ~(CURL_BLOCK_SIZE - 1))
+/* The offset of "n" within its' block. */
+#define CURL_BLOCK_OFFSET(n) ((n) & (CURL_BLOCK_SIZE - 1))
+
 struct BDRVCURLState;
 struct CURLState;
 
@@ -86,11 +94,18 @@ static bool libcurl_initialized;
 typedef struct CURLAIOCB {
 Coroutine *co;
 QEMUIOVector *qiov;
+uint64_t qiov_offset; /* Offset in qiov to place data. */
 
 uint64_t offset;
 uint64_t bytes;
 int ret;
 
+/*
+ * start and end indicate the subset of the surrounding
+ * CURL_BLOCK_SIZE sized block that is the subject of this
+ * IOCB. They are offsets from the beginning of the underlying
+ * buffer.
+ */
 size_t start;
 size_t end;
 } CURLAIOCB;
@@ -110,7 +125,6 @@ typedef struct CURLState
 char *orig_buf;
 uint64_t buf_start;
 size_t buf_off;
-size_t buf_len;
 char range[128];
 char errmsg[CURL_ERROR_SIZE];
 char in_use;
@@ -259,11 +273,11 @@ static size_t curl_read_cb(void *ptr, size_t size, size_t 
nmemb, void *opaque)
 goto read_end;
 }
 
-if (s->buf_off >= s->buf_len) {
+if (s->buf_off >= CURL_BLOCK_SIZE) {
 /* buffer full, read nothing */
 goto read_end;
 }
-realsize = MIN(realsize, s->buf_len - s->buf_off);
+realsize = MIN(realsize, CURL_BLOCK_SIZE - s->buf_off);
 memcpy(s->orig_buf + s->buf_off, ptr, realsize);
 s->buf_off += realsize;
 
@@ -281,35 +295,44 @@ static bool curl_find_buf(BDRVCURLState *s, uint64_t 
start, uint64_t len,
 uint64_t clamped_end = MIN(end, s->len);
 uint64_t clamped_len = clamped_end - start;
 
-for (i=0; istates[i];
-uint64_t buf_end = (state->buf_start + state->buf_off);
-uint64_t buf_fend = (state->buf_start + state->buf_len);
+/* The end of the currently valid data. */
+uint64_t buf_end = state->buf_start + state->buf_off;
+/* The end of the valid data when the IO completes. */
+uint64_t buf_fend = state->buf_start + CURL_BLOCK_SIZE;
 
 if (!state->orig_buf)
 continue;
 if (!state->buf_off)
 continue;
 
-// Does the existing buffer cover our section?
+/*
+ * Does the existing buffer cover our section?
+ */
 if ((start >= state->buf_start) &&
 (start <= buf_end) &&
 (clamped_end >= state->buf_start) &&
 (clamped_end <= buf_end))
 {
-char *buf = state->orig_buf + (start - state->buf_start);
+char *buf = state->orig_buf + CURL_BLOCK_OFFSET(start);
 
 trace_curl_pending_hit(qemu_coroutine_self(),
start, len);
-qemu_iovec_from_buf(acb->qiov, 0, buf, clamped_len);
+qemu_iovec_from_buf(acb->qiov, acb->qiov_offset, buf, clamped_len);
 if (clamped_len < len) {
-qemu_iovec_memset(acb->qiov, clamped_len, 0, len - 
clamped_len);
+qemu_iovec_memset(acb->qiov, acb->qiov_offset + clamped_len,
+  0, len - clamped_len);
 }
 acb->ret = 0;
 return true;
 }
 
-// Wait for unfinished chunks
+/*
+ * If an in-progress IO will provide the required data, wait
+ * for it to complete - the initiator will complete this
+ * aiocb.
+ */
 if (state->in_use &&
 (start >= state->buf_start) &&
 (start <= buf_fend) &&
@@ -320,10 +343,10 @@ static bool curl_find_buf(BDRVCURLState *s, uint64_t 
start, uint64_t len,
 
 trace_curl_pending_piggyback(qemu_coroutine_self(),
  start, len);
-acb->start = start - state->buf_start;
+acb->start = CURL_BLOCK_OFFSET(start);
 acb->end = acb->start + clamped_len;
 
-for (j=0; jacb[j]) {
 state->acb[j] = acb;
 return true;
@@ -377,7 +400,7 @@ static void curl_multi_check_completion(BDRVCURLState *s)

[RFC PATCH 1/9] block/curl: Add an 'offset' parameter, affecting all range requests

2020-08-18 Thread David Edmondson

A new 'offset' parameter affects all range requests that are sent to
the remote server. The value, in bytes, is simply added to any byte
offset values passed in to the driver.

Signed-off-by: David Edmondson 
---
 block/curl.c  | 12 +++-
 docs/system/device-url-syntax.rst.inc |  4 
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/block/curl.c b/block/curl.c
index 4f907c47be..32ec760f66 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -74,6 +74,7 @@ static CURLMcode __curl_multi_socket_action(CURLM 
*multi_handle,
 #define CURL_BLOCK_OPT_PASSWORD_SECRET "password-secret"
 #define CURL_BLOCK_OPT_PROXY_USERNAME "proxy-username"
 #define CURL_BLOCK_OPT_PROXY_PASSWORD_SECRET "proxy-password-secret"
+#define CURL_BLOCK_OPT_OFFSET "offset"
 
 #define CURL_BLOCK_OPT_READAHEAD_DEFAULT (256 * 1024)
 #define CURL_BLOCK_OPT_SSLVERIFY_DEFAULT true
@@ -135,6 +136,7 @@ typedef struct BDRVCURLState {
 char *password;
 char *proxyusername;
 char *proxypassword;
+size_t offset;
 } BDRVCURLState;
 
 static void curl_clean_state(CURLState *s);
@@ -658,6 +660,11 @@ static QemuOptsList runtime_opts = {
 .type = QEMU_OPT_STRING,
 .help = "ID of secret used as password for HTTP proxy auth",
 },
+{
+.name = CURL_BLOCK_OPT_OFFSET,
+.type = QEMU_OPT_SIZE,
+.help = "Offset into underlying content"
+},
 { /* end of list */ }
 },
 };
@@ -769,6 +776,8 @@ static int curl_open(BlockDriverState *bs, QDict *options, 
int flags,
 }
 }
 
+s->offset = qemu_opt_get_size(opts, CURL_BLOCK_OPT_OFFSET, 0);
+
 trace_curl_open(file);
 qemu_co_queue_init(&s->free_state_waitq);
 s->aio_context = bdrv_get_aio_context(bs);
@@ -899,7 +908,8 @@ static void curl_setup_preadv(BlockDriverState *bs, 
CURLAIOCB *acb)
 }
 state->acb[0] = acb;
 
-snprintf(state->range, 127, "%" PRIu64 "-%" PRIu64, start, end);
+snprintf(state->range, 127, "%" PRIu64 "-%" PRIu64,
+ s->offset + start, s->offset + end);
 trace_curl_setup_preadv(acb->bytes, start, state->range);
 curl_easy_setopt(state->curl, CURLOPT_RANGE, state->range);
 
diff --git a/docs/system/device-url-syntax.rst.inc 
b/docs/system/device-url-syntax.rst.inc
index 88d7a372a7..33f1ddfe6d 100644
--- a/docs/system/device-url-syntax.rst.inc
+++ b/docs/system/device-url-syntax.rst.inc
@@ -197,6 +197,10 @@ These are specified using a special URL syntax.
   get the size of the image to be downloaded. If not set, the
   default timeout of 5 seconds is used.
 
+   ``offset``
+  Add an offset, in bytes, to all range requests sent to the
+  remote server.
+
Note that when passing options to qemu explicitly, ``driver`` is the
value of .
 
-- 
2.27.0

[RFC PATCH 7/9] block/curl: Allow the user to control the number of cache blocks

2020-08-18 Thread David Edmondson

Rather than using a fixed number, allow the user to specify the number
of cache blocks allocated. This cannot be less than the number of Curl
states and defaults to that value.

Signed-off-by: David Edmondson 
---
 block/curl.c  | 20 +---
 docs/system/device-url-syntax.rst.inc |  4 
 qapi/block-core.json  |  4 
 3 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/block/curl.c b/block/curl.c
index 0ea9eedebd..27fa77c351 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -75,14 +75,15 @@ static CURLMcode __curl_multi_socket_action(CURLM 
*multi_handle,
 #define CURL_BLOCK_OPT_PROXY_PASSWORD_SECRET "proxy-password-secret"
 #define CURL_BLOCK_OPT_OFFSET "offset"
 #define CURL_BLOCK_OPT_BLOCKSIZE "blocksize"
+#define CURL_BLOCK_OPT_BLOCKCOUNT "blockcount"
 
 #define CURL_BLOCK_OPT_SSLVERIFY_DEFAULT true
 #define CURL_BLOCK_OPT_TIMEOUT_DEFAULT 5
 /* Must be a non-zero power of 2. */
 #define CURL_BLOCK_OPT_BLOCKSIZE_DEFAULT (256 * 1024)
+/* The defaultnumber of blocks to store in the cache. */
+#define CURL_BLOCK_OPT_BLOCKCOUNT_DEFAULT (CURL_NUM_STATES)
 
-/* The maximum number of blocks to store in the cache. */
-#define CURL_BLOCK_CACHE_MAX_BLOCKS 100
 /* The number of heads in the hash table. */
 #define CURL_BLOCK_CACHE_HASH 37
 
@@ -161,6 +162,7 @@ typedef struct BDRVCURLState {
 char *proxypassword;
 size_t offset;
 size_t blocksize;
+int cache_max;
 int cache_allocated; /* The number of block_t currently allocated. */
 QLIST_HEAD(, block) cache_free;
 QTAILQ_HEAD(, block) cache_lru;
@@ -287,7 +289,7 @@ static block_t *curl_cache_get(BDRVCURLState *s)
 }
 
 /* If not at the limit, try get a new one. */
-if (s->cache_allocated < CURL_BLOCK_CACHE_MAX_BLOCKS) {
+if (s->cache_allocated < s->cache_max) {
 b = curl_cache_alloc(s);
 if (b) {
 b->use++;
@@ -929,6 +931,11 @@ static QemuOptsList runtime_opts = {
 .type = QEMU_OPT_SIZE,
 .help = "Block size for IO requests"
 },
+{
+.name = CURL_BLOCK_OPT_BLOCKCOUNT,
+.type = QEMU_OPT_SIZE,
+.help = "Maximum number of cached blocks"
+},
 { /* end of list */ }
 },
 };
@@ -1039,6 +1046,13 @@ static int curl_open(BlockDriverState *bs, QDict 
*options, int flags,
 error_setg(errp, "blocksize must be a non-zero power of two");
 goto out_noclean;
 }
+s->cache_max = qemu_opt_get_size(opts, CURL_BLOCK_OPT_BLOCKCOUNT,
+ CURL_BLOCK_OPT_BLOCKCOUNT_DEFAULT);
+if (s->cache_max < CURL_NUM_STATES) {
+error_setg(errp, "blockcount must be larger than %d",
+CURL_NUM_STATES - 1);
+goto out_noclean;
+}
 
 trace_curl_open(file);
 qemu_co_queue_init(&s->free_state_waitq);
diff --git a/docs/system/device-url-syntax.rst.inc 
b/docs/system/device-url-syntax.rst.inc
index ee504ee41a..56843cb38f 100644
--- a/docs/system/device-url-syntax.rst.inc
+++ b/docs/system/device-url-syntax.rst.inc
@@ -201,6 +201,10 @@ These are specified using a special URL syntax.
   bytes. The value must be a non-zero power of two.  It defaults
   to 256kB.
 
+   ``blockcount``
+  The number of ``blocksize`` blocks that the system may allocate
+  to store data read from the remote server.
+
Note that when passing options to qemu explicitly, ``driver`` is the
value of .
 
diff --git a/qapi/block-core.json b/qapi/block-core.json
index cd16197e1e..91888166fa 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -3767,11 +3767,15 @@
 # @blocksize: Size of all IO requests sent to the remote server; must
 # be a non-zero power of two (defaults to 1 256kB)
 #
+# @blockcount: The number of IO blocks used to cache data from the
+#  remote server.
+#
 # Since: 2.9
 ##
 { 'struct': 'BlockdevOptionsCurlBase',
   'data': { 'url': 'str',
 '*blocksize': 'int',
+'*blockcount': 'int',
 '*timeout': 'int',
 '*username': 'str',
 '*password-secret': 'str',
-- 
2.27.0

[RFC PATCH 5/9] block/curl: Allow the blocksize to be specified by the user

2020-08-18 Thread David Edmondson

Rather than a fixed 256kB blocksize, allow the user to specify the
size used. It must be a non-zero power of two, defaulting to 256kB.

Signed-off-by: David Edmondson 
---
 block/curl.c  | 73 +--
 docs/system/device-url-syntax.rst.inc |  7 +++
 qapi/block-core.json  |  4 ++
 3 files changed, 58 insertions(+), 26 deletions(-)

diff --git a/block/curl.c b/block/curl.c
index cfc518efda..b2d02818a9 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -74,17 +74,12 @@ static CURLMcode __curl_multi_socket_action(CURLM 
*multi_handle,
 #define CURL_BLOCK_OPT_PROXY_USERNAME "proxy-username"
 #define CURL_BLOCK_OPT_PROXY_PASSWORD_SECRET "proxy-password-secret"
 #define CURL_BLOCK_OPT_OFFSET "offset"
+#define CURL_BLOCK_OPT_BLOCKSIZE "blocksize"
 
 #define CURL_BLOCK_OPT_SSLVERIFY_DEFAULT true
 #define CURL_BLOCK_OPT_TIMEOUT_DEFAULT 5
-
 /* Must be a non-zero power of 2. */
-#define CURL_BLOCK_SIZE (256 * 1024)
-
-/* Align "n" to the start of the containing block. */
-#define CURL_BLOCK_ALIGN(n) ((n) & ~(CURL_BLOCK_SIZE - 1))
-/* The offset of "n" within its' block. */
-#define CURL_BLOCK_OFFSET(n) ((n) & (CURL_BLOCK_SIZE - 1))
+#define CURL_BLOCK_OPT_BLOCKSIZE_DEFAULT (256 * 1024)
 
 struct BDRVCURLState;
 struct CURLState;
@@ -102,7 +97,7 @@ typedef struct CURLAIOCB {
 
 /*
  * start and end indicate the subset of the surrounding
- * CURL_BLOCK_SIZE sized block that is the subject of this
+ * BDRVCURLState.blocksize sized block that is the subject of this
  * IOCB. They are offsets from the beginning of the underlying
  * buffer.
  */
@@ -148,11 +143,24 @@ typedef struct BDRVCURLState {
 char *proxyusername;
 char *proxypassword;
 size_t offset;
+size_t blocksize;
 } BDRVCURLState;
 
 static void curl_clean_state(CURLState *s);
 static void curl_multi_do(void *arg);
 
+/* Align "n" to the start of the containing block. */
+static inline uint64_t curl_block_align(BDRVCURLState *s, uint64_t n)
+{
+return n & ~(s->blocksize - 1);
+}
+
+/* The offset of "n" within its' block. */
+static inline uint64_t curl_block_offset(BDRVCURLState *s, uint64_t n)
+{
+return n & (s->blocksize - 1);
+}
+
 #ifdef NEED_CURL_TIMER_CALLBACK
 /* Called from curl_multi_do_locked, with s->mutex held.  */
 static int curl_timer_cb(CURLM *multi, long timeout_ms, void *opaque)
@@ -264,22 +272,23 @@ static size_t curl_header_cb(void *ptr, size_t size, 
size_t nmemb, void *opaque)
 /* Called from curl_multi_do_locked, with s->mutex held.  */
 static size_t curl_read_cb(void *ptr, size_t size, size_t nmemb, void *opaque)
 {
-CURLState *s = ((CURLState*)opaque);
+CURLState *state = (CURLState *)opaque;
+BDRVCURLState *s = state->s;
 size_t realsize = size * nmemb;
 
 trace_curl_read_cb(realsize);
 
-if (!s || !s->orig_buf) {
+if (!state || !state->orig_buf) {
 goto read_end;
 }
 
-if (s->buf_off >= CURL_BLOCK_SIZE) {
+if (state->buf_off >= s->blocksize) {
 /* buffer full, read nothing */
 goto read_end;
 }
-realsize = MIN(realsize, CURL_BLOCK_SIZE - s->buf_off);
-memcpy(s->orig_buf + s->buf_off, ptr, realsize);
-s->buf_off += realsize;
+realsize = MIN(realsize, s->blocksize - state->buf_off);
+memcpy(state->orig_buf + state->buf_off, ptr, realsize);
+state->buf_off += realsize;
 
 read_end:
 /* curl will error out if we do not return this value */
@@ -300,7 +309,7 @@ static bool curl_find_buf(BDRVCURLState *s, uint64_t start, 
uint64_t len,
 /* The end of the currently valid data. */
 uint64_t buf_end = state->buf_start + state->buf_off;
 /* The end of the valid data when the IO completes. */
-uint64_t buf_fend = state->buf_start + CURL_BLOCK_SIZE;
+uint64_t buf_fend = state->buf_start + s->blocksize;
 
 if (!state->orig_buf)
 continue;
@@ -315,7 +324,7 @@ static bool curl_find_buf(BDRVCURLState *s, uint64_t start, 
uint64_t len,
 (clamped_end >= state->buf_start) &&
 (clamped_end <= buf_end))
 {
-char *buf = state->orig_buf + CURL_BLOCK_OFFSET(start);
+char *buf = state->orig_buf + curl_block_offset(s, start);
 
 trace_curl_pending_hit(qemu_coroutine_self(),
start, len);
@@ -343,7 +352,7 @@ static bool curl_find_buf(BDRVCURLState *s, uint64_t start, 
uint64_t len,
 
 trace_curl_pending_piggyback(qemu_coroutine_self(),
  start, len);
-acb->start = CURL_BLOCK_OFFSET(start);
+acb->start = curl_block_offset(s, start);
 acb->end = acb->start + clamped_len;
 
 for

[RFC PATCH 0/9] block/curl: Add caching of data downloaded from the remote server

2020-08-18 Thread David Edmondson

When using qemu-img to convert an image that is hosted on an HTTP
server to some faster local (or pseudo-local) storage, the overall
performance can be improved by reading data from the HTTP server in
larger blocks and by caching and re-using blocks already read. This
set of patches implements both of these, and adds a further patch
allowing an offset to be added to all of the HTTP requests.

The first patch (block/curl: Add an 'offset' parameter, affecting all
range requests) allows the user to add an arbitrary offset to all
range requests sent to the HTTP server. This is useful if the image to
be read from the HTTP server is embedded in another file (for example
an uncompressed tar file). It avoids the need to first download the
file containing the source image and extract it (both of which will
require writing the image to local storage). It is logically distinct
from the rest of the patches and somewhat use-case specific.

The remaining patches implement block based retrieval of data from the
HTTP server and, optionally, caching of those blocks in memory.

The existing HTTP implementation simply reads whatever data is
requested by the caller, with the option for a user-specified amount
of readahead. This is poor for performance because most IO requests
(from QCOW2, for example) are for relatively small amounts of data,
typically no more than 64kB. This does not allow the underlying TCP
connections to achieve peak throughput.

The existing readhead mechanism is also intended to work in
conjunction with the HTTP driver's attempt to piggy-back a new IO
request on one that is already in flight. This works, but is often
defeated because it relies on the existing IO request *completely*
satisfying any subsequent request that might piggy-back onto it. This
is rarely the case and, particularly when used with "readahead", can
result in the same data being downloaded repeatedly.

The observed performance will depend greatly on the environment, but
when using qemu-img to retrieve a 1GiB QCOW2 image from an HTTPS
server, the following was observed:

| approach   | time (hh:mm:ss) |
|+-|
| QCOW2 over HTTPS (existing implementation) |00:00:59 |
| 256kB blocks, 8 cached blocks  |00:00:42 |
| 2MB blocks, 100 cached blocks  |00:00:34 |

By way of comparison, aria2c (a dedicated HTTP download client) can
retrieve the same image in 19 seconds. Obviously this is without any
QCOW2 layer.

David Edmondson (9):
  block/curl: Add an 'offset' parameter, affecting all range requests
  block/curl: Remove readahead support
  block/curl: Tracing
  block/curl: Perform IO in fixed size chunks
  block/curl: Allow the blocksize to be specified by the user
  block/curl: Cache downloaded blocks
  block/curl: Allow the user to control the number of cache blocks
  block/curl: Allow 16 sockets/ACB
  block/curl: Add readahead support

 block/curl.c  | 515 ++
 block/io.c|   4 +
 block/linux-aio.c |   6 +
 block/trace-events|  18 +-
 docs/system/device-url-syntax.rst.inc |  15 +
 qapi/block-core.json  |  11 +-
 6 files changed, 488 insertions(+), 81 deletions(-)

-- 
2.27.0

[RFC PATCH 6/9] block/curl: Cache downloaded blocks

2020-08-18 Thread David Edmondson

In the hope that they will be referred to multiple times, cache the
blocks downloaded from the remote server.

Signed-off-by: David Edmondson 
---
 block/curl.c   | 321 +++--
 block/trace-events |   3 +
 2 files changed, 287 insertions(+), 37 deletions(-)

diff --git a/block/curl.c b/block/curl.c
index b2d02818a9..0ea9eedebd 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -81,11 +81,29 @@ static CURLMcode __curl_multi_socket_action(CURLM 
*multi_handle,
 /* Must be a non-zero power of 2. */
 #define CURL_BLOCK_OPT_BLOCKSIZE_DEFAULT (256 * 1024)
 
+/* The maximum number of blocks to store in the cache. */
+#define CURL_BLOCK_CACHE_MAX_BLOCKS 100
+/* The number of heads in the hash table. */
+#define CURL_BLOCK_CACHE_HASH 37
+
 struct BDRVCURLState;
 struct CURLState;
 
 static bool libcurl_initialized;
 
+typedef struct block {
+QLIST_ENTRY(block) hash; /* Blocks with the same hash value. */
+QLIST_ENTRY(block) free; /* Block free list. */
+QTAILQ_ENTRY(block) lru; /* LRU list. */
+bool hashed; /* block_t contains data and is hashed. */
+int use; /* Use count. */
+
+uint64_t start; /* Offset of first byte. */
+uint64_t count; /* Valid bytes. */
+
+char *buf;  /* Data. */
+} block_t;
+
 typedef struct CURLAIOCB {
 Coroutine *co;
 QEMUIOVector *qiov;
@@ -117,12 +135,11 @@ typedef struct CURLState
 CURLAIOCB *acb[CURL_NUM_ACB];
 CURL *curl;
 QLIST_HEAD(, CURLSocket) sockets;
-char *orig_buf;
-uint64_t buf_start;
 size_t buf_off;
 char range[128];
 char errmsg[CURL_ERROR_SIZE];
 char in_use;
+block_t *cache_block;
 } CURLState;
 
 typedef struct BDRVCURLState {
@@ -144,11 +161,17 @@ typedef struct BDRVCURLState {
 char *proxypassword;
 size_t offset;
 size_t blocksize;
+int cache_allocated; /* The number of block_t currently allocated. */
+QLIST_HEAD(, block) cache_free;
+QTAILQ_HEAD(, block) cache_lru;
+QLIST_HEAD(, block) * cache_hash;
 } BDRVCURLState;
 
 static void curl_clean_state(CURLState *s);
 static void curl_multi_do(void *arg);
 
+static void curl_cache_free(BDRVCURLState *s, block_t *b);
+
 /* Align "n" to the start of the containing block. */
 static inline uint64_t curl_block_align(BDRVCURLState *s, uint64_t n)
 {
@@ -161,6 +184,198 @@ static inline uint64_t curl_block_offset(BDRVCURLState 
*s, uint64_t n)
 return n & (s->blocksize - 1);
 }
 
+static uint64_t curl_cache_hash(BDRVCURLState *s, uint64_t n)
+{
+return curl_block_align(s, n) % CURL_BLOCK_CACHE_HASH;
+}
+
+static bool curl_cache_init(BDRVCURLState *s)
+{
+s->cache_allocated = 0;
+
+QLIST_INIT(&s->cache_free);
+QTAILQ_INIT(&s->cache_lru);
+
+s->cache_hash = g_try_malloc(CURL_BLOCK_CACHE_HASH * 
sizeof(s->cache_hash));
+if (!s->cache_hash) {
+return false;
+}
+
+for (int i = 0; i < CURL_BLOCK_CACHE_HASH; i++) {
+QLIST_INIT(&s->cache_hash[i]);
+}
+
+return true;
+}
+
+static void curl_cache_deinit(BDRVCURLState *s)
+{
+block_t *b;
+
+/*
+ * Cache blocks are either in the hash table or on the free list.
+ */
+for (int i = 0; i < CURL_BLOCK_CACHE_HASH; i++) {
+while (!QLIST_EMPTY(&s->cache_hash[i])) {
+b = QLIST_FIRST(&s->cache_hash[i]);
+QLIST_REMOVE(b, hash);
+b->hashed = false;
+curl_cache_free(s, b);
+}
+}
+
+while (!QLIST_EMPTY(&s->cache_free)) {
+b = QLIST_FIRST(&s->cache_free);
+QLIST_REMOVE(b, free);
+curl_cache_free(s, b);
+}
+
+assert(s->cache_allocated == 0);
+
+g_free(s->cache_hash);
+s->cache_hash = NULL;
+}
+
+static block_t *curl_cache_alloc(BDRVCURLState *s)
+{
+block_t *b = g_try_malloc0(sizeof(*b));
+
+if (!b) {
+return NULL;
+}
+
+b->buf = g_try_malloc(s->blocksize);
+if (!b->buf) {
+g_free(b);
+return NULL;
+}
+
+s->cache_allocated++;
+
+trace_curl_cache_alloc(s->cache_allocated);
+
+return b;
+}
+
+static void curl_cache_free(BDRVCURLState *s, block_t *b)
+{
+assert(b->use == 0);
+assert(!b->hashed);
+
+g_free(b->buf);
+g_free(b);
+
+s->cache_allocated--;
+
+trace_curl_cache_free(s->cache_allocated);
+}
+
+static block_t *curl_cache_get(BDRVCURLState *s)
+{
+block_t *b = NULL;
+
+/* If there is one on the free list, use it. */
+if (!QLIST_EMPTY(&s->cache_free)) {
+b = QLIST_FIRST(&s->cache_free);
+QLIST_REMOVE(b, free);
+
+assert(b->use == 0);
+assert(!b->hashed);
+
+b->use++;
+goto done;
+}
+
+/* If not at the limit, try get a new one. */
+if (s->cache_allocated < CURL_BLOCK_CACHE_MAX_BLOCKS) {
+b = curl_cache_alloc(s);
+if (b) {
+b->u

[RFC PATCH 8/9] block/curl: Allow 16 sockets/ACB

2020-08-18 Thread David Edmondson

qemu-img allows up to 16 coroutines when performing IO. Ensure that
there is a Curl socket and ACB available to each of them.

Signed-off-by: David Edmondson 
---
 block/curl.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/curl.c b/block/curl.c
index 27fa77c351..8ee314739a 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -60,8 +60,8 @@ static CURLMcode __curl_multi_socket_action(CURLM 
*multi_handle,
 #define PROTOCOLS (CURLPROTO_HTTP | CURLPROTO_HTTPS | \
CURLPROTO_FTP | CURLPROTO_FTPS)
 
-#define CURL_NUM_STATES 8
-#define CURL_NUM_ACB8
+#define CURL_NUM_STATES 16
+#define CURL_NUM_ACB16
 #define CURL_TIMEOUT_MAX 1
 
 #define CURL_BLOCK_OPT_URL   "url"
-- 
2.27.0

[RFC PATCH 2/9] block/curl: Remove readahead support

2020-08-18 Thread David Edmondson

Block based caching and the current readahead support do not interact
well, so remove readahead support before adding block
caching. Readahead will be re-added later.

Signed-off-by: David Edmondson 
---
 block/curl.c  | 23 ---
 docs/system/device-url-syntax.rst.inc |  7 ---
 qapi/block-core.json  |  4 
 3 files changed, 4 insertions(+), 30 deletions(-)

diff --git a/block/curl.c b/block/curl.c
index 32ec760f66..d0c74d7de5 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -65,7 +65,6 @@ static CURLMcode __curl_multi_socket_action(CURLM 
*multi_handle,
 #define CURL_TIMEOUT_MAX 1
 
 #define CURL_BLOCK_OPT_URL   "url"
-#define CURL_BLOCK_OPT_READAHEAD "readahead"
 #define CURL_BLOCK_OPT_SSLVERIFY "sslverify"
 #define CURL_BLOCK_OPT_TIMEOUT "timeout"
 #define CURL_BLOCK_OPT_COOKIE"cookie"
@@ -76,7 +75,6 @@ static CURLMcode __curl_multi_socket_action(CURLM 
*multi_handle,
 #define CURL_BLOCK_OPT_PROXY_PASSWORD_SECRET "proxy-password-secret"
 #define CURL_BLOCK_OPT_OFFSET "offset"
 
-#define CURL_BLOCK_OPT_READAHEAD_DEFAULT (256 * 1024)
 #define CURL_BLOCK_OPT_SSLVERIFY_DEFAULT true
 #define CURL_BLOCK_OPT_TIMEOUT_DEFAULT 5
 
@@ -124,7 +122,6 @@ typedef struct BDRVCURLState {
 uint64_t len;
 CURLState states[CURL_NUM_STATES];
 char *url;
-size_t readahead_size;
 bool sslverify;
 uint64_t timeout;
 char *cookie;
@@ -615,11 +612,6 @@ static QemuOptsList runtime_opts = {
 .type = QEMU_OPT_STRING,
 .help = "URL to open",
 },
-{
-.name = CURL_BLOCK_OPT_READAHEAD,
-.type = QEMU_OPT_SIZE,
-.help = "Readahead size",
-},
 {
 .name = CURL_BLOCK_OPT_SSLVERIFY,
 .type = QEMU_OPT_BOOL,
@@ -705,14 +697,6 @@ static int curl_open(BlockDriverState *bs, QDict *options, 
int flags,
 goto out_noclean;
 }
 
-s->readahead_size = qemu_opt_get_size(opts, CURL_BLOCK_OPT_READAHEAD,
-  CURL_BLOCK_OPT_READAHEAD_DEFAULT);
-if ((s->readahead_size & 0x1ff) != 0) {
-error_setg(errp, "HTTP_READAHEAD_SIZE %zd is not a multiple of 512",
-   s->readahead_size);
-goto out_noclean;
-}
-
 s->timeout = qemu_opt_get_number(opts, CURL_BLOCK_OPT_TIMEOUT,
  CURL_BLOCK_OPT_TIMEOUT_DEFAULT);
 if (s->timeout > CURL_TIMEOUT_MAX) {
@@ -898,7 +882,7 @@ static void curl_setup_preadv(BlockDriverState *bs, 
CURLAIOCB *acb)
 state->buf_off = 0;
 g_free(state->orig_buf);
 state->buf_start = start;
-state->buf_len = MIN(acb->end + s->readahead_size, s->len - start);
+state->buf_len = MIN(acb->end, s->len - start);
 end = start + state->buf_len - 1;
 state->orig_buf = g_try_malloc(state->buf_len);
 if (state->buf_len && state->orig_buf == NULL) {
@@ -971,8 +955,9 @@ static void curl_refresh_filename(BlockDriverState *bs)
 {
 BDRVCURLState *s = bs->opaque;
 
-/* "readahead" and "timeout" do not change the guest-visible data,
- * so ignore them */
+/*
+ * "timeout" does not change the guest-visible data, so ignore it.
+ */
 if (s->sslverify != CURL_BLOCK_OPT_SSLVERIFY_DEFAULT ||
 s->cookie || s->username || s->password || s->proxyusername ||
 s->proxypassword)
diff --git a/docs/system/device-url-syntax.rst.inc 
b/docs/system/device-url-syntax.rst.inc
index 33f1ddfe6d..bc38b9df38 100644
--- a/docs/system/device-url-syntax.rst.inc
+++ b/docs/system/device-url-syntax.rst.inc
@@ -174,13 +174,6 @@ These are specified using a special URL syntax.
``url``
   The full URL when passing options to the driver explicitly.
 
-   ``readahead``
-  The amount of data to read ahead with each range request to the
-  remote server. This value may optionally have the suffix 'T', 'G',
-  'M', 'K', 'k' or 'b'. If it does not have a suffix, it will be
-  assumed to be in bytes. The value must be a multiple of 512 bytes.
-  It defaults to 256k.
-
``sslverify``
   Whether to verify the remote server's certificate when connecting
   over SSL. It can have the value 'on' or 'off'. It defaults to
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 197bdc1c36..d6f5e91cc3 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -3752,9 +3752,6 @@
 #
 # @url: URL of the image file
 #
-# @readahead: Size of the read-ahead cache; must be a multiple of
-# 512 (defaults to 256 kB)
-#
 # @timeout: Timeout for connections, in seconds (defaults to 5)
 #
 # @username: Username for authentication (defaults to none)
@@ -3771,7 +3768,6 @@
 ##
 { 'struct': 'BlockdevOptionsCurlBase',
   'data': { 'url': 'str',
-'*readahead': 'int',
 '*timeout': 'int',
 '*username': 'str',
 '*password-secret': 'str',
-- 
2.27.0

[RFC PATCH 3/9] block/curl: Tracing

2020-08-18 Thread David Edmondson

Add some more trace functions to the IO path.

Signed-off-by: David Edmondson 
---
 block/curl.c   | 10 +-
 block/io.c |  4 
 block/linux-aio.c  |  6 ++
 block/trace-events | 13 -
 4 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/block/curl.c b/block/curl.c
index d0c74d7de5..d2f4de46c9 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -299,6 +299,8 @@ static bool curl_find_buf(BDRVCURLState *s, uint64_t start, 
uint64_t len,
 {
 char *buf = state->orig_buf + (start - state->buf_start);
 
+trace_curl_pending_hit(qemu_coroutine_self(),
+   start, len);
 qemu_iovec_from_buf(acb->qiov, 0, buf, clamped_len);
 if (clamped_len < len) {
 qemu_iovec_memset(acb->qiov, clamped_len, 0, len - 
clamped_len);
@@ -316,6 +318,8 @@ static bool curl_find_buf(BDRVCURLState *s, uint64_t start, 
uint64_t len,
 {
 int j;
 
+trace_curl_pending_piggyback(qemu_coroutine_self(),
+ start, len);
 acb->start = start - state->buf_start;
 acb->end = acb->start + clamped_len;
 
@@ -328,6 +332,8 @@ static bool curl_find_buf(BDRVCURLState *s, uint64_t start, 
uint64_t len,
 }
 }
 
+trace_curl_pending_miss(qemu_coroutine_self(), start, len);
+
 return false;
 }
 
@@ -894,7 +900,7 @@ static void curl_setup_preadv(BlockDriverState *bs, 
CURLAIOCB *acb)
 
 snprintf(state->range, 127, "%" PRIu64 "-%" PRIu64,
  s->offset + start, s->offset + end);
-trace_curl_setup_preadv(acb->bytes, start, state->range);
+trace_curl_setup_preadv(qemu_coroutine_self(), start, acb->bytes);
 curl_easy_setopt(state->curl, CURLOPT_RANGE, state->range);
 
 if (curl_multi_add_handle(s->multi, state->curl) != CURLM_OK) {
@@ -923,10 +929,12 @@ static int coroutine_fn curl_co_preadv(BlockDriverState 
*bs,
 .bytes = bytes
 };
 
+trace_curl_co_preadv(qemu_coroutine_self(), offset, bytes);
 curl_setup_preadv(bs, &acb);
 while (acb.ret == -EINPROGRESS) {
 qemu_coroutine_yield();
 }
+trace_curl_co_preadv_done(qemu_coroutine_self());
 return acb.ret;
 }
 
diff --git a/block/io.c b/block/io.c
index ad3a51ed53..f816a46bf0 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1194,6 +1194,8 @@ static int coroutine_fn 
bdrv_driver_pwritev(BlockDriverState *bs,
 return -ENOMEDIUM;
 }
 
+trace_bdrv_driver_pwritev(qemu_coroutine_self(), offset, bytes);
+
 if (drv->bdrv_co_pwritev_part) {
 ret = drv->bdrv_co_pwritev_part(bs, offset, bytes, qiov, qiov_offset,
 flags & bs->supported_write_flags);
@@ -1253,6 +1255,8 @@ emulate_flags:
 qemu_iovec_destroy(&local_qiov);
 }
 
+trace_bdrv_driver_pwritev_done(qemu_coroutine_self());
+
 return ret;
 }
 
diff --git a/block/linux-aio.c b/block/linux-aio.c
index 3c0527c2bf..811e9ff94c 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -15,6 +15,7 @@
 #include "qemu/event_notifier.h"
 #include "qemu/coroutine.h"
 #include "qapi/error.h"
+#include "trace.h"
 
 #include 
 
@@ -391,6 +392,8 @@ int coroutine_fn laio_co_submit(BlockDriverState *bs, 
LinuxAioState *s, int fd,
 .qiov   = qiov,
 };
 
+trace_laio_co_submit(qemu_coroutine_self(), offset, qiov->size);
+
 ret = laio_do_submit(fd, &laiocb, offset, type);
 if (ret < 0) {
 return ret;
@@ -399,6 +402,9 @@ int coroutine_fn laio_co_submit(BlockDriverState *bs, 
LinuxAioState *s, int fd,
 if (laiocb.ret == -EINPROGRESS) {
 qemu_coroutine_yield();
 }
+
+trace_laio_co_submit_done(qemu_coroutine_self());
+
 return laiocb.ret;
 }
 
diff --git a/block/trace-events b/block/trace-events
index 9158335061..0b52d2ca1d 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -17,6 +17,8 @@ bdrv_co_pwrite_zeroes(void *bs, int64_t offset, int count, 
int flags) "bs %p off
 bdrv_co_do_copy_on_readv(void *bs, int64_t offset, unsigned int bytes, int64_t 
cluster_offset, int64_t cluster_bytes) "bs %p offset %"PRId64" bytes %u 
cluster_offset %"PRId64" cluster_bytes %"PRId64
 bdrv_co_copy_range_from(void *src, uint64_t src_offset, void *dst, uint64_t 
dst_offset, uint64_t bytes, int read_flags, int write_flags) "src %p offset 
%"PRIu64" dst %p offset %"PRIu64" bytes %"PRIu64" rw flags 0x%x 0x%x"
 bdrv_co_copy_range_to(void *src, uint64_t src_offset, void *dst, uint64_t 
dst_offset, uint64_t bytes, int read_flags, int write_flags) "src %p offset 
%"PRIu64" dst %p offset %"PRIu64" bytes %"PRIu64" rw flags 0x%x 0x%x"
+bdrv_driver_pwritev(void *co, uint64_t offset, uint64_

[RFC PATCH 9/9] block/curl: Add readahead support

2020-08-18 Thread David Edmondson

Re-add support for a readahead parameter, which is the number of bytes
added to the request from the upper layer before breaking the request
into blocks. The default is zero. The number of bytes specified has no
alignment requirements.

Signed-off-by: David Edmondson 
---
 block/curl.c  | 11 +++
 docs/system/device-url-syntax.rst.inc |  7 +++
 qapi/block-core.json  |  3 +++
 3 files changed, 21 insertions(+)

diff --git a/block/curl.c b/block/curl.c
index 8ee314739a..a182a55b93 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -65,6 +65,7 @@ static CURLMcode __curl_multi_socket_action(CURLM 
*multi_handle,
 #define CURL_TIMEOUT_MAX 1
 
 #define CURL_BLOCK_OPT_URL   "url"
+#define CURL_BLOCK_OPT_READAHEAD "readahead"
 #define CURL_BLOCK_OPT_SSLVERIFY "sslverify"
 #define CURL_BLOCK_OPT_TIMEOUT "timeout"
 #define CURL_BLOCK_OPT_COOKIE"cookie"
@@ -149,6 +150,7 @@ typedef struct BDRVCURLState {
 uint64_t len;
 CURLState states[CURL_NUM_STATES];
 char *url;
+size_t readahead_size;
 bool sslverify;
 uint64_t timeout;
 char *cookie;
@@ -881,6 +883,11 @@ static QemuOptsList runtime_opts = {
 .type = QEMU_OPT_STRING,
 .help = "URL to open",
 },
+{
+.name = CURL_BLOCK_OPT_READAHEAD,
+.type = QEMU_OPT_SIZE,
+.help = "Readahead size",
+},
 {
 .name = CURL_BLOCK_OPT_SSLVERIFY,
 .type = QEMU_OPT_BOOL,
@@ -976,6 +983,8 @@ static int curl_open(BlockDriverState *bs, QDict *options, 
int flags,
 goto out_noclean;
 }
 
+s->readahead_size = qemu_opt_get_size(opts, CURL_BLOCK_OPT_READAHEAD, 0);
+
 s->timeout = qemu_opt_get_number(opts, CURL_BLOCK_OPT_TIMEOUT,
  CURL_BLOCK_OPT_TIMEOUT_DEFAULT);
 if (s->timeout > CURL_TIMEOUT_MAX) {
@@ -1247,6 +1256,8 @@ static int coroutine_fn curl_co_preadv(BlockDriverState 
*bs,
 
 trace_curl_co_preadv(qemu_coroutine_self(), offset, bytes);
 
+bytes += s->readahead_size;
+
 while (bytes > 0) {
 uint64_t len = MIN(bytes, s->blocksize - curl_block_offset(s, off));
 CURLAIOCB acb = {
diff --git a/docs/system/device-url-syntax.rst.inc 
b/docs/system/device-url-syntax.rst.inc
index 56843cb38f..58245e017c 100644
--- a/docs/system/device-url-syntax.rst.inc
+++ b/docs/system/device-url-syntax.rst.inc
@@ -174,6 +174,13 @@ These are specified using a special URL syntax.
``url``
   The full URL when passing options to the driver explicitly.
 
+   ``readahead``
+  The amount of data to read ahead with each range request to the
+  remote server. This value may optionally have the suffix 'T', 'G',
+  'M', 'K', 'k' or 'b'. If it does not have a suffix, it will be
+  assumed to be in bytes. The value must be a multiple of 512 bytes.
+  It defaults to 256k.
+
``sslverify``
   Whether to verify the remote server's certificate when connecting
   over SSL. It can have the value 'on' or 'off'. It defaults to
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 91888166fa..f4092ccc14 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -3752,6 +3752,8 @@
 #
 # @url: URL of the image file
 #
+# @readahead: Amount of read-ahead (defaults to 0)
+#
 # @timeout: Timeout for connections, in seconds (defaults to 5)
 #
 # @username: Username for authentication (defaults to none)
@@ -3776,6 +3778,7 @@
   'data': { 'url': 'str',
 '*blocksize': 'int',
 '*blockcount': 'int',
+'*readahead': 'int',
 '*timeout': 'int',
 '*username': 'str',
 '*password-secret': 'str',
-- 
2.27.0

Re: [RFC PATCH 0/9] block/curl: Add caching of data downloaded from the remote server

2020-08-19 Thread David Edmondson

On Wednesday, 2020-08-19 at 15:11:37 +01, Stefan Hajnoczi wrote:

> On Tue, Aug 18, 2020 at 12:08:36PM +0100, David Edmondson wrote:
>> When using qemu-img to convert an image that is hosted on an HTTP
>> server to some faster local (or pseudo-local) storage, the overall
>> performance can be improved by reading data from the HTTP server in
>> larger blocks and by caching and re-using blocks already read. This
>> set of patches implements both of these, and adds a further patch
>> allowing an offset to be added to all of the HTTP requests.
>
> Hi David,
> Thanks for posting this! Kevin and Max are the maintainers in this area,
> but I wanted to ask an initial question:
>
> Is caching curl-specific or could this be implemented as a block filter
> driver so that it can be stacked on top of other network protocols too?

This implementation is curl specific, as you probably surmised. I will
look at implementing something similar as a block filter.

dme.
-- 
Facts don't do what I want them to.

Re: [PATCH 2/3] block: add logging facility for long standing IO requests

2020-08-20 Thread David Edmondson

On Monday, 2020-08-10 at 13:14:46 +03, Denis V. Lunev wrote:

> +strftime(buf, sizeof(buf), "%m-%d %H:%M:%S",

"%F %T" would include the year, which can be useful.

> + localtime_r(&start_time_host_s, &t));
> +
> +bs = blk_bs(blk_stats2blk(stats));
> +qemu_log("long %s[%ld] IO request: %d.03%d since %s.%03d bs: %s(%s, 
> %s)\n",
> + block_account_type(cookie->type), cookie->bytes,
> + (int)(latency_ms / 1000), (int)(latency_ms % 1000), buf,
> + (int)((cookie->start_time_ns / 100) % 1000),

Is there a reason not to use %u rather than casting?

dme.
-- 
We wanna wait, but here we go again.

Re: Bug? qemu-img convert to preallocated image makes it sparse

2020-01-17 Thread David Edmondson

On Thursday, 2020-01-16 at 15:37:22 +01, Max Reitz wrote:

> So I suppose the best idea I can come up with is indeed a
> --target-is-zero flag for qemu-img convert -n.  Would that work for you?

I was looking at adding this for a slightly different reason (converting
sparse images to newly provisioned LUNs).

Followup is a naive patch (I'm new to hacking on qemu and the block
layer seems complex due to maturity) that works in my tests. Feedback
much appreciated.

The patch specifically targets the initial blanking of the image rather
than any other attempt to write zeroes, as I couldn't convince myself
that there was no control flow where qemu-img convert might need to
overwrite (with zeroes) data that it itself had previously written.

dme.
-- 
Stop the music and go home.

[PATCH] qemu-img: Add --target-is-zero to convert

2020-01-17 Thread David Edmondson

In many cases the target of a convert operation is a newly provisioned
target that the user knows is blank (filled with zeroes). In this
situation there is no requirement for qemu-img to wastefully zero out
the entire device.

Add a new option, --target-is-zero, allowing the user to indicate that
an existing target device is already zero filled.
---

Apologies if this arrives twice - this From address wasn't subscribed
the first time around.

 qemu-img.c | 19 ---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index 95a24b9762..56ca727e8c 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -70,6 +70,7 @@ enum {
 OPTION_PREALLOCATION = 265,
 OPTION_SHRINK = 266,
 OPTION_SALVAGE = 267,
+OPTION_TARGET_IS_ZERO = 268,
 };
 
 typedef enum OutputFormat {
@@ -1593,6 +1594,7 @@ typedef struct ImgConvertState {
 bool copy_range;
 bool salvage;
 bool quiet;
+bool target_is_zero;
 int min_sparse;
 int alignment;
 size_t cluster_sectors;
@@ -1984,10 +1986,11 @@ static int convert_do_copy(ImgConvertState *s)
 int64_t sector_num = 0;
 
 /* Check whether we have zero initialisation or can get it efficiently */
-if (s->target_is_new && s->min_sparse && !s->target_has_backing) {
+s->has_zero_init = s->target_is_zero;
+
+if (!s->has_zero_init && s->target_is_new && s->min_sparse &&
+!s->target_has_backing) {
 s->has_zero_init = bdrv_has_zero_init(blk_bs(s->target));
-} else {
-s->has_zero_init = false;
 }
 
 if (!s->has_zero_init && !s->target_has_backing &&
@@ -2076,6 +2079,7 @@ static int img_convert(int argc, char **argv)
 .buf_sectors= IO_BUF_SIZE / BDRV_SECTOR_SIZE,
 .wr_in_order= true,
 .num_coroutines = 8,
+.target_is_zero = false,
 };
 
 for(;;) {
@@ -2086,6 +2090,7 @@ static int img_convert(int argc, char **argv)
 {"force-share", no_argument, 0, 'U'},
 {"target-image-opts", no_argument, 0, OPTION_TARGET_IMAGE_OPTS},
 {"salvage", no_argument, 0, OPTION_SALVAGE},
+{"target-is-zero", no_argument, 0, OPTION_TARGET_IS_ZERO},
 {0, 0, 0, 0}
 };
 c = getopt_long(argc, argv, ":hf:O:B:Cco:l:S:pt:T:qnm:WU",
@@ -2209,6 +2214,9 @@ static int img_convert(int argc, char **argv)
 case OPTION_TARGET_IMAGE_OPTS:
 tgt_image_opts = true;
 break;
+case OPTION_TARGET_IS_ZERO:
+s.target_is_zero = true;
+break;
 }
 }
 
@@ -2247,6 +2255,11 @@ static int img_convert(int argc, char **argv)
 warn_report("This will become an error in future QEMU versions.");
 }
 
+if (s.target_is_zero && !skip_create) {
+error_report("--target-is-zero requires use of -n flag");
+goto fail_getopt;
+}
+
 s.src_num = argc - optind - 1;
 out_filename = s.src_num >= 1 ? argv[argc - 1] : NULL;
 
-- 
2.24.1

[PATCH] qemu-img: Add --target-is-zero to convert

2020-01-17 Thread David Edmondson

In many cases the target of a convert operation is a newly provisioned
target that the user knows is blank (filled with zeroes). In this
situation there is no requirement for qemu-img to wastefully zero out
the entire device.

Add a new option, --target-is-zero, allowing the user to indicate that
an existing target device is already zero filled.
---
 qemu-img.c | 19 ---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index 95a24b9762..56ca727e8c 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -70,6 +70,7 @@ enum {
 OPTION_PREALLOCATION = 265,
 OPTION_SHRINK = 266,
 OPTION_SALVAGE = 267,
+OPTION_TARGET_IS_ZERO = 268,
 };
 
 typedef enum OutputFormat {
@@ -1593,6 +1594,7 @@ typedef struct ImgConvertState {
 bool copy_range;
 bool salvage;
 bool quiet;
+bool target_is_zero;
 int min_sparse;
 int alignment;
 size_t cluster_sectors;
@@ -1984,10 +1986,11 @@ static int convert_do_copy(ImgConvertState *s)
 int64_t sector_num = 0;
 
 /* Check whether we have zero initialisation or can get it efficiently */
-if (s->target_is_new && s->min_sparse && !s->target_has_backing) {
+s->has_zero_init = s->target_is_zero;
+
+if (!s->has_zero_init && s->target_is_new && s->min_sparse &&
+!s->target_has_backing) {
 s->has_zero_init = bdrv_has_zero_init(blk_bs(s->target));
-} else {
-s->has_zero_init = false;
 }
 
 if (!s->has_zero_init && !s->target_has_backing &&
@@ -2076,6 +2079,7 @@ static int img_convert(int argc, char **argv)
 .buf_sectors= IO_BUF_SIZE / BDRV_SECTOR_SIZE,
 .wr_in_order= true,
 .num_coroutines = 8,
+.target_is_zero = false,
 };
 
 for(;;) {
@@ -2086,6 +2090,7 @@ static int img_convert(int argc, char **argv)
 {"force-share", no_argument, 0, 'U'},
 {"target-image-opts", no_argument, 0, OPTION_TARGET_IMAGE_OPTS},
 {"salvage", no_argument, 0, OPTION_SALVAGE},
+{"target-is-zero", no_argument, 0, OPTION_TARGET_IS_ZERO},
 {0, 0, 0, 0}
 };
 c = getopt_long(argc, argv, ":hf:O:B:Cco:l:S:pt:T:qnm:WU",
@@ -2209,6 +2214,9 @@ static int img_convert(int argc, char **argv)
 case OPTION_TARGET_IMAGE_OPTS:
 tgt_image_opts = true;
 break;
+case OPTION_TARGET_IS_ZERO:
+s.target_is_zero = true;
+break;
 }
 }
 
@@ -2247,6 +2255,11 @@ static int img_convert(int argc, char **argv)
 warn_report("This will become an error in future QEMU versions.");
 }
 
+if (s.target_is_zero && !skip_create) {
+error_report("--target-is-zero requires use of -n flag");
+goto fail_getopt;
+}
+
 s.src_num = argc - optind - 1;
 out_filename = s.src_num >= 1 ? argv[argc - 1] : NULL;
 
-- 
2.24.1

Re: [PATCH] qemu-img: Add --target-is-zero to convert

2020-01-21 Thread David Edmondson

On Monday, 2020-01-20 at 19:33:43 -05, John Snow wrote:

> CC qemu-block and block maintainers
>
> On 1/17/20 5:34 AM, David Edmondson wrote:
>> In many cases the target of a convert operation is a newly provisioned
>> target that the user knows is blank (filled with zeroes). In this
>> situation there is no requirement for qemu-img to wastefully zero out
>> the entire device.
>> 
>
> Is there no way to convince bdrv_has_zero_init to return what we want
> already in this case?

In the current HEAD code, bdrv_has_zero_init will never be called for
“convert -n” (skip target volume creation).

If -n is not specified the host_device block driver doesn't provide a
bdrv_has_zero_init function, so it's assumed not supported.

> I cannot recall off hand, but wonder if there's an advanced syntax
> method of specifying the target image that can set this flag already.

I couldn't see one, but I'd be happy to learn of its existence. My first
approach was to add a raw specific option and add it using
--target-image-opts, resulting in something like:

qemu-img convert -n --target-image-opts sparse.qcow2 \
driver=raw,file.filename=/dev/sdg,assume-blank=on

(assume-blank=on is the new bit). This worked, but is only useful for
raw targets.

The discussion here led me to switch to --target-is-zero.

Mark Kanda sent me some comments suggesting that I get rid of the new
target_is_zero boolean and simply set has_zero_init, which I will do
before sending out another patch if this overall approach is considered
appropriate.

>> Add a new option, --target-is-zero, allowing the user to indicate that
>> an existing target device is already zero filled.
>> ---
>>  qemu-img.c | 19 ---
>>  1 file changed, 16 insertions(+), 3 deletions(-)
>> 
>> diff --git a/qemu-img.c b/qemu-img.c
>> index 95a24b9762..56ca727e8c 100644
>> --- a/qemu-img.c
>> +++ b/qemu-img.c
>> @@ -70,6 +70,7 @@ enum {
>>  OPTION_PREALLOCATION = 265,
>>  OPTION_SHRINK = 266,
>>  OPTION_SALVAGE = 267,
>> +OPTION_TARGET_IS_ZERO = 268,
>>  };
>>  
>>  typedef enum OutputFormat {
>> @@ -1593,6 +1594,7 @@ typedef struct ImgConvertState {
>>  bool copy_range;
>>  bool salvage;
>>  bool quiet;
>> +bool target_is_zero;
>>  int min_sparse;
>>  int alignment;
>>  size_t cluster_sectors;
>> @@ -1984,10 +1986,11 @@ static int convert_do_copy(ImgConvertState *s)
>>  int64_t sector_num = 0;
>>  
>>  /* Check whether we have zero initialisation or can get it efficiently 
>> */
>> -if (s->target_is_new && s->min_sparse && !s->target_has_backing) {
>> +s->has_zero_init = s->target_is_zero;
>> +
>> +if (!s->has_zero_init && s->target_is_new && s->min_sparse &&
>> +!s->target_has_backing) {
>>  s->has_zero_init = bdrv_has_zero_init(blk_bs(s->target));
>> -} else {
>> -s->has_zero_init = false;
>>  }
>>  
>>  if (!s->has_zero_init && !s->target_has_backing &&
>> @@ -2076,6 +2079,7 @@ static int img_convert(int argc, char **argv)
>>  .buf_sectors= IO_BUF_SIZE / BDRV_SECTOR_SIZE,
>>  .wr_in_order= true,
>>  .num_coroutines = 8,
>> +.target_is_zero = false,
>>  };
>>  
>>  for(;;) {
>> @@ -2086,6 +2090,7 @@ static int img_convert(int argc, char **argv)
>>  {"force-share", no_argument, 0, 'U'},
>>  {"target-image-opts", no_argument, 0, OPTION_TARGET_IMAGE_OPTS},
>>  {"salvage", no_argument, 0, OPTION_SALVAGE},
>> +{"target-is-zero", no_argument, 0, OPTION_TARGET_IS_ZERO},
>>  {0, 0, 0, 0}
>>  };
>>  c = getopt_long(argc, argv, ":hf:O:B:Cco:l:S:pt:T:qnm:WU",
>> @@ -2209,6 +2214,9 @@ static int img_convert(int argc, char **argv)
>>  case OPTION_TARGET_IMAGE_OPTS:
>>  tgt_image_opts = true;
>>  break;
>> +case OPTION_TARGET_IS_ZERO:
>> +s.target_is_zero = true;
>> +break;
>>  }
>>  }
>>  
>> @@ -2247,6 +2255,11 @@ static int img_convert(int argc, char **argv)
>>  warn_report("This will become an error in future QEMU versions.");
>>  }
>>  
>> +if (s.target_is_zero && !skip_create) {
>> +error_report("--target-is-zero requires use of -n flag");
>> +goto fail_getopt;
>> +}
>> +
>>  s.src_num = argc - optind - 1;
>>  out_filename = s.src_num >= 1 ? argv[argc - 1] : NULL;
>>  
>> 

dme.
-- 
Please forgive me if I act a little strange, for I know not what I do.

Re: [PATCH] qemu-img: Add --target-is-zero to convert

2020-01-23 Thread David Edmondson

On Tuesday, 2020-01-21 at 16:02:16 +01, Max Reitz wrote:

> Hi,
>
> On 17.01.20 11:34, David Edmondson wrote:
>> In many cases the target of a convert operation is a newly provisioned
>> target that the user knows is blank (filled with zeroes). In this
>> situation there is no requirement for qemu-img to wastefully zero out
>> the entire device.
>> 
>> Add a new option, --target-is-zero, allowing the user to indicate that
>> an existing target device is already zero filled.
>> ---
>>  qemu-img.c | 19 ---
>>  1 file changed, 16 insertions(+), 3 deletions(-)
>> 
>> diff --git a/qemu-img.c b/qemu-img.c
>> index 95a24b9762..56ca727e8c 100644
>> --- a/qemu-img.c
>> +++ b/qemu-img.c
>> @@ -70,6 +70,7 @@ enum {
>>  OPTION_PREALLOCATION = 265,
>>  OPTION_SHRINK = 266,
>>  OPTION_SALVAGE = 267,
>> +OPTION_TARGET_IS_ZERO = 268,
>>  };
>>  
>>  typedef enum OutputFormat {
>> @@ -1593,6 +1594,7 @@ typedef struct ImgConvertState {
>>  bool copy_range;
>>  bool salvage;
>>  bool quiet;
>> +bool target_is_zero;
>
> As you already said, we probably don’t need this and it’d be sufficient
> to set the has_zero_init value directly.
>
>>  int min_sparse;
>>  int alignment;
>>  size_t cluster_sectors;
>> @@ -1984,10 +1986,11 @@ static int convert_do_copy(ImgConvertState *s)
>>  int64_t sector_num = 0;
>>  
>>  /* Check whether we have zero initialisation or can get it efficiently 
>> */
>> -if (s->target_is_new && s->min_sparse && !s->target_has_backing) {
>> +s->has_zero_init = s->target_is_zero;
>
> We cannot has_zero_init to true if the target has a backing file,
> because convert_co_write() asserts that the target must not have a
> backing file if has_zero_init is true.  (It’s impossible for a file to
> be initialized to zeroes if it has a backing file; or at least it
> doesn’t make sense then to have a backing file.)

I'll add a check causing (has_zero_init && target_has_backing) to throw
an error after the target_has_backing is determined.

> Case in point:
>
> $ ./qemu-img create -f qcow2 src.qcow2 64M
> Formatting 'src.qcow2', fmt=qcow2 size=67108864 cluster_size=65536
> lazy_refcounts=off refcount_bits=16
> $ ./qemu-img create -f qcow2 backing.qcow2 64M
> Formatting 'backing.qcow2', fmt=qcow2 size=67108864 cluster_size=65536
> lazy_refcounts=off refcount_bits=16
> $ ./qemu-img create -f qcow2 -b backing.qcow2 dst.qcow2 64M
>
> Formatting 'dst.qcow2', fmt=qcow2 size=67108864
> backing_file=backing.qcow2 cluster_size=65536 lazy_refcounts=off
> refcount_bits=16
> $ ./qemu-img convert -n -B backing.qcow2 -f qcow2 -O qcow2
> --target-is-zero src.qcow2 dst.qcow2
> qemu-img: qemu-img.c:1812: convert_co_write: Assertion
> `!s->target_has_backing' failed.
> [1]80813 abort (core dumped)  ./qemu-img convert -n -B backing.qcow2
> -f qcow2 -O qcow2 --target-is-zero
>
>> +
>> +if (!s->has_zero_init && s->target_is_new && s->min_sparse &&
>> +!s->target_has_backing) {
>
> (This will be irrelevant after target_has_backing is gone, but because
> has_zero_init and target_has_backing are equivalent here, there is no
> need to check both.)

I don't understand this comment - I must be missing something.

If both has_zero_init and target_has_backing are false here, we should
go and check bdrv_has_zero_init().

They can't both be true (when the above mentioned test is added) and if
either is true, we don't want to call brv_has_zero_init(), as either the
user has asserted that the target is blank or we have a backing file.

>>  s->has_zero_init = bdrv_has_zero_init(blk_bs(s->target));
>> -} else {
>> -s->has_zero_init = false;
>>  }
>>  
>>  if (!s->has_zero_init && !s->target_has_backing &&
>> @@ -2076,6 +2079,7 @@ static int img_convert(int argc, char **argv)
>>  .buf_sectors= IO_BUF_SIZE / BDRV_SECTOR_SIZE,
>>  .wr_in_order= true,
>>  .num_coroutines = 8,
>> +.target_is_zero = false,
>>  };
>>  
>>  for(;;) {
>> @@ -2086,6 +2090,7 @@ static int img_convert(int argc, char **argv)
>>  {"force-share", no_argument, 0, 'U'},
>>  {"target-image-opts", no_argument, 0, OPTION_TARGET_IMAGE_OPTS},
>>  {"salvage", no_argument, 0, OPTION_SALVAGE},
>> +{&

[PATCH v2 0/2] qemu-img: Add --target-is-zero to indicate that a target is blank

2020-01-24 Thread David Edmondson



qemu-img: Add --target-is-zero to indicate that a target is blank

v2:
- Remove target_is_zero, preferring to set has_zero_init
  directly (Mark Kanda).
- Disallow --target-is-zero in the presence of a backing file (Max
  Reitz).
- Add relevant documentation (Max Reitz).
- @var -> @code for options in qemu-img.texi.


David Edmondson (2):
  qemu-img: Add --target-is-zero to convert
  doc: Use @code rather than @var for options

 qemu-img-cmds.hx |  4 ++--
 qemu-img.c   | 25 ++---
 qemu-img.texi| 20 
 3 files changed, 36 insertions(+), 13 deletions(-)

-- 
2.24.1

[PATCH v2 1/2] qemu-img: Add --target-is-zero to convert

2020-01-24 Thread David Edmondson

In many cases the target of a convert operation is a newly provisioned
target that the user knows is blank (filled with zeroes). In this
situation there is no requirement for qemu-img to wastefully zero out
the entire device.

Add a new option, --target-is-zero, allowing the user to indicate that
an existing target device is already zero filled.

Signed-off-by: David Edmondson 
---
 qemu-img-cmds.hx |  4 ++--
 qemu-img.c   | 25 ++---
 qemu-img.texi|  4 
 3 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index 1c93e6d185..6f958a0a48 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -44,9 +44,9 @@ STEXI
 ETEXI
 
 DEF("convert", img_convert,
-"convert [--object objectdef] [--image-opts] [--target-image-opts] [-U] 
[-C] [-c] [-p] [-q] [-n] [-f fmt] [-t cache] [-T src_cache] [-O output_fmt] [-B 
backing_file] [-o options] [-l snapshot_param] [-S sparse_size] [-m 
num_coroutines] [-W] [--salvage] filename [filename2 [...]] output_filename")
+"convert [--object objectdef] [--image-opts] [--target-image-opts] 
[--target-is-zero] [-U] [-C] [-c] [-p] [-q] [-n] [-f fmt] [-t cache] [-T 
src_cache] [-O output_fmt] [-B backing_file] [-o options] [-l snapshot_param] 
[-S sparse_size] [-m num_coroutines] [-W] [--salvage] filename [filename2 
[...]] output_filename")
 STEXI
-@item convert [--object @var{objectdef}] [--image-opts] [--target-image-opts] 
[-U] [-C] [-c] [-p] [-q] [-n] [-f @var{fmt}] [-t @var{cache}] [-T 
@var{src_cache}] [-O @var{output_fmt}] [-B @var{backing_file}] [-o 
@var{options}] [-l @var{snapshot_param}] [-S @var{sparse_size}] [-m 
@var{num_coroutines}] [-W] [--salvage] @var{filename} [@var{filename2} [...]] 
@var{output_filename}
+@item convert [--object @var{objectdef}] [--image-opts] [--target-image-opts] 
[--target-is-zero] [-U] [-C] [-c] [-p] [-q] [-n] [-f @var{fmt}] [-t 
@var{cache}] [-T @var{src_cache}] [-O @var{output_fmt}] [-B @var{backing_file}] 
[-o @var{options}] [-l @var{snapshot_param}] [-S @var{sparse_size}] [-m 
@var{num_coroutines}] [-W] [--salvage] @var{filename} [@var{filename2} [...]] 
@var{output_filename}
 ETEXI
 
 DEF("create", img_create,
diff --git a/qemu-img.c b/qemu-img.c
index 6233b8ca56..46db72dbe0 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -70,6 +70,7 @@ enum {
 OPTION_PREALLOCATION = 265,
 OPTION_SHRINK = 266,
 OPTION_SALVAGE = 267,
+OPTION_TARGET_IS_ZERO = 268,
 };
 
 typedef enum OutputFormat {
@@ -1984,10 +1985,9 @@ static int convert_do_copy(ImgConvertState *s)
 int64_t sector_num = 0;
 
 /* Check whether we have zero initialisation or can get it efficiently */
-if (s->target_is_new && s->min_sparse && !s->target_has_backing) {
+if (!s->has_zero_init && s->target_is_new && s->min_sparse &&
+!s->target_has_backing) {
 s->has_zero_init = bdrv_has_zero_init(blk_bs(s->target));
-} else {
-s->has_zero_init = false;
 }
 
 if (!s->has_zero_init && !s->target_has_backing &&
@@ -2086,6 +2086,7 @@ static int img_convert(int argc, char **argv)
 {"force-share", no_argument, 0, 'U'},
 {"target-image-opts", no_argument, 0, OPTION_TARGET_IMAGE_OPTS},
 {"salvage", no_argument, 0, OPTION_SALVAGE},
+{"target-is-zero", no_argument, 0, OPTION_TARGET_IS_ZERO},
 {0, 0, 0, 0}
 };
 c = getopt_long(argc, argv, ":hf:O:B:Cco:l:S:pt:T:qnm:WU",
@@ -2209,6 +2210,14 @@ static int img_convert(int argc, char **argv)
 case OPTION_TARGET_IMAGE_OPTS:
 tgt_image_opts = true;
 break;
+case OPTION_TARGET_IS_ZERO:
+/*
+ * The user asserting that the target is blank has the
+ * same effect as the target driver supporting zero
+ * initialisation.
+ */
+s.has_zero_init = true;
+break;
 }
 }
 
@@ -2247,6 +2256,11 @@ static int img_convert(int argc, char **argv)
 warn_report("This will become an error in future QEMU versions.");
 }
 
+if (s.has_zero_init && !skip_create) {
+error_report("--target-is-zero requires use of -n flag");
+goto fail_getopt;
+}
+
 s.src_num = argc - optind - 1;
 out_filename = s.src_num >= 1 ? argv[argc - 1] : NULL;
 
@@ -2380,6 +2394,11 @@ static int img_convert(int argc, char **argv)
 }
 s.target_has_backing = (bool) out_baseimg;
 
+if (s.has_zero_init && s.target_has_backing) {
+error_report("Cannot use --target-is-zero with a backing file");
+goto out;
+}
+
 if (s.src_num > 1 && out_baseimg) {
 error_report("Having a backing file for the target makes no sense when 
"

[PATCH v2 2/2] doc: Use @code rather than @var for options

2020-01-24 Thread David Edmondson

Texinfo defines @var for metasyntactic variables and such terms are
shown in upper-case or italics in the output of makeinfo. When
considering an option to a command, such as "-n", upper-casing is
undesirable as it may confuse the reader or be in conflict with the
equivalent upper-case option.

Replace the use of @var for options with @code to avoid this.

Signed-off-by: David Edmondson 
---
 qemu-img.texi | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/qemu-img.texi b/qemu-img.texi
index 3b6dfd8682..6b4a1ac961 100644
--- a/qemu-img.texi
+++ b/qemu-img.texi
@@ -74,13 +74,13 @@ keys.
 @item --image-opts
 Indicates that the source @var{filename} parameter is to be interpreted as a
 full option string, not a plain filename. This parameter is mutually
-exclusive with the @var{-f} parameter.
+exclusive with the @code{-f} parameter.
 
 @item --target-image-opts
 Indicates that the @var{output_filename} parameter(s) are to be interpreted as
 a full option string, not a plain filename. This parameter is mutually
-exclusive with the @var{-O} parameters. It is currently required to also use
-the @var{-n} parameter to skip image creation. This restriction may be relaxed
+exclusive with the @code{-O} parameters. It is currently required to also use
+the @code{-n} parameter to skip image creation. This restriction may be relaxed
 in a future release.
 
 @item --force-share (-U)
@@ -103,13 +103,13 @@ with or without a command shows help and lists the 
supported formats
 
 @item -p
 display progress bar (compare, convert and rebase commands only).
-If the @var{-p} option is not used for a command that supports it, the
+If the @code{-p} option is not used for a command that supports it, the
 progress is reported when the process receives a @code{SIGUSR1} or
 @code{SIGINFO} signal.
 
 @item -q
 Quiet mode - do not print any output (except errors). There's no progress bar
-in case both @var{-q} and @var{-p} options are used.
+in case both @code{-q} and @code{-p} options are used.
 
 @item -S @var{size}
 indicates the consecutive number of bytes that must contain only zeros
@@ -298,14 +298,14 @@ the top image stays valid).
 Check if two images have the same content. You can compare images with
 different format or settings.
 
-The format is probed unless you specify it by @var{-f} (used for
-@var{filename1}) and/or @var{-F} (used for @var{filename2}) option.
+The format is probed unless you specify it by @code{-f} (used for
+@var{filename1}) and/or @code{-F} (used for @var{filename2}) option.
 
 By default, images with different size are considered identical if the larger
 image contains only unallocated and/or zeroed sectors in the area after the end
 of the other image. In addition, if any sector is not allocated in one image
 and contains only zero bytes in the second one, it is evaluated as equal. You
-can use Strict mode by specifying the @var{-s} option. When compare runs in
+can use Strict mode by specifying the @code{-s} option. When compare runs in
 Strict mode, it fails in case image size differs or a sector is allocated in
 one image and is not allocated in the second one.
 
-- 
2.24.1

Re: Ways to deal with broken machine types

2021-03-30 Thread David Edmondson

On Tuesday, 2021-03-23 at 15:40:24 -04, Michael S. Tsirkin wrote:

> On Tue, Mar 23, 2021 at 05:40:36PM +, Daniel P. Berrangé wrote:
>> On Tue, Mar 23, 2021 at 05:54:47PM +0100, Igor Mammedov wrote:
>> > Let me hijack this thread for beyond this case scope.
>> > 
>> > I agree that for this particular bug we've done all we could, but
>> > there is broader issue to discuss here.
>> > 
>> > We have machine versions to deal with hw compatibility issues and that 
>> > covers most of the cases,
>> > but occasionally we notice problem well after release(s),
>> > so users may be stuck with broken VM and need to manually fix 
>> > configuration (and/or VM).
>> > Figuring out what's wrong and how to fix it is far from trivial. So lets 
>> > discuss if we
>> > can help to ease this pain, yes it will be late for first victims but it's 
>> > still
>> > better than never.
>> 
>> To summarize the problem situation
>> 
>>  - We rely on a machine type version to encode a precise guest ABI.
>>  - Due a bug, we are in a situation where the same machine type
>>encodes two distinct guest ABIs due to a mistake introduced
>>betwen QEMU N-2 and N-1
>>  - We want to fix the bug in QEMU N
>>  - For incoming migration there is no way to distinguish between
>>the ABIs used in N-2 and N-1, to pick the right one
>
>
> Not just incoming migration. Same applies to a guest restart.
>
>
>> So we're left with an unwinnable problem:
>> 
>>   - Not fixing the bug =>
>> 
>>a) user migrating N-2 to N-1 have ABI change
>>b) user migrating N-2 to N have ABI change
>>c) user migrating N-1 to N are fine
>> 
>> No mitigation for (a) or (b)
>> 
>>   - Fixing the bug =>
>> 
>>a) user migrating N-2 to N-1 have ABI change.
>>b) user migrating N-2 to N are fine
>>c) user migrating N-1 to N have ABI change
>> 
>> Bad situations (a) and (c) are mitigated by
>> backporting fix to N-1-stable too.
>> 
>> Generally we have preferred to fix the bug, because we have
>> usually identified them fairly quickly after release, and
>> backporting the fix to stable has been sufficient mitigation
>> against ill effects. Basically the people left broken are a
>> relatively small set out of the total userbase.
>> 
>> The real challenge arises when we are slow to identify the
>> problem, such that we have a large number of people impacted.
>> 
>> 
>> > I'll try to sum up idea Michael suggested (here comes my unorganized 
>> > brain-dump),
>> > 
>> > 1. We can keep in VM's config QEMU version it was created on
>> >and as minimum warn user with a pointer to known issues if version in
>> >config mismatches version of actually used QEMU, with a knob to silence
>> >it for particular mismatch.
>> > 
>> > When an issue becomes know and resolved we know for sure how and what
>> > changed and embed instructions on what options to use for fixing up VM's
>> > config to preserve old HW config depending on QEMU version VM was 
>> > installed on.
>> 
>> > some more ideas:
>> >2. let mgmt layer to keep fixup list and apply them to config if 
>> > available
>> >(user would need to upgrade mgmt or update fixup list somehow)
>> >3. let mgmt layer to pass VM's QEMU version to currently used QEMU, so
>> >   that QEMU could maintain and apply fixups based on QEMU version + 
>> > machine type.
>> >   The user will have to upgrade to newer QEMU to get/use new fixups.
>> 
>> The nice thing about machine type versioning is that we are treating the
>> versions as opaque strings which represent a specific ABI, regardless of
>> the QEMU version. This means that even if distros backport fixes for bugs
>> or even new features, the machine type compatibility check remains a
>> simple equality comparsion.
>> 
>> As soon as you introduce the QEMU version though, we have created a
>> large matrix for compatibility.
>
>
> Yes but. If we explicitly handle them all the same then
> mechanically testing them all is an overkill.
> We just need to test the ones that have bugs which we
> care about fixing.
>
>
>> This matrix is expanded if a distro
>> chooses to backport fixes for any of the machine type bugs to their
>> stable streams. This can get particularly expensive when there are
>> multiple streams a distro is maintaining.
>> 
>> *IF* the original N-1 qemu has a property that could be queried by
>> the mgmt app to identify a machine type bug, then we could potentially
>> apply a fixup automatically.
>> 
>> eg query-machines command in QEMU version N could report against
>> "pc-i440fx-5.0", that there was a regression fix that has to be
>> applied if property "foo" had value "bar".
>> 
>> Now, the mgmt app wants to migrate from QEMU N-2 or N-1 to QEMU N.
>> It can query the value of "foo" on the source QEMU with qom-get.
>> It now knows whether it has to override this property "foo" when
>> spawning QEMU N on the target host.
>> 
>> Of course this doesn't help us if neither N-1 or N-2 QEMU had a
>

Re: [PATCH] i386: Make 'hv-reenlightenment' require explicit 'tsc-frequency' setting

2021-03-30 Thread David Edmondson

On Tuesday, 2021-03-30 at 14:36:03 +02, Vitaly Kuznetsov wrote:

> Commit 561dbb41b1d7 "i386: Make migration fail when Hyper-V reenlightenment
> was enabled but 'user_tsc_khz' is unset" forbade migrations with when guest
> has opted for reenlightenment notifications but 'tsc-frequency' wasn't set
> explicitly on the command line. This works but the migration fail late and

s/fail/fails/

> this may come as an unpleasant surprise. To make things more explicit,
> require 'tsc-frequency=' on the command line when 'hv-reenlightenment' was
> enabled. Make the change affect 6.0+ machine types only to preserve
> previously-valid configurations valid.

s/ valid//

>
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  docs/hyperv.txt   |  1 +
>  hw/i386/pc.c  |  1 +
>  target/i386/cpu.c | 23 +--
>  target/i386/cpu.h |  1 +
>  4 files changed, 24 insertions(+), 2 deletions(-)
>
> diff --git a/docs/hyperv.txt b/docs/hyperv.txt
> index e53c581f4586..5b02d341ab25 100644
> --- a/docs/hyperv.txt
> +++ b/docs/hyperv.txt
> @@ -165,6 +165,7 @@ emulate TSC accesses after migration so 'tsc-frequency=' 
> CPU option also has to
>  be specified to make migration succeed. The destination host has to either 
> have
>  the same TSC frequency or support TSC scaling CPU feature.
>  
> +Requires: tsc-frequency
>  Recommended: hv-frequencies
>  
>  3.16. hv-evmcs
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 8a84b25a031e..47b79e949ad7 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -98,6 +98,7 @@
>  
>  GlobalProperty pc_compat_5_2[] = {
>  { "ICH9-LPC", "x-smi-cpu-hotunplug", "off" },
> +{ TYPE_X86_CPU, "x-hv-reenlightenment-requires-tscfreq", "off"},
>  };
>  const size_t pc_compat_5_2_len = G_N_ELEMENTS(pc_compat_5_2);
>  
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index 6b3e9467f177..751636bafac5 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -6647,10 +6647,23 @@ static void x86_cpu_filter_features(X86CPU *cpu, bool 
> verbose)
>  }
>  }
>  
> -static void x86_cpu_hyperv_realize(X86CPU *cpu)
> +static void x86_cpu_hyperv_realize(X86CPU *cpu, Error **errp)
>  {
> +CPUX86State *env = &cpu->env;
>  size_t len;
>  
> +/*
> + * Reenlightenment requires explicit 'tsc-frequency' setting for 
> successful
> + * migration (see hyperv_reenlightenment_post_load(). As 'hv-passthrough'

s/post_load()/post_load())/

> + * mode is not migratable, we can loosen the restriction.
> + */
> +if (hyperv_feat_enabled(cpu, HYPERV_FEAT_REENLIGHTENMENT) &&
> +!cpu->hyperv_passthrough && !env->user_tsc_khz &&
> +cpu->hyperv_reenlightenment_requires_tscfreq) {
> +error_setg(errp, "'hv-reenlightenment' requires 'tsc-frequency=' to 
> be set");

s/=//

> +return;
> +}
> +
>  /* Hyper-V vendor id */
>  if (!cpu->hyperv_vendor) {
>  memcpy(cpu->hyperv_vendor_id, "Microsoft Hv", 12);
> @@ -6846,7 +6859,11 @@ static void x86_cpu_realizefn(DeviceState *dev, Error 
> **errp)
>  }
>  
>  /* Process Hyper-V enlightenments */
> -x86_cpu_hyperv_realize(cpu);
> +x86_cpu_hyperv_realize(cpu, &local_err);
> +if (local_err != NULL) {
> +error_propagate(errp, local_err);
> +return;
> +}
>  
>  cpu_exec_realizefn(cs, &local_err);
>  if (local_err != NULL) {
> @@ -7374,6 +7391,8 @@ static Property x86_cpu_properties[] = {
>  DEFINE_PROP_INT32("x-hv-max-vps", X86CPU, hv_max_vps, -1),
>  DEFINE_PROP_BOOL("x-hv-synic-kvm-only", X86CPU, hyperv_synic_kvm_only,
>   false),
> +DEFINE_PROP_BOOL("x-hv-reenlightenment-requires-tscfreq", X86CPU,
> + hyperv_reenlightenment_requires_tscfreq, true),
>  DEFINE_PROP_BOOL("x-intel-pt-auto-level", X86CPU, intel_pt_auto_level,
>   true),
>  DEFINE_PROP_END_OF_LIST()
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index 570f916878f9..0196a300f018 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -1677,6 +1677,7 @@ struct X86CPU {
>  uint32_t hyperv_spinlock_attempts;
>  char *hyperv_vendor;
>  bool hyperv_synic_kvm_only;
> +bool hyperv_reenlightenment_requires_tscfreq;
>  uint64_t hyperv_features;
>  bool hyperv_passthrough;
>  OnOffAuto hyperv_no_nonarch_cs;
> -- 
> 2.30.2

dme.
-- 
No proper time of day.

Re: [PATCH] docs: Add a QEMU Code of Conduct and Conflict Resolution Policy document

2021-03-31 Thread David Edmondson

On Wednesday, 2021-03-31 at 17:05:27 +02, Paolo Bonzini wrote:

> In an ideal world, we would all get along together very well, always be
> polite and never end up in huge conflicts. And even if there are conflicts,
> we would always handle each other fair and respectfully. Unfortunately,
> this is not an ideal world and sometimes people forget how to interact with
> each other in a professional and respectful way. Fortunately, this seldom
> happens in the QEMU community, but for such rare cases it is preferrable
> to have a basic code of conduct document available to show to people
> who are misbehaving.  In case that does not help yet, we should also have
> a conflict resolution policy ready that can be applied in the worst case.
>
> The Code of Conduct document tries to be short and to the point while
> trying to remain friendly and welcoming; it is based on the Fedora Code
> of Conduct[1] with extra detail added based on the Contributor Covenant
> 1.3.0[2].  Other proposals included the Contributor Covenant 1.3.0 itself
> or the Django Code of Conduct[3] (which is also a derivative of Fedora's)
> but, in any case, there was agreement on keeping the conflict resolution
> policy separate from the CoC itself.
>
> An important point is whether to apply the code of conduct to violations
> that occur outside public spaces.  The text herein restricts that to
> individuals acting as a representative or a member of the project or
> its community.  This is intermediate between the Contributor Covenant
> (which only mentions representatives of the community, for example using
> an official project e-mail address or posting via an official social media
> account), and the Django Code of Conduct, which says that violations of
> this code outside these spaces "may" be considered but does not limit
> this further.
>
> The conflict resolution policy is based on the Drupal Conflict Resolution
> Policy[4] and its derivative, the Mozilla Consequence Ladder[5].
>
> [1] https://www.fedoraproject.com/code-of-conduct/
> [2] https://www.contributor-covenant.org/version/1/3/0/code-of-conduct/
> [3] https://www.djangoproject.com/conduct/
> [4] https://www.drupal.org/conflict-resolution
> [5] 
> https://github.com/mozilla/diversity/blob/master/code-of-conduct-enforcement/consequence-ladder.md
>
> Co-developed-by: Thomas Huth 
> Signed-off-by: Paolo Bonzini 
> ---
>  docs/devel/code-of-conduct.rst | 60 ++
>  docs/devel/conflict-resolution.rst | 80 ++
>  docs/devel/index.rst   |  2 +
>  3 files changed, 142 insertions(+)
>  create mode 100644 docs/devel/code-of-conduct.rst
>  create mode 100644 docs/devel/conflict-resolution.rst
>
> diff --git a/docs/devel/code-of-conduct.rst b/docs/devel/code-of-conduct.rst
> new file mode 100644
> index 00..83e8855250
> --- /dev/null
> +++ b/docs/devel/code-of-conduct.rst
> @@ -0,0 +1,60 @@
> +Code of Conduct
> +===
> +
> +The QEMU community is made up of a mixture of professionals and
> +volunteers from all over the world. Diversity is one of our strengths,
> +but it can also lead to communication issues and unhappiness.
> +To that end, we have a few ground rules that we ask people to adhere to.
> +
> +* Be welcoming. We are committed to making participation in this project
> +  a harassment-free experience for everyone, regardless of level of
> +  experience, gender, gender identity and expression, sexual orientation,
> +  disability, personal appearance, body size, race, ethnicity, age, religion,
> +  or nationality.
> +
> +* Be respectful. Not all of us will agree all the time.  Disagreements, both
> +  social and technical, happen all the time and the QEMU community is no
> +  exception. When we disagree, we try to understand why.  It is important 
> that
> +  we resolve disagreements and differing views constructively.  Members of 
> the
> +  QEMU community should be respectful when dealing with other contributors as
> +  well as with people outside the QEMU community and with users of QEMU.
> +
> +Harassment and other exclusionary behavior are not acceptable. A community
> +where people feel uncomfortable or threatened is neither welcoming nor
> +respectful.  Examples of unacceptable behavior by participants include:
> +
> +* The use of sexualized language or imagery
> +
> +* Personal attacks
> +
> +* Trolling or insulting/derogatory comments
> +
> +* Public or private harassment
> +
> +* Publishing other's private information, such as physical or electronic
> +addresses, without explicit permission
> +
> +This isn't an exhaustive list of things that you can't do. Rather, take
> +it in the spirit in which it's intended—a guide to make it easier to
> +be excellent to each other.
> +
> +This code of conduct applies to all spaces managed by the QEMU project.
> +This includes IRC, the mailing lists, the issue tracker, community
> +events, and any other forums created by the project team which the
> +community

Re: [PATCH 4/5] coroutine-lock: reimplement CoRwLock to fix downgrade bug

2021-03-17 Thread David Edmondson

On Tuesday, 2021-03-16 at 17:00:06 +01, Paolo Bonzini wrote:

> A feature of the current rwlock is that if multiple coroutines hold a
> reader lock, all must be runnable. The unlock implementation relies on
> this, choosing to wake a single coroutine when the final read lock
> holder exits the critical section, assuming that it will wake a
> coroutine attempting to acquire a write lock.
>
> The downgrade implementation violates this assumption by creating a
> read lock owning coroutine that is exclusively runnable - any other
> coroutines that are waiting to acquire a read lock are *not* made
> runnable when the write lock holder converts its ownership to read
> only.
>
> To fix this, keep the queue of waiters explicitly in the CoRwLock
> instead of using CoQueue, and store for each whether it is a
> potential reader or a writer.  This way, downgrade can look at the
> first queued coroutines and wake it if it is a reader, causing
> all other readers to be released in turn.
>
> Reported-by: David Edmondson 
> Signed-off-by: Paolo Bonzini 
> ---
>  include/qemu/coroutine.h   |  10 ++-
>  util/qemu-coroutine-lock.c | 150 -
>  2 files changed, 106 insertions(+), 54 deletions(-)
>
> diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
> index 84eab6e3bf..ae62d4bc8d 100644
> --- a/include/qemu/coroutine.h
> +++ b/include/qemu/coroutine.h
> @@ -237,11 +237,15 @@ bool qemu_co_enter_next_impl(CoQueue *queue, 
> QemuLockable *lock);
>  bool qemu_co_queue_empty(CoQueue *queue);
>  
>  
> +typedef struct CoRwTicket CoRwTicket;
>  typedef struct CoRwlock {
> -int pending_writer;
> -int reader;
>  CoMutex mutex;
> -CoQueue queue;
> +
> +/* Number of readers, of -1 if owned for writing.  */

s/, of/, or/

> +int owner;
> +
> +/* Waiting coroutines.  */
> +QSIMPLEQ_HEAD(, CoRwTicket) tickets;
>  } CoRwlock;

Isn't this...

 * ... Also, @qemu_co_rwlock_upgrade
 * only overrides CoRwlock fairness if there are no concurrent readers, so
 * another writer might run while @qemu_co_rwlock_upgrade blocks.

...now incorrect?

>  /**
> diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
> index eb73cf11dc..655634d185 100644
> --- a/util/qemu-coroutine-lock.c
> +++ b/util/qemu-coroutine-lock.c
> @@ -327,11 +327,70 @@ void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex)
>  trace_qemu_co_mutex_unlock_return(mutex, self);
>  }
>  
> +struct CoRwTicket {
> +bool read;
> +Coroutine *co;
> +QSIMPLEQ_ENTRY(CoRwTicket) next;
> +};
> +
>  void qemu_co_rwlock_init(CoRwlock *lock)
>  {
> -memset(lock, 0, sizeof(*lock));
> -qemu_co_queue_init(&lock->queue);
>  qemu_co_mutex_init(&lock->mutex);
> +lock->owner = 0;
> +QSIMPLEQ_INIT(&lock->tickets);
> +}
> +
> +/* Releases the internal CoMutex.  */
> +static void qemu_co_rwlock_maybe_wake_one(CoRwlock *lock)
> +{
> +CoRwTicket *tkt = QSIMPLEQ_FIRST(&lock->tickets);
> +Coroutine *co = NULL;
> +
> +/*
> + * Setting lock->owner here prevents rdlock and wrlock from
> + * sneaking in between unlock and wake.
> + */
> +
> +if (tkt) {
> +if (tkt->read) {
> +if (lock->owner >= 0) {
> +lock->owner++;
> +co = tkt->co;
> +}
> +} else {
> +if (lock->owner == 0) {
> +lock->owner = -1;
> +co = tkt->co;
> +}
> +}
> +}
> +
> +if (co) {
> +QSIMPLEQ_REMOVE_HEAD(&lock->tickets, next);
> +qemu_co_mutex_unlock(&lock->mutex);
> +aio_co_wake(co);
> +} else {
> +qemu_co_mutex_unlock(&lock->mutex);
> +}
> +}
> +
> +/* Releases the internal CoMutex.  */
> +static void qemu_co_rwlock_sleep(bool read, CoRwlock *lock)
> +{
> +CoRwTicket my_ticket = { read, qemu_coroutine_self() };
> +
> +QSIMPLEQ_INSERT_TAIL(&lock->tickets, &my_ticket, next);
> +qemu_co_mutex_unlock(&lock->mutex);
> +qemu_coroutine_yield();
> +
> +if (read) {
> +/* Possibly wake another reader, which will wake the next in line.  
> */
> +assert(lock->owner >= 1);
> +qemu_co_mutex_lock(&lock->mutex);
> +qemu_co_rwlock_maybe_wake_one(lock);
> +} else {
> +assert(lock->owner == -1);
> +}
>  }
>  
>  void qemu_co_rwlock_rdlock(CoRwlock *lock)
> @@ -339,13 +398,13 @@ void qemu_co_rwlock_rdlock(CoRwlock *lock)
>  
>  qemu_co_mutex_lock(&lock->mutex);

Re: [PATCH v4 1/2] migration/ram: Reduce unnecessary rate limiting

2021-03-17 Thread David Edmondson

On Wednesday, 2021-03-17 at 09:37:11 +08, Kunkun Jiang wrote:

> Hi  Peter,
>
> On 2021/3/17 5:39, Peter Xu wrote:
>> On Tue, Mar 16, 2021 at 08:57:15PM +0800, Kunkun Jiang wrote:
>>> When the host page is a huge page and something is sent in the
>>> current iteration, migration_rate_limit() should be executed.
>>> If not, it can be omitted.
>>>
>>> Signed-off-by: Keqian Zhu 
>>> Signed-off-by: Kunkun Jiang 
>>> Reviewed-by: David Edmondson 
>>> ---
>>>   migration/ram.c | 9 +++--
>>>   1 file changed, 7 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/migration/ram.c b/migration/ram.c
>>> index 72143da0ac..3eb5b0d7a7 100644
>>> --- a/migration/ram.c
>>> +++ b/migration/ram.c
>>> @@ -2015,8 +2015,13 @@ static int ram_save_host_page(RAMState *rs, 
>>> PageSearchStatus *pss,
>>>   
>>>   pages += tmppages;
>>>   pss->page++;
>>> -/* Allow rate limiting to happen in the middle of huge pages */
>>> -migration_rate_limit();
>>> +/*
>>> + * Allow rate limiting to happen in the middle of huge pages if
>>> + * something is sent in the current iteration.
>>> + */
>>> +if (pagesize_bits > 1 && tmppages > 0) {
>>> +migration_rate_limit();
>>> +}
>> Sorry I'm still not a fan of this - I'd even prefer calling that once more 
>> just
>> to make sure it won't be forgotten to be called..  Not to say it's merely a 
>> noop.
>>
>> I'll leave this to Dave..  Maybe I'm too harsh! :)
>>
> You are very serious and meticulous. I like your character very much.😉
> This patch was used to reviewed by David. So, I want to know what
> his opinion is.
>
> @David
> Hi David, what is your opinion on this patch?

I suspect that this referred to David Gilbert rather than me :-)

> Thanks,
> Kunkun Jiang

dme.
-- 
Time is waiting to explain, why refuse?

Re: [PATCH 4/6] coroutine-lock: reimplement CoRwLock to fix downgrade bug

2021-03-17 Thread David Edmondson

On Wednesday, 2021-03-17 at 13:16:39 +01, Paolo Bonzini wrote:

> An invariant of the current rwlock is that if multiple coroutines hold a
> reader lock, all must be runnable. The unlock implementation relies on
> this, choosing to wake a single coroutine when the final read lock
> holder exits the critical section, assuming that it will wake a
> coroutine attempting to acquire a write lock.
>
> The downgrade implementation violates this assumption by creating a
> read lock owning coroutine that is exclusively runnable - any other
> coroutines that are waiting to acquire a read lock are *not* made
> runnable when the write lock holder converts its ownership to read
> only.
>
> More in general, the old implementation had lots of other fairness bugs.
> The root cause of the bugs was that CoQueue would wake up readers even
> if there were pending writers, and would wake up writers even if there
> were readers.  In that case, the coroutine would go back to sleep *at
> the end* of the CoQueue, losing its place at the head of the line.
>
> To fix this, keep the queue of waiters explicitly in the CoRwLock
> instead of using CoQueue, and store for each whether it is a
> potential reader or a writer.  This way, downgrade can look at the
> first queued coroutines and wake it only if it is a reader, causing
> all other readers in line to be released in turn.
>
> Reported-by: David Edmondson 
> Signed-off-by: Paolo Bonzini 

A couple of nits below, but it looks good overall.

Reviewed-by: David Edmondson 

> ---
> v3->v4: clean up the code and fix upgrade logic.  Fix upgrade comment too.
>
>  include/qemu/coroutine.h   |  17 +++--
>  util/qemu-coroutine-lock.c | 148 -
>  2 files changed, 106 insertions(+), 59 deletions(-)
>
> diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
> index 84eab6e3bf..7919d3bb62 100644
> --- a/include/qemu/coroutine.h
> +++ b/include/qemu/coroutine.h
> @@ -237,11 +237,15 @@ bool qemu_co_enter_next_impl(CoQueue *queue, 
> QemuLockable *lock);
>  bool qemu_co_queue_empty(CoQueue *queue);
>  
>  
> +typedef struct CoRwTicket CoRwTicket;
>  typedef struct CoRwlock {
> -int pending_writer;
> -int reader;
>  CoMutex mutex;
> -CoQueue queue;
> +
> +/* Number of readers, of -1 if owned for writing.  */
> +int owners;

s/, of/, or/

> +
> +/* Waiting coroutines.  */
> +QSIMPLEQ_HEAD(, CoRwTicket) tickets;
>  } CoRwlock;
>  
>  /**
> @@ -260,10 +264,9 @@ void qemu_co_rwlock_rdlock(CoRwlock *lock);
>  /**
>   * Write Locks the CoRwlock from a reader.  This is a bit more efficient than
>   * @qemu_co_rwlock_unlock followed by a separate @qemu_co_rwlock_wrlock.
> - * However, if the lock cannot be upgraded immediately, control is 
> transferred
> - * to the caller of the current coroutine.  Also, @qemu_co_rwlock_upgrade
> - * only overrides CoRwlock fairness if there are no concurrent readers, so
> - * another writer might run while @qemu_co_rwlock_upgrade blocks.
> + * Note that if the lock cannot be upgraded immediately, control is 
> transferred
> + * to the caller of the current coroutine; another writer might run while
> + * @qemu_co_rwlock_upgrade blocks.

This is better, thanks.

>   */
>  void qemu_co_rwlock_upgrade(CoRwlock *lock);
>  
> diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
> index eb73cf11dc..2669403839 100644
> --- a/util/qemu-coroutine-lock.c
> +++ b/util/qemu-coroutine-lock.c
> @@ -327,11 +327,51 @@ void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex)
>  trace_qemu_co_mutex_unlock_return(mutex, self);
>  }
>  
> +struct CoRwTicket {
> +bool read;
> +Coroutine *co;
> +QSIMPLEQ_ENTRY(CoRwTicket) next;
> +};
> +
>  void qemu_co_rwlock_init(CoRwlock *lock)
>  {
> -memset(lock, 0, sizeof(*lock));
> -qemu_co_queue_init(&lock->queue);
>  qemu_co_mutex_init(&lock->mutex);
> +lock->owners = 0;
> +QSIMPLEQ_INIT(&lock->tickets);
> +}
> +
> +/* Releases the internal CoMutex.  */
> +static void qemu_co_rwlock_maybe_wake_one(CoRwlock *lock)
> +{
> +CoRwTicket *tkt = QSIMPLEQ_FIRST(&lock->tickets);
> +Coroutine *co = NULL;
> +
> +/*
> + * Setting lock->owners here prevents rdlock and wrlock from
> + * sneaking in between unlock and wake.
> + */
> +
> +if (tkt) {
> +if (tkt->read) {
> +if (lock->owners >= 0) {
> +lock->owners++;
> +co = tkt->co;
> +}
> +} else {
> +if (lock->owners == 0) {
> +lock->owners = -1;
> +

Re: [PATCH 5/6] test-coroutine: add rwlock upgrade test

2021-03-17 Thread David Edmondson

On Wednesday, 2021-03-17 at 13:16:40 +01, Paolo Bonzini wrote:

> Test that rwlock upgrade is fair, and readers go back to sleep if a writer
> is in line.
>
> Signed-off-by: Paolo Bonzini 

Reviewed-by: David Edmondson 

> ---
>  tests/unit/test-coroutine.c | 62 +
>  1 file changed, 62 insertions(+)
>
> diff --git a/tests/unit/test-coroutine.c b/tests/unit/test-coroutine.c
> index e946d93a65..6e6f51d480 100644
> --- a/tests/unit/test-coroutine.c
> +++ b/tests/unit/test-coroutine.c
> @@ -264,6 +264,67 @@ static void test_co_mutex_lockable(void)
>  g_assert(QEMU_MAKE_LOCKABLE(null_pointer) == NULL);
>  }
>  
> +static CoRwlock rwlock;
> +
> +/* Test that readers are properly sent back to the queue when upgrading,
> + * even if they are the sole readers.  The test scenario is as follows:
> + *
> + *
> + * | c1   | c2 |
> + * |--++
> + * | rdlock   ||
> + * | yield||
> + * |  | wrlock |
> + * |  ||
> + * | upgrade  ||
> + * |  |  |
> + * |  | unlock |
> + * |||
> + * | unlock   ||
> + */
> +
> +static void coroutine_fn rwlock_yield_upgrade(void *opaque)
> +{
> +qemu_co_rwlock_rdlock(&rwlock);
> +qemu_coroutine_yield();
> +
> +qemu_co_rwlock_upgrade(&rwlock);
> +qemu_co_rwlock_unlock(&rwlock);
> +
> +*(bool *)opaque = true;
> +}
> +
> +static void coroutine_fn rwlock_wrlock_yield(void *opaque)
> +{
> +qemu_co_rwlock_wrlock(&rwlock);
> +qemu_coroutine_yield();
> +
> +qemu_co_rwlock_unlock(&rwlock);
> +*(bool *)opaque = true;
> +}
> +
> +static void test_co_rwlock_upgrade(void)
> +{
> +bool c1_done = false;
> +bool c2_done = false;
> +Coroutine *c1, *c2;
> +
> +qemu_co_rwlock_init(&rwlock);
> +c1 = qemu_coroutine_create(rwlock_yield_upgrade, &c1_done);
> +c2 = qemu_coroutine_create(rwlock_wrlock_yield, &c2_done);
> +
> +qemu_coroutine_enter(c1);
> +qemu_coroutine_enter(c2);
> +
> +/* c1 now should go to sleep.  */
> +qemu_coroutine_enter(c1);
> +g_assert(!c1_done);
> +
> +qemu_coroutine_enter(c2);
> +g_assert(c1_done);
> +g_assert(c2_done);
> +}
> +
>  /*
>   * Check that creation, enter, and return work
>   */
> @@ -501,6 +562,7 @@ int main(int argc, char **argv)
>  g_test_add_func("/basic/order", test_order);
>  g_test_add_func("/locking/co-mutex", test_co_mutex);
>  g_test_add_func("/locking/co-mutex/lockable", test_co_mutex_lockable);
> +g_test_add_func("/locking/co-rwlock/upgrade", test_co_rwlock_upgrade);
>  if (g_test_perf()) {
>  g_test_add_func("/perf/lifecycle", perf_lifecycle);
>  g_test_add_func("/perf/nesting", perf_nesting);
> -- 
> 2.29.2

dme.
-- 
Oh by the way, which one's Pink?

Re: [PATCH 4/6] coroutine-lock: reimplement CoRwLock to fix downgrade bug

2021-03-17 Thread David Edmondson

On Wednesday, 2021-03-17 at 18:19:58 +01, Paolo Bonzini wrote:

> On 17/03/21 16:17, David Edmondson wrote:
>>> +if (tkt) {
>>> +if (tkt->read) {
>>> +if (lock->owners >= 0) {
>>> +lock->owners++;
>>> +co = tkt->co;
>>> +}
>>> +} else {
>>> +if (lock->owners == 0) {
>>> +lock->owners = -1;
>>> +co = tkt->co;
>>> +}
>>> +}
>>> +}
>>> +
>>> +if (co) {
>>> +QSIMPLEQ_REMOVE_HEAD(&lock->tickets, next);
>>> +qemu_co_mutex_unlock(&lock->mutex);
>>> +aio_co_wake(co);
>>> +} else {
>>> +qemu_co_mutex_unlock(&lock->mutex);
>>> +}
>> 
>> This block could be pushed up into the earlier block, but I imagine that
>> the compiler will do it for you.
>
> I guess I could do
>
>  if (!tkt || (tkt->read ? lock->owners < 0 : lock->owners != 0)) {
>  qemu_co_mutex_unlock(&lock->mutex);
>  return;
>  }
>  if (tkt->read) {
>  lock->owners++;
>  } else {
>  lock->owners = -1;
>  }
>
>  co = tkt->co;
>  QSIMPLEQ_REMOVE_HEAD(&lock->tickets, next);
>  qemu_co_mutex_unlock(&lock->mutex);
>  aio_co_wake(co);
>
> but I find it less readable.

Agreed.

> So that leaves only the of/or typo, right?

Yes.

dme.
-- 
In heaven there is no beer, that's why we drink it here.

Re: [PATCH 5/6] test-coroutine: add rwlock upgrade test

2021-03-17 Thread David Edmondson

On Wednesday, 2021-03-17 at 19:00:12 +01, Paolo Bonzini wrote:

> Test that rwlock upgrade is fair, and readers go back to sleep if a writer
> is in line.
>
> Signed-off-by: Paolo Bonzini 

Reviewed-by: David Edmondson 

> ---
>  tests/unit/test-coroutine.c | 62 +
>  1 file changed, 62 insertions(+)
>
> diff --git a/tests/unit/test-coroutine.c b/tests/unit/test-coroutine.c
> index e946d93a65..6e6f51d480 100644
> --- a/tests/unit/test-coroutine.c
> +++ b/tests/unit/test-coroutine.c
> @@ -264,6 +264,67 @@ static void test_co_mutex_lockable(void)
>  g_assert(QEMU_MAKE_LOCKABLE(null_pointer) == NULL);
>  }
>  
> +static CoRwlock rwlock;
> +
> +/* Test that readers are properly sent back to the queue when upgrading,
> + * even if they are the sole readers.  The test scenario is as follows:
> + *
> + *
> + * | c1   | c2 |
> + * |--++
> + * | rdlock   ||
> + * | yield||
> + * |  | wrlock |
> + * |  ||
> + * | upgrade  ||
> + * |  |  |
> + * |  | unlock |
> + * |||
> + * | unlock   ||
> + */
> +
> +static void coroutine_fn rwlock_yield_upgrade(void *opaque)
> +{
> +qemu_co_rwlock_rdlock(&rwlock);
> +qemu_coroutine_yield();
> +
> +qemu_co_rwlock_upgrade(&rwlock);
> +qemu_co_rwlock_unlock(&rwlock);
> +
> +*(bool *)opaque = true;
> +}
> +
> +static void coroutine_fn rwlock_wrlock_yield(void *opaque)
> +{
> +qemu_co_rwlock_wrlock(&rwlock);
> +qemu_coroutine_yield();
> +
> +qemu_co_rwlock_unlock(&rwlock);
> +*(bool *)opaque = true;
> +}
> +
> +static void test_co_rwlock_upgrade(void)
> +{
> +bool c1_done = false;
> +bool c2_done = false;
> +Coroutine *c1, *c2;
> +
> +qemu_co_rwlock_init(&rwlock);
> +c1 = qemu_coroutine_create(rwlock_yield_upgrade, &c1_done);
> +c2 = qemu_coroutine_create(rwlock_wrlock_yield, &c2_done);
> +
> +qemu_coroutine_enter(c1);
> +qemu_coroutine_enter(c2);
> +
> +/* c1 now should go to sleep.  */
> +qemu_coroutine_enter(c1);
> +g_assert(!c1_done);
> +
> +qemu_coroutine_enter(c2);
> +g_assert(c1_done);
> +g_assert(c2_done);
> +}
> +
>  /*
>   * Check that creation, enter, and return work
>   */
> @@ -501,6 +562,7 @@ int main(int argc, char **argv)
>  g_test_add_func("/basic/order", test_order);
>  g_test_add_func("/locking/co-mutex", test_co_mutex);
>  g_test_add_func("/locking/co-mutex/lockable", test_co_mutex_lockable);
> +g_test_add_func("/locking/co-rwlock/upgrade", test_co_rwlock_upgrade);
>  if (g_test_perf()) {
>  g_test_add_func("/perf/lifecycle", perf_lifecycle);
>  g_test_add_func("/perf/nesting", perf_nesting);
> -- 
> 2.29.2

dme.
-- 
Don't you know you're never going to get to France.

[RFC PATCH 1/7] target/i386: Declare constants for XSAVE offsets

2021-05-20 Thread David Edmondson

Declare and use manifest constants for the XSAVE state component
offsets.

Signed-off-by: David Edmondson 
---
 target/i386/cpu.h | 30 +++---
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index e6836393f7..1fb732f366 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -1305,6 +1305,22 @@ typedef struct XSavePKRU {
 uint32_t padding;
 } XSavePKRU;
 
+#define XSAVE_FCW_FSW_OFFSET0x000
+#define XSAVE_FTW_FOP_OFFSET0x004
+#define XSAVE_CWD_RIP_OFFSET0x008
+#define XSAVE_CWD_RDP_OFFSET0x010
+#define XSAVE_MXCSR_OFFSET  0x018
+#define XSAVE_ST_SPACE_OFFSET   0x020
+#define XSAVE_XMM_SPACE_OFFSET  0x0a0
+#define XSAVE_XSTATE_BV_OFFSET  0x200
+#define XSAVE_AVX_OFFSET0x240
+#define XSAVE_BNDREG_OFFSET 0x3c0
+#define XSAVE_BNDCSR_OFFSET 0x400
+#define XSAVE_OPMASK_OFFSET 0x440
+#define XSAVE_ZMM_HI256_OFFSET  0x480
+#define XSAVE_HI16_ZMM_OFFSET   0x680
+#define XSAVE_PKRU_OFFSET   0xa80
+
 typedef struct X86XSaveArea {
 X86LegacyXSaveArea legacy;
 X86XSaveHeader header;
@@ -1325,19 +1341,19 @@ typedef struct X86XSaveArea {
 XSavePKRU pkru_state;
 } X86XSaveArea;
 
-QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, avx_state) != 0x240);
+QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, avx_state) != XSAVE_AVX_OFFSET);
 QEMU_BUILD_BUG_ON(sizeof(XSaveAVX) != 0x100);
-QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, bndreg_state) != 0x3c0);
+QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, bndreg_state) != XSAVE_BNDREG_OFFSET);
 QEMU_BUILD_BUG_ON(sizeof(XSaveBNDREG) != 0x40);
-QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, bndcsr_state) != 0x400);
+QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, bndcsr_state) != XSAVE_BNDCSR_OFFSET);
 QEMU_BUILD_BUG_ON(sizeof(XSaveBNDCSR) != 0x40);
-QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, opmask_state) != 0x440);
+QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, opmask_state) != XSAVE_OPMASK_OFFSET);
 QEMU_BUILD_BUG_ON(sizeof(XSaveOpmask) != 0x40);
-QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, zmm_hi256_state) != 0x480);
+QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, zmm_hi256_state) != 
XSAVE_ZMM_HI256_OFFSET);
 QEMU_BUILD_BUG_ON(sizeof(XSaveZMM_Hi256) != 0x200);
-QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, hi16_zmm_state) != 0x680);
+QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, hi16_zmm_state) != 
XSAVE_HI16_ZMM_OFFSET);
 QEMU_BUILD_BUG_ON(sizeof(XSaveHi16_ZMM) != 0x400);
-QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, pkru_state) != 0xA80);
+QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, pkru_state) != XSAVE_PKRU_OFFSET);
 QEMU_BUILD_BUG_ON(sizeof(XSavePKRU) != 0x8);
 
 typedef enum TPRAccess {
-- 
2.30.2

[RFC PATCH 7/7] target/i386: Manipulate only AMD XSAVE state on AMD

2021-05-20 Thread David Edmondson

On AMD CPUs, ensure to save/load only the relevant XSAVE state.

Signed-off-by: David Edmondson 
---
 target/i386/tcg/fpu_helper.c | 12 +--
 target/i386/xsave_helper.c   | 70 ++--
 2 files changed, 54 insertions(+), 28 deletions(-)

diff --git a/target/i386/tcg/fpu_helper.c b/target/i386/tcg/fpu_helper.c
index fba2de5b04..f1d4704b34 100644
--- a/target/i386/tcg/fpu_helper.c
+++ b/target/i386/tcg/fpu_helper.c
@@ -2643,7 +2643,11 @@ static void do_xsave(CPUX86State *env, target_ulong ptr, 
uint64_t rfbm,
 do_xsave_bndcsr(env, ptr + XO(intel.bndcsr_state), ra);
 }
 if (opt & XSTATE_PKRU_MASK) {
-do_xsave_pkru(env, ptr + XO(intel.pkru_state), ra);
+if (IS_AMD_CPU(env)) {
+do_xsave_pkru(env, ptr + XO(amd.pkru_state), ra);
+} else {
+do_xsave_pkru(env, ptr + XO(intel.pkru_state), ra);
+}
 }
 
 /* Update the XSTATE_BV field.  */
@@ -2854,7 +2858,11 @@ void helper_xrstor(CPUX86State *env, target_ulong ptr, 
uint64_t rfbm)
 if (rfbm & XSTATE_PKRU_MASK) {
 uint64_t old_pkru = env->pkru;
 if (xstate_bv & XSTATE_PKRU_MASK) {
-do_xrstor_pkru(env, ptr + XO(intel.pkru_state), ra);
+if (IS_AMD_CPU(env)) {
+do_xrstor_pkru(env, ptr + XO(amd.pkru_state), ra);
+} else {
+do_xrstor_pkru(env, ptr + XO(intel.pkru_state), ra);
+}
 } else {
 env->pkru = 0;
 }
diff --git a/target/i386/xsave_helper.c b/target/i386/xsave_helper.c
index 97dbab85d1..6b4501cf29 100644
--- a/target/i386/xsave_helper.c
+++ b/target/i386/xsave_helper.c
@@ -10,6 +10,7 @@ void x86_cpu_xsave_all_areas(X86CPU *cpu, X86XSaveArea *buf)
 {
 CPUX86State *env = &cpu->env;
 X86XSaveArea *xsave = buf;
+const bool is_amd = IS_AMD_CPU(env);
 
 uint16_t cwd, swd, twd;
 int i;
@@ -31,30 +32,38 @@ void x86_cpu_xsave_all_areas(X86CPU *cpu, X86XSaveArea *buf)
 sizeof env->fpregs);
 xsave->legacy.mxcsr = env->mxcsr;
 xsave->header.xstate_bv = env->xstate_bv;
-memcpy(&xsave->intel.bndreg_state.bnd_regs, env->bnd_regs,
-sizeof env->bnd_regs);
-xsave->intel.bndcsr_state.bndcsr = env->bndcs_regs;
-memcpy(&xsave->intel.opmask_state.opmask_regs, env->opmask_regs,
-sizeof env->opmask_regs);
+if (!is_amd) {
+memcpy(&xsave->intel.bndreg_state.bnd_regs, env->bnd_regs,
+   sizeof env->bnd_regs);
+xsave->intel.bndcsr_state.bndcsr = env->bndcs_regs;
+memcpy(&xsave->intel.opmask_state.opmask_regs, env->opmask_regs,
+   sizeof env->opmask_regs);
+}
 
 for (i = 0; i < CPU_NB_REGS; i++) {
 uint8_t *xmm = xsave->legacy.xmm_regs[i];
 uint8_t *ymmh = xsave->avx_state.ymmh[i];
-uint8_t *zmmh = xsave->intel.zmm_hi256_state.zmm_hi256[i];
 stq_p(xmm, env->xmm_regs[i].ZMM_Q(0));
 stq_p(xmm+8,   env->xmm_regs[i].ZMM_Q(1));
 stq_p(ymmh,env->xmm_regs[i].ZMM_Q(2));
 stq_p(ymmh+8,  env->xmm_regs[i].ZMM_Q(3));
-stq_p(zmmh,env->xmm_regs[i].ZMM_Q(4));
-stq_p(zmmh+8,  env->xmm_regs[i].ZMM_Q(5));
-stq_p(zmmh+16, env->xmm_regs[i].ZMM_Q(6));
-stq_p(zmmh+24, env->xmm_regs[i].ZMM_Q(7));
+if (!is_amd) {
+uint8_t *zmmh = xsave->intel.zmm_hi256_state.zmm_hi256[i];
+stq_p(zmmh,env->xmm_regs[i].ZMM_Q(4));
+stq_p(zmmh+8,  env->xmm_regs[i].ZMM_Q(5));
+stq_p(zmmh+16, env->xmm_regs[i].ZMM_Q(6));
+stq_p(zmmh+24, env->xmm_regs[i].ZMM_Q(7));
+}
 }
 
 #ifdef TARGET_X86_64
-memcpy(&xsave->intel.hi16_zmm_state.hi16_zmm, &env->xmm_regs[16],
-16 * sizeof env->xmm_regs[16]);
-memcpy(&xsave->intel.pkru_state, &env->pkru, sizeof env->pkru);
+if (is_amd) {
+memcpy(&xsave->amd.pkru_state, &env->pkru, sizeof env->pkru);
+} else {
+memcpy(&xsave->intel.hi16_zmm_state.hi16_zmm, &env->xmm_regs[16],
+   16 * sizeof env->xmm_regs[16]);
+memcpy(&xsave->intel.pkru_state, &env->pkru, sizeof env->pkru);
+}
 #endif
 
 }
@@ -64,6 +73,7 @@ void x86_cpu_xrstor_all_areas(X86CPU *cpu, const X86XSaveArea 
*buf)
 
 CPUX86State *env = &cpu->env;
 const X86XSaveArea *xsave = buf;
+const bool is_amd = IS_AMD_CPU(env);
 
 int i;
 uint16_t cwd, swd, twd;
@@ -83,30 +93,38 @@ void x86_cpu_xrstor_all_areas(X86CPU *cpu, const 
X86XSaveArea *buf)
 memcpy(env->fpregs, &xsave->legacy.fpregs,
 sizeof env->fpregs);
 env->xstate_bv = xsave->header.xstate_bv;
-memcpy(env->bnd_regs, &xsave->intel.bndreg_state.bnd_regs,
-

[RFC PATCH 2/7] target/i386: Use constants for XSAVE offsets

2021-05-20 Thread David Edmondson

Where existing constants for XSAVE offsets exists, use them.

Signed-off-by: David Edmondson 
---
 target/i386/kvm/kvm.c | 56 ++-
 1 file changed, 18 insertions(+), 38 deletions(-)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index d972eb4705..aff0774fef 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -2397,44 +2397,24 @@ static int kvm_put_fpu(X86CPU *cpu)
 return kvm_vcpu_ioctl(CPU(cpu), KVM_SET_FPU, &fpu);
 }
 
-#define XSAVE_FCW_FSW 0
-#define XSAVE_FTW_FOP 1
-#define XSAVE_CWD_RIP 2
-#define XSAVE_CWD_RDP 4
-#define XSAVE_MXCSR   6
-#define XSAVE_ST_SPACE8
-#define XSAVE_XMM_SPACE   40
-#define XSAVE_XSTATE_BV   128
-#define XSAVE_YMMH_SPACE  144
-#define XSAVE_BNDREGS 240
-#define XSAVE_BNDCSR  256
-#define XSAVE_OPMASK  272
-#define XSAVE_ZMM_Hi256   288
-#define XSAVE_Hi16_ZMM416
-#define XSAVE_PKRU672
-
-#define XSAVE_BYTE_OFFSET(word_offset) \
-((word_offset) * sizeof_field(struct kvm_xsave, region[0]))
-
-#define ASSERT_OFFSET(word_offset, field) \
-QEMU_BUILD_BUG_ON(XSAVE_BYTE_OFFSET(word_offset) != \
-  offsetof(X86XSaveArea, field))
-
-ASSERT_OFFSET(XSAVE_FCW_FSW, legacy.fcw);
-ASSERT_OFFSET(XSAVE_FTW_FOP, legacy.ftw);
-ASSERT_OFFSET(XSAVE_CWD_RIP, legacy.fpip);
-ASSERT_OFFSET(XSAVE_CWD_RDP, legacy.fpdp);
-ASSERT_OFFSET(XSAVE_MXCSR, legacy.mxcsr);
-ASSERT_OFFSET(XSAVE_ST_SPACE, legacy.fpregs);
-ASSERT_OFFSET(XSAVE_XMM_SPACE, legacy.xmm_regs);
-ASSERT_OFFSET(XSAVE_XSTATE_BV, header.xstate_bv);
-ASSERT_OFFSET(XSAVE_YMMH_SPACE, avx_state);
-ASSERT_OFFSET(XSAVE_BNDREGS, bndreg_state);
-ASSERT_OFFSET(XSAVE_BNDCSR, bndcsr_state);
-ASSERT_OFFSET(XSAVE_OPMASK, opmask_state);
-ASSERT_OFFSET(XSAVE_ZMM_Hi256, zmm_hi256_state);
-ASSERT_OFFSET(XSAVE_Hi16_ZMM, hi16_zmm_state);
-ASSERT_OFFSET(XSAVE_PKRU, pkru_state);
+#define ASSERT_OFFSET(offset, field) \
+QEMU_BUILD_BUG_ON(offset != offsetof(X86XSaveArea, field))
+
+ASSERT_OFFSET(XSAVE_FCW_FSW_OFFSET, legacy.fcw);
+ASSERT_OFFSET(XSAVE_FTW_FOP_OFFSET, legacy.ftw);
+ASSERT_OFFSET(XSAVE_CWD_RIP_OFFSET, legacy.fpip);
+ASSERT_OFFSET(XSAVE_CWD_RDP_OFFSET, legacy.fpdp);
+ASSERT_OFFSET(XSAVE_MXCSR_OFFSET, legacy.mxcsr);
+ASSERT_OFFSET(XSAVE_ST_SPACE_OFFSET, legacy.fpregs);
+ASSERT_OFFSET(XSAVE_XMM_SPACE_OFFSET, legacy.xmm_regs);
+ASSERT_OFFSET(XSAVE_XSTATE_BV_OFFSET, header.xstate_bv);
+ASSERT_OFFSET(XSAVE_AVX_OFFSET, avx_state);
+ASSERT_OFFSET(XSAVE_BNDREG_OFFSET, bndreg_state);
+ASSERT_OFFSET(XSAVE_BNDCSR_OFFSET, bndcsr_state);
+ASSERT_OFFSET(XSAVE_OPMASK_OFFSET, opmask_state);
+ASSERT_OFFSET(XSAVE_ZMM_HI256_OFFSET, zmm_hi256_state);
+ASSERT_OFFSET(XSAVE_HI16_ZMM_OFFSET, hi16_zmm_state);
+ASSERT_OFFSET(XSAVE_PKRU_OFFSET, pkru_state);
 
 static int kvm_put_xsave(X86CPU *cpu)
 {
-- 
2.30.2

[RFC PATCH 0/7] Support protection keys in an AMD EPYC-Milan VM

2021-05-20 Thread David Edmondson

AMD EPYC-Milan CPUs introduced support for protection keys, previously
available only with Intel CPUs.

AMD chose to place the XSAVE state component for the protection keys
at a different offset in the XSAVE state area than that chosen by
Intel.

To accommodate this, modify QEMU to behave appropriately on AMD
systems, allowing a VM to properly take advantage of the new feature.

Further, avoid manipulating XSAVE state components that are not
present on AMD systems.

The code in patch 6 that changes the CPUID 0x0d leaf is mostly dumped
somewhere that seemed to work - I'm not sure where it really belongs.

David Edmondson (7):
  target/i386: Declare constants for XSAVE offsets
  target/i386: Use constants for XSAVE offsets
  target/i386: Clarify the padding requirements of X86XSaveArea
  target/i386: Prepare for per-vendor X86XSaveArea layout
  target/i386: Introduce AMD X86XSaveArea sub-union
  target/i386: Adjust AMD XSAVE PKRU area offset in CPUID leaf 0xd
  target/i386: Manipulate only AMD XSAVE state on AMD

 target/i386/cpu.c| 19 +
 target/i386/cpu.h| 80 
 target/i386/kvm/kvm.c| 57 +
 target/i386/tcg/fpu_helper.c | 20 ++---
 target/i386/xsave_helper.c   | 70 +++
 5 files changed, 152 insertions(+), 94 deletions(-)

-- 
2.30.2

[RFC PATCH 6/7] target/i386: Adjust AMD XSAVE PKRU area offset in CPUID leaf 0xd

2021-05-20 Thread David Edmondson

AMD stores the pkru_state at a different offset to Intel, so update
the CPUID leaf which indicates such.

Signed-off-by: David Edmondson 
---
 target/i386/cpu.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 4f481691b4..9340a477a3 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -1397,7 +1397,7 @@ typedef struct ExtSaveArea {
 uint32_t offset, size;
 } ExtSaveArea;
 
-static const ExtSaveArea x86_ext_save_areas[] = {
+static ExtSaveArea x86_ext_save_areas[] = {
 [XSTATE_FP_BIT] = {
 /* x87 FP state component is always enabled if XSAVE is supported */
 .feature = FEAT_1_ECX, .bits = CPUID_EXT_XSAVE,
@@ -6088,6 +6088,11 @@ static void x86_cpu_filter_features(X86CPU *cpu, bool 
verbose)
 mark_unavailable_features(cpu, FEAT_7_0_EBX, 
CPUID_7_0_EBX_INTEL_PT, prefix);
 }
 }
+
+if (IS_AMD_CPU(env)) {
+x86_ext_save_areas[XSTATE_PKRU_BIT].offset =
+offsetof(X86XSaveArea, amd.pkru_state);
+}
 }
 
 static void x86_cpu_hyperv_realize(X86CPU *cpu)
-- 
2.30.2

[RFC PATCH 3/7] target/i386: Clarify the padding requirements of X86XSaveArea

2021-05-20 Thread David Edmondson

Replace the hard-coded size of offsets or structure elements with
defined constants or sizeof().

Signed-off-by: David Edmondson 
---
 target/i386/cpu.h | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 1fb732f366..0bb365bddf 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -1329,7 +1329,13 @@ typedef struct X86XSaveArea {
 
 /* AVX State: */
 XSaveAVX avx_state;
-uint8_t padding[960 - 576 - sizeof(XSaveAVX)];
+
+/* Ensure that XSaveBNDREG is properly aligned. */
+uint8_t padding[XSAVE_BNDREG_OFFSET
+- sizeof(X86LegacyXSaveArea)
+- sizeof(X86XSaveHeader)
+- sizeof(XSaveAVX)];
+
 /* MPX State: */
 XSaveBNDREG bndreg_state;
 XSaveBNDCSR bndcsr_state;
-- 
2.30.2

[RFC PATCH 4/7] target/i386: Prepare for per-vendor X86XSaveArea layout

2021-05-20 Thread David Edmondson

Move Intel specific components of X86XSaveArea into a sub-union.

Signed-off-by: David Edmondson 
---
 target/i386/cpu.c| 12 
 target/i386/cpu.h| 55 +---
 target/i386/kvm/kvm.c| 12 
 target/i386/tcg/fpu_helper.c | 12 
 target/i386/xsave_helper.c   | 24 
 5 files changed, 63 insertions(+), 52 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index c496bfa1c2..4f481691b4 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -1418,27 +1418,27 @@ static const ExtSaveArea x86_ext_save_areas[] = {
 .size = sizeof(XSaveAVX) },
 [XSTATE_BNDREGS_BIT] =
   { .feature = FEAT_7_0_EBX, .bits = CPUID_7_0_EBX_MPX,
-.offset = offsetof(X86XSaveArea, bndreg_state),
+.offset = offsetof(X86XSaveArea, intel.bndreg_state),
 .size = sizeof(XSaveBNDREG)  },
 [XSTATE_BNDCSR_BIT] =
   { .feature = FEAT_7_0_EBX, .bits = CPUID_7_0_EBX_MPX,
-.offset = offsetof(X86XSaveArea, bndcsr_state),
+.offset = offsetof(X86XSaveArea, intel.bndcsr_state),
 .size = sizeof(XSaveBNDCSR)  },
 [XSTATE_OPMASK_BIT] =
   { .feature = FEAT_7_0_EBX, .bits = CPUID_7_0_EBX_AVX512F,
-.offset = offsetof(X86XSaveArea, opmask_state),
+.offset = offsetof(X86XSaveArea, intel.opmask_state),
 .size = sizeof(XSaveOpmask) },
 [XSTATE_ZMM_Hi256_BIT] =
   { .feature = FEAT_7_0_EBX, .bits = CPUID_7_0_EBX_AVX512F,
-.offset = offsetof(X86XSaveArea, zmm_hi256_state),
+.offset = offsetof(X86XSaveArea, intel.zmm_hi256_state),
 .size = sizeof(XSaveZMM_Hi256) },
 [XSTATE_Hi16_ZMM_BIT] =
   { .feature = FEAT_7_0_EBX, .bits = CPUID_7_0_EBX_AVX512F,
-.offset = offsetof(X86XSaveArea, hi16_zmm_state),
+.offset = offsetof(X86XSaveArea, intel.hi16_zmm_state),
 .size = sizeof(XSaveHi16_ZMM) },
 [XSTATE_PKRU_BIT] =
   { .feature = FEAT_7_0_ECX, .bits = CPUID_7_0_ECX_PKU,
-.offset = offsetof(X86XSaveArea, pkru_state),
+.offset = offsetof(X86XSaveArea, intel.pkru_state),
 .size = sizeof(XSavePKRU) },
 };
 
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 0bb365bddf..f1ce4e3008 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -1330,36 +1330,47 @@ typedef struct X86XSaveArea {
 /* AVX State: */
 XSaveAVX avx_state;
 
-/* Ensure that XSaveBNDREG is properly aligned. */
-uint8_t padding[XSAVE_BNDREG_OFFSET
-- sizeof(X86LegacyXSaveArea)
-- sizeof(X86XSaveHeader)
-- sizeof(XSaveAVX)];
-
-/* MPX State: */
-XSaveBNDREG bndreg_state;
-XSaveBNDCSR bndcsr_state;
-/* AVX-512 State: */
-XSaveOpmask opmask_state;
-XSaveZMM_Hi256 zmm_hi256_state;
-XSaveHi16_ZMM hi16_zmm_state;
-/* PKRU State: */
-XSavePKRU pkru_state;
+union {
+struct {
+/* Ensure that XSaveBNDREG is properly aligned. */
+uint8_t padding[XSAVE_BNDREG_OFFSET
+- sizeof(X86LegacyXSaveArea)
+- sizeof(X86XSaveHeader)
+- sizeof(XSaveAVX)];
+
+/* MPX State: */
+XSaveBNDREG bndreg_state;
+XSaveBNDCSR bndcsr_state;
+/* AVX-512 State: */
+XSaveOpmask opmask_state;
+XSaveZMM_Hi256 zmm_hi256_state;
+XSaveHi16_ZMM hi16_zmm_state;
+/* PKRU State: */
+XSavePKRU pkru_state;
+} intel;
+};
 } X86XSaveArea;
 
-QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, avx_state) != XSAVE_AVX_OFFSET);
+QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, avx_state)
+  != XSAVE_AVX_OFFSET);
 QEMU_BUILD_BUG_ON(sizeof(XSaveAVX) != 0x100);
-QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, bndreg_state) != XSAVE_BNDREG_OFFSET);
+QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, intel.bndreg_state)
+  != XSAVE_BNDREG_OFFSET);
 QEMU_BUILD_BUG_ON(sizeof(XSaveBNDREG) != 0x40);
-QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, bndcsr_state) != XSAVE_BNDCSR_OFFSET);
+QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, intel.bndcsr_state)
+  != XSAVE_BNDCSR_OFFSET);
 QEMU_BUILD_BUG_ON(sizeof(XSaveBNDCSR) != 0x40);
-QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, opmask_state) != XSAVE_OPMASK_OFFSET);
+QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, intel.opmask_state)
+  != XSAVE_OPMASK_OFFSET);
 QEMU_BUILD_BUG_ON(sizeof(XSaveOpmask) != 0x40);
-QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, zmm_hi256_state) != 
XSAVE_ZMM_HI256_OFFSET);
+QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, intel.zmm_hi256_state)
+  != XSAVE_ZMM_HI256_OFFSET);
 QEMU_BUILD_BUG_ON(sizeof(XSaveZMM_Hi256) != 0x200);
-QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, hi16_zmm_state) != 
XSAVE_HI16_ZMM_OFFSET);
+QEMU_BUILD_BUG_ON(offsetof

[RFC PATCH 5/7] target/i386: Introduce AMD X86XSaveArea sub-union

2021-05-20 Thread David Edmondson

AMD stores the pkru_state at a different offset to Intel.

Signed-off-by: David Edmondson 
---
 target/i386/cpu.h | 17 +++--
 target/i386/kvm/kvm.c |  3 ++-
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index f1ce4e3008..99f0d5d851 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -1319,7 +1319,8 @@ typedef struct XSavePKRU {
 #define XSAVE_OPMASK_OFFSET 0x440
 #define XSAVE_ZMM_HI256_OFFSET  0x480
 #define XSAVE_HI16_ZMM_OFFSET   0x680
-#define XSAVE_PKRU_OFFSET   0xa80
+#define XSAVE_INTEL_PKRU_OFFSET 0xa80
+#define XSAVE_AMD_PKRU_OFFSET   0x980
 
 typedef struct X86XSaveArea {
 X86LegacyXSaveArea legacy;
@@ -1348,6 +1349,16 @@ typedef struct X86XSaveArea {
 /* PKRU State: */
 XSavePKRU pkru_state;
 } intel;
+struct {
+/* Ensure that XSavePKRU is properly aligned. */
+uint8_t padding[XSAVE_AMD_PKRU_OFFSET
+- sizeof(X86LegacyXSaveArea)
+- sizeof(X86XSaveHeader)
+- sizeof(XSaveAVX)];
+
+/* PKRU State: */
+XSavePKRU pkru_state;
+} amd;
 };
 } X86XSaveArea;
 
@@ -1370,7 +1381,9 @@ QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, 
intel.hi16_zmm_state)
   != XSAVE_HI16_ZMM_OFFSET);
 QEMU_BUILD_BUG_ON(sizeof(XSaveHi16_ZMM) != 0x400);
 QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, intel.pkru_state)
-  != XSAVE_PKRU_OFFSET);
+  != XSAVE_INTEL_PKRU_OFFSET);
+QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, amd.pkru_state)
+  != XSAVE_AMD_PKRU_OFFSET);
 QEMU_BUILD_BUG_ON(sizeof(XSavePKRU) != 0x8);
 
 typedef enum TPRAccess {
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 417776a635..9dd7db060d 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -2414,7 +2414,8 @@ ASSERT_OFFSET(XSAVE_BNDCSR_OFFSET, intel.bndcsr_state);
 ASSERT_OFFSET(XSAVE_OPMASK_OFFSET, intel.opmask_state);
 ASSERT_OFFSET(XSAVE_ZMM_HI256_OFFSET, intel.zmm_hi256_state);
 ASSERT_OFFSET(XSAVE_HI16_ZMM_OFFSET, intel.hi16_zmm_state);
-ASSERT_OFFSET(XSAVE_PKRU_OFFSET, intel.pkru_state);
+ASSERT_OFFSET(XSAVE_INTEL_PKRU_OFFSET, intel.pkru_state);
+ASSERT_OFFSET(XSAVE_AMD_PKRU_OFFSET, amd.pkru_state);
 
 static int kvm_put_xsave(X86CPU *cpu)
 {
-- 
2.30.2

Re: [PATCH] hw/elf_ops: Fix a typo

2021-03-03 Thread David Edmondson

On Thursday, 2021-02-25 at 19:13:44 +01, Philippe Mathieu-Daudé wrote:

> g_mapped_file_new_from_fd()'s parameter is named 'writable'.
>
> Signed-off-by: Philippe Mathieu-Daudé 

Reviewed-by: David Edmondson 

> ---
>  include/hw/elf_ops.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/hw/elf_ops.h b/include/hw/elf_ops.h
> index 8e8436831d2..304f266bf3b 100644
> --- a/include/hw/elf_ops.h
> +++ b/include/hw/elf_ops.h
> @@ -417,7 +417,7 @@ static int glue(load_elf, SZ)(const char *name, int fd,
>  
>  /*
>   * Since we want to be able to modify the mapped buffer, we set the
> - * 'writeble' parameter to 'true'. Modifications to the buffer are not
> + * 'writable' parameter to 'true'. Modifications to the buffer are not
>   * written back to the file.
>   */
>  mapped_file = g_mapped_file_new_from_fd(fd, true, NULL);
> -- 
> 2.26.2

dme.
-- 
I've always been mad, I know I've been mad, like the most of us are.

Re: [PATCH v2 1/3] migration/ram: Modify the code comment of ram_save_host_page()

2021-03-03 Thread David Edmondson

On Monday, 2021-03-01 at 16:21:30 +08, Kunkun Jiang wrote:

> The ram_save_host_page() has been modified several times
> since its birth. But the comment hasn't been modified as it should
> be. It'd better to modify the comment to explain ram_save_host_page()
> more clearly.

I don't think that it's reasonable for me to send Reviewed-by for this,
given that I suggested the text.

Could someone else check the sense and correctness?

> Signed-off-by: Keqian Zhu 
> Signed-off-by: Kunkun Jiang 
> ---
>  migration/ram.c | 14 ++
>  1 file changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/migration/ram.c b/migration/ram.c
> index 72143da0ac..24967cb970 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1970,15 +1970,13 @@ static int ram_save_target_page(RAMState *rs, 
> PageSearchStatus *pss,
>  }
>  
>  /**
> - * ram_save_host_page: save a whole host page
> + * ram_save_host_page: save a whole host page or the rest of a RAMBlock
>   *
> - * Starting at *offset send pages up to the end of the current host
> - * page. It's valid for the initial offset to point into the middle of
> - * a host page in which case the remainder of the hostpage is sent.
> - * Only dirty target pages are sent. Note that the host page size may
> - * be a huge page for this block.
> - * The saving stops at the boundary of the used_length of the block
> - * if the RAMBlock isn't a multiple of the host page size.
> + * Send dirty pages between pss->page and either the end of that page
> + * or the used_length of the RAMBlock, whichever is smaller.
> + *
> + * Note that if the host page is a huge page, pss->page may be in the
> + * middle of that page.
>   *
>   * Returns the number of pages written or negative on error
>   *
> -- 
> 2.23.0

dme.
-- 
Leaves are falling all around, it's time I was on my way.

Re: [PATCH v2 3/3] migration/ram: Optimize ram_save_host_page()

2021-03-03 Thread David Edmondson

On Monday, 2021-03-01 at 16:21:32 +08, Kunkun Jiang wrote:

> Starting from pss->page, ram_save_host_page() will check every page
> and send the dirty pages up to the end of the current host page or
> the boundary of used_length of the block. If the host page size is
> a huge page, the step "check" will take a lot of time.
>
> This will improve performance to use migration_bitmap_find_dirty().

This is cleaner, thank you.

I was hoping to just invert the body of the loop - something like
(completely untested):

do {
  int pages_this_iteration = 0;

  /* Check if the page is dirty and, if so, send it. */
  if (migration_bitmap_clear_dirty(rs, pss->block, pss->page)) {
pages_this_iteration = ram_save_target_page(rs, pss, last_stage);
if (pages_this_iteration < 0) {
  return pages_this_iteration;
}

pages += pages_this_iteration;

/*
 * Allow rate limiting to happen in the middle of huge pages if
 * the current iteration sent something.
 */
if (pagesize_bits > 1 && pages_this_iteration > 0) {
  migration_rate_limit();
}
  }
  pss->page = migration_bitmap_find_dirty(rs, pss->block, pss->page);
 } while ((pss->page < hostpage_boundary) &&
  offset_in_ramblock(pss->block,
 ((ram_addr_t)pss->page) << TARGET_PAGE_BITS));
/* The offset we leave with is the min boundary of host page and block */
pss->page = MIN(pss->page, hostpage_boundary) - 1;

> Signed-off-by: Keqian Zhu 
> Signed-off-by: Kunkun Jiang 
> ---
>  migration/ram.c | 12 +++-
>  1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/migration/ram.c b/migration/ram.c
> index 3a9115b6dc..a1374db356 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1991,6 +1991,8 @@ static int ram_save_host_page(RAMState *rs, 
> PageSearchStatus *pss,
>  int tmppages, pages = 0;
>  size_t pagesize_bits =
>  qemu_ram_pagesize(pss->block) >> TARGET_PAGE_BITS;
> +unsigned long hostpage_boundary =
> +QEMU_ALIGN_UP(pss->page + 1, pagesize_bits);
>  unsigned long start_page = pss->page;
>  int res;
>  
> @@ -2002,7 +2004,7 @@ static int ram_save_host_page(RAMState *rs, 
> PageSearchStatus *pss,
>  do {
>  /* Check the pages is dirty and if it is send it */
>  if (!migration_bitmap_clear_dirty(rs, pss->block, pss->page)) {
> -pss->page++;
> +pss->page = migration_bitmap_find_dirty(rs, pss->block, 
> pss->page);
>  continue;
>  }
>  
> @@ -2012,16 +2014,16 @@ static int ram_save_host_page(RAMState *rs, 
> PageSearchStatus *pss,
>  }
>  
>  pages += tmppages;
> -pss->page++;
> +pss->page = migration_bitmap_find_dirty(rs, pss->block, pss->page);
>  /* Allow rate limiting to happen in the middle of huge pages */
>  if (pagesize_bits > 1) {
>  migration_rate_limit();
>  }
> -} while ((pss->page & (pagesize_bits - 1)) &&
> +} while ((pss->page < hostpage_boundary) &&
>   offset_in_ramblock(pss->block,
>  ((ram_addr_t)pss->page) << 
> TARGET_PAGE_BITS));
> -/* The offset we leave with is the last one we looked at */
> -pss->page--;
> +/* The offset we leave with is the min boundary of host page and block */
> +pss->page = MIN(pss->page, hostpage_boundary) - 1;
>  
>  res = ram_save_release_protection(rs, pss, start_page);
>  return (res < 0 ? res : pages);
> -- 
> 2.23.0

dme.
-- 
Don't you know you're never going to get to France.

1 2 3 >

1 - 100 of 292 matches

Mail list logo