date:20211007

Re: [PATCH 1/2] pylint: fix errors and warnings from qemu-tests test 297

2021-10-07 Thread Emanuele Giuseppe Esposito





On 06/10/2021 18:46, Kevin Wolf wrote:

Am 06.10.2021 um 15:00 hat Emanuele Giuseppe Esposito geschrieben:

Test 297 in qemu-iotests folder currently fails: pylint has
learned new things to check, or we simply missed them.

All fixes in this patch are related to additional spaces used
or wrong indentation.

No functional change intended.

Signed-off-by: Emanuele Giuseppe Esposito 



@@ -87,13 +87,14 @@ class TestStopWithBlockJob(iotests.QMPTestCase):
  iotests.qemu_img('create', '-f', iotests.imgfmt, self.overlay_img,
   '1G')
  
-result = self.vm.qmp('blockdev-add', **{

+result = self.vm.qmp('blockdev-add',
+ **{
   'node-name': 'overlay',
   'driver': iotests.imgfmt,
   'file': {
   'driver': 'file',
   'filename': self.overlay_img
- }
+ }
   })
  self.assert_qmp(result, 'return', {})


Am I the only one to think that the new indentation for the closing
brace there is horrible? PEP-8 explictly allows things like:

 my_list = [
 1, 2, 3,
 4, 5, 6,
 ]

Some of the other changes in this patch should be made, but at least if
these are behind different switches, I would consider just disabling the
one that complains about nicely formatted dicts.


The error is "C0330: Wrong hanging indentation"
so it is not about dicts. I guess we can disable the error, but the 
problem is that we will disable it for the whole file, which doesn't 
seem right.


Alternatively, this also works fine:

-result = self.vm.qmp('blockdev-add',
- **{
- 'node-name': 'overlay',
- 'driver': iotests.imgfmt,
- 'file': {
- 'driver': 'file',
- 'filename': self.overlay_img
- }
- })
+result = self.vm.qmp('blockdev-add', **{
+'node-name': 'overlay',
+'driver': iotests.imgfmt,
+'file': {
+'driver': 'file',
+'filename': self.overlay_img
+}})

What do you think?

Otherwise I am happy to disable the error altogether.

Emanuele

Re: [PATCH 2/2] qemu-iotests: fix image-fleecing pylint errors

2021-10-07 Thread Emanuele Giuseppe Esposito





On 06/10/2021 18:51, Kevin Wolf wrote:

Am 06.10.2021 um 15:01 hat Emanuele Giuseppe Esposito geschrieben:

The problem here is that some variables are formatted with
unnecessary spaces to make it prettier and easier to read.

However, pylint complains about those additional spaces.
A solution is to transform them as string with arbitrary spaces,
and then convert it back into a tuple.

Removing the spaces makes it a little bit ugly, and directly
using the string forces us to change the test reference output
accordingly, which will 1) contain ugly weird formatted strings,
2) is not portable if somebody changes the formatting in the test
string.

Signed-off-by: Emanuele Giuseppe Esposito 


Changing our logic because of a style checker feels wrong. I'd rather
stick in a line like this before the definitions:

# pylint: disable=bad-whitespace

(Not sure if the syntax of this is entirely correct, but from the
comment in your patch and existing uses in iotests, I think this would
be the line.)


Ok, I will add the line. Same remarks from the previous patch applies: 
unfortunately then we disable the warning for the whole file.


Since here (like the previous patch) the error spans on multiple lines, 
adding a # pylint: disable= comment on each line is infeasible and ugly.


Emanuele

Re: [PATCH v5 05/13] job: @force parameter for job_cancel_sync()

2021-10-07 Thread Vladimir Sementsov-Ogievskiy


10/6/21 18:19, Hanna Reitz wrote:

Callers should be able to specify whether they want job_cancel_sync() to
force-cancel the job or not.

In fact, almost all invocations do not care about consistency of the
result and just want the job to terminate as soon as possible, so they
should pass force=true.  The replication block driver is the exception,
specifically the active commit job it runs.

As for job_cancel_sync_all(), all callers want it to force-cancel all
jobs, because that is the point of it: To cancel all remaining jobs as
quickly as possible (generally on process termination).  So make it
invoke job_cancel_sync() with force=true.

This changes some iotest outputs, because quitting qemu while a mirror
job is active will now lead to it being cancelled instead of completed,
which is what we want.  (Cancelling a READY mirror job with force=false
may take an indefinite amount of time, which we do not want when
quitting.  If users want consistent results, they must have all jobs be
done before they quit qemu.)

Buglink:https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Hanna Reitz


Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir

Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k

2021-10-07 Thread Stefan Hajnoczi

On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck wrote:
> At the moment the maximum transfer size with virtio is limited to 4M
> (1024 * PAGE_SIZE). This series raises this limit to its maximum
> theoretical possible transfer size of 128M (32k pages) according to the
> virtio specs:
> 
> https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-240006

Hi Christian,
I took a quick look at the code:

- The Linux 9p driver restricts descriptor chains to 128 elements
  (net/9p/trans_virtio.c:VIRTQUEUE_NUM)

- The QEMU 9pfs code passes iovecs directly to preadv(2) and will fail
  with EINVAL when called with more than IOV_MAX iovecs
  (hw/9pfs/9p.c:v9fs_read())

Unless I misunderstood the code, neither side can take advantage of the
new 32k descriptor chain limit?

Thanks,
Stefan


signature.asc
Description: PGP signature

Re: [PATCH v5 07/13] job: Do not soft-cancel after a job is done

2021-10-07 Thread Vladimir Sementsov-Ogievskiy


10/6/21 18:19, Hanna Reitz wrote:

The only job that supports a soft cancel mode is the mirror job, and in
such a case it resets its .cancelled field before it leaves its .run()
function, so it does not really count as cancelled.

However, it is possible to cancel the job after .run() returns and
before job_exit() (which is run in the main loop) is executed.  Then,
.cancelled would still be true and the job would count as cancelled.
This does not seem to be in the interest of the mirror job, so adjust
job_cancel_async() to not set .cancelled in such a case, and
job_cancel() to not invoke job_completed_txn_abort().

Signed-off-by: Hanna Reitz



Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir

Re: [PATCH v5 08/13] job: Add job_cancel_requested()

2021-10-07 Thread Vladimir Sementsov-Ogievskiy


10/6/21 18:19, Hanna Reitz wrote:

Most callers of job_is_cancelled() actually want to know whether the job
is on its way to immediate termination.  For example, we refuse to pause
jobs that are cancelled; but this only makes sense for jobs that are
really actually cancelled.

A mirror job that is cancelled during READY with force=false should
absolutely be allowed to pause.  This "cancellation" (which is actually
a kind of completion) may take an indefinite amount of time, and so
should behave like any job during normal operation.  For example, with
on-target-error=stop, the job should stop on write errors.  (In
contrast, force-cancelled jobs should not get write errors, as they
should just terminate and not do further I/O.)

Therefore, redefine job_is_cancelled() to only return true for jobs that
are force-cancelled (which as of HEAD^ means any job that interprets the
cancellation request as a request for immediate termination), and add
job_cancel_requested() as the general variant, which returns true for
any jobs which have been requested to be cancelled, whether it be
immediately or after an arbitrarily long completion phase.

Finally, here is a justification for how different job_is_cancelled()
invocations are treated by this patch:

- block/mirror.c (mirror_run()):
   - The first invocation is a while loop that should loop until the job
 has been cancelled or scheduled for completion.  What kind of cancel
 does not matter, only the fact that the job is supposed to end.

   - The second invocation wants to know whether the job has been
 soft-cancelled.  Calling job_cancel_requested() is a bit too broad,
 but if the job were force-cancelled, we should leave the main loop
 as soon as possible anyway, so this should not matter here.

   - The last two invocations already check force_cancel, so they should
 continue to use job_is_cancelled().

- block/backup.c, block/commit.c, block/stream.c, anything in tests/:
   These jobs know only force-cancel, so there is no difference between
   job_is_cancelled() and job_cancel_requested().  We can continue using
   job_is_cancelled().

- job.c:
   - job_pause_point(), job_yield(), job_sleep_ns(): Only force-cancelled
 jobs should be prevented from being paused.  Continue using 
job_is_cancelled().

   - job_update_rc(), job_finalize_single(), job_finish_sync(): These
 functions are all called after the job has left its main loop.  The
 mirror job (the only job that can be soft-cancelled) will clear
 .cancelled before leaving the main loop if it has been
 soft-cancelled.  Therefore, these functions will observe .cancelled
 to be true only if the job has been force-cancelled.  We can
 continue to use job_is_cancelled().
 (Furthermore, conceptually, a soft-cancelled mirror job should not
 report to have been cancelled.  It should report completion (see
 also the block-job-cancel QAPI documentation).  Therefore, it makes
 sense for these functions not to distinguish between a
 soft-cancelled mirror job and a job that has completed as normal.)

   - job_completed_txn_abort(): All jobs other than @job have been
 force-cancelled.  job_is_cancelled() must be true for them.
 Regarding @job itself: job_completed_txn_abort() is mostly called
 when the job's return value is not 0.  A soft-cancelled mirror has a
 return value of 0, and so will not end up here then.
 However, job_cancel() invokes job_completed_txn_abort() if the job
 has been deferred to the main loop, which is mostly the case for
 completed jobs (which skip the assertion), but not for sure.
 To be safe, use job_cancel_requested() in this assertion.

   - job_complete(): This is function eventually invoked by the user
 (through qmp_block_job_complete() or qmp_job_complete(), or
 job_complete_sync(), which comes from qemu-img).  The intention here
 is to prevent a user from invoking job-complete after the job has
 been cancelled.  This should also apply to soft cancelling: After a
 mirror job has been soft-cancelled, the user should not be able to
 decide otherwise and have it complete as normal (i.e. pivoting to
 the target).

   - job_cancel(): Both functions are equivalent (see comment there), but
 we want to use job_is_cancelled(), because this shows that we call
 job_completed_txn_abort() only for force-cancelled jobs.  (As
 explained for job_update_rc(), soft-cancelled jobs should be treated
 as if they have completed as normal.)

Buglink:https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Hanna Reitz




Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir

Re: [PATCH 2/2] qemu-iotests: fix image-fleecing pylint errors

2021-10-07 Thread Kevin Wolf

Am 07.10.2021 um 09:53 hat Emanuele Giuseppe Esposito geschrieben:
> 
> 
> On 06/10/2021 18:51, Kevin Wolf wrote:
> > Am 06.10.2021 um 15:01 hat Emanuele Giuseppe Esposito geschrieben:
> > > The problem here is that some variables are formatted with
> > > unnecessary spaces to make it prettier and easier to read.
> > > 
> > > However, pylint complains about those additional spaces.
> > > A solution is to transform them as string with arbitrary spaces,
> > > and then convert it back into a tuple.
> > > 
> > > Removing the spaces makes it a little bit ugly, and directly
> > > using the string forces us to change the test reference output
> > > accordingly, which will 1) contain ugly weird formatted strings,
> > > 2) is not portable if somebody changes the formatting in the test
> > > string.
> > > 
> > > Signed-off-by: Emanuele Giuseppe Esposito 
> > 
> > Changing our logic because of a style checker feels wrong. I'd rather
> > stick in a line like this before the definitions:
> > 
> > # pylint: disable=bad-whitespace
> > 
> > (Not sure if the syntax of this is entirely correct, but from the
> > comment in your patch and existing uses in iotests, I think this would
> > be the line.)
> 
> Ok, I will add the line. Same remarks from the previous patch applies:
> unfortunately then we disable the warning for the whole file.
> 
> Since here (like the previous patch) the error spans on multiple lines,
> adding a # pylint: disable= comment on each line is infeasible and ugly.

It doesn't fail with my pylint version, so I can't try it out, but does
the following work?

# pylint: disable=bad-whitespace
... definitions ...
# pylint: enable=bad-whitespace

Kevin

Re: [PATCH v5 00/13] mirror: Handle errors after READY cancel

2021-10-07 Thread Vladimir Sementsov-Ogievskiy


10/6/21 18:19, Hanna Reitz wrote:

Hi,

v1 cover letter:
https://lists.nongnu.org/archive/html/qemu-block/2021-07/msg00705.html

v2 cover letter:
https://lists.nongnu.org/archive/html/qemu-block/2021-07/msg00747.html

v3 cover letter:
https://lists.nongnu.org/archive/html/qemu-block/2021-08/msg00127.html

v4 cover letter:
https://lists.nongnu.org/archive/html/qemu-block/2021-09/msg00314.html


Changes in v5:
- Added patch 7: When it was soft-cancelled, the mirror job clears
   .cancelled before leaving its main loop.  The clear intention is to
   have the job be treated as having completed successfully (and this is
   also supported by the QMP documentation of block-job-cancel).  It is
   however possible to cancel the job after it has left its main loop,
   before it can be unwound, and then the job would again be treated as
   cancelled.  We don’t want that, so make a soft job-cancel a no-op when
   .deferred_to_main_loop is true.
   (This fixes the test-replication failure.)

- Patch 8: Add a comment in job_cancel() that job_cancel_requested() and
   job_is_cancelled() are equivalent here, but why we want to inquire
   job_is_cancelled() instead of job_cancel_requested().


git-backport-diff against v4:

Key:
[] : patches are identical
[] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/13:[] [--] 'job: Context changes in job_completed_txn_abort()'
002/13:[] [--] 'mirror: Keep s->synced on error'
003/13:[] [--] 'mirror: Drop s->synced'
004/13:[] [--] 'job: Force-cancel jobs in a failed transaction'
005/13:[] [--] 'job: @force parameter for job_cancel_sync()'
006/13:[] [--] 'jobs: Give Job.force_cancel more meaning'
007/13:[down] 'job: Do not soft-cancel after a job is done'
008/13:[0005] [FC] 'job: Add job_cancel_requested()'
009/13:[] [--] 'mirror: Use job_is_cancelled()'
010/13:[] [--] 'mirror: Check job_is_cancelled() earlier'
011/13:[] [--] 'mirror: Stop active mirroring after force-cancel'
012/13:[] [--] 'mirror: Do not clear .cancelled'
013/13:[] [--] 'iotests: Add mirror-ready-cancel-error test'


Hanna Reitz (13):
   job: Context changes in job_completed_txn_abort()
   mirror: Keep s->synced on error
   mirror: Drop s->synced
   job: Force-cancel jobs in a failed transaction
   job: @force parameter for job_cancel_sync()
   jobs: Give Job.force_cancel more meaning
   job: Do not soft-cancel after a job is done
   job: Add job_cancel_requested()
   mirror: Use job_is_cancelled()
   mirror: Check job_is_cancelled() earlier
   mirror: Stop active mirroring after force-cancel
   mirror: Do not clear .cancelled
   iotests: Add mirror-ready-cancel-error test

  include/qemu/job.h|  29 +++-
  block/backup.c|   3 +-
  block/mirror.c|  56 ---
  block/replication.c   |   4 +-
  blockdev.c|   4 +-
  job.c |  94 ++--
  tests/unit/test-blockjob.c|   2 +-
  tests/qemu-iotests/109.out|  60 +++-
  .../tests/mirror-ready-cancel-error   | 143 ++
  .../tests/mirror-ready-cancel-error.out   |   5 +
  tests/qemu-iotests/tests/qsd-jobs.out |   2 +-
  11 files changed, 312 insertions(+), 90 deletions(-)
  create mode 100755 tests/qemu-iotests/tests/mirror-ready-cancel-error
  create mode 100644 tests/qemu-iotests/tests/mirror-ready-cancel-error.out



Thanks, applied to my jobs branch.

--
Best regards,
Vladimir

Re: [PATCH v7 3/8] qmp: add QMP command x-debug-query-virtio

2021-10-07 Thread Daniel P . Berrangé

On Tue, Oct 05, 2021 at 04:24:21PM -0500, Eric Blake wrote:
> On Tue, Oct 05, 2021 at 12:45:48PM -0400, Jonah Palmer wrote:
> > From: Laurent Vivier 
> > 
> > This new command lists all the instances of VirtIODevice with
> > their QOM paths and virtio type/name.
> > 
> > Signed-off-by: Jonah Palmer 
> > ---
> >  hw/virtio/meson.build  |  2 ++
> >  hw/virtio/virtio-stub.c| 14 ++
> >  hw/virtio/virtio.c | 27 +++
> >  include/hw/virtio/virtio.h |  1 +
> >  qapi/meson.build   |  1 +
> >  qapi/qapi-schema.json  |  1 +
> >  qapi/virtio.json   | 66 
> > ++
> >  tests/qtest/qmp-cmd-test.c |  1 +
> >  8 files changed, 113 insertions(+)
> >  create mode 100644 hw/virtio/virtio-stub.c
> >  create mode 100644 qapi/virtio.json
> > 
> >  [Jonah: VirtioInfo member 'type' is now of type string and no longer
> >   relies on defining a QAPI list of virtio device type enumerations
> >   to match the VirtIODevice name with qapi_enum_parse().]
> 
> Hmm; depending on how much information you want to cram in strings, we
> may want to rebase this series on top of Dan's work to add the
> HumanReadableText QAPI type:
> https://lists.gnu.org/archive/html/qemu-devel/2021-09/msg07717.html

The HumanReadableText type is only needed for commands that return
entirely unstructured data, eg returns a HumanReadableText, or an
array of HumanReadableText.

The command here is returning semi-structured data, so I don't think
it needs to depend on HumanReadableText.

> 
> > +++ b/qapi/virtio.json
> > @@ -0,0 +1,66 @@
> > +# -*- Mode: Python -*-
> > +# vim: filetype=python
> > +#
> > +
> > +##
> > +# = Virtio devices
> > +##
> > +
> > +##
> > +# @VirtioInfo:
> > +#
> > +# Information about a given VirtIODevice
> > +#
> > +# @path: VirtIO device canonical QOM path.
> > +#
> > +# @type: VirtIO device name.
> > +#
> > +# Since: 6.2
> > +#
> > +##
> > +{ 'struct': 'VirtioInfo',
> > +'data': {
> > +'path': 'str',
> > +'type': 'str'
> > +}
> > +}
> > +
> > +##
> > +# @x-debug-query-virtio:
> > +#
> > +# Return a list of all initalized VirtIO devices
> > +#
> > +# Returns: list of gathered @VirtioInfo devices
> > +#
> > +# Since: 6.2
> > +#
> > +# Example:
> > +#
> > +# -> { "execute": "x-debug-query-virtio" }
> > +# <- { "return": [
> > +#{
> > +#"path": "/machine/peripheral-anon/device[4]/virtio-backend",
> > +#"type": "virtio-input"
> > +#},
> > +#{
> > +#"path": "/machine/peripheral/crypto0/virtio-backend",
> > +#"type": "virtio-crypto"
> > +#},
> > +#{
> > +#"path": "/machine/peripheral-anon/device[2]/virtio-backend",
> > +#"type": "virtio-scsi"
> > +#},
> > +#{
> > +#"path": "/machine/peripheral-anon/device[1]/virtio-backend",
> > +#"type": "virtio-net"
> > +#},
> > +#{
> > +#"path": "/machine/peripheral-anon/device[0]/virtio-backend",
> > +#"type": "virtio-serial"
> > +#}
> > +#  ]
> > +#}
> > +#
> > +##
> > +
> > +{ 'command': 'x-debug-query-virtio', 'returns': ['VirtioInfo'] }
> 
> But for now, it looks like 'str' is the correct type.
> 
> 
> -- 
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.   +1-919-301-3266
> Virtualization:  qemu.org | libvirt.org
> 

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [PATCH 1/2] pylint: fix errors and warnings from qemu-tests test 297

2021-10-07 Thread Kevin Wolf

Am 07.10.2021 um 09:51 hat Emanuele Giuseppe Esposito geschrieben:
> 
> 
> On 06/10/2021 18:46, Kevin Wolf wrote:
> > Am 06.10.2021 um 15:00 hat Emanuele Giuseppe Esposito geschrieben:
> > > Test 297 in qemu-iotests folder currently fails: pylint has
> > > learned new things to check, or we simply missed them.
> > > 
> > > All fixes in this patch are related to additional spaces used
> > > or wrong indentation.
> > > 
> > > No functional change intended.
> > > 
> > > Signed-off-by: Emanuele Giuseppe Esposito 
> > 
> > > @@ -87,13 +87,14 @@ class TestStopWithBlockJob(iotests.QMPTestCase):
> > >   iotests.qemu_img('create', '-f', iotests.imgfmt, 
> > > self.overlay_img,
> > >'1G')
> > > -result = self.vm.qmp('blockdev-add', **{
> > > +result = self.vm.qmp('blockdev-add',
> > > + **{
> > >'node-name': 'overlay',
> > >'driver': iotests.imgfmt,
> > >'file': {
> > >'driver': 'file',
> > >'filename': self.overlay_img
> > > - }
> > > + }
> > >})
> > >   self.assert_qmp(result, 'return', {})
> > 
> > Am I the only one to think that the new indentation for the closing
> > brace there is horrible? PEP-8 explictly allows things like:
> > 
> >  my_list = [
> >  1, 2, 3,
> >  4, 5, 6,
> >  ]
> > 
> > Some of the other changes in this patch should be made, but at least if
> > these are behind different switches, I would consider just disabling the
> > one that complains about nicely formatted dicts.
> 
> The error is "C0330: Wrong hanging indentation"
> so it is not about dicts. I guess we can disable the error, but the problem
> is that we will disable it for the whole file, which doesn't seem right.

Actually, I would disable it globally in pylintrc because building
dictionaries for JSON is something that we do a lot.

But then I'm surprised that this is the only instance that actually
fails. I wonder what the difference is.

For example, 129 doesn't seem to be skipped and has this code:

result = self.vm.qmp('blockdev-add', **{
 'node-name': 'overlay',
 'driver': iotests.imgfmt,
 'file': {
 'driver': 'file',
 'filename': self.overlay_img
 }
 })

Yet you don't report a pylint error for this file.

Oooh... I think I do see a difference: The final line is indented by one
space more in the case that fails for you. It should be vertically
aligned with the "'" in the first line, but it is actually aligned with
the "b" of "blockdev-add".

Does removing one space of indentation in the last line fix the report?

Kevin

Re: [RFC PATCH v2 01/25] main-loop.h: introduce qemu_in_main_thread()

2021-10-07 Thread Stefan Hajnoczi

On Tue, Oct 05, 2021 at 10:31:51AM -0400, Emanuele Giuseppe Esposito wrote:
> When invoked from the main loop, this function is the same
> as qemu_mutex_iothread_locked, and returns true if the BQL is held.
> When invoked from iothreads or tests, it returns true only
> if the current AioContext is the Main Loop.
> 
> This essentially just extends qemu_mutex_iothread_locked to work
> also in unit tests or other users like storage-daemon, that run
> in the Main Loop but end up using the implementation in
> stubs/iothread-lock.c.
> 
> Using qemu_mutex_iothread_locked in unit tests defaults to false
> because they use the implementation in stubs/iothread-lock,
> making all assertions added in next patches fail despite the
> AioContext is still the main loop.
> 
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
>  include/qemu/main-loop.h | 13 +
>  softmmu/cpus.c   |  5 +
>  stubs/iothread-lock.c|  5 +
>  3 files changed, 23 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [RFC PATCH v2 02/25] include/sysemu/block-backend: split header into I/O and global state (GS) API

2021-10-07 Thread Stefan Hajnoczi

On Tue, Oct 05, 2021 at 10:31:52AM -0400, Emanuele Giuseppe Esposito wrote:
> +/*
> + * Global state (GS) API. These functions run under the BQL lock.
> + *
> + * If a function modifies the graph, it also uses drain and/or
> + * aio_context_acquire/release to be sure it has unique access.
> + * aio_context locking is needed together with BQL because of
> + * the thread-safe I/O API that concurrently runs and accesses
> + * the graph without the BQL.
> + *
> + * It is important to note that not all of these functions are
> + * necessarily limited to running under the BQL, but they would
> + * require additional auditing and may small thread-safety changes

s/may/many/

> diff --git a/include/sysemu/block-backend-io.h 
> b/include/sysemu/block-backend-io.h
> new file mode 100644
> index 00..a77b2080ce
> --- /dev/null
> +++ b/include/sysemu/block-backend-io.h
> @@ -0,0 +1,130 @@
> +/*
> + * QEMU Block backends
> + *
> + * Copyright (C) 2014-2016 Red Hat, Inc.
> + *
> + * Authors:
> + *  Markus Armbruster ,
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2.1
> + * or later.  See the COPYING.LIB file in the top-level directory.
> + */
> +
> +#ifndef BLOCK_BACKEND_IO_H
> +#define BLOCK_BACKEND_IO_H
> +
> +#include "block-backend-common.h"
> +
> +/*
> + * I/O API functions. These functions are thread-safe, and therefore
> + * can run in any thread as long as they have called
> + * aio_context_acquire/release().

The meaning of "they" is not 100% clear. It could be the thread or the
I/O API functions. I suggest making it explicit:

  s/they have/the thread has/

Otherwise:
Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [RFC PATCH v2 04/25] include/block/block: split header into I/O and global state API

2021-10-07 Thread Stefan Hajnoczi

On Tue, Oct 05, 2021 at 10:31:54AM -0400, Emanuele Giuseppe Esposito wrote:
> +int bdrv_pwrite_sync(BdrvChild *child, int64_t offset,
> + const void *buf, int64_t bytes);

Why is this bit of a surprise since the other synchronous I/O functions
aren't included in this header. Why did you put it here? This one may be
safe to move to the I/O API.

> +int bdrv_block_status(BlockDriverState *bs, int64_t offset,
> +  int64_t bytes, int64_t *pnum, int64_t *map,
> +  BlockDriverState **file);

This function just called bdrv_block_status_above(), which is in the I/O
API. I think it's safe to move this to the I/O API or else
bdrv_block_status_above() shouldn't be there :).

signature.asc
Description: PGP signature

Re: [PATCH 1/2] pylint: fix errors and warnings from qemu-tests test 297

2021-10-07 Thread Emanuele Giuseppe Esposito





The error is "C0330: Wrong hanging indentation"
so it is not about dicts. I guess we can disable the error, but the problem
is that we will disable it for the whole file, which doesn't seem right.


Actually, I would disable it globally in pylintrc because building
dictionaries for JSON is something that we do a lot.

But then I'm surprised that this is the only instance that actually
fails. I wonder what the difference is.

For example, 129 doesn't seem to be skipped and has this code:





 result = self.vm.qmp('blockdev-add', **{
  'node-name': 'overlay',
  'driver': iotests.imgfmt,
  'file': {
  'driver': 'file',
  'filename': self.overlay_img
  }
  })

Yet you don't report a pylint error for this file.


Well, unless I am misunderstanding something... 129 *is* the file I am 
reporting. And that is exactly the function where pylint complains.




Oooh... I think I do see a difference: The final line is indented by one
space more in the case that fails for you. It should be vertically
aligned with the "'" in the first line, but it is actually aligned with
the "b" of "blockdev-add"

Does removing one space of indentation in the last line fix the report?


It's not only the final line, it's from "**{" till the ending ")".
'node-name' is under "ock" of 'blockdev-add'. It is clearly bad 
indented, regardless of the new style and pylint new rules.


Pylint itself suggests to move it 4 spaces more than "result =", ie 21 
spaces before.


Still, applying your suggestion to all the lines and removing 1 space 
from all lines still does not make pylint happy, as it asks to remove 20 
spaces.


To simplify things, this is the error I get:

 === pylint ===
+* Module 129
+129:91:0: C0330: Wrong hanging indentation (remove 21 spaces).
+ 'node-name': 'overlay',
+|^ (bad-continuation)
+129:92:0: C0330: Wrong hanging indentation (remove 21 spaces).
+ 'driver': iotests.imgfmt,
+|^ (bad-continuation)
+129:93:0: C0330: Wrong hanging indentation (remove 21 spaces).
+ 'file': {
+|^ (bad-continuation)
+129:97:0: C0330: Wrong hanging indentation.
+ })
+|   |^ (bad-continuation)

So unless you want to disable it overall, one way of fixing 129 is to 
follow what pylint suggests, and do like I wrote in the previous email:


Either:
result = self.vm.qmp('blockdev-add', **{
'node-name': 'overlay', <-- 21 spaces less
'driver': iotests.imgfmt,   <-- 21 spaces less
'file': {   <-- 21 spaces less
'driver': 'file',   <-- 21 spaces less
'filename': self.overlay_img<-- 21 spaces less
}   <-- 21 spaces less
})  <-- 21 spaces less

or (difference is in the last line only):
result = self.vm.qmp('blockdev-add', **{
'node-name': 'overlay', <-- 21 spaces less
'driver': iotests.imgfmt,   <-- 21 spaces less
'file': {   <-- 21 spaces less
'driver': 'file',   <-- 21 spaces less
'filename': self.overlay_img<-- 21 spaces less
}}) <-- 21 spaces less
Emanuele

Re: [PATCH 2/2] qemu-iotests: fix image-fleecing pylint errors

2021-10-07 Thread Emanuele Giuseppe Esposito





On 07/10/2021 10:36, Kevin Wolf wrote:

Am 07.10.2021 um 09:53 hat Emanuele Giuseppe Esposito geschrieben:



On 06/10/2021 18:51, Kevin Wolf wrote:

Am 06.10.2021 um 15:01 hat Emanuele Giuseppe Esposito geschrieben:

The problem here is that some variables are formatted with
unnecessary spaces to make it prettier and easier to read.

However, pylint complains about those additional spaces.
A solution is to transform them as string with arbitrary spaces,
and then convert it back into a tuple.

Removing the spaces makes it a little bit ugly, and directly
using the string forces us to change the test reference output
accordingly, which will 1) contain ugly weird formatted strings,
2) is not portable if somebody changes the formatting in the test
string.

Signed-off-by: Emanuele Giuseppe Esposito 


Changing our logic because of a style checker feels wrong. I'd rather
stick in a line like this before the definitions:

# pylint: disable=bad-whitespace

(Not sure if the syntax of this is entirely correct, but from the
comment in your patch and existing uses in iotests, I think this would
be the line.)


Ok, I will add the line. Same remarks from the previous patch applies:
unfortunately then we disable the warning for the whole file.

Since here (like the previous patch) the error spans on multiple lines,
adding a # pylint: disable= comment on each line is infeasible and ugly.


It doesn't fail with my pylint version, so I can't try it out, but does
the following work?

# pylint: disable=bad-whitespace
... definitions ...
# pylint: enable=bad-whitespace


You are right, this is valid and looks very good. In this way I can just 
temporarily disable the error for the variables.




Kevin

Re: [RFC PATCH v2 05/25] assertions for block global state API

2021-10-07 Thread Stefan Hajnoczi

On Tue, Oct 05, 2021 at 10:31:55AM -0400, Emanuele Giuseppe Esposito wrote:
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
>  block.c| 135 +++--
>  block/commit.c |   2 +
>  block/io.c |  20 
>  blockdev.c |   1 +
>  4 files changed, 155 insertions(+), 3 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [RFC PATCH v2 03/25] block/block-backend.c: assertions for block-backend

2021-10-07 Thread Stefan Hajnoczi

On Tue, Oct 05, 2021 at 10:31:53AM -0400, Emanuele Giuseppe Esposito wrote:
> All the global state (GS) API functions will check that
> qemu_in_main_thread() returns true. If not, it means
> that the safety of BQL cannot be guaranteed, and
> they need to be moved to I/O.
> 
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
>  block/block-backend.c  | 89 +-
>  softmmu/qdev-monitor.c |  2 +
>  2 files changed, 90 insertions(+), 1 deletion(-)

Modulo Eric's comment:

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [RFC PATCH v2 06/25] include/block/block_int: split header into I/O and global state API

2021-10-07 Thread Stefan Hajnoczi

On Tue, Oct 05, 2021 at 10:31:56AM -0400, Emanuele Giuseppe Esposito wrote:
> +int coroutine_fn bdrv_co_copy_range_from(BdrvChild *src, int64_t src_offset,
> + BdrvChild *dst, int64_t dst_offset,
> + int64_t bytes,
> + BdrvRequestFlags read_flags,
> + BdrvRequestFlags write_flags);
> +int coroutine_fn bdrv_co_copy_range_to(BdrvChild *src, int64_t src_offset,
> +   BdrvChild *dst, int64_t dst_offset,
> +   int64_t bytes,
> +   BdrvRequestFlags read_flags,
> +   BdrvRequestFlags write_flags);

These look like I/O APIs to me. Are they in the GS API because only
qemu-img.c call copy_range? I thought SCSI emulation might call it too,
but grep says otherwise.

Otherwise:

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [RFC PATCH v2 04/25] include/block/block: split header into I/O and global state API

2021-10-07 Thread Emanuele Giuseppe Esposito





On 07/10/2021 11:33, Stefan Hajnoczi wrote:

On Tue, Oct 05, 2021 at 10:31:54AM -0400, Emanuele Giuseppe Esposito wrote:

+int bdrv_pwrite_sync(BdrvChild *child, int64_t offset,
+ const void *buf, int64_t bytes);


Why is this bit of a surprise since the other synchronous I/O functions
aren't included in this header. Why did you put it here? This one may be
safe to move to the I/O API.


Considering that in the next patch I did not even add an assertion for 
it, I am confident enough that it was a copy-paste mistake. This goes 
into I/O.





+int bdrv_block_status(BlockDriverState *bs, int64_t offset,
+  int64_t bytes, int64_t *pnum, int64_t *map,
+  BlockDriverState **file);


This function just called bdrv_block_status_above(), which is in the I/O
API. I think it's safe to move this to the I/O API or else
bdrv_block_status_above() shouldn't be there :).



It *seems* that while bdrv_block_status_above() is an I/O, probably 
running in some coroutine (from here its internal qemu_in_coroutine 
check), bdrv_block_status might be called from the main loop (or 
alternatively the function is never invoked in the tests, so the 
assertion never triggered).


Maybe bdrv_block_status_above is one of the few functions that are both 
I/O and Main loop? I put it in I/O as it can't have the assertion.


Thank you,
Emanuele

Re: [RFC PATCH v2 06/25] include/block/block_int: split header into I/O and global state API

2021-10-07 Thread Emanuele Giuseppe Esposito





On 07/10/2021 12:52, Stefan Hajnoczi wrote:

On Tue, Oct 05, 2021 at 10:31:56AM -0400, Emanuele Giuseppe Esposito wrote:

+int coroutine_fn bdrv_co_copy_range_from(BdrvChild *src, int64_t src_offset,
+ BdrvChild *dst, int64_t dst_offset,
+ int64_t bytes,
+ BdrvRequestFlags read_flags,
+ BdrvRequestFlags write_flags);
+int coroutine_fn bdrv_co_copy_range_to(BdrvChild *src, int64_t src_offset,
+   BdrvChild *dst, int64_t dst_offset,
+   int64_t bytes,
+   BdrvRequestFlags read_flags,
+   BdrvRequestFlags write_flags);


These look like I/O APIs to me. Are they in the GS API because only
qemu-img.c call copy_range? I thought SCSI emulation might call it too,
but grep says otherwise.


SCSI (iscsi.c) implements the function pointer 
(*bdrv_co_copy_range_from/to), but does not call the function itself. 
However, later in the patches I put the function pointer as I/O.


These two functions are only tested by test-replication and thus are 
always under BQL when tested. But after looking at them again, and since 
they internally use only I/O APIs, I think we can move them to the I/O API.


Thank you,
Emanuele

Re: [RFC PATCH v2 04/25] include/block/block: split header into I/O and global state API

2021-10-07 Thread Daniel P . Berrangé

On Tue, Oct 05, 2021 at 10:31:54AM -0400, Emanuele Giuseppe Esposito wrote:
> Similarly to the previous patch, split block.h
> in block-io.h and block-global-state.h
> 
> block-common.h contains the structures shared between
> the two headers, and the functions that can't be categorized as
> I/O or global state.

This is nice from a code organization POV, but it doesn't do all
that much from a code reviewer / author POV as I doubt anyone
will remember which header file the respective APIs/structures/
constants are in, without having to look it up each time.

It would make life easier if we had distinct namning conventions
for APIs/struct/contsants in the respective headers.

eg instead of  "bdrv_" have "bdrv_state_" and "bdrv_io_" as
the two naming conventions for -global-state.h and -io.h
respectively, nad only use the bare 'bdrv_' for -common.h

Yes, this would be major code churn, but I think it'd make
the code clearer to understand which will be a win over the
long term.

NB, I'm not suggesting doing a rename as part of this patch
though. Any rename would have to be separate, and likely
split over many patches to make it manageable.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [PATCH] qemu-iotests: flush after every test

2021-10-07 Thread Paolo Bonzini


On 06/10/21 17:42, Philippe Mathieu-Daudé wrote:

On 10/6/21 11:27, Paolo Bonzini wrote:

This makes it possible to see what is happening, even if the output of
"make check-block" is not sent to a tty (for example if it is sent to
grep or tee).

Signed-off-by: Paolo Bonzini 
---
  tests/qemu-iotests/testrunner.py | 1 +
  1 file changed, 1 insertion(+)

diff --git a/tests/qemu-iotests/testrunner.py b/tests/qemu-iotests/testrunner.py
index 4a6ec421ed..b76db57e4c 100644
--- a/tests/qemu-iotests/testrunner.py
+++ b/tests/qemu-iotests/testrunner.py
@@ -340,6 +340,7 @@ def run_tests(self, tests: List[str]) -> bool:
  elif res.status == 'not run':
  notrun.append(name)
  
+sys.stdout.flush()


Shouldn't we flush stderr too?


It's never used by the program.

Paolo


Reviewed-by: Philippe Mathieu-Daudé

Re: [RFC PATCH v2 04/25] include/block/block: split header into I/O and global state API

2021-10-07 Thread Paolo Bonzini


On 07/10/21 12:54, Emanuele Giuseppe Esposito wrote:



+int bdrv_block_status(BlockDriverState *bs, int64_t offset,
+  int64_t bytes, int64_t *pnum, int64_t *map,
+  BlockDriverState **file);


This function just called bdrv_block_status_above(), which is in the I/O
API. I think it's safe to move this to the I/O API or else
bdrv_block_status_above() shouldn't be there :).



It *seems* that while bdrv_block_status_above() is an I/O, probably 
running in some coroutine (from here its internal qemu_in_coroutine 
check), bdrv_block_status might be called from the main loop (or 
alternatively the function is never invoked in the tests, so the 
assertion never triggered).


Maybe bdrv_block_status_above is one of the few functions that are both 
I/O and Main loop? I put it in I/O as it can't have the assertion.


No, they are both I/O.  Callers of bdrv_block_status are hw/nvme and 
qemu-img.c; while the latter can be either (it does not have iothreads), 
hw/nvme is definitely I/O.


Paolo

Re: [RFC PATCH v2 18/25] block/coroutines: I/O API

2021-10-07 Thread Paolo Bonzini


On 05/10/21 16:32, Emanuele Giuseppe Esposito wrote:

block coroutines functions run in different aiocontext, and are
not protected by the BQL. Therefore are I/O.

Signed-off-by: Emanuele Giuseppe Esposito 
---
  block/coroutines.h | 6 ++
  1 file changed, 6 insertions(+)

diff --git a/block/coroutines.h b/block/coroutines.h
index 514d169d23..105e0ce2a9 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -27,6 +27,12 @@
  
  #include "block/block_int.h"
  
+/*

+ * I/O API functions. These functions are thread-safe, and therefore
+ * can run in any thread as long as they have called
+ * aio_context_acquire/release().
+ */
+
  int coroutine_fn bdrv_co_check(BlockDriverState *bs,
 BdrvCheckResult *res, BdrvCheckMode fix);
  int coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs, Error **errp);



Reviewed-by: Paolo Bonzini 

except for the same comment about "they" that was in patch 2.

Paolo

Re: [RFC PATCH v2 13/25] include/systemu/blockdev.h: global state API

2021-10-07 Thread Paolo Bonzini


On 05/10/21 16:32, Emanuele Giuseppe Esposito wrote:

  DriveInfo *drive_get_next(BlockInterfaceType type);
  
+DriveInfo *drive_new(QemuOpts *arg, BlockInterfaceType block_default_type,

+ Error **errp);
+
+/* Common functions that are neither I/O nor Global State */
+
+DriveInfo *blk_legacy_dinfo(BlockBackend *blk);
+int drive_get_max_devs(BlockInterfaceType type);
+
  QemuOpts *drive_def(const char *optstr);
+
  QemuOpts *drive_add(BlockInterfaceType type, int index, const char *file,
  const char *optstr);
-DriveInfo *drive_new(QemuOpts *arg, BlockInterfaceType block_default_type,
- Error **errp);


drive_add and drive_def touch global state (QemuOpts).  But really 
neither should be in this header: drive_add can be moved to 
softmmu/vl.c, while drive_def can be inlined into its two callers.  With 
that changed,


Reviewed-by: Paolo Bonzini

Re: [RFC PATCH v2 08/25] block: introduce assert_bdrv_graph_writable

2021-10-07 Thread Paolo Bonzini


On 05/10/21 16:31, Emanuele Giuseppe Esposito wrote:

We want to be sure that the functions that write the child and
parent list of a bs are either under BQL or drain.
If this guarantee holds, then we can read the list also in the I/O APIs.

Signed-off-by: Emanuele Giuseppe Esposito 
---
  block.c| 5 +
  block/io.c | 5 +
  include/block/block_int-global-state.h | 8 +++-
  3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/block.c b/block.c
index b912f517a4..7d1eb847a4 100644
--- a/block.c
+++ b/block.c
@@ -2711,12 +2711,14 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
  if (child->klass->detach) {
  child->klass->detach(child);
  }
+assert_bdrv_graph_writable(old_bs);
  QLIST_REMOVE(child, next_parent);
  }
  
  child->bs = new_bs;
  
  if (new_bs) {

+assert_bdrv_graph_writable(new_bs);
  QLIST_INSERT_HEAD(&new_bs->parents, child, next_parent);
  
  /*

@@ -2917,6 +2919,7 @@ static int bdrv_attach_child_noperm(BlockDriverState 
*parent_bs,
  return ret;
  }
  
+assert_bdrv_graph_writable(parent_bs);

  QLIST_INSERT_HEAD(&parent_bs->children, *child, next);
  /*
   * child is removed in bdrv_attach_child_common_abort(), so don't care to
@@ -3117,6 +3120,7 @@ static void bdrv_unset_inherits_from(BlockDriverState 
*root, BdrvChild *child,
  void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child)
  {
  g_assert(qemu_in_main_thread());
+assert_bdrv_graph_writable(parent);
  if (child == NULL) {
  return;
  }
@@ -4878,6 +4882,7 @@ static void bdrv_remove_filter_or_cow_child_abort(void 
*opaque)
  BdrvRemoveFilterOrCowChild *s = opaque;
  BlockDriverState *parent_bs = s->child->opaque;
  
+assert_bdrv_graph_writable(parent_bs);

  QLIST_INSERT_HEAD(&parent_bs->children, s->child, next);
  if (s->is_backing) {
  parent_bs->backing = s->child;
diff --git a/block/io.c b/block/io.c
index 21dcc5d962..d184183b07 100644
--- a/block/io.c
+++ b/block/io.c
@@ -739,6 +739,11 @@ void bdrv_drain_all(void)
  bdrv_drain_all_end();
  }
  
+void assert_bdrv_graph_writable(BlockDriverState *bs)

+{
+g_assert(qatomic_read(&bs->quiesce_counter) > 0 || qemu_in_main_thread());
+}
+
  /**
   * Remove an active request from the tracked requests list
   *
diff --git a/include/block/block_int-global-state.h 
b/include/block/block_int-global-state.h
index aad549cb85..a53ab146fc 100644
--- a/include/block/block_int-global-state.h
+++ b/include/block/block_int-global-state.h
@@ -343,4 +343,10 @@ void bdrv_remove_aio_context_notifier(BlockDriverState *bs,
   */
  void bdrv_drain_all_end_quiesce(BlockDriverState *bs);
  
-#endif /* BLOCK_INT_GLOBAL_STATE*/

+/**
+ * Make sure that the function is either running under
+ * drain, or under BQL.
+ */
+void assert_bdrv_graph_writable(BlockDriverState *bs);
+
+#endif /* BLOCK_INT_GLOBAL_STATE */



Reviewed-by: Paolo Bonzini

Re: [RFC PATCH v2 08/25] block: introduce assert_bdrv_graph_writable

2021-10-07 Thread Paolo Bonzini


On 05/10/21 16:31, Emanuele Giuseppe Esposito wrote:

We want to be sure that the functions that write the child and
parent list of a bs are either under BQL or drain.
If this guarantee holds, then we can read the list also in the I/O APIs.

Signed-off-by: Emanuele Giuseppe Esposito 
---
  block.c| 5 +
  block/io.c | 5 +
  include/block/block_int-global-state.h | 8 +++-
  3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/block.c b/block.c
index b912f517a4..7d1eb847a4 100644
--- a/block.c
+++ b/block.c
@@ -2711,12 +2711,14 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
  if (child->klass->detach) {
  child->klass->detach(child);
  }
+assert_bdrv_graph_writable(old_bs);
  QLIST_REMOVE(child, next_parent);
  }
  
  child->bs = new_bs;
  
  if (new_bs) {

+assert_bdrv_graph_writable(new_bs);
  QLIST_INSERT_HEAD(&new_bs->parents, child, next_parent);
  
  /*

@@ -2917,6 +2919,7 @@ static int bdrv_attach_child_noperm(BlockDriverState 
*parent_bs,
  return ret;
  }
  
+assert_bdrv_graph_writable(parent_bs);

  QLIST_INSERT_HEAD(&parent_bs->children, *child, next);
  /*
   * child is removed in bdrv_attach_child_common_abort(), so don't care to
@@ -3117,6 +3120,7 @@ static void bdrv_unset_inherits_from(BlockDriverState 
*root, BdrvChild *child,
  void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child)
  {
  g_assert(qemu_in_main_thread());
+assert_bdrv_graph_writable(parent);
  if (child == NULL) {
  return;
  }
@@ -4878,6 +4882,7 @@ static void bdrv_remove_filter_or_cow_child_abort(void 
*opaque)
  BdrvRemoveFilterOrCowChild *s = opaque;
  BlockDriverState *parent_bs = s->child->opaque;
  
+assert_bdrv_graph_writable(parent_bs);

  QLIST_INSERT_HEAD(&parent_bs->children, s->child, next);
  if (s->is_backing) {
  parent_bs->backing = s->child;
diff --git a/block/io.c b/block/io.c
index 21dcc5d962..d184183b07 100644
--- a/block/io.c
+++ b/block/io.c
@@ -739,6 +739,11 @@ void bdrv_drain_all(void)
  bdrv_drain_all_end();
  }
  
+void assert_bdrv_graph_writable(BlockDriverState *bs)

+{
+g_assert(qatomic_read(&bs->quiesce_counter) > 0 || qemu_in_main_thread());
+}


Hmm, wait - I think this should be an "&&", not an OR?

Paolo


+
  /**
   * Remove an active request from the tracked requests list
   *
diff --git a/include/block/block_int-global-state.h 
b/include/block/block_int-global-state.h
index aad549cb85..a53ab146fc 100644
--- a/include/block/block_int-global-state.h
+++ b/include/block/block_int-global-state.h
@@ -343,4 +343,10 @@ void bdrv_remove_aio_context_notifier(BlockDriverState *bs,
   */
  void bdrv_drain_all_end_quiesce(BlockDriverState *bs);
  
-#endif /* BLOCK_INT_GLOBAL_STATE*/

+/**
+ * Make sure that the function is either running under
+ * drain, or under BQL.
+ */
+void assert_bdrv_graph_writable(BlockDriverState *bs);
+
+#endif /* BLOCK_INT_GLOBAL_STATE */

Re: [RFC PATCH v2 14/25] assertions for blockdev.h global state API

2021-10-07 Thread Paolo Bonzini


On 05/10/21 16:32, Emanuele Giuseppe Esposito wrote:

Signed-off-by: Emanuele Giuseppe Esposito 
---
  block/block-backend.c |  2 ++
  blockdev.c| 12 
  2 files changed, 14 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index 9f09245069..18791c4fdc 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -805,6 +805,7 @@ DriveInfo *blk_legacy_dinfo(BlockBackend *blk)
  DriveInfo *blk_set_legacy_dinfo(BlockBackend *blk, DriveInfo *dinfo)
  {
  assert(!blk->legacy_dinfo);
+g_assert(qemu_in_main_thread());
  return blk->legacy_dinfo = dinfo;
  }
  
@@ -815,6 +816,7 @@ DriveInfo *blk_set_legacy_dinfo(BlockBackend *blk, DriveInfo *dinfo)

  BlockBackend *blk_by_legacy_dinfo(DriveInfo *dinfo)
  {
  BlockBackend *blk = NULL;
+g_assert(qemu_in_main_thread());
  
  while ((blk = blk_next(blk)) != NULL) {

  if (blk->legacy_dinfo == dinfo) {
diff --git a/blockdev.c b/blockdev.c
index 5608b78f8f..917bcf8cbc 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -114,6 +114,8 @@ void override_max_devs(BlockInterfaceType type, int 
max_devs)
  BlockBackend *blk;
  DriveInfo *dinfo;
  
+g_assert(qemu_in_main_thread());

+
  if (max_devs <= 0) {
  return;
  }
@@ -230,6 +232,8 @@ DriveInfo *drive_get(BlockInterfaceType type, int bus, int 
unit)
  BlockBackend *blk;
  DriveInfo *dinfo;
  
+g_assert(qemu_in_main_thread());

+
  for (blk = blk_next(NULL); blk; blk = blk_next(blk)) {
  dinfo = blk_legacy_dinfo(blk);
  if (dinfo && dinfo->type == type
@@ -252,6 +256,8 @@ void drive_check_orphaned(void)
  Location loc;
  bool orphans = false;
  
+g_assert(qemu_in_main_thread());

+
  for (blk = blk_next(NULL); blk; blk = blk_next(blk)) {
  dinfo = blk_legacy_dinfo(blk);
  /*
@@ -285,6 +291,7 @@ void drive_check_orphaned(void)
  
  DriveInfo *drive_get_by_index(BlockInterfaceType type, int index)

  {
+g_assert(qemu_in_main_thread());
  return drive_get(type,
   drive_index_to_bus_id(type, index),
   drive_index_to_unit_id(type, index));
@@ -296,6 +303,8 @@ int drive_get_max_bus(BlockInterfaceType type)
  BlockBackend *blk;
  DriveInfo *dinfo;
  
+g_assert(qemu_in_main_thread());

+
  max_bus = -1;
  for (blk = blk_next(NULL); blk; blk = blk_next(blk)) {
  dinfo = blk_legacy_dinfo(blk);
@@ -312,6 +321,7 @@ int drive_get_max_bus(BlockInterfaceType type)
  DriveInfo *drive_get_next(BlockInterfaceType type)
  {
  static int next_block_unit[IF_COUNT];
+g_assert(qemu_in_main_thread());
  
  return drive_get(type, 0, next_block_unit[type]++);

  }
@@ -792,6 +802,8 @@ DriveInfo *drive_new(QemuOpts *all_opts, BlockInterfaceType 
block_default_type,
  const char *filename;
  int i;
  
+g_assert(qemu_in_main_thread());

+
  /* Change legacy command line options into QMP ones */
  static const struct {
  const char *from;



Reviewed-by: Paolo Bonzini

Re: [RFC PATCH v2 16/25] block/backup-top.h: global state API + assertions

2021-10-07 Thread Paolo Bonzini


On 05/10/21 16:32, Emanuele Giuseppe Esposito wrote:

backup-top functions always run under BQL lock.

Signed-off-by: Emanuele Giuseppe Esposito 
---
  block/backup-top.c |  2 ++
  block/backup-top.h | 11 +++
  2 files changed, 13 insertions(+)

diff --git a/block/backup-top.c b/block/backup-top.c
index 425e3778be..8b58a909f7 100644
--- a/block/backup-top.c
+++ b/block/backup-top.c
@@ -182,6 +182,7 @@ BlockDriverState *bdrv_backup_top_append(BlockDriverState 
*source,
  bool appended = false;
  
  assert(source->total_sectors == target->total_sectors);

+g_assert(qemu_in_main_thread());
  
  top = bdrv_new_open_driver(&bdrv_backup_top_filter, filter_node_name,

 BDRV_O_RDWR, errp);
@@ -244,6 +245,7 @@ fail:
  void bdrv_backup_top_drop(BlockDriverState *bs)
  {
  BDRVBackupTopState *s = bs->opaque;
+g_assert(qemu_in_main_thread());
  
  bdrv_drop_filter(bs, &error_abort);
  
diff --git a/block/backup-top.h b/block/backup-top.h

index b28b0031c4..8cb6f62869 100644
--- a/block/backup-top.h
+++ b/block/backup-top.h
@@ -29,6 +29,17 @@
  #include "block/block_int.h"
  #include "block/block-copy.h"
  
+/*

+ * Graph API. These functions run under the BQL lock.
+ *
+ * If a function modifies the graph, it uses drain and/or
+ * aio_context_acquire/release to be sure it has unique access.
+ *
+ * All functions in this header must use this assertion:
+ * g_assert(qemu_in_main_thread());
+ * to be sure they belong here.
+ */
+
  BlockDriverState *bdrv_backup_top_append(BlockDriverState *source,
   BlockDriverState *target,
   const char *filter_node_name,



This is now bdrv_cbw_append, but anyway:

Reviewed-by: Paolo Bonzini

Re: [RFC PATCH v2 15/25] include/block/snapshot: global state API + assertions

2021-10-07 Thread Paolo Bonzini


On 05/10/21 16:32, Emanuele Giuseppe Esposito wrote:

Snapshots run also under the BQL lock, so they all are
in the global state API. The aiocontext lock that they hold
is currently an overkill and in future could be removed.

Signed-off-by: Emanuele Giuseppe Esposito 
---
  block/snapshot.c | 28 
  include/block/snapshot.h | 21 +
  migration/savevm.c   |  2 ++
  3 files changed, 51 insertions(+)

diff --git a/block/snapshot.c b/block/snapshot.c
index ccacda8bd5..e8756f7f90 100644
--- a/block/snapshot.c
+++ b/block/snapshot.c
@@ -57,6 +57,8 @@ int bdrv_snapshot_find(BlockDriverState *bs, QEMUSnapshotInfo 
*sn_info,
  QEMUSnapshotInfo *sn_tab, *sn;
  int nb_sns, i, ret;
  
+g_assert(qemu_in_main_thread());

+
  ret = -ENOENT;
  nb_sns = bdrv_snapshot_list(bs, &sn_tab);
  if (nb_sns < 0) {
@@ -105,6 +107,7 @@ bool bdrv_snapshot_find_by_id_and_name(BlockDriverState *bs,
  bool ret = false;
  
  assert(id || name);

+g_assert(qemu_in_main_thread());
  
  nb_sns = bdrv_snapshot_list(bs, &sn_tab);

  if (nb_sns < 0) {
@@ -200,6 +203,7 @@ static BlockDriverState 
*bdrv_snapshot_fallback(BlockDriverState *bs)
  int bdrv_can_snapshot(BlockDriverState *bs)
  {
  BlockDriver *drv = bs->drv;
+g_assert(qemu_in_main_thread());
  if (!drv || !bdrv_is_inserted(bs) || bdrv_is_read_only(bs)) {
  return 0;
  }
@@ -220,6 +224,9 @@ int bdrv_snapshot_create(BlockDriverState *bs,
  {
  BlockDriver *drv = bs->drv;
  BlockDriverState *fallback_bs = bdrv_snapshot_fallback(bs);
+
+g_assert(qemu_in_main_thread());
+
  if (!drv) {
  return -ENOMEDIUM;
  }
@@ -240,6 +247,8 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
  BdrvChild **fallback_ptr;
  int ret, open_ret;
  
+g_assert(qemu_in_main_thread());

+
  if (!drv) {
  error_setg(errp, "Block driver is closed");
  return -ENOMEDIUM;
@@ -348,6 +357,8 @@ int bdrv_snapshot_delete(BlockDriverState *bs,
  BlockDriverState *fallback_bs = bdrv_snapshot_fallback(bs);
  int ret;
  
+g_assert(qemu_in_main_thread());

+
  if (!drv) {
  error_setg(errp, QERR_DEVICE_HAS_NO_MEDIUM, bdrv_get_device_name(bs));
  return -ENOMEDIUM;
@@ -380,6 +391,8 @@ int bdrv_snapshot_list(BlockDriverState *bs,
  {
  BlockDriver *drv = bs->drv;
  BlockDriverState *fallback_bs = bdrv_snapshot_fallback(bs);
+
+g_assert(qemu_in_main_thread());
  if (!drv) {
  return -ENOMEDIUM;
  }
@@ -419,6 +432,8 @@ int bdrv_snapshot_load_tmp(BlockDriverState *bs,
  {
  BlockDriver *drv = bs->drv;
  
+g_assert(qemu_in_main_thread());

+
  if (!drv) {
  error_setg(errp, QERR_DEVICE_HAS_NO_MEDIUM, bdrv_get_device_name(bs));
  return -ENOMEDIUM;
@@ -447,6 +462,8 @@ int bdrv_snapshot_load_tmp_by_id_or_name(BlockDriverState 
*bs,
  int ret;
  Error *local_err = NULL;
  
+g_assert(qemu_in_main_thread());

+
  ret = bdrv_snapshot_load_tmp(bs, id_or_name, NULL, &local_err);
  if (ret == -ENOENT || ret == -EINVAL) {
  error_free(local_err);
@@ -515,6 +532,8 @@ bool bdrv_all_can_snapshot(bool has_devices, strList 
*devices,
  g_autoptr(GList) bdrvs = NULL;
  GList *iterbdrvs;
  
+g_assert(qemu_in_main_thread());

+
  if (bdrv_all_get_snapshot_devices(has_devices, devices, &bdrvs, errp) < 
0) {
  return false;
  }
@@ -549,6 +568,8 @@ int bdrv_all_delete_snapshot(const char *name,
  g_autoptr(GList) bdrvs = NULL;
  GList *iterbdrvs;
  
+g_assert(qemu_in_main_thread());

+
  if (bdrv_all_get_snapshot_devices(has_devices, devices, &bdrvs, errp) < 
0) {
  return -1;
  }
@@ -588,6 +609,8 @@ int bdrv_all_goto_snapshot(const char *name,
  g_autoptr(GList) bdrvs = NULL;
  GList *iterbdrvs;
  
+g_assert(qemu_in_main_thread());

+
  if (bdrv_all_get_snapshot_devices(has_devices, devices, &bdrvs, errp) < 
0) {
  return -1;
  }
@@ -622,6 +645,8 @@ int bdrv_all_has_snapshot(const char *name,
  g_autoptr(GList) bdrvs = NULL;
  GList *iterbdrvs;
  
+g_assert(qemu_in_main_thread());

+
  if (bdrv_all_get_snapshot_devices(has_devices, devices, &bdrvs, errp) < 
0) {
  return -1;
  }
@@ -663,6 +688,7 @@ int bdrv_all_create_snapshot(QEMUSnapshotInfo *sn,
  {
  g_autoptr(GList) bdrvs = NULL;
  GList *iterbdrvs;
+g_assert(qemu_in_main_thread());
  
  if (bdrv_all_get_snapshot_devices(has_devices, devices, &bdrvs, errp) < 0) {

  return -1;
@@ -703,6 +729,8 @@ BlockDriverState *bdrv_all_find_vmstate_bs(const char 
*vmstate_bs,
  g_autoptr(GList) bdrvs = NULL;
  GList *iterbdrvs;
  
+g_assert(qemu_in_main_thread());

+
  if (bdrv_all_get_snapshot_devices(has_devices, devices, &bdrvs, errp) < 
0) {
  return NULL;
  }
diff --git a/include/block/snapshot.h b/include/block/snapshot.h
index 940345692f..3a848493

Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k

2021-10-07 Thread Christian Schoenebeck

On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck wrote:
> > At the moment the maximum transfer size with virtio is limited to 4M
> > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > theoretical possible transfer size of 128M (32k pages) according to the
> > virtio specs:
> > 
> > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#
> > x1-240006
> Hi Christian,
> I took a quick look at the code:
> 
> - The Linux 9p driver restricts descriptor chains to 128 elements
>   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)

Yes, that's the limitation that I am about to remove (WIP); current kernel 
patches:
https://lore.kernel.org/netdev/cover.1632327421.git.linux_...@crudebyte.com/

> - The QEMU 9pfs code passes iovecs directly to preadv(2) and will fail
>   with EINVAL when called with more than IOV_MAX iovecs
>   (hw/9pfs/9p.c:v9fs_read())

Hmm, which makes me wonder why I never encountered this error during testing.

Most people will use the 9p qemu 'local' fs driver backend in practice, so 
that v9fs_read() call would translate for most people to this implementation 
on QEMU side (hw/9p/9p-local.c):

static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState *fs,
const struct iovec *iov,
int iovcnt, off_t offset)
{
#ifdef CONFIG_PREADV
return preadv(fs->fd, iov, iovcnt, offset);
#else
int err = lseek(fs->fd, offset, SEEK_SET);
if (err == -1) {
return err;
} else {
return readv(fs->fd, iov, iovcnt);
}
#endif
}

> Unless I misunderstood the code, neither side can take advantage of the
> new 32k descriptor chain limit?
> 
> Thanks,
> Stefan

I need to check that when I have some more time. One possible explanation 
might be that preadv() already has this wrapped into a loop in its 
implementation to circumvent a limit like IOV_MAX. It might be another "it 
works, but not portable" issue, but not sure.

There are still a bunch of other issues I have to resolve. If you look at
net/9p/client.c on kernel side, you'll notice that it basically does this ATM

kmalloc(msize);

for every 9p request. So not only does it allocate much more memory for every 
request than actually required (i.e. say 9pfs was mounted with msize=8M, then 
a 9p request that actually would just need 1k would nevertheless allocate 8M), 
but also it allocates > PAGE_SIZE, which obviously may fail at any time.

With those kernel patches above and QEMU being patched with these series as 
well, I can go above 4M msize now, and the test system runs stable if 9pfs was 
mounted with an msize not being "too high". If I try to mount 9pfs with msize 
being very high, the upper described kmalloc() issue would kick in and cause 
an immediate kernel oops when mounting. So that's a high priority issue that I 
still need to resolve.

Best regards,
Christian Schoenebeck

Re: [RFC PATCH v2 04/25] include/block/block: split header into I/O and global state API

2021-10-07 Thread Stefan Hajnoczi

On Thu, Oct 07, 2021 at 12:43:43PM +0100, Daniel P. Berrangé wrote:
> On Tue, Oct 05, 2021 at 10:31:54AM -0400, Emanuele Giuseppe Esposito wrote:
> > Similarly to the previous patch, split block.h
> > in block-io.h and block-global-state.h
> > 
> > block-common.h contains the structures shared between
> > the two headers, and the functions that can't be categorized as
> > I/O or global state.
> 
> This is nice from a code organization POV, but it doesn't do all
> that much from a code reviewer / author POV as I doubt anyone
> will remember which header file the respective APIs/structures/
> constants are in, without having to look it up each time.
> 
> It would make life easier if we had distinct namning conventions
> for APIs/struct/contsants in the respective headers.
> 
> eg instead of  "bdrv_" have "bdrv_state_" and "bdrv_io_" as
> the two naming conventions for -global-state.h and -io.h
> respectively, nad only use the bare 'bdrv_' for -common.h
> 
> Yes, this would be major code churn, but I think it'd make
> the code clearer to understand which will be a win over the
> long term.
> 
> NB, I'm not suggesting doing a rename as part of this patch
> though. Any rename would have to be separate, and likely
> split over many patches to make it manageable.

Yes. Taking it one step further, BlockDriverState could be split into
two struct so that I/O code doesn't even have access to the struct
needed to invoke GS APIs. This is a type-safe way of enforcing the API
split.

Unfortunately that's a lot of code churn and I think the separation is
not very clean. For example, block drivers need to forward requests to
their children, so they need to traverse the graph (which we think of as
global state).

Stefan


signature.asc
Description: PGP signature

Re: [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable

2021-10-07 Thread Christian Schoenebeck

On Mittwoch, 6. Oktober 2021 16:42:34 CEST Stefan Hajnoczi wrote:
> On Wed, Oct 06, 2021 at 02:50:07PM +0200, Christian Schoenebeck wrote:
> > On Mittwoch, 6. Oktober 2021 13:06:55 CEST Stefan Hajnoczi wrote:
> > > On Tue, Oct 05, 2021 at 06:32:46PM +0200, Christian Schoenebeck wrote:
> > > > On Dienstag, 5. Oktober 2021 17:10:40 CEST Stefan Hajnoczi wrote:
> > > > > On Tue, Oct 05, 2021 at 03:15:26PM +0200, Christian Schoenebeck 
wrote:
> > > > > > On Dienstag, 5. Oktober 2021 14:45:56 CEST Stefan Hajnoczi wrote:
> > > > > > > On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian Schoenebeck
> > 
> > wrote:
> > > > > > > > Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> > > > > > > > variable per virtio user.
> > > > > > > 
> > > > > > > virtio user == virtio device model?
> > > > > > 
> > > > > > Yes
> > > > > > 
> > > > > > > > Reasons:
> > > > > > > > 
> > > > > > > > (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
> > > > > > > > 
> > > > > > > > maximum queue size possible. Which is actually the maximum
> > > > > > > > queue size allowed by the virtio protocol. The appropriate
> > > > > > > > value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> > > > > > > > 
> > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio
> > > > > > > > -v1.
> > > > > > > > 1-cs
> > > > > > > > 01.h
> > > > > > > > tml#x1-240006
> > > > > > > > 
> > > > > > > > Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
> > > > > > > > more or less arbitrary value of 1024 in the past, which
> > > > > > > > limits the maximum transfer size with virtio to 4M
> > > > > > > > (more precise: 1024 * PAGE_SIZE, with the latter typically
> > > > > > > > being 4k).
> > > > > > > 
> > > > > > > Being equal to IOV_MAX is a likely reason. Buffers with more
> > > > > > > iovecs
> > > > > > > than
> > > > > > > that cannot be passed to host system calls (sendmsg(2),
> > > > > > > pwritev(2),
> > > > > > > etc).
> > > > > > 
> > > > > > Yes, that's use case dependent. Hence the solution to opt-in if it
> > > > > > is
> > > > > > desired and feasible.
> > > > > > 
> > > > > > > > (2) Additionally the current value of 1024 poses a hidden
> > > > > > > > limit,
> > > > > > > > 
> > > > > > > > invisible to guest, which causes a system hang with the
> > > > > > > > following QEMU error if guest tries to exceed it:
> > > > > > > > 
> > > > > > > > virtio: too many write descriptors in indirect table
> > > > > > > 
> > > > > > > I don't understand this point. 2.6.5 The Virtqueue Descriptor
> > > > > > > Table
> > > > 
> > > > says:
> > > > > > >   The number of descriptors in the table is defined by the queue
> > > > > > >   size
> > > > > > >   for
> > > > > > > 
> > > > > > > this virtqueue: this is the maximum possible descriptor chain
> > > > > > > length.
> > > > > > > 
> > > > > > > and 2.6.5.3.1 Driver Requirements: Indirect Descriptors says:
> > > > > > >   A driver MUST NOT create a descriptor chain longer than the
> > > > > > >   Queue
> > > > > > >   Size
> > > > > > >   of
> > > > > > > 
> > > > > > > the device.
> > > > > > > 
> > > > > > > Do you mean a broken/malicious guest driver that is violating
> > > > > > > the
> > > > > > > spec?
> > > > > > > That's not a hidden limit, it's defined by the spec.
> > > > > > 
> > > > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00781.htm
> > > > > > l
> > > > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00788.htm
> > > > > > l
> > > > > > 
> > > > > > You can already go beyond that queue size at runtime with the
> > > > > > indirection
> > > > > > table. The only actual limit is the currently hard coded value of
> > > > > > 1k
> > > > > > pages.
> > > > > > Hence the suggestion to turn that into a variable.
> > > > > 
> > > > > Exceeding Queue Size is a VIRTIO spec violation. Drivers that
> > > > > operate
> > > > > outsided the spec do so at their own risk. They may not be
> > > > > compatible
> > > > > with all device implementations.
> > > > 
> > > > Yes, I am ware about that. And still, this practice is already done,
> > > > which
> > > > apparently is not limited to 9pfs.
> > > > 
> > > > > The limit is not hidden, it's Queue Size as defined by the spec :).
> > > > > 
> > > > > If you have a driver that is exceeding the limit, then please fix
> > > > > the
> > > > > driver.
> > > > 
> > > > I absolutely understand your position, but I hope you also understand
> > > > that
> > > > this violation of the specs is a theoretical issue, it is not a
> > > > real-life
> > > > problem right now, and due to lack of man power unfortunately I have
> > > > to
> > > > prioritize real-life problems over theoretical ones ATM. Keep in mind
> > > > that
> > > > right now I am the only person working on 9pfs actively, I do this
> > > > voluntarily whenever I find a free time slice, and I am not paid for
> > > > it
> > > > either.
> > > > 
> > > > I don

Re: [RFC PATCH 0/4] aio: AIO_CONTEXT_ACQUIRE_GUARD() macro experiment

2021-10-07 Thread Stefan Hajnoczi

On Tue, Oct 05, 2021 at 08:58:03PM +0200, Philippe Mathieu-Daudé wrote:
> Experiment to use glib g_autoptr/autofree features with
> AIO context.
> Since this is a RFC, only few examples are provided.
> 
> TODO: Document the macros in docs/devel/multiple-iothreads.txt
> 
> Philippe Mathieu-Daudé (4):
>   block/aio: Add automatically released aio_context variants
>   hw/scsi/scsi-disk: Use automatic AIO context lock
>   hw/scsi/scsi-generic: Use automatic AIO context lock
>   hw/block/virtio-blk: Use automatic AIO context lock
> 
>  include/block/aio.h| 24 
>  hw/block/virtio-blk.c  | 26 --
>  hw/scsi/scsi-disk.c| 13 -
>  hw/scsi/scsi-generic.c |  6 +++---
>  4 files changed, 43 insertions(+), 26 deletions(-)

This is nice. Two things:

1. Emanuele is working on eliminating aio_context_acquire/release(), so
   a big effort to convert existing code to using guards could be wasted
   energy and cause conflicts with his patches.

2. A few callers anticipate that the AioContext of their BDS may change
   between acquire/release. Care needs to be taken when converting to
   preserve the semantics but most instances should be safe to convert.

Stefan

signature.asc
Description: PGP signature

Re: [RFC PATCH v2 04/25] include/block/block: split header into I/O and global state API

2021-10-07 Thread Eric Blake

On Thu, Oct 07, 2021 at 01:51:33PM +0200, Paolo Bonzini wrote:
> On 07/10/21 12:54, Emanuele Giuseppe Esposito wrote:
> > > 
> > > > +int bdrv_block_status(BlockDriverState *bs, int64_t offset,
> > > > +  int64_t bytes, int64_t *pnum, int64_t *map,
> > > > +  BlockDriverState **file);
> > > 
> > > This function just called bdrv_block_status_above(), which is in the I/O
> > > API. I think it's safe to move this to the I/O API or else
> > > bdrv_block_status_above() shouldn't be there :).
> > > 
> > 
> > It *seems* that while bdrv_block_status_above() is an I/O, probably
> > running in some coroutine (from here its internal qemu_in_coroutine
> > check), bdrv_block_status might be called from the main loop (or
> > alternatively the function is never invoked in the tests, so the
> > assertion never triggered).
> > 
> > Maybe bdrv_block_status_above is one of the few functions that are both
> > I/O and Main loop? I put it in I/O as it can't have the assertion.
> 
> No, they are both I/O.  Callers of bdrv_block_status are hw/nvme and
> qemu-img.c; while the latter can be either (it does not have iothreads),
> hw/nvme is definitely I/O.

nbd/server.c also uses bdrv_block_status_above as part of its I/O path
to serve NBD_CMD_BLOCK_STATUS.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v7 7/8] qmp: add QMP command x-debug-virtio-queue-element

2021-10-07 Thread Eric Blake

On Tue, Oct 05, 2021 at 12:45:52PM -0400, Jonah Palmer wrote:
> From: Laurent Vivier 
> 
> This new command shows the information of a VirtQueue element.
> 
> Signed-off-by: Jonah Palmer 
> ---
> +++ b/qapi/virtio.json
> +##
> +# @VirtioRingAvail:
> +#
> +# @flags: VRingAvail flags
> +#
> +# @idx: VRingAvail idx

Is it worth being consistent...

> +##
> +# @VirtioQueueElement:
> +#
> +# @device-name: VirtIODevice name (for reference)
> +#
> +# @index: index of the element in the queue

...and spelling things 'index' everywhere instead of sometimes
abbreviating?

But overall, it looks like you did a nice job of making the command
machine-parseable, while still leaving the flexibility to alter it as
needed since it is only for developers under the x-debug- namespace.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH 3/5] block/nvme: Extract nvme_free_queue() from nvme_free_queue_pair()

2021-10-07 Thread Stefan Hajnoczi

On Wed, Oct 06, 2021 at 06:49:29PM +0200, Philippe Mathieu-Daudé wrote:
> Instead of duplicating code, extract the common helper to free
> a single queue.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  block/nvme.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 5/5] block/nvme: Fix memory leak from nvme_init_queue()

2021-10-07 Thread Stefan Hajnoczi

On Wed, Oct 06, 2021 at 06:49:31PM +0200, Philippe Mathieu-Daudé wrote:
> nvme_create_queue_pair() allocates resources with qemu_vfio_dma_map(),
> but we never release them. Do it in nvme_free_queue() which is called
> from nvme_free_queue_pair().
> 
> Reported by valgrind:
> 
>   ==252858== 520,192 bytes in 1 blocks are still reachable in loss record 
> 8,293 of 8,302
>   ==252858==at 0x4846803: memalign (vg_replace_malloc.c:1265)
>   ==252858==by 0x484691F: posix_memalign (vg_replace_malloc.c:1429)
>   ==252858==by 0xB8AFE4: qemu_try_memalign (oslib-posix.c:210)
>   ==252858==by 0xA9E315: nvme_create_queue_pair (nvme.c:229)
>   ==252858==by 0xAA0125: nvme_init (nvme.c:799)
>   ==252858==by 0xAA081C: nvme_file_open (nvme.c:953)
>   ==252858==by 0xA23DDD: bdrv_open_driver (block.c:1550)
>   ==252858==by 0xA24806: bdrv_open_common (block.c:1827)
>   ==252858==by 0xA2889B: bdrv_open_inherit (block.c:3747)
>   ==252858==by 0xA28DE4: bdrv_open (block.c:3840)
>   ==252858==by 0x9E0F8E: bds_tree_init (blockdev.c:675)
>   ==252858==by 0x9E7C74: qmp_blockdev_add (blockdev.c:3551)
> 
> Fixes: bdd6a90a9e5 ("block: Add VFIO based NVMe driver")
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  block/nvme.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index 6e476f54b9f..903c8ffa060 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> @@ -185,6 +185,7 @@ static bool nvme_init_queue(BDRVNVMeState *s, NVMeQueue 
> *q,
>  
>  static void nvme_free_queue(BDRVNVMeState *s, NVMeQueue *q)
>  {
> +qemu_vfio_dma_unmap(s->vfio, q->queue);
>  qemu_vfree(q->queue);
>  }

I can't figure out the issue. qemu_vfree(q->queue) was already called
before this patch. How does adding qemu_vfio_dma_unmap() help with the
valgrind report in the commit description?

Stefan


signature.asc
Description: PGP signature

Re: [PATCH 4/5] block/nvme: Pass BDRVNVMeState* handle to nvme_free_queue_pair()

2021-10-07 Thread Stefan Hajnoczi

On Wed, Oct 06, 2021 at 06:49:30PM +0200, Philippe Mathieu-Daudé wrote:
> In the next commit we want to access BDRVNVMeState from
> nvme_free_queue_pair() callee. Pass it along first.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  block/nvme.c | 14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 1/5] block/nvme: Automatically free qemu_memalign() with QEMU_AUTO_VFREE

2021-10-07 Thread Stefan Hajnoczi

On Wed, Oct 06, 2021 at 06:49:27PM +0200, Philippe Mathieu-Daudé wrote:
> Since commit 4d324c0bf65 ("introduce QEMU_AUTO_VFREE") buffers
> allocated by qemu_memalign() can automatically freed when using
> the QEMU_AUTO_VFREE macro. Use it to simplify a bit.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  block/nvme.c | 11 ---
>  1 file changed, 4 insertions(+), 7 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 2/5] block/nvme: Display CQ/SQ pointer in nvme_free_queue_pair()

2021-10-07 Thread Stefan Hajnoczi

On Wed, Oct 06, 2021 at 06:49:28PM +0200, Philippe Mathieu-Daudé wrote:
> For debugging purpose it is helpful to know the CQ/SQ pointers.
> We already have a trace event in nvme_free_queue_pair(), extend
> it to report these pointer addresses.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  block/nvme.c   | 2 +-
>  block/trace-events | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 5/5] block/nvme: Fix memory leak from nvme_init_queue()

2021-10-07 Thread Philippe Mathieu-Daudé

On 10/7/21 15:29, Stefan Hajnoczi wrote:
> On Wed, Oct 06, 2021 at 06:49:31PM +0200, Philippe Mathieu-Daudé wrote:
>> nvme_create_queue_pair() allocates resources with qemu_vfio_dma_map(),
>> but we never release them. Do it in nvme_free_queue() which is called
>> from nvme_free_queue_pair().
>>
>> Reported by valgrind:
>>
>>   ==252858== 520,192 bytes in 1 blocks are still reachable in loss record 
>> 8,293 of 8,302
>>   ==252858==at 0x4846803: memalign (vg_replace_malloc.c:1265)
>>   ==252858==by 0x484691F: posix_memalign (vg_replace_malloc.c:1429)
>>   ==252858==by 0xB8AFE4: qemu_try_memalign (oslib-posix.c:210)
>>   ==252858==by 0xA9E315: nvme_create_queue_pair (nvme.c:229)
>>   ==252858==by 0xAA0125: nvme_init (nvme.c:799)
>>   ==252858==by 0xAA081C: nvme_file_open (nvme.c:953)
>>   ==252858==by 0xA23DDD: bdrv_open_driver (block.c:1550)
>>   ==252858==by 0xA24806: bdrv_open_common (block.c:1827)
>>   ==252858==by 0xA2889B: bdrv_open_inherit (block.c:3747)
>>   ==252858==by 0xA28DE4: bdrv_open (block.c:3840)
>>   ==252858==by 0x9E0F8E: bds_tree_init (blockdev.c:675)
>>   ==252858==by 0x9E7C74: qmp_blockdev_add (blockdev.c:3551)
>>
>> Fixes: bdd6a90a9e5 ("block: Add VFIO based NVMe driver")
>> Signed-off-by: Philippe Mathieu-Daudé 
>> ---
>>  block/nvme.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/block/nvme.c b/block/nvme.c
>> index 6e476f54b9f..903c8ffa060 100644
>> --- a/block/nvme.c
>> +++ b/block/nvme.c
>> @@ -185,6 +185,7 @@ static bool nvme_init_queue(BDRVNVMeState *s, NVMeQueue 
>> *q,
>>  
>>  static void nvme_free_queue(BDRVNVMeState *s, NVMeQueue *q)
>>  {
>> +qemu_vfio_dma_unmap(s->vfio, q->queue);
>>  qemu_vfree(q->queue);
>>  }
> 
> I can't figure out the issue. qemu_vfree(q->queue) was already called
> before this patch. How does adding qemu_vfio_dma_unmap() help with the
> valgrind report in the commit description?

You are right, I think I didn't select the correct record
between the 8302 reported by valgrind. I will revisit, thanks.

Re: [RFC PATCH v2 06/25] include/block/block_int: split header into I/O and global state API

2021-10-07 Thread Stefan Hajnoczi

On Thu, Oct 07, 2021 at 01:30:42PM +0200, Emanuele Giuseppe Esposito wrote:
> 
> 
> On 07/10/2021 12:52, Stefan Hajnoczi wrote:
> > On Tue, Oct 05, 2021 at 10:31:56AM -0400, Emanuele Giuseppe Esposito wrote:
> > > +int coroutine_fn bdrv_co_copy_range_from(BdrvChild *src, int64_t 
> > > src_offset,
> > > + BdrvChild *dst, int64_t 
> > > dst_offset,
> > > + int64_t bytes,
> > > + BdrvRequestFlags read_flags,
> > > + BdrvRequestFlags write_flags);
> > > +int coroutine_fn bdrv_co_copy_range_to(BdrvChild *src, int64_t 
> > > src_offset,
> > > +   BdrvChild *dst, int64_t 
> > > dst_offset,
> > > +   int64_t bytes,
> > > +   BdrvRequestFlags read_flags,
> > > +   BdrvRequestFlags write_flags);
> > 
> > These look like I/O APIs to me. Are they in the GS API because only
> > qemu-img.c call copy_range? I thought SCSI emulation might call it too,
> > but grep says otherwise.
> 
> SCSI (iscsi.c) implements the function pointer
> (*bdrv_co_copy_range_from/to), but does not call the function itself.
> However, later in the patches I put the function pointer as I/O.

I meant the SCSI target emulation in hw/scsi/ where the SCSI commands
that require copy_range could be handled.

> These two functions are only tested by test-replication and thus are always
> under BQL when tested. But after looking at them again, and since they
> internally use only I/O APIs, I think we can move them to the I/O API.

Okay, great.

Stefan


signature.asc
Description: PGP signature

Re: [RFC PATCH v2 08/25] block: introduce assert_bdrv_graph_writable

2021-10-07 Thread Emanuele Giuseppe Esposito





On 07/10/2021 14:02, Paolo Bonzini wrote:

--- a/block/io.c
+++ b/block/io.c
@@ -739,6 +739,11 @@ void bdrv_drain_all(void)
  bdrv_drain_all_end();
  }
+void assert_bdrv_graph_writable(BlockDriverState *bs)
+{
+    g_assert(qatomic_read(&bs->quiesce_counter) > 0 || 
qemu_in_main_thread());

+}


Hmm, wait - I think this should be an "&&", not an OR?



You are right, && makes more sense. But as you said, using an AND will 
make the assertion fail in multiple places, because the necessary drains 
are missing. So I am going to remove the check for drains and leave it 
as todo. I will handle this case in another patch.


Thank you,
Emanuele

Re: [RFC PATCH 0/4] aio: AIO_CONTEXT_ACQUIRE_GUARD() macro experiment

2021-10-07 Thread Philippe Mathieu-Daudé

On 10/7/21 15:15, Stefan Hajnoczi wrote:
> On Tue, Oct 05, 2021 at 08:58:03PM +0200, Philippe Mathieu-Daudé wrote:
>> Experiment to use glib g_autoptr/autofree features with
>> AIO context.
>> Since this is a RFC, only few examples are provided.
>>
>> TODO: Document the macros in docs/devel/multiple-iothreads.txt
>>
>> Philippe Mathieu-Daudé (4):
>>   block/aio: Add automatically released aio_context variants
>>   hw/scsi/scsi-disk: Use automatic AIO context lock
>>   hw/scsi/scsi-generic: Use automatic AIO context lock
>>   hw/block/virtio-blk: Use automatic AIO context lock
>>
>>  include/block/aio.h| 24 
>>  hw/block/virtio-blk.c  | 26 --
>>  hw/scsi/scsi-disk.c| 13 -
>>  hw/scsi/scsi-generic.c |  6 +++---
>>  4 files changed, 43 insertions(+), 26 deletions(-)
> 
> This is nice. Two things:
> 
> 1. Emanuele is working on eliminating aio_context_acquire/release(), so
>a big effort to convert existing code to using guards could be wasted
>energy and cause conflicts with his patches.

Thanks for the update, I'll wait Emanuele effort to land.

> 2. A few callers anticipate that the AioContext of their BDS may change
>between acquire/release. Care needs to be taken when converting to
>preserve the semantics but most instances should be safe to convert.
> 
> Stefan
>

Re: [RFC PATCH v2 08/25] block: introduce assert_bdrv_graph_writable

2021-10-07 Thread Stefan Hajnoczi

On Thu, Oct 07, 2021 at 03:47:42PM +0200, Emanuele Giuseppe Esposito wrote:
> 
> 
> On 07/10/2021 14:02, Paolo Bonzini wrote:
> > > --- a/block/io.c
> > > +++ b/block/io.c
> > > @@ -739,6 +739,11 @@ void bdrv_drain_all(void)
> > >   bdrv_drain_all_end();
> > >   }
> > > +void assert_bdrv_graph_writable(BlockDriverState *bs)
> > > +{
> > > +    g_assert(qatomic_read(&bs->quiesce_counter) > 0 ||
> > > qemu_in_main_thread());
> > > +}
> > 
> > Hmm, wait - I think this should be an "&&", not an OR?
> > 
> 
> You are right, && makes more sense. But as you said, using an AND will make
> the assertion fail in multiple places, because the necessary drains are
> missing. So I am going to remove the check for drains and leave it as todo.
> I will handle this case in another patch.

BTW when an assertion expression tests unrelated things it helps to use
separate assert() calls instead of &&. That way it's obvious which
sub-expression failed from the error message and it's not necessary to
inspect the coredump.

In other words:

  assert(a && b) -> Assertion `a && b` failed.

This doesn't tell me whether it was a or b that was false. The assertion
failure is easier to diagnose if you split it:

  assert(a); -> Assertion `a` failed.
  assert(b); -> Assertion `b` failed.

Stefan


signature.asc
Description: PGP signature

Re: [RFC PATCH v2 07/25] assertions for block_int global state API

2021-10-07 Thread Stefan Hajnoczi

On Tue, Oct 05, 2021 at 10:31:57AM -0400, Emanuele Giuseppe Esposito wrote:
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
>  block.c | 17 +
>  block/backup.c  |  1 +
>  block/block-backend.c   |  3 +++
>  block/commit.c  |  2 ++
>  block/dirty-bitmap.c|  2 ++
>  block/io.c  |  6 ++
>  block/mirror.c  |  4 
>  block/monitor/bitmap-qmp-cmds.c |  6 ++
>  block/stream.c  |  2 ++
>  blockdev.c  |  7 +++
>  10 files changed, 50 insertions(+)

Except copy_range:
Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [RFC PATCH v2 09/25] include/block/blockjob_int.h: split header into I/O and GS API

2021-10-07 Thread Stefan Hajnoczi

On Tue, Oct 05, 2021 at 10:31:59AM -0400, Emanuele Giuseppe Esposito wrote:
> Since the I/O functions are not many, keep a single file.
> Also split the function pointers in BlockJobDriver.
> 
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
>  include/block/blockjob_int.h | 55 
>  1 file changed, 55 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [RFC PATCH v2 10/25] assertions for blockjob_int.h

2021-10-07 Thread Stefan Hajnoczi

On Tue, Oct 05, 2021 at 10:32:00AM -0400, Emanuele Giuseppe Esposito wrote:
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
>  blockjob.c | 4 
>  1 file changed, 4 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [RFC PATCH v2 11/25] include/block/blockjob.h: global state API

2021-10-07 Thread Stefan Hajnoczi

On Tue, Oct 05, 2021 at 10:32:01AM -0400, Emanuele Giuseppe Esposito wrote:
> blockjob functions run always under the BQL lock.
> 
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
>  include/block/blockjob.h | 23 +++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/include/block/blockjob.h b/include/block/blockjob.h
> index d200f33c10..3bf384f8bf 100644
> --- a/include/block/blockjob.h
> +++ b/include/block/blockjob.h
> @@ -77,6 +77,27 @@ typedef struct BlockJob {
>  GSList *nodes;
>  } BlockJob;
>  
> +/*
> + * Global state (GS) API. These functions run under the BQL lock.
> + *
> + * If a function modifies the graph, it also uses drain and/or
> + * aio_context_acquire/release to be sure it has unique access.
> + * aio_context locking is needed together with BQL because of
> + * the thread-safe I/O API that concurrently runs and accesses
> + * the graph without the BQL.
> + *
> + * It is important to note that not all of these functions are
> + * necessarily limited to running under the BQL, but they would
> + * require additional auditing and may small thread-safety changes
> + * to move them into the I/O API. Often it's not worth doing that
> + * work since the APIs are only used with the BQL held at the
> + * moment, so they have been placed in the GS API (for now).
> + *
> + * All functions below must use this assertion:
> + * g_assert(qemu_in_main_thread());
> + * to catch when they are accidentally called without the BQL.
> + */

This is comment is duplicated in many places. I suggest explaining it in
one place and using references in the other files:

  /*
   * Global state (GS) API. These functions run under the BQL lock.
   *
   * See include/block/block.h for more information about the GS API.
   */


signature.asc
Description: PGP signature

Re: [RFC PATCH v2 12/25] assertions for blockob.h global state API

2021-10-07 Thread Stefan Hajnoczi

On Tue, Oct 05, 2021 at 10:32:02AM -0400, Emanuele Giuseppe Esposito wrote:
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
>  blockjob.c | 10 ++
>  1 file changed, 10 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [RFC PATCH v2 14/25] assertions for blockdev.h global state API

2021-10-07 Thread Stefan Hajnoczi

On Tue, Oct 05, 2021 at 10:32:04AM -0400, Emanuele Giuseppe Esposito wrote:
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
>  block/block-backend.c |  2 ++
>  blockdev.c| 12 
>  2 files changed, 14 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [RFC PATCH v2 13/25] include/systemu/blockdev.h: global state API

2021-10-07 Thread Stefan Hajnoczi

On Tue, Oct 05, 2021 at 10:32:03AM -0400, Emanuele Giuseppe Esposito wrote:
> blockdev functions run always under the BQL lock.
> 
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
>  include/sysemu/blockdev.h | 35 ++-
>  1 file changed, 30 insertions(+), 5 deletions(-)
> 
> diff --git a/include/sysemu/blockdev.h b/include/sysemu/blockdev.h
> index 32c2d6023c..28233f6b63 100644
> --- a/include/sysemu/blockdev.h
> +++ b/include/sysemu/blockdev.h
> @@ -38,24 +38,49 @@ struct DriveInfo {
>  QTAILQ_ENTRY(DriveInfo) next;
>  };
>  
> -DriveInfo *blk_legacy_dinfo(BlockBackend *blk);
> +/*
> + * Global state (GS) API. These functions run under the BQL lock.
> + *
> + * If a function modifies the graph, it also uses drain and/or
> + * aio_context_acquire/release to be sure it has unique access.
> + * aio_context locking is needed together with BQL because of
> + * the thread-safe I/O API that concurrently runs and accesses
> + * the graph without the BQL.
> + *
> + * It is important to note that not all of these functions are
> + * necessarily limited to running under the BQL, but they would
> + * require additional auditing and may small thread-safety changes
> + * to move them into the I/O API. Often it's not worth doing that
> + * work since the APIs are only used with the BQL held at the
> + * moment, so they have been placed in the GS API (for now).
> + *
> + * All functions in this header must use this assertion:
> + * g_assert(qemu_in_main_thread());
> + * to catch when they are accidentally called without the BQL.
> + */
> +
>  DriveInfo *blk_set_legacy_dinfo(BlockBackend *blk, DriveInfo *dinfo);
>  BlockBackend *blk_by_legacy_dinfo(DriveInfo *dinfo);
>  
>  void override_max_devs(BlockInterfaceType type, int max_devs);
>  
>  DriveInfo *drive_get(BlockInterfaceType type, int bus, int unit);
> -void drive_mark_claimed_by_board(void);

This function prototype is no longer used. Please make a note of this in
the commit description so reviewers know the deletion is intentional and
the reason for it. (It could have been an accident so I had to grep the
code to figure out why you did this.)

Otherwise:

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [RFC PATCH v2 16/25] block/backup-top.h: global state API + assertions

2021-10-07 Thread Stefan Hajnoczi

On Tue, Oct 05, 2021 at 10:32:06AM -0400, Emanuele Giuseppe Esposito wrote:
> backup-top functions always run under BQL lock.
> 
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
>  block/backup-top.c |  2 ++
>  block/backup-top.h | 11 +++
>  2 files changed, 13 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [RFC PATCH v2 15/25] include/block/snapshot: global state API + assertions

2021-10-07 Thread Stefan Hajnoczi

On Tue, Oct 05, 2021 at 10:32:05AM -0400, Emanuele Giuseppe Esposito wrote:
> Snapshots run also under the BQL lock, so they all are
> in the global state API. The aiocontext lock that they hold
> is currently an overkill and in future could be removed.
> 
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
>  block/snapshot.c | 28 
>  include/block/snapshot.h | 21 +
>  migration/savevm.c   |  2 ++
>  3 files changed, 51 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [RFC PATCH v2 17/25] include/block/transactions: global state API + assertions

2021-10-07 Thread Stefan Hajnoczi

On Tue, Oct 05, 2021 at 10:32:07AM -0400, Emanuele Giuseppe Esposito wrote:
> transactions run always under the BQL lock, so they are all
> in the global state API.
> 
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
>  include/qemu/transactions.h | 24 
>  util/transactions.c |  4 
>  2 files changed, 28 insertions(+)

Hmm...not sure about this. Maybe Vladimir can share his thoughts.

This seems like a library that can be used in various situations. It has
no connection to the BQL. There's no need to declare it GS or I/O
because it's just a utility just like QEMUIOVector, etc.

signature.asc
Description: PGP signature

Re: [RFC PATCH v2 18/25] block/coroutines: I/O API

2021-10-07 Thread Stefan Hajnoczi

On Tue, Oct 05, 2021 at 10:32:08AM -0400, Emanuele Giuseppe Esposito wrote:
> block coroutines functions run in different aiocontext, and are
> not protected by the BQL. Therefore are I/O.
> 
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
>  block/coroutines.h | 6 ++
>  1 file changed, 6 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [RFC PATCH v2 19/25] block_int-common.h: split function pointers in BlockDriver

2021-10-07 Thread Stefan Hajnoczi

On Tue, Oct 05, 2021 at 10:32:09AM -0400, Emanuele Giuseppe Esposito wrote:
> Similar to the header split, also the function pointers in BlockDriver
> can be split in I/O and global state.
> 
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
>  include/block/block_int-common.h | 472 ---
>  1 file changed, 251 insertions(+), 221 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [RFC PATCH v2 22/25] block_int-common.h: assertions in the callers of BdrvChildClass function pointers

2021-10-07 Thread Stefan Hajnoczi

On Tue, Oct 05, 2021 at 10:32:12AM -0400, Emanuele Giuseppe Esposito wrote:
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
>  block.c | 10 ++
>  1 file changed, 10 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [RFC PATCH v2 23/25] block-backend-common.h: split function pointers in BlockDevOps

2021-10-07 Thread Stefan Hajnoczi

On Tue, Oct 05, 2021 at 10:32:13AM -0400, Emanuele Giuseppe Esposito wrote:
> Assertions in the callers of the funciton pointrs are already
> added by previous patches.
> 
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
>  include/sysemu/block-backend-common.h | 42 +++
>  1 file changed, 37 insertions(+), 5 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH v2 05/17] python/aqmp: Add dict conversion method to Greeting object

2021-10-07 Thread Eric Blake

On Wed, Sep 22, 2021 at 08:49:26PM -0400, John Snow wrote:
> The iotests interface expects to return the greeting as a dict; AQMP
> offers it as a rich object.
> 
> Signed-off-by: John Snow 
> ---
>  python/qemu/aqmp/models.py | 13 +
>  1 file changed, 13 insertions(+)

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 04/17] python/aqmp: add send_fd_scm

2021-10-07 Thread Eric Blake

On Wed, Sep 22, 2021 at 08:49:25PM -0400, John Snow wrote:
> The single space is indeed required to successfully transmit the file
> descriptor to QEMU.

Sending fds requires a payload of at least one byte, but I don't think
that qemu cares which byte.  Thus, while your choice of space is fine,
the commit message may be a bit misleading at implying it must be
space.

> 
> Python 3.11 removes support for calling sendmsg directly from a
> transport's socket. There is no other interface for doing this, our use
> case is, I suspect, "quite unique".
> 
> As far as I can tell, this is safe to do -- send_fd_scm is a synchronous
> function and we can be guaranteed that the async coroutines will *not* be
> running when it is invoked. In testing, it works correctly.
> 
> I investigated quite thoroughly the possibility of creating my own
> asyncio Transport (The class that ultimately manages the raw socket
> object) so that I could manage the socket myself, but this is so wildly
> invasive and unportable I scrapped the idea. It would involve a lot of
> copy-pasting of various python utilities and classes just to re-create
> the same infrastructure, and for extremely little benefit. Nah.
> 
> Just boldly void the warranty instead, while I try to follow up on
> https://bugs.python.org/issue43232

Bummer that we have to do that, but at least you are documenting the
problems and pursuing a remedy upstream.

> 
> Signed-off-by: John Snow 
> ---
>  python/qemu/aqmp/qmp_client.py | 22 ++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/python/qemu/aqmp/qmp_client.py b/python/qemu/aqmp/qmp_client.py
> index d2ad7459f9f..f987da02eb0 100644
> --- a/python/qemu/aqmp/qmp_client.py
> +++ b/python/qemu/aqmp/qmp_client.py
> @@ -9,6 +9,8 @@
>  
>  import asyncio
>  import logging
> +import socket
> +import struct
>  from typing import (
>  Dict,
>  List,
> @@ -624,3 +626,23 @@ async def execute(self, cmd: str,
>  """
>  msg = self.make_execute_msg(cmd, arguments, oob=oob)
>  return await self.execute_msg(msg)
> +
> +@upper_half
> +@require(Runstate.RUNNING)
> +def send_fd_scm(self, fd: int) -> None:
> +"""
> +Send a file descriptor to the remote via SCM_RIGHTS.
> +"""
> +assert self._writer is not None
> +sock = self._writer.transport.get_extra_info('socket')
> +
> +if sock.family != socket.AF_UNIX:
> +raise AQMPError("Sending file descriptors requires a UNIX 
> socket.")
> +
> +# Void the warranty sticker.
> +# Access to sendmsg in asyncio is scheduled for removal in Python 
> 3.11.
> +sock = sock._sock  # pylint: disable=protected-access
> +sock.sendmsg(
> +[b' '],
> +[(socket.SOL_SOCKET, socket.SCM_RIGHTS, struct.pack('@i', fd))]
> +)

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [RFC PATCH v2 20/25] block_int-common.h: assertion in the callers of BlockDriver function pointers

2021-10-07 Thread Stefan Hajnoczi

On Tue, Oct 05, 2021 at 10:32:10AM -0400, Emanuele Giuseppe Esposito wrote:
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
>  block.c | 16 
>  1 file changed, 16 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [RFC PATCH v2 21/25] block_int-common.h: split function pointers in BdrvChildClass

2021-10-07 Thread Stefan Hajnoczi

On Tue, Oct 05, 2021 at 10:32:11AM -0400, Emanuele Giuseppe Esposito wrote:
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
>  include/block/block_int-common.h | 65 ++--
>  1 file changed, 46 insertions(+), 19 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [RFC PATCH v2 24/25] job.h: split function pointers in JobDriver

2021-10-07 Thread Stefan Hajnoczi

On Tue, Oct 05, 2021 at 10:32:14AM -0400, Emanuele Giuseppe Esposito wrote:
> The job API will be handled separately in another serie.
> 
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
>  include/qemu/job.h | 31 +++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/include/qemu/job.h b/include/qemu/job.h
> index 41162ed494..c236c43026 100644
> --- a/include/qemu/job.h
> +++ b/include/qemu/job.h
> @@ -169,12 +169,21 @@ typedef struct Job {
>   * Callbacks and other information about a Job driver.
>   */
>  struct JobDriver {
> +
> +/* Fields initialized in struct definition and never changed. */
> +
>  /** Derived Job struct size */
>  size_t instance_size;
>  
>  /** Enum describing the operation */
>  JobType job_type;
>  
> +/*
> + * I/O API functions. These functions are thread-safe, and therefore
> + * can run in any thread as long as they have called
> + * aio_context_acquire/release().
> + */

This comment described the block layer well but job.h is more generic. I
don't think these functions are necessarily thread-safe (they shouldn't
be invoked from multiple threads at the same time). It's just that they
are run without regard to the BQL and may run in an arbitrary thread.

Other than that:

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [RFC PATCH v2 25/25] job.h: assertions in the callers of JobDriver funcion pointers

2021-10-07 Thread Stefan Hajnoczi

On Tue, Oct 05, 2021 at 10:32:15AM -0400, Emanuele Giuseppe Esposito wrote:
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
>  job.c | 9 +
>  1 file changed, 9 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH v2 06/17] python/aqmp: Reduce severity of EOFError-caused loop terminations

2021-10-07 Thread Eric Blake

On Wed, Sep 22, 2021 at 08:49:27PM -0400, John Snow wrote:
> When we encounter an EOFError, we don't know if it's an "error" in the
> perspective of the user of the library yet. Therefore, we should not log
> it as an error. Reduce the severity of this logging message to "INFO" to
> indicate that it's something that we expect to occur during the normal
> operation of the library.
> 
> Signed-off-by: John Snow 
> ---
>  python/qemu/aqmp/protocol.py | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 12/17] python/machine: Handle QMP errors on close more meticulously

2021-10-07 Thread Eric Blake

On Wed, Sep 22, 2021 at 08:49:33PM -0400, John Snow wrote:
> To use the AQMP backend, Machine just needs to be a little more diligent
> about what happens when closing a QMP connection. The operation is no
> longer a freebie in the async world; it may return errors encountered in
> the async bottom half on incoming message receipt, etc.
> 
> (AQMP's disconnect, ultimately, serves as the quiescence point where all
> async contexts are gathered together, and any final errors reported at
> that point.)
> 
> Because async QMP continues to check for messages asynchronously, it's
> almost certainly likely that the loop will have exited due to EOF after
> issuing the last 'quit' command. That error will ultimately be bubbled
> up when attempting to close the QMP connection. The manager class here
> then is free to discard it -- if it was expected.
> 
> Signed-off-by: John Snow 
> 
> ---
> 
> Yes, I regret that this class has become quite a dumping ground for
> complexity around the exit path. It's in need of a refactor to help
> separate out the exception handling and cleanup mechanisms from the
> VM-related stuff, but it's not a priority to do that just yet -- but
> it's on the list.
> 
> Signed-off-by: John Snow 

This second S-o-b won't matter.

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 07/17] python/aqmp: Disable logging messages by default

2021-10-07 Thread Eric Blake

On Wed, Sep 22, 2021 at 08:49:28PM -0400, John Snow wrote:
> AQMP is a library, and ideally it should not print error diagnostics
> unless a user opts into seeing them. By default, Python will print all
> WARNING, ERROR or CRITICAL messages to screen if no logging
> configuration has been created by a client application.
> 
> In AQMP's case, ERROR logging statements are used to report additional
> detail about runtime failures that will also eventually be reported to the
> client library via an Exception, so these messages should not be
> rendered by default.
> 
> (Why bother to have them at all, then? In async contexts, there may be
> multiple Exceptions and we are only able to report one of them back to
> the client application. It is not reasonably easy to predict ahead of
> time if one or more of these Exceptions will be squelched. Therefore,
> it's useful to log intermediate failures to help make sense of the
> ultimate, resulting failure.)
> 
> Add a NullHandler that will suppress these messages until a client
> application opts into logging via logging.basicConfig or similar. Note
> that upon calling basicConfig(), this handler will *not* suppress these
> messages from being displayed by the client's configuration.
> 
> Signed-off-by: John Snow 
> ---
>  python/qemu/aqmp/__init__.py | 4 
>  1 file changed, 4 insertions(+)

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable

2021-10-07 Thread Stefan Hajnoczi

On Thu, Oct 07, 2021 at 03:09:16PM +0200, Christian Schoenebeck wrote:
> On Mittwoch, 6. Oktober 2021 16:42:34 CEST Stefan Hajnoczi wrote:
> > On Wed, Oct 06, 2021 at 02:50:07PM +0200, Christian Schoenebeck wrote:
> > > On Mittwoch, 6. Oktober 2021 13:06:55 CEST Stefan Hajnoczi wrote:
> > > > On Tue, Oct 05, 2021 at 06:32:46PM +0200, Christian Schoenebeck wrote:
> > > > > On Dienstag, 5. Oktober 2021 17:10:40 CEST Stefan Hajnoczi wrote:
> > > > > > On Tue, Oct 05, 2021 at 03:15:26PM +0200, Christian Schoenebeck 
> wrote:
> > > > > > > On Dienstag, 5. Oktober 2021 14:45:56 CEST Stefan Hajnoczi wrote:
> > > > > > > > On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian Schoenebeck
> > > 
> > > wrote:
> > > > > > > > > Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> > > > > > > > > variable per virtio user.
> > > > > > > > 
> > > > > > > > virtio user == virtio device model?
> > > > > > > 
> > > > > > > Yes
> > > > > > > 
> > > > > > > > > Reasons:
> > > > > > > > > 
> > > > > > > > > (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
> > > > > > > > > 
> > > > > > > > > maximum queue size possible. Which is actually the maximum
> > > > > > > > > queue size allowed by the virtio protocol. The appropriate
> > > > > > > > > value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> > > > > > > > > 
> > > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio
> > > > > > > > > -v1.
> > > > > > > > > 1-cs
> > > > > > > > > 01.h
> > > > > > > > > tml#x1-240006
> > > > > > > > > 
> > > > > > > > > Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
> > > > > > > > > more or less arbitrary value of 1024 in the past, which
> > > > > > > > > limits the maximum transfer size with virtio to 4M
> > > > > > > > > (more precise: 1024 * PAGE_SIZE, with the latter typically
> > > > > > > > > being 4k).
> > > > > > > > 
> > > > > > > > Being equal to IOV_MAX is a likely reason. Buffers with more
> > > > > > > > iovecs
> > > > > > > > than
> > > > > > > > that cannot be passed to host system calls (sendmsg(2),
> > > > > > > > pwritev(2),
> > > > > > > > etc).
> > > > > > > 
> > > > > > > Yes, that's use case dependent. Hence the solution to opt-in if it
> > > > > > > is
> > > > > > > desired and feasible.
> > > > > > > 
> > > > > > > > > (2) Additionally the current value of 1024 poses a hidden
> > > > > > > > > limit,
> > > > > > > > > 
> > > > > > > > > invisible to guest, which causes a system hang with the
> > > > > > > > > following QEMU error if guest tries to exceed it:
> > > > > > > > > 
> > > > > > > > > virtio: too many write descriptors in indirect table
> > > > > > > > 
> > > > > > > > I don't understand this point. 2.6.5 The Virtqueue Descriptor
> > > > > > > > Table
> > > > > 
> > > > > says:
> > > > > > > >   The number of descriptors in the table is defined by the queue
> > > > > > > >   size
> > > > > > > >   for
> > > > > > > > 
> > > > > > > > this virtqueue: this is the maximum possible descriptor chain
> > > > > > > > length.
> > > > > > > > 
> > > > > > > > and 2.6.5.3.1 Driver Requirements: Indirect Descriptors says:
> > > > > > > >   A driver MUST NOT create a descriptor chain longer than the
> > > > > > > >   Queue
> > > > > > > >   Size
> > > > > > > >   of
> > > > > > > > 
> > > > > > > > the device.
> > > > > > > > 
> > > > > > > > Do you mean a broken/malicious guest driver that is violating
> > > > > > > > the
> > > > > > > > spec?
> > > > > > > > That's not a hidden limit, it's defined by the spec.
> > > > > > > 
> > > > > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00781.htm
> > > > > > > l
> > > > > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00788.htm
> > > > > > > l
> > > > > > > 
> > > > > > > You can already go beyond that queue size at runtime with the
> > > > > > > indirection
> > > > > > > table. The only actual limit is the currently hard coded value of
> > > > > > > 1k
> > > > > > > pages.
> > > > > > > Hence the suggestion to turn that into a variable.
> > > > > > 
> > > > > > Exceeding Queue Size is a VIRTIO spec violation. Drivers that
> > > > > > operate
> > > > > > outsided the spec do so at their own risk. They may not be
> > > > > > compatible
> > > > > > with all device implementations.
> > > > > 
> > > > > Yes, I am ware about that. And still, this practice is already done,
> > > > > which
> > > > > apparently is not limited to 9pfs.
> > > > > 
> > > > > > The limit is not hidden, it's Queue Size as defined by the spec :).
> > > > > > 
> > > > > > If you have a driver that is exceeding the limit, then please fix
> > > > > > the
> > > > > > driver.
> > > > > 
> > > > > I absolutely understand your position, but I hope you also understand
> > > > > that
> > > > > this violation of the specs is a theoretical issue, it is not a
> > > > > real-life
> > > > > problem right now, and due to lack of man power unfortunately I have
> >

[PULL v2 01/15] block/backup: avoid integer overflow of `max-workers`

2021-10-07 Thread Vladimir Sementsov-Ogievskiy

From: Stefano Garzarella 

QAPI generates `struct BackupPerf` where `max-workers` value is stored
in an `int64_t` variable.
But block_copy_async(), and the underlying code, uses an `int` parameter.

At the end that variable is used to initialize `max_busy_tasks` in
block/aio_task.c causing the following assertion failure if a value
greater than INT_MAX(2147483647) is used:

  ../block/aio_task.c:63: aio_task_pool_wait_one: Assertion `pool->busy_tasks > 
0' failed.

Let's check that `max-workers` doesn't exceed INT_MAX and print an
error in that case.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2009310
Signed-off-by: Stefano Garzarella 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Message-Id: <20211005161157.282396-2-sgarz...@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/backup.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 687d2882bc..8b072db5d9 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -407,8 +407,8 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
 return NULL;
 }
 
-if (perf->max_workers < 1) {
-error_setg(errp, "max-workers must be greater than zero");
+if (perf->max_workers < 1 || perf->max_workers > INT_MAX) {
+error_setg(errp, "max-workers must be between 1 and %d", INT_MAX);
 return NULL;
 }
 
-- 
2.31.1

[PULL v2 00/15] jobs: mirror: Handle errors after READY cancel

2021-10-07 Thread Vladimir Sementsov-Ogievskiy

The following changes since commit 9618c5badaa8eed25259cf095ff880efb939fbe7:

  Merge remote-tracking branch 
'remotes/vivier/tags/trivial-branch-for-6.2-pull-request' into staging 
(2021-10-04 16:27:35 -0700)

are available in the Git repository at:

  https://src.openvz.org/scm/~vsementsov/qemu.git tags/pull-jobs-2021-10-07-v2

for you to fetch changes up to 2451f72527d8760566a499b7513e17aaceb0f131:

  iotests: Add mirror-ready-cancel-error test (2021-10-07 10:42:50 +0200)


mirror: Handle errors after READY cancel
v2: add small fix by Stefano, Hanna's series fixed


Hanna Reitz (13):
  job: Context changes in job_completed_txn_abort()
  mirror: Keep s->synced on error
  mirror: Drop s->synced
  job: Force-cancel jobs in a failed transaction
  job: @force parameter for job_cancel_sync()
  jobs: Give Job.force_cancel more meaning
  job: Do not soft-cancel after a job is done
  job: Add job_cancel_requested()
  mirror: Use job_is_cancelled()
  mirror: Check job_is_cancelled() earlier
  mirror: Stop active mirroring after force-cancel
  mirror: Do not clear .cancelled
  iotests: Add mirror-ready-cancel-error test

Stefano Garzarella (2):
  block/backup: avoid integer overflow of `max-workers`
  block/aio_task: assert `max_busy_tasks` is greater than 0

 include/qemu/job.h |  29 ++---
 block/aio_task.c   |   2 +
 block/backup.c |   7 ++-
 block/mirror.c |  56 +
 block/replication.c|   4 +-
 blockdev.c |   4 +-
 job.c  |  94 
+++-
 tests/unit/test-blockjob.c |   2 +-
 tests/qemu-iotests/109.out |  60 --
 tests/qemu-iotests/tests/mirror-ready-cancel-error | 143 
+++
 tests/qemu-iotests/tests/mirror-ready-cancel-error.out |   5 ++
 tests/qemu-iotests/tests/qsd-jobs.out  |   2 +-
 12 files changed, 316 insertions(+), 92 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/mirror-ready-cancel-error
 create mode 100644 tests/qemu-iotests/tests/mirror-ready-cancel-error.out

-- 
2.31.1

[PULL v2 02/15] block/aio_task: assert `max_busy_tasks` is greater than 0

2021-10-07 Thread Vladimir Sementsov-Ogievskiy

From: Stefano Garzarella 

All code in block/aio_task.c expects `max_busy_tasks` to always
be greater than 0.

Assert this condition during the AioTaskPool creation where
`max_busy_tasks` is set.

Signed-off-by: Stefano Garzarella 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Message-Id: <20211005161157.282396-3-sgarz...@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/aio_task.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/aio_task.c b/block/aio_task.c
index 88989fa248..9bd17ea2c1 100644
--- a/block/aio_task.c
+++ b/block/aio_task.c
@@ -98,6 +98,8 @@ AioTaskPool *coroutine_fn aio_task_pool_new(int 
max_busy_tasks)
 {
 AioTaskPool *pool = g_new0(AioTaskPool, 1);
 
+assert(max_busy_tasks > 0);
+
 pool->main_co = qemu_coroutine_self();
 pool->max_busy_tasks = max_busy_tasks;
 
-- 
2.31.1

[PULL v2 10/15] job: Add job_cancel_requested()

2021-10-07 Thread Vladimir Sementsov-Ogievskiy

From: Hanna Reitz 

Most callers of job_is_cancelled() actually want to know whether the job
is on its way to immediate termination.  For example, we refuse to pause
jobs that are cancelled; but this only makes sense for jobs that are
really actually cancelled.

A mirror job that is cancelled during READY with force=false should
absolutely be allowed to pause.  This "cancellation" (which is actually
a kind of completion) may take an indefinite amount of time, and so
should behave like any job during normal operation.  For example, with
on-target-error=stop, the job should stop on write errors.  (In
contrast, force-cancelled jobs should not get write errors, as they
should just terminate and not do further I/O.)

Therefore, redefine job_is_cancelled() to only return true for jobs that
are force-cancelled (which as of HEAD^ means any job that interprets the
cancellation request as a request for immediate termination), and add
job_cancel_requested() as the general variant, which returns true for
any jobs which have been requested to be cancelled, whether it be
immediately or after an arbitrarily long completion phase.

Finally, here is a justification for how different job_is_cancelled()
invocations are treated by this patch:

- block/mirror.c (mirror_run()):
  - The first invocation is a while loop that should loop until the job
has been cancelled or scheduled for completion.  What kind of cancel
does not matter, only the fact that the job is supposed to end.

  - The second invocation wants to know whether the job has been
soft-cancelled.  Calling job_cancel_requested() is a bit too broad,
but if the job were force-cancelled, we should leave the main loop
as soon as possible anyway, so this should not matter here.

  - The last two invocations already check force_cancel, so they should
continue to use job_is_cancelled().

- block/backup.c, block/commit.c, block/stream.c, anything in tests/:
  These jobs know only force-cancel, so there is no difference between
  job_is_cancelled() and job_cancel_requested().  We can continue using
  job_is_cancelled().

- job.c:
  - job_pause_point(), job_yield(), job_sleep_ns(): Only force-cancelled
jobs should be prevented from being paused.  Continue using 
job_is_cancelled().

  - job_update_rc(), job_finalize_single(), job_finish_sync(): These
functions are all called after the job has left its main loop.  The
mirror job (the only job that can be soft-cancelled) will clear
.cancelled before leaving the main loop if it has been
soft-cancelled.  Therefore, these functions will observe .cancelled
to be true only if the job has been force-cancelled.  We can
continue to use job_is_cancelled().
(Furthermore, conceptually, a soft-cancelled mirror job should not
report to have been cancelled.  It should report completion (see
also the block-job-cancel QAPI documentation).  Therefore, it makes
sense for these functions not to distinguish between a
soft-cancelled mirror job and a job that has completed as normal.)

  - job_completed_txn_abort(): All jobs other than @job have been
force-cancelled.  job_is_cancelled() must be true for them.
Regarding @job itself: job_completed_txn_abort() is mostly called
when the job's return value is not 0.  A soft-cancelled mirror has a
return value of 0, and so will not end up here then.
However, job_cancel() invokes job_completed_txn_abort() if the job
has been deferred to the main loop, which is mostly the case for
completed jobs (which skip the assertion), but not for sure.
To be safe, use job_cancel_requested() in this assertion.

  - job_complete(): This is function eventually invoked by the user
(through qmp_block_job_complete() or qmp_job_complete(), or
job_complete_sync(), which comes from qemu-img).  The intention here
is to prevent a user from invoking job-complete after the job has
been cancelled.  This should also apply to soft cancelling: After a
mirror job has been soft-cancelled, the user should not be able to
decide otherwise and have it complete as normal (i.e. pivoting to
the target).

  - job_cancel(): Both functions are equivalent (see comment there), but
we want to use job_is_cancelled(), because this shows that we call
job_completed_txn_abort() only for force-cancelled jobs.  (As
explained for job_update_rc(), soft-cancelled jobs should be treated
as if they have completed as normal.)

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Hanna Reitz 
Reviewed-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Message-Id: <20211006151940.214590-9-hre...@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/qemu/job.h |  8 +++-
 block/mirror.c | 10 --
 job.c  | 14 --
 3 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 90f6abbd6a..6e67b69

[PULL v2 09/15] job: Do not soft-cancel after a job is done

2021-10-07 Thread Vladimir Sementsov-Ogievskiy

From: Hanna Reitz 

The only job that supports a soft cancel mode is the mirror job, and in
such a case it resets its .cancelled field before it leaves its .run()
function, so it does not really count as cancelled.

However, it is possible to cancel the job after .run() returns and
before job_exit() (which is run in the main loop) is executed.  Then,
.cancelled would still be true and the job would count as cancelled.
This does not seem to be in the interest of the mirror job, so adjust
job_cancel_async() to not set .cancelled in such a case, and
job_cancel() to not invoke job_completed_txn_abort().

Signed-off-by: Hanna Reitz 
Reviewed-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Message-Id: <20211006151940.214590-8-hre...@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 job.c | 25 +
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/job.c b/job.c
index 81c016eb10..44e741ebd4 100644
--- a/job.c
+++ b/job.c
@@ -734,9 +734,19 @@ static void job_cancel_async(Job *job, bool force)
 assert(job->pause_count > 0);
 job->pause_count--;
 }
-job->cancelled = true;
-/* To prevent 'force == false' overriding a previous 'force == true' */
-job->force_cancel |= force;
+
+/*
+ * Ignore soft cancel requests after the job is already done
+ * (We will still invoke job->driver->cancel() above, but if the
+ * job driver supports soft cancelling and the job is done, that
+ * should be a no-op, too.  We still call it so it can override
+ * @force.)
+ */
+if (force || !job->deferred_to_main_loop) {
+job->cancelled = true;
+/* To prevent 'force == false' overriding a previous 'force == true' */
+job->force_cancel |= force;
+}
 }
 
 static void job_completed_txn_abort(Job *job)
@@ -963,7 +973,14 @@ void job_cancel(Job *job, bool force)
 if (!job_started(job)) {
 job_completed(job);
 } else if (job->deferred_to_main_loop) {
-job_completed_txn_abort(job);
+/*
+ * job_cancel_async() ignores soft-cancel requests for jobs
+ * that are already done (i.e. deferred to the main loop).  We
+ * have to check again whether the job is really cancelled.
+ */
+if (job_is_cancelled(job)) {
+job_completed_txn_abort(job);
+}
 } else {
 job_enter(job);
 }
-- 
2.31.1

Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k

2021-10-07 Thread Stefan Hajnoczi

On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck wrote:
> > > At the moment the maximum transfer size with virtio is limited to 4M
> > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > theoretical possible transfer size of 128M (32k pages) according to the
> > > virtio specs:
> > > 
> > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#
> > > x1-240006
> > Hi Christian,
> > I took a quick look at the code:
> > 
> > - The Linux 9p driver restricts descriptor chains to 128 elements
> >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> 
> Yes, that's the limitation that I am about to remove (WIP); current kernel 
> patches:
> https://lore.kernel.org/netdev/cover.1632327421.git.linux_...@crudebyte.com/

I haven't read the patches yet but I'm concerned that today the driver
is pretty well-behaved and this new patch series introduces a spec
violation. Not fixing existing spec violations is okay, but adding new
ones is a red flag. I think we need to figure out a clean solution.

> > - The QEMU 9pfs code passes iovecs directly to preadv(2) and will fail
> >   with EINVAL when called with more than IOV_MAX iovecs
> >   (hw/9pfs/9p.c:v9fs_read())
> 
> Hmm, which makes me wonder why I never encountered this error during testing.
> 
> Most people will use the 9p qemu 'local' fs driver backend in practice, so 
> that v9fs_read() call would translate for most people to this implementation 
> on QEMU side (hw/9p/9p-local.c):
> 
> static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState *fs,
> const struct iovec *iov,
> int iovcnt, off_t offset)
> {
> #ifdef CONFIG_PREADV
> return preadv(fs->fd, iov, iovcnt, offset);
> #else
> int err = lseek(fs->fd, offset, SEEK_SET);
> if (err == -1) {
> return err;
> } else {
> return readv(fs->fd, iov, iovcnt);
> }
> #endif
> }
> 
> > Unless I misunderstood the code, neither side can take advantage of the
> > new 32k descriptor chain limit?
> > 
> > Thanks,
> > Stefan
> 
> I need to check that when I have some more time. One possible explanation 
> might be that preadv() already has this wrapped into a loop in its 
> implementation to circumvent a limit like IOV_MAX. It might be another "it 
> works, but not portable" issue, but not sure.
>
> There are still a bunch of other issues I have to resolve. If you look at
> net/9p/client.c on kernel side, you'll notice that it basically does this ATM
> 
> kmalloc(msize);
> 
> for every 9p request. So not only does it allocate much more memory for every 
> request than actually required (i.e. say 9pfs was mounted with msize=8M, then 
> a 9p request that actually would just need 1k would nevertheless allocate 
> 8M), 
> but also it allocates > PAGE_SIZE, which obviously may fail at any time.

The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc() situation.

I saw zerocopy code in the 9p guest driver but didn't investigate when
it's used. Maybe that should be used for large requests (file
reads/writes)? virtio-blk/scsi don't memcpy data into a new buffer, they
directly access page cache or O_DIRECT pinned pages.

Stefan


signature.asc
Description: PGP signature

[PATCH V4] block/rbd: implement bdrv_co_block_status

2021-10-07 Thread Peter Lieven

the qemu rbd driver currently lacks support for bdrv_co_block_status.
This results mainly in incorrect progress during block operations (e.g.
qemu-img convert with an rbd image as source).

This patch utilizes the rbd_diff_iterate2 call from librbd to detect
allocated and unallocated (all zero areas).

To avoid querying the ceph OSDs for the answer this is only done if
the image has the fast-diff feature which depends on the object-map and
exclusive-lock features. In this case it is guaranteed that the information
is present in memory in the librbd client and thus very fast.

If fast-diff is not available all areas are reported to be allocated
which is the current behaviour if bdrv_co_block_status is not implemented.

Signed-off-by: Peter Lieven 
---
 block/rbd.c | 111 
 1 file changed, 111 insertions(+)

V3->V4:
 - make req.exists a bool [Ilya]
 - simplify callback under the assuption that we never receive a cb
   for a hole since we do not diff against a snapshot [Ilya]
 - remove out label [Ilya]
 - rename ret to status [Ilya]

V2->V3:
- check rbd_flags every time (they can change during runtime) [Ilya]
- also check for fast-diff invalid flag [Ilya]
- *map and *file cant be NULL [Ilya]
- set ret = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID in case of an
  unallocated area [Ilya]
- typo: catched -> caught [Ilya]
- changed wording about fast-diff, object-map and exclusive lock in
  commit msg [Ilya]

V1->V2:
- add commit comment [Stefano]
- use failed_post_open [Stefano]
- remove redundant assert [Stefano]
- add macro+comment for the magic -9000 value [Stefano]
- always set *file if its non NULL [Stefano]

diff --git a/block/rbd.c b/block/rbd.c
index 701fbf2b0c..b9fa8e78eb 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1259,6 +1259,116 @@ static ImageInfoSpecific 
*qemu_rbd_get_specific_info(BlockDriverState *bs,
 return spec_info;
 }
 
+typedef struct rbd_diff_req {
+uint64_t offs;
+uint64_t bytes;
+bool exists;
+} rbd_diff_req;
+
+/*
+ * rbd_diff_iterate2 allows to interrupt the exection by returning a negative
+ * value in the callback routine. Choose a value that does not conflict with
+ * an existing exitcode and return it if we want to prematurely stop the
+ * execution because we detected a change in the allocation status.
+ */
+#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000
+
+static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len,
+   int exists, void *opaque)
+{
+struct rbd_diff_req *req = opaque;
+
+assert(req->offs + req->bytes <= offs);
+/*
+ * we do not diff against a snapshot so we should never receive a callback
+ * for a hole.
+ */
+assert(exists);
+
+if (!req->exists && offs > req->offs) {
+/*
+ * we started in an unallocated area and hit the first allocated
+ * block. req->bytes must be set to the length of the unallocated area
+ * before the allocated area. stop further processing.
+ */
+req->bytes = offs - req->offs;
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+
+if (req->exists && offs > req->offs + req->bytes) {
+/*
+ * we started in an allocated area and jumped over an unallocated area,
+ * req->bytes contains the length of the allocated area before the
+ * unallocated area. stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+
+req->bytes += len;
+req->exists = true;
+
+return 0;
+}
+
+static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs,
+ bool want_zero, int64_t 
offset,
+ int64_t bytes, int64_t *pnum,
+ int64_t *map,
+ BlockDriverState **file)
+{
+BDRVRBDState *s = bs->opaque;
+int status, r;
+struct rbd_diff_req req = { .offs = offset };
+uint64_t features, flags;
+
+assert(offset + bytes <= s->image_size);
+
+/* default to all sectors allocated */
+status = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
+*map = offset;
+*file = bs;
+*pnum = bytes;
+
+/* check if RBD image supports fast-diff */
+r = rbd_get_features(s->image, &features);
+if (r < 0) {
+return status;
+}
+if (!(features & RBD_FEATURE_FAST_DIFF)) {
+return status;
+}
+
+/* check if RBD fast-diff result is valid */
+r = rbd_get_flags(s->image, &flags);
+if (r < 0) {
+return status;
+}
+if (flags & RBD_FLAG_FAST_DIFF_INVALID) {
+return status;
+}
+
+r = rbd_diff_iterate2(s->image, NULL, offset, bytes, true, true,
+  qemu_rbd_co_block_status_cb, &req);
+if (r < 0 && r != QEMU_RBD_EXIT_DIFF_ITERATE2) {
+return status;
+}
+assert(req.bytes <= bytes);
+if (!req.exists) {
+

Re: [PATCH V3] block/rbd: implement bdrv_co_block_status

2021-10-07 Thread Peter Lieven


Am 05.10.21 um 10:36 schrieb Ilya Dryomov:

On Tue, Oct 5, 2021 at 10:19 AM Peter Lieven  wrote:

Am 05.10.21 um 09:54 schrieb Ilya Dryomov:

On Thu, Sep 16, 2021 at 2:21 PM Peter Lieven  wrote:

the qemu rbd driver currently lacks support for bdrv_co_block_status.
This results mainly in incorrect progress during block operations (e.g.
qemu-img convert with an rbd image as source).

This patch utilizes the rbd_diff_iterate2 call from librbd to detect
allocated and unallocated (all zero areas).

To avoid querying the ceph OSDs for the answer this is only done if
the image has the fast-diff feature which depends on the object-map and
exclusive-lock features. In this case it is guaranteed that the information
is present in memory in the librbd client and thus very fast.

If fast-diff is not available all areas are reported to be allocated
which is the current behaviour if bdrv_co_block_status is not implemented.

Signed-off-by: Peter Lieven 
---
V2->V3:
- check rbd_flags every time (they can change during runtime) [Ilya]
- also check for fast-diff invalid flag [Ilya]
- *map and *file cant be NULL [Ilya]
- set ret = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID in case of an
unallocated area [Ilya]
- typo: catched -> caught [Ilya]
- changed wording about fast-diff, object-map and exclusive lock in
commit msg [Ilya]

V1->V2:
- add commit comment [Stefano]
- use failed_post_open [Stefano]
- remove redundant assert [Stefano]
- add macro+comment for the magic -9000 value [Stefano]
- always set *file if its non NULL [Stefano]

   block/rbd.c | 126 
   1 file changed, 126 insertions(+)

diff --git a/block/rbd.c b/block/rbd.c
index dcf82b15b8..3cb24f9981 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1259,6 +1259,131 @@ static ImageInfoSpecific 
*qemu_rbd_get_specific_info(BlockDriverState *bs,
   return spec_info;
   }

+typedef struct rbd_diff_req {
+uint64_t offs;
+uint64_t bytes;
+int exists;

Hi Peter,

Nit: make exists a bool.  The one in the callback has to be an int
because of the callback signature but let's not spread that.


+} rbd_diff_req;
+
+/*
+ * rbd_diff_iterate2 allows to interrupt the exection by returning a negative
+ * value in the callback routine. Choose a value that does not conflict with
+ * an existing exitcode and return it if we want to prematurely stop the
+ * execution because we detected a change in the allocation status.
+ */
+#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000
+
+static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len,
+   int exists, void *opaque)
+{
+struct rbd_diff_req *req = opaque;
+
+assert(req->offs + req->bytes <= offs);
+
+if (req->exists && offs > req->offs + req->bytes) {
+/*
+ * we started in an allocated area and jumped over an unallocated area,
+ * req->bytes contains the length of the allocated area before the
+ * unallocated area. stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+if (req->exists && !exists) {
+/*
+ * we started in an allocated area and reached a hole. req->bytes
+ * contains the length of the allocated area before the hole.
+ * stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;

Do you have a test case for when this branch is taken?


That would happen if you diff from a snapshot, the question is if it can also 
happen if the image is a clone from a snapshot?



+}
+if (!req->exists && exists && offs > req->offs) {
+/*
+ * we started in an unallocated area and hit the first allocated
+ * block. req->bytes must be set to the length of the unallocated area
+ * before the allocated area. stop further processing.
+ */
+req->bytes = offs - req->offs;
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+
+/*
+ * assert that we caught all cases above and allocation state has not
+ * changed during callbacks.
+ */
+assert(exists == req->exists || !req->bytes);
+req->exists = exists;
+
+/*
+ * assert that we either return an unallocated block or have got callbacks
+ * for all allocated blocks present.
+ */
+assert(!req->exists || offs == req->offs + req->bytes);
+req->bytes = offs + len - req->offs;
+
+return 0;
+}
+
+static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs,
+ bool want_zero, int64_t 
offset,
+ int64_t bytes, int64_t *pnum,
+ int64_t *map,
+ BlockDriverState **file)
+{
+BDRVRBDState *s = bs->opaque;
+int ret, r;

Nit: I would rename ret to status or something like that to make
it clear(er) that it is an actual value and never an error.  Or,
even better, dro

Re: [PATCH v2 04/17] python/aqmp: add send_fd_scm

2021-10-07 Thread John Snow

On Thu, Oct 7, 2021 at 10:52 AM Eric Blake  wrote:

> On Wed, Sep 22, 2021 at 08:49:25PM -0400, John Snow wrote:
> > The single space is indeed required to successfully transmit the file
> > descriptor to QEMU.
>
> Sending fds requires a payload of at least one byte, but I don't think
> that qemu cares which byte.  Thus, while your choice of space is fine,
> the commit message may be a bit misleading at implying it must be
> space.
>
>
OK, I'll rephrase. (Space winds up being useful in particular because it
doesn't mess with the parsing for subsequent JSON objects sent over the
wire.)

(Idle curiosity: Is it possible to make QEMU accept an empty payload here?
I understand that for compatibility reasons it wouldn't change much for the
python lib even if we did, but I'm curious.)


> >
> > Python 3.11 removes support for calling sendmsg directly from a
> > transport's socket. There is no other interface for doing this, our use
> > case is, I suspect, "quite unique".
> >
> > As far as I can tell, this is safe to do -- send_fd_scm is a synchronous
> > function and we can be guaranteed that the async coroutines will *not* be
> > running when it is invoked. In testing, it works correctly.
> >
> > I investigated quite thoroughly the possibility of creating my own
> > asyncio Transport (The class that ultimately manages the raw socket
> > object) so that I could manage the socket myself, but this is so wildly
> > invasive and unportable I scrapped the idea. It would involve a lot of
> > copy-pasting of various python utilities and classes just to re-create
> > the same infrastructure, and for extremely little benefit. Nah.
> >
> > Just boldly void the warranty instead, while I try to follow up on
> > https://bugs.python.org/issue43232
>
> Bummer that we have to do that, but at least you are documenting the
> problems and pursuing a remedy upstream.
>
>
Yeah. I suspect our use case is so niche that it's not likely to get
traction, but I'll try again. This sort of thing might make it harder to
use projects like pypy, so it does feel like a defeat. Still, where there's
a will, there's a way, right? :)

--js

Re: [PATCH v3 00/17] iotests: support zstd

2021-10-07 Thread Vladimir Sementsov-Ogievskiy


9/14/21 20:08, Hanna Reitz wrote:

On 14.09.21 12:25, Vladimir Sementsov-Ogievskiy wrote:

These series makes tests pass with

    IMGOPTS='compression_type=zstd'

Also, python iotests start to support IMGOPTS (they didn't before).

v3:
02-04,06,08,14,17: add Hanna's r-b
07  iotests.py: filter out successful output of qemu-img create
   fix subject
   handle 149, 237 and 296 iotests
  (note, 149 is handled intuitively, as it fails :(


It was also reviewed intuitively. :)

Thanks, applied to my block branch:

https://github.com/XanClic/qemu/commits/block



Hmm, patches are not here neither in master. Aren't they missed somehow?



--
Best regards,
Vladimir

Re: [PATCH v2 12/17] python/machine: Handle QMP errors on close more meticulously

2021-10-07 Thread John Snow

On Thu, Oct 7, 2021 at 11:08 AM Eric Blake  wrote:

> On Wed, Sep 22, 2021 at 08:49:33PM -0400, John Snow wrote:
> > To use the AQMP backend, Machine just needs to be a little more diligent
> > about what happens when closing a QMP connection. The operation is no
> > longer a freebie in the async world; it may return errors encountered in
> > the async bottom half on incoming message receipt, etc.
> >
> > (AQMP's disconnect, ultimately, serves as the quiescence point where all
> > async contexts are gathered together, and any final errors reported at
> > that point.)
> >
> > Because async QMP continues to check for messages asynchronously, it's
> > almost certainly likely that the loop will have exited due to EOF after
> > issuing the last 'quit' command. That error will ultimately be bubbled
> > up when attempting to close the QMP connection. The manager class here
> > then is free to discard it -- if it was expected.
> >
> > Signed-off-by: John Snow 
> >
> > ---
> >
> > Yes, I regret that this class has become quite a dumping ground for
> > complexity around the exit path. It's in need of a refactor to help
> > separate out the exception handling and cleanup mechanisms from the
> > VM-related stuff, but it's not a priority to do that just yet -- but
> > it's on the list.
> >
> > Signed-off-by: John Snow 
>
> This second S-o-b won't matter.
>
> Reviewed-by: Eric Blake 
>

Sorry, it's a bug with git-publish I've just not invested time into fixing.
It happens when I use a '---' to add an additional note for reviewers.
git-publish likes to add the second line because it doesn't "see" the first
one anymore. It's harmless, ultimately, but ... it sure does make me look
like I have no idea what I'm doing ;)

--js

Re: [PATCH v2 12/17] python/machine: Handle QMP errors on close more meticulously

2021-10-07 Thread John Snow

On Wed, Sep 22, 2021 at 8:50 PM John Snow  wrote:

> To use the AQMP backend, Machine just needs to be a little more diligent
> about what happens when closing a QMP connection. The operation is no
> longer a freebie in the async world; it may return errors encountered in
> the async bottom half on incoming message receipt, etc.
>
> (AQMP's disconnect, ultimately, serves as the quiescence point where all
> async contexts are gathered together, and any final errors reported at
> that point.)
>
> Because async QMP continues to check for messages asynchronously, it's
> almost certainly likely that the loop will have exited due to EOF after
> issuing the last 'quit' command. That error will ultimately be bubbled
> up when attempting to close the QMP connection. The manager class here
> then is free to discard it -- if it was expected.
>
> Signed-off-by: John Snow 
>
> ---
>
> Yes, I regret that this class has become quite a dumping ground for
> complexity around the exit path. It's in need of a refactor to help
> separate out the exception handling and cleanup mechanisms from the
> VM-related stuff, but it's not a priority to do that just yet -- but
> it's on the list.
>
> Signed-off-by: John Snow 
> ---
>  python/qemu/machine/machine.py | 48 +-
>  1 file changed, 42 insertions(+), 6 deletions(-)
>
> diff --git a/python/qemu/machine/machine.py
> b/python/qemu/machine/machine.py
> index 0bd40bc2f76..c33a78a2d9f 100644
> --- a/python/qemu/machine/machine.py
> +++ b/python/qemu/machine/machine.py
> @@ -342,9 +342,15 @@ def _post_shutdown(self) -> None:
>  # Comprehensive reset for the failed launch case:
>  self._early_cleanup()
>
> -if self._qmp_connection:
> -self._qmp.close()
> -self._qmp_connection = None
> +try:
> +self._close_qmp_connection()
> +except Exception as err:  # pylint: disable=broad-except
> +LOG.warning(
> +"Exception closing QMP connection: %s",
> +str(err) if str(err) else type(err).__name__
> +)
> +finally:
> +assert self._qmp_connection is None
>
>  self._close_qemu_log_file()
>
> @@ -420,6 +426,31 @@ def _launch(self) -> None:
> close_fds=False)
>  self._post_launch()
>
> +def _close_qmp_connection(self) -> None:
> +"""
> +Close the underlying QMP connection, if any.
> +
> +Dutifully report errors that occurred while closing, but assume
> +that any error encountered indicates an abnormal termination
> +process and not a failure to close.
> +"""
> +if self._qmp_connection is None:
> +return
> +
> +try:
> +self._qmp.close()
> +except EOFError:
> +# EOF can occur as an Exception here when using the Async
> +# QMP backend. It indicates that the server closed the
> +# stream. If we successfully issued 'quit' at any point,
> +# then this was expected. If the remote went away without
> +# our permission, it's worth reporting that as an abnormal
> +# shutdown case.
> +if not self._quit_issued:
>

I strongly suspect that this line needs to actually be "if not
(self._user_killed or self._quit_issued):" to also handle the force-kill
cases.

I wrote a different sync compatibility layer in a yet-to-be-published
branch and noticed this code still allows EOFError through. Why it did not
seem to come up in *this* series I still don't know -- I think the timing
of when exactly the coroutines get scheduled can be finnicky depending on
what else is running. So, sometimes, we manage to close the loop before we
get EOF and sometimes we don't.

I am mulling over adding a "tolerate_eof: bool = False" argument to
disconnect() to allow the caller to handle this stuff with a little less
boilerplate. Anyone have strong feelings on that?

(Ultimately, because there's no canonical "goodbye" in-band message, the
QMP layer itself can never really know if EOF was expected or not -- that's
really up to whomever is managing the connection, but it does add a layer
of complexity around the exit path that I am starting to find is a
nuisance. Not to mention that there are likely many possible cases in which
we might expect the remote to disappear on us that don't involve using QMP
at all -- kill is one, using the guest agent to coerce the guest into
shutdown is another. Networking and migration are other theoretical(?)
avenues for intended disconnects.)

> +raise
> +finally:
> +self._qmp_connection = None
> +
>  def _early_cleanup(self) -> None:
>  """
>  Perform any cleanup that needs to happen before the VM exits.
> @@ -460,9 +491,14 @@ def _soft_shutdown(self, timeout: Optional[int]) ->
> None:
>  self._early_cleanup()
>
>  if self._qmp

Re: [PATCH 1/2] pylint: fix errors and warnings from qemu-tests test 297

2021-10-07 Thread Kevin Wolf

Am 07.10.2021 um 12:34 hat Emanuele Giuseppe Esposito geschrieben:
> 
> > > The error is "C0330: Wrong hanging indentation"
> > > so it is not about dicts. I guess we can disable the error, but the 
> > > problem
> > > is that we will disable it for the whole file, which doesn't seem right.
> > 
> > Actually, I would disable it globally in pylintrc because building
> > dictionaries for JSON is something that we do a lot.
> > 
> > But then I'm surprised that this is the only instance that actually
> > fails. I wonder what the difference is.
> > 
> > For example, 129 doesn't seem to be skipped and has this code:
> 
> 
> > 
> >  result = self.vm.qmp('blockdev-add', **{
> >   'node-name': 'overlay',
> >   'driver': iotests.imgfmt,
> >   'file': {
> >   'driver': 'file',
> >   'filename': self.overlay_img
> >   }
> >   })
> > 
> > Yet you don't report a pylint error for this file.
> 
> Well, unless I am misunderstanding something... 129 *is* the file I am
> reporting. And that is exactly the function where pylint complains.

Indeed, my bad. I got confused there.

And the other files that do something similar are all in SKIP_FILES in
297. So it looks like we don't have another case to copy.

> > 
> > Oooh... I think I do see a difference: The final line is indented by one
> > space more in the case that fails for you. It should be vertically
> > aligned with the "'" in the first line, but it is actually aligned with
> > the "b" of "blockdev-add"
> > 
> > Does removing one space of indentation in the last line fix the report?
> 
> It's not only the final line, it's from "**{" till the ending ")".
> 'node-name' is under "ock" of 'blockdev-add'. It is clearly bad indented,
> regardless of the new style and pylint new rules.
> 
> Pylint itself suggests to move it 4 spaces more than "result =", ie 21
> spaces before.
> 
> Still, applying your suggestion to all the lines and removing 1 space from
> all lines still does not make pylint happy, as it asks to remove 20 spaces.
> 
> To simplify things, this is the error I get:
> 
>  === pylint ===
> +* Module 129
> +129:91:0: C0330: Wrong hanging indentation (remove 21 spaces).
> + 'node-name': 'overlay',
> +|^ (bad-continuation)
> +129:92:0: C0330: Wrong hanging indentation (remove 21 spaces).
> + 'driver': iotests.imgfmt,
> +|^ (bad-continuation)
> +129:93:0: C0330: Wrong hanging indentation (remove 21 spaces).
> + 'file': {
> +|^ (bad-continuation)
> +129:97:0: C0330: Wrong hanging indentation.
> + })
> +|   |^ (bad-continuation)
> 
> So unless you want to disable it overall, one way of fixing 129 is to follow
> what pylint suggests, and do like I wrote in the previous email:
> 
> Either:
> result = self.vm.qmp('blockdev-add', **{
> 'node-name': 'overlay',   <-- 21 spaces less
> 'driver': iotests.imgfmt, <-- 21 spaces less
> 'file': { <-- 21 spaces less
> 'driver': 'file', <-- 21 spaces less
> 'filename': self.overlay_img  <-- 21 spaces less
> } <-- 21 spaces less
> })<-- 21 spaces less

Yes, this looks reasonble enough.

Kevin

[PATCH 03/15] pcie: Add some SR/IOV API documentation in docs/pcie_sriov.txt

2021-10-07 Thread Lukasz Maniak

From: Knut Omang 

Add a small intro + minimal documentation for how to
implement SR/IOV support for an emulated device.

Signed-off-by: Knut Omang 
---
 docs/pcie_sriov.txt | 115 
 1 file changed, 115 insertions(+)
 create mode 100644 docs/pcie_sriov.txt

diff --git a/docs/pcie_sriov.txt b/docs/pcie_sriov.txt
new file mode 100644
index 00..f5e891e1d4
--- /dev/null
+++ b/docs/pcie_sriov.txt
@@ -0,0 +1,115 @@
+PCI SR/IOV EMULATION SUPPORT
+
+
+Description
+===
+SR/IOV (Single Root I/O Virtualization) is an optional extended capability
+of a PCI Express device. It allows a single physical function (PF) to appear 
as multiple
+virtual functions (VFs) for the main purpose of eliminating software
+overhead in I/O from virtual machines.
+
+Qemu now implements the basic common functionality to enable an emulated device
+to support SR/IOV. Yet no fully implemented devices exists in Qemu, but a
+proof-of-concept hack of the Intel igb can be found here:
+
+git://github.com/knuto/qemu.git sriov_patches_v5
+
+Implementation
+==
+Implementing emulation of an SR/IOV capable device typically consists of
+implementing support for two types of device classes; the "normal" physical 
device
+(PF) and the virtual device (VF). From Qemu's perspective, the VFs are just
+like other devices, except that some of their properties are derived from
+the PF.
+
+A virtual function is different from a physical function in that the BAR
+space for all VFs are defined by the BAR registers in the PFs SR/IOV
+capability. All VFs have the same BARs and BAR sizes.
+
+Accesses to these virtual BARs then is computed as
+
++  *  + 
+
+From our emulation perspective this means that there is a separate call for
+setting up a BAR for a VF.
+
+1) To enable SR/IOV support in the PF, it must be a PCI Express device so
+   you would need to add a PCI Express capability in the normal PCI
+   capability list. You might also want to add an ARI (Alternative
+   Routing-ID Interpretation) capability to indicate that your device
+   supports functions beyond it's "own" function space (0-7),
+   which is necessary to support more than 7 functions, or
+   if functions extends beyond offset 7 because they are placed at an
+   offset > 1 or have stride > 1.
+
+   ...
+   #include "hw/pci/pcie.h"
+   #include "hw/pci/pcie_sriov.h"
+
+   pci_your_pf_dev_realize( ... )
+   {
+  ...
+  int ret = pcie_endpoint_cap_init(d, 0x70);
+  ...
+  pcie_ari_init(d, 0x100, 1);
+  ...
+
+  /* Add and initialize the SR/IOV capability */
+  pcie_sriov_pf_init(d, 0x200, "your_virtual_dev",
+   vf_devid, initial_vfs, total_vfs,
+   fun_offset, stride);
+
+  /* Set up individual VF BARs (parameters as for normal BARs) */
+  pcie_sriov_pf_init_vf_bar( ... )
+  ...
+   }
+
+   For cleanup, you simply call:
+
+  pcie_sriov_pf_exit(device);
+
+   which will delete all the virtual functions and associated resources.
+
+2) Similarly in the implementation of the virtual function, you need to
+   make it a PCI Express device and add a similar set of capabilities
+   except for the SR/IOV capability. Then you need to set up the VF BARs as
+   subregions of the PFs SR/IOV VF BARs by calling
+   pcie_sriov_vf_register_bar() instead of the normal pci_register_bar() call:
+
+   pci_your_vf_dev_realize( ... )
+   {
+  ...
+  int ret = pcie_endpoint_cap_init(d, 0x60);
+  ...
+  pcie_ari_init(d, 0x100, 1);
+  ...
+  memory_region_init(mr, ... )
+  pcie_sriov_vf_register_bar(d, bar_nr, mr);
+  ...
+   }
+
+Testing on Linux guest
+==
+The easiest is if your device driver supports sysfs based SR/IOV
+enabling. Support for this was added in kernel v.3.8, so not all drivers
+support it yet.
+
+To enable 4 VFs for a device at 01:00.0:
+
+   modprobe yourdriver
+   echo 4 > /sys/bus/pci/devices/:01:00.0/sriov_numvfs
+
+You should now see 4 VFs with lspci.
+To turn SR/IOV off again - the standard requires you to turn it off before you 
can enable
+another VF count, and the emulation enforces this:
+
+   echo 0 > /sys/bus/pci/devices/:01:00.0/sriov_numvfs
+
+Older drivers typically provide a max_vfs module parameter
+to enable it at load time:
+
+   modprobe yourdriver max_vfs=4
+
+To disable the VFs again then, you simply have to unload the driver:
+
+   rmmod yourdriver
-- 
2.25.1

[PATCH 04/15] pcie: Add callback preceding SR-IOV VFs update

2021-10-07 Thread Lukasz Maniak

PCIe devices implementing SR-IOV may need to perform certain actions
before the VFs are unrealized or vice versa.

Signed-off-by: Lukasz Maniak 
---
 docs/pcie_sriov.txt |  2 +-
 hw/pci/pcie_sriov.c | 14 +-
 include/hw/pci/pcie_sriov.h |  8 +++-
 3 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/docs/pcie_sriov.txt b/docs/pcie_sriov.txt
index f5e891e1d4..63ca1a7b8e 100644
--- a/docs/pcie_sriov.txt
+++ b/docs/pcie_sriov.txt
@@ -57,7 +57,7 @@ setting up a BAR for a VF.
   /* Add and initialize the SR/IOV capability */
   pcie_sriov_pf_init(d, 0x200, "your_virtual_dev",
vf_devid, initial_vfs, total_vfs,
-   fun_offset, stride);
+   fun_offset, stride, pre_vfs_update_cb);
 
   /* Set up individual VF BARs (parameters as for normal BARs) */
   pcie_sriov_pf_init_vf_bar( ... )
diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
index 501a1ff433..cac2aee061 100644
--- a/hw/pci/pcie_sriov.c
+++ b/hw/pci/pcie_sriov.c
@@ -30,7 +30,8 @@ static void unregister_vfs(PCIDevice *dev);
 void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
 const char *vfname, uint16_t vf_dev_id,
 uint16_t init_vfs, uint16_t total_vfs,
-uint16_t vf_offset, uint16_t vf_stride)
+uint16_t vf_offset, uint16_t vf_stride,
+SriovVfsUpdate pre_vfs_update)
 {
 uint8_t *cfg = dev->config + offset;
 uint8_t *wmask;
@@ -41,6 +42,7 @@ void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
 dev->exp.sriov_pf.num_vfs = 0;
 dev->exp.sriov_pf.vfname = g_strdup(vfname);
 dev->exp.sriov_pf.vf = NULL;
+dev->exp.sriov_pf.pre_vfs_update = pre_vfs_update;
 
 pci_set_word(cfg + PCI_SRIOV_VF_OFFSET, vf_offset);
 pci_set_word(cfg + PCI_SRIOV_VF_STRIDE, vf_stride);
@@ -180,6 +182,11 @@ static void register_vfs(PCIDevice *dev)
 assert(dev->exp.sriov_pf.vf);
 
 trace_sriov_register_vfs(SRIOV_ID(dev), num_vfs);
+
+if (dev->exp.sriov_pf.pre_vfs_update) {
+dev->exp.sriov_pf.pre_vfs_update(dev, dev->exp.sriov_pf.num_vfs, 
num_vfs);
+}
+
 for (i = 0; i < num_vfs; i++) {
 dev->exp.sriov_pf.vf[i] = register_vf(dev, devfn, 
dev->exp.sriov_pf.vfname, i);
 if (!dev->exp.sriov_pf.vf[i]) {
@@ -198,6 +205,11 @@ static void unregister_vfs(PCIDevice *dev)
 uint16_t i;
 
 trace_sriov_unregister_vfs(SRIOV_ID(dev), num_vfs);
+
+if (dev->exp.sriov_pf.pre_vfs_update) {
+dev->exp.sriov_pf.pre_vfs_update(dev, dev->exp.sriov_pf.num_vfs, 0);
+}
+
 for (i = 0; i < num_vfs; i++) {
 PCIDevice *vf = dev->exp.sriov_pf.vf[i];
 object_property_set_bool(OBJECT(vf), "realized", false, &local_err);
diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
index 0974f00054..9ab48b79c0 100644
--- a/include/hw/pci/pcie_sriov.h
+++ b/include/hw/pci/pcie_sriov.h
@@ -13,11 +13,16 @@
 #ifndef QEMU_PCIE_SRIOV_H
 #define QEMU_PCIE_SRIOV_H
 
+typedef void (*SriovVfsUpdate)(PCIDevice *dev, uint16_t prev_num_vfs,
+   uint16_t num_vfs);
+
 struct PCIESriovPF {
 uint16_t num_vfs;   /* Number of virtual functions created */
 uint8_t vf_bar_type[PCI_NUM_REGIONS];  /* Store type for each VF bar */
 const char *vfname; /* Reference to the device type used for the 
VFs */
 PCIDevice **vf; /* Pointer to an array of num_vfs VF devices */
+
+SriovVfsUpdate pre_vfs_update;  /* Callback preceding VFs count change */
 };
 
 struct PCIESriovVF {
@@ -28,7 +33,8 @@ struct PCIESriovVF {
 void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
 const char *vfname, uint16_t vf_dev_id,
 uint16_t init_vfs, uint16_t total_vfs,
-uint16_t vf_offset, uint16_t vf_stride);
+uint16_t vf_offset, uint16_t vf_stride,
+SriovVfsUpdate pre_vfs_update);
 void pcie_sriov_pf_exit(PCIDevice *dev);
 
 /* Set up a VF bar in the SR/IOV bar area */
-- 
2.25.1

[PATCH 01/15] pcie: Set default and supported MaxReadReq to 512

2021-10-07 Thread Lukasz Maniak

From: Knut Omang 

Make the default PCI Express Capability for PCIe devices set
MaxReadReq to 512. Tyipcal modern devices people would want to
emulate or simulate would want this. The previous value would
cause warnings from the root port driver on some kernels.

Signed-off-by: Knut Omang 
---
 hw/pci/pcie.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
index 6e95d82903..c1a12f3744 100644
--- a/hw/pci/pcie.c
+++ b/hw/pci/pcie.c
@@ -62,8 +62,9 @@ pcie_cap_v1_fill(PCIDevice *dev, uint8_t port, uint8_t type, 
uint8_t version)
  * Functions conforming to the ECN, PCI Express Base
  * Specification, Revision 1.1., or subsequent PCI Express Base
  * Specification revisions.
+ *  + set max payload size to 256, which seems to be a common value
  */
-pci_set_long(exp_cap + PCI_EXP_DEVCAP, PCI_EXP_DEVCAP_RBER);
+pci_set_long(exp_cap + PCI_EXP_DEVCAP, PCI_EXP_DEVCAP_RBER | (0x1 & 
PCI_EXP_DEVCAP_PAYLOAD));
 
 pci_set_long(exp_cap + PCI_EXP_LNKCAP,
  (port << PCI_EXP_LNKCAP_PN_SHIFT) |
@@ -179,6 +180,8 @@ int pcie_cap_init(PCIDevice *dev, uint8_t offset,
 pci_set_long(exp_cap + PCI_EXP_DEVCAP2,
  PCI_EXP_DEVCAP2_EFF | PCI_EXP_DEVCAP2_EETLPP);
 
+pci_set_word(exp_cap + PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_READRQ_256B);
+
 pci_set_word(dev->wmask + pos + PCI_EXP_DEVCTL2, PCI_EXP_DEVCTL2_EETLPPB);
 
 if (dev->cap_present & QEMU_PCIE_EXTCAP_INIT) {
-- 
2.25.1

[PATCH 00/15] hw/nvme: SR-IOV with Virtualization Enhancements

2021-10-07 Thread Lukasz Maniak

Hi,

This series of patches is an attempt to add support for the following
sections of NVMe specification revision 1.4:

8.5 Virtualization Enhancements (Optional)
8.5.1 VQ Resource Definition
8.5.2 VI Resource Definition
8.5.3 Secondary Controller States and Resource Configuration
8.5.4 Single Root I/O Virtualization and Sharing (SR-IOV)

The NVMe controller's Single Root I/O Virtualization and Sharing
implementation is based on patches introducing SR-IOV support for PCI
Express proposed by Knut Omang:
https://lists.gnu.org/archive/html/qemu-devel/2015-10/msg05155.html

However, based on what I was able to find historically, Knut's patches
have not yet been pulled into QEMU due to no example of a working device
up to this point:
https://lists.gnu.org/archive/html/qemu-devel/2017-10/msg02722.html

In terms of design, the Physical Function controller and the Virtual
Function controllers are almost independent, with few exceptions:
PF handles flexible resource allocation for all its children (VFs have
read-only access to this data), and reset (PF explicitly calls it on VFs).
Since the MMIO access is serialized, no extra precautions are required
to handle concurrent resets, as well as the secondary controller state
access doesn't need to be atomic.

A controller with full SR-IOV support must be capable of handling the
Namespace Management command. As there is a pending review with this
functionality, this patch list is not duplicating efforts.
Yet, NS management patches are not required to test the SR-IOV support.

We tested the patches on Ubuntu 20.04.3 LTS with kernel 5.4.0. We have
hit various issues with NVMe CLI (list and virt-mgmt commands) between
releases from version 1.09 to master, thus we chose this golden NVMe CLI
hash for testing: a50a0c1.

The implementation is not 100% finished and certainly not bug free,
since we are already aware of some issues e.g. interaction with
namespaces related to AER, or unexpected (?) kernel behavior in more
complex reset scenarios. However, our SR-IOV implementation is already
able to support typical SR-IOV use cases, so we believe the patches are
ready to share with the community.

Hope you find some time to review the work we did, and share your
thoughts.

Kind regards,
Lukasz

Knut Omang (3):
  pcie: Set default and supported MaxReadReq to 512
  pcie: Add support for Single Root I/O Virtualization (SR/IOV)
  pcie: Add some SR/IOV API documentation in docs/pcie_sriov.txt

Lukasz Maniak (5):
  pcie: Add callback preceding SR-IOV VFs update
  hw/nvme: Add support for SR-IOV
  hw/nvme: Add support for Primary Controller Capabilities
  hw/nvme: Add support for Secondary Controller List
  docs: Add documentation for SR-IOV and Virtualization Enhancements

Łukasz Gieryk (7):
  pcie: Add 1.2 version token for the Power Management Capability
  hw/nvme: Implement the Function Level Reset
  hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime
  hw/nvme: Calculate BAR atributes in a function
  hw/nvme: Initialize capability structures for primary/secondary
controllers
  pcie: Add helpers to the SR/IOV API
  hw/nvme: Add support for the Virtualization Management command

 docs/pcie_sriov.txt  | 115 +++
 docs/system/devices/nvme.rst |  27 ++
 hw/nvme/ctrl.c   | 589 ---
 hw/nvme/ns.c |   2 +-
 hw/nvme/nvme.h   |  47 ++-
 hw/nvme/subsys.c |  74 -
 hw/nvme/trace-events |   6 +
 hw/pci/meson.build   |   1 +
 hw/pci/pci.c |  97 --
 hw/pci/pcie.c|  10 +-
 hw/pci/pcie_sriov.c  | 313 +++
 hw/pci/trace-events  |   5 +
 include/block/nvme.h |  65 
 include/hw/pci/pci.h |  12 +-
 include/hw/pci/pci_ids.h |   1 +
 include/hw/pci/pci_regs.h|   1 +
 include/hw/pci/pcie.h|   6 +
 include/hw/pci/pcie_sriov.h  |  81 +
 include/qemu/typedefs.h  |   2 +
 19 files changed, 1369 insertions(+), 85 deletions(-)
 create mode 100644 docs/pcie_sriov.txt
 create mode 100644 hw/pci/pcie_sriov.c
 create mode 100644 include/hw/pci/pcie_sriov.h

-- 
2.25.1

[PATCH 02/15] pcie: Add support for Single Root I/O Virtualization (SR/IOV)

2021-10-07 Thread Lukasz Maniak

From: Knut Omang 

This patch provides the building blocks for creating an SR/IOV
PCIe Extended Capability header and register/unregister
SR/IOV Virtual Functions.

Signed-off-by: Knut Omang 
---
 hw/pci/meson.build  |   1 +
 hw/pci/pci.c|  97 +---
 hw/pci/pcie.c   |   5 +
 hw/pci/pcie_sriov.c | 287 
 hw/pci/trace-events |   5 +
 include/hw/pci/pci.h|  12 +-
 include/hw/pci/pcie.h   |   6 +
 include/hw/pci/pcie_sriov.h |  67 +
 include/qemu/typedefs.h |   2 +
 9 files changed, 456 insertions(+), 26 deletions(-)
 create mode 100644 hw/pci/pcie_sriov.c
 create mode 100644 include/hw/pci/pcie_sriov.h

diff --git a/hw/pci/meson.build b/hw/pci/meson.build
index 5c4bbac817..bcc9c75919 100644
--- a/hw/pci/meson.build
+++ b/hw/pci/meson.build
@@ -5,6 +5,7 @@ pci_ss.add(files(
   'pci.c',
   'pci_bridge.c',
   'pci_host.c',
+  'pcie_sriov.c',
   'shpc.c',
   'slotid_cap.c'
 ))
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 186758ee11..1ad647f78e 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -239,6 +239,9 @@ int pci_bar(PCIDevice *d, int reg)
 {
 uint8_t type;
 
+/* PCIe virtual functions do not have their own BARs */
+assert(!pci_is_vf(d));
+
 if (reg != PCI_ROM_SLOT)
 return PCI_BASE_ADDRESS_0 + reg * 4;
 
@@ -304,10 +307,30 @@ void pci_device_deassert_intx(PCIDevice *dev)
 }
 }
 
-static void pci_do_device_reset(PCIDevice *dev)
+static void pci_reset_regions(PCIDevice *dev)
 {
 int r;
+if (pci_is_vf(dev)) {
+return;
+}
+
+for (r = 0; r < PCI_NUM_REGIONS; ++r) {
+PCIIORegion *region = &dev->io_regions[r];
+if (!region->size) {
+continue;
+}
+
+if (!(region->type & PCI_BASE_ADDRESS_SPACE_IO) &&
+region->type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
+pci_set_quad(dev->config + pci_bar(dev, r), region->type);
+} else {
+pci_set_long(dev->config + pci_bar(dev, r), region->type);
+}
+}
+}
 
+static void pci_do_device_reset(PCIDevice *dev)
+{
 pci_device_deassert_intx(dev);
 assert(dev->irq_state == 0);
 
@@ -323,19 +346,7 @@ static void pci_do_device_reset(PCIDevice *dev)
   pci_get_word(dev->wmask + PCI_INTERRUPT_LINE) |
   pci_get_word(dev->w1cmask + PCI_INTERRUPT_LINE));
 dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
-for (r = 0; r < PCI_NUM_REGIONS; ++r) {
-PCIIORegion *region = &dev->io_regions[r];
-if (!region->size) {
-continue;
-}
-
-if (!(region->type & PCI_BASE_ADDRESS_SPACE_IO) &&
-region->type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
-pci_set_quad(dev->config + pci_bar(dev, r), region->type);
-} else {
-pci_set_long(dev->config + pci_bar(dev, r), region->type);
-}
-}
+pci_reset_regions(dev);
 pci_update_mappings(dev);
 
 msi_reset(dev);
@@ -884,6 +895,15 @@ static void pci_init_multifunction(PCIBus *bus, PCIDevice 
*dev, Error **errp)
 dev->config[PCI_HEADER_TYPE] |= PCI_HEADER_TYPE_MULTI_FUNCTION;
 }
 
+/* With SR/IOV and ARI, a device at function 0 need not be a multifunction
+ * device, as it may just be a VF that ended up with function 0 in
+ * the legacy PCI interpretation. Avoid failing in such cases:
+ */
+if (pci_is_vf(dev) &&
+dev->exp.sriov_vf.pf->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
+return;
+}
+
 /*
  * multifunction bit is interpreted in two ways as follows.
  *   - all functions must set the bit to 1.
@@ -1083,6 +1103,7 @@ static PCIDevice *do_pci_register_device(PCIDevice 
*pci_dev,
bus->devices[devfn]->name);
 return NULL;
 } else if (dev->hotplugged &&
+   !pci_is_vf(pci_dev) &&
pci_get_function_0(pci_dev)) {
 error_setg(errp, "PCI: slot %d function 0 already occupied by %s,"
" new func %s cannot be exposed to guest.",
@@ -1191,6 +1212,7 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num,
 pcibus_t size = memory_region_size(memory);
 uint8_t hdr_type;
 
+assert(!pci_is_vf(pci_dev)); /* VFs must use pcie_sriov_vf_register_bar */
 assert(region_num >= 0);
 assert(region_num < PCI_NUM_REGIONS);
 assert(is_power_of_2(size));
@@ -1294,11 +1316,43 @@ pcibus_t pci_get_bar_addr(PCIDevice *pci_dev, int 
region_num)
 return pci_dev->io_regions[region_num].addr;
 }
 
-static pcibus_t pci_bar_address(PCIDevice *d,
-int reg, uint8_t type, pcibus_t size)
+static pcibus_t pci_config_get_bar_addr(PCIDevice *d, int reg,
+uint8_t type, pcibus_t size)
+{
+pcibus_t new_addr;
+if (!pci_is_vf(d)) {
+int bar = pci_bar(d, reg);
+if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
+new_addr = pci

[PATCH 05/15] hw/nvme: Add support for SR-IOV

2021-10-07 Thread Lukasz Maniak

This patch implements initial support for Single Root I/O Virtualization
on an NVMe device.

Essentially, it allows to define the maximum number of virtual functions
supported by the NVMe controller via sriov_max_vfs parameter.

Passing a non-zero value to sriov_max_vfs triggers reporting of SR-IOV
capability by a physical controller and ARI capability by both the
physical and virtual function devices.

NVMe controllers created via virtual functions mirror functionally
the physical controller, which may not entirely be the case, thus
consideration would be needed on the way to limit the capabilities of
the VF.

NVMe subsystem is required for the use of SR-IOV.

Signed-off-by: Lukasz Maniak 
---
 hw/nvme/ctrl.c   | 74 ++--
 hw/nvme/nvme.h   |  1 +
 include/hw/pci/pci_ids.h |  1 +
 3 files changed, 73 insertions(+), 3 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 6a571d18cf..ad79ff0c00 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -35,6 +35,7 @@
  *  mdts=,vsl=, \
  *  zoned.zasl=, \
  *  zoned.auto_transition=, \
+ *  sriov_max_vfs= \
  *  subsys=
  *  -device nvme-ns,drive=,bus=,nsid=,\
  *  zoned=, \
@@ -106,6 +107,12 @@
  *   transitioned to zone state closed for resource management purposes.
  *   Defaults to 'on'.
  *
+ * - `sriov_max_vfs`
+ *   Indicates the maximum number of PCIe virtual functions supported
+ *   by the controller. The default value is 0. Specifying a non-zero value
+ *   enables reporting of both SR-IOV and ARI capabilities by the NVMe device.
+ *   Virtual function controllers will not report SR-IOV capability.
+ *
  * nvme namespace device parameters
  * 
  * - `shared`
@@ -160,6 +167,7 @@
 #include "sysemu/block-backend.h"
 #include "sysemu/hostmem.h"
 #include "hw/pci/msix.h"
+#include "hw/pci/pcie_sriov.h"
 #include "migration/vmstate.h"
 
 #include "nvme.h"
@@ -175,6 +183,9 @@
 #define NVME_TEMPERATURE_CRITICAL 0x175
 #define NVME_NUM_FW_SLOTS 1
 #define NVME_DEFAULT_MAX_ZA_SIZE (128 * KiB)
+#define NVME_MAX_VFS 127
+#define NVME_VF_OFFSET 0x1
+#define NVME_VF_STRIDE 1
 
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
@@ -5583,6 +5594,10 @@ static void nvme_ctrl_reset(NvmeCtrl *n)
 g_free(event);
 }
 
+if (!pci_is_vf(&n->parent_obj) && n->params.sriov_max_vfs) {
+pcie_sriov_pf_disable_vfs(&n->parent_obj);
+}
+
 n->aer_queued = 0;
 n->outstanding_aers = 0;
 n->qs_created = false;
@@ -6264,6 +6279,19 @@ static void nvme_check_constraints(NvmeCtrl *n, Error 
**errp)
 error_setg(errp, "vsl must be non-zero");
 return;
 }
+
+if (params->sriov_max_vfs) {
+if (!n->subsys) {
+error_setg(errp, "subsystem is required for the use of SR-IOV");
+return;
+}
+
+if (params->sriov_max_vfs > NVME_MAX_VFS) {
+error_setg(errp, "sriov_max_vfs must be between 0 and %d",
+   NVME_MAX_VFS);
+return;
+}
+}
 }
 
 static void nvme_init_state(NvmeCtrl *n)
@@ -6321,6 +6349,20 @@ static void nvme_init_pmr(NvmeCtrl *n, PCIDevice 
*pci_dev)
 memory_region_set_enabled(&n->pmr.dev->mr, false);
 }
 
+static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset,
+uint64_t bar_size)
+{
+uint16_t vf_dev_id = n->params.use_intel_id ?
+ PCI_DEVICE_ID_INTEL_NVME : PCI_DEVICE_ID_REDHAT_NVME;
+
+pcie_sriov_pf_init(pci_dev, offset, "nvme", vf_dev_id,
+   n->params.sriov_max_vfs, n->params.sriov_max_vfs,
+   NVME_VF_OFFSET, NVME_VF_STRIDE, NULL);
+
+pcie_sriov_pf_init_vf_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
+  PCI_BASE_ADDRESS_MEM_TYPE_64, bar_size);
+}
+
 static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
 {
 uint8_t *pci_conf = pci_dev->config;
@@ -6335,7 +6377,7 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, 
Error **errp)
 
 if (n->params.use_intel_id) {
 pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_INTEL);
-pci_config_set_device_id(pci_conf, 0x5845);
+pci_config_set_device_id(pci_conf, PCI_DEVICE_ID_INTEL_NVME);
 } else {
 pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_REDHAT);
 pci_config_set_device_id(pci_conf, PCI_DEVICE_ID_REDHAT_NVME);
@@ -6343,6 +6385,9 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, 
Error **errp)
 
 pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS);
 pcie_endpoint_cap_init(pci_dev, 0x80);
+if (n->params.sriov_max_vfs) {
+pcie_ari_init(pci_dev, 0x100, 1);
+}
 
 bar_size = QEMU_ALIGN_UP(n->reg_size, 4 * KiB);
 msix_table_offset = bar_size;
@@ -6361,8 +6406,12 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice 
*pci_dev, Error **errp)

[PATCH 08/15] pcie: Add 1.2 version token for the Power Management Capability

2021-10-07 Thread Lukasz Maniak

From: Łukasz Gieryk 

Signed-off-by: Łukasz Gieryk 
---
 include/hw/pci/pci_regs.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/hw/pci/pci_regs.h b/include/hw/pci/pci_regs.h
index 77ba64b931..a590140962 100644
--- a/include/hw/pci/pci_regs.h
+++ b/include/hw/pci/pci_regs.h
@@ -4,5 +4,6 @@
 #include "standard-headers/linux/pci_regs.h"
 
 #define  PCI_PM_CAP_VER_1_1 0x0002  /* PCI PM spec ver. 1.1 */
+#define  PCI_PM_CAP_VER_1_2 0x0003  /* PCI PM spec ver. 1.2 */
 
 #endif
-- 
2.25.1

[PATCH 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers

2021-10-07 Thread Lukasz Maniak

From: Łukasz Gieryk 

With two new properties (sriov_max_vi_per_vf, sriov_max_vq_per_vf) one
can configure the maximum number of virtual queues and interrupts
assignable to a single virtual device. The primary and secondary
controller capability structures are initialized accordingly.

Since the number of available queues (interrupts) now varies between
VF/PF, BAR size calculation is also adjusted.

Signed-off-by: Łukasz Gieryk 
---
 hw/nvme/ctrl.c   | 110 +++
 hw/nvme/nvme.h   |   2 +
 include/block/nvme.h |   5 ++
 3 files changed, 108 insertions(+), 9 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 425fbf2c73..67c7210d7e 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -36,6 +36,8 @@
  *  zoned.zasl=, \
  *  zoned.auto_transition=, \
  *  sriov_max_vfs= \
+ *  sriov_max_vi_per_vf= \
+ *  sriov_max_vq_per_vf= \
  *  subsys=
  *  -device nvme-ns,drive=,bus=,nsid=,\
  *  zoned=, \
@@ -113,6 +115,18 @@
  *   enables reporting of both SR-IOV and ARI capabilities by the NVMe device.
  *   Virtual function controllers will not report SR-IOV capability.
  *
+ * - `sriov_max_vi_per_vf`
+ *   Indicates the maximum number of virtual interrupt resources assignable
+ *   to a secondary controller. Must be explicitly set if sriov_max_vfs != 0.
+ *   The parameter affect VFs similarly to how msix_qsize affects PF, i.e.,
+ *   determines the number of interrupts available to all queues (admin, io).
+ *
+ * - `sriov_max_vq_per_vf`
+ *   Indicates the maximum number of virtual queue resources assignable to
+ *   a secondary controller. Must be explicitly set if sriov_max_vfs != 0.
+ *   The parameter affect VFs similarly to how max_ioqpairs affects PF,
+ *   except the number of flexible queues includes the admin queue.
+ *
  * nvme namespace device parameters
  * 
  * - `shared`
@@ -184,6 +198,7 @@
 #define NVME_NUM_FW_SLOTS 1
 #define NVME_DEFAULT_MAX_ZA_SIZE (128 * KiB)
 #define NVME_MAX_VFS 127
+#define NVME_VF_RES_GRANULARITY 1
 #define NVME_VF_OFFSET 0x1
 #define NVME_VF_STRIDE 1
 
@@ -6254,6 +6269,7 @@ static const MemoryRegionOps nvme_cmb_ops = {
 static void nvme_check_constraints(NvmeCtrl *n, Error **errp)
 {
 NvmeParams *params = &n->params;
+int msix_total;
 
 if (params->num_queues) {
 warn_report("num_queues is deprecated; please use max_ioqpairs "
@@ -6324,6 +6340,30 @@ static void nvme_check_constraints(NvmeCtrl *n, Error 
**errp)
NVME_MAX_VFS);
 return;
 }
+
+if (params->sriov_max_vi_per_vf < 1 ||
+(params->sriov_max_vi_per_vf - 1) % NVME_VF_RES_GRANULARITY) {
+error_setg(errp, "sriov_max_vi_per_vf must meet:"
+   " (X - 1) %% %d == 0 and X >= 1",
+   NVME_VF_RES_GRANULARITY);
+return;
+}
+
+if (params->sriov_max_vq_per_vf < 2 ||
+(params->sriov_max_vq_per_vf - 1) % NVME_VF_RES_GRANULARITY) {
+error_setg(errp, "sriov_max_vq_per_vf must meet:"
+   " (X - 1) %% %d == 0 and X >= 2",
+   NVME_VF_RES_GRANULARITY);
+return;
+}
+
+msix_total = params->msix_qsize +
+ params->sriov_max_vfs * params->sriov_max_vi_per_vf;
+if (msix_total > PCI_MSIX_FLAGS_QSIZE + 1) {
+error_setg(errp, "sriov_max_vi_per_vf is too big for max_vfs=%d",
+   params->sriov_max_vfs);
+return;
+}
 }
 }
 
@@ -6332,13 +6372,35 @@ static void nvme_init_state(NvmeCtrl *n)
 NvmePriCtrlCap *cap = &n->pri_ctrl_cap;
 NvmeSecCtrlList *list = &n->sec_ctrl_list;
 NvmeSecCtrlEntry *sctrl;
+uint8_t max_vfs;
+uint32_t total_vq, total_vi;
 int i;
 
-n->max_ioqpairs = n->params.max_ioqpairs;
-n->conf_ioqpairs = n->max_ioqpairs;
+if (pci_is_vf(&n->parent_obj)) {
+sctrl = nvme_sctrl(n);
+
+max_vfs = 0;
+
+n->max_ioqpairs = n->params.sriov_max_vq_per_vf - 1;
+n->conf_ioqpairs = sctrl->nvq ? le16_to_cpu(sctrl->nvq) - 1 : 0;
+
+n->max_msix_qsize = n->params.sriov_max_vi_per_vf;
+n->conf_msix_qsize = sctrl->nvi ? le16_to_cpu(sctrl->nvi) : 1;
+} else {
+max_vfs = n->params.sriov_max_vfs;
+
+n->max_ioqpairs = n->params.max_ioqpairs +
+  max_vfs * n->params.sriov_max_vq_per_vf;
+n->conf_ioqpairs = n->max_ioqpairs;
+
+n->max_msix_qsize = n->params.msix_qsize +
+max_vfs * n->params.sriov_max_vi_per_vf;
+n->conf_msix_qsize = n->max_msix_qsize;
+}
+
+total_vq = n->params.sriov_max_vq_per_vf * max_vfs;
+total_vi = n->params.sriov_max_vi_per_vf * max_vfs;
 
-n->max_msix_qsize = n->params.msix_qsize;
-n->conf_msix_qsize = n->max_msix_qsize;
 n->sq = g_n

[PATCH 13/15] pcie: Add helpers to the SR/IOV API

2021-10-07 Thread Lukasz Maniak

From: Łukasz Gieryk 

Two convenience functions for retrieving:
 - the total number of VFs,
 - the PCIDevice object of the N-th VF.

Signed-off-by: Łukasz Gieryk 
---
 hw/pci/pcie_sriov.c | 14 ++
 include/hw/pci/pcie_sriov.h |  8 
 2 files changed, 22 insertions(+)

diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
index cac2aee061..5a8e92d5ab 100644
--- a/hw/pci/pcie_sriov.c
+++ b/hw/pci/pcie_sriov.c
@@ -292,8 +292,22 @@ uint16_t pcie_sriov_vf_number(PCIDevice *dev)
 return dev->exp.sriov_vf.vf_number;
 }
 
+uint16_t pcie_sriov_vf_number_total(PCIDevice *dev)
+{
+assert(!pci_is_vf(dev));
+return dev->exp.sriov_pf.num_vfs;
+}
 
 PCIDevice *pcie_sriov_get_pf(PCIDevice *dev)
 {
 return dev->exp.sriov_vf.pf;
 }
+
+PCIDevice *pcie_sriov_get_vf_at_index(PCIDevice *dev, int n)
+{
+assert(!pci_is_vf(dev));
+if (n < dev->exp.sriov_pf.num_vfs) {
+return dev->exp.sriov_pf.vf[n];
+}
+return NULL;
+}
diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
index 9ab48b79c0..d1f39b7223 100644
--- a/include/hw/pci/pcie_sriov.h
+++ b/include/hw/pci/pcie_sriov.h
@@ -65,9 +65,17 @@ void pcie_sriov_pf_disable_vfs(PCIDevice *dev);
 /* Get logical VF number of a VF - only valid for VFs */
 uint16_t pcie_sriov_vf_number(PCIDevice *dev);
 
+/* Get the total number of VFs - only valid for PF */
+uint16_t pcie_sriov_vf_number_total(PCIDevice *dev);
+
 /* Get the physical function that owns this VF.
  * Returns NULL if dev is not a virtual function
  */
 PCIDevice *pcie_sriov_get_pf(PCIDevice *dev);
 
+/* Get the n-th VF of this physical function - only valid for PF.
+ * Returns NULL if index is invalid
+ */
+PCIDevice *pcie_sriov_get_vf_at_index(PCIDevice *dev, int n);
+
 #endif /* QEMU_PCIE_SRIOV_H */
-- 
2.25.1

[PATCH 06/15] hw/nvme: Add support for Primary Controller Capabilities

2021-10-07 Thread Lukasz Maniak

Implementation of Primary Controller Capabilities data
structure (Identify command with CNS value of 14h).

Currently, the command returns only ID of a primary controller.
Handling of remaining fields are added in subsequent patches
implementing virtualization enhancements.

Signed-off-by: Lukasz Maniak 
---
 hw/nvme/ctrl.c   | 22 +-
 hw/nvme/nvme.h   |  2 ++
 hw/nvme/trace-events |  1 +
 include/block/nvme.h | 23 +++
 4 files changed, 43 insertions(+), 5 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index ad79ff0c00..d2fde3dd07 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -4538,6 +4538,13 @@ static uint16_t nvme_identify_ctrl_list(NvmeCtrl *n, 
NvmeRequest *req,
 return nvme_c2h(n, (uint8_t *)list, sizeof(list), req);
 }
 
+static uint16_t nvme_identify_pri_ctrl_cap(NvmeCtrl *n, NvmeRequest *req)
+{
+trace_pci_nvme_identify_pri_ctrl_cap(le16_to_cpu(n->pri_ctrl_cap.cntlid));
+
+return nvme_c2h(n, (uint8_t *)&n->pri_ctrl_cap, sizeof(NvmePriCtrlCap), 
req);
+}
+
 static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req,
  bool active)
 {
@@ -4756,6 +4763,8 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest 
*req)
 return nvme_identify_ctrl_list(n, req, true);
 case NVME_ID_CNS_CTRL_LIST:
 return nvme_identify_ctrl_list(n, req, false);
+case NVME_ID_CNS_PRIMARY_CTRL_CAP:
+return nvme_identify_pri_ctrl_cap(n, req);
 case NVME_ID_CNS_CS_NS:
 return nvme_identify_ns_csi(n, req, true);
 case NVME_ID_CNS_CS_NS_PRESENT:
@@ -6296,6 +6305,8 @@ static void nvme_check_constraints(NvmeCtrl *n, Error 
**errp)
 
 static void nvme_init_state(NvmeCtrl *n)
 {
+NvmePriCtrlCap *cap = &n->pri_ctrl_cap;
+
 /* add one to max_ioqpairs to account for the admin queue pair */
 n->reg_size = pow2ceil(sizeof(NvmeBar) +
2 * (n->params.max_ioqpairs + 1) * NVME_DB_SIZE);
@@ -6305,6 +6316,8 @@ static void nvme_init_state(NvmeCtrl *n)
 n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
 n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
 n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
+
+cap->cntlid = cpu_to_le16(n->cntlid);
 }
 
 static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
@@ -6604,15 +6617,14 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 qbus_init(&n->bus, sizeof(NvmeBus), TYPE_NVME_BUS,
   &pci_dev->qdev, n->parent_obj.qdev.id);
 
-nvme_init_state(n);
-if (nvme_init_pci(n, pci_dev, errp)) {
-return;
-}
-
 if (nvme_init_subsys(n, errp)) {
 error_propagate(errp, local_err);
 return;
 }
+nvme_init_state(n);
+if (nvme_init_pci(n, pci_dev, errp)) {
+return;
+}
 nvme_init_ctrl(n, pci_dev);
 
 /* setup a namespace if the controller drive property was given */
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 4331f5da1f..479817f66e 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -461,6 +461,8 @@ typedef struct NvmeCtrl {
 };
 uint32_tasync_config;
 } features;
+
+NvmePriCtrlCap  pri_ctrl_cap;
 } NvmeCtrl;
 
 static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t nsid)
diff --git a/hw/nvme/trace-events b/hw/nvme/trace-events
index ff6cafd520..1014ebceb6 100644
--- a/hw/nvme/trace-events
+++ b/hw/nvme/trace-events
@@ -52,6 +52,7 @@ pci_nvme_identify_ctrl(void) "identify controller"
 pci_nvme_identify_ctrl_csi(uint8_t csi) "identify controller, csi=0x%"PRIx8""
 pci_nvme_identify_ns(uint32_t ns) "nsid %"PRIu32""
 pci_nvme_identify_ctrl_list(uint8_t cns, uint16_t cntid) "cns 0x%"PRIx8" cntid 
%"PRIu16""
+pci_nvme_identify_pri_ctrl_cap(uint16_t cntlid) "identify primary controller 
capabilities cntlid=%"PRIu16""
 pci_nvme_identify_ns_csi(uint32_t ns, uint8_t csi) "nsid=%"PRIu32", 
csi=0x%"PRIx8""
 pci_nvme_identify_nslist(uint32_t ns) "nsid %"PRIu32""
 pci_nvme_identify_nslist_csi(uint16_t ns, uint8_t csi) "nsid=%"PRIu16", 
csi=0x%"PRIx8""
diff --git a/include/block/nvme.h b/include/block/nvme.h
index e3bd47bf76..f69bd1d14f 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -1017,6 +1017,7 @@ enum NvmeIdCns {
 NVME_ID_CNS_NS_PRESENT= 0x11,
 NVME_ID_CNS_NS_ATTACHED_CTRL_LIST = 0x12,
 NVME_ID_CNS_CTRL_LIST = 0x13,
+NVME_ID_CNS_PRIMARY_CTRL_CAP  = 0x14,
 NVME_ID_CNS_CS_NS_PRESENT_LIST= 0x1a,
 NVME_ID_CNS_CS_NS_PRESENT = 0x1b,
 NVME_ID_CNS_IO_COMMAND_SET= 0x1c,
@@ -1465,6 +1466,27 @@ typedef enum NvmeZoneState {
 NVME_ZONE_STATE_OFFLINE  = 0x0f,
 } NvmeZoneState;
 
+typedef struct QEMU_PACKED NvmePriCtrlCap {
+uint16_tcntlid;
+uint16_tportid;
+uint8_t crt;
+uint8_t rsvd5[27];
+uint32_tvqfrt;
+uint32_tvqrfa;
+uint16_tvqrfap;
+uint16_tvqprt;
+uint16_tvqfrsm;
+uint16_tvqgran

[PATCH 07/15] hw/nvme: Add support for Secondary Controller List

2021-10-07 Thread Lukasz Maniak

Introduce handling for Secondary Controller List (Identify command with
CNS value of 15h).

Secondary controller ids are unique in the subsystem, hence they are
reserved by it upon initialization of the primary controller to the
number of sriov_max_vfs.

ID reservation requires the addition of an intermediate controller slot
state, so the reserved controller has the address 0x.
A secondary controller is in the reserved state when it has no virtual
function assigned, but its primary controller is realized.
Secondary controller reservations are released to NULL when its primary
controller is unregistered.

Signed-off-by: Lukasz Maniak 
---
 hw/nvme/ctrl.c   | 42 -
 hw/nvme/ns.c |  2 +-
 hw/nvme/nvme.h   | 16 +-
 hw/nvme/subsys.c | 74 ++--
 hw/nvme/trace-events |  1 +
 include/block/nvme.h | 20 
 6 files changed, 143 insertions(+), 12 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index d2fde3dd07..9687a7322c 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -4545,6 +4545,14 @@ static uint16_t nvme_identify_pri_ctrl_cap(NvmeCtrl *n, 
NvmeRequest *req)
 return nvme_c2h(n, (uint8_t *)&n->pri_ctrl_cap, sizeof(NvmePriCtrlCap), 
req);
 }
 
+static uint16_t nvme_identify_sec_ctrl_list(NvmeCtrl *n, NvmeRequest *req)
+{
+trace_pci_nvme_identify_sec_ctrl_list(le16_to_cpu(n->pri_ctrl_cap.cntlid),
+  n->sec_ctrl_list.numcntl);
+
+return nvme_c2h(n, (uint8_t *)&n->sec_ctrl_list, sizeof(NvmeSecCtrlList), 
req);
+}
+
 static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req,
  bool active)
 {
@@ -4765,6 +4773,8 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest 
*req)
 return nvme_identify_ctrl_list(n, req, false);
 case NVME_ID_CNS_PRIMARY_CTRL_CAP:
 return nvme_identify_pri_ctrl_cap(n, req);
+case NVME_ID_CNS_SECONDARY_CTRL_LIST:
+return nvme_identify_sec_ctrl_list(n, req);
 case NVME_ID_CNS_CS_NS:
 return nvme_identify_ns_csi(n, req, true);
 case NVME_ID_CNS_CS_NS_PRESENT:
@@ -6306,6 +6316,9 @@ static void nvme_check_constraints(NvmeCtrl *n, Error 
**errp)
 static void nvme_init_state(NvmeCtrl *n)
 {
 NvmePriCtrlCap *cap = &n->pri_ctrl_cap;
+NvmeSecCtrlList *list = &n->sec_ctrl_list;
+NvmeSecCtrlEntry *sctrl;
+int i;
 
 /* add one to max_ioqpairs to account for the admin queue pair */
 n->reg_size = pow2ceil(sizeof(NvmeBar) +
@@ -6317,6 +6330,12 @@ static void nvme_init_state(NvmeCtrl *n)
 n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
 n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
 
+list->numcntl = cpu_to_le16(n->params.sriov_max_vfs);
+for (i = 0; i < n->params.sriov_max_vfs; i++) {
+sctrl = &list->sec[i];
+sctrl->pcid = cpu_to_le16(n->cntlid);
+}
+
 cap->cntlid = cpu_to_le16(n->cntlid);
 }
 
@@ -6362,6 +6381,27 @@ static void nvme_init_pmr(NvmeCtrl *n, PCIDevice 
*pci_dev)
 memory_region_set_enabled(&n->pmr.dev->mr, false);
 }
 
+static void nvme_update_vfs(PCIDevice *pci_dev, uint16_t prev_num_vfs,
+uint16_t num_vfs)
+{
+NvmeCtrl *n = NVME(pci_dev);
+uint16_t num_active_vfs = MAX(prev_num_vfs, num_vfs);
+bool vf_enable = (prev_num_vfs < num_vfs);
+uint16_t i;
+
+/*
+ * As per SR-IOV design,
+ * VF count can only go from 0 to a set value and vice versa.
+ */
+for (i = 0; i < num_active_vfs; i++) {
+if (vf_enable) {
+n->sec_ctrl_list.sec[i].vfn = cpu_to_le16(i + 1);
+} else {
+n->sec_ctrl_list.sec[i].vfn = 0;
+}
+}
+}
+
 static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset,
 uint64_t bar_size)
 {
@@ -6370,7 +6410,7 @@ static void nvme_init_sriov(NvmeCtrl *n, PCIDevice 
*pci_dev, uint16_t offset,
 
 pcie_sriov_pf_init(pci_dev, offset, "nvme", vf_dev_id,
n->params.sriov_max_vfs, n->params.sriov_max_vfs,
-   NVME_VF_OFFSET, NVME_VF_STRIDE, NULL);
+   NVME_VF_OFFSET, NVME_VF_STRIDE, nvme_update_vfs);
 
 pcie_sriov_pf_init_vf_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
   PCI_BASE_ADDRESS_MEM_TYPE_64, bar_size);
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index b7cf1494e7..c70aed8c66 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -517,7 +517,7 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
 for (i = 0; i < ARRAY_SIZE(subsys->ctrls); i++) {
 NvmeCtrl *ctrl = subsys->ctrls[i];
 
-if (ctrl) {
+if (ctrl && ctrl != SUBSYS_SLOT_RSVD) {
 nvme_attach_ns(ctrl, ns);
 }
 }
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 479817f66e..fd229f06f0 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nv

[PATCH 11/15] hw/nvme: Calculate BAR atributes in a function

2021-10-07 Thread Lukasz Maniak

From: Łukasz Gieryk 

An Nvme device with SR-IOV capability calculates the BAR size
differently for PF and VF, so it makes sense to extract the common code
to a separate function.

Also: it seems the n->reg_size parameter unnecessarily splits the BAR
size calculation in two phases; removed to simplify the code.

Signed-off-by: Łukasz Gieryk 
---
 hw/nvme/ctrl.c | 52 +-
 hw/nvme/nvme.h |  1 -
 2 files changed, 35 insertions(+), 18 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 5d9166d66f..425fbf2c73 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6339,10 +6339,6 @@ static void nvme_init_state(NvmeCtrl *n)
 
 n->max_msix_qsize = n->params.msix_qsize;
 n->conf_msix_qsize = n->max_msix_qsize;
-
-/* add one to max_ioqpairs to account for the admin queue pair */
-n->reg_size = pow2ceil(sizeof(NvmeBar) +
-   2 * (n->params.max_ioqpairs + 1) * NVME_DB_SIZE);
 n->sq = g_new0(NvmeSQueue *, n->max_ioqpairs + 1);
 n->cq = g_new0(NvmeCQueue *, n->max_ioqpairs + 1);
 n->temperature = NVME_TEMPERATURE;
@@ -6401,6 +6397,36 @@ static void nvme_init_pmr(NvmeCtrl *n, PCIDevice 
*pci_dev)
 memory_region_set_enabled(&n->pmr.dev->mr, false);
 }
 
+static uint64_t nvme_bar_size(unsigned total_queues, unsigned total_irqs,
+  unsigned *msix_table_offset,
+  unsigned *msix_pba_offset)
+{
+uint64_t bar_size, msix_table_size, msix_pba_size;
+
+bar_size = sizeof(NvmeBar);
+bar_size += 2 * total_queues * NVME_DB_SIZE;
+bar_size = pow2ceil(bar_size);
+bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB);
+
+if (msix_table_offset) {
+*msix_table_offset = bar_size;
+}
+
+msix_table_size = PCI_MSIX_ENTRY_SIZE * total_irqs;
+bar_size += msix_table_size;
+bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB);
+
+if (msix_pba_offset) {
+*msix_pba_offset = bar_size;
+}
+
+msix_pba_size = QEMU_ALIGN_UP(total_irqs, 64) / 8;
+bar_size += msix_pba_size;
+
+bar_size = pow2ceil(bar_size);
+return bar_size;
+}
+
 static void nvme_update_vfs(PCIDevice *pci_dev, uint16_t prev_num_vfs,
 uint16_t num_vfs)
 {
@@ -6461,7 +6487,7 @@ static int nvme_add_pm_capability(PCIDevice *pci_dev, 
uint8_t offset)
 static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
 {
 uint8_t *pci_conf = pci_dev->config;
-uint64_t bar_size, msix_table_size, msix_pba_size;
+uint64_t bar_size;
 unsigned msix_table_offset, msix_pba_offset;
 int ret;
 
@@ -6486,21 +6512,13 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice 
*pci_dev, Error **errp)
 pcie_ari_init(pci_dev, 0x100, 1);
 }
 
-bar_size = QEMU_ALIGN_UP(n->reg_size, 4 * KiB);
-msix_table_offset = bar_size;
-msix_table_size = PCI_MSIX_ENTRY_SIZE * n->params.msix_qsize;
-
-bar_size += msix_table_size;
-bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB);
-msix_pba_offset = bar_size;
-msix_pba_size = QEMU_ALIGN_UP(n->params.msix_qsize, 64) / 8;
-
-bar_size += msix_pba_size;
-bar_size = pow2ceil(bar_size);
+/* add one to max_ioqpairs to account for the admin queue pair */
+bar_size = nvme_bar_size(n->max_ioqpairs + 1, n->max_msix_qsize,
+ &msix_table_offset, &msix_pba_offset);
 
 memory_region_init(&n->bar0, OBJECT(n), "nvme-bar0", bar_size);
 memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n, "nvme",
-  n->reg_size);
+  msix_table_offset);
 memory_region_add_subregion(&n->bar0, 0, &n->iomem);
 
 if (pci_is_vf(pci_dev)) {
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 65383e495c..a8eded4713 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -410,7 +410,6 @@ typedef struct NvmeCtrl {
 uint16_tmax_prp_ents;
 uint16_tcqe_size;
 uint16_tsqe_size;
-uint32_treg_size;
 uint32_tmax_q_ents;
 uint8_t outstanding_aers;
 uint32_tirq_status;
-- 
2.25.1

[PATCH 15/15] docs: Add documentation for SR-IOV and Virtualization Enhancements

2021-10-07 Thread Lukasz Maniak

Signed-off-by: Lukasz Maniak 
---
 docs/system/devices/nvme.rst | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/docs/system/devices/nvme.rst b/docs/system/devices/nvme.rst
index bff72d1c24..904fd7290c 100644
--- a/docs/system/devices/nvme.rst
+++ b/docs/system/devices/nvme.rst
@@ -235,3 +235,30 @@ The virtual namespace device supports DIF- and DIX-based 
protection information
   to ``1`` to transfer protection information as the first eight bytes of
   metadata. Otherwise, the protection information is transferred as the last
   eight bytes.
+
+Virtualization Enhancements and SR-IOV
+--
+
+The ``nvme`` device supports Single Root I/O Virtualization and Sharing
+along with Virtualization Enhancements. The controller has to be linked to
+an NVM Subsystem device (``nvme-subsys``) for use with SR-IOV.
+
+A number of parameters are present:
+
+``sriov_max_vfs`` (default: ``0``)
+  Indicates the maximum number of PCIe virtual functions supported
+  by the controller. Specifying a non-zero value enables reporting of both
+  SR-IOV and ARI (Alternative Routing-ID Interpretation) capabilities
+  by the NVMe device. Virtual function controllers will not report SR-IOV.
+
+``sriov_max_vi_per_vf``
+  Indicates the maximum number of virtual interrupt resources assignable
+  to a secondary controller. Must be explicitly set if ``sriov_max_vfs`` != 0.
+  The parameter affect VFs similarly to how ``msix_qsize`` affects PF, i.e.,
+  determines the number of interrupts available to all queues (admin, io).
+
+``sriov_max_vq_per_vf``
+  Indicates the maximum number of virtual queue resources assignable to
+  a secondary controller. Must be explicitly set if ``sriov_max_vfs`` != 0.
+  The parameter affect VFs similarly to how ``max_ioqpairs`` affects PF,
+  except the number of flexible queues includes the admin queue.
-- 
2.25.1

[PATCH 14/15] hw/nvme: Add support for the Virtualization Management command

2021-10-07 Thread Lukasz Maniak

From: Łukasz Gieryk 

With the new command one can:
 - assign flexible resources (queues, interrupts) to primary and
   secondary controllers,
 - toggle the online/offline state of given controller.

Signed-off-by: Łukasz Gieryk 
---
 hw/nvme/ctrl.c   | 207 ++-
 hw/nvme/nvme.h   |  16 
 hw/nvme/trace-events |   3 +
 include/block/nvme.h |  17 
 4 files changed, 241 insertions(+), 2 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 67c7210d7e..0c44d9b23a 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -246,6 +246,7 @@ static const uint32_t nvme_cse_acs[256] = {
 [NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_NS_ATTACHMENT]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_NIC,
+[NVME_ADM_CMD_VIRT_MNGMT]   = NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_FORMAT_NVM]   = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
 };
 
@@ -277,6 +278,7 @@ static const uint32_t nvme_cse_iocs_zoned[256] = {
 };
 
 static void nvme_process_sq(void *opaque);
+static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst);
 
 static uint16_t nvme_sqid(NvmeRequest *req)
 {
@@ -5506,6 +5508,163 @@ out:
 return status;
 }
 
+static void nvme_get_virt_res_num(NvmeCtrl *n, uint8_t rt, int *num_total,
+  int *num_prim, int *num_sec)
+{
+*num_total = le32_to_cpu(rt ? n->pri_ctrl_cap.vifrt : 
n->pri_ctrl_cap.vqfrt);
+*num_prim = le16_to_cpu(rt ? n->pri_ctrl_cap.virfap : 
n->pri_ctrl_cap.vqrfap);
+*num_sec = le16_to_cpu(rt ? n->pri_ctrl_cap.virfa : n->pri_ctrl_cap.vqrfa);
+}
+
+static uint16_t nvme_assign_virt_res_to_prim(NvmeCtrl *n, NvmeRequest *req,
+ uint16_t cntlid, uint8_t rt, int 
nr)
+{
+int num_total, num_prim, num_sec;
+
+if (cntlid != n->cntlid) {
+return NVME_INVALID_CTRL_ID;
+}
+
+nvme_get_virt_res_num(n, rt, &num_total, &num_prim, &num_sec);
+
+if (nr > num_total) {
+return NVME_INVALID_NUM_RESOURCES;
+}
+
+if (nr > num_total - num_sec) {
+return NVME_INVALID_RESOURCE_ID;
+}
+
+if (rt) {
+n->pri_ctrl_cap.virfap = cpu_to_le16(nr);
+} else {
+n->pri_ctrl_cap.vqrfap = cpu_to_le16(nr);
+}
+
+req->cqe.result = cpu_to_le32(nr);
+return req->status;
+}
+
+static void nvme_update_virt_res(NvmeCtrl *n, NvmeSecCtrlEntry *sctrl,
+ uint8_t rt, int nr)
+{
+int prev_nr, prev_total;
+
+if (rt) {
+prev_nr = le16_to_cpu(sctrl->nvi);
+prev_total = le32_to_cpu(n->pri_ctrl_cap.virfa);
+sctrl->nvi = cpu_to_le16(nr);
+n->pri_ctrl_cap.virfa = cpu_to_le32(prev_total + nr - prev_nr);
+} else {
+prev_nr = le16_to_cpu(sctrl->nvq);
+prev_total = le32_to_cpu(n->pri_ctrl_cap.vqrfa);
+sctrl->nvq = cpu_to_le16(nr);
+n->pri_ctrl_cap.vqrfa = cpu_to_le32(prev_total + nr - prev_nr);
+}
+}
+
+static uint16_t nvme_assign_virt_res_to_sec(NvmeCtrl *n, NvmeRequest *req,
+uint16_t cntlid, uint8_t rt, int 
nr)
+{
+int limit = rt ? n->params.sriov_max_vi_per_vf :
+ n->params.sriov_max_vq_per_vf;
+int num_total, num_prim, num_sec, num_free, diff;
+NvmeSecCtrlEntry *sctrl;
+
+sctrl = nvme_sctrl_for_cntlid(n, cntlid);
+if (!sctrl) {
+return NVME_INVALID_CTRL_ID;
+}
+
+if (sctrl->scs) {
+return NVME_INVALID_SEC_CTRL_STATE;
+}
+
+if (nr > limit) {
+return NVME_INVALID_NUM_RESOURCES;
+}
+
+nvme_get_virt_res_num(n, rt, &num_total, &num_prim, &num_sec);
+num_free = num_total - num_prim - num_sec;
+diff = nr - le16_to_cpu(rt ? sctrl->nvi : sctrl->nvq);
+
+if (diff > num_free) {
+return NVME_INVALID_RESOURCE_ID;
+}
+
+nvme_update_virt_res(n, sctrl, rt, nr);
+req->cqe.result = cpu_to_le32(nr);
+
+return req->status;
+}
+
+static uint16_t nvme_virt_set_state(NvmeCtrl *n, uint16_t cntlid, bool online)
+{
+NvmeCtrl *sn = NULL;
+NvmeSecCtrlEntry *sctrl;
+
+sctrl = nvme_sctrl_for_cntlid(n, cntlid);
+if (!sctrl) {
+return NVME_INVALID_CTRL_ID;
+}
+
+if (sctrl->vfn) {
+sn = NVME(pcie_sriov_get_vf_at_index(&n->parent_obj,
+ le16_to_cpu(sctrl->vfn) - 1));
+}
+
+if (online) {
+if (!NVME_CC_EN(ldl_le_p(&n->bar.cc)) || !sctrl->nvi ||
+(le16_to_cpu(sctrl->nvq) < 2)) {
+return NVME_INVALID_SEC_CTRL_STATE;
+}
+
+if (!sctrl->scs) {
+sctrl->scs = 0x1;
+if (sn) {
+nvme_ctrl_reset(sn, NVME_RESET_CONTROLLER);
+}
+}
+} else {
+if (sctrl->scs) {
+sctrl->scs = 0x0;
+if (sn) {
+nvme_ctrl_reset(sn, NVME_RESET_CONTROLLER);
+}
+}
+
+

[PATCH 09/15] hw/nvme: Implement the Function Level Reset

2021-10-07 Thread Lukasz Maniak

From: Łukasz Gieryk 

This patch implements the FLR, a feature currently not implemented for
the Nvme device, while listed as a mandatory ("shall") in the 1.4 spec.

The implementation reuses FLR-related building blocks defined for the
pci-bridge module, and follows the same logic:
- FLR capability is advertised in the PCIE config,
- custom pci_write_config callback detects a write to the trigger
  register and performs the PCI reset,
- which, eventually, calls the custom dc->reset handler.

Depending on reset type, parts of the state should (or should not) be
cleared. To distinguish the type of reset, an additional parameter is
passed to the reset function.

This patch also enables advertisement of the Power Management PCI
capability. The main reason behind it is to announce the no_soft_reset=1
bit, to signal SR/IOV support where each VF can be reset individually.

The implementation purposedly ignores writes to the PMCS.PS register,
as even such naïve behavior is enough to correctly handle the D3->D0
transition.

It’s worth to note, that the power state transition back to to D3, with
all the corresponding side effects, wasn't and stil isn't handled
properly.

Signed-off-by: Łukasz Gieryk 
---
 hw/nvme/ctrl.c   | 52 
 hw/nvme/nvme.h   |  5 +
 hw/nvme/trace-events |  1 +
 3 files changed, 54 insertions(+), 4 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 9687a7322c..b04cf5eae9 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -5582,7 +5582,7 @@ static void nvme_process_sq(void *opaque)
 }
 }
 
-static void nvme_ctrl_reset(NvmeCtrl *n)
+static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst)
 {
 NvmeNamespace *ns;
 int i;
@@ -5614,7 +5614,9 @@ static void nvme_ctrl_reset(NvmeCtrl *n)
 }
 
 if (!pci_is_vf(&n->parent_obj) && n->params.sriov_max_vfs) {
-pcie_sriov_pf_disable_vfs(&n->parent_obj);
+if (rst != NVME_RESET_CONTROLLER) {
+pcie_sriov_pf_disable_vfs(&n->parent_obj);
+}
 }
 
 n->aer_queued = 0;
@@ -5848,7 +5850,7 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, 
uint64_t data,
 }
 } else if (!NVME_CC_EN(data) && NVME_CC_EN(cc)) {
 trace_pci_nvme_mmio_stopped();
-nvme_ctrl_reset(n);
+nvme_ctrl_reset(n, NVME_RESET_CONTROLLER);
 cc = 0;
 csts &= ~NVME_CSTS_READY;
 }
@@ -6416,6 +6418,28 @@ static void nvme_init_sriov(NvmeCtrl *n, PCIDevice 
*pci_dev, uint16_t offset,
   PCI_BASE_ADDRESS_MEM_TYPE_64, bar_size);
 }
 
+static int nvme_add_pm_capability(PCIDevice *pci_dev, uint8_t offset)
+{
+Error *err = NULL;
+int ret;
+
+ret = pci_add_capability(pci_dev, PCI_CAP_ID_PM, offset,
+ PCI_PM_SIZEOF, &err);
+if (err) {
+error_report_err(err);
+return ret;
+}
+
+pci_set_word(pci_dev->config + offset + PCI_PM_PMC,
+ PCI_PM_CAP_VER_1_2);
+pci_set_word(pci_dev->config + offset + PCI_PM_CTRL,
+ PCI_PM_CTRL_NO_SOFT_RESET);
+pci_set_word(pci_dev->wmask + offset + PCI_PM_CTRL,
+ PCI_PM_CTRL_STATE_MASK);
+
+return 0;
+}
+
 static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
 {
 uint8_t *pci_conf = pci_dev->config;
@@ -6437,7 +6461,9 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, 
Error **errp)
 }
 
 pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS);
+nvme_add_pm_capability(pci_dev, 0x60);
 pcie_endpoint_cap_init(pci_dev, 0x80);
+pcie_cap_flr_init(pci_dev);
 if (n->params.sriov_max_vfs) {
 pcie_ari_init(pci_dev, 0x100, 1);
 }
@@ -6686,7 +6712,7 @@ static void nvme_exit(PCIDevice *pci_dev)
 NvmeNamespace *ns;
 int i;
 
-nvme_ctrl_reset(n);
+nvme_ctrl_reset(n, NVME_RESET_FUNCTION);
 
 if (n->subsys) {
 for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
@@ -6785,6 +6811,22 @@ static void nvme_set_smart_warning(Object *obj, Visitor 
*v, const char *name,
 }
 }
 
+static void nvme_pci_reset(DeviceState *qdev)
+{
+PCIDevice *pci_dev = PCI_DEVICE(qdev);
+NvmeCtrl *n = NVME(pci_dev);
+
+trace_pci_nvme_pci_reset();
+nvme_ctrl_reset(n, NVME_RESET_FUNCTION);
+}
+
+static void nvme_pci_write_config(PCIDevice *dev, uint32_t address,
+  uint32_t val, int len)
+{
+pci_default_write_config(dev, address, val, len);
+pcie_cap_flr_write_config(dev, address, val, len);
+}
+
 static const VMStateDescription nvme_vmstate = {
 .name = "nvme",
 .unmigratable = 1,
@@ -6796,6 +6838,7 @@ static void nvme_class_init(ObjectClass *oc, void *data)
 PCIDeviceClass *pc = PCI_DEVICE_CLASS(oc);
 
 pc->realize = nvme_realize;
+pc->config_write = nvme_pci_write_config;
 pc->exit = nvme_exit;
 pc->class_id = PCI_CLASS_STORAGE_EXPRESS;
 pc->revision = 2;
@@ -6804,6

[PATCH 10/15] hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime

2021-10-07 Thread Lukasz Maniak

From: Łukasz Gieryk 

The Nvme device defines two properties: max_ioqpairs, msix_qsize. Having
them as constants is problematic for SR-IOV support.

The SR-IOV feature introduces virtual resources (queues, interrupts)
that can be assigned to PF and its dependent VFs. Each device, following
a reset, should work with the configured number of queues. A single
constant is no longer sufficient to hold the whole state.

This patch tries to solve the problem by introducing additional
variables in NvmeCtrl’s state. The variables for, e.g., managing queues
are therefore organized as:

 - n->params.max_ioqpairs – no changes, constant set by the user.

 - n->max_ioqpairs - (new) value derived from n->params.* in realize();
 constant through device’s lifetime.

 - n->(mutable_state) – (not a part of this patch) user-configurable,
specifies number of queues available _after_
reset.

 - n->conf_ioqpairs - (new) used in all the places instead of the ‘old’
  n->params.max_ioqpairs; initialized in realize()
  and updated during reset() to reflect user’s
  changes to the mutable state.

Since the number of available i/o queues and interrupts can change in
runtime, buffers for sq/cqs and the MSIX-related structures are
allocated big enough to handle the limits, to completely avoid the
complicated reallocation. A helper function (nvme_update_msixcap_ts)
updates the corresponding capability register, to signal configuration
changes.

Signed-off-by: Łukasz Gieryk 
---
 hw/nvme/ctrl.c | 62 +-
 hw/nvme/nvme.h |  4 
 2 files changed, 45 insertions(+), 21 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index b04cf5eae9..5d9166d66f 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -416,12 +416,12 @@ static bool nvme_nsid_valid(NvmeCtrl *n, uint32_t nsid)
 
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
 {
-return sqid < n->params.max_ioqpairs + 1 && n->sq[sqid] != NULL ? 0 : -1;
+return sqid < n->conf_ioqpairs + 1 && n->sq[sqid] != NULL ? 0 : -1;
 }
 
 static int nvme_check_cqid(NvmeCtrl *n, uint16_t cqid)
 {
-return cqid < n->params.max_ioqpairs + 1 && n->cq[cqid] != NULL ? 0 : -1;
+return cqid < n->conf_ioqpairs + 1 && n->cq[cqid] != NULL ? 0 : -1;
 }
 
 static void nvme_inc_cq_tail(NvmeCQueue *cq)
@@ -4034,8 +4034,7 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeRequest 
*req)
 trace_pci_nvme_err_invalid_create_sq_cqid(cqid);
 return NVME_INVALID_CQID | NVME_DNR;
 }
-if (unlikely(!sqid || sqid > n->params.max_ioqpairs ||
-n->sq[sqid] != NULL)) {
+if (unlikely(!sqid || sqid > n->conf_ioqpairs || n->sq[sqid] != NULL)) {
 trace_pci_nvme_err_invalid_create_sq_sqid(sqid);
 return NVME_INVALID_QID | NVME_DNR;
 }
@@ -4382,8 +4381,7 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest 
*req)
 trace_pci_nvme_create_cq(prp1, cqid, vector, qsize, qflags,
  NVME_CQ_FLAGS_IEN(qflags) != 0);
 
-if (unlikely(!cqid || cqid > n->params.max_ioqpairs ||
-n->cq[cqid] != NULL)) {
+if (unlikely(!cqid || cqid > n->conf_ioqpairs || n->cq[cqid] != NULL)) {
 trace_pci_nvme_err_invalid_create_cq_cqid(cqid);
 return NVME_INVALID_QID | NVME_DNR;
 }
@@ -4399,7 +4397,7 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest 
*req)
 trace_pci_nvme_err_invalid_create_cq_vector(vector);
 return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
 }
-if (unlikely(vector >= n->params.msix_qsize)) {
+if (unlikely(vector >= n->conf_msix_qsize)) {
 trace_pci_nvme_err_invalid_create_cq_vector(vector);
 return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
 }
@@ -4980,13 +4978,12 @@ defaults:
 
 break;
 case NVME_NUMBER_OF_QUEUES:
-result = (n->params.max_ioqpairs - 1) |
-((n->params.max_ioqpairs - 1) << 16);
+result = (n->conf_ioqpairs - 1) | ((n->conf_ioqpairs - 1) << 16);
 trace_pci_nvme_getfeat_numq(result);
 break;
 case NVME_INTERRUPT_VECTOR_CONF:
 iv = dw11 & 0x;
-if (iv >= n->params.max_ioqpairs + 1) {
+if (iv >= n->conf_ioqpairs + 1) {
 return NVME_INVALID_FIELD | NVME_DNR;
 }
 
@@ -5141,10 +5138,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, 
NvmeRequest *req)
 
 trace_pci_nvme_setfeat_numq((dw11 & 0x) + 1,
 ((dw11 >> 16) & 0x) + 1,
-n->params.max_ioqpairs,
-n->params.max_ioqpairs);
-req->cqe.result = cpu_to_le32((n->params.max_ioqpairs - 1) |
-  ((n->params.max_ioqpairs - 1) << 16));
+n->conf_ioqpairs,
+n->conf_ioqpairs);
+req->cqe.resu

[PATCH 13/12] block-backend: fix blk_co_flush prototype to mention coroutine_fn

2021-10-07 Thread Vladimir Sementsov-Ogievskiy

We do have this marker for blk_co_flush function declaration in
block/block-backend.c. Add it in header too.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/sysemu/block-backend.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index 979829b325..f3227098fc 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -183,7 +183,7 @@ BlockAIOCB *blk_aio_ioctl(BlockBackend *blk, unsigned long 
int req, void *buf,
   BlockCompletionFunc *cb, void *opaque);
 int coroutine_fn blk_co_pdiscard(BlockBackend *blk, int64_t offset,
  int64_t bytes);
-int blk_co_flush(BlockBackend *blk);
+int coroutine_fn blk_co_flush(BlockBackend *blk);
 int blk_flush(BlockBackend *blk);
 int blk_commit_all(void);
 void blk_inc_in_flight(BlockBackend *blk);
-- 
2.31.1

1 2 >

1 - 100 of 117 matches

Mail list logo