Re: [Qemu-devel] [PATCH 4/4] q35: introduce q35-lite

2016-11-14 Thread Chao Peng
On Mon, 2016-11-07 at 18:09 +0100, Paolo Bonzini wrote:
> 
> On 06/11/2016 08:06, Michael S. Tsirkin wrote:
> > 
> > On Sat, Nov 05, 2016 at 03:19:51AM -0400, Chao Peng wrote:
> > > 
> > > > 
> > > > This patch introduces a light weight machine type which shares
> > > > the
> > > > same codebase with existing q35 machine type but with some
> > > > features
> > > > disabled by default.
> > > > 
> > > > Signed-off-by: Chao Peng 
> > I don't find this too useful, but if others do and send acks, I'll
> > merge
> > it, but only if it also has migration disabled.
> > 
> 
> Agreed, it's enough to have patches 1-3.
> 

I'm fine.




Re: [Qemu-devel] [PATCH v2 0/5] [RFC] Add HAX support

2016-11-14 Thread Vincent Palatin
On Sun, Nov 13, 2016 at 4:20 AM,   wrote:
> Hi,
>
> Your series seems to have some coding style problems. See output below for
> more information:
>
> Type: series
> Subject: [Qemu-devel] [PATCH v2 0/5] [RFC] Add HAX support
> Message-id: cover.1478863621.git.vpala...@chromium.org
>
> === TEST SCRIPT BEGIN ===
> #!/bin/bash
>
> BASE=base
> n=1
> total=$(git log --oneline $BASE.. | wc -l)
> failed=0
>
> # Useful git options
> git config --local diff.renamelimit 0
> git config --local diff.renames True
>
> commits="$(git log --format=%H --reverse $BASE..)"
> for c in $commits; do
> echo "Checking PATCH $n/$total: $(git show --no-patch --format=%s $c)..."
> if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; 
> then
> failed=1
> echo
> fi
> n=$((n+1))
> done
>
> exit $failed
> === TEST SCRIPT END ===
>
> Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
> Switched to a new branch 'test'
> b9d801a Plumb the HAXM-based hardware acceleration support
> c855846 hax: simplify init
> 577d188 hax: remove non UG code
> edf12f7 target-i386: Add Intel HAX files
> cfebedf kvm: move cpu synchronization code
>
> === OUTPUT BEGIN ===
> fatal: unrecognized argument: --no-patch
> Checking PATCH 1/5: ...
> fatal: unrecognized argument: --no-patch
> Checking PATCH 2/5: ...
> fatal: unrecognized argument: --no-patch
> Checking PATCH 3/5: ...
> fatal: unrecognized argument: --no-patch
> Checking PATCH 4/5: ...
> fatal: unrecognized argument: --no-patch
> Checking PATCH 5/5: ...
> ERROR: spaces required around that '-' (ctx:WxV)
> #158: FILE: cpus.c:1276:
> +if (SuspendThread(cpu->hThread) == (DWORD) -1) {
> ^
>
> ERROR: spaces required around that '-' (ctx:WxV)
> #176: FILE: cpus.c:1294:
> +if (ResumeThread(cpu->hThread) == (DWORD) -1) {
>^
>

Apparently I missed this couple of warnings when I did my final checkpatch pass.
At the same time, I have no idea how to solve this one, can anybody
enlighten me ?
Having a space on both sides of a unary minus doesn't seem what we want,
so I imagine something else is wrong and confuse a bit checkpatch.pl.


> total: 2 errors, 0 warnings, 349 lines checked
>
> Your patch has style problems, please review.  If any of these errors
> are false positives report them to the maintainer, see
> CHECKPATCH in MAINTAINERS.
>
> === OUTPUT END ===
>
> Test command exited with code: 1
>
>
> ---
> Email generated automatically by Patchew [http://patchew.org/].
> Please send your feedback to patchew-de...@freelists.org



Re: [Qemu-devel] [PATCH 3/3] tests: add XSCOM tests for the PowerNV machine

2016-11-14 Thread Cédric Le Goater

> +static uint64_t pnv_xscom_addr(const PnvChip *chip, uint32_t pcba)
> +{
> +uint64_t addr = chip->xscom_base;
> +
> +if (chip->chip_type == PNV_CHIP_POWER9) {
> +return addr | ((uint64_t) pcba << 3);
> +} else {
> +return addr | (((uint64_t) pcba << 4) & ~0xfful) |

This should be '~0xffull' on 32bit systems. I will send a v2 including 
a change to replace hweight_long() by ctpop64() in pnv.c

Thanks,

C. 

> +(((uint64_t) pcba << 3) & 0x78);
> +}
> +}
> +




Re: [Qemu-devel] [PATCH v2 0/5] [RFC] Add HAX support

2016-11-14 Thread Paolo Bonzini


On 14/11/2016 09:21, Vincent Palatin wrote:
> On Sun, Nov 13, 2016 at 4:20 AM,   wrote:
>> Hi,
>>
>> Your series seems to have some coding style problems. See output below for
>> more information:
>>
>> Type: series
>> Subject: [Qemu-devel] [PATCH v2 0/5] [RFC] Add HAX support
>> Message-id: cover.1478863621.git.vpala...@chromium.org
>>
>> === TEST SCRIPT BEGIN ===
>> #!/bin/bash
>>
>> BASE=base
>> n=1
>> total=$(git log --oneline $BASE.. | wc -l)
>> failed=0
>>
>> # Useful git options
>> git config --local diff.renamelimit 0
>> git config --local diff.renames True
>>
>> commits="$(git log --format=%H --reverse $BASE..)"
>> for c in $commits; do
>> echo "Checking PATCH $n/$total: $(git show --no-patch --format=%s $c)..."
>> if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; 
>> then
>> failed=1
>> echo
>> fi
>> n=$((n+1))
>> done
>>
>> exit $failed
>> === TEST SCRIPT END ===
>>
>> Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
>> Switched to a new branch 'test'
>> b9d801a Plumb the HAXM-based hardware acceleration support
>> c855846 hax: simplify init
>> 577d188 hax: remove non UG code
>> edf12f7 target-i386: Add Intel HAX files
>> cfebedf kvm: move cpu synchronization code
>>
>> === OUTPUT BEGIN ===
>> fatal: unrecognized argument: --no-patch
>> Checking PATCH 1/5: ...
>> fatal: unrecognized argument: --no-patch
>> Checking PATCH 2/5: ...
>> fatal: unrecognized argument: --no-patch
>> Checking PATCH 3/5: ...
>> fatal: unrecognized argument: --no-patch
>> Checking PATCH 4/5: ...
>> fatal: unrecognized argument: --no-patch
>> Checking PATCH 5/5: ...
>> ERROR: spaces required around that '-' (ctx:WxV)
>> #158: FILE: cpus.c:1276:
>> +if (SuspendThread(cpu->hThread) == (DWORD) -1) {
>> ^
>>
>> ERROR: spaces required around that '-' (ctx:WxV)
>> #176: FILE: cpus.c:1294:
>> +if (ResumeThread(cpu->hThread) == (DWORD) -1) {
>>^
>>
> 
> Apparently I missed this couple of warnings when I did my final checkpatch 
> pass.
> At the same time, I have no idea how to solve this one, can anybody
> enlighten me ?
> Having a space on both sides of a unary minus doesn't seem what we want,
> so I imagine something else is wrong and confuse a bit checkpatch.pl.

Yes, I think it's the Win32 "DWORD" type that confuses it.  You can
ignore it.

Paolo



Re: [Qemu-devel] [PATCH v2 0/5] [RFC] Add HAX support

2016-11-14 Thread Stefan Weil

On 11/14/16 09:21, Vincent Palatin wrote:

On Sun, Nov 13, 2016 at 4:20 AM,   wrote:


=== OUTPUT BEGIN ===
fatal: unrecognized argument: --no-patch
Checking PATCH 1/5: ...
fatal: unrecognized argument: --no-patch
Checking PATCH 2/5: ...
fatal: unrecognized argument: --no-patch
Checking PATCH 3/5: ...
fatal: unrecognized argument: --no-patch
Checking PATCH 4/5: ...
fatal: unrecognized argument: --no-patch
Checking PATCH 5/5: ...
ERROR: spaces required around that '-' (ctx:WxV)
#158: FILE: cpus.c:1276:
+if (SuspendThread(cpu->hThread) == (DWORD) -1) {
^

ERROR: spaces required around that '-' (ctx:WxV)
#176: FILE: cpus.c:1294:
+if (ResumeThread(cpu->hThread) == (DWORD) -1) {
   ^



Apparently I missed this couple of warnings when I did my final checkpatch pass.
At the same time, I have no idea how to solve this one, can anybody
enlighten me ?
Having a space on both sides of a unary minus doesn't seem what we want,
so I imagine something else is wrong and confuse a bit checkpatch.pl.



(DWORK)(-1) works for checkpatch.pl.

I noticed a strange behaviour of checkpatch.pl:

If I run it on all of your five commits, it does not find any error.
If I run it only on the 5th commit, it shows the two errors like above.

Stefan




Re: [Qemu-devel] [PATCH V11 1/1] fsdev: add IO throttle support to fsdev devices

2016-11-14 Thread Pradeep Jagadeesh

On 11/12/2016 3:13 PM, Greg Kurz wrote:

On Fri, 11 Nov 2016 03:54:27 -0500
Pradeep Jagadeesh  wrote:


Uses throttling APIs to limit I/O bandwidth and number of operations on the
devices which use 9p-local driver.

Signed-off-by: Pradeep Jagadeesh 
Reviewed-by: Alberto Garcia" 
---


Hi Pradeep,

I'll have a look next week but I'm not sure this can go to 2.8 since we're
already in soft feature freeze (only bug fixes are accepted).


Hi Greg,

It is ok even if it does not make it to 2.8.I just want to complete this 
work from my side.


Thanks,
Pradeep


Cheers.

--
Greg


 fsdev/Makefile.objs |   2 +-
 fsdev/file-op-9p.h  |   3 ++
 fsdev/qemu-fsdev-opts.c |  77 -
 fsdev/qemu-fsdev-throttle.c | 117 
 fsdev/qemu-fsdev-throttle.h |  39 +++
 hw/9pfs/9p-local.c  |   8 +++
 hw/9pfs/9p.c|   5 ++
 hw/9pfs/cofile.c|   2 +
 8 files changed, 251 insertions(+), 2 deletions(-)
 create mode 100644 fsdev/qemu-fsdev-throttle.c
 create mode 100644 fsdev/qemu-fsdev-throttle.h

This adds the support for the 9p-local driver.
For now this functionality can be enabled only through qemu cli options.
QMP interface and support to other drivers need further extensions.
To make it simple for other drivers, the throttle code has been put in
separate files.

v1 -> v2:

-Fixed FsContext redeclaration issue
-Removed couple of function declarations from 9p-throttle.h
-Fixed some of the .help messages

v2 -> v3:

-Addressed follwing comments by Claudio Fontana
 -Removed redundant memset calls in fsdev_throttle_configure_iolimits function
 -Checking throttle structure validity before initializing other structures
  in fsdev_throttle_configure_iolimits

-Addressed following comments by Greg Kurz
 -Moved the code from 9pfs directory to fsdev directory, because the throttling
  is for the fsdev devices.Renamed the files and functions to fsdev_ from 9pfs_
 -Renamed throttling cli options to throttling.*, as in QMP cli options
 -Removed some of the unwanted .h files from qemu-fsdev-throttle.[ch]
 -Using throttle_enabled() function to set the thottle enabled flag for fsdev.

v3 -> v4:

-Addressed following comments by Alberto Garcia
 -Removed the unwanted locking and other data structures in 
qemu-fsdev-throttle.[ch]

-Addressed following comments by Greg Kurz
 -Removed fsdev_iolimitsenable/disable functions, instead using 
throttle_enabled function

v4 -> V5:
 -Fixed the issue with the larger block size accounting.
 (i.e, when the 9pfs mounted using msize=xxx option)

V5 -> V6:
-Addressed the comments by Alberto Garcia
 -Removed the fsdev_throttle_timer_cb()
 -Simplified the  fsdev_throttle_schedule_next_request() as suggested

V6 -> V7:
-Addressed the comments by Alberto Garcia
 -changed the  fsdev_throttle_schedule_next_request() as suggested

v7 -> v8:
-Addressed comments by Alberto Garcia
 -Fixed some indentation issues and split the configure_io_limit function
 -Inlined throttle_timer_check code

V8 -> v9:
-Addressed the comments by Greg Kurz
 -Inlined the fsdev_throttle_schedule_next_request() code into
  fsdev_co_throttle_request ()

v9 -> v10:
-Addressed the comments by alberto garcia
 -fixed the indentation issues and minor issues

v10 -> v11:
-Addressed the comments by alberto garcia
 -renamed err variable to errp issues


diff --git a/fsdev/Makefile.objs b/fsdev/Makefile.objs
index 1b120a4..659df6e 100644
--- a/fsdev/Makefile.objs
+++ b/fsdev/Makefile.objs
@@ -5,7 +5,7 @@ common-obj-y = qemu-fsdev.o 9p-marshal.o 9p-iov-marshal.o
 else
 common-obj-y = qemu-fsdev-dummy.o
 endif
-common-obj-y += qemu-fsdev-opts.o
+common-obj-y += qemu-fsdev-opts.o qemu-fsdev-throttle.o

 # Toplevel always builds this; targets without virtio will put it in
 # common-obj-y
diff --git a/fsdev/file-op-9p.h b/fsdev/file-op-9p.h
index 6db9fea..33fe822 100644
--- a/fsdev/file-op-9p.h
+++ b/fsdev/file-op-9p.h
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include "qemu-fsdev-throttle.h"

 #define SM_LOCAL_MODE_BITS0600
 #define SM_LOCAL_DIR_MODE_BITS0700
@@ -74,6 +75,7 @@ typedef struct FsDriverEntry {
 char *path;
 int export_flags;
 FileOperations *ops;
+FsThrottle fst;
 } FsDriverEntry;

 typedef struct FsContext
@@ -83,6 +85,7 @@ typedef struct FsContext
 int export_flags;
 struct xattr_operations **xops;
 struct extended_ops exops;
+FsThrottle *fst;
 /* fs driver specific data */
 void *private;
 } FsContext;
diff --git a/fsdev/qemu-fsdev-opts.c b/fsdev/qemu-fsdev-opts.c
index 1dd8c7a..385423f0 100644
--- a/fsdev/qemu-fsdev-opts.c
+++ b/fsdev/qemu-fsdev-opts.c
@@ -37,8 +37,83 @@ static QemuOptsList qemu_fsdev_opts = {
 }, {
 .name = "sock_fd",
 .type = QEMU_OPT_NUMBER,
+}, {
+.name = "throttling.iops-total",
+.type = QEMU_OPT_NUMBER,
+.help = "limit total I/O operations per second",
+ 

[Qemu-devel] [PATCH v2 0/4] ppc/pnv: XSCOM fixes and unit tests

2016-11-14 Thread Cédric Le Goater
Hello,

Here is a little serie adding some fixes for the XSCOM registers of
the POWER9 cores and a unit test. 

Changes since v1 :

 - fixed pnv_xscom_addr() for 32bit host systems
 - replace hweight_long() by ctpop64()

Tested on a 32-bit LE system and on 64-bit LE and BE systems.

Thanks,

C.

Cédric Le Goater (3):
  ppc/pnv: add a 'xscom_core_base' field to PnvChipClass
  ppc/pnv: fix xscom address translation for POWER9
  tests: add XSCOM tests for the PowerNV machine

David Gibson (1):
  ppc/pnv: Fix fatal bug on 32-bit hosts

 hw/ppc/pnv.c   |  10 +++-
 hw/ppc/pnv_xscom.c |   8 +--
 include/hw/ppc/pnv.h   |   1 +
 include/hw/ppc/pnv_xscom.h |   5 +-
 tests/Makefile.include |   1 +
 tests/pnv-xscom-test.c | 140 +
 6 files changed, 156 insertions(+), 9 deletions(-)
 create mode 100644 tests/pnv-xscom-test.c

-- 
2.7.4




[Qemu-devel] [PATCH v2 1/4] ppc/pnv: add a 'xscom_core_base' field to PnvChipClass

2016-11-14 Thread Cédric Le Goater
The XSCOM addresses for the core registers are encoded in a slightly
different way on POWER8 and POWER9.

Signed-off-by: Cédric Le Goater 
---
 hw/ppc/pnv.c   | 8 +++-
 include/hw/ppc/pnv.h   | 1 +
 include/hw/ppc/pnv_xscom.h | 5 ++---
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c
index 6af34241f248..e7779581545d 100644
--- a/hw/ppc/pnv.c
+++ b/hw/ppc/pnv.c
@@ -521,6 +521,7 @@ static void pnv_chip_power8e_class_init(ObjectClass *klass, 
void *data)
 k->cores_mask = POWER8E_CORE_MASK;
 k->core_pir = pnv_chip_core_pir_p8;
 k->xscom_base = 0x003fc00ull;
+k->xscom_core_base = 0x1000ull;
 dc->desc = "PowerNV Chip POWER8E";
 }
 
@@ -542,6 +543,7 @@ static void pnv_chip_power8_class_init(ObjectClass *klass, 
void *data)
 k->cores_mask = POWER8_CORE_MASK;
 k->core_pir = pnv_chip_core_pir_p8;
 k->xscom_base = 0x003fc00ull;
+k->xscom_core_base = 0x1000ull;
 dc->desc = "PowerNV Chip POWER8";
 }
 
@@ -563,6 +565,7 @@ static void pnv_chip_power8nvl_class_init(ObjectClass 
*klass, void *data)
 k->cores_mask = POWER8_CORE_MASK;
 k->core_pir = pnv_chip_core_pir_p8;
 k->xscom_base = 0x003fc00ull;
+k->xscom_core_base = 0x1000ull;
 dc->desc = "PowerNV Chip POWER8NVL";
 }
 
@@ -584,6 +587,7 @@ static void pnv_chip_power9_class_init(ObjectClass *klass, 
void *data)
 k->cores_mask = POWER9_CORE_MASK;
 k->core_pir = pnv_chip_core_pir_p9;
 k->xscom_base = 0x00603fcull;
+k->xscom_core_base = 0x0ull;
 dc->desc = "PowerNV Chip POWER9";
 }
 
@@ -691,7 +695,9 @@ static void pnv_chip_realize(DeviceState *dev, Error **errp)
 object_unref(OBJECT(pnv_core));
 
 /* Each core has an XSCOM MMIO region */
-pnv_xscom_add_subregion(chip, PNV_XSCOM_EX_CORE_BASE(core_hwid),
+pnv_xscom_add_subregion(chip,
+PNV_XSCOM_EX_CORE_BASE(pcc->xscom_core_base,
+   core_hwid),
 &PNV_CORE(pnv_core)->xscom_regs);
 i++;
 }
diff --git a/include/hw/ppc/pnv.h b/include/hw/ppc/pnv.h
index 7bee658733db..df98a72006e4 100644
--- a/include/hw/ppc/pnv.h
+++ b/include/hw/ppc/pnv.h
@@ -69,6 +69,7 @@ typedef struct PnvChipClass {
 uint64_t cores_mask;
 
 hwaddr   xscom_base;
+hwaddr   xscom_core_base;
 
 uint32_t (*core_pir)(PnvChip *chip, uint32_t core_id);
 } PnvChipClass;
diff --git a/include/hw/ppc/pnv_xscom.h b/include/hw/ppc/pnv_xscom.h
index 41a5127a1907..0faa1847bf13 100644
--- a/include/hw/ppc/pnv_xscom.h
+++ b/include/hw/ppc/pnv_xscom.h
@@ -40,7 +40,7 @@ typedef struct PnvXScomInterfaceClass {
 } PnvXScomInterfaceClass;
 
 /*
- * Layout of the XSCOM PCB addresses of EX core 1
+ * Layout of the XSCOM PCB addresses of EX core 1 (POWER 8)
  *
  *   GPIO0x1100
  *   SCOM0x1101
@@ -54,8 +54,7 @@ typedef struct PnvXScomInterfaceClass {
  *   PCB SLAVE   0x110F
  */
 
-#define PNV_XSCOM_EX_BASE 0x1000
-#define PNV_XSCOM_EX_CORE_BASE(i) (PNV_XSCOM_EX_BASE | (((uint64_t)i) << 24))
+#define PNV_XSCOM_EX_CORE_BASE(base, i) (base | (((uint64_t)i) << 24))
 #define PNV_XSCOM_EX_CORE_SIZE0x10
 
 #define PNV_XSCOM_LPC_BASE0xb0020
-- 
2.7.4




[Qemu-devel] [PATCH v2 3/4] ppc/pnv: Fix fatal bug on 32-bit hosts

2016-11-14 Thread Cédric Le Goater
From: David Gibson 

If the pnv machine type is compiled on a 32-bit host, the unsigned long
(host) type is 32-bit.  This means that the hweight_long() used to
calculate the number of allowed cores only considers the low 32 bits of
the cores_mask variable, and can thus return 0 in some circumstances.

This corrects the bug.

Signed-off-by: David Gibson 
Suggested-by: Richard Henderson 
[clg: replaced hweight_long() by ctpop64() ]
Signed-off-by: Cédric Le Goater 
---
 hw/ppc/pnv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c
index e7779581545d..9df7b25315af 100644
--- a/hw/ppc/pnv.c
+++ b/hw/ppc/pnv.c
@@ -620,7 +620,7 @@ static void pnv_chip_core_sanitize(PnvChip *chip, Error 
**errp)
 chip->cores_mask &= pcc->cores_mask;
 
 /* now that we have a sane layout, let check the number of cores */
-cores_max = hweight_long(chip->cores_mask);
+cores_max = ctpop64(chip->cores_mask);
 if (chip->nr_cores > cores_max) {
 error_setg(errp, "warning: too many cores for chip ! Limit is %d",
cores_max);
-- 
2.7.4




[Qemu-devel] [PATCH v2 2/4] ppc/pnv: fix xscom address translation for POWER9

2016-11-14 Thread Cédric Le Goater
High addresses can overflow the uint32_t pcba variable after the 8byte
shift.

Signed-off-by: Cédric Le Goater 
---
 hw/ppc/pnv_xscom.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/hw/ppc/pnv_xscom.c b/hw/ppc/pnv_xscom.c
index f46646141a96..8da271872f59 100644
--- a/hw/ppc/pnv_xscom.c
+++ b/hw/ppc/pnv_xscom.c
@@ -124,8 +124,8 @@ static uint64_t xscom_read(void *opaque, hwaddr addr, 
unsigned width)
 goto complete;
 }
 
-val = address_space_ldq(&chip->xscom_as, pcba << 3, MEMTXATTRS_UNSPECIFIED,
-&result);
+val = address_space_ldq(&chip->xscom_as, (uint64_t) pcba << 3,
+MEMTXATTRS_UNSPECIFIED, &result);
 if (result != MEMTX_OK) {
 qemu_log_mask(LOG_GUEST_ERROR, "XSCOM read failed at @0x%"
   HWADDR_PRIx " pcba=0x%08x\n", addr, pcba);
@@ -150,8 +150,8 @@ static void xscom_write(void *opaque, hwaddr addr, uint64_t 
val,
 goto complete;
 }
 
-address_space_stq(&chip->xscom_as, pcba << 3, val, MEMTXATTRS_UNSPECIFIED,
-  &result);
+address_space_stq(&chip->xscom_as, (uint64_t) pcba << 3, val,
+  MEMTXATTRS_UNSPECIFIED, &result);
 if (result != MEMTX_OK) {
 qemu_log_mask(LOG_GUEST_ERROR, "XSCOM write failed at @0x%"
   HWADDR_PRIx " pcba=0x%08x data=0x%" PRIx64 "\n",
-- 
2.7.4




[Qemu-devel] [PATCH v2 4/4] tests: add XSCOM tests for the PowerNV machine

2016-11-14 Thread Cédric Le Goater
Add a couple of tests on the XSCOM bus of the PowerNV machine for the
the POWER8 and POWER9 CPUs. The first tests reads the CFAM identifier
of the chip. The second test goes further in the XSCOM address space
and reaches the cores to read their DTS registers.

Signed-off-by: Cédric Le Goater 
---

 Changes since v1:

 - fixed pnv_xscom_addr() for 32bit host systems

 tests/Makefile.include |   1 +
 tests/pnv-xscom-test.c | 140 +
 2 files changed, 141 insertions(+)
 create mode 100644 tests/pnv-xscom-test.c

diff --git a/tests/Makefile.include b/tests/Makefile.include
index de516341fd44..90c9ad9ac6e1 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -270,6 +270,7 @@ gcov-files-ppc64-y = ppc64-softmmu/hw/ppc/spapr_pci.c
 check-qtest-ppc64-y += tests/endianness-test$(EXESUF)
 check-qtest-ppc64-y += tests/boot-order-test$(EXESUF)
 check-qtest-ppc64-y += tests/prom-env-test$(EXESUF)
+check-qtest-ppc64-y += tests/pnv-xscom-test$(EXESUF)
 check-qtest-ppc64-y += tests/drive_del-test$(EXESUF)
 check-qtest-ppc64-y += tests/postcopy-test$(EXESUF)
 check-qtest-ppc64-y += tests/boot-serial-test$(EXESUF)
diff --git a/tests/pnv-xscom-test.c b/tests/pnv-xscom-test.c
new file mode 100644
index ..f9294e677382
--- /dev/null
+++ b/tests/pnv-xscom-test.c
@@ -0,0 +1,140 @@
+/*
+ * QTest testcase for PowerNV XSCOM bus
+ *
+ * Copyright (c) 2016, IBM Corporation.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later. See the COPYING file in the top-level directory.
+ */
+#include "qemu/osdep.h"
+
+#include "libqtest.h"
+
+typedef enum PnvChipType {
+PNV_CHIP_POWER8E, /* AKA Murano (default) */
+PNV_CHIP_POWER8,  /* AKA Venice */
+PNV_CHIP_POWER8NVL,   /* AKA Naples */
+PNV_CHIP_POWER9,  /* AKA Nimbus */
+} PnvChipType;
+
+typedef struct PnvChip {
+PnvChipType chip_type;
+const char *cpu_model;
+uint64_txscom_base;
+uint64_txscom_core_base;
+uint64_tcfam_id;
+uint32_tfirst_core;
+} PnvChip;
+
+static const PnvChip pnv_chips[] = {
+{
+.chip_type  = PNV_CHIP_POWER8,
+.cpu_model  = "POWER8",
+.xscom_base = 0x0003fc00ull,
+.xscom_core_base = 0x1000ull,
+.cfam_id= 0x220ea0498000ull,
+.first_core = 0x1,
+}, {
+.chip_type  = PNV_CHIP_POWER8NVL,
+.cpu_model  = "POWER8NVL",
+.xscom_base = 0x0003fc00ull,
+.xscom_core_base = 0x1000ull,
+.cfam_id= 0x120d30498000ull,
+.first_core = 0x1,
+}, {
+.chip_type  = PNV_CHIP_POWER9,
+.cpu_model  = "POWER9",
+.xscom_base = 0x000603fcull,
+.xscom_core_base = 0x0ull,
+.cfam_id= 0x100d10498000ull,
+.first_core = 0x20,
+},
+};
+
+static uint64_t pnv_xscom_addr(const PnvChip *chip, uint32_t pcba)
+{
+uint64_t addr = chip->xscom_base;
+
+if (chip->chip_type == PNV_CHIP_POWER9) {
+ addr |= ((uint64_t) pcba << 3);
+} else {
+ addr |= (((uint64_t) pcba << 4) & ~0xffull) |
+(((uint64_t) pcba << 3) & 0x78);
+}
+return addr;
+}
+
+static uint64_t pnv_xscom_read(const PnvChip *chip, uint32_t pcba)
+{
+return readq(pnv_xscom_addr(chip, pcba));
+}
+
+static void test_xscom_cfam_id(const PnvChip *chip)
+{
+uint64_t f000f = pnv_xscom_read(chip, 0xf000f);
+
+g_assert_cmphex(f000f, ==, chip->cfam_id);
+}
+
+static void test_cfam_id(const void *data)
+{
+char *args;
+const PnvChip *chip = data;
+
+args = g_strdup_printf("-M powernv,accel=tcg -cpu %s", chip->cpu_model);
+
+qtest_start(args);
+test_xscom_cfam_id(chip);
+qtest_quit(global_qtest);
+
+g_free(args);
+}
+
+#define PNV_XSCOM_EX_CORE_BASE(chip, i) \
+((chip)->xscom_core_base | (((uint64_t)i) << 24))
+#define PNV_XSCOM_EX_DTS_RESULT0 0x5
+
+static void test_xscom_core(const PnvChip *chip)
+{
+uint32_t first_core_dts0 =
+PNV_XSCOM_EX_CORE_BASE(chip, chip->first_core) |
+PNV_XSCOM_EX_DTS_RESULT0;
+uint64_t dts0 = pnv_xscom_read(chip, first_core_dts0);
+
+g_assert_cmphex(dts0, ==, 0x26f024f023full);
+}
+
+static void test_core(const void *data)
+{
+char *args;
+const PnvChip *chip = data;
+
+args = g_strdup_printf("-M powernv,accel=tcg -cpu %s", chip->cpu_model);
+
+qtest_start(args);
+test_xscom_core(chip);
+qtest_quit(global_qtest);
+
+g_free(args);
+}
+
+static void add_test(const char *name, void (*test)(const void *data))
+{
+int i;
+
+for (i = 0; i < ARRAY_SIZE(pnv_chips); i++) {
+char *tname = g_strdup_printf("pnv-xscom/%s/%s", name,
+  pnv_chips[i].cpu_model);
+qtest_add_data_func(tname, &pnv_chips[i], test);
+g_free(tname);
+}
+}
+
+int main(int argc, char **argv)
+{
+g_test_init(&argc, &argv, NULL);
+
+add_test("cfam_id", 

Re: [Qemu-devel] [PATCH v2 0/4] ppc/pnv: XSCOM fixes and unit tests

2016-11-14 Thread no-reply
Hi,

Your series seems to have some coding style problems. See output below for
more information:

Type: series
Subject: [Qemu-devel] [PATCH v2 0/4]  ppc/pnv: XSCOM fixes and unit tests
Message-id: 1479114778-3881-1-git-send-email-...@kaod.org

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

# Useful git options
git config --local diff.renamelimit 0
git config --local diff.renames True

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
echo "Checking PATCH $n/$total: $(git show --no-patch --format=%s $c)..."
if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
failed=1
echo
fi
n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 * [new tag] patchew/1479114778-3881-1-git-send-email-...@kaod.org -> 
patchew/1479114778-3881-1-git-send-email-...@kaod.org
Switched to a new branch 'test'
d3d45dc tests: add XSCOM tests for the PowerNV machine
495b638 ppc/pnv: Fix fatal bug on 32-bit hosts
662eee5 ppc/pnv: fix xscom address translation for POWER9
1a6e1d7 ppc/pnv: add a 'xscom_core_base' field to PnvChipClass

=== OUTPUT BEGIN ===
fatal: unrecognized argument: --no-patch
Checking PATCH 1/4: ...
fatal: unrecognized argument: --no-patch
Checking PATCH 2/4: ...
fatal: unrecognized argument: --no-patch
Checking PATCH 3/4: ...
fatal: unrecognized argument: --no-patch
Checking PATCH 4/4: ...
ERROR: suspect code indent for conditional statements (4, 9)
#92: FILE: tests/pnv-xscom-test.c:58:
+if (chip->chip_type == PNV_CHIP_POWER9) {
+ addr |= ((uint64_t) pcba << 3);

total: 1 errors, 0 warnings, 147 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

=== OUTPUT END ===

Test command exited with code: 1


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-de...@freelists.org

Re: [Qemu-devel] [PATCH v2 0/5] [RFC] Add HAX support

2016-11-14 Thread Vincent Palatin
On Mon, Nov 14, 2016 at 9:55 AM, Stefan Weil  wrote:
> On 11/14/16 09:21, Vincent Palatin wrote:
>>
>> On Sun, Nov 13, 2016 at 4:20 AM,   wrote:
>>>
>>>
>>> === OUTPUT BEGIN ===
>>> fatal: unrecognized argument: --no-patch
>>> Checking PATCH 1/5: ...
>>> fatal: unrecognized argument: --no-patch
>>> Checking PATCH 2/5: ...
>>> fatal: unrecognized argument: --no-patch
>>> Checking PATCH 3/5: ...
>>> fatal: unrecognized argument: --no-patch
>>> Checking PATCH 4/5: ...
>>> fatal: unrecognized argument: --no-patch
>>> Checking PATCH 5/5: ...
>>> ERROR: spaces required around that '-' (ctx:WxV)
>>> #158: FILE: cpus.c:1276:
>>> +if (SuspendThread(cpu->hThread) == (DWORD) -1) {
>>> ^
>>>
>>> ERROR: spaces required around that '-' (ctx:WxV)
>>> #176: FILE: cpus.c:1294:
>>> +if (ResumeThread(cpu->hThread) == (DWORD) -1) {
>>>^
>>>
>>
>> Apparently I missed this couple of warnings when I did my final checkpatch
>> pass.
>> At the same time, I have no idea how to solve this one, can anybody
>> enlighten me ?
>> Having a space on both sides of a unary minus doesn't seem what we want,
>> so I imagine something else is wrong and confuse a bit checkpatch.pl.
>
>
>
> (DWORK)(-1) works for checkpatch.pl.

Yes, thanks. This should be an acceptable workaround, I will update my
next series.

>
> I noticed a strange behaviour of checkpatch.pl:
>
> If I run it on all of your five commits, it does not find any error.
> If I run it only on the 5th commit, it shows the two errors like above.

Interesting, that's why I missed it in first place ... I was a bit confused.

-- 
Vincent



Re: [Qemu-devel] [PATCH v2 2/5] target-i386: Add Intel HAX files

2016-11-14 Thread Stefan Weil
Am 11.11.2016 um 12:28 schrieb Vincent Palatin:
[...]
> 
> Signed-off-by: Vincent Palatin 
> ---
>  hax-stub.c  |   74 +++
>  include/sysemu/hax.h|   66 ++
>  target-i386/hax-all.c   | 1490 
> +++

Git warns about a whitespace issue:
The empty last line of target-i386/hax-all.c should be removed.

Stefan





Re: [Qemu-devel] TCP performance problems - GSO/TSO, MSS, 8139cp related

2016-11-14 Thread Dr. David Alan Gilbert
* Russell King - ARM Linux (li...@armlinux.org.uk) wrote:
> On Fri, Nov 11, 2016 at 09:23:43PM +, David Woodhouse wrote:
> > It's also *fairly* unlikely that the kernel in the guest has developed
> > a bug and isn't setting gso_size sanely. I'm more inclined to suspect
> > that qemu isn't properly emulating those bits. But at first glance at
> > the code, it looks like *that's* been there for the last decade too...
> 
> I take issue with that, having looked at the qemu rtl8139 code:
> 
> if ((txdw0 & CP_TX_LGSEN) && ip_protocol == IP_PROTO_TCP)
> {
> int large_send_mss = (txdw0 >> 16) & CP_TC_LGSEN_MSS_MASK;
> 
> DPRINTF("+++ C+ mode offloaded task TSO MTU=%d IP data %d 
> "
> "frame data %d specified MSS=%d\n", ETH_MTU,
> ip_data_len, saved_size - ETH_HLEN, large_send_mss);
> 
> That's the only reference to "large_send_mss" there, other than that,
> the MSS value that gets stuck into the field by 8139cp.c is completely
> unused.  Instead, qemu does this:
> 
> eth_payload_data = saved_buffer + ETH_HLEN;
> eth_payload_len  = saved_size   - ETH_HLEN;
> 
> ip = (ip_header*)eth_payload_data;
> 
> hlen = IP_HEADER_LENGTH(ip);
> ip_data_len = be16_to_cpu(ip->ip_len) - hlen;
> 
> tcp_header *p_tcp_hdr = (tcp_header*)(eth_payload_data + 
> hlen);
> int tcp_hlen = TCP_HEADER_DATA_OFFSET(p_tcp_hdr);
> 
> /* ETH_MTU = ip header len + tcp header len + payload */
> int tcp_data_len = ip_data_len - tcp_hlen;
> int tcp_chunk_size = ETH_MTU - hlen - tcp_hlen;
> 
> for (tcp_send_offset = 0; tcp_send_offset < tcp_data_len; 
> tcp_send_offset += tcp_chunk_size)
> {
> 
> It uses a fixed value of ETH_MTU to calculate the size of the TCP
> data chunks, and this is not surprisingly the well known:
> 
> #define ETH_MTU 1500
> 
> Qemu seems to be buggy - it ignores the MSS value, and always tries to
> send 1500 byte frames.

cc'ing in Stefan who last touched that code and Jason and Vlad who
know the net code.

Dave

> -- 
> RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
> according to speedtest.net.
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH] tcg/mips: Add support for mips64el backend

2016-11-14 Thread Jin Guojie
Richard,

I have studied your V2 patch

https://lists.nongnu.org/archive/html/qemu-devel/2016-02/msg02969.html

. Though I have not tested this patch on Loongson machine, I feel this 
patch has implemented MIPS64 ISA very completely, including big/little
endian, N32 and N64 ABI. The use of #if is more clean. Many corner cases
are well handled. My patch is only a subset of yours.

I wonder why your patch have not be merged into the mainstream.
If I had seen it before, I don't need to waste time reinventing my patch.

Since tcg target for MIPS64 is of great use for developers, I 
really hope this feature can be merged into mainstream.

Is your v2 patch still in review process? Is there chance for this
patch to be merged in a not so long term? Or should other code
work should be done before being merged?

I want listen to your advice. Should I test your v2 patch on Loongson 
and use it? Or whether it is worth modifying my patch and resubmit it 
according to your review comments?

Jin Guojie

-- Original --
From:  "Richard Henderson";;
Send time: Sunday, Nov 13, 2016 3:56 PM
To: "jinguojie"; "qemu-devel"; 
Cc: "Aurelien Jarno"; 
Subject:  Re: [Qemu-devel] [PATCH] tcg/mips: Add support for mips64el backend



On 11/11/2016 08:20 AM, jinguo...@loongson.cn wrote:
> From: Jin Guojie 
>
> This patch implements TCG mips64r2(little-endian) translation backend.
> Tested on Loongson 3A2000(a MIPS64-compatible CPU) with Fedora Linux 21 Remix.
> linux-0.2.img.bz2 runs well.
> The performance is nearly 10 times higher than tci mode.
>
> https://en.wikipedia.org/wiki/Loongson
> http://www.loongnix.org/index.php/Loongnix
>
> Cc: Aurelien Jarno 
> Signed-off-by: Jin Guojie 

Have you seen

https://lists.nongnu.org/archive/html/qemu-devel/2016-02/msg01910.html

?  I know there are bugs in that patch set, but I would like any mips64 support 
to look like that.  In particular, reduce the use of #if to an absolute minimum.

> +#if UINTPTR_MAX == UINT32_MAX
> +# define TCG_TARGET_REG_BITS 32
> +#elif UINTPTR_MAX == UINT64_MAX
> +# define TCG_TARGET_REG_BITS 64
> +#endif

There are two mips64 abi's.  You're only allowing 64-bit code to be generated 
for n64, and not n32.

> +#undef use_movnz_instructions
> +#undef use_mips32_instructions
> +#undef use_mips32r6_instructions
> +
> +#define use_movnz_instructions  0
> +#define use_mips32_instructions  0
> +#define use_mips32r6_instructions  0

Why?  Certainly we should be able to generate code for mips64r2 and mips64r6.

> +#if TCG_TARGET_REG_BITS == 64
> +static const TCGReg tcg_target_call_oarg_regs[1] = {
> +TCG_REG_V0,
> +};
> +#else
>  static const TCGReg tcg_target_call_oarg_regs[2] = {
>  TCG_REG_V0,
>  TCG_REG_V1
>  };
> +#endif

This change would be incorrect if we ever enhance tcg to handle __int128_t.  In 
the meantime it doesn't matter, and can be left unchanged.

> @@ -459,7 +502,15 @@ static inline void tcg_out_mov(TCGContext *s, TCGType 
> type,
>  {
>  /* Simple reg-reg move, optimising out the 'do nothing' case */
>  if (ret != arg) {
> +#if TCG_TARGET_REG_BITS == 64
> +if (type == TCG_TYPE_I32) {
> +tcg_out_opc_reg(s, OPC_ADDU, ret, arg, TCG_REG_ZERO);
> +} else {
> +tcg_out_opc_reg(s, OPC_DADDU, ret, arg, TCG_REG_ZERO);
> +}
> +#else
>  tcg_out_opc_reg(s, OPC_ADDU, ret, arg, TCG_REG_ZERO);
> +#endif
>  }

This is why a proper mips assembler uses OPC_OR.

>  }
>
> @@ -470,12 +521,21 @@ static inline void tcg_out_movi(TCGContext *s, TCGType 
> type,
>  tcg_out_opc_imm(s, OPC_ADDIU, reg, TCG_REG_ZERO, arg);
>  } else if (arg == (uint16_t)arg) {
>  tcg_out_opc_imm(s, OPC_ORI, reg, TCG_REG_ZERO, arg);
> -} else {
> +} else if (arg == (int32_t)arg) {
>  tcg_out_opc_imm(s, OPC_LUI, reg, TCG_REG_ZERO, arg >> 16);
>  if (arg & 0x) {
>  tcg_out_opc_imm(s, OPC_ORI, reg, reg, arg & 0x);
>  }
>  }
> +#if TCG_TARGET_REG_BITS == 64
> +/* 64-bit imm */
> +else {
> +tcg_out_opc_imm(s, OPC_LUI, reg, 0, (arg >> 32) & 0x);
> +tcg_out_opc_imm(s, OPC_ORI, reg, reg, (arg >> 16) & 0x);
> +tcg_out_opc_imm_64(s, OPC_DSLL, reg, reg, 16);
> +tcg_out_opc_imm(s, OPC_ORI, reg, reg, arg & 0x);
> +}
> +#endif

This is only a 48-bit immediate.

>  }
>
>  static inline void tcg_out_bswap16(TCGContext *s, TCGReg ret, TCGReg arg)
> @@ -566,7 +626,11 @@ static void tcg_out_ldst(TCGContext *s, MIPSInsn opc, 
> TCGReg data,
>  if (ofs != lo) {
>  tcg_out_movi(s, TCG_TYPE_PTR, TCG_TMP0, ofs - lo);
>  if (addr != TCG_REG_ZERO) {
> +#if TCG_TARGET_REG_BITS == 64
> +tcg_out_opc_reg(s, OPC_DADDU, TCG_TMP0, TCG_TMP0, addr);
> +#else
>  tcg_out_opc_reg(s, OPC_ADDU, TCG_TMP0, TCG_TMP0, addr);
> +#endif

See my patchset where I introduce OPC_PADDU to avoid this and other similar 
ifdefs.

> @@ -1163,6 +1276,7 @@ static void tcg_ou

Re: [Qemu-devel] [PATCH v2 2/5] target-i386: Add Intel HAX files

2016-11-14 Thread Vincent Palatin
On Mon, Nov 14, 2016 at 10:29 AM, Stefan Weil  wrote:
> Am 11.11.2016 um 12:28 schrieb Vincent Palatin:
> [...]
>>
>> Signed-off-by: Vincent Palatin 
>> ---
>>  hax-stub.c  |   74 +++
>>  include/sysemu/hax.h|   66 ++
>>  target-i386/hax-all.c   | 1490 
>> +++
>
> Git warns about a whitespace issue:
> The empty last line of target-i386/hax-all.c should be removed.

Done, I will send it with the v3 series.

-- 
Vincent



Re: [Qemu-devel] virsh dump (qemu guest memory dump?): KASLR enabled linux guest support

2016-11-14 Thread Andrew Jones
On Mon, Nov 14, 2016 at 01:32:56PM +0800, Dave Young wrote:
> On 11/09/16 at 04:38pm, Laszlo Ersek wrote:
> > On 11/09/16 15:47, Daniel P. Berrange wrote:
> > > On Wed, Nov 09, 2016 at 01:20:51PM +0100, Andrew Jones wrote:
> > >> On Wed, Nov 09, 2016 at 11:58:19AM +, Daniel P. Berrange wrote:
> > >>> On Wed, Nov 09, 2016 at 12:48:09PM +0100, Andrew Jones wrote:
> >  On Wed, Nov 09, 2016 at 11:37:35AM +, Daniel P. Berrange wrote:
> > > On Wed, Nov 09, 2016 at 12:26:17PM +0100, Laszlo Ersek wrote:
> > >> On 11/09/16 11:40, Andrew Jones wrote:
> > >>> On Wed, Nov 09, 2016 at 11:01:46AM +0800, Dave Young wrote:
> >  Hi,
> > 
> >  Latest linux kernel enabled kaslr to randomiz phys/virt memory
> >  addresses, we had some effort to support kexec/kdump so that crash
> >  utility can still works in case crashed kernel has kaslr enabled.
> > 
> >  But according to Dave Anderson virsh dump does not work, quoted 
> >  messages
> >  from Dave below:
> > 
> >  """
> >  with virsh dump, there's no way of even knowing that KASLR
> >  has randomized the kernel __START_KERNEL_map region, because there 
> >  is no
> >  virtual address information -- e.g., like "SYMBOL(_stext)" in the 
> >  kdump
> >  vmcoreinfo data to compare against the vmlinux file symbol value.
> >  Unless virsh dump can export some basic virtual memory data, which
> >  they say it can't, I don't see how KASLR can ever be supported.
> >  """
> > 
> >  I assume virsh dump is using qemu guest memory dump facility so it
> >  should be first addressed in qemu. Thus post this query to qemu 
> >  devel
> >  list. If this is not correct please let me know.
> > 
> >  Could you qemu dump people make it work? Or we can not support 
> >  virt dump
> >  as long as KASLR being enabled. Latest Fedora kernel has enabled 
> >  it in x86_64.
> > 
> > >>>
> > >>> When the -kernel command line option is used, then it may be 
> > >>> possible
> > >>> to extract some information that could be used to supplement the 
> > >>> memory
> > >>> dump that dump-guest-memory provides. However, that would be a 
> > >>> specific
> > >>> use. In general, QEMU knows nothing about the guest kernel. It 
> > >>> doesn't
> > >>> know where it is in the disk image, and it doesn't even know if it's
> > >>> Linux.
> > >>>
> > >>> Is there anything a guest userspace application could probe from 
> > >>> e.g.
> > >>> /proc that would work? If so, then the guest agent could gain a new
> > >>> feature providing that.
> > >>
> > >> I fully agree. This is exactly what I suggested too, independently, 
> > >> in
> > >> the downstream thread, before arriving at this upstream thread. Let 
> > >> me
> > >> quote that email:
> > >>
> > >> On 11/09/16 12:09, Laszlo Ersek wrote:
> > >>> [...] the dump-guest-memory QEMU command supports an option called
> > >>> "paging". Here's its documentation, from the "qapi-schema.json" 
> > >>> source
> > >>> file:
> > >>>
> >  # @paging: if true, do paging to get guest's memory mapping. This 
> >  allows
> >  #  using gdb to process the core file.
> >  #
> >  #  IMPORTANT: this option can make QEMU allocate several 
> >  gigabytes
> >  # of RAM. This can happen for a large guest, 
> >  or a
> >  # malicious guest pretending to be large.
> >  #
> >  #  Also, paging=true has the following limitations:
> >  #
> >  # 1. The guest may be in a catastrophic state or can 
> >  have corrupted
> >  #memory, which cannot be trusted
> >  # 2. The guest can be in real-mode even if paging is 
> >  enabled. For
> >  #example, the guest uses ACPI to sleep, and ACPI 
> >  sleep state
> >  #goes in real-mode
> >  # 3. Currently only supported on i386 and x86_64.
> >  #
> > >>>
> > >>> "virsh dump --memory-only" sets paging=false, for obvious reasons.
> > >>>
> > >>> [...] the dump-guest-memory command provides a raw snapshot of the
> > >>> virtual machine's memory (and of the registers of the VCPUs); it is
> > >>> not enlightened about the guest.
> > >>>
> > >>> If the additional information you are looking for can be retrieved
> > >>> within the running Linux guest, using an appropriately privieleged
> > >>> userspace process, then I would recommend considering an extension 
> > >>> to
> > >>> the qemu guest agent. The management layer (libv

Re: [Qemu-devel] Assertion failure on qcow2 disk with cluster_size != 64k

2016-11-14 Thread Kevin Wolf
Am 24.10.2016 um 22:32 hat Eric Blake geschrieben:
> On 10/21/2016 08:14 AM, Ed Swierk wrote:
> > On Thu, Oct 20, 2016 at 6:38 PM, Eric Blake  wrote:
> >> On 10/20/2016 07:24 PM, Ed Swierk wrote:
> >>> Changing max_transfer in the normal write case to
> >>> MIN_NON_ZERO(alignment, MAX_WRITE_ZEROES_BOUNCE_BUFFER) appears to fix
> >>> the problem, but I don't pretend to understand all the subtleties
> >>> here.
> >>
> >> That actually sounds like the right fix.  But since the bug was probably
> >> caused by my code, I'll formalize it into a patch and see if I can
> >> modify the testsuite to give it coverage.
> > 
> > If alignment > MAX_WRITE_ZEROES_BOUNCE_BUFFER (however unlikely) we
> > have the same problem, so maybe this would be better?
> 
> Our qcow2 support is currently limited to a maximum of 2M clusters;
> while MAX_WRITE_ZEROES_BOUNCE_BUFFER is 32k * 512, or 16M.  The
> maximum-size bounce buffer should not be the problem here; but for some
> reason, it looks like alignment is larger than max_transfer which should
> not normally be possible.  I'm still playing with what should be the
> right patch, but hope to have something posted soon.

Are you still playing with it?

Kevin


pgpxGGk0H6iGO.pgp
Description: PGP signature


Re: [Qemu-devel] [kvm-unit-tests PATCH v8 2/3] arm: pmu: Check cycle count increases

2016-11-14 Thread Andrew Jones
On Fri, Nov 11, 2016 at 01:55:49PM -0600, Wei Huang wrote:
> 
> 
> On 11/11/2016 01:43 AM, Andrew Jones wrote:
> > On Tue, Nov 08, 2016 at 12:17:14PM -0600, Wei Huang wrote:
> >> From: Christopher Covington 
> >>
> >> Ensure that reads of the PMCCNTR_EL0 are monotonically increasing,
> >> even for the smallest delta of two subsequent reads.
> >>
> >> Signed-off-by: Christopher Covington 
> >> Signed-off-by: Wei Huang 
> >> ---
> >>  arm/pmu.c | 98 
> >> +++
> >>  1 file changed, 98 insertions(+)
> >>
> >> diff --git a/arm/pmu.c b/arm/pmu.c
> >> index 0b29088..d5e3ac3 100644
> >> --- a/arm/pmu.c
> >> +++ b/arm/pmu.c
> >> @@ -14,6 +14,7 @@
> >>   */
> >>  #include "libcflat.h"
> >>  
> >> +#define PMU_PMCR_E (1 << 0)
> >>  #define PMU_PMCR_N_SHIFT   11
> >>  #define PMU_PMCR_N_MASK0x1f
> >>  #define PMU_PMCR_ID_SHIFT  16
> >> @@ -21,6 +22,10 @@
> >>  #define PMU_PMCR_IMP_SHIFT 24
> >>  #define PMU_PMCR_IMP_MASK  0xff
> >>  
> >> +#define PMU_CYCLE_IDX  31
> >> +
> >> +#define NR_SAMPLES 10
> >> +
> >>  #if defined(__arm__)
> >>  static inline uint32_t pmcr_read(void)
> >>  {
> >> @@ -29,6 +34,47 @@ static inline uint32_t pmcr_read(void)
> >>asm volatile("mrc p15, 0, %0, c9, c12, 0" : "=r" (ret));
> >>return ret;
> >>  }
> >> +
> >> +static inline void pmcr_write(uint32_t value)
> >> +{
> >> +  asm volatile("mcr p15, 0, %0, c9, c12, 0" : : "r" (value));
> >> +}
> >> +
> >> +static inline void pmselr_write(uint32_t value)
> >> +{
> >> +  asm volatile("mcr p15, 0, %0, c9, c12, 5" : : "r" (value));
> >> +}
> >> +
> >> +static inline void pmxevtyper_write(uint32_t value)
> >> +{
> >> +  asm volatile("mcr p15, 0, %0, c9, c13, 1" : : "r" (value));
> >> +}
> >> +
> >> +/*
> >> + * While PMCCNTR can be accessed as a 64 bit coprocessor register, 
> >> returning 64
> >> + * bits doesn't seem worth the trouble when differential usage of the 
> >> result is
> >> + * expected (with differences that can easily fit in 32 bits). So just 
> >> return
> >> + * the lower 32 bits of the cycle count in AArch32.
> > 
> > Like I said in the last review, I'd rather we not do this. We should
> > return the full value and then the test case should confirm the upper
> > 32 bits are zero.
> > 
> 
> Unless I miss something in ARM documentation, ARMv7 PMCCNTR is a 32-bit
> register. We can force it to a more coarse-grained cycle counter with
> PMCR.D bit=1 (see below). But it is still not a 64-bit register. ARMv8
> PMCCNTR_EL0 is a 64-bit register.
> 
> "The PMCR.D bit configures whether PMCCNTR increments once every clock
> cycle, or once every 64 clock cycles. "
> 
> So I think the comment above in the code is an overstatement, which
> should be deleted or moved down to ARMv8 pmccntr_read() below.

OK, please fix as appropriate, but for the v8 64-bit register, please
don't drop the upper bits until after a unit test has a chance to check
them.

Thanks,
drew

> 
> >> + */
> >> +static inline uint32_t pmccntr_read(void)
> >> +{
> >> +  uint32_t cycles;
> >> +
> >> +  asm volatile("mrc p15, 0, %0, c9, c13, 0" : "=r" (cycles));
> >> +  return cycles;
> >> +}
> >> +
> >> +static inline void pmcntenset_write(uint32_t value)
> >> +{
> >> +  asm volatile("mcr p15, 0, %0, c9, c12, 1" : : "r" (value));
> >> +}
> >> +
> >> +/* PMCCFILTR is an obsolete name for PMXEVTYPER31 in ARMv7 */
> >> +static inline void pmccfiltr_write(uint32_t value)
> >> +{
> >> +  pmselr_write(PMU_CYCLE_IDX);
> >> +  pmxevtyper_write(value);
> >> +}
> >>  #elif defined(__aarch64__)
> >>  static inline uint32_t pmcr_read(void)
> >>  {
> >> @@ -37,6 +83,29 @@ static inline uint32_t pmcr_read(void)
> >>asm volatile("mrs %0, pmcr_el0" : "=r" (ret));
> >>return ret;
> >>  }
> >> +
> >> +static inline void pmcr_write(uint32_t value)
> >> +{
> >> +  asm volatile("msr pmcr_el0, %0" : : "r" (value));
> >> +}
> >> +
> >> +static inline uint32_t pmccntr_read(void)
> >> +{
> >> +  uint32_t cycles;
> >> +
> >> +  asm volatile("mrs %0, pmccntr_el0" : "=r" (cycles));
> >> +  return cycles;
> >> +}
> >> +
> >> +static inline void pmcntenset_write(uint32_t value)
> >> +{
> >> +  asm volatile("msr pmcntenset_el0, %0" : : "r" (value));
> >> +}
> >> +
> >> +static inline void pmccfiltr_write(uint32_t value)
> >> +{
> >> +  asm volatile("msr pmccfiltr_el0, %0" : : "r" (value));
> >> +}
> >>  #endif
> >>  
> >>  /*
> >> @@ -63,11 +132,40 @@ static bool check_pmcr(void)
> >>return ((pmcr >> PMU_PMCR_IMP_SHIFT) & PMU_PMCR_IMP_MASK) != 0;
> >>  }
> >>  
> >> +/*
> >> + * Ensure that the cycle counter progresses between back-to-back reads.
> >> + */
> >> +static bool check_cycles_increase(void)
> >> +{
> >> +  pmcr_write(pmcr_read() | PMU_PMCR_E);
> >> +
> >> +  for (int i = 0; i < NR_SAMPLES; i++) {
> >> +  unsigned long a, b;
> >> +
> >> +  a = pmccntr_read();
> >> +  b = pmccntr_read();
> >> +
> >> +  if (a >= b) {
> >> +  printf("Read %ld then %ld.\n",

Re: [Qemu-devel] facing issue with qemu amd-v support

2016-11-14 Thread Anand J
Hi,

Further investigation to the issue revealed that this happens before the
guest OS get to execute. KVM is not able to emulate a PF happened during a
jump to 0x4000.

Folowing is the trace of instructions at which the emulation error is
happening.


IN:
 CR3: b859
 CR2: 
0x000eecc4:  push   %ebx
0x000eecc5:  xor%ecx,%ecx
0x000eecc7:  mov(%eax,%ecx,1),%bl
0x000eecca:  cmp(%edx,%ecx,1),%bl
0x000eeccd:  je 0xeecdc

vmexit(004e, , 000f5ea0, 000eecc4)!
CR2: 
vmsave! b8592000
FS:  | 
vmload! b8592000
FS:  | 
vmrun! b8592000
CR3: b86bd000CR2: 

IN:
 CR3: b859
 CR2: 
0x000eeccf:  setl   %al
0x000eecd2:  movzbl %al,%eax
0x000eecd5:  neg%eax
0x000eecd7:  or $0x1,%eax
0x000eecda:  jmp0xeece3


IN:
 CR3: b859
 CR2: 
0x000eece3:  pop%ebx
*0x000eece4:  ret   *

vmexit(004e, , 4000, 4000)!
CR2: 
vmsave! b8592000
FS:  | 


Bold ret instruction causes a VMEXIT due to page fault. Can anyone please
help me figure out possible causes for this issue?

Thanks,
Anand

On Mon, Oct 24, 2016 at 2:35 PM, Anand J  wrote:

>
> -- Forwarded message --
> From: Anand J 
> Date: Mon, Oct 24, 2016 at 4:28 AM
> Subject: facing issue with qemu amd-v support
> To: qemu-disc...@nongnu.org
>
>
> Hi,
>
> I'm running fedora-24_x86_64 inside qemu in software mode. I configured
> KVM in the guest OS and want to run qemu inside the guest with kvm enabled.
> But qemu is throwing the following error. Does anybody know how to fix this?
>
> KVM internal error. Suberror: 1
> emulation failure
> EAX= EBX=404b ECX= EDX=000f5ea0
> ESI= EDI= EBP= ESP=6fd0
> EIP=4000 EFL=0086 [--S--P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =0010   00c09300 DPL=0 DS   [-WA]
> CS =0008   00c09b00 DPL=0 CS32 [-RA]
> SS =0010   00c09300 DPL=0 DS   [-WA]
> DS =0010   00c09300 DPL=0 DS   [-WA]
> FS =0010   00c09300 DPL=0 DS   [-WA]
> GS =0010   00c09300 DPL=0 DS   [-WA]
> LDT=   8200 DPL=0 LDT
> TR =   8b00 DPL=0 TSS32-busy
> GDT= 000f7180 0037
> IDT= 000f71be 
> CR0=0011 CR2= CR3= CR4=
> DR0= DR1= DR2=
> DR3=
> DR6=0ff0 DR7=0400
> EFER=
> Code=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00
>
>
> Thanks,
> Anand
>
>


Re: [Qemu-devel] virsh dump (qemu guest memory dump?): KASLR enabled linux guest support

2016-11-14 Thread Daniel P. Berrange
On Mon, Nov 14, 2016 at 01:32:56PM +0800, Dave Young wrote:
> On 11/09/16 at 04:38pm, Laszlo Ersek wrote:
> > On 11/09/16 15:47, Daniel P. Berrange wrote:
> > > On Wed, Nov 09, 2016 at 01:20:51PM +0100, Andrew Jones wrote:
> > >> On Wed, Nov 09, 2016 at 11:58:19AM +, Daniel P. Berrange wrote:
> > >>> On Wed, Nov 09, 2016 at 12:48:09PM +0100, Andrew Jones wrote:
> >  On Wed, Nov 09, 2016 at 11:37:35AM +, Daniel P. Berrange wrote:
> > > On Wed, Nov 09, 2016 at 12:26:17PM +0100, Laszlo Ersek wrote:
> > >> On 11/09/16 11:40, Andrew Jones wrote:
> > >>> On Wed, Nov 09, 2016 at 11:01:46AM +0800, Dave Young wrote:
> >  Hi,
> > 
> >  Latest linux kernel enabled kaslr to randomiz phys/virt memory
> >  addresses, we had some effort to support kexec/kdump so that crash
> >  utility can still works in case crashed kernel has kaslr enabled.
> > 
> >  But according to Dave Anderson virsh dump does not work, quoted 
> >  messages
> >  from Dave below:
> > 
> >  """
> >  with virsh dump, there's no way of even knowing that KASLR
> >  has randomized the kernel __START_KERNEL_map region, because there 
> >  is no
> >  virtual address information -- e.g., like "SYMBOL(_stext)" in the 
> >  kdump
> >  vmcoreinfo data to compare against the vmlinux file symbol value.
> >  Unless virsh dump can export some basic virtual memory data, which
> >  they say it can't, I don't see how KASLR can ever be supported.
> >  """
> > 
> >  I assume virsh dump is using qemu guest memory dump facility so it
> >  should be first addressed in qemu. Thus post this query to qemu 
> >  devel
> >  list. If this is not correct please let me know.
> > 
> >  Could you qemu dump people make it work? Or we can not support 
> >  virt dump
> >  as long as KASLR being enabled. Latest Fedora kernel has enabled 
> >  it in x86_64.
> > 
> > >>>
> > >>> When the -kernel command line option is used, then it may be 
> > >>> possible
> > >>> to extract some information that could be used to supplement the 
> > >>> memory
> > >>> dump that dump-guest-memory provides. However, that would be a 
> > >>> specific
> > >>> use. In general, QEMU knows nothing about the guest kernel. It 
> > >>> doesn't
> > >>> know where it is in the disk image, and it doesn't even know if it's
> > >>> Linux.
> > >>>
> > >>> Is there anything a guest userspace application could probe from 
> > >>> e.g.
> > >>> /proc that would work? If so, then the guest agent could gain a new
> > >>> feature providing that.
> > >>
> > >> I fully agree. This is exactly what I suggested too, independently, 
> > >> in
> > >> the downstream thread, before arriving at this upstream thread. Let 
> > >> me
> > >> quote that email:
> > >>
> > >> On 11/09/16 12:09, Laszlo Ersek wrote:
> > >>> [...] the dump-guest-memory QEMU command supports an option called
> > >>> "paging". Here's its documentation, from the "qapi-schema.json" 
> > >>> source
> > >>> file:
> > >>>
> >  # @paging: if true, do paging to get guest's memory mapping. This 
> >  allows
> >  #  using gdb to process the core file.
> >  #
> >  #  IMPORTANT: this option can make QEMU allocate several 
> >  gigabytes
> >  # of RAM. This can happen for a large guest, 
> >  or a
> >  # malicious guest pretending to be large.
> >  #
> >  #  Also, paging=true has the following limitations:
> >  #
> >  # 1. The guest may be in a catastrophic state or can 
> >  have corrupted
> >  #memory, which cannot be trusted
> >  # 2. The guest can be in real-mode even if paging is 
> >  enabled. For
> >  #example, the guest uses ACPI to sleep, and ACPI 
> >  sleep state
> >  #goes in real-mode
> >  # 3. Currently only supported on i386 and x86_64.
> >  #
> > >>>
> > >>> "virsh dump --memory-only" sets paging=false, for obvious reasons.
> > >>>
> > >>> [...] the dump-guest-memory command provides a raw snapshot of the
> > >>> virtual machine's memory (and of the registers of the VCPUs); it is
> > >>> not enlightened about the guest.
> > >>>
> > >>> If the additional information you are looking for can be retrieved
> > >>> within the running Linux guest, using an appropriately privieleged
> > >>> userspace process, then I would recommend considering an extension 
> > >>> to
> > >>> the qemu guest agent. The management layer (libv

Re: [Qemu-devel] [PATCH v2 2/5] target-i386: Add Intel HAX files

2016-11-14 Thread Paolo Bonzini


On 11/11/2016 12:28, Vincent Palatin wrote:
> +
> +memcpy(env->xmm_regs, fpu.mmx_1, sizeof(fpu.mmx_1));
> +memcpy((ZMMReg *) (env->xmm_regs) + 8, fpu.mmx_2, sizeof(fpu.mmx_2));

HAX will only support SSE (128-bit) registers, while env->xmm_regs
supports AVX512 (512-bit) so you have to copy registers one by one.

Is there documentation for HAX?  In particular I'm curious as to what
the CPUID information looks like in the guest, and whether there are
ioctls to change it.

> +
> +static int hax_handle_fastmmio(CPUArchState *env, struct hax_fastmmio *hft)
> +{
> +uint64_t buf = 0;
> +/*
> + * With fast MMIO, QEMU need not sync vCPU state with HAXM
> + * driver because it will only invoke MMIO handler
> + * However, some MMIO operations utilize virtual address like qemu_pipe
> + * Thus we need to sync the CR0, CR3 and CR4 so that QEMU
> + * can translate the guest virtual address to guest physical
> + * address
> + */
> +env->cr[0] = hft->_cr0;
> +env->cr[2] = hft->_cr2;
> +env->cr[3] = hft->_cr3;
> +env->cr[4] = hft->_cr4;

These seem to apply only to some parts of the Android emulator that are
not upstream, so you can remove them.

> +buf = hft->value;
> +
> +cpu_physical_memory_rw(hft->gpa, (uint8_t *) &buf, hft->size,
> +   hft->direction);
> +if (hft->direction == 0) {
> +hft->value = buf;
> +}

No need to use "buf", you can use &hft->value directly.

> +return 0;
> +}
> +
> +static int hax_handle_io(CPUArchState *env, uint32_t df, uint16_t port,
> + int direction, int size, int count, void *buffer)
> +{
> +uint8_t *ptr;
> +int i;
> +
> +if (!df) {
> +ptr = (uint8_t *) buffer;
> +} else {
> +ptr = buffer + size * count - size;
> +}
> +for (i = 0; i < count; i++) {
> +if (direction == HAX_EXIT_IO_IN) {
> +switch (size) {
> +case 1:
> +stb_p(ptr, cpu_inb(port));
> +break;
> +case 2:
> +stw_p(ptr, cpu_inw(port));
> +break;
> +case 4:
> +stl_p(ptr, cpu_inl(port));
> +break;
> +}
> +} else {
> +switch (size) {
> +case 1:
> +cpu_outb(port, ldub_p(ptr));
> +break;
> +case 2:
> +cpu_outw(port, lduw_p(ptr));
> +break;
> +case 4:
> +cpu_outl(port, ldl_p(ptr));
> +break;
> +}
> +}

The whole "if" can be replaced by

MemTxAttrs = { 0 };
...

address_space_rw(&address_space_io, port, attrs,
 ptr, size, direction == HAX_EXIT_IO_OUT);

Thanks,

Paolo

> +if (!df) {
> +ptr += size;
> +} else {
> +ptr -= size;
> +}
> +}
> +
> +return 0;
> +}
> +



Re: [Qemu-devel] virsh dump (qemu guest memory dump?): KASLR enabled linux guest support

2016-11-14 Thread Paolo Bonzini


On 09/11/2016 17:01, Daniel P. Berrange wrote:
> On Wed, Nov 09, 2016 at 04:38:36PM +0100, Laszlo Ersek wrote:
>> On 11/09/16 15:47, Daniel P. Berrange wrote:
> That doesn't help with trying to diagnose a crash during boot up, since
> the guest agent isn't running till fairly late. I'm also concerned that
> the QEMU guest agent is likely to be far from widely deployed in guests,
>>
>> I have no hard data, but from the recent Fedora and RHEL-7 guest
>> installations I've done, it seems like qga is installed automatically.
>> (Not sure if that's because Anaconda realizes it's installing the OS in
>> a VM.) Once I made sure there was an appropriate virtio-serial config in
>> the domain XMLs, I could talk to the agents (mainly for fstrim's sake)
>> immediately.
> 
> I'm thinking about cloud deployment where people rarely use Anaconda
> directly - they'll use a pre-built cloud image, or customize the basic
> cloud image. Neither Fedora or Ubuntu include the qemu guest agent in
> their cloud images AFAICT, so very few OpenStack deployments will have
> QEMU guest agent in at this time.

That's a bug in my opinion.  The guest agent is necessary in order to
avoid losing data in snapshots.  We should not only get distros to embed
the guest agent, but also to include the relevant freeze/thaw hooks.

Paolo

> Of course we could try to get distros to embed qemu guets agent by
> default, but its not clear if we'd succeed, given how aggressive
> they are at stripping stuff out to create the smallest practical
> images.
> 
> Regards,
> Daniel
> 



Re: [Qemu-devel] virsh dump (qemu guest memory dump?): KASLR enabled linux guest support

2016-11-14 Thread Paolo Bonzini


On 14/11/2016 11:10, Daniel P. Berrange wrote:
> There's already patches posted to create a virtio-pstore device for
> QEMU, which is what led me to suggest this as an option:
> 
>   https://lists.nongnu.org/archive/html/qemu-devel/2016-09/msg00381.html

It's also possible to use UEFI as a pstore backend.

Paolo



Re: [Qemu-devel] virsh dump (qemu guest memory dump?): KASLR enabled linux guest support

2016-11-14 Thread Daniel P. Berrange
On Mon, Nov 14, 2016 at 11:28:04AM +0100, Paolo Bonzini wrote:
> 
> 
> On 14/11/2016 11:10, Daniel P. Berrange wrote:
> > There's already patches posted to create a virtio-pstore device for
> > QEMU, which is what led me to suggest this as an option:
> > 
> >   https://lists.nongnu.org/archive/html/qemu-devel/2016-09/msg00381.html
> 
> It's also possible to use UEFI as a pstore backend.

Presumably that'll also require some QEMU patches to provide storage
for UEFI's pstore ?

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://entangle-photo.org   -o-http://search.cpan.org/~danberr/ :|



Re: [Qemu-devel] virsh dump (qemu guest memory dump?): KASLR enabled linux guest support

2016-11-14 Thread Paolo Bonzini


On 09/11/2016 16:28, Dave Anderson wrote:
> I'm not sure whether this "guest userspace agent" is still in play here,
> but if there were such a thing, it could theoretically do the same
> thing that crash currently does when running on a live system.
> 
> Both of those are available or calculatable from the contents of
> a kdump header.  However, on a live system, it's done like this:
> 
> - /proc/kallsyms is queried for the symbol value of "_text", which would
>   be relocated if KASLR is in play.  That value is compared against the
>   "_text" symbol value compiled into the vmlinux file to determine the
>   relocation value generated by CONFIG_RANDOMIZE_BASE.
> 
> [...] in order to read kernel symbols from the 
> statically-mapped kernel region based at __START_KERNEL_map, it 
> translates a (possibly relocated) kernel virtual address into a
> physical address like this:
> 
>   physical-address = virtual-address - __START_KERNEL_map + phys_base
> 
> But it's a chicken-and-egg deal, because the contents of the "phys_base"
> symbol are needed to calculate the physical address, but it can't
> read the "phys_base" symbol contents without first knowing its contents.
> 
> So on a live system, the "phys_base" is calculated by reading
> the "Kernel Code:" value from /proc/iomem, and then doing this:
> 
>   phys_base = [Kernel Code: value] - ["_text" symbol value] - 
> __START_KERNEL_map
   
^^^

Should there be parentheses around this?  The physical-address formula
above is equivalent to

phys_base = physical-address - (virtual-address - __START_KERNEL_map)

> 
> So theoretically, the guest agent could read /proc/iomem and /proc/kallsyms
> for the information required.  (I think...)

Then yes, the guest-agent could add a command get-kernel-text-start with an 
output like:

{ 'virtual': 0x8600, 'physical': 0xb600 }

and libvirt can expose it to crash.  In this case, phys_base would be 0xb000
if I did the math right, and the relocation value is obtained by comparing the
"virtual" address with the vmlinux "_text".

IIRC the guest agent runs as root, so reading /proc/iomem is not a problem.

Paolo



Re: [Qemu-devel] facing issue with qemu amd-v support

2016-11-14 Thread Paolo Bonzini


On 14/11/2016 11:08, Anand J wrote:
> Hi,
> 
> Further investigation to the issue revealed that this happens before the
> guest OS get to execute. KVM is not able to emulate a PF happened during a
> jump to 0x4000.
> 
> Folowing is the trace of instructions at which the emulation error is
> happening.
> 
> 
> IN:
>  CR3: b859
>  CR2: 
> 0x000eeccf:  setl   %al
> 0x000eecd2:  movzbl %al,%eax
> 0x000eecd5:  neg%eax
> 0x000eecd7:  or $0x1,%eax
> 0x000eecda:  jmp0xeece3
> 
> 
> IN:
>  CR3: b859
>  CR2: 
> 0x000eece3:  pop%ebx
> *0x000eece4:  ret   *
> 
> vmexit(004e, , 4000, 4000)!
> CR2: 
> vmsave! b8592000
> FS:  | 
> 
> 
> Bold ret instruction causes a VMEXIT due to page fault. Can anyone please
> help me figure out possible causes for this issue?

Sorry, I cannot really guess what's going on.  All I can suggest is to
look at KVM's trace normally and under QEMU, and find the difference.

Paolo

> Thanks,
> Anand
> 
> On Mon, Oct 24, 2016 at 2:35 PM, Anand J  wrote:
> 
>>
>> -- Forwarded message --
>> From: Anand J 
>> Date: Mon, Oct 24, 2016 at 4:28 AM
>> Subject: facing issue with qemu amd-v support
>> To: qemu-disc...@nongnu.org
>>
>>
>> Hi,
>>
>> I'm running fedora-24_x86_64 inside qemu in software mode. I configured
>> KVM in the guest OS and want to run qemu inside the guest with kvm enabled.
>> But qemu is throwing the following error. Does anybody know how to fix this?
>>
>> KVM internal error. Suberror: 1
>> emulation failure
>> EAX= EBX=404b ECX= EDX=000f5ea0
>> ESI= EDI= EBP= ESP=6fd0
>> EIP=4000 EFL=0086 [--S--P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
>> ES =0010   00c09300 DPL=0 DS   [-WA]
>> CS =0008   00c09b00 DPL=0 CS32 [-RA]
>> SS =0010   00c09300 DPL=0 DS   [-WA]
>> DS =0010   00c09300 DPL=0 DS   [-WA]
>> FS =0010   00c09300 DPL=0 DS   [-WA]
>> GS =0010   00c09300 DPL=0 DS   [-WA]
>> LDT=   8200 DPL=0 LDT
>> TR =   8b00 DPL=0 TSS32-busy
>> GDT= 000f7180 0037
>> IDT= 000f71be 
>> CR0=0011 CR2= CR3= CR4=
>> DR0= DR1= DR2=
>> DR3=
>> DR6=0ff0 DR7=0400
>> EFER=
>> Code=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00
>> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 00 00 00
>>
>>
>> Thanks,
>> Anand
>>
>>
> 



Re: [Qemu-devel] virsh dump (qemu guest memory dump?): KASLR enabled linux guest support

2016-11-14 Thread Laszlo Ersek
On 11/14/16 11:33, Daniel P. Berrange wrote:
> On Mon, Nov 14, 2016 at 11:28:04AM +0100, Paolo Bonzini wrote:
>>
>>
>> On 14/11/2016 11:10, Daniel P. Berrange wrote:
>>> There's already patches posted to create a virtio-pstore device for
>>> QEMU, which is what led me to suggest this as an option:
>>>
>>>   https://lists.nongnu.org/archive/html/qemu-devel/2016-09/msg00381.html
>>
>> It's also possible to use UEFI as a pstore backend.
> 
> Presumably that'll also require some QEMU patches to provide storage
> for UEFI's pstore ?

Using UEFI non-volatile variables as a pstore backend is a guest kernel
feature, and it already works transparently with OVMF utilizing QEMU's
pflash device. If memory serves, the data to be written are broken into
1KB chunks, and saved as separate UEFI variables under a dedicated
namespace GUID.

https://bugzilla.redhat.com/show_bug.cgi?id=828497

(Private BZ -- I apologize to the non-RedHatter subscribers that read this.)

(Also, not everyone has been enthusiastic about this feature:
.)

Anyway, when I say "it works", I mean it works for the direct purpose of
storing data (like saving dmesg at panic), and for retrieving data, from
within the guest. (At a subsequent guest boot, possibly.) This is the
scope of pstore in general, AIUI (see "Documentation/ABI/testing/pstore").

However, host-side insight into the OVMF/edk2 varstore format remains
something we don't, and shouldn't, implement. In this regard, the UEFI
variables that happen to contain pstore data are no different from other
kinds of UEFI variables; they are equally opaque from the host side.
(Unless we want to implement and maintain a large utility that reflects
and tracks the multi-layer variable driver stack in edk2. "Unless" is
rhetorical, we don't want that.)

If host-side access is needed to the guest's phys-base / virt-base, then
my first preference would be the guest agent (interrogated at guest
startup), and my second preference would be virtio-pstore. I reckon
virtio-pstore will take a new guest driver, and I suppose the host-side
on-disk format is being designed for easy parsing.

Thanks
Laszlo



Re: [Qemu-devel] virsh dump (qemu guest memory dump?): KASLR enabled linux guest support

2016-11-14 Thread Paolo Bonzini


On 14/11/2016 11:33, Daniel P. Berrange wrote:
> On Mon, Nov 14, 2016 at 11:28:04AM +0100, Paolo Bonzini wrote:
>>
>>
>> On 14/11/2016 11:10, Daniel P. Berrange wrote:
>>> There's already patches posted to create a virtio-pstore device for
>>> QEMU, which is what led me to suggest this as an option:
>>>
>>>   https://lists.nongnu.org/archive/html/qemu-devel/2016-09/msg00381.html
>>
>> It's also possible to use UEFI as a pstore backend.
> 
> Presumably that'll also require some QEMU patches to provide storage
> for UEFI's pstore ?

That's just the UEFI variable store.  But for some reason Fedora doesn't
set CONFIG_EFI_VARS, so the next possibility is to use ACPI ERST.  This
would not require any change to guests, unlike virtio-pstore.

Paolo



Re: [Qemu-devel] [PATCH v2 5/5] Plumb the HAXM-based hardware acceleration support

2016-11-14 Thread Paolo Bonzini


On 11/11/2016 12:28, Vincent Palatin wrote:
> +/*
> + * In Hax, the qemu allocate the virtual address, and HAX kernel
> + * populate the memory with physical memory. Currently we have no
> + * paging, so user should make sure enough free memory in advance
> + */
> +if (hax_enabled()) {
> +int ret;
> +ret = hax_populate_ram((uint64_t)(uintptr_t)new_block->host,
> +   new_block->max_length);
> +if (ret < 0) {
> +error_setg(errp, "Hax failed to populate ram");
> +return;
> +}
> +}
> +
>  if (!new_block->host) {
>  error_setg_errno(errp, errno,
>   "cannot set up guest memory '%s'",

The hax_enabled() check should be after the "if (!new_block->host)" block.

Paolo



Re: [Qemu-devel] [PATCH v2 2/5] target-i386: Add Intel HAX files

2016-11-14 Thread Paolo Bonzini


On 11/11/2016 12:28, Vincent Palatin wrote:
> +
> +memcpy(env->xmm_regs, fpu.mmx_1, sizeof(fpu.mmx_1));
> +memcpy((ZMMReg *) (env->xmm_regs) + 8, fpu.mmx_2, sizeof(fpu.mmx_2));

HAX will only support SSE (128-bit) registers, while env->xmm_regs
supports AVX512 (512-bit) so you have to copy registers one by one.

Is there documentation for HAX?  In particular I'm curious as to what
the CPUID information looks like in the guest, and whether there are
ioctls to change it.  In particular I would expect XSAVE to be disabled.

Paolo



Re: [Qemu-devel] [PATCH v2 2/5] target-i386: Add Intel HAX files

2016-11-14 Thread Vincent Palatin
On Mon, Nov 14, 2016 at 11:15 AM, Paolo Bonzini  wrote:
>
>
> On 11/11/2016 12:28, Vincent Palatin wrote:
>> +
>> +memcpy(env->xmm_regs, fpu.mmx_1, sizeof(fpu.mmx_1));
>> +memcpy((ZMMReg *) (env->xmm_regs) + 8, fpu.mmx_2, sizeof(fpu.mmx_2));
>
> HAX will only support SSE (128-bit) registers, while env->xmm_regs
> supports AVX512 (512-bit) so you have to copy registers one by one.


Good point,
I will fix this


>
> Is there documentation for HAX?

No developer doc I know of,
both Intel website and the download packages contain only installation
documentations as far as I can tell.
I will ask Intel when I have the chance.

>  In particular I'm curious as to what
> the CPUID information looks like in the guest, and whether there are
> ioctls to change it.

No idea for the interface, but I have put an example below if you are
interested.

> In particular I would expect XSAVE to be disabled.

For EAX=1  I'm seeing ECX = 00d82201 => [26] = 0 && [27] = 0.
We should be fine for XSAVE.


On the Intel Core i5-6200U CPU I was running my tests on, I have
dumped the CPUID inside the emulator with HAX and on the Windows host:

== emulation with HAX ==
 eax ineax  ebx  ecx  edx
 0004 756e6547 6c65746e 49656e69
0001 000106f1 00010400 00d82201 1f88fbff
0002 03020101   0c040844
0003    
0004    
0005 0040 0040 0003 11142120
0006 27f7 0002 0009 
0007  029c67af  
0008    
0009    
000a 07300404   0603
000b 0001 0002 0100 0001
000c    
000d 001f 0440 0440 
000e    
000f    
0010    
0011    
0012    
0013    
0014 0001 000f 0007 
0015 0002 00c8  
0016 0960 0af0 0064 
8000 8008   
8001    2800
8002 74726956 206c6175 20555043 
8003    
8004    
8005    
8006   04008040 
8007    
8008 3027   

== Windows host  ==
 eax ineax  ebx  ecx  edx
 0016 756e6547 6c65746e 49656e69
0001 000406e3 00100800 7ffafbbf bfebfbff
0002 76036301 00f0b5ff  00c3
0003    
0004    
0005 0040 0040 0003 11142120
0006 27f7 0002 0009 
0007    
0008    
0009    
000a 07300404   0603
000b   00c3 
000c    
000d    
000e    
000f    
0010    
0011    
0012    
0013    
0014    
0015 0002 00c8  
0016 0960 0af0 0064 
8000 8008   
8001   0121 2c10
8002 65746e49 2952286c 726f4320 4d542865
8003 35692029 3032362d 43205530 40205550
8004 332e3220 7a484730  
8005    
8006   01006040 
8007    0100
8008 3027   

>
>> +
>> +static int hax_handle_fastmmio(CPUArchState *env, struct hax_fastmmio *hft)
>> +{
>> +uint64_t buf = 0;
>> +/*
>> + * With fast MMIO, QEMU need not sync vCPU state with HAXM
>> + * driver because it will only invoke MMIO handler
>> + * However, some MMIO operations utilize virtual address like qemu_pipe
>> + * Thus we need to sync the CR0, CR3 and CR4 so that QEMU
>> + * can translate the guest virtual address to guest physical
>> + * address
>> + */
>> +env->cr[0] = hft->_cr0;
>> +env->cr[2] = hft->_cr2;
>> +env->cr[3] = hft->_cr3;
>> +env->cr[4] = hft->_cr4;
>
> These seem to apply only to some parts of the Android emulator that are
> not upstream, so you can remove them.

Ok, removed.
Re-tested my own image still works ...


>
>> +buf =

Re: [Qemu-devel] [PATCH v2 5/5] Plumb the HAXM-based hardware acceleration support

2016-11-14 Thread Vincent Palatin
On Mon, Nov 14, 2016 at 12:56 PM, Paolo Bonzini  wrote:
>
>
> On 11/11/2016 12:28, Vincent Palatin wrote:
>> +/*
>> + * In Hax, the qemu allocate the virtual address, and HAX kernel
>> + * populate the memory with physical memory. Currently we have 
>> no
>> + * paging, so user should make sure enough free memory in 
>> advance
>> + */
>> +if (hax_enabled()) {
>> +int ret;
>> +ret = hax_populate_ram((uint64_t)(uintptr_t)new_block->host,
>> +   new_block->max_length);
>> +if (ret < 0) {
>> +error_setg(errp, "Hax failed to populate ram");
>> +return;
>> +}
>> +}
>> +
>>  if (!new_block->host) {
>>  error_setg_errno(errp, errno,
>>   "cannot set up guest memory '%s'",
>
> The hax_enabled() check should be after the "if (!new_block->host)" block.

Indeed, fixed in v3 series.

Thansk !

-- 
Vincent



Re: [Qemu-devel] [PATCH v2 0/5] [RFC] Add HAX support

2016-11-14 Thread Stefan Weil
Am 11.11.2016 um 12:28 schrieb Vincent Palatin:
> I took a stab at trying to rebase/upstream the support for Intel HAXM.
> (Hardware Accelerated Execution Manager).
> Intel HAX is kernel-based hardware acceleration module for Windows and MacOSX.
> 
> I have based my work on the last version of the source code I found:
> the emu-2.2-release branch in the external/qemu-android repository as used by
> the Android emulator.
> In patch 2/5, I have forward-ported the core HAX code mostly unmodified from
> there, I just did some minor touch up to make it build and run properly,
> and fixed the style issues to go through checkpatch.pl.
> I have not included the Darwin support.
> It might contain some outdated constructs and probably requires more
> attention (thus the 'RFC' for this patchset).
> 
> In patch 3/5, I'm removing a good chunk of the support for CPUs without UG 
> mode
> as advised by Paolo to simplify the initial version.
> 
> In patch 5/5, I have put the plumbing into the QEMU code base, I did some 
> clean
> up there and it is reasonably intrusive: i.e.
>  Makefile.target   |  1 +
>  configure | 18 ++
>  cpus.c| 87 
> ++-
>  exec.c| 16 +
>  hw/intc/apic_common.c |  3 +-
>  include/qom/cpu.h |  5 +++
>  include/sysemu/hw_accel.h |  9 +
>  qemu-options.hx   | 11 ++
>  target-i386/Makefile.objs |  7 
>  vl.c  | 15 ++--
>  10 files changed, 167 insertions(+), 5 deletions(-)
> 
> The qemu_cpu_kick_thread mess in cpus.c is probably still not perfact though.
> 
> The patch 1/5 just extracts from KVM specific header the cpu_synchronize_
> functions that HAX is also using.
> 
> I have tested the end result on a Windows 10 Pro machine (with UG support)
> with the Intel HAXM module 6.0.4 and a large ChromiumOS x86_64 image to
> exercise various code paths. It looks stable.
> I also did a quick regression testing of the integration by running a Linux
> build with KVM enabled.


A full build for Windows needs the patch below to
fix missing declarations, otherwise it fails with
compiler warnings and linker errors.

Stefan


>From 91481639a1005ed3278eb55c77c99bb1bcc135ce Mon Sep 17 00:00:00 2001
From: Stefan Weil 
Date: Mon, 14 Nov 2016 13:09:53 +0100
Subject: [PATCH] Fix include statements for HAXM support

We need sysemu/hw_accel.h. As sysemu/hw_accel.h already includes
sysemu/kvm.h, that one can be removed.

Signed-off-by: Stefan Weil 
---
 hw/ppc/pnv_xscom.c  | 2 +-
 hw/ppc/ppce500_spin.c   | 4 ++--
 hw/ppc/spapr.c  | 2 +-
 hw/ppc/spapr_hcall.c| 2 +-
 hw/s390x/s390-pci-inst.c| 1 +
 target-ppc/mmu-hash64.c | 2 +-
 target-ppc/translate_init.c | 2 +-
 target-s390x/gdbstub.c  | 1 +
 8 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/hw/ppc/pnv_xscom.c b/hw/ppc/pnv_xscom.c
index 5aaa264..abcb85d 100644
--- a/hw/ppc/pnv_xscom.c
+++ b/hw/ppc/pnv_xscom.c
@@ -20,7 +20,7 @@
 #include "qapi/error.h"
 #include "hw/hw.h"
 #include "qemu/log.h"
-#include "sysemu/kvm.h"
+#include "sysemu/hw_accel.h"
 #include "target-ppc/cpu.h"
 #include "hw/sysbus.h"

diff --git a/hw/ppc/ppce500_spin.c b/hw/ppc/ppce500_spin.c
index cf958a9..eb219ab 100644
--- a/hw/ppc/ppce500_spin.c
+++ b/hw/ppc/ppce500_spin.c
@@ -29,9 +29,9 @@

 #include "qemu/osdep.h"
 #include "hw/hw.h"
-#include "sysemu/sysemu.h"
 #include "hw/sysbus.h"
-#include "sysemu/kvm.h"
+#include "sysemu/hw_accel.h"
+#include "sysemu/sysemu.h"
 #include "e500.h"

 #define MAX_CPUS 32
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 0cbab24..174f4d3 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -36,7 +36,7 @@
 #include "sysemu/device_tree.h"
 #include "sysemu/block-backend.h"
 #include "sysemu/cpus.h"
-#include "sysemu/kvm.h"
+#include "sysemu/hw_accel.h"
 #include "kvm_ppc.h"
 #include "migration/migration.h"
 #include "mmu-hash64.h"
diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index 9a9bedf..b2a8e48 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -1,5 +1,6 @@
 #include "qemu/osdep.h"
 #include "qapi/error.h"
+#include "sysemu/hw_accel.h"
 #include "sysemu/sysemu.h"
 #include "qemu/log.h"
 #include "cpu.h"
@@ -9,7 +10,6 @@
 #include "mmu-hash64.h"
 #include "cpu-models.h"
 #include "trace.h"
-#include "sysemu/kvm.h"
 #include "kvm_ppc.h"
 #include "hw/ppc/spapr_ovec.h"

diff --git a/hw/s390x/s390-pci-inst.c b/hw/s390x/s390-pci-inst.c
index 0864d9b..4d0775c 100644
--- a/hw/s390x/s390-pci-inst.c
+++ b/hw/s390x/s390-pci-inst.c
@@ -18,6 +18,7 @@
 #include "s390-pci-bus.h"
 #include "exec/memory-internal.h"
 #include "qemu/error-report.h"
+#include "sysemu/hw_accel.h"

 /* #define DEBUG_S390PCI_INST */
 #ifdef DEBUG_S390PCI_INST
diff --git a/target-ppc/mmu-hash64.c b/target-ppc/mmu-hash64.c
index fdb7a78..0efc8c6 100644
--- a/target-ppc/mmu-hash64.c
+++ b/target-ppc/mmu-hash64.c
@@ -23,7 +23,7 @@
 #include "exec/exec-

Re: [Qemu-devel] [PATCH v2 0/5] [RFC] Add HAX support

2016-11-14 Thread Vincent Palatin
On Mon, Nov 14, 2016 at 1:21 PM, Stefan Weil  wrote:
> Am 11.11.2016 um 12:28 schrieb Vincent Palatin:
>> I took a stab at trying to rebase/upstream the support for Intel HAXM.
>> (Hardware Accelerated Execution Manager).
>> Intel HAX is kernel-based hardware acceleration module for Windows and 
>> MacOSX.
>>
>> I have based my work on the last version of the source code I found:
>> the emu-2.2-release branch in the external/qemu-android repository as used by
>> the Android emulator.
>> In patch 2/5, I have forward-ported the core HAX code mostly unmodified from
>> there, I just did some minor touch up to make it build and run properly,
>> and fixed the style issues to go through checkpatch.pl.
>> I have not included the Darwin support.
>> It might contain some outdated constructs and probably requires more
>> attention (thus the 'RFC' for this patchset).
>>
>> In patch 3/5, I'm removing a good chunk of the support for CPUs without UG 
>> mode
>> as advised by Paolo to simplify the initial version.
>>
>> In patch 5/5, I have put the plumbing into the QEMU code base, I did some 
>> clean
>> up there and it is reasonably intrusive: i.e.
>>  Makefile.target   |  1 +
>>  configure | 18 ++
>>  cpus.c| 87 
>> ++-
>>  exec.c| 16 +
>>  hw/intc/apic_common.c |  3 +-
>>  include/qom/cpu.h |  5 +++
>>  include/sysemu/hw_accel.h |  9 +
>>  qemu-options.hx   | 11 ++
>>  target-i386/Makefile.objs |  7 
>>  vl.c  | 15 ++--
>>  10 files changed, 167 insertions(+), 5 deletions(-)
>>
>> The qemu_cpu_kick_thread mess in cpus.c is probably still not perfact though.
>>
>> The patch 1/5 just extracts from KVM specific header the cpu_synchronize_
>> functions that HAX is also using.
>>
>> I have tested the end result on a Windows 10 Pro machine (with UG support)
>> with the Intel HAXM module 6.0.4 and a large ChromiumOS x86_64 image to
>> exercise various code paths. It looks stable.
>> I also did a quick regression testing of the integration by running a Linux
>> build with KVM enabled.
>
>
> A full build for Windows needs the patch below to
> fix missing declarations, otherwise it fails with
> compiler warnings and linker errors.

Thanks for filing the gaps. That's very helpful !
Do you mind if I merge it with your SoB into Patch 1/5 where it belongs ?
or do you prefer keeping it as a separate patch ?


> From 91481639a1005ed3278eb55c77c99bb1bcc135ce Mon Sep 17 00:00:00 2001
> From: Stefan Weil 
> Date: Mon, 14 Nov 2016 13:09:53 +0100
> Subject: [PATCH] Fix include statements for HAXM support
>
> We need sysemu/hw_accel.h. As sysemu/hw_accel.h already includes
> sysemu/kvm.h, that one can be removed.
>
> Signed-off-by: Stefan Weil 
> ---
>  hw/ppc/pnv_xscom.c  | 2 +-
>  hw/ppc/ppce500_spin.c   | 4 ++--
>  hw/ppc/spapr.c  | 2 +-
>  hw/ppc/spapr_hcall.c| 2 +-
>  hw/s390x/s390-pci-inst.c| 1 +
>  target-ppc/mmu-hash64.c | 2 +-
>  target-ppc/translate_init.c | 2 +-
>  target-s390x/gdbstub.c  | 1 +
>  8 files changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/hw/ppc/pnv_xscom.c b/hw/ppc/pnv_xscom.c
> index 5aaa264..abcb85d 100644
> --- a/hw/ppc/pnv_xscom.c
> +++ b/hw/ppc/pnv_xscom.c
> @@ -20,7 +20,7 @@
>  #include "qapi/error.h"
>  #include "hw/hw.h"
>  #include "qemu/log.h"
> -#include "sysemu/kvm.h"
> +#include "sysemu/hw_accel.h"
>  #include "target-ppc/cpu.h"
>  #include "hw/sysbus.h"
>
> diff --git a/hw/ppc/ppce500_spin.c b/hw/ppc/ppce500_spin.c
> index cf958a9..eb219ab 100644
> --- a/hw/ppc/ppce500_spin.c
> +++ b/hw/ppc/ppce500_spin.c
> @@ -29,9 +29,9 @@
>
>  #include "qemu/osdep.h"
>  #include "hw/hw.h"
> -#include "sysemu/sysemu.h"
>  #include "hw/sysbus.h"
> -#include "sysemu/kvm.h"
> +#include "sysemu/hw_accel.h"
> +#include "sysemu/sysemu.h"
>  #include "e500.h"
>
>  #define MAX_CPUS 32
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 0cbab24..174f4d3 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -36,7 +36,7 @@
>  #include "sysemu/device_tree.h"
>  #include "sysemu/block-backend.h"
>  #include "sysemu/cpus.h"
> -#include "sysemu/kvm.h"
> +#include "sysemu/hw_accel.h"
>  #include "kvm_ppc.h"
>  #include "migration/migration.h"
>  #include "mmu-hash64.h"
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index 9a9bedf..b2a8e48 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c
> @@ -1,5 +1,6 @@
>  #include "qemu/osdep.h"
>  #include "qapi/error.h"
> +#include "sysemu/hw_accel.h"
>  #include "sysemu/sysemu.h"
>  #include "qemu/log.h"
>  #include "cpu.h"
> @@ -9,7 +10,6 @@
>  #include "mmu-hash64.h"
>  #include "cpu-models.h"
>  #include "trace.h"
> -#include "sysemu/kvm.h"
>  #include "kvm_ppc.h"
>  #include "hw/ppc/spapr_ovec.h"
>
> diff --git a/hw/s390x/s390-pci-inst.c b/hw/s390x/s390-pci-inst.c
> index 0864d9b..4d0775c 100644
> --- a/hw/s390x/s390-pci-in

[Qemu-devel] [PATCH 1/3] 9pfs: add cleanup operation in FileOperations

2016-11-14 Thread Li Qiang
From: Li Qiang 

Currently, the backend of VirtFS doesn't have a cleanup
function. This will lead resource leak issues if the backed
driver allocates resources. This patch addresses this issue.

Signed-off-by: Li Qiang 
---
 fsdev/file-op-9p.h | 1 +
 hw/9pfs/9p.c   | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/fsdev/file-op-9p.h b/fsdev/file-op-9p.h
index 6db9fea..a56dc84 100644
--- a/fsdev/file-op-9p.h
+++ b/fsdev/file-op-9p.h
@@ -100,6 +100,7 @@ struct FileOperations
 {
 int (*parse_opts)(QemuOpts *, struct FsDriverEntry *);
 int (*init)(struct FsContext *);
+void (*cleanup)(struct FsContext *);
 int (*lstat)(FsContext *, V9fsPath *, struct stat *);
 ssize_t (*readlink)(FsContext *, V9fsPath *, char *, size_t);
 int (*chmod)(FsContext *, V9fsPath *, FsCred *);
diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
index aea7e9d..166b5a7 100644
--- a/hw/9pfs/9p.c
+++ b/hw/9pfs/9p.c
@@ -3532,6 +3532,9 @@ void v9fs_device_unrealize_common(V9fsState *s, Error 
**errp)
 {
 g_free(s->ctx.fs_root);
 g_free(s->tag);
+if (s->ops->cleanup) {
+s->ops->cleanup(&s->ctx);
+}
 }
 
 typedef struct VirtfsCoResetData {
-- 
1.8.3.1




[Qemu-devel] [PATCH 2/3] 9pfs: add cleanup operation for handle backend driver

2016-11-14 Thread Li Qiang
From: Li Qiang 

In the init operation of handle backend dirver, it allocates a
handle_data struct and opens a mount file. We should free these
resources when the 9pfs device is unrealized. This is what this
patch does.

Signed-off-by: Li Qiang 
---
 hw/9pfs/9p-handle.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/hw/9pfs/9p-handle.c b/hw/9pfs/9p-handle.c
index 3d77594..9b50f40 100644
--- a/hw/9pfs/9p-handle.c
+++ b/hw/9pfs/9p-handle.c
@@ -649,6 +649,13 @@ out:
 return ret;
 }
 
+static void handle_cleanup(FsContext *ctx)
+{
+struct handle_data *data = ctx->private;
+close(data->mountfd);
+g_free(data);
+}
+
 static int handle_parse_opts(QemuOpts *opts, struct FsDriverEntry *fse)
 {
 const char *sec_model = qemu_opt_get(opts, "security_model");
@@ -671,6 +678,7 @@ static int handle_parse_opts(QemuOpts *opts, struct 
FsDriverEntry *fse)
 FileOperations handle_ops = {
 .parse_opts   = handle_parse_opts,
 .init = handle_init,
+.cleanup  = handle_cleanup,
 .lstat= handle_lstat,
 .readlink = handle_readlink,
 .close= handle_close,
-- 
1.8.3.1




[Qemu-devel] [PATCH 0/3] add cleanup operation in handle/proxy backend

2016-11-14 Thread Li Qiang
From: Li Qiang 

Currently, the backend of VirtFS doesn't have a cleanup
function. This will leak some resources in handle and proxy
backend driver. This patchset addresses this issue.

Li Qiang (3):
  9pfs: add cleanup operation in FileOperations
  9pfs: add cleanup operation for handle backend driver
  9pfs: add cleanup operation for proxy backend driver

 fsdev/file-op-9p.h  |  1 +
 hw/9pfs/9p-handle.c |  8 
 hw/9pfs/9p-proxy.c  | 10 ++
 hw/9pfs/9p.c|  3 +++
 4 files changed, 22 insertions(+)

-- 
1.8.3.1




Re: [Qemu-devel] [PATCH v2 0/5] [RFC] Add HAX support

2016-11-14 Thread Stefan Weil
Am 14.11.2016 um 13:33 schrieb Vincent Palatin:
> On Mon, Nov 14, 2016 at 1:21 PM, Stefan Weil  wrote:
>>
>> A full build for Windows needs the patch below to
>> fix missing declarations, otherwise it fails with
>> compiler warnings and linker errors.
> Thanks for filing the gaps. That's very helpful !
> Do you mind if I merge it with your SoB into Patch 1/5 where it belongs ?
> or do you prefer keeping it as a separate patch ?

Just merge it where it belongs to (I did not check).
You may add my SoB as well for that part.

Cheers,
Stefan



Re: [Qemu-devel] [QEMU PATCH v2] kvmclock: advance clock by time window between vm_stop and pre_save

2016-11-14 Thread Marcelo Tosatti
On Thu, Nov 10, 2016 at 06:57:21PM +0100, Paolo Bonzini wrote:
> 
> 
> On 10/11/2016 12:48, Marcelo Tosatti wrote:
> > Destination has to run the following logic:
> > 
> > If (source has KVM_CAP_ADVANCE_CLOCK)
> > use KVM_GET_CLOCK value
> > Else
> >read pvclock from guest
> > 
> > To support migration from older QEMU versions which do not have
> > KVM_CAP_ADVANCE_CLOCK (or new QEMU versions running on old
> > hosts without KVM_CAP_ADVANCE_CLOCK).
> > 
> > I don't see any clean way to give that information, except changing
> > the migration format to pass "host: kvm_cap_advance_clock=true/false"
> > information.
> 
> If you make it only affect new machine types, you could transmit a dummy
> clock value such as -1 if the source does not have KVM_CLOCK_TSC_STABLE.
> 
> Paolo

Prefer a new subsection (which is fine since migration is broken
anyway), because otherwise you have to deal with restoring
s->clock from -1 to what was read at KVM_GET_CLOCK (in case
migration fails).




Re: [Qemu-devel] [PATCH v2 0/5] [RFC] Add HAX support

2016-11-14 Thread Stefan Weil
Am 11.11.2016 um 12:28 schrieb Vincent Palatin:
[...]
> I have tested the end result on a Windows 10 Pro machine (with UG support)
> with the Intel HAXM module 6.0.4 and a large ChromiumOS x86_64 image to
> exercise various code paths. It looks stable.
> I also did a quick regression testing of the integration by running a Linux
> build with KVM enabled.

My test on Windows 7 with HAXM 6.0.4 fails:

$ test/qemu-system-x86_64.exe --enable-hax
HAX is working and emulator runs in fast virt mode.
Unknown hax vcpu return 1

This application has requested the Runtime to terminate it in an unusual
way.
Please contact the application's support team for more information.

$ test/qemu-system-i386.exe --enable-hax
HAX is working and emulator runs in fast virt mode.
Unknown hax vcpu return 1

This application has requested the Runtime to terminate it in an unusual
way.
Please contact the application's support team for more information.

I tested debug code (configure --enable-debug && make) based on
latest QEMU from git, this patch series and my include fixes.

Stefan




[Qemu-devel] [PATCH 3/3] 9pfs: add cleanup operation for proxy backend driver

2016-11-14 Thread Li Qiang
From: Li Qiang 

In the init operation of proxy backend dirver, it allocates a
V9fsProxy struct and some other resources. We should free these
resources when the 9pfs device is unrealized. This is what this
patch does.

Signed-off-by: Li Qiang 
---
 hw/9pfs/9p-proxy.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/hw/9pfs/9p-proxy.c b/hw/9pfs/9p-proxy.c
index f2417b7..4b22f57 100644
--- a/hw/9pfs/9p-proxy.c
+++ b/hw/9pfs/9p-proxy.c
@@ -1168,9 +1168,19 @@ static int proxy_init(FsContext *ctx)
 return 0;
 }
 
+static void proxy_cleanup(FsContext *ctx)
+{
+V9fsProxy *proxy = ctx->private;
+close(proxy->sockfd);
+g_free(proxy->in_iovec.iov_base);
+g_free(proxy->out_iovec.iov_base);
+g_free(proxy);
+}
+
 FileOperations proxy_ops = {
 .parse_opts   = proxy_parse_opts,
 .init = proxy_init,
+.cleanup  = proxy_cleanup,
 .lstat= proxy_lstat,
 .readlink = proxy_readlink,
 .close= proxy_close,
-- 
1.8.3.1




[Qemu-devel] [qemu patch 0/2] improve kvmclock difference on migration

2016-11-14 Thread Marcelo Tosatti
See patches for details.





[Qemu-devel] [PATCH] hw/misc/ivshmem:fix misconfig of not_legacy_32bit

2016-11-14 Thread Zhuangyanying
From: ZhuangYanying 

After "ivshmem: Split ivshmem-plain, ivshmem-doorbell off ivshmem", 
ivshmem_64bit renamed to not_legacy_32bit, and changed the implementation of 
this property.
Then use64 = not_legacy_32bit = 1, then PCI attribute configuration ~ 
PCI_BASE_ADDRESS_MEM_TYPE_64 (default for ivshmem), the actual use is the 
legacy model, can not support greater than or equal 1G mapping, which is the 
opposite of configuration requirements.

Signed-off-by: ann.zhuangyany...@huawei.com
---
Recently, I tested ivshmem, found that use64, that is not_legacy_32bit 
implementation is odd, or even the opposite.
Previous use64 = ivshmem_64bit = 1, then attr |= PCI_BASE_ADDRESS_MEM_TYPE_64, 
then ivshmem support 1G and above packaged into bar2, presented to the virtual 
machine.
But after "ivshmem: Split ivshmem-plain, ivshmem-doorbell off ivshmem", 
PCI_BASE_ADDRESS_MEM_TYPE_64 is configured while not_legacy_32bit = 0, that is 
the legacy model.
---
 hw/misc/ivshmem.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index 230e51b..b71acf6 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -858,7 +858,7 @@ static void ivshmem_common_realize(PCIDevice *dev, Error 
**errp)
 pci_register_bar(dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY,
  &s->ivshmem_mmio);
 
-if (!s->not_legacy_32bit) {
+if (s->not_legacy_32bit) {
 attr |= PCI_BASE_ADDRESS_MEM_TYPE_64;
 }
 
@@ -1033,6 +1033,7 @@ static const VMStateDescription ivshmem_plain_vmsd = {
 
 static Property ivshmem_plain_properties[] = {
 DEFINE_PROP_ON_OFF_AUTO("master", IVShmemState, master, ON_OFF_AUTO_OFF),
+DEFINE_PROP_UINT32("use64", IVShmemState, not_legacy_32bit, 1),
 DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -1107,6 +1108,7 @@ static Property ivshmem_doorbell_properties[] = {
 DEFINE_PROP_BIT("ioeventfd", IVShmemState, features, IVSHMEM_IOEVENTFD,
 true),
 DEFINE_PROP_ON_OFF_AUTO("master", IVShmemState, master, ON_OFF_AUTO_OFF),
+DEFINE_PROP_UINT32("use64", IVShmemState, not_legacy_32bit, 1),
 DEFINE_PROP_END_OF_LIST(),
 };
 
-- 
1.8.3.1





[Qemu-devel] [qemu patch 1/2] kvm: sync linux headers

2016-11-14 Thread Marcelo Tosatti
Import KVM_CLOCK_TSC_STABLE.

Signed-off-by: Marcelo Tosatti 

diff --git a/include/standard-headers/linux/input.h 
b/include/standard-headers/linux/input.h
index 7361a16..b472b85 100644
--- a/include/standard-headers/linux/input.h
+++ b/include/standard-headers/linux/input.h
@@ -245,6 +245,7 @@ struct input_mask {
 #define BUS_SPI0x1C
 #define BUS_RMI0x1D
 #define BUS_CEC0x1E
+#define BUS_INTEL_ISHTP0x1F
 
 /*
  * MT_TOOL types
diff --git a/include/standard-headers/linux/pci_regs.h 
b/include/standard-headers/linux/pci_regs.h
index 4040951..e5a2e68 100644
--- a/include/standard-headers/linux/pci_regs.h
+++ b/include/standard-headers/linux/pci_regs.h
@@ -612,6 +612,8 @@
  */
 #define PCI_EXP_DEVCAP236  /* Device Capabilities 2 */
 #define  PCI_EXP_DEVCAP2_ARI   0x0020 /* Alternative Routing-ID */
+#define  PCI_EXP_DEVCAP2_ATOMIC_ROUTE  0x0040 /* Atomic Op routing */
+#define PCI_EXP_DEVCAP2_ATOMIC_COMP64  0x0100 /* Atomic 64-bit compare */
 #define  PCI_EXP_DEVCAP2_LTR   0x0800 /* Latency tolerance 
reporting */
 #define  PCI_EXP_DEVCAP2_OBFF_MASK 0x000c /* OBFF support mechanism */
 #define  PCI_EXP_DEVCAP2_OBFF_MSG  0x0004 /* New message signaling */
@@ -619,6 +621,7 @@
 #define PCI_EXP_DEVCTL240  /* Device Control 2 */
 #define  PCI_EXP_DEVCTL2_COMP_TIMEOUT  0x000f  /* Completion Timeout Value */
 #define  PCI_EXP_DEVCTL2_ARI   0x0020  /* Alternative Routing-ID */
+#define PCI_EXP_DEVCTL2_ATOMIC_REQ 0x0040  /* Set Atomic requests */
 #define  PCI_EXP_DEVCTL2_IDO_REQ_EN0x0100  /* Allow IDO for requests */
 #define  PCI_EXP_DEVCTL2_IDO_CMP_EN0x0200  /* Allow IDO for completions */
 #define  PCI_EXP_DEVCTL2_LTR_EN0x0400  /* Enable LTR mechanism 
*/
@@ -671,7 +674,8 @@
 #define PCI_EXT_CAP_ID_PMUX0x1A/* Protocol Multiplexing */
 #define PCI_EXT_CAP_ID_PASID   0x1B/* Process Address Space ID */
 #define PCI_EXT_CAP_ID_DPC 0x1D/* Downstream Port Containment */
-#define PCI_EXT_CAP_ID_MAX PCI_EXT_CAP_ID_DPC
+#define PCI_EXT_CAP_ID_PTM 0x1F/* Precision Time Measurement */
+#define PCI_EXT_CAP_ID_MAX PCI_EXT_CAP_ID_PTM
 
 #define PCI_EXT_CAP_DSN_SIZEOF 12
 #define PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF 40
@@ -964,4 +968,13 @@
 
 #define PCI_EXP_DPC_SOURCE_ID  10  /* DPC Source Identifier */
 
+/* Precision Time Measurement */
+#define PCI_PTM_CAP0x04/* PTM Capability */
+#define  PCI_PTM_CAP_REQ   0x0001  /* Requester capable */
+#define  PCI_PTM_CAP_ROOT  0x0004  /* Root capable */
+#define  PCI_PTM_GRANULARITY_MASK  0xFF00  /* Clock granularity */
+#define PCI_PTM_CTRL   0x08/* PTM Control */
+#define  PCI_PTM_CTRL_ENABLE   0x0001  /* PTM enable */
+#define  PCI_PTM_CTRL_ROOT 0x0002  /* Root select */
+
 #endif /* LINUX_PCI_REGS_H */
diff --git a/linux-headers/asm-arm/kvm.h b/linux-headers/asm-arm/kvm.h
index 541268c..2fb7859 100644
--- a/linux-headers/asm-arm/kvm.h
+++ b/linux-headers/asm-arm/kvm.h
@@ -84,6 +84,13 @@ struct kvm_regs {
 #define KVM_VGIC_V2_DIST_SIZE  0x1000
 #define KVM_VGIC_V2_CPU_SIZE   0x2000
 
+/* Supported VGICv3 address types  */
+#define KVM_VGIC_V3_ADDR_TYPE_DIST 2
+#define KVM_VGIC_V3_ADDR_TYPE_REDIST   3
+
+#define KVM_VGIC_V3_DIST_SIZE  SZ_64K
+#define KVM_VGIC_V3_REDIST_SIZE(2 * SZ_64K)
+
 #define KVM_ARM_VCPU_POWER_OFF 0 /* CPU is started in OFF state */
 #define KVM_ARM_VCPU_PSCI_0_2  1 /* CPU uses PSCI v0.2 */
 
diff --git a/linux-headers/asm-x86/unistd_32.h 
b/linux-headers/asm-x86/unistd_32.h
index abeaf40..d45ea28 100644
--- a/linux-headers/asm-x86/unistd_32.h
+++ b/linux-headers/asm-x86/unistd_32.h
@@ -377,5 +377,8 @@
 #define __NR_copy_file_range 377
 #define __NR_preadv2 378
 #define __NR_pwritev2 379
+#define __NR_pkey_mprotect 380
+#define __NR_pkey_alloc 381
+#define __NR_pkey_free 382
 
 #endif /* _ASM_X86_UNISTD_32_H */
diff --git a/linux-headers/asm-x86/unistd_64.h 
b/linux-headers/asm-x86/unistd_64.h
index 73c3d1f..e22db91 100644
--- a/linux-headers/asm-x86/unistd_64.h
+++ b/linux-headers/asm-x86/unistd_64.h
@@ -330,5 +330,8 @@
 #define __NR_copy_file_range 326
 #define __NR_preadv2 327
 #define __NR_pwritev2 328
+#define __NR_pkey_mprotect 329
+#define __NR_pkey_alloc 330
+#define __NR_pkey_free 331
 
 #endif /* _ASM_X86_UNISTD_64_H */
diff --git a/linux-headers/asm-x86/unistd_x32.h 
b/linux-headers/asm-x86/unistd_x32.h
index e5aea76..84e58b2 100644
--- a/linux-headers/asm-x86/unistd_x32.h
+++ b/linux-headers/asm-x86/unistd_x32.h
@@ -283,6 +283,9 @@
 #define __NR_membarrier (__X32_SYSCALL_BIT + 324)
 #define __NR_mlock2 (__X32_SYSCALL_BIT + 325)
 #define __NR_copy_file_range (__X32_SYSCALL_BIT + 326)
+#define __NR_pkey_mprotect (__X32_SYS

[Qemu-devel] virtIO question

2016-11-14 Thread zhun...@gmail.com
I have a question about qemu.is it a bug in qemu version 1.2?
in qemu version 1.2 ,it set avail event by the code :
 if (vq->vdev->guest_features & (1 << VIRTIO_RING_F_EVENT_IDX)) {
vring_avail_event(vq, vring_avail_idx(vq));
}
 and in version 2.7 the code is
 if (virtio_vdev_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX)) {
vring_set_avail_event(vq, vq->last_avail_idx);
}

a big difference of this is the value.vring_avail_idx(vq)is the latest value of 
VRingAvail.idx,and vq->last_avail_idx is not, I think it really different with 
the two different values,and I think the later is right,is it??
thanks a lot!!



zhun...@gmail.com


[Qemu-devel] [qemu patch 2/2] kvmclock: reduce kvmclock difference on migration

2016-11-14 Thread Marcelo Tosatti
Check for KVM_CAP_ADJUST_CLOCK capability KVM_CLOCK_TSC_STABLE, which
indicates that KVM_GET_CLOCK returns a value as seen by the guest at
that moment.

For new machine types, use this value rather than reading 
from guest memory.

This reduces kvmclock difference on migration from 5s to 0.1s
(when max_downtime == 5s).

Signed-off-by: Marcelo Tosatti 

---
 hw/i386/kvm/clock.c|   88 +++--
 include/hw/i386/pc.h   |5 ++
 target-i386/kvm.c  |7 +++
 target-i386/kvm_i386.h |1 
 4 files changed, 98 insertions(+), 3 deletions(-)

Index: qemu-mig-advance-clock/hw/i386/kvm/clock.c
===
--- qemu-mig-advance-clock.orig/hw/i386/kvm/clock.c 2016-11-14 
09:07:55.519856329 -0200
+++ qemu-mig-advance-clock/hw/i386/kvm/clock.c  2016-11-14 10:19:45.723254737 
-0200
@@ -36,6 +36,12 @@
 
 uint64_t clock;
 bool clock_valid;
+
+/* whether machine supports reliable KVM_GET_CLOCK */
+bool mach_use_reliable_get_clock;
+
+/* whether source host supported reliable KVM_GET_CLOCK */
+bool src_use_reliable_get_clock;
 } KVMClockState;
 
 struct pvclock_vcpu_time_info {
@@ -91,15 +97,37 @@
 
 if (running) {
 struct kvm_clock_data data = {};
-uint64_t time_at_migration = kvmclock_current_nsec(s);
+uint64_t time_at_migration = 0;
 
-s->clock_valid = false;
+/* local (running VM) restore */
+if (s->clock_valid) {
+/*
+ * if host does not support reliable KVM_GET_CLOCK,
+ * read kvmclock value from memory
+ */
+if (!kvm_has_adjust_clock_stable()) {
+time_at_migration = kvmclock_current_nsec(s);
+}
+/* migration/savevm/init restore */
+} else {
+/*
+ * use s->clock in case machine uses reliable
+ * get clock and host where vm was executing
+ * supported reliable get clock
+ */
+if (!s->mach_use_reliable_get_clock ||
+!s->src_use_reliable_get_clock) {
+time_at_migration = kvmclock_current_nsec(s);
+}
+}
 
-/* We can't rely on the migrated clock value, just discard it */
+/* We can't rely on the saved clock value, just discard it */
 if (time_at_migration) {
 s->clock = time_at_migration;
 }
 
+s->clock_valid = false;
+
 data.clock = s->clock;
 ret = kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, &data);
 if (ret < 0) {
@@ -152,22 +180,76 @@
 qemu_add_vm_change_state_handler(kvmclock_vm_state_change, s);
 }
 
+static bool kvmclock_src_use_reliable_get_clock(void *opaque)
+{
+KVMClockState *s = opaque;
+
+/*
+ * On machine types that support reliable KVM_GET_CLOCK,
+ * if host kernel does provide reliable KVM_GET_CLOCK,
+ * set src_use_reliable_get_clock=true so that destination
+ * avoids reading kvmclock from memory.
+ */
+if (s->mach_use_reliable_get_clock && kvm_has_adjust_clock_stable()) {
+s->src_use_reliable_get_clock = true;
+}
+
+return s->src_use_reliable_get_clock;
+}
+
+static const VMStateDescription kvmclock_reliable_get_clock = {
+.name = "kvmclock/src_use_reliable_get_clock",
+.version_id = 1,
+.minimum_version_id = 1,
+.needed = kvmclock_src_use_reliable_get_clock,
+.fields = (VMStateField[]) {
+VMSTATE_BOOL(src_use_reliable_get_clock, KVMClockState),
+VMSTATE_END_OF_LIST()
+}
+};
+
+static void kvmclock_pre_save(void *opaque)
+{
+KVMClockState *s = opaque;
+struct kvm_clock_data data;
+int ret;
+
+ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, &data);
+if (ret < 0) {
+fprintf(stderr, "KVM_GET_CLOCK failed: %s\n", strerror(ret));
+abort();
+}
+s->clock = data.clock;
+}
+
 static const VMStateDescription kvmclock_vmsd = {
 .name = "kvmclock",
 .version_id = 1,
 .minimum_version_id = 1,
+.pre_save = kvmclock_pre_save,
 .fields = (VMStateField[]) {
 VMSTATE_UINT64(clock, KVMClockState),
 VMSTATE_END_OF_LIST()
+},
+.subsections = (const VMStateDescription * []) {
+&kvmclock_reliable_get_clock,
+NULL
 }
 };
 
+static Property kvmclock_properties[] = {
+DEFINE_PROP_BOOL("mach_use_reliable_get_clock", KVMClockState,
+  mach_use_reliable_get_clock, true),
+DEFINE_PROP_END_OF_LIST(),
+};
+
 static void kvmclock_class_init(ObjectClass *klass, void *data)
 {
 DeviceClass *dc = DEVICE_CLASS(klass);
 
 dc->realize = kvmclock_realize;
 dc->vmsd = &kvmclock_vmsd;
+dc->props = kvmclock_properties;
 }
 
 static const TypeInfo kvmclock_info = {
Index: qemu-mig-advance-clock/include/hw/i386/pc.h
===
--- qemu-mig-advance-clock.orig/include/hw/i38

[Qemu-devel] qemu 2.6.2: e1000 network card hanging after live migration

2016-11-14 Thread Peter Lieven

Hi,

I have observed a virtual server with an e1000  network card loosing network 
connectivity
after live migration (not reproducible so far). In the log of the vServer I 
found several messages like:

kernel: [13833656.832108] e1000 :00:05.0 eth2: Reset adapter

When I restarted the vServer to try to restore connectivity I triggered the 
following assertion:

qemu-2.6.2: hw/pci/pci.c:280: pcibus_reset: Assertion `bus->irq_count[i] == 0' 
failed.

Is this something that has anyone observed before?

Thanks,
Peter



Re: [Qemu-devel] [PATCH v2 0/5] [RFC] Add HAX support

2016-11-14 Thread Vincent Palatin
On Mon, Nov 14, 2016 at 1:36 PM, Stefan Weil  wrote:
> Am 11.11.2016 um 12:28 schrieb Vincent Palatin:
> [...]
>> I have tested the end result on a Windows 10 Pro machine (with UG support)
>> with the Intel HAXM module 6.0.4 and a large ChromiumOS x86_64 image to
>> exercise various code paths. It looks stable.
>> I also did a quick regression testing of the integration by running a Linux
>> build with KVM enabled.
>
> My test on Windows 7 with HAXM 6.0.4 fails:
>
> $ test/qemu-system-x86_64.exe --enable-hax
> HAX is working and emulator runs in fast virt mode.
> Unknown hax vcpu return 1

Sorry about this.
I did notice that just running the default Seabios/iPXE was triggering
an issue and forgot to debug it (as I'm mostly running Chromium OS
images).
I will have a look and try to sort this out..

>
> This application has requested the Runtime to terminate it in an unusual
> way.
> Please contact the application's support team for more information.
>
> $ test/qemu-system-i386.exe --enable-hax
> HAX is working and emulator runs in fast virt mode.
> Unknown hax vcpu return 1
>
> This application has requested the Runtime to terminate it in an unusual
> way.
> Please contact the application's support team for more information.
>
> I tested debug code (configure --enable-debug && make) based on
> latest QEMU from git, this patch series and my include fixes.
>
> Stefan
>



Re: [Qemu-devel] [PATCH 07/13] virtio-scsi: always use dataplane path if ioeventfd is active

2016-11-14 Thread Paolo Bonzini


On 11/11/2016 22:03, Alex Williamson wrote:
> On Fri, 11 Nov 2016 21:24:33 +0100
> Paolo Bonzini  wrote:
>> If you can post a backtrace of all threads at the time of the hang, from
>> origin/master (so without vhost, and not at ad07cd6) that could help.
> 
> Yes, it occurs with all of the vfio devices removed using VNC/Cirrus.

I cannot reproduce it anyway. :(

As you said on IRC it's a pretty standard "event loop doing nothing"
backtrace, so it seems that an eventfd write was lost.

Since I was lucky with the vhost patch, perhaps this can help:

diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index f2ea29d..22d6cd5 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -202,13 +202,15 @@ void virtio_scsi_dataplane_stop(VirtIODevice *vdev)
 
 aio_context_acquire(s->ctx);
 virtio_scsi_clear_aio(s);
-aio_context_release(s->ctx);
-
-blk_drain_all(); /* ensure there are no in-flight requests */
 
 for (i = 0; i < vs->conf.num_queues + 2; i++) {
+VirtQueue *vq = virtio_get_queue(vdev, i);
 virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), i, false);
+
virtio_queue_host_notifier_aio_read(virtio_queue_get_guest_notifier(vq));
 }
+aio_context_release(s->ctx);
+
+blk_drain_all(); /* ensure there are no in-flight requests */
 
 /* Clean up guest notifier (irq) */
 k->set_guest_notifiers(qbus->parent, vs->conf.num_queues + 2, false);
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 89b0b80..9c894d7 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -2018,7 +2018,7 @@ EventNotifier *virtio_queue_get_guest_notifier(VirtQueue 
*vq)
 return &vq->guest_notifier;
 }
 
-static void virtio_queue_host_notifier_aio_read(EventNotifier *n)
+void virtio_queue_host_notifier_aio_read(EventNotifier *n)
 {
 VirtQueue *vq = container_of(n, VirtQueue, host_notifier);
 if (event_notifier_test_and_clear(n)) {
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 35ede30..d3dfc69 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -274,6 +274,7 @@ int virtio_device_grab_ioeventfd(VirtIODevice *vdev);
 void virtio_device_release_ioeventfd(VirtIODevice *vdev);
 bool virtio_device_ioeventfd_enabled(VirtIODevice *vdev);
 EventNotifier *virtio_queue_get_host_notifier(VirtQueue *vq);
+void virtio_queue_host_notifier_aio_read(EventNotifier *n);
 void virtio_queue_host_notifier_read(EventNotifier *n);
 void virtio_queue_aio_set_host_notifier_handler(VirtQueue *vq, AioContext *ctx,
 void (*fn)(VirtIODevice *,

And if it doesn't work here is some printf debugging.  It's pretty verbose but
the interesting part starts pretty much where you issue the virsh shutdown or
system_powerdown command:

diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index f2ea29d..ec0f750 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -108,11 +108,13 @@ static void virtio_scsi_clear_aio(VirtIOSCSI *s)
 VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s);
 int i;
 
+printf("before clear\n");
 virtio_queue_aio_set_host_notifier_handler(vs->ctrl_vq, s->ctx, NULL);
 virtio_queue_aio_set_host_notifier_handler(vs->event_vq, s->ctx, NULL);
 for (i = 0; i < vs->conf.num_queues; i++) {
 virtio_queue_aio_set_host_notifier_handler(vs->cmd_vqs[i], s->ctx, 
NULL);
 }
+printf("after clear\n");
 }
 
 /* Context: QEMU global mutex held */
@@ -202,15 +204,20 @@ void virtio_scsi_dataplane_stop(VirtIODevice *vdev)
 
 aio_context_acquire(s->ctx);
 virtio_scsi_clear_aio(s);
-aio_context_release(s->ctx);
-
-blk_drain_all(); /* ensure there are no in-flight requests */
 
 for (i = 0; i < vs->conf.num_queues + 2; i++) {
+VirtQueue *vq = virtio_get_queue(vdev, i);
 virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), i, false);
+
virtio_queue_host_notifier_aio_read(virtio_queue_get_guest_notifier(vq));
 }
+aio_context_release(s->ctx);
+
+printf("before drain\n");
+blk_drain_all(); /* ensure there are no in-flight requests */
+printf("after drain\n");
 
 /* Clean up guest notifier (irq) */
+printf("end of virtio_scsi_dataplane_stop\n");
 k->set_guest_notifiers(qbus->parent, vs->conf.num_queues + 2, false);
 s->dataplane_stopping = false;
 s->dataplane_started = false;
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 3e5ae6a..e8b83d4 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -75,6 +75,7 @@ static void virtio_scsi_complete_req(VirtIOSCSIReq *req)
 }
 
 if (req->sreq) {
+printf("finish %x\n", req->sreq->tag);
 req->sreq->hba_private = NULL;
 scsi_req_unref(req->sreq);
 }
@@ -549,6 +549,7 @@ static int virtio_scsi_handle_cmd_req_prepare(VirtIOSCSI 
*s, VirtIOSCSIReq *req)
 return -ENOENT;
 }
 

Re: [Qemu-devel] Crashing in tcp_close

2016-11-14 Thread Stefan Hajnoczi
On Sun, Nov 13, 2016 at 11:55:16AM +, Brian Candler wrote:
> On 12/11/2016 10:44, Samuel Thibault wrote:
> > Oops, sorry, my patch was completely bogus, here is a proper one.
> 
> Excellent.
> 
> I've run the original build process 18 times (each run takes about 25
> minutes) without valgrind, and it hasn't crashed once. So this looks good.
> Thank you!

Excellent work guys!  Glad that the issue was solved.

Thanks for sticking around to debug this issue, Brian!

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-14 Thread Fam Zheng
On Fri, 11/11 13:59, Karl Rister wrote:
> 
> Stefan
> 
> I ran some quick tests with your patches and got some pretty good gains,
> but also some seemingly odd behavior.
> 
> These results are for a 5 minute test doing sequential 4KB requests from
> fio using O_DIRECT, libaio, and IO depth of 1.  The requests are
> performed directly against the virtio-blk device (no filesystem) which
> is backed by a 400GB NVme card.
> 
> QEMU_AIO_POLL_MAX_NS  IOPs
>unset31,383
>146,860
>246,440
>435,246
>834,973
>   1646,794
>   3246,729
>   6435,520
>  12845,902

For sequential read with ioq=1, each request takes >2ns under 45,000 IOPs.
Isn't a poll time of 128ns a mismatching order of magnitude? Have you tried
larger values? Not criticizing, just trying to understand how it workd.

Also, do you happen to have numbers for unpatched QEMU (just to confirm that
"unset" case doesn't cause regression) and baremetal for comparison?

Fam



Re: [Qemu-devel] [qemu patch 2/2] kvmclock: reduce kvmclock difference on migration

2016-11-14 Thread Paolo Bonzini


On 14/11/2016 13:36, Marcelo Tosatti wrote:
> +/* local (running VM) restore */
> +if (s->clock_valid) {
> +/*
> + * if host does not support reliable KVM_GET_CLOCK,
> + * read kvmclock value from memory
> + */
> +if (!kvm_has_adjust_clock_stable()) {
> +time_at_migration = kvmclock_current_nsec(s);

Just assign to s->clock here...


> +}
> +/* migration/savevm/init restore */
> +} else {
> +/*
> + * use s->clock in case machine uses reliable
> + * get clock and host where vm was executing
> + * supported reliable get clock
> + */
> +if (!s->mach_use_reliable_get_clock ||
> +!s->src_use_reliable_get_clock) {
> +time_at_migration = kvmclock_current_nsec(s);

... and here, so that time_at_migration is not needed anymore.

Also here it's enough to look at s->src_user_reliable_get_clock, because
if s->mach_use_reliable_get_clock is false,
s->src_use_reliable_get_clock will be false as well.

> +}
> +}
>  
> -/* We can't rely on the migrated clock value, just discard it */
> +/* We can't rely on the saved clock value, just discard it */
>  if (time_at_migration) {
>  s->clock = time_at_migration;

[...]

> 
> +static bool kvmclock_src_use_reliable_get_clock(void *opaque)
> +{
> +KVMClockState *s = opaque;
> +
> +/*
> + * On machine types that support reliable KVM_GET_CLOCK,
> + * if host kernel does provide reliable KVM_GET_CLOCK,
> + * set src_use_reliable_get_clock=true so that destination
> + * avoids reading kvmclock from memory.
> + */
> +if (s->mach_use_reliable_get_clock && kvm_has_adjust_clock_stable()) {
> +s->src_use_reliable_get_clock = true;
> +}
> +
> +return s->src_use_reliable_get_clock;
> +}

Here you can just return s->mach_use_reliable_get_clock.  To set
s->src_use_reliable_get_clock, after issuing KVM_GET_CLOCK you can look
at the KVM_CLOCK_TSC_STABLE bit in the kvm_clock struct's flags.

You don't actually need kvm_has_adjust_clock_stable(), but please place
the KVM_GET_CLOCK code for kvmclock_pre_save and
kvmclock_vm_state_change in a common function.

Also, just another small nit: please make your scripts use the "-p"
option on diff. :)

Thanks,

Paolo



Re: [Qemu-devel] [PATCH v2] qapi-schema: clarify 'colo' state for MigrationStatus

2016-11-14 Thread Stefan Hajnoczi
On Mon, Nov 14, 2016 at 10:36:45AM +0800, Hailiang Zhang wrote:
> ping ?
> 
> Anyone pick this up?

The original patch that added these lines went through Amit Shah and
David Gilbert.  I have CCed them.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] virtIO question

2016-11-14 Thread Stefan Hajnoczi
On Sat, Nov 12, 2016 at 04:43:21PM +0800, zhun...@gmail.com wrote:
> Thanks,the expression is not the key problem,I just write it wrong,the key 
> problem is that what I get from the code is everytime dirver add a sg ,it 
> will call virtqueue_kick,such as network driver,in start_xmit function ,it 
> called xmit_skb generate a sg list and add it to the queue,then called 
> virtqueue_kick ,why it handle like this??can you explain it to me??thank you 
> very much!!!

I can see there *are* cases in Linux 4.9-rc1 virtio_net.c:start_xmit()
where virtqueue_kick() is skipped so that multiple tx packets can be
added to the virtqueue in a single kick.

static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
{
...
bool kick = !skb->xmit_more;

...
if (kick || netif_xmit_stopped(txq))
virtqueue_kick(sq->vq);
...
}

Also keep in mind that the virtio driver APIs in Linux are used by
multiple device drivers (virtio_net.ko, virtio_blk.ko, etc).
virtio_net.ko does not use all features offered by the API.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] [qemu patch 2/2] kvmclock: reduce kvmclock difference on migration

2016-11-14 Thread Marcelo Tosatti
On Mon, Nov 14, 2016 at 02:54:38PM +0100, Paolo Bonzini wrote:
> 
> 
> On 14/11/2016 13:36, Marcelo Tosatti wrote:
> > +/* local (running VM) restore */
> > +if (s->clock_valid) {
> > +/*
> > + * if host does not support reliable KVM_GET_CLOCK,
> > + * read kvmclock value from memory
> > + */
> > +if (!kvm_has_adjust_clock_stable()) {
> > +time_at_migration = kvmclock_current_nsec(s);
> 
> Just assign to s->clock here...

If kvmclock is not enabled, you want to use s->clock,
rather than 0.

> > +}
> > +/* migration/savevm/init restore */
> > +} else {
> > +/*
> > + * use s->clock in case machine uses reliable
> > + * get clock and host where vm was executing
> > + * supported reliable get clock
> > + */
> > +if (!s->mach_use_reliable_get_clock ||
> > +!s->src_use_reliable_get_clock) {
> > +time_at_migration = kvmclock_current_nsec(s);
> 
> ... and here, so that time_at_migration is not needed anymore.

Same as above.

> Also here it's enough to look at s->src_user_reliable_get_clock, because
> if s->mach_use_reliable_get_clock is false,
> s->src_use_reliable_get_clock will be false as well.

Yes, but i like the code annotation.

> > +}
> > +}
> >  
> > -/* We can't rely on the migrated clock value, just discard it */
> > +/* We can't rely on the saved clock value, just discard it */
> >  if (time_at_migration) {
> >  s->clock = time_at_migration;
> 
> [...]
> 
> > 
> > +static bool kvmclock_src_use_reliable_get_clock(void *opaque)
> > +{
> > +KVMClockState *s = opaque;
> > +
> > +/*
> > + * On machine types that support reliable KVM_GET_CLOCK,
> > + * if host kernel does provide reliable KVM_GET_CLOCK,
> > + * set src_use_reliable_get_clock=true so that destination
> > + * avoids reading kvmclock from memory.
> > + */
> > +if (s->mach_use_reliable_get_clock && kvm_has_adjust_clock_stable()) {
> > +s->src_use_reliable_get_clock = true;
> > +}
> > +
> > +return s->src_use_reliable_get_clock;
> > +}
> 
> Here you can just return s->mach_use_reliable_get_clock. 

mach_use_reliable_get_clock can be true but host might not support it.

>  To set
> s->src_use_reliable_get_clock, after issuing KVM_GET_CLOCK you can look
> at the KVM_CLOCK_TSC_STABLE bit in the kvm_clock struct's flags.

KVM_CLOCK_TSC_STABLE bit in the kvmclock structure != 
KVM_GET_CLOCK returns reliable value, right?


> You don't actually need kvm_has_adjust_clock_stable(), but please place
> the KVM_GET_CLOCK code for kvmclock_pre_save and
> kvmclock_vm_state_change in a common function.

Sure.

> Also, just another small nit: please make your scripts use the "-p"
> option on diff. :)

Sure.

> 
> Thanks,
> 
> Paolo



Re: [Qemu-devel] [qemu patch 2/2] kvmclock: reduce kvmclock difference on migration

2016-11-14 Thread Juan Quintela
Marcelo Tosatti  wrote:
> Check for KVM_CAP_ADJUST_CLOCK capability KVM_CLOCK_TSC_STABLE, which
> indicates that KVM_GET_CLOCK returns a value as seen by the guest at
> that moment.
>
> For new machine types, use this value rather than reading 
> from guest memory.
>
> This reduces kvmclock difference on migration from 5s to 0.1s
> (when max_downtime == 5s).
>
> Signed-off-by: Marcelo Tosatti 

Acked-by: Juan Quintela 

But, if you have to respin it 


> +/* whether machine supports reliable KVM_GET_CLOCK */
> +bool mach_use_reliable_get_clock;
> +
> +/* whether source host supported reliable KVM_GET_CLOCK */
> +bool src_use_reliable_get_clock;

This two names are really long, but I don't have better suggesitons :-()
>  if (running) {
>  struct kvm_clock_data data = {};
> -uint64_t time_at_migration = kvmclock_current_nsec(s);
> +uint64_t time_at_migration = 0;

This was not "time_at_migration", it was not already before, but just
now it looks really weird. (as it was already faulty, this is why it is
only a suggestion.)

>  
> -s->clock_valid = false;
> +/* local (running VM) restore */
> +if (s->clock_valid) {
> +/*
> + * if host does not support reliable KVM_GET_CLOCK,
> + * read kvmclock value from memory
> + */
> +if (!kvm_has_adjust_clock_stable()) {
> +time_at_migration = kvmclock_current_nsec(s);
> +}
> +/* migration/savevm/init restore */
> +} else {
> +/*
> + * use s->clock in case machine uses reliable
> + * get clock and host where vm was executing
> + * supported reliable get clock
> + */

This comment is just weird.  Simplifying
/* If A and B do C */
if (!A and || !B) {
   then D();
}

Doing the opposite comment?

Migration code looks rigth.

Once said that, I continue hating clocks.

Later, Juan.



Re: [Qemu-devel] [qemu patch 2/2] kvmclock: reduce kvmclock difference on migration

2016-11-14 Thread Juan Quintela
Paolo Bonzini  wrote:
> On 14/11/2016 13:36, Marcelo Tosatti wrote:
>> +/* local (running VM) restore */
>> +if (s->clock_valid) {
>> +/*
>> + * if host does not support reliable KVM_GET_CLOCK,
>> + * read kvmclock value from memory
>> + */
>> +if (!kvm_has_adjust_clock_stable()) {
>> +time_at_migration = kvmclock_current_nsec(s);
>
> Just assign to s->clock here...

Agreed, I just wanted to make so many changes.


> Also, just another small nit: please make your scripts use the "-p"
> option on diff. :)

YE

I was searching what functions the code belonged for :p

Thanks, Juan.



Re: [Qemu-devel] [kvm-unit-tests PATCH v5 07/11] arm/arm64: gicv2: add an IPI test

2016-11-14 Thread Andrew Jones
On Fri, Nov 11, 2016 at 02:13:31PM +0100, Andrew Jones wrote:
> On Fri, Nov 11, 2016 at 11:13:46AM +, Andre Przywara wrote:
> > Hi,
> > 
> > more a comment loosely related to this patch ...
> > 
> > > diff --git a/arm/unittests.cfg b/arm/unittests.cfg
> > > index 3f6fa45c587e..68bf5cd6008f 100644
> > > --- a/arm/unittests.cfg
> > > +++ b/arm/unittests.cfg
> > > @@ -54,3 +54,10 @@ file = selftest.flat
> > >  smp = $MAX_SMP
> > >  extra_params = -append 'smp'
> > >  groups = selftest
> > > +
> > > +# Test GIC emulation
> > > +[gicv2-ipi]
> > > +file = gic.flat
> > > +smp = $((($MAX_SMP < 8)?$MAX_SMP:8))
> > 
> > So here we always go with the maximum number of VCPUs in the guest.
> > However (as you also noted in your cover-letter) running with a
> > different number of CPUs might be interesting, for instance with less
> > than 8 CPUs on a GICv2 (the ITARGETSR register must be masked) or in
> > general with an odd number (both literally and in the broader sense). I
> > have a test case with passes with 8 VCPUs but fails with less.
> > 
> > Is there any good way to run some tests multiple times with different
> > numbers of VCPUS?
> > Shall we add some "set" functionality to the smp parameter, so that we
> > can specify a list of desired test points?
> >
> 
> We can just add multiple entries, e.g.
> 
> [gicv2-ipi]
> file = gic.flat
> smp = $((($MAX_SMP < 8)?$MAX_SMP:8))
> [gicv2-ipi-3]
> file = gic.flat
> smp = $((($MAX_SMP > 3)?3:$MAX_SMP))
> 
> or whatever. But we need to always consider MAX_SMP, since some
> machines may less than 8.
>

Hmm, thinking about this some more, the unit test needs to know how
many processors the test wants, in order to ensure it's testing
correctly. We should provide the number to both -smp and -append,
like we do for selftest-setup.

So, we can have one test that doesn't care, just uses MAX_SMP or 8,
like this patch introduces, but then for each test that does care we
need, e.g.

 smp = 3
 extra_params = '... smp=3 ...'

Then the unit test will start exactly 2 secondaries (or abort if
they're not available)

Anyway, I don't think this is something we should extend the framework
for, but rather address it with unittests.cfg and unit test input
validation.

Thanks,
drew



Re: [Qemu-devel] [qemu patch 2/2] kvmclock: reduce kvmclock difference on migration

2016-11-14 Thread Paolo Bonzini


On 14/11/2016 15:00, Marcelo Tosatti wrote:
> On Mon, Nov 14, 2016 at 02:54:38PM +0100, Paolo Bonzini wrote:
>>
>>
>> On 14/11/2016 13:36, Marcelo Tosatti wrote:
>>> +/* local (running VM) restore */
>>> +if (s->clock_valid) {
>>> +/*
>>> + * if host does not support reliable KVM_GET_CLOCK,
>>> + * read kvmclock value from memory
>>> + */
>>> +if (!kvm_has_adjust_clock_stable()) {
>>> +time_at_migration = kvmclock_current_nsec(s);
>>
>> Just assign to s->clock here...
> 
> If kvmclock is not enabled, you want to use s->clock,
> rather than 0.
> 
>>> +}
>>> +/* migration/savevm/init restore */
>>> +} else {
>>> +/*
>>> + * use s->clock in case machine uses reliable
>>> + * get clock and host where vm was executing
>>> + * supported reliable get clock
>>> + */
>>> +if (!s->mach_use_reliable_get_clock ||
>>> +!s->src_use_reliable_get_clock) {
>>> +time_at_migration = kvmclock_current_nsec(s);
>>
>> ... and here, so that time_at_migration is not needed anymore.
> 
> Same as above.

You're right.

>> Also here it's enough to look at s->src_user_reliable_get_clock, because
>> if s->mach_use_reliable_get_clock is false,
>> s->src_use_reliable_get_clock will be false as well.
> 
> Yes, but i like the code annotation.

Ah, I think we're looking at it differently.

I'm thinking "mach_use_reliable_get_clock is just for migration,
src_use_reliable_get_clock is the state".  Perhaps you're thinking of
enabling/disabling the whole new code for old machines?  What is the
advantage?

>>> +}
>>> +}
>>>  
>>> -/* We can't rely on the migrated clock value, just discard it */
>>> +/* We can't rely on the saved clock value, just discard it */
>>>  if (time_at_migration) {
>>>  s->clock = time_at_migration;
>>
>> [...]
>>
>>>
>>> +static bool kvmclock_src_use_reliable_get_clock(void *opaque)
>>> +{
>>> +KVMClockState *s = opaque;
>>> +
>>> +/*
>>> + * On machine types that support reliable KVM_GET_CLOCK,
>>> + * if host kernel does provide reliable KVM_GET_CLOCK,
>>> + * set src_use_reliable_get_clock=true so that destination
>>> + * avoids reading kvmclock from memory.
>>> + */
>>> +if (s->mach_use_reliable_get_clock && kvm_has_adjust_clock_stable()) {
>>> +s->src_use_reliable_get_clock = true;
>>> +}
>>> +
>>> +return s->src_use_reliable_get_clock;
>>> +}
>>
>> Here you can just return s->mach_use_reliable_get_clock. 
> 
> mach_use_reliable_get_clock can be true but host might not support it.

Yes, but the "needed" function is only required to avoid breaking
pc-i440fx-2.7 and earlier.  If you return true here, you can still
migrate a "false" value for src_use_reliable_get_clock.

>>  To set
>> s->src_use_reliable_get_clock, after issuing KVM_GET_CLOCK you can look
>> at the KVM_CLOCK_TSC_STABLE bit in the kvm_clock struct's flags.
> 
> KVM_CLOCK_TSC_STABLE bit in the kvmclock structure != 
> KVM_GET_CLOCK returns reliable value, right?

It is the same as "is using masterclock", which is actually a stricter
condition than the KVM_CHECK_EXTENSION return value.  The right check to
use is whether masterclock is in use, and then the idea is to treat
clock,src_use_reliable_get_clock as one tuple that is updated atomically.

Paolo



[Qemu-devel] [PATCH RFC v2 2/2] ARM: KVM: Enable in-kernel timers with user space gic

2016-11-14 Thread Alexander Graf
When running with KVM enabled, you can choose between emulating the
gic in kernel or user space. If the kernel supports in-kernel virtualization
of the interrupt controller, it will default to that. If not, if will
default to user space emulation.

Unfortunately when running in user mode gic emulation, we miss out on
timer events which are only available from kernel space. This patch leverages
the new kernel/user space pending line synchronization for those timer events.

Signed-off-by: Alexander Graf 

---

rfc1 -> rfc2:

  - use local variable for ARM_CPU
  - remove bear trap
  - move timer warning to gic device
---
 hw/intc/arm_gic.c|  7 +++
 include/sysemu/kvm.h | 11 +++
 kvm-all.c|  5 +
 kvm-stub.c   |  5 +
 target-arm/cpu.h |  3 +++
 target-arm/kvm.c | 20 
 6 files changed, 51 insertions(+)

diff --git a/hw/intc/arm_gic.c b/hw/intc/arm_gic.c
index 521aac3..1f3aacf 100644
--- a/hw/intc/arm_gic.c
+++ b/hw/intc/arm_gic.c
@@ -25,6 +25,7 @@
 #include "qom/cpu.h"
 #include "qemu/log.h"
 #include "trace.h"
+#include "sysemu/kvm.h"
 
 //#define DEBUG_GIC
 
@@ -1428,6 +1429,12 @@ static void arm_gic_realize(DeviceState *dev, Error 
**errp)
 return;
 }
 
+if (kvm_enabled() && !kvm_arm_supports_timer()) {
+error_setg(errp, "KVM with user space irqchip only works when the "
+ "host kernel supports KVM_CAP_ARM_TIMER");
+return;
+}
+
 /* This creates distributor and main CPU interface (s->cpuiomem[0]) */
 gic_init_irqs_and_mmio(s, gic_set_irq, gic_ops);
 
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index df67cc0..9715fee 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -227,6 +227,17 @@ int kvm_init_vcpu(CPUState *cpu);
 int kvm_cpu_exec(CPUState *cpu);
 int kvm_destroy_vcpu(CPUState *cpu);
 
+/**
+ * kvm_arm_supports_timer
+ *
+ * Not all KVM implementations support notifications for the CP15 timers to
+ * user space. This function indicates whether the current KVM implementation
+ * does support them.
+ *
+ * Returns: true if KVM supports using ARM core timers from user space
+ */
+bool kvm_arm_supports_timer(void);
+
 #ifdef NEED_CPU_H
 #include "cpu.h"
 
diff --git a/kvm-all.c b/kvm-all.c
index 330219e..8d4696c 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -2194,6 +2194,11 @@ int kvm_has_intx_set_mask(void)
 return kvm_state->intx_set_mask;
 }
 
+bool kvm_arm_supports_timer(void)
+{
+return kvm_check_extension(kvm_state, KVM_CAP_ARM_TIMER);
+}
+
 #ifdef KVM_CAP_SET_GUEST_DEBUG
 struct kvm_sw_breakpoint *kvm_find_sw_breakpoint(CPUState *cpu,
  target_ulong pc)
diff --git a/kvm-stub.c b/kvm-stub.c
index b1b6b96..a4d408b 100644
--- a/kvm-stub.c
+++ b/kvm-stub.c
@@ -157,4 +157,9 @@ bool kvm_has_free_slot(MachineState *ms)
 {
 return false;
 }
+
+bool kvm_arm_supports_timer(void)
+{
+return false;
+}
 #endif
diff --git a/target-arm/cpu.h b/target-arm/cpu.h
index ca5c849..2c379a3 100644
--- a/target-arm/cpu.h
+++ b/target-arm/cpu.h
@@ -659,6 +659,9 @@ struct ARMCPU {
 
 ARMELChangeHook *el_change_hook;
 void *el_change_hook_opaque;
+
+/* Used to synchronize KVM and QEMU timer levels */
+uint8_t timer_irq_level;
 };
 
 static inline ARMCPU *arm_env_get_cpu(CPUARMState *env)
diff --git a/target-arm/kvm.c b/target-arm/kvm.c
index c00b94e..c5f0d37 100644
--- a/target-arm/kvm.c
+++ b/target-arm/kvm.c
@@ -527,6 +527,26 @@ void kvm_arch_pre_run(CPUState *cs, struct kvm_run *run)
 
 MemTxAttrs kvm_arch_post_run(CPUState *cs, struct kvm_run *run)
 {
+ARMCPU *cpu;
+bool vtimer_high;
+
+if (kvm_irqchip_in_kernel()) {
+/*
+ * We only need to sync timer states with user-space interrupt
+ * controllers, so return early and save cycles if we don't.
+ */
+return MEMTXATTRS_UNSPECIFIED;
+}
+
+cpu = ARM_CPU(cs);
+
+/* Synchronize our internal vtimer irq line with the kvm one */
+if (run->s.regs.timer_irq_level != cpu->timer_irq_level) {
+vtimer_high = run->s.regs.timer_irq_level & KVM_ARM_TIMER_VTIMER;
+qemu_set_irq(cpu->gt_timer_outputs[GTIMER_VIRT], vtimer_high ? 1 : 0);
+cpu->timer_irq_level = run->s.regs.timer_irq_level;
+}
+
 return MEMTXATTRS_UNSPECIFIED;
 }
 
-- 
1.8.5.6




[Qemu-devel] [PATCH RFC v2 0/2] Enable cp15 timers with user space gic & kvm

2016-11-14 Thread Alexander Graf
When running with KVM enabled, you can choose between emulating the
gic in kernel or user space. If the kernel supports in-kernel virtualization
of the interrupt controller, it will default to that. If not, if will
default to user space emulation.

Unfortunately when running in user mode gic emulation, we miss out on
timer events which are only available from kernel space.

This patch set leverages the new kernel/user space pending line synchronization
for those timer events.

rfc1 -> rfc2:

  - use local variable for ARM_CPU
  - remove bear trap
  - move timer warning to gic device

Alexander Graf (2):
  linux-headers: update
  ARM: KVM: Enable in-kernel timers with user space gic

 hw/intc/arm_gic.c |  7 +++
 include/sysemu/kvm.h  | 11 +++
 kvm-all.c |  5 +
 kvm-stub.c|  5 +
 linux-headers/asm-arm/kvm.h   |  1 +
 linux-headers/asm-arm64/kvm.h |  1 +
 linux-headers/linux/kvm.h |  6 ++
 target-arm/cpu.h  |  3 +++
 target-arm/kvm.c  | 20 
 9 files changed, 59 insertions(+)

-- 
1.8.5.6




[Qemu-devel] [PATCH RFC v2 1/2] linux-headers: update

2016-11-14 Thread Alexander Graf
This patch updates the Linux headers to include the in-progress user
space ARM timer patches. It is explicitly RFC, as the patches are not
merged yet.
---
 linux-headers/asm-arm/kvm.h   | 1 +
 linux-headers/asm-arm64/kvm.h | 1 +
 linux-headers/linux/kvm.h | 6 ++
 3 files changed, 8 insertions(+)

diff --git a/linux-headers/asm-arm/kvm.h b/linux-headers/asm-arm/kvm.h
index 541268c..5d58ec2 100644
--- a/linux-headers/asm-arm/kvm.h
+++ b/linux-headers/asm-arm/kvm.h
@@ -105,6 +105,7 @@ struct kvm_debug_exit_arch {
 };
 
 struct kvm_sync_regs {
+   __u8 timer_irq_level;
 };
 
 struct kvm_arch_memory_slot {
diff --git a/linux-headers/asm-arm64/kvm.h b/linux-headers/asm-arm64/kvm.h
index fd5a276..0e1cbd1 100644
--- a/linux-headers/asm-arm64/kvm.h
+++ b/linux-headers/asm-arm64/kvm.h
@@ -143,6 +143,7 @@ struct kvm_debug_exit_arch {
 #define KVM_GUESTDBG_USE_HW(1 << 17)
 
 struct kvm_sync_regs {
+   __u8 timer_irq_level;
 };
 
 struct kvm_arch_memory_slot {
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index 4806e06..737113c 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -870,6 +870,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_S390_USER_INSTR0 130
 #define KVM_CAP_MSI_DEVID 131
 #define KVM_CAP_PPC_HTM 132
+#define KVM_CAP_ARM_TIMER 133
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1327,4 +1328,9 @@ struct kvm_assigned_msix_entry {
 #define KVM_X2APIC_API_USE_32BIT_IDS(1ULL << 0)
 #define KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK  (1ULL << 1)
 
+/* Available with KVM_CAP_ARM_TIMER */
+
+/* Bits for run->arm_timer.timesource */
+#define KVM_ARM_TIMER_VTIMER   (1 << 0)
+
 #endif /* __LINUX_KVM_H */
-- 
1.8.5.6




Re: [Qemu-devel] [qemu patch 2/2] kvmclock: reduce kvmclock difference on migration

2016-11-14 Thread Marcelo Tosatti
On Mon, Nov 14, 2016 at 03:22:02PM +0100, Paolo Bonzini wrote:
> 
> 
> On 14/11/2016 15:00, Marcelo Tosatti wrote:
> > On Mon, Nov 14, 2016 at 02:54:38PM +0100, Paolo Bonzini wrote:
> >>
> >>
> >> On 14/11/2016 13:36, Marcelo Tosatti wrote:
> >>> +/* local (running VM) restore */
> >>> +if (s->clock_valid) {
> >>> +/*
> >>> + * if host does not support reliable KVM_GET_CLOCK,
> >>> + * read kvmclock value from memory
> >>> + */
> >>> +if (!kvm_has_adjust_clock_stable()) {
> >>> +time_at_migration = kvmclock_current_nsec(s);
> >>
> >> Just assign to s->clock here...
> > 
> > If kvmclock is not enabled, you want to use s->clock,
> > rather than 0.
> > 
> >>> +}
> >>> +/* migration/savevm/init restore */
> >>> +} else {
> >>> +/*
> >>> + * use s->clock in case machine uses reliable
> >>> + * get clock and host where vm was executing
> >>> + * supported reliable get clock
> >>> + */
> >>> +if (!s->mach_use_reliable_get_clock ||
> >>> +!s->src_use_reliable_get_clock) {
> >>> +time_at_migration = kvmclock_current_nsec(s);
> >>
> >> ... and here, so that time_at_migration is not needed anymore.
> > 
> > Same as above.
> 
> You're right.
> 
> >> Also here it's enough to look at s->src_user_reliable_get_clock, because
> >> if s->mach_use_reliable_get_clock is false,
> >> s->src_use_reliable_get_clock will be false as well.
> > 
> > Yes, but i like the code annotation.
> 
> Ah, I think we're looking at it differently.

Well, i didnt want to mix the meaning of the variables:

+/* whether machine supports reliable KVM_GET_CLOCK */
+bool mach_use_reliable_get_clock;
+
+/* whether source host supported reliable KVM_GET_CLOCK */
+bool src_use_reliable_get_clock;

See the comments on top (later if you look at the variable, 
then have to think: well it has one name, but its disabled 
by that other path as well, so its more than its 
name,etc...).

> I'm thinking "mach_use_reliable_get_clock is just for migration,

Thats whether the machine supports it. New machines have it enabled,
olders don't.
> src_use_reliable_get_clock is the state". 

Thats whether the migration source supported it.

> Perhaps you're thinking of
> enabling/disabling the whole new code for old machines? 

source  destination behaviour
supportssupportson migration use s->clock,
on stop/cont as well

supports~supports
on migration use s->clock,
on stop/cont read from guest mem

~supportsupportson migration read from guest,
on stop/cont use
kvm_get_clock/kvm_set_clock

~support~supportalways read from guest memory


Thats what should happen (and thats what the patch should implement).


"support" means host supports your new KVM_GET_CLOCK/KVM_SET_CLOCK.

>  What is the
> advantage?

Well its necessary to use the correct thing, otherwise
you see a time backwards event.

> 
> >>> +}
> >>> +}
> >>>  
> >>> -/* We can't rely on the migrated clock value, just discard it */
> >>> +/* We can't rely on the saved clock value, just discard it */
> >>>  if (time_at_migration) {
> >>>  s->clock = time_at_migration;
> >>
> >> [...]
> >>
> >>>
> >>> +static bool kvmclock_src_use_reliable_get_clock(void *opaque)
> >>> +{
> >>> +KVMClockState *s = opaque;
> >>> +
> >>> +/*
> >>> + * On machine types that support reliable KVM_GET_CLOCK,
> >>> + * if host kernel does provide reliable KVM_GET_CLOCK,
> >>> + * set src_use_reliable_get_clock=true so that destination
> >>> + * avoids reading kvmclock from memory.
> >>> + */
> >>> +if (s->mach_use_reliable_get_clock && kvm_has_adjust_clock_stable()) 
> >>> {
> >>> +s->src_use_reliable_get_clock = true;
> >>> +}
> >>> +
> >>> +return s->src_use_reliable_get_clock;
> >>> +}
> >>
> >> Here you can just return s->mach_use_reliable_get_clock. 
> > 
> > mach_use_reliable_get_clock can be true but host might not support it.
> 
> Yes, but the "needed" function is only required to avoid breaking
> pc-i440fx-2.7 and earlier. 

"needed" is required so that the migration between:

SRC DESTBEHAVIOUR
~supportsupportson migration read from guest,
on stop/cont use
kvm_get_clock/kvm_set_clock

Destination does not use KVM_GET_CLOCK value (which is
broken and should not be used).

> If you return true here, you can still
> migrate a "false" value for src_use_reliable_get_clock.

But the source only uses a reliable KVM_GET_CLOCK if

Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-14 Thread Karl Rister
On 11/14/2016 07:53 AM, Fam Zheng wrote:
> On Fri, 11/11 13:59, Karl Rister wrote:
>>
>> Stefan
>>
>> I ran some quick tests with your patches and got some pretty good gains,
>> but also some seemingly odd behavior.
>>
>> These results are for a 5 minute test doing sequential 4KB requests from
>> fio using O_DIRECT, libaio, and IO depth of 1.  The requests are
>> performed directly against the virtio-blk device (no filesystem) which
>> is backed by a 400GB NVme card.
>>
>> QEMU_AIO_POLL_MAX_NS  IOPs
>>unset31,383
>>146,860
>>246,440
>>435,246
>>834,973
>>   1646,794
>>   3246,729
>>   6435,520
>>  12845,902
> 
> For sequential read with ioq=1, each request takes >2ns under 45,000 IOPs.
> Isn't a poll time of 128ns a mismatching order of magnitude? Have you tried
> larger values? Not criticizing, just trying to understand how it workd.

Not yet, I was just trying to get something out as quick as I could
(while juggling this with some other stuff...).  Frankly I was a bit
surprised that the low values made such an impact and then got
distracted by the behaviors of 4, 8, and 64.

> 
> Also, do you happen to have numbers for unpatched QEMU (just to confirm that
> "unset" case doesn't cause regression) and baremetal for comparison?

I didn't run this exact test on the same qemu.git master changeset
unpatched.  I did however previously try it against the v2.7.0 tag and
got somewhere around 27.5K IOPs.  My original intention was to apply the
patches to v2.7.0 but it wouldn't build.

We have done a lot of testing and tracing on the qemu-rhev package and
27K IOPs is about what we see there (with tracing disabled).

Given the patch discussions I saw I was mainly trying to get a sniff
test out and then do a more complete workup with whatever updates are made.

I should probably note that there are a lot of pinning optimizations
made here to assist in our tracing efforts which also result in improved
performance.  Ultimately, in a proper evaluation of these patches most
of that will be removed so the behavior may change somewhat.

> 
> Fam
> 


-- 
Karl Rister 



Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-14 Thread Christian Borntraeger
On 11/09/2016 06:13 PM, Stefan Hajnoczi wrote:
> Recent performance investigation work done by Karl Rister shows that the
> guest->host notification takes around 20 us.  This is more than the "overhead"
> of QEMU itself (e.g. block layer).
> 
> One way to avoid the costly exit is to use polling instead of notification.
> The main drawback of polling is that it consumes CPU resources.  In order to
> benefit performance the host must have extra CPU cycles available on physical
> CPUs that aren't used by the guest.
> 
> This is an experimental AioContext polling implementation.  It adds a polling
> callback into the event loop.  Polling functions are implemented for 
> virtio-blk
> virtqueue guest->host kick and Linux AIO completion.
> 
> The QEMU_AIO_POLL_MAX_NS environment variable sets the number of nanoseconds 
> to
> poll before entering the usual blocking poll(2) syscall.  Try setting this
> variable to the time from old request completion to new virtqueue kick.
> 
> By default no polling is done.  The QEMU_AIO_POLL_MAX_NS must be set to get 
> any
> polling!
> 
> Karl: I hope you can try this patch series with several QEMU_AIO_POLL_MAX_NS
> values.  If you don't find a good value we should double-check the tracing 
> data
> to see if this experimental code can be improved.
> 
> Stefan Hajnoczi (3):
>   aio-posix: add aio_set_poll_handler()
>   virtio: poll virtqueues for new buffers
>   linux-aio: poll ring for completions
> 
>  aio-posix.c | 133 
> 
>  block/linux-aio.c   |  17 +++
>  hw/virtio/virtio.c  |  19 
>  include/block/aio.h |  16 +++
>  4 files changed, 185 insertions(+)

Hmm, I see all affected threads using more CPU power, but the performance 
numbers are
somewhat inconclusive on s390. I have no proper test setup (only a shared 
LPAR), but
all numbers are in the same ballpark of 3-5Gbyte/sec for 5 disks for 4k random 
reads
with iodepth=8.

What I find interesting is that the guest still does a huge amount of exits for 
the
guest->host notifications. I think if we could combine this with some 
notification
suppression, then things could be even more interesting.

Christian




Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-14 Thread Christian Borntraeger
On 11/09/2016 06:13 PM, Stefan Hajnoczi wrote:
> Recent performance investigation work done by Karl Rister shows that the
> guest->host notification takes around 20 us.  This is more than the "overhead"
> of QEMU itself (e.g. block layer).
> 
> One way to avoid the costly exit is to use polling instead of notification.
> The main drawback of polling is that it consumes CPU resources.  In order to
> benefit performance the host must have extra CPU cycles available on physical
> CPUs that aren't used by the guest.
> 
> This is an experimental AioContext polling implementation.  It adds a polling
> callback into the event loop.  Polling functions are implemented for 
> virtio-blk
> virtqueue guest->host kick and Linux AIO completion.
> 
> The QEMU_AIO_POLL_MAX_NS environment variable sets the number of nanoseconds 
> to
> poll before entering the usual blocking poll(2) syscall.  Try setting this
> variable to the time from old request completion to new virtqueue kick.
> 
> By default no polling is done.  The QEMU_AIO_POLL_MAX_NS must be set to get 
> any
> polling!
> 
> Karl: I hope you can try this patch series with several QEMU_AIO_POLL_MAX_NS
> values.  If you don't find a good value we should double-check the tracing 
> data
> to see if this experimental code can be improved.
> 
> Stefan Hajnoczi (3):
>   aio-posix: add aio_set_poll_handler()
>   virtio: poll virtqueues for new buffers
>   linux-aio: poll ring for completions
> 
>  aio-posix.c | 133 
> 
>  block/linux-aio.c   |  17 +++
>  hw/virtio/virtio.c  |  19 
>  include/block/aio.h |  16 +++
>  4 files changed, 185 insertions(+)
> 

Another observation: With more iothreads than host CPUs the performance drops 
significantly.





Re: [Qemu-devel] [qemu patch 2/2] kvmclock: reduce kvmclock difference on migration

2016-11-14 Thread Paolo Bonzini


On 14/11/2016 15:50, Marcelo Tosatti wrote:
> Well, i didnt want to mix the meaning of the variables:
> 
> +/* whether machine supports reliable KVM_GET_CLOCK */
> +bool mach_use_reliable_get_clock;
> +
> +/* whether source host supported reliable KVM_GET_CLOCK */
> +bool src_use_reliable_get_clock;
> 
> See the comments on top (later if you look at the variable, 
> then have to think: well it has one name, but its disabled 
> by that other path as well, so its more than its 
> name,etc...).
> 
>> I'm thinking "mach_use_reliable_get_clock is just for migration,
> 
> Thats whether the machine supports it. New machines have it enabled,
> olders don't.

Yes.

>> src_use_reliable_get_clock is the state". 
> 
> Thats whether the migration source supported it.

But it's not used only for migration.  It's used on every vmstate change
(running->stop and stop->running, isn't it?  I think that, apart from
the migration case, it's better to use s->clock if kvmclock is stable,
even on older machine types.

> +static bool kvmclock_src_use_reliable_get_clock(void *opaque)
> +{
> +KVMClockState *s = opaque;
> +
> +/*
> + * On machine types that support reliable KVM_GET_CLOCK,
> + * if host kernel does provide reliable KVM_GET_CLOCK,
> + * set src_use_reliable_get_clock=true so that destination
> + * avoids reading kvmclock from memory.
> + */
> +if (s->mach_use_reliable_get_clock && kvm_has_adjust_clock_stable()) 
> {
> +s->src_use_reliable_get_clock = true;
> +}
> +
> +return s->src_use_reliable_get_clock;
> +}

 Here you can just return s->mach_use_reliable_get_clock. 
>>>
>>> mach_use_reliable_get_clock can be true but host might not support it.
>>
>> Yes, but the "needed" function is only required to avoid breaking
>> pc-i440fx-2.7 and earlier. 
> 
> "needed" is required so that the migration between:
> 
> SRC DESTBEHAVIOUR
> ~supportsupportson migration read from guest,
> on stop/cont use
> kvm_get_clock/kvm_set_clock
> 
> Destination does not use KVM_GET_CLOCK value (which is
> broken and should not be used).

If needed returns false, the destination will see
src_use_reliable_get_clock = false anyway.

If needed returns true, the destination can still see
src_use_reliable_get_clock = false if that's the value.

needed  src_use_reliable_get_clock  effect
false   false   use kvmclock_current_nsec
false   trueuse kvmclock_current_nsec
truefalse   use kvmclock_current_nsec
truetrueuse s->clock


So the idea is:

- set s->clock and s->reliable_get_clock on every KVM_GET_CLOCK

- on migration source, use subsections and
s->mach_use_reliable_get_clock to avoid breaking migration on pre-2.8
machine types

- on migration destination, use .pre_load so that s->reliable_get_clock
is initialized to false on older machine types

>> If you return true here, you can still
>> migrate a "false" value for src_use_reliable_get_clock.
> 
> But the source only uses a reliable KVM_GET_CLOCK if 
> both conditions are true.
> 
> And the subsection is only needed if the source
> uses a reliable KVM_GET_CLOCK.

Yes, but the point of the subsection is just to avoid breaking migration
format.

>> It is the same as "is using masterclock", which is actually a stricter
>> condition than the KVM_CHECK_EXTENSION return value.  The right check to
>> use is whether masterclock is in use, 
> 
> Actually its "has a reliable KVM_GET_CLOCK" (which returns 
> get_kernel_clock() + (rdtsc() - tsc_timestamp), 
> 
> "broken KVM_GET_CLOCK" =  get_kernel_clock()

Yes.  And these end up being the same.

Paolo



Re: [Qemu-devel] [PATCH v7 RFC] block/vxhs: Initial commit to add Veritas HyperScale VxHS block device support

2016-11-14 Thread Stefan Hajnoczi
On Wed, Sep 28, 2016 at 10:45 PM, Stefan Hajnoczi  wrote:
> On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote:
>
> Review of .bdrv_open() and .bdrv_aio_writev() code paths.
>
> The big issues I see in this driver and libqnio:
>
> 1. Showstoppers like broken .bdrv_open() and leaking memory on every
>reply message.
> 2. Insecure due to missing input validation (network packets and
>configuration) and incorrect string handling.
> 3. Not fully asynchronous so QEMU and the guest may hang.
>
> Please think about the whole codebase and not just the lines I've
> pointed out in this review when fixing these sorts of issues.  There may
> be similar instances of these bugs elsewhere and it's important that
> they are fixed so that this can be merged.

Ping?

You didn't respond to the comments I raised.  The libqnio buffer
overflows and other issues from this email are still present.

I put a lot of time into reviewing this patch series and libqnio.  If
you want to get reviews please address feedback before sending a new
patch series.

>
>> +/*
>> + * Structure per vDisk maintained for state
>> + */
>> +typedef struct BDRVVXHSState {
>> +int fds[2];
>> +int64_t vdisk_size;
>> +int64_t vdisk_blocks;
>> +int64_t vdisk_flags;
>> +int vdisk_aio_count;
>> +int event_reader_pos;
>> +VXHSAIOCB   *qnio_event_acb;
>> +void*qnio_ctx;
>> +QemuSpinvdisk_lock; /* Lock to protect BDRVVXHSState */
>> +QemuSpinvdisk_acb_lock;  /* Protects ACB */
>
> These comments are insufficient for documenting locking.  Not all fields
> are actually protected by these locks.  Please order fields according to
> lock coverage:
>
> typedef struct VXHSAIOCB {
> ...
>
> /* Protected by BDRVVXHSState->vdisk_acb_lock */
> int segments;
> ...
> };
>
> typedef struct BDRVVXHSState {
> ...
>
> /* Protected by vdisk_lock */
> QemuSpinvdisk_lock;
> int vdisk_aio_count;
> QSIMPLEQ_HEAD(aio_retryq, VXHSAIOCB) vdisk_aio_retryq;
> ...
> }
>
>> +static void vxhs_qnio_iio_close(BDRVVXHSState *s, int idx)
>> +{
>> +/*
>> + * Close vDisk device
>> + */
>> +if (s->vdisk_hostinfo[idx].vdisk_rfd >= 0) {
>> +iio_devclose(s->qnio_ctx, 0, s->vdisk_hostinfo[idx].vdisk_rfd);
>
> libqnio comment:
> Why does iio_devclose() take an unused cfd argument?  Perhaps it can be
> dropped.
>
>> +s->vdisk_hostinfo[idx].vdisk_rfd = -1;
>> +}
>> +
>> +/*
>> + * Close QNIO channel against cached channel-fd
>> + */
>> +if (s->vdisk_hostinfo[idx].qnio_cfd >= 0) {
>> +iio_close(s->qnio_ctx, s->vdisk_hostinfo[idx].qnio_cfd);
>
> libqnio comment:
> Why does iio_devclose() take an int32_t cfd argument but iio_close()
> takes a uint32_t cfd argument?
>
>> +s->vdisk_hostinfo[idx].qnio_cfd = -1;
>> +}
>> +}
>> +
>> +static int vxhs_qnio_iio_open(int *cfd, const char *of_vsa_addr,
>> +  int *rfd, const char *file_name)
>> +{
>> +/*
>> + * Open qnio channel to storage agent if not opened before.
>> + */
>> +if (*cfd < 0) {
>> +*cfd = iio_open(global_qnio_ctx, of_vsa_addr, 0);
>
> libqnio comments:
>
> 1.
> There is a buffer overflow in qnio_create_channel().  strncpy() is used
> incorrectly so long hostname or port (both can be 99 characters long)
> will overflow channel->name[] (64 characters) or channel->port[] (8
> characters).
>
> strncpy(channel->name, hostname, strlen(hostname) + 1);
> strncpy(channel->port, port, strlen(port) + 1);
>
> The third argument must be the size of the *destination* buffer, not the
> source buffer.  Also note that strncpy() doesn't NUL-terminate the
> destination string so you must do that manually to ensure there is a NUL
> byte at the end of the buffer.
>
> 2.
> channel is leaked in the "Failed to open single connection" error case
> in qnio_create_channel().
>
> 3.
> If host is longer the 63 characters then the ioapi_ctx->channels and
> qnio_ctx->channels maps will use different keys due to string truncation
> in qnio_create_channel().  This means "Channel already exists" in
> qnio_create_channel() and possibly other things will not work as
> expected.
>
>> +if (*cfd < 0) {
>> +trace_vxhs_qnio_iio_open(of_vsa_addr);
>> +return -ENODEV;
>> +}
>> +}
>> +
>> +/*
>> + * Open vdisk device
>> + */
>> +*rfd = iio_devopen(global_qnio_ctx, *cfd, file_name, 0);
>
> libqnio comment:
> Buffer overflow in iio_devopen() since chandev[128] is not large enough
> to hold channel[100] + " " + devpath[arbitrary length] chars:
>
> sprintf(chandev, "%s %s", channel, devpath);
>
>> +
>> +if (*rfd < 0) {
>> +if (*cfd >= 0) {
>
> This check is always true.  Otherwise

Re: [Qemu-devel] [PATCH v6 1/2] block/vxhs.c: Add support for a new block device type called "vxhs"

2016-11-14 Thread Stefan Hajnoczi
On Mon, Nov 07, 2016 at 04:59:44PM -0800, Ashish Mittal wrote:
> Source code for the qnio library that this code loads can be downloaded from:
> https://github.com/MittalAshish/libqnio.git
> 
> Sample command line using the JSON syntax:
> ./qemu-system-x86_64 -name instance-0008 -S -vnc 0.0.0.0:0 -k en-us
> -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5
> -msg timestamp=on
> 'json:{"driver":"vxhs","vdisk-id":"c3e9095a-a5ee-4dce-afeb-2a59fb387410",
> "server":{"host":"172.172.17.4","port":""}}'
> 
> Sample command line using the URI syntax:
> qemu-img convert -f raw -O raw -n
> /var/lib/nova/instances/_base/0c5eacd5ebea5ed914b6a3e7b18f1ce734c386ad
> vxhs://192.168.0.1:/c6718f6b-0401-441d-a8c3-1f0064d75ee0
> 
> Signed-off-by: Ashish Mittal 
> ---
> v6 changelog:
> (1) Added qemu-iotests for VxHS as a new patch in the series.
> (2) Replaced release version from 2.8 to 2.9 in block-core.json.
> 
> v5 changelog:
> (1) Incorporated v4 review comments.
> 
> v4 changelog:
> (1) Incorporated v3 review comments on QAPI changes.
> (2) Added refcounting for device open/close.
> Free library resources on last device close.
> 
> v3 changelog:
> (1) Added QAPI schema for the VxHS driver.
> 
> v2 changelog:
> (1) Changes done in response to v1 comments.
> 
>  block/Makefile.objs  |   2 +
>  block/trace-events   |  21 ++
>  block/vxhs.c | 689 
> +++
>  configure|  41 +++
>  qapi/block-core.json |  21 +-
>  5 files changed, 772 insertions(+), 2 deletions(-)
>  create mode 100644 block/vxhs.c
> 
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index 67a036a..58313a2 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -18,6 +18,7 @@ block-obj-$(CONFIG_LIBNFS) += nfs.o
>  block-obj-$(CONFIG_CURL) += curl.o
>  block-obj-$(CONFIG_RBD) += rbd.o
>  block-obj-$(CONFIG_GLUSTERFS) += gluster.o
> +block-obj-$(CONFIG_VXHS) += vxhs.o
>  block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
>  block-obj-$(CONFIG_LIBSSH2) += ssh.o
>  block-obj-y += accounting.o dirty-bitmap.o
> @@ -38,6 +39,7 @@ rbd.o-cflags   := $(RBD_CFLAGS)
>  rbd.o-libs := $(RBD_LIBS)
>  gluster.o-cflags   := $(GLUSTERFS_CFLAGS)
>  gluster.o-libs := $(GLUSTERFS_LIBS)
> +vxhs.o-libs:= $(VXHS_LIBS)
>  ssh.o-cflags   := $(LIBSSH2_CFLAGS)
>  ssh.o-libs := $(LIBSSH2_LIBS)
>  archipelago.o-libs := $(ARCHIPELAGO_LIBS)
> diff --git a/block/trace-events b/block/trace-events
> index 882c903..efdd5ef 100644
> --- a/block/trace-events
> +++ b/block/trace-events
> @@ -112,3 +112,24 @@ qed_aio_write_data(void *s, void *acb, int ret, uint64_t 
> offset, size_t len) "s
>  qed_aio_write_prefill(void *s, void *acb, uint64_t start, size_t len, 
> uint64_t offset) "s %p acb %p start %"PRIu64" len %zu offset %"PRIu64
>  qed_aio_write_postfill(void *s, void *acb, uint64_t start, size_t len, 
> uint64_t offset) "s %p acb %p start %"PRIu64" len %zu offset %"PRIu64
>  qed_aio_write_main(void *s, void *acb, int ret, uint64_t offset, size_t len) 
> "s %p acb %p ret %d offset %"PRIu64" len %zu"
> +
> +# block/vxhs.c
> +vxhs_iio_callback(int error, int reason) "ctx is NULL: error %d, reason %d"
> +vxhs_setup_qnio(void *s) "Context to HyperScale IO manager = %p"
> +vxhs_iio_callback_chnfail(int err, int error) "QNIO channel failed, no i/o 
> %d, %d"
> +vxhs_iio_callback_unknwn(int opcode, int err) "unexpected opcode %d, errno 
> %d"
> +vxhs_open_fail(int ret) "Could not open the device. Error = %d"
> +vxhs_open_epipe(int ret) "Could not create a pipe for device. Bailing out. 
> Error=%d"
> +vxhs_aio_rw_invalid(int req) "Invalid I/O request iodir %d"
> +vxhs_aio_rw_ioerr(char *guid, int iodir, uint64_t size, uint64_t off, void 
> *acb, int ret, int err) "IO ERROR (vDisk %s) FOR : Read/Write = %d size = %lu 
> offset = %lu ACB = %p. Error = %d, errno = %d"
> +vxhs_get_vdisk_stat_err(char *guid, int ret, int err) "vDisk (%s) stat ioctl 
> failed, ret = %d, errno = %d"
> +vxhs_get_vdisk_stat(char *vdisk_guid, uint64_t vdisk_size) "vDisk %s stat 
> ioctl returned size %lu"
> +vxhs_qnio_iio_open(const char *ip) "Failed to connect to storage agent on 
> host-ip %s"
> +vxhs_qnio_iio_devopen(const char *fname) "Failed to open vdisk device: %s"
> +vxhs_complete_aio(void *acb, uint64_t ret) "aio failed acb %p ret %ld"
> +vxhs_parse_uri_filename(const char *filename) "URI passed via 
> bdrv_parse_filename %s"
> +vxhs_qemu_init_vdisk(const char *vdisk_id) "vdisk-id from json %s"
> +vxhs_parse_uri_hostinfo(int num, char *host, int port) "Host %d: IP %s, Port 
> %d"
> +vxhs_qemu_init(char *of_vsa_addr, int port) "Adding host %s:%d to 
> BDRVVXHSState"
> +vxhs_qemu_init_filename(const char *filename) "Filename passed as %s"
> +vxhs_close(char *vdisk_guid) "Closing vdisk %s"
> diff --git a/block/vxhs.c b/block/vxhs.c
> new file mode 100644
> index 000..8913e8f
> --- /dev/null
> +++ b/block/vxhs.c
> @@ -0,0 +1,689 @@
> +/*
> + * QEMU Block driver for

Re: [Qemu-devel] [kvm-unit-tests PATCH v8 2/3] arm: pmu: Check cycle count increases

2016-11-14 Thread Christopher Covington
Hi Drew, Wei,

On 11/14/2016 05:05 AM, Andrew Jones wrote:
> On Fri, Nov 11, 2016 at 01:55:49PM -0600, Wei Huang wrote:
>>
>>
>> On 11/11/2016 01:43 AM, Andrew Jones wrote:
>>> On Tue, Nov 08, 2016 at 12:17:14PM -0600, Wei Huang wrote:
 From: Christopher Covington 

 Ensure that reads of the PMCCNTR_EL0 are monotonically increasing,
 even for the smallest delta of two subsequent reads.

 Signed-off-by: Christopher Covington 
 Signed-off-by: Wei Huang 
 ---
  arm/pmu.c | 98 
 +++
  1 file changed, 98 insertions(+)

 diff --git a/arm/pmu.c b/arm/pmu.c
 index 0b29088..d5e3ac3 100644
 --- a/arm/pmu.c
 +++ b/arm/pmu.c
 @@ -14,6 +14,7 @@
   */
  #include "libcflat.h"
  
 +#define PMU_PMCR_E (1 << 0)
  #define PMU_PMCR_N_SHIFT   11
  #define PMU_PMCR_N_MASK0x1f
  #define PMU_PMCR_ID_SHIFT  16
 @@ -21,6 +22,10 @@
  #define PMU_PMCR_IMP_SHIFT 24
  #define PMU_PMCR_IMP_MASK  0xff
  
 +#define PMU_CYCLE_IDX  31
 +
 +#define NR_SAMPLES 10
 +
  #if defined(__arm__)
  static inline uint32_t pmcr_read(void)
  {
 @@ -29,6 +34,47 @@ static inline uint32_t pmcr_read(void)
asm volatile("mrc p15, 0, %0, c9, c12, 0" : "=r" (ret));
return ret;
  }
 +
 +static inline void pmcr_write(uint32_t value)
 +{
 +  asm volatile("mcr p15, 0, %0, c9, c12, 0" : : "r" (value));
 +}
 +
 +static inline void pmselr_write(uint32_t value)
 +{
 +  asm volatile("mcr p15, 0, %0, c9, c12, 5" : : "r" (value));
 +}
 +
 +static inline void pmxevtyper_write(uint32_t value)
 +{
 +  asm volatile("mcr p15, 0, %0, c9, c13, 1" : : "r" (value));
 +}
 +
 +/*
 + * While PMCCNTR can be accessed as a 64 bit coprocessor register, 
 returning 64
 + * bits doesn't seem worth the trouble when differential usage of the 
 result is
 + * expected (with differences that can easily fit in 32 bits). So just 
 return
 + * the lower 32 bits of the cycle count in AArch32.
>>>
>>> Like I said in the last review, I'd rather we not do this. We should
>>> return the full value and then the test case should confirm the upper
>>> 32 bits are zero.
>>
>> Unless I miss something in ARM documentation, ARMv7 PMCCNTR is a 32-bit
>> register. We can force it to a more coarse-grained cycle counter with
>> PMCR.D bit=1 (see below). But it is still not a 64-bit register.

AArch32 System Register Descriptions
Performance Monitors registers
PMCCNTR, Performance Monitors Cycle Count Register

To access the PMCCNTR when accessing as a 32-bit register:
MRC p15,0,,c9,c13,0 ; Read PMCCNTR[31:0] into Rt
MCR p15,0,,c9,c13,0 ; Write Rt to PMCCNTR[31:0]. PMCCNTR[63:32] are 
unchanged

To access the PMCCNTR when accessing as a 64-bit register:
MRRC p15,0,,,c9 ; Read PMCCNTR[31:0] into Rt and PMCCNTR[63:32] into 
Rt2
MCRR p15,0,,,c9 ; Write Rt to PMCCNTR[31:0] and Rt2 to PMCCNTR[63:32]

Regards,
Cov

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code
Aurora Forum, a Linux Foundation Collaborative Project.



Re: [Qemu-devel] [kvm-unit-tests PATCH v5 09/11] arm/arm64: add initial gicv3 support

2016-11-14 Thread Andrew Jones
On Fri, Nov 11, 2016 at 04:31:36PM +, Andre Przywara wrote:
> Hi,
> 
> On 10/11/16 17:21, Andrew Jones wrote:
> > Signed-off-by: Andrew Jones 
> > 
> > ---
> > v5: use modern register names [Andre]
> > v4:
> >  - only take defines from kernel we need now [Andre]
> >  - simplify enable by not caring if we reinit the distributor [drew]
> > v2:
> >  - configure irqs as NS GRP1
> > ---
> >  lib/arm/asm/arch_gicv3.h   | 42 +
> >  lib/arm/asm/gic-v3.h   | 94 
> > ++
> >  lib/arm/asm/gic.h  |  6 ++-
> >  lib/arm/gic.c  | 65 
> >  lib/arm64/asm/arch_gicv3.h | 44 ++
> >  lib/arm64/asm/gic-v3.h |  1 +
> >  lib/arm64/asm/sysreg.h | 44 ++
> >  7 files changed, 294 insertions(+), 2 deletions(-)
> >  create mode 100644 lib/arm/asm/arch_gicv3.h
> >  create mode 100644 lib/arm/asm/gic-v3.h
> >  create mode 100644 lib/arm64/asm/arch_gicv3.h
> >  create mode 100644 lib/arm64/asm/gic-v3.h
> >  create mode 100644 lib/arm64/asm/sysreg.h
> > 
> > diff --git a/lib/arm/asm/arch_gicv3.h b/lib/arm/asm/arch_gicv3.h
> > new file mode 100644
> > index ..81a1e5f6c29c
> > --- /dev/null
> > +++ b/lib/arm/asm/arch_gicv3.h
> > @@ -0,0 +1,42 @@
> > +/*
> > + * All ripped off from arch/arm/include/asm/arch_gicv3.h
> > + *
> > + * Copyright (C) 2016, Red Hat Inc, Andrew Jones 
> > + *
> > + * This work is licensed under the terms of the GNU LGPL, version 2.
> > + */
> > +#ifndef _ASMARM_ARCH_GICV3_H_
> > +#define _ASMARM_ARCH_GICV3_H_
> > +
> > +#ifndef __ASSEMBLY__
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#define __stringify xstr
> > +
> > +#define __ACCESS_CP15(CRn, Op1, CRm, Op2)  p15, Op1, %0, CRn, CRm, Op2
> > +
> > +#define ICC_PMR__ACCESS_CP15(c4, 0, c6, 0)
> > +#define ICC_IGRPEN1__ACCESS_CP15(c12, 0, c12, 7)
> > +
> > +static inline void gicv3_write_pmr(u32 val)
> > +{
> > +   asm volatile("mcr " __stringify(ICC_PMR) : : "r" (val));
> > +}
> > +
> > +static inline void gicv3_write_grpen1(u32 val)
> > +{
> > +   asm volatile("mcr " __stringify(ICC_IGRPEN1) : : "r" (val));
> > +   isb();
> > +}
> > +
> > +static inline u64 gicv3_read_typer(const volatile void __iomem *addr)
> 
> It may be worth to add that this is for GICR_TYPER (or GITS_TYPER),
> because GICD_TYPER is 32-bit only.
> Or to make the naming generic (because the code actually is), along the
> lines of read_64bit_reg or the like?

Hmm, the fact that these two consecutive mmio addresses allow me to
read and combine them into one address isn't a general property, but
rather one of this particular register. So I think we want typer in
the name. I'm not sure how to improve on the name, since it's useful
for both GICR_ and GITS_. I'll just add a comment above it.

> 
> > +{
> > +   u64 val = readl(addr);
> > +   val |= (u64)readl(addr + 4) << 32;
> > +   return val;
> > +}
> > +
> > +#endif /* !__ASSEMBLY__ */
> > +#endif /* _ASMARM_ARCH_GICV3_H_ */
> > diff --git a/lib/arm/asm/gic-v3.h b/lib/arm/asm/gic-v3.h
> > new file mode 100644
> > index ..e0f303d82508
> > --- /dev/null
> > +++ b/lib/arm/asm/gic-v3.h
> > @@ -0,0 +1,94 @@
> > +/*
> > + * All GIC* defines are lifted from include/linux/irqchip/arm-gic-v3.h
> > + *
> > + * Copyright (C) 2016, Red Hat Inc, Andrew Jones 
> > + *
> > + * This work is licensed under the terms of the GNU LGPL, version 2.
> > + */
> > +#ifndef _ASMARM_GIC_V3_H_
> > +#define _ASMARM_GIC_V3_H_
> > +
> > +#ifndef _ASMARM_GIC_H_
> > +#error Do not directly include . Include 
> > +#endif
> > +
> > +#define GICD_CTLR_RWP  (1U << 31)
> > +#define GICD_CTLR_ARE_NS   (1U << 4)
> > +#define GICD_CTLR_ENABLE_G1A   (1U << 1)
> > +#define GICD_CTLR_ENABLE_G1(1U << 0)
> 
> +1 to Alex for adding a comment noting the non-secure view here.

Will do.

> 
> > +
> > +/* Re-Distributor registers, offsets from RD_base */
> > +#define GICR_TYPER 0x0008
> > +
> > +#define GICR_TYPER_LAST(1U << 4)
> > +
> > +/* Re-Distributor registers, offsets from SGI_base */
> > +#define GICR_IGROUPR0  GICD_IGROUPR
> > +#define GICR_ISENABLER0GICD_ISENABLER
> > +#define GICR_IPRIORITYR0   GICD_IPRIORITYR
> > +
> > +#include 
> > +
> > +#ifndef __ASSEMBLY__
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +struct gicv3_data {
> > +   void *dist_base;
> > +   void *redist_base[NR_CPUS];
> > +   unsigned int irq_nr;
> > +};
> > +extern struct gicv3_data gicv3_data;
> > +
> > +#define gicv3_dist_base()  (gicv3_data.dist_base)
> > +#define gicv3_redist_base()
> > (gicv3_data.redist_base[smp_processor_id()])
> > +#define gicv3_sgi_base()   
> > (gicv3_data.redist_base[smp_processor_id()] + SZ_64K)
> > +
> > +extern int gicv3_init(void);
> > +ext

Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-14 Thread Stefan Hajnoczi
On Fri, Nov 11, 2016 at 01:59:25PM -0600, Karl Rister wrote:
> On 11/09/2016 11:13 AM, Stefan Hajnoczi wrote:
> > Recent performance investigation work done by Karl Rister shows that the
> > guest->host notification takes around 20 us.  This is more than the 
> > "overhead"
> > of QEMU itself (e.g. block layer).
> > 
> > One way to avoid the costly exit is to use polling instead of notification.
> > The main drawback of polling is that it consumes CPU resources.  In order to
> > benefit performance the host must have extra CPU cycles available on 
> > physical
> > CPUs that aren't used by the guest.
> > 
> > This is an experimental AioContext polling implementation.  It adds a 
> > polling
> > callback into the event loop.  Polling functions are implemented for 
> > virtio-blk
> > virtqueue guest->host kick and Linux AIO completion.
> > 
> > The QEMU_AIO_POLL_MAX_NS environment variable sets the number of 
> > nanoseconds to
> > poll before entering the usual blocking poll(2) syscall.  Try setting this
> > variable to the time from old request completion to new virtqueue kick.
> > 
> > By default no polling is done.  The QEMU_AIO_POLL_MAX_NS must be set to get 
> > any
> > polling!
> > 
> > Karl: I hope you can try this patch series with several QEMU_AIO_POLL_MAX_NS
> > values.  If you don't find a good value we should double-check the tracing 
> > data
> > to see if this experimental code can be improved.
> 
> Stefan
> 
> I ran some quick tests with your patches and got some pretty good gains,
> but also some seemingly odd behavior.
>
> These results are for a 5 minute test doing sequential 4KB requests from
> fio using O_DIRECT, libaio, and IO depth of 1.  The requests are
> performed directly against the virtio-blk device (no filesystem) which
> is backed by a 400GB NVme card.
> 
> QEMU_AIO_POLL_MAX_NS  IOPs
>unset31,383
>146,860
>246,440
>435,246
>834,973
>   1646,794
>   3246,729
>   6435,520
>  12845,902

The environment variable is in nanoseconds.  The range of values you
tried are very small (all <1 usec).  It would be interesting to try
larger values in the ballpark of the latencies you have traced.  For
example 2000, 4000, 8000, 16000, and 32000 ns.

Very interesting that QEMU_AIO_POLL_MAX_NS=1 performs so well without
much CPU overhead.

> I found the results for 4, 8, and 64 odd so I re-ran some tests to check
> for consistency.  I used values of 2 and 4 and ran each 5 times.  Here
> is what I got:
> 
> IterationQEMU_AIO_POLL_MAX_NS=2   QEMU_AIO_POLL_MAX_NS=4
> 146,972   35,434
> 246,939   35,719
> 347,005   35,584
> 447,016   35,615
> 547,267   35,474
> 
> So the results seem consistent.

That is interesting.  I don't have an explanation for the consistent
difference between 2 and 4 ns polling time.  The time difference is so
small yet the IOPS difference is clear.

Comparing traces could shed light on the cause for this difference.

> I saw some discussion on the patches made which make me think you'll be
> making some changes, is that right?  If so, I may wait for the updates
> and then we can run the much more exhaustive set of workloads
> (sequential read and write, random read and write) at various block
> sizes (4, 8, 16, 32, 64, 128, and 256) and multiple IO depths (1 and 32)
> that we were doing when we started looking at this.

I'll send an updated version of the patches.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-14 Thread Paolo Bonzini


On 14/11/2016 16:26, Stefan Hajnoczi wrote:
> On Fri, Nov 11, 2016 at 01:59:25PM -0600, Karl Rister wrote:
>> QEMU_AIO_POLL_MAX_NS  IOPs
>>unset31,383
>>146,860
>>246,440
>>435,246
>>834,973
>>   1646,794
>>   3246,729
>>   6435,520
>>  12845,902
> 
> The environment variable is in nanoseconds.  The range of values you
> tried are very small (all <1 usec).  It would be interesting to try
> larger values in the ballpark of the latencies you have traced.  For
> example 2000, 4000, 8000, 16000, and 32000 ns.
> 
> Very interesting that QEMU_AIO_POLL_MAX_NS=1 performs so well without
> much CPU overhead.

That basically means "avoid a syscall if you already know there's
something to do", so in retrospect it's not that surprising.  Still
interesting though, and it means that the feature is useful even if you
don't have CPU to waste.

Paolo



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-14 Thread Karl Rister
On 11/14/2016 09:26 AM, Stefan Hajnoczi wrote:
> On Fri, Nov 11, 2016 at 01:59:25PM -0600, Karl Rister wrote:
>> On 11/09/2016 11:13 AM, Stefan Hajnoczi wrote:
>>> Recent performance investigation work done by Karl Rister shows that the
>>> guest->host notification takes around 20 us.  This is more than the 
>>> "overhead"
>>> of QEMU itself (e.g. block layer).
>>>
>>> One way to avoid the costly exit is to use polling instead of notification.
>>> The main drawback of polling is that it consumes CPU resources.  In order to
>>> benefit performance the host must have extra CPU cycles available on 
>>> physical
>>> CPUs that aren't used by the guest.
>>>
>>> This is an experimental AioContext polling implementation.  It adds a 
>>> polling
>>> callback into the event loop.  Polling functions are implemented for 
>>> virtio-blk
>>> virtqueue guest->host kick and Linux AIO completion.
>>>
>>> The QEMU_AIO_POLL_MAX_NS environment variable sets the number of 
>>> nanoseconds to
>>> poll before entering the usual blocking poll(2) syscall.  Try setting this
>>> variable to the time from old request completion to new virtqueue kick.
>>>
>>> By default no polling is done.  The QEMU_AIO_POLL_MAX_NS must be set to get 
>>> any
>>> polling!
>>>
>>> Karl: I hope you can try this patch series with several QEMU_AIO_POLL_MAX_NS
>>> values.  If you don't find a good value we should double-check the tracing 
>>> data
>>> to see if this experimental code can be improved.
>>
>> Stefan
>>
>> I ran some quick tests with your patches and got some pretty good gains,
>> but also some seemingly odd behavior.
>>
>> These results are for a 5 minute test doing sequential 4KB requests from
>> fio using O_DIRECT, libaio, and IO depth of 1.  The requests are
>> performed directly against the virtio-blk device (no filesystem) which
>> is backed by a 400GB NVme card.
>>
>> QEMU_AIO_POLL_MAX_NS  IOPs
>>unset31,383
>>146,860
>>246,440
>>435,246
>>834,973
>>   1646,794
>>   3246,729
>>   6435,520
>>  12845,902
> 
> The environment variable is in nanoseconds.  The range of values you
> tried are very small (all <1 usec).  It would be interesting to try
> larger values in the ballpark of the latencies you have traced.  For
> example 2000, 4000, 8000, 16000, and 32000 ns.

Agreed.  As I alluded to in another post, I decided to start at 1 and
double the values until I saw a difference with the expectation that it
would have to get quite large before that happened.  The results went in
a different direction, and then I got distracted by the variation at
certain points.  I figured that by itself the fact that noticeable
improvements were possible with such low values was interesting.

I will definitely continue the progression and capture some larger values.

> 
> Very interesting that QEMU_AIO_POLL_MAX_NS=1 performs so well without
> much CPU overhead.
> 
>> I found the results for 4, 8, and 64 odd so I re-ran some tests to check
>> for consistency.  I used values of 2 and 4 and ran each 5 times.  Here
>> is what I got:
>>
>> IterationQEMU_AIO_POLL_MAX_NS=2   QEMU_AIO_POLL_MAX_NS=4
>> 146,972   35,434
>> 246,939   35,719
>> 347,005   35,584
>> 447,016   35,615
>> 547,267   35,474
>>
>> So the results seem consistent.
> 
> That is interesting.  I don't have an explanation for the consistent
> difference between 2 and 4 ns polling time.  The time difference is so
> small yet the IOPS difference is clear.
> 
> Comparing traces could shed light on the cause for this difference.
> 
>> I saw some discussion on the patches made which make me think you'll be
>> making some changes, is that right?  If so, I may wait for the updates
>> and then we can run the much more exhaustive set of workloads
>> (sequential read and write, random read and write) at various block
>> sizes (4, 8, 16, 32, 64, 128, and 256) and multiple IO depths (1 and 32)
>> that we were doing when we started looking at this.
> 
> I'll send an updated version of the patches.
> 
> Stefan
> 


-- 
Karl Rister 



Re: [Qemu-devel] [PATCH v11 11/22] vfio iommu: Add blocking notifier to notify DMA_UNMAP

2016-11-14 Thread Alex Williamson
On Mon, 14 Nov 2016 13:22:08 +0530
Kirti Wankhede  wrote:

> On 11/9/2016 2:58 AM, Alex Williamson wrote:
> > On Wed, 9 Nov 2016 01:29:19 +0530
> > Kirti Wankhede  wrote:
> >   
> >> On 11/8/2016 11:16 PM, Alex Williamson wrote:  
> >>> On Tue, 8 Nov 2016 21:56:29 +0530
> >>> Kirti Wankhede  wrote:
> >>> 
>  On 11/8/2016 5:15 AM, Alex Williamson wrote:
> > On Sat, 5 Nov 2016 02:40:45 +0530
> > Kirti Wankhede  wrote:
> >   
>  ...
> >>  
> >> +int vfio_register_notifier(struct device *dev, struct notifier_block 
> >> *nb)  
> >
> > Is the expectation here that this is a generic notifier for all
> > vfio->mdev signaling?  That should probably be made clear in the mdev
> > API to avoid vendor drivers assuming their notifier callback only
> > occurs for unmaps, even if that's currently the case.
> >   
> 
>  Ok. Adding comment about notifier callback in mdev_device which is part
>  of next patch.
> 
>  ...
> 
> >>mutex_lock(&iommu->lock);
> >>  
> >> -  if (!iommu->external_domain) {
> >> +  /* Fail if notifier list is empty */
> >> +  if ((!iommu->external_domain) || (!iommu->notifier.head)) {
> >>ret = -EINVAL;
> >>goto pin_done;
> >>}
> >> @@ -867,6 +870,11 @@ unlock:
> >>/* Report how much was unmapped */
> >>unmap->size = unmapped;
> >>  
> >> +  if (unmapped && iommu->external_domain)
> >> +  blocking_notifier_call_chain(&iommu->notifier,
> >> +   
> >> VFIO_IOMMU_NOTIFY_DMA_UNMAP,
> >> +   unmap);  
> >
> > This is after the fact, there's already a gap here where pages are
> > unpinned and the mdev device is still running.  
> 
>  Oh, there is a bug here, now unpin_pages() take user_pfn as argument and
>  find vfio_dma. If its not found, it doesn't unpin pages. We have to call
>  this notifier before vfio_remove_dma(). But if we call this before
>  vfio_remove_dma() there will be deadlock since iommu->lock is already
>  held here and vfio_iommu_type1_unpin_pages() will also try to hold
>  iommu->lock.
>  If we want to call blocking_notifier_call_chain() before
>  vfio_remove_dma(), sequence should be:
> 
>  unmapped += dma->size;
>  mutex_unlock(&iommu->lock);
>  if (iommu->external_domain)) {
>   struct vfio_iommu_type1_dma_unmap nb_unmap;
> 
>   nb_unmap.iova = dma->iova;
>   nb_unmap.size = dma->size;
>   blocking_notifier_call_chain(&iommu->notifier,
>    VFIO_IOMMU_NOTIFY_DMA_UNMAP,
>    &nb_unmap);
>  }
>  mutex_lock(&iommu->lock);
>  vfio_remove_dma(iommu, dma);
> >>>
> >>> It seems like it would be worthwhile to have the rb-tree rooted in the
> >>> vfio-dma, then we only need to call the notifier if there are pages
> >>> pinned within that vfio-dma (ie. the rb-tree is not empty).  We can
> >>> then release the lock call the notifier, re-acquire the lock, and
> >>> BUG_ON if the rb-tree still is not empty.  We might get duplicate pfns
> >>> between separate vfio_dma structs, but as I mentioned in other replies,
> >>> that seems like an exception that we don't need to optimize for.
> >>> 
> >>
> >> If we don't optimize for the case where iova from different vfio_dma are
> >> mapped to same pfn and we would not consider this case for page
> >> accounting then:  
> > 
> > Just to clarify, the current code (not handling mdevs) will pin and do
> > page accounting per iova, regardless of whether the iova translates to a
> > unique pfn.  As long as we do no worse than that, I'm ok.
> >   
> >> - have rb tree of pinned iova, where key would be iova, in each vfio_dma
> >> structure.
> >> - iova tracking structure would have iova and ref_count only.
> >> - page accounting would only count number of iova's in rb_tree, case
> >> where different iova could map to same pfn would not be considered in
> >> this implementation for now.
> >> - vfio_unpin_pages() would have user_pfn and pfn as input, we would
> >> validate that iova exist in rb tree and trust vendor driver that
> >> corresponding pfn is correct, there is no validation of pfn. If want
> >> validate pfn, call GUP, verify pfn and call put_pfn().
> >> - In .release() or .detach_group() path, if there are entries in this rb
> >> tree, call GUP again using that iova, get pfn and then call
> >> put_pfn(pfn) for ref_count+1 times. This is because we are not keeping
> >> pfn in our tracking logic.  
> > 
> > Wait a sec, if we detach a group from the container and it's not the
> > last group in the container (which would trigger a release), we can't
> > assume anything about which vfio_dma entries were associated with that
> > device.  The vend

Re: [Qemu-devel] [PULL 0/1] UUID test fix patch for 2.8

2016-11-14 Thread Stefan Hajnoczi
On Fri, Nov 11, 2016 at 09:01:41PM +0800, Fam Zheng wrote:
> The following changes since commit 6bbcb76301a72dc80c8d29af13d40bb9a759c9c6:
> 
>   MAINTAINERS: Remove obsolete stable branches (2016-11-10 15:29:59 +)
> 
> are available in the git repository at:
> 
>   g...@github.com:famz/qemu tags/for-upstream
> 
> for you to fetch changes up to d9c05e507f7a6647cd7b106c8784f1f15a0e4f5c:
> 
>   test-uuid: fix leak (2016-11-11 20:53:23 +0800)
> 
> 
> 
> This is the memory leak fix for uuid test.
> 
> 
> 
> Marc-André Lureau (1):
>   test-uuid: fix leak
> 
>  tests/test-uuid.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> -- 
> 2.7.4
> 
> 

Thanks, applied to my staging tree:
https://github.com/stefanha/qemu/commits/staging

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] [PATCH for-2.8 0/2] pc: remove redundant fw_cfg file "etc/boot-cpus"

2016-11-14 Thread Stefan Hajnoczi
On Fri, Nov 11, 2016 at 04:21:10PM +0100, Igor Mammedov wrote:
> 
> Commit 080ac219cc7d9c55adf925c3545b7450055ad625
>pc: Add 'etc/boot-cpus' fw_cfg file for machine with more than 255 CPUs
> 
> added "etc/boot-cpus" fw_cfg file durung 2.8 merge window, however
> QEMU alredy had similar legacy FW_CFG_NB_CPUS fw_cfg entry that
> should do practically the same. Considering FW_CFG_NB_CPUS's been
> around for a long time and is used by external projects (firmwares)
> we can't replace it with 'etc/boot-cpus' fw_cfg file.
> 
> Drop redundant 'etc/boot-cpus' fw_cfg file  and reuse FW_CFG_NB_CPUS
> instead.
> 
> So here goes QEMU part of fixup
> 
> CC: Eduardo Habkost 
> CC: m...@redhat.com
> CC: Stefan Hajnoczi 
> CC: "Kevin O'Connor" 
> CC: Gerd Hoffmann 
> CC: Laszlo Ersek 
> 
> Igor Mammedov (2):
>   fw_cfg: move FW_CFG_NB_CPUS out of fw_cfg_init1()
>   pc: drop "etc/boot-cpus" fw_cfg file and use FW_CFG_NB_CPUS instead
> 
>  include/hw/i386/pc.h  |  4 ++--
>  hw/arm/virt.c |  4 +++-
>  hw/i386/pc.c  | 20 
>  hw/nvram/fw_cfg.c |  1 -
>  hw/ppc/mac_newworld.c |  1 +
>  hw/ppc/mac_oldworld.c |  1 +
>  hw/sparc/sun4m.c  |  1 +
>  hw/sparc64/sun4u.c|  1 +
>  8 files changed, 17 insertions(+), 16 deletions(-)

Thanks, this should go through Eduardo's tree.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] [qemu patch 2/2] kvmclock: reduce kvmclock difference on migration

2016-11-14 Thread Marcelo Tosatti
On Mon, Nov 14, 2016 at 04:00:38PM +0100, Paolo Bonzini wrote:
> 
> 
> On 14/11/2016 15:50, Marcelo Tosatti wrote:
> > Well, i didnt want to mix the meaning of the variables:
> > 
> > +/* whether machine supports reliable KVM_GET_CLOCK */
> > +bool mach_use_reliable_get_clock;
> > +
> > +/* whether source host supported reliable KVM_GET_CLOCK */
> > +bool src_use_reliable_get_clock;
> > 
> > See the comments on top (later if you look at the variable, 
> > then have to think: well it has one name, but its disabled 
> > by that other path as well, so its more than its 
> > name,etc...).
> > 
> >> I'm thinking "mach_use_reliable_get_clock is just for migration,
> > 
> > Thats whether the machine supports it. New machines have it enabled,
> > olders don't.
> 
> Yes.
> 
> >> src_use_reliable_get_clock is the state". 
> > 
> > Thats whether the migration source supported it.
> 
> But it's not used only for migration.  It's used on every vmstate change
> (running->stop and stop->running, isn't it?  

No. Its used only if s->clock_valid == false, which means either:

migration/savevm/init.

Now for savevm there is a bug: 

> I think that, apart from
> the migration case, it's better to use s->clock if kvmclock is stable,
> even on older machine types.

Yes, its already doing that:

/* local (running VM) restore */
if (s->clock_valid) {
/*
 * if host does not support reliable KVM_GET_CLOCK,
 * read kvmclock value from memory
 */
if (!kvm_has_adjust_clock_stable()) {
time_at_migration = kvmclock_current_nsec(s);
}

It only uses kvmclock_current_nsec() if kvm_has_adjust_clock_stable is 
false.

> > +static bool kvmclock_src_use_reliable_get_clock(void *opaque)
> > +{
> > +KVMClockState *s = opaque;
> > +
> > +/*
> > + * On machine types that support reliable KVM_GET_CLOCK,
> > + * if host kernel does provide reliable KVM_GET_CLOCK,
> > + * set src_use_reliable_get_clock=true so that destination
> > + * avoids reading kvmclock from memory.
> > + */
> > +if (s->mach_use_reliable_get_clock && 
> > kvm_has_adjust_clock_stable()) {
> > +s->src_use_reliable_get_clock = true;
> > +}
> > +
> > +return s->src_use_reliable_get_clock;
> > +}
> 
>  Here you can just return s->mach_use_reliable_get_clock. 
> >>>
> >>> mach_use_reliable_get_clock can be true but host might not support it.
> >>
> >> Yes, but the "needed" function is only required to avoid breaking
> >> pc-i440fx-2.7 and earlier. 
> > 
> > "needed" is required so that the migration between:
> > 
> > SRC DESTBEHAVIOUR
> > ~supportsupportson migration read from guest,
> > on stop/cont use
> > kvm_get_clock/kvm_set_clock
> > 
> > Destination does not use KVM_GET_CLOCK value (which is
> > broken and should not be used).
> 
> If needed returns false, the destination will see
> src_use_reliable_get_clock = false anyway.

Causing it to read the clock from memory, thats what 
should happen.

> If needed returns true, the destination can still see
> src_use_reliable_get_clock = false if that's the value.

You are asking me to change from:

+static bool kvmclock_src_use_reliable_get_clock(void *opaque)
+{
+KVMClockState *s = opaque;
+
+/*
+ * On machine types that support reliable KVM_GET_CLOCK,
+ * if host kernel does provide reliable KVM_GET_CLOCK,
+ * set src_use_reliable_get_clock=true so that destination
+ * avoids reading kvmclock from memory.
+ */
+if (s->mach_use_reliable_get_clock && kvm_has_adjust_clock_stable()) {
+s->src_use_reliable_get_clock = true;
+}
+
+return s->src_use_reliable_get_clock;
+}

to

static bool kvmclock_src_use_reliable_get_clock(void *opaque)
{
KVMClockState *s = opaque;

/*
 * On machine types that support reliable KVM_GET_CLOCK,
 * if host kernel does provide reliable KVM_GET_CLOCK,
 * set src_use_reliable_get_clock=true so that destination
 * avoids reading kvmclock from memory.
 */
if (s->mach_use_reliable_get_clock && kvm_has_adjust_clock_stable())
{
s->src_use_reliable_get_clock = true;
}

return s->mach_use_reliable_get_clock;
}


Ah, OK, done.


> neededsrc_use_reliable_get_clock  effect
> false false   use kvmclock_current_nsec
> false trueuse kvmclock_current_nsec
> true  false   use kvmclock_current_nsec
> true  trueuse s->clock
> 
> 
> So the idea is:
> 
> - set s->clock and s->reliable_get_clock on every KVM_GET_CLOCK

Already do that (apart from s->clock_valid which is necessary).

> - on migration source, use subsections and
> s->mach_use_reliable_get_clock to

[Qemu-devel] [PATCH v12 05/22] vfio iommu: Added pin and unpin callback functions to vfio_iommu_driver_ops

2016-11-14 Thread Kirti Wankhede
Added APIs for pining and unpining set of pages. These call back into
backend iommu module to actually pin and unpin pages.
Added two new callback functions to struct vfio_iommu_driver_ops. Backend
IOMMU module that supports pining and unpinning pages for mdev devices
should provide these functions.

Renamed static functions in vfio_type1_iommu.c to resolve conflicts

Signed-off-by: Kirti Wankhede 
Signed-off-by: Neo Jia 
Change-Id: Ia7417723aaae86bec2959ad9ae6c2915ddd340e0
---
 drivers/vfio/vfio.c | 103 
 drivers/vfio/vfio_iommu_type1.c |  20 
 include/linux/vfio.h|  12 -
 3 files changed, 124 insertions(+), 11 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 2e83bdf007fe..7dcfbca2016a 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1799,6 +1799,109 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, 
size_t offset)
 }
 EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
 
+
+/*
+ * Pin a set of guest PFNs and return their associated host PFNs for local
+ * domain only.
+ * @dev [in] : device
+ * @user_pfn [in]: array of user/guest PFNs to be unpinned. Number of 
user/guest
+ *   PFNs should not be greater than PAGE_SIZE.
+ * @npage [in] :count of elements in array.  This count should not be greater
+ * than PAGE_SIZE.
+ * @prot [in] : protection flags
+ * @phys_pfn[out] : array of host PFNs
+ * Return error or number of pages pinned.
+ */
+int vfio_pin_pages(struct device *dev, unsigned long *user_pfn, int npage,
+  int prot, unsigned long *phys_pfn)
+{
+   struct vfio_container *container;
+   struct vfio_group *group;
+   struct vfio_iommu_driver *driver;
+   int ret;
+
+   if (!dev || !user_pfn || !phys_pfn || !npage)
+   return -EINVAL;
+
+   if (npage >= PAGE_SIZE)
+   return -E2BIG;
+
+   group = vfio_group_get_from_dev(dev);
+   if (IS_ERR(group))
+   return PTR_ERR(group);
+
+   ret = vfio_group_add_container_user(group);
+   if (ret)
+   goto err_pin_pages;
+
+   container = group->container;
+   down_read(&container->group_lock);
+
+   driver = container->iommu_driver;
+   if (likely(driver && driver->ops->pin_pages))
+   ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
+npage, prot, phys_pfn);
+   else
+   ret = -ENOTTY;
+
+   up_read(&container->group_lock);
+   vfio_group_try_dissolve_container(group);
+
+err_pin_pages:
+   vfio_group_put(group);
+   return ret;
+}
+EXPORT_SYMBOL(vfio_pin_pages);
+
+/*
+ * Unpin set of host PFNs for local domain only.
+ * @dev [in] : device
+ * @user_pfn [in]: array of user/guest PFNs to be unpinned. Number of 
user/guest
+ *   PFNs should not be greater than PAGE_SIZE.
+ * @npage [in] :count of elements in array.  This count should not be greater
+ * than PAGE_SIZE.
+ * Return error or number of pages unpinned.
+ */
+int vfio_unpin_pages(struct device *dev, unsigned long *user_pfn, int npage)
+{
+   struct vfio_container *container;
+   struct vfio_group *group;
+   struct vfio_iommu_driver *driver;
+   int ret;
+
+   if (!dev || !user_pfn || !npage)
+   return -EINVAL;
+
+   if (npage >= PAGE_SIZE)
+   return -E2BIG;
+
+   group = vfio_group_get_from_dev(dev);
+   if (IS_ERR(group))
+   return PTR_ERR(group);
+
+   ret = vfio_group_add_container_user(group);
+   if (ret)
+   goto err_unpin_pages;
+
+   container = group->container;
+   down_read(&container->group_lock);
+
+   driver = container->iommu_driver;
+   if (likely(driver && driver->ops->unpin_pages))
+   ret = driver->ops->unpin_pages(container->iommu_data, user_pfn,
+  npage);
+   else
+   ret = -ENOTTY;
+
+   up_read(&container->group_lock);
+   vfio_group_try_dissolve_container(group);
+
+err_unpin_pages:
+   vfio_group_put(group);
+   return ret;
+}
+EXPORT_SYMBOL(vfio_unpin_pages);
+
 /**
  * Module/class support
  */
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ba19424e4a1..9f3d58d3dfaf 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -259,8 +259,8 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, 
unsigned long *pfn)
  * the iommu can only map chunks of consecutive pfns anyway, so get the
  * first page and all consecutive pages with the same locking.
  */
-static long vfio_pin_pages(unsigned long vaddr, long npage,
-  int prot, unsigned long *pfn_base)
+static long vfio_pin_pages_remote(unsigned long vaddr, long npage,
+ int prot, unsigned long *pfn_base)
 {
unsigned long limit = r

[Qemu-devel] [PATCH v12 00/22] Add Mediated device support

2016-11-14 Thread Kirti Wankhede
This series adds Mediated device support to Linux host kernel. Purpose
of this series is to provide a common interface for mediated device
management that can be used by different devices. This series introduces
Mdev core module that creates and manages mediated devices, VFIO based
driver for mediated devices that are created by mdev core module and
update VFIO type1 IOMMU module to support pinning & unpinning for mediated
devices.

What changed v11-> v12?
vfio_iommu_type1:
- Pinned pages are tracked in vfio_dma structure for that range, dma->iova
  to dma->iova + dma->size.
- Key for pfn_list rb-tree is iova, instead of pfn.
- Removing address space structure. vfio_dma keeps task structure, which
  is used to get mm structure (using get_task_mm(task) and mmput(mm)) for
  pin/unpin and page accounting.
- vfio_unpin_pages() has array of user_pfns as input argument, instead of
  array of pfns.
- On vfio_pin_pages(), pinning happens once. On later call to
  vfio_pin_pages() with same user_pfn, if iova is found in pfn_list, only
  ref_count is incremented.
- In vfio_unpin_pages(), ref_count is decremented and page is unpinned
  when ref_count is 0.
- For vfio_pin_pages() and vfio_unpin_pages() input array, number of
  elements in array should be less that PAGE_SIZE. If vendor driver wants
  to use for more pages, array should  be split it in chunks of PAGE_SIZE.
- Updating page accounting logic with above changes.

Tested by assigning below combinations of devices to a single VM:
- GPU pass through only
- vGPU device only
- One GPU pass through and one vGPU device
- Linux VM hot plug and unplug vGPU device while GPU pass through device
  exist
- Linux VM hot plug and unplug GPU pass through device while vGPU device
  exist

Tested with Linux 4.9-rc5

Kirti Wankhede (22):
  vfio: Mediated device Core driver
  vfio: VFIO based driver for Mediated devices
  vfio: Rearrange functions to get vfio_group from dev
  vfio: Common function to increment container_users
  vfio iommu: Added pin and unpin callback functions to
vfio_iommu_driver_ops
  vfio iommu type1: Update arguments of vfio_lock_acct
  vfio iommu type1: Update argument of vaddr_get_pfn()
  vfio iommu type1: Add find_iommu_group() function
  vfio iommu type1: Add task structure to vfio_dma
  vfio iommu type1: Add support for mediated devices
  vfio iommu: Add blocking notifier to notify DMA_UNMAP
  vfio: Add notifier callback to parent's ops structure of mdev
  vfio: Introduce common function to add capabilities
  vfio_pci: Update vfio_pci to use vfio_info_add_capability()
  vfio: Introduce vfio_set_irqs_validate_and_prepare()
  vfio_pci: Updated to use vfio_set_irqs_validate_and_prepare()
  vfio_platform: Updated to use vfio_set_irqs_validate_and_prepare()
  vfio: Define device_api strings
  docs: Add Documentation for Mediated devices
  docs: Sysfs ABI for mediated device framework
  docs: Sample driver to demonstrate how to use Mediated device
framework.
  MAINTAINERS: Add entry VFIO based Mediated device drivers

 Documentation/ABI/testing/sysfs-bus-vfio-mdev |  111 ++
 Documentation/vfio-mediated-device.txt|  399 +++
 MAINTAINERS   |9 +
 drivers/vfio/Kconfig  |1 +
 drivers/vfio/Makefile |1 +
 drivers/vfio/mdev/Kconfig |   17 +
 drivers/vfio/mdev/Makefile|5 +
 drivers/vfio/mdev/mdev_core.c |  388 +++
 drivers/vfio/mdev/mdev_driver.c   |  120 ++
 drivers/vfio/mdev/mdev_private.h  |   41 +
 drivers/vfio/mdev/mdev_sysfs.c|  286 +
 drivers/vfio/mdev/vfio_mdev.c |  167 +++
 drivers/vfio/pci/vfio_pci.c   |   83 +-
 drivers/vfio/platform/vfio_platform_common.c  |   31 +-
 drivers/vfio/vfio.c   |  341 +-
 drivers/vfio/vfio_iommu_type1.c   |  831 +++---
 include/linux/mdev.h  |  177 +++
 include/linux/vfio.h  |   30 +-
 include/uapi/linux/vfio.h |   10 +
 samples/vfio-mdev/Makefile|   13 +
 samples/vfio-mdev/mtty.c  | 1503 +
 21 files changed, 4328 insertions(+), 236 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-bus-vfio-mdev
 create mode 100644 Documentation/vfio-mediated-device.txt
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev_core.c
 create mode 100644 drivers/vfio/mdev/mdev_driver.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
 create mode 100644 drivers/vfio/mdev/vfio_mdev.c
 create mode 100644 include/linux/mdev.h
 create mode 100644 samples/vfio-mdev/Makefile
 create mode 100644 samples/vfio-mdev/mtty.c

-- 
2.7.0




[Qemu-devel] [PATCH v12 06/22] vfio iommu type1: Update arguments of vfio_lock_acct

2016-11-14 Thread Kirti Wankhede
Added task structure as input argument to vfio_lock_acct() function.

Signed-off-by: Kirti Wankhede 
Signed-off-by: Neo Jia 
Change-Id: I5d3673cc9d3786bb436b395d5f74537f1a36da80
---
 drivers/vfio/vfio_iommu_type1.c | 30 --
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 9f3d58d3dfaf..34d17e51dc97 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -150,17 +150,22 @@ static void vfio_lock_acct_bg(struct work_struct *work)
kfree(vwork);
 }
 
-static void vfio_lock_acct(long npage)
+static void vfio_lock_acct(struct task_struct *task, long npage)
 {
struct vwork *vwork;
struct mm_struct *mm;
 
-   if (!current->mm || !npage)
+   if (!npage)
+   return;
+
+   mm = get_task_mm(task);
+   if (!mm)
return; /* process exited or nothing to do */
 
-   if (down_write_trylock(¤t->mm->mmap_sem)) {
-   current->mm->locked_vm += npage;
-   up_write(¤t->mm->mmap_sem);
+   if (down_write_trylock(&mm->mmap_sem)) {
+   mm->locked_vm += npage;
+   up_write(&mm->mmap_sem);
+   mmput(mm);
return;
}
 
@@ -170,11 +175,8 @@ static void vfio_lock_acct(long npage)
 * wouldn't need this silliness
 */
vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
-   if (!vwork)
-   return;
-   mm = get_task_mm(current);
-   if (!mm) {
-   kfree(vwork);
+   if (!vwork) {
+   mmput(mm);
return;
}
INIT_WORK(&vwork->work, vfio_lock_acct_bg);
@@ -285,7 +287,7 @@ static long vfio_pin_pages_remote(unsigned long vaddr, long 
npage,
 
if (unlikely(disable_hugepages)) {
if (!rsvd)
-   vfio_lock_acct(1);
+   vfio_lock_acct(current, 1);
return 1;
}
 
@@ -313,7 +315,7 @@ static long vfio_pin_pages_remote(unsigned long vaddr, long 
npage,
}
 
if (!rsvd)
-   vfio_lock_acct(i);
+   vfio_lock_acct(current, i);
 
return i;
 }
@@ -328,7 +330,7 @@ static long vfio_unpin_pages_remote(unsigned long pfn, long 
npage,
unlocked += put_pfn(pfn++, prot);
 
if (do_accounting)
-   vfio_lock_acct(-unlocked);
+   vfio_lock_acct(current, -unlocked);
 
return unlocked;
 }
@@ -390,7 +392,7 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, 
struct vfio_dma *dma)
cond_resched();
}
 
-   vfio_lock_acct(-unlocked);
+   vfio_lock_acct(current, -unlocked);
 }
 
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
-- 
2.7.0




Re: [Qemu-devel] [qemu patch 2/2] kvmclock: reduce kvmclock difference on migration

2016-11-14 Thread Marcelo Tosatti
On Mon, Nov 14, 2016 at 03:09:46PM +0100, Juan Quintela wrote:
> Marcelo Tosatti  wrote:
> > Check for KVM_CAP_ADJUST_CLOCK capability KVM_CLOCK_TSC_STABLE, which
> > indicates that KVM_GET_CLOCK returns a value as seen by the guest at
> > that moment.
> >
> > For new machine types, use this value rather than reading 
> > from guest memory.
> >
> > This reduces kvmclock difference on migration from 5s to 0.1s
> > (when max_downtime == 5s).
> >
> > Signed-off-by: Marcelo Tosatti 
> 
> Acked-by: Juan Quintela 
> 
> But, if you have to respin it 
> 
> 
> > +/* whether machine supports reliable KVM_GET_CLOCK */
> > +bool mach_use_reliable_get_clock;
> > +
> > +/* whether source host supported reliable KVM_GET_CLOCK */
> > +bool src_use_reliable_get_clock;
> 
> This two names are really long, but I don't have better suggesitons :-()
> >  if (running) {
> >  struct kvm_clock_data data = {};
> > -uint64_t time_at_migration = kvmclock_current_nsec(s);
> > +uint64_t time_at_migration = 0;
> 
> This was not "time_at_migration", it was not already before, but just
> now it looks really weird. (as it was already faulty, this is why it is
> only a suggestion.)
> 
> >  
> > -s->clock_valid = false;
> > +/* local (running VM) restore */
> > +if (s->clock_valid) {
> > +/*
> > + * if host does not support reliable KVM_GET_CLOCK,
> > + * read kvmclock value from memory
> > + */
> > +if (!kvm_has_adjust_clock_stable()) {
> > +time_at_migration = kvmclock_current_nsec(s);
> > +}
> > +/* migration/savevm/init restore */
> > +} else {
> > +/*
> > + * use s->clock in case machine uses reliable
> > + * get clock and host where vm was executing
> > + * supported reliable get clock
> > + */
> 
> This comment is just weird.  Simplifying
> /* If A and B do C */
> if (!A and || !B) {
>then D();
> }
> 
> Doing the opposite comment?
> 
> Migration code looks rigth.

Fixed those.

> Once said that, I continue hating clocks.

Me too, especially the biological one.




[Qemu-devel] [PATCH v12 19/22] docs: Add Documentation for Mediated devices

2016-11-14 Thread Kirti Wankhede
Add file Documentation/vfio-mediated-device.txt that include details of
mediated device framework.

Signed-off-by: Kirti Wankhede 
Signed-off-by: Neo Jia 
Change-Id: I137dd646442936090d92008b115908b7b2c7bc5d
---
 Documentation/vfio-mediated-device.txt | 298 +
 drivers/vfio/mdev/Kconfig  |   1 +
 2 files changed, 299 insertions(+)
 create mode 100644 Documentation/vfio-mediated-device.txt

diff --git a/Documentation/vfio-mediated-device.txt 
b/Documentation/vfio-mediated-device.txt
new file mode 100644
index ..fe8bd2e7b26a
--- /dev/null
+++ b/Documentation/vfio-mediated-device.txt
@@ -0,0 +1,298 @@
+/*
+ * VFIO Mediated devices
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ * Author: Neo Jia 
+ * Kirti Wankhede 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+Virtual Function I/O (VFIO) Mediated devices[1]
+===
+
+The number of use cases for virtualizing DMA devices that do not have built-in
+SR_IOV capability is increasing. Previously, to virtualize such devices,
+developers had to create their own management interfaces and APIs, and then
+integrate them with user space software. To simplify integration with user 
space
+software, we have identified common requirements and a unified management
+interface for such devices.
+
+The VFIO driver framework provides unified APIs for direct device access. It is
+an IOMMU/device-agnostic framework for exposing direct device access to user
+space in a secure, IOMMU-protected environment. This framework is used for
+multiple devices, such as GPUs, network adapters, and compute accelerators. 
With
+direct device access, virtual machines or user space applications have direct
+access to the physical device. This framework is reused for mediated devices.
+
+The mediated core driver provides a common interface for mediated device
+management that can be used by drivers of different devices. This module
+provides a generic interface to perform these operations:
+
+* Create and destroy a mediated device
+* Add a mediated device to and remove it from a mediated bus driver
+* Add a mediated device to and remove it from an IOMMU group
+
+The mediated core driver also provides an interface to register a bus driver.
+For example, the mediated VFIO mdev driver is designed for mediated devices and
+supports VFIO APIs. The mediated bus driver adds a mediated device to and
+removes it from a VFIO group.
+
+The following high-level block diagram shows the main components and interfaces
+in the VFIO mediated driver framework. The diagram shows NVIDIA, Intel, and IBM
+devices as examples, as these devices are the first devices to use this module.
+
+ +---+
+ |   |
+ | +---+ |  mdev_register_driver() +--+
+ | |   | +<+  |
+ | |  mdev | | |  |
+ | |  bus  | +>+ vfio_mdev.ko |<-> VFIO user
+ | |  driver   | | probe()/remove()|  |APIs
+ | |   | | +--+
+ | +---+ |
+ |   |
+ |  MDEV CORE|
+ |   MODULE  |
+ |   mdev.ko |
+ | +---+ |  mdev_register_device() +--+
+ | |   | +<+  |
+ | |   | | |  nvidia.ko   |<-> physical
+ | |   | +>+  |device
+ | |   | |callbacks+--+
+ | | Physical  | |
+ | |  device   | |  mdev_register_device() +--+
+ | | interface | |<+  |
+ | |   | | |  i915.ko |<-> physical
+ | |   | +>+  |device
+ | |   | |callbacks+--+
+ | |   | |
+ | |   | |  mdev_register_device() +--+
+ | |   | +<+  |
+ | |   | | | ccw_device.ko|<-> physical
+ | |   | +>+  |device
+ | |   | |callbacks+--+
+ | +---+ |
+ +---+
+
+
+Registration Interfaces
+===
+
+The mediated core driver provides the following types of registration
+interfaces:
+
+* Registration interface for a mediated bus driver
+* Physical device driver interface
+
+Registration Interface for a Mediated Bus Driver
+
+
+The registration interface for a medi

[Qemu-devel] [PATCH v12 04/22] vfio: Common function to increment container_users

2016-11-14 Thread Kirti Wankhede
This change rearrange functions to have common function to increment
container_users

Signed-off-by: Kirti Wankhede 
Signed-off-by: Neo Jia 
Reviewed-by: Jike Song 

Change-Id: I8bdeb352bc8439b107ffd519480fd4dc238677f2
---
 drivers/vfio/vfio.c | 34 +-
 1 file changed, 21 insertions(+), 13 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 23bc86c1d05d..2e83bdf007fe 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1385,6 +1385,23 @@ static bool vfio_group_viable(struct vfio_group *group)
 group, vfio_dev_viable) == 0);
 }
 
+static int vfio_group_add_container_user(struct vfio_group *group)
+{
+   if (!atomic_inc_not_zero(&group->container_users))
+   return -EINVAL;
+
+   if (group->noiommu) {
+   atomic_dec(&group->container_users);
+   return -EPERM;
+   }
+   if (!group->container->iommu_driver || !vfio_group_viable(group)) {
+   atomic_dec(&group->container_users);
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
 static const struct file_operations vfio_device_fops;
 
 static int vfio_group_get_device_fd(struct vfio_group *group, char *buf)
@@ -1694,23 +1711,14 @@ static const struct file_operations vfio_device_fops = {
 struct vfio_group *vfio_group_get_external_user(struct file *filep)
 {
struct vfio_group *group = filep->private_data;
+   int ret;
 
if (filep->f_op != &vfio_group_fops)
return ERR_PTR(-EINVAL);
 
-   if (!atomic_inc_not_zero(&group->container_users))
-   return ERR_PTR(-EINVAL);
-
-   if (group->noiommu) {
-   atomic_dec(&group->container_users);
-   return ERR_PTR(-EPERM);
-   }
-
-   if (!group->container->iommu_driver ||
-   !vfio_group_viable(group)) {
-   atomic_dec(&group->container_users);
-   return ERR_PTR(-EINVAL);
-   }
+   ret = vfio_group_add_container_user(group);
+   if (ret)
+   return ERR_PTR(ret);
 
vfio_group_get(group);
 
-- 
2.7.0




[Qemu-devel] [PATCH v12 07/22] vfio iommu type1: Update argument of vaddr_get_pfn()

2016-11-14 Thread Kirti Wankhede
Update arguments of vaddr_get_pfn() to take struct mm_struct *mm as input
argument.

Signed-off-by: Kirti Wankhede 
Signed-off-by: Neo Jia 
Change-Id: I885fd4cd4a9f66f4ee2c1caf58267464ec239f52
---
 drivers/vfio/vfio_iommu_type1.c | 32 
 1 file changed, 24 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 34d17e51dc97..52af5fc01d91 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -230,20 +230,36 @@ static int put_pfn(unsigned long pfn, int prot)
return 0;
 }
 
-static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
+static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
+int prot, unsigned long *pfn)
 {
struct page *page[1];
struct vm_area_struct *vma;
-   int ret = -EFAULT;
+   int ret;
+
+   if (mm == current->mm) {
+   ret = get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE),
+ page);
+   } else {
+   unsigned int flags = 0;
+
+   if (prot & IOMMU_WRITE)
+   flags |= FOLL_WRITE;
+
+   down_read(&mm->mmap_sem);
+   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
+   NULL);
+   up_read(&mm->mmap_sem);
+   }
 
-   if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
+   if (ret == 1) {
*pfn = page_to_pfn(page[0]);
return 0;
}
 
-   down_read(¤t->mm->mmap_sem);
+   down_read(&mm->mmap_sem);
 
-   vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
+   vma = find_vma_intersection(mm, vaddr, vaddr + 1);
 
if (vma && vma->vm_flags & VM_PFNMAP) {
*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
@@ -251,7 +267,7 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, 
unsigned long *pfn)
ret = 0;
}
 
-   up_read(¤t->mm->mmap_sem);
+   up_read(&mm->mmap_sem);
 
return ret;
 }
@@ -272,7 +288,7 @@ static long vfio_pin_pages_remote(unsigned long vaddr, long 
npage,
if (!current->mm)
return -ENODEV;
 
-   ret = vaddr_get_pfn(vaddr, prot, pfn_base);
+   ret = vaddr_get_pfn(current->mm, vaddr, prot, pfn_base);
if (ret)
return ret;
 
@@ -295,7 +311,7 @@ static long vfio_pin_pages_remote(unsigned long vaddr, long 
npage,
for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
unsigned long pfn = 0;
 
-   ret = vaddr_get_pfn(vaddr, prot, &pfn);
+   ret = vaddr_get_pfn(current->mm, vaddr, prot, &pfn);
if (ret)
break;
 
-- 
2.7.0




[Qemu-devel] [PATCH v12 01/22] vfio: Mediated device Core driver

2016-11-14 Thread Kirti Wankhede
Design for Mediated Device Driver:
Main purpose of this driver is to provide a common interface for mediated
device management that can be used by different drivers of different
devices.

This module provides a generic interface to create the device, add it to
mediated bus, add device to IOMMU group and then add it to vfio group.

Below is the high Level block diagram, with Nvidia, Intel and IBM devices
as example, since these are the devices which are going to actively use
this module as of now.

 +---+
 |   |
 | +---+ |  mdev_register_driver() +--+
 | |   | +<+ __init() |
 | |  mdev | | |  |
 | |  bus  | +>+  |<-> VFIO user
 | |  driver   | | probe()/remove()| vfio_mdev.ko |APIs
 | |   | | |  |
 | +---+ | +--+
 |   |
 |  MDEV CORE|
 |   MODULE  |
 |   mdev.ko |
 | +---+ |  mdev_register_device() +--+
 | |   | +<+  |
 | |   | | |  nvidia.ko   |<-> physical
 | |   | +>+  |device
 | |   | |callback +--+
 | | Physical  | |
 | |  device   | |  mdev_register_device() +--+
 | | interface | |<+  |
 | |   | | |  i915.ko |<-> physical
 | |   | +>+  |device
 | |   | |callback +--+
 | |   | |
 | |   | |  mdev_register_device() +--+
 | |   | +<+  |
 | |   | | | ccw_device.ko|<-> physical
 | |   | +>+  |device
 | |   | |callback +--+
 | +---+ |
 +---+

Core driver provides two types of registration interfaces:
1. Registration interface for mediated bus driver:

/**
  * struct mdev_driver - Mediated device's driver
  * @name: driver name
  * @probe: called when new device created
  * @remove:called when device removed
  * @driver:device driver structure
  *
  **/
struct mdev_driver {
 const char *name;
 int  (*probe)  (struct device *dev);
 void (*remove) (struct device *dev);
 struct device_driverdriver;
};

Mediated bus driver for mdev device should use this interface to register
and unregister with core driver respectively:

int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
void mdev_unregister_driver(struct mdev_driver *drv);

Mediated bus driver is responsible to add/delete mediated devices to/from
VFIO group when devices are bound and unbound to the driver.

2. Physical device driver interface
This interface provides vendor driver the set APIs to manage physical
device related work in its driver. APIs are :

* dev_attr_groups: attributes of the parent device.
* mdev_attr_groups: attributes of the mediated device.
* supported_type_groups: attributes to define supported type. This is
 mandatory field.
* create: to allocate basic resources in vendor driver for a mediated
 device. This is mandatory to be provided by vendor driver.
* remove: to free resources in vendor driver when mediated device is
 destroyed. This is mandatory to be provided by vendor driver.
* open: open callback of mediated device
* release: release callback of mediated device
* read : read emulation callback.
* write: write emulation callback.
* ioctl: ioctl callback.
* mmap: mmap emulation callback.

Drivers should use these interfaces to register and unregister device to
mdev core driver respectively:

extern int  mdev_register_device(struct device *dev,
 const struct parent_ops *ops);
extern void mdev_unregister_device(struct device *dev);

There are no locks to serialize above callbacks in mdev driver and
vfio_mdev driver. If required, vendor driver can have locks to serialize
above APIs in their driver.

Signed-off-by: Kirti Wankhede 
Signed-off-by: Neo Jia 
Reviewed-by: Jike Song 

Change-Id: I73a5084574270b14541c529461ea2f03c292d510
---
 drivers/vfio/Kconfig |   1 +
 drivers/vfio/Makefile|   1 +
 drivers/vfio/mdev/Kconfig|   9 +
 drivers/vfio/mdev/Makefile   |   4 +
 drivers/vfio/mdev/mdev_core.c| 374 +++
 drivers/vfio/mdev/mdev_driver.c  | 120 +
 drivers/vfio/mdev/mdev_private.h |  41 +
 drivers/vfio/mdev/mdev_sysfs.c   | 286 ++
 include/linux/mdev.h | 168 ++
 9 files changed, 1004 insertions(+)
 create mode 100644 drivers/vfio/mdev/Kconfig

[Qemu-devel] [PATCH v12 20/22] docs: Sysfs ABI for mediated device framework

2016-11-14 Thread Kirti Wankhede
Added details of sysfs ABI for mediated device framework

Signed-off-by: Kirti Wankhede 
Signed-off-by: Neo Jia 
Change-Id: Icb0fd4ed58a2fa793fbcb1c3d5009a4403c1f3ac
---
 Documentation/ABI/testing/sysfs-bus-vfio-mdev | 111 ++
 1 file changed, 111 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-bus-vfio-mdev

diff --git a/Documentation/ABI/testing/sysfs-bus-vfio-mdev 
b/Documentation/ABI/testing/sysfs-bus-vfio-mdev
new file mode 100644
index ..452dbe39270e
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-vfio-mdev
@@ -0,0 +1,111 @@
+What:   /sys/...//mdev_supported_types/
+Date:   October 2016
+Contact:Kirti Wankhede 
+Description:
+This directory contains list of directories of currently
+   supported mediated device types and their details for
+   . Supported type attributes are defined by the
+   vendor driver who registers with Mediated device framework.
+   Each supported type is a directory whose name is created
+   by adding the device driver string as a prefix to the
+   string provided by the vendor driver.
+
+What:   /sys/...//mdev_supported_types//
+Date:   October 2016
+Contact:Kirti Wankhede 
+Description:
+This directory gives details of supported type, like name,
+   description, available_instances, device_api etc.
+   'device_api' and 'available_instances' are mandatory
+   attributes to be provided by vendor driver. 'name',
+   'description' and other vendor driver specific attributes
+   are optional.
+
+What:   /sys/.../mdev_supported_types//create
+Date:   October 2016
+Contact:Kirti Wankhede 
+Description:
+   Writing UUID to this file will create mediated device of
+   type  for parent device . This is a
+   write-only file.
+   For example:
+   # echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1001" > \
+  /sys/devices/foo/mdev_supported_types/foo-1/create
+
+What:   /sys/.../mdev_supported_types//devices/
+Date:   October 2016
+Contact:Kirti Wankhede 
+Description:
+   This directory contains symbolic links pointing to mdev
+   devices sysfs entries which are created of this .
+
+What:   /sys/.../mdev_supported_types//available_instances
+Date:   October 2016
+Contact:Kirti Wankhede 
+Description:
+   Reading this attribute will show the number of mediated
+   devices of type  that can be created. This is a
+   readonly file.
+Users:
+   Userspace applications interested in creating mediated
+   device of that type. Userspace application should check
+   the number of available instances could be created before
+   creating mediated device of this type.
+
+What:   /sys/.../mdev_supported_types//device_api
+Date:   October 2016
+Contact:Kirti Wankhede 
+Description:
+   Reading this attribute will show VFIO device API supported
+   by this type. For example, "vfio-pci" for a PCI device,
+   "vfio-platform" for platform device.
+
+What:   /sys/.../mdev_supported_types//name
+Date:   October 2016
+Contact:Kirti Wankhede 
+Description:
+   Reading this attribute will show human readable name of the
+   mediated device that will get created of type .
+   This is optional attribute. For example: "Grid M60-0Q"
+Users:
+   Userspace applications interested in knowing the name of
+   a particular  that can help in understanding the
+   type of mediated device.
+
+What:   /sys/.../mdev_supported_types//description
+Date:   October 2016
+Contact:Kirti Wankhede 
+Description:
+   Reading this attribute will show description of the type of
+   mediated device that will get created of type .
+   This is optional attribute. For example:
+   "2 heads, 512M FB, 2560x1600 maximum resolution"
+Users:
+   Userspace applications interested in knowing the details of
+   a particular  that can help in understanding the
+   features provided by that type of mediated device.
+
+What:   /sys/...///
+Date:   October 2016
+Contact:Kirti Wankhede 
+Description:
+   This directory represents device directory of mediated
+   device. It contains all the attributes related to mediated
+   device.
+
+What:   /sys/...///mdev_type
+Date:   October 2016
+Contact:Kirti Wankhede 
+Description:
+   This is symbolic link pointing to supported type, 
+  

[Qemu-devel] [PATCH v12 08/22] vfio iommu type1: Add find_iommu_group() function

2016-11-14 Thread Kirti Wankhede
Add find_iommu_group()

Signed-off-by: Kirti Wankhede 
Signed-off-by: Neo Jia 
Reviewed-by: Jike Song 
Reviewed-by: Dong Jia Shi 

Change-Id: I9d372f1ebe9eb01a5a21374b8a2b03f7df73601f
---
 drivers/vfio/vfio_iommu_type1.c | 57 -
 1 file changed, 33 insertions(+), 24 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 52af5fc01d91..ffe2026f1341 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -752,11 +752,24 @@ static void vfio_test_domain_fgsp(struct vfio_domain 
*domain)
__free_pages(pages, order);
 }
 
+static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
+  struct iommu_group *iommu_group)
+{
+   struct vfio_group *g;
+
+   list_for_each_entry(g, &domain->group_list, next) {
+   if (g->iommu_group == iommu_group)
+   return g;
+   }
+
+   return NULL;
+}
+
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 struct iommu_group *iommu_group)
 {
struct vfio_iommu *iommu = iommu_data;
-   struct vfio_group *group, *g;
+   struct vfio_group *group;
struct vfio_domain *domain, *d;
struct bus_type *bus = NULL;
int ret;
@@ -764,10 +777,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
mutex_lock(&iommu->lock);
 
list_for_each_entry(d, &iommu->domain_list, next) {
-   list_for_each_entry(g, &d->group_list, next) {
-   if (g->iommu_group != iommu_group)
-   continue;
-
+   if (find_iommu_group(d, iommu_group)) {
mutex_unlock(&iommu->lock);
return -EINVAL;
}
@@ -887,27 +897,26 @@ static void vfio_iommu_type1_detach_group(void 
*iommu_data,
mutex_lock(&iommu->lock);
 
list_for_each_entry(domain, &iommu->domain_list, next) {
-   list_for_each_entry(group, &domain->group_list, next) {
-   if (group->iommu_group != iommu_group)
-   continue;
+   group = find_iommu_group(domain, iommu_group);
+   if (!group)
+   continue;
 
-   iommu_detach_group(domain->domain, iommu_group);
-   list_del(&group->next);
-   kfree(group);
-   /*
-* Group ownership provides privilege, if the group
-* list is empty, the domain goes away.  If it's the
-* last domain, then all the mappings go away too.
-*/
-   if (list_empty(&domain->group_list)) {
-   if (list_is_singular(&iommu->domain_list))
-   vfio_iommu_unmap_unpin_all(iommu);
-   iommu_domain_free(domain->domain);
-   list_del(&domain->next);
-   kfree(domain);
-   }
-   goto done;
+   iommu_detach_group(domain->domain, iommu_group);
+   list_del(&group->next);
+   kfree(group);
+   /*
+* Group ownership provides privilege, if the group
+* list is empty, the domain goes away.  If it's the
+* last domain, then all the mappings go away too.
+*/
+   if (list_empty(&domain->group_list)) {
+   if (list_is_singular(&iommu->domain_list))
+   vfio_iommu_unmap_unpin_all(iommu);
+   iommu_domain_free(domain->domain);
+   list_del(&domain->next);
+   kfree(domain);
}
+   goto done;
}
 
 done:
-- 
2.7.0




[Qemu-devel] [PATCH v12 11/22] vfio iommu: Add blocking notifier to notify DMA_UNMAP

2016-11-14 Thread Kirti Wankhede
Added blocking notifier to IOMMU TYPE1 driver to notify vendor drivers
about DMA_UNMAP.
Exported two APIs vfio_register_notifier() and vfio_unregister_notifier().
Notifier should be registered, if external user wants to use
vfio_pin_pages()/vfio_unpin_pages() APIs to pin/unpin pages.
Vendor driver should use VFIO_IOMMU_NOTIFY_DMA_UNMAP action to invalidate
mappings.

Signed-off-by: Kirti Wankhede 
Signed-off-by: Neo Jia 
Change-Id: I5910d0024d6be87f3e8d3e0ca0eaeaaa0b17f271
---
 drivers/vfio/vfio.c | 73 +
 drivers/vfio/vfio_iommu_type1.c | 60 ++---
 include/linux/vfio.h| 11 +++
 3 files changed, 132 insertions(+), 12 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 7dcfbca2016a..388a3cbddbd9 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1902,6 +1902,79 @@ err_unpin_pages:
 }
 EXPORT_SYMBOL(vfio_unpin_pages);
 
+int vfio_register_notifier(struct device *dev, struct notifier_block *nb)
+{
+   struct vfio_container *container;
+   struct vfio_group *group;
+   struct vfio_iommu_driver *driver;
+   ssize_t ret;
+
+   if (!dev || !nb)
+   return -EINVAL;
+
+   group = vfio_group_get_from_dev(dev);
+   if (IS_ERR(group))
+   return PTR_ERR(group);
+
+   ret = vfio_group_add_container_user(group);
+   if (ret)
+   goto err_register_nb;
+
+   container = group->container;
+   down_read(&container->group_lock);
+
+   driver = container->iommu_driver;
+   if (likely(driver && driver->ops->register_notifier))
+   ret = driver->ops->register_notifier(container->iommu_data, nb);
+   else
+   ret = -ENOTTY;
+
+   up_read(&container->group_lock);
+   vfio_group_try_dissolve_container(group);
+
+err_register_nb:
+   vfio_group_put(group);
+   return ret;
+}
+EXPORT_SYMBOL(vfio_register_notifier);
+
+int vfio_unregister_notifier(struct device *dev, struct notifier_block *nb)
+{
+   struct vfio_container *container;
+   struct vfio_group *group;
+   struct vfio_iommu_driver *driver;
+   ssize_t ret;
+
+   if (!dev || !nb)
+   return -EINVAL;
+
+   group = vfio_group_get_from_dev(dev);
+   if (IS_ERR(group))
+   return PTR_ERR(group);
+
+   ret = vfio_group_add_container_user(group);
+   if (ret)
+   goto err_unregister_nb;
+
+   container = group->container;
+   down_read(&container->group_lock);
+
+   driver = container->iommu_driver;
+   if (likely(driver && driver->ops->unregister_notifier))
+   ret = driver->ops->unregister_notifier(container->iommu_data,
+  nb);
+   else
+   ret = -ENOTTY;
+
+   up_read(&container->group_lock);
+   vfio_group_try_dissolve_container(group);
+
+err_unregister_nb:
+   vfio_group_put(group);
+   return ret;
+}
+EXPORT_SYMBOL(vfio_unregister_notifier);
+
 /**
  * Module/class support
  */
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2697d874dd35..7e034e7c5ea5 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson "
@@ -60,6 +61,7 @@ struct vfio_iommu {
struct vfio_domain  *external_domain; /* domain for external user */
struct mutexlock;
struct rb_root  dma_list;
+   struct blocking_notifier_head notifier;
boolv2;
boolnesting;
 };
@@ -568,7 +570,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 
mutex_lock(&iommu->lock);
 
-   if (!iommu->external_domain) {
+   /* Fail if notifier list is empty */
+   if ((!iommu->external_domain) || (!iommu->notifier.head)) {
ret = -EINVAL;
goto pin_done;
}
@@ -854,15 +857,29 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
if (dma->task->mm != current->mm)
break;
unmapped += dma->size;
+
+   mutex_unlock(&iommu->lock);
+   if (iommu->external_domain && !RB_EMPTY_ROOT(&dma->pfn_list)) {
+   struct vfio_iommu_type1_dma_unmap nb_unmap;
+
+   nb_unmap.iova = dma->iova;
+   nb_unmap.size = dma->size;
+   blocking_notifier_call_chain(&iommu->notifier,
+   VFIO_IOMMU_NOTIFY_DMA_UNMAP,
+   &nb_unmap);
+
+   if (WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list)))
+   goto unmap_exit;
+   }
+   mutex_lock(&iommu->lock);
  

[Qemu-devel] [PATCH v12 02/22] vfio: VFIO based driver for Mediated devices

2016-11-14 Thread Kirti Wankhede
vfio_mdev driver registers with mdev core driver.
mdev core driver creates mediated device and calls probe routine of
vfio_mdev driver for each device.
Probe routine of vfio_mdev driver adds mediated device to VFIO core module

This driver forms a shim layer that pass through VFIO devices operations
to vendor driver for mediated devices.

Signed-off-by: Kirti Wankhede 
Signed-off-by: Neo Jia 
Reviewed-by: Jike Song 

Change-Id: I583f4734752971d3d112324d69e2508c88f359ec
---
 drivers/vfio/mdev/Kconfig |   7 ++
 drivers/vfio/mdev/Makefile|   1 +
 drivers/vfio/mdev/mdev_core.c |  16 -
 drivers/vfio/mdev/vfio_mdev.c | 148 ++
 4 files changed, 171 insertions(+), 1 deletion(-)
 create mode 100644 drivers/vfio/mdev/vfio_mdev.c

diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
index 258481d65ebd..1aa0391d74f2 100644
--- a/drivers/vfio/mdev/Kconfig
+++ b/drivers/vfio/mdev/Kconfig
@@ -7,3 +7,10 @@ config VFIO_MDEV
  Provides a framework to virtualize devices.
 
  If you don't know what do here, say N.
+
+config VFIO_MDEV_DEVICE
+   tristate "VFIO support for Mediated devices"
+   depends on VFIO && VFIO_MDEV
+   default n
+   help
+ VFIO based driver for mediated devices.
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
index 31bc04801d94..fa2d5ea466ee 100644
--- a/drivers/vfio/mdev/Makefile
+++ b/drivers/vfio/mdev/Makefile
@@ -2,3 +2,4 @@
 mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
 
 obj-$(CONFIG_VFIO_MDEV) += mdev.o
+obj-$(CONFIG_VFIO_MDEV_DEVICE) += vfio_mdev.o
diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index 613e8a8a3b2a..1e0714ebc56a 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -354,7 +354,21 @@ int mdev_device_remove(struct device *dev, bool 
force_remove)
 
 static int __init mdev_init(void)
 {
-   return mdev_bus_register();
+   int ret;
+
+   ret = mdev_bus_register();
+   if (ret) {
+   pr_err("Failed to register mdev bus\n");
+   return ret;
+   }
+
+   /*
+* Attempt to load known vfio_mdev.  This gives us a working environment
+* without the user needing to explicitly load vfio_mdev driver.
+*/
+   request_module_nowait("vfio_mdev");
+
+   return ret;
 }
 
 static void __exit mdev_exit(void)
diff --git a/drivers/vfio/mdev/vfio_mdev.c b/drivers/vfio/mdev/vfio_mdev.c
new file mode 100644
index ..ffc36758cb84
--- /dev/null
+++ b/drivers/vfio/mdev/vfio_mdev.c
@@ -0,0 +1,148 @@
+/*
+ * VFIO based driver for Mediated device
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ * Author: Neo Jia 
+ * Kirti Wankhede 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "NVIDIA Corporation"
+#define DRIVER_DESC "VFIO based driver for Mediated device"
+
+static int vfio_mdev_open(void *device_data)
+{
+   struct mdev_device *mdev = device_data;
+   struct parent_device *parent = mdev->parent;
+   int ret;
+
+   if (unlikely(!parent->ops->open))
+   return -EINVAL;
+
+   if (!try_module_get(THIS_MODULE))
+   return -ENODEV;
+
+   ret = parent->ops->open(mdev);
+   if (ret)
+   module_put(THIS_MODULE);
+
+   return ret;
+}
+
+static void vfio_mdev_release(void *device_data)
+{
+   struct mdev_device *mdev = device_data;
+   struct parent_device *parent = mdev->parent;
+
+   if (likely(parent->ops->release))
+   parent->ops->release(mdev);
+
+   module_put(THIS_MODULE);
+}
+
+static long vfio_mdev_unlocked_ioctl(void *device_data,
+unsigned int cmd, unsigned long arg)
+{
+   struct mdev_device *mdev = device_data;
+   struct parent_device *parent = mdev->parent;
+
+   if (unlikely(!parent->ops->ioctl))
+   return -EINVAL;
+
+   return parent->ops->ioctl(mdev, cmd, arg);
+}
+
+static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+   struct mdev_device *mdev = device_data;
+   struct parent_device *parent = mdev->parent;
+
+   if (unlikely(!parent->ops->read))
+   return -EINVAL;
+
+   return parent->ops->read(mdev, buf, count, ppos);
+}
+
+static ssize_t vfio_mdev_write(void *device_data, const char __user *buf,
+  size_t count, loff_t *ppos)
+{
+   struct mdev_device *mdev = device_data;
+   struct parent_device *parent = mdev->parent;
+
+   if (unlikely(!parent->ops->write))
+ 

[Qemu-devel] [PATCH v12 22/22] MAINTAINERS: Add entry VFIO based Mediated device drivers

2016-11-14 Thread Kirti Wankhede
Adding myself as a maintainer of mediated device framework,
a sub module of VFIO.

Signed-off-by: Kirti Wankhede 
Signed-off-by: Neo Jia 
Change-Id: I58f6717783e0d4008ca31f4a5c4494696bae8571
---
 MAINTAINERS | 9 +
 1 file changed, 9 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 411e3b87b8c2..0cff155c1315 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12754,6 +12754,15 @@ F: drivers/vfio/
 F: include/linux/vfio.h
 F: include/uapi/linux/vfio.h
 
+VFIO MEDIATED DEVICE DRIVERS
+M: Kirti Wankhede 
+L: k...@vger.kernel.org
+S: Maintained
+F: Documentation/vfio-mediated-device.txt
+F: drivers/vfio/mdev/
+F: include/linux/mdev.h
+F: samples/vfio-mdev/
+
 VFIO PLATFORM DRIVER
 M: Baptiste Reynal 
 L: k...@vger.kernel.org
-- 
2.7.0




[Qemu-devel] [PATCH v12 18/22] vfio: Define device_api strings

2016-11-14 Thread Kirti Wankhede
Defined device API strings. Vendor driver using mediated device
framework should use corresponding string for device_api attribute.

Signed-off-by: Kirti Wankhede 
Signed-off-by: Neo Jia 
Change-Id: I42d29f475f02a7132ce13297fbf2b48f1da10995
---
 include/uapi/linux/vfio.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 255a2113f53c..519eff362c1c 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -203,6 +203,16 @@ struct vfio_device_info {
 };
 #define VFIO_DEVICE_GET_INFO   _IO(VFIO_TYPE, VFIO_BASE + 7)
 
+/*
+ * Vendor driver using Mediated device framework should provide device_api
+ * attribute in supported type attribute groups. Device API string should be 
one
+ * of the following corresponding to device flags in vfio_device_info 
structure.
+ */
+
+#define VFIO_DEVICE_API_PCI_STRING "vfio-pci"
+#define VFIO_DEVICE_API_PLATFORM_STRING"vfio-platform"
+#define VFIO_DEVICE_API_AMBA_STRING"vfio-amba"
+
 /**
  * VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
  *struct vfio_region_info)
-- 
2.7.0




[Qemu-devel] [PATCH v12 03/22] vfio: Rearrange functions to get vfio_group from dev

2016-11-14 Thread Kirti Wankhede
This patch rearranges functions to get vfio_group from device

Signed-off-by: Kirti Wankhede 
Signed-off-by: Neo Jia 
Reviewed-by: Jike Song 

Change-Id: I1f93262bdbab75094bc24b087b29da35ba70c4c6
---
 drivers/vfio/vfio.c | 23 ---
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index d1d70e0b011b..23bc86c1d05d 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -480,6 +480,21 @@ static struct vfio_group *vfio_group_get_from_minor(int 
minor)
return group;
 }
 
+static struct vfio_group *vfio_group_get_from_dev(struct device *dev)
+{
+   struct iommu_group *iommu_group;
+   struct vfio_group *group;
+
+   iommu_group = iommu_group_get(dev);
+   if (!iommu_group)
+   return NULL;
+
+   group = vfio_group_get_from_iommu(iommu_group);
+   iommu_group_put(iommu_group);
+
+   return group;
+}
+
 /**
  * Device objects - create, release, get, put, search
  */
@@ -811,16 +826,10 @@ EXPORT_SYMBOL_GPL(vfio_add_group_dev);
  */
 struct vfio_device *vfio_device_get_from_dev(struct device *dev)
 {
-   struct iommu_group *iommu_group;
struct vfio_group *group;
struct vfio_device *device;
 
-   iommu_group = iommu_group_get(dev);
-   if (!iommu_group)
-   return NULL;
-
-   group = vfio_group_get_from_iommu(iommu_group);
-   iommu_group_put(iommu_group);
+   group = vfio_group_get_from_dev(dev);
if (!group)
return NULL;
 
-- 
2.7.0




[Qemu-devel] [PATCH v12 21/22] docs: Sample driver to demonstrate how to use Mediated device framework.

2016-11-14 Thread Kirti Wankhede
The Sample driver creates mdev device that simulates serial port over PCI
card.

Signed-off-by: Kirti Wankhede 
Signed-off-by: Neo Jia 
Change-Id: I857f8f12f8b275f2498dfe8c628a5cdc7193b1b2
---
 Documentation/vfio-mediated-device.txt |  103 ++-
 samples/vfio-mdev/Makefile |   13 +
 samples/vfio-mdev/mtty.c   | 1503 
 3 files changed, 1618 insertions(+), 1 deletion(-)
 create mode 100644 samples/vfio-mdev/Makefile
 create mode 100644 samples/vfio-mdev/mtty.c

diff --git a/Documentation/vfio-mediated-device.txt 
b/Documentation/vfio-mediated-device.txt
index fe8bd2e7b26a..0d2e402af7bb 100644
--- a/Documentation/vfio-mediated-device.txt
+++ b/Documentation/vfio-mediated-device.txt
@@ -289,8 +289,109 @@ these callbacks are supported in the TYPE1 IOMMU module. 
To enable them for
 other IOMMU backend modules, such as PPC64 sPAPR module, they need to provide
 these two callback functions.
 
+Using the Sample Code
+=
+
+mtty.c in samples/vfio-mdev/ directory is a sample driver program to
+demonstrate how to use the mediated device framework.
+
+The sample driver creates an mdev device that simulates a serial port over a 
PCI
+card.
+
+1. Build and load the mtty.ko module.
+
+   This step creates a dummy device, /sys/devices/virtual/mtty/mtty/
+
+   Files in this device directory in sysfs are similar to the following:
+
+   # tree /sys/devices/virtual/mtty/mtty/
+  /sys/devices/virtual/mtty/mtty/
+  |-- mdev_supported_types
+  |   |-- mtty-1
+  |   |   |-- available_instances
+  |   |   |-- create
+  |   |   |-- device_api
+  |   |   |-- devices
+  |   |   `-- name
+  |   `-- mtty-2
+  |   |-- available_instances
+  |   |-- create
+  |   |-- device_api
+  |   |-- devices
+  |   `-- name
+  |-- mtty_dev
+  |   `-- sample_mtty_dev
+  |-- power
+  |   |-- autosuspend_delay_ms
+  |   |-- control
+  |   |-- runtime_active_time
+  |   |-- runtime_status
+  |   `-- runtime_suspended_time
+  |-- subsystem -> ../../../../class/mtty
+  `-- uevent
+
+2. Create a mediated device by using the dummy device that you created in the
+   previous step.
+
+   # echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1001" > \
+  /sys/devices/virtual/mtty/mtty/mdev_supported_types/mtty-2/create
+
+3. Add parameters to qemu-kvm.
+
+   -device vfio-pci,\
+sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001
+
+4. Boot the VM.
+
+   In the Linux guest VM, with no hardware on the host, the device appears
+   as  follows:
+
+   # lspci -s 00:05.0 -xxvv
+   00:05.0 Serial controller: Device 4348:3253 (rev 10) (prog-if 02 [16550])
+   Subsystem: Device 4348:3253
+   Physical Slot: 5
+   Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
+   Stepping- SERR- FastB2B- DisINTx-
+   Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
+   SERR-  Link[LNKA] -> GSI 10 (level, high) -> IRQ
+10
+   :00:05.0: ttyS1 at I/O 0xc150 (irq = 10) is a 16550A
+   :00:05.0: ttyS2 at I/O 0xc158 (irq = 10) is a 16550A
+
+
+5. In the Linux guest VM, check the serial ports.
+
+   # setserial -g /dev/ttyS*
+   /dev/ttyS0, UART: 16550A, Port: 0x03f8, IRQ: 4
+   /dev/ttyS1, UART: 16550A, Port: 0xc150, IRQ: 10
+   /dev/ttyS2, UART: 16550A, Port: 0xc158, IRQ: 10
+
+6. Using a minicom or any terminal enulation program, open port /dev/ttyS1 or
+   /dev/ttyS2 with hardware flow control disabled.
+
+7. Type data on the minicom terminal or send data to the terminal emulation
+   program and read the data.
+
+   Data is loop backed from hosts mtty driver.
+
+8. Destroy the mediated device that you created.
+
+   # echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001/remove
+
 References
---
+==
 
 [1] See Documentation/vfio.txt for more information on VFIO.
 [2] struct mdev_driver in include/linux/mdev.h
diff --git a/samples/vfio-mdev/Makefile b/samples/vfio-mdev/Makefile
new file mode 100644
index ..a932edbe38eb
--- /dev/null
+++ b/samples/vfio-mdev/Makefile
@@ -0,0 +1,13 @@
+#
+# Makefile for mtty.c file
+#
+KERNEL_DIR:=/lib/modules/$(shell uname -r)/build
+
+obj-m:=mtty.o
+
+modules clean modules_install:
+   $(MAKE) -C $(KERNEL_DIR) SUBDIRS=$(PWD) $@
+
+default: modules
+
+module: modules
diff --git a/samples/vfio-mdev/mtty.c b/samples/vfio-mdev/mtty.c
new file mode 100644
index ..6b633a4ea333
--- /dev/null
+++ b/samples/vfio-mdev/mtty.c
@@ -0,0 +1,1503 @@
+/*
+ * Mediated virtual PCI serial host device driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ * Author: Neo Jia 
+ * Kirti Wankhede 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Sample d

[Qemu-devel] [PATCH v12 16/22] vfio_pci: Updated to use vfio_set_irqs_validate_and_prepare()

2016-11-14 Thread Kirti Wankhede
Updated vfio_pci.c file to use vfio_set_irqs_validate_and_prepare()

Signed-off-by: Kirti Wankhede 
Signed-off-by: Neo Jia 
Change-Id: I9f3daba89d8dba5cb5b01a8cff420412f30686c7
---
 drivers/vfio/pci/vfio_pci.c | 34 +++---
 1 file changed, 7 insertions(+), 27 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 03b5434f4d5b..dcd7c2a99618 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -818,45 +818,25 @@ static long vfio_pci_ioctl(void *device_data,
 
} else if (cmd == VFIO_DEVICE_SET_IRQS) {
struct vfio_irq_set hdr;
-   size_t size;
u8 *data = NULL;
int max, ret = 0;
+   size_t data_size = 0;
 
minsz = offsetofend(struct vfio_irq_set, count);
 
if (copy_from_user(&hdr, (void __user *)arg, minsz))
return -EFAULT;
 
-   if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
-   hdr.count >= (U32_MAX - hdr.start) ||
-   hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
- VFIO_IRQ_SET_ACTION_TYPE_MASK))
-   return -EINVAL;
-
max = vfio_pci_get_irq_count(vdev, hdr.index);
-   if (hdr.start >= max || hdr.start + hdr.count > max)
-   return -EINVAL;
 
-   switch (hdr.flags & VFIO_IRQ_SET_DATA_TYPE_MASK) {
-   case VFIO_IRQ_SET_DATA_NONE:
-   size = 0;
-   break;
-   case VFIO_IRQ_SET_DATA_BOOL:
-   size = sizeof(uint8_t);
-   break;
-   case VFIO_IRQ_SET_DATA_EVENTFD:
-   size = sizeof(int32_t);
-   break;
-   default:
-   return -EINVAL;
-   }
-
-   if (size) {
-   if (hdr.argsz - minsz < hdr.count * size)
-   return -EINVAL;
+   ret = vfio_set_irqs_validate_and_prepare(&hdr, max,
+VFIO_PCI_NUM_IRQS, &data_size);
+   if (ret)
+   return ret;
 
+   if (data_size) {
data = memdup_user((void __user *)(arg + minsz),
-  hdr.count * size);
+   data_size);
if (IS_ERR(data))
return PTR_ERR(data);
}
-- 
2.7.0




[Qemu-devel] [PATCH v12 09/22] vfio iommu type1: Add task structure to vfio_dma

2016-11-14 Thread Kirti Wankhede
Add task structure to vfio_dma structure.
During DMA_UNMAP, same task who mapped it or other task who shares same
address space is allowed to unmap, otherwise unmap fails.
QEMU maps few iova ranges initially, then fork threads and from the child
thread calls DMA_UNMAP on previously mapped iova. Since child shares same
address space, DMA_UNMAP is successful.

Signed-off-by: Kirti Wankhede 
Signed-off-by: Neo Jia 
Change-Id: I7600f1bea6b384fd589fa72421ccf031bcfd9ac5
---
 drivers/vfio/vfio_iommu_type1.c | 137 +---
 1 file changed, 86 insertions(+), 51 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index ffe2026f1341..50aca95cf61e 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson "
@@ -75,6 +76,7 @@ struct vfio_dma {
unsigned long   vaddr;  /* Process virtual addr */
size_t  size;   /* Map size (bytes) */
int prot;   /* IOMMU_READ/WRITE */
+   struct task_struct  *task;
 };
 
 struct vfio_group {
@@ -277,41 +279,47 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
  * the iommu can only map chunks of consecutive pfns anyway, so get the
  * first page and all consecutive pages with the same locking.
  */
-static long vfio_pin_pages_remote(unsigned long vaddr, long npage,
- int prot, unsigned long *pfn_base)
+static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
+ long npage, int prot, unsigned long *pfn_base)
 {
-   unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-   bool lock_cap = capable(CAP_IPC_LOCK);
+   unsigned long limit;
+   bool lock_cap = ns_capable(task_active_pid_ns(dma->task)->user_ns,
+  CAP_IPC_LOCK);
+   struct mm_struct *mm;
long ret, i;
bool rsvd;
 
-   if (!current->mm)
+   mm = get_task_mm(dma->task);
+   if (!mm)
return -ENODEV;
 
-   ret = vaddr_get_pfn(current->mm, vaddr, prot, pfn_base);
+   ret = vaddr_get_pfn(mm, vaddr, prot, pfn_base);
if (ret)
-   return ret;
+   goto pin_pg_remote_exit;
 
rsvd = is_invalid_reserved_pfn(*pfn_base);
+   limit = task_rlimit(dma->task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 
-   if (!rsvd && !lock_cap && current->mm->locked_vm + 1 > limit) {
+   if (!rsvd && !lock_cap && mm->locked_vm + 1 > limit) {
put_pfn(*pfn_base, prot);
pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
limit << PAGE_SHIFT);
-   return -ENOMEM;
+   ret = -ENOMEM;
+   goto pin_pg_remote_exit;
}
 
if (unlikely(disable_hugepages)) {
if (!rsvd)
-   vfio_lock_acct(current, 1);
-   return 1;
+   vfio_lock_acct(dma->task, 1);
+   ret = 1;
+   goto pin_pg_remote_exit;
}
 
/* Lock all the consecutive pages from pfn_base */
for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
unsigned long pfn = 0;
 
-   ret = vaddr_get_pfn(current->mm, vaddr, prot, &pfn);
+   ret = vaddr_get_pfn(mm, vaddr, prot, &pfn);
if (ret)
break;
 
@@ -321,8 +329,7 @@ static long vfio_pin_pages_remote(unsigned long vaddr, long 
npage,
break;
}
 
-   if (!rsvd && !lock_cap &&
-   current->mm->locked_vm + i + 1 > limit) {
+   if (!rsvd && !lock_cap && mm->locked_vm + i + 1 > limit) {
put_pfn(pfn, prot);
pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
__func__, limit << PAGE_SHIFT);
@@ -331,13 +338,16 @@ static long vfio_pin_pages_remote(unsigned long vaddr, 
long npage,
}
 
if (!rsvd)
-   vfio_lock_acct(current, i);
+   vfio_lock_acct(dma->task, i);
+   ret = i;
 
-   return i;
+pin_pg_remote_exit:
+   mmput(mm);
+   return ret;
 }
 
-static long vfio_unpin_pages_remote(unsigned long pfn, long npage,
-   int prot, bool do_accounting)
+static long vfio_unpin_pages_remote(struct vfio_dma *dma, unsigned long pfn,
+   long npage, int prot, bool do_accounting)
 {
unsigned long unlocked = 0;
long i;
@@ -346,7 +356,7 @@ static long vfio_unpin_pages_remote(unsigned long pfn, long 
npage,
unlocked += put_pfn(pfn++, prot);
 
if (do_accounting)
-   vfio_lock_acct(current, -un

Re: [Qemu-devel] [PATCH v6 1/2] block/vxhs.c: Add support for a new block device type called "vxhs"

2016-11-14 Thread Fam Zheng
On Mon, 11/14 15:07, Stefan Hajnoczi wrote:
> On Mon, Nov 07, 2016 at 04:59:44PM -0800, Ashish Mittal wrote:
> > diff --git a/block/vxhs.c b/block/vxhs.c
> > new file mode 100644
> > index 000..8913e8f
> > --- /dev/null
> > +++ b/block/vxhs.c
> > @@ -0,0 +1,689 @@
> > +/*
> > + * QEMU Block driver for Veritas HyperScale (VxHS)
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2 or 
> > later.
> > + * See the COPYING file in the top-level directory.
> > + *
> > + */
> > +
> > +#include "qemu/osdep.h"
> > +#include "block/block_int.h"
> > +#include 
> 
> Please move system headers (<>) above user headers ("").  This way you
> can be sure the system header isn't affected by any macros defined by
> user headers.

Yes, but still after "qemu/osdep.h", which prepares necessary bits for any other
headers.

Fam



[Qemu-devel] [PATCH v12 13/22] vfio: Introduce common function to add capabilities

2016-11-14 Thread Kirti Wankhede
Vendor driver using mediated device framework should use
vfio_info_add_capability() to add capabilities.
Introduced this function to reduce code duplication in vendor drivers.

vfio_info_cap_shift() manipulated a data buffer to add an offset to each
element in a chain. This data buffer is documented in a uapi header.
Changing vfio_info_cap_shift symbol to be available to all drivers.

Signed-off-by: Kirti Wankhede 
Signed-off-by: Neo Jia 
Change-Id: I6fca329fa2291f37a2c859d0bc97574d9e2ce1a6
---
 drivers/vfio/vfio.c  | 60 +++-
 include/linux/vfio.h |  3 +++
 2 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 388a3cbddbd9..3648fd74aa30 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1797,8 +1797,66 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, 
size_t offset)
for (tmp = caps->buf; tmp->next; tmp = (void *)tmp + tmp->next - offset)
tmp->next += offset;
 }
-EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
+EXPORT_SYMBOL(vfio_info_cap_shift);
 
+static int sparse_mmap_cap(struct vfio_info_cap *caps, void *cap_type)
+{
+   struct vfio_info_cap_header *header;
+   struct vfio_region_info_cap_sparse_mmap *sparse_cap, *sparse = cap_type;
+   size_t size;
+
+   size = sizeof(*sparse) + sparse->nr_areas *  sizeof(*sparse->areas);
+   header = vfio_info_cap_add(caps, size,
+  VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
+   if (IS_ERR(header))
+   return PTR_ERR(header);
+
+   sparse_cap = container_of(header,
+   struct vfio_region_info_cap_sparse_mmap, header);
+   sparse_cap->nr_areas = sparse->nr_areas;
+   memcpy(sparse_cap->areas, sparse->areas,
+  sparse->nr_areas * sizeof(*sparse->areas));
+   return 0;
+}
+
+static int region_type_cap(struct vfio_info_cap *caps, void *cap_type)
+{
+   struct vfio_info_cap_header *header;
+   struct vfio_region_info_cap_type *type_cap, *cap = cap_type;
+
+   header = vfio_info_cap_add(caps, sizeof(*cap),
+  VFIO_REGION_INFO_CAP_TYPE, 1);
+   if (IS_ERR(header))
+   return PTR_ERR(header);
+
+   type_cap = container_of(header, struct vfio_region_info_cap_type,
+   header);
+   type_cap->type = cap->type;
+   type_cap->subtype = cap->subtype;
+   return 0;
+}
+
+int vfio_info_add_capability(struct vfio_info_cap *caps, int cap_type_id,
+void *cap_type)
+{
+   int ret = -EINVAL;
+
+   if (!cap_type)
+   return 0;
+
+   switch (cap_type_id) {
+   case VFIO_REGION_INFO_CAP_SPARSE_MMAP:
+   ret = sparse_mmap_cap(caps, cap_type);
+   break;
+
+   case VFIO_REGION_INFO_CAP_TYPE:
+   ret = region_type_cap(caps, cap_type);
+   break;
+   }
+
+   return ret;
+}
+EXPORT_SYMBOL(vfio_info_add_capability);
 
 /*
  * Pin a set of guest PFNs and return their associated host PFNs for local
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 123e089388e9..b1aaf69cb470 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -111,6 +111,9 @@ extern struct vfio_info_cap_header *vfio_info_cap_add(
struct vfio_info_cap *caps, size_t size, u16 id, u16 version);
 extern void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset);
 
+extern int vfio_info_add_capability(struct vfio_info_cap *caps,
+   int cap_type_id, void *cap_type);
+
 struct pci_dev;
 #ifdef CONFIG_EEH
 extern void vfio_spapr_pci_eeh_open(struct pci_dev *pdev);
-- 
2.7.0




  1   2   3   >