date:20170418

Re: [Qemu-devel] [[RFC]PATCH:hw/sd:sd_init()] hw/sd : modified the sd_init() function

2017-04-18 Thread Tejaswini Poluri

Hii,

A gentle reminder on the same!

Regards,
Tejaswini

On Sun, Apr 9, 2017 at 5:44 AM, Tejaswini Poluri  wrote:

> Hii Paolo,
> Waiting for your comments on this patch
>
> Regards,
> Tejaswini
> On 28 Mar 2017 11:01 a.m., "Tejaswini Poluri" 
> wrote:
>
>>
>>
>> On Mon, Mar 27, 2017 at 4:43 PM, Stefan Hajnoczi 
>> wrote:
>>
>>> On Mon, Mar 27, 2017 at 04:01:02PM +0530, Tejaswini wrote:
>>> > From: Tejaswini Poluri 
>>>
>>> Please shorten the subject line: "[PATCH] hw/sd: simplify sd_init()
>>> prototype"
>>>
>>> > @@ -573,16 +573,19 @@ SDState *sd_init(BlockBackend *blk, bool is_spi)
>>> >  qdev_prop_set_drive(dev, "drive", blk, &err);
>>> >  if (err) {
>>> >  error_report("sd_init failed: %s", error_get_pretty(err));
>>> > -return NULL;
>>> > +return -1;
>>> >  }
>>> >  qdev_prop_set_bit(dev, "spi", is_spi);
>>> >  object_property_set_bool(obj, true, "realized", &err);
>>> >  if (err) {
>>> >  error_report("sd_init failed: %s", error_get_pretty(err));
>>> > -return NULL;
>>> > +return -1;
>>> >  }
>>> > -
>>> > -return SD_CARD(dev);
>>> > +sd_state = SD_CARD(dev);
>>>
>>> The caller will not see the new value of sd_state.  In C arguments are
>>> passed by value.  That means they are local variables inside the
>>> function and do not affect the caller.
>>>
>>> The caller will call sd_init() along with passing SDState pointer as an
>> argument like below
>> if (sd_init(s->card, blk, false) < 0) .
>> And the sd_init() function is modified to
>> int sd_init(SDState *sd_state, BlockBackend *blk, bool is_spi)
>> so that the caller gets the new value of SDstate updated.
>> Looking forward for the comments of Paolo Bonzini to understand what more
>> needs to be done as part of the task.
>>
>> I have CCed Paolo Bonzini, who posted this task.  Maybe he can explain
>>> what he meant by "Include SDState by value instead of allocating it in
>>> sd_init (hw/sd/)".
>>>
>>> > +if (!sd_state) {
>>> > + return -1;
>>> > + }
>>>
>>> QEMU use 4 space indentation.  Please do not use tabs.
>>>
>>
>>
>>
>> --
>> Regards,
>> Tejaswini
>>
>


-- 
Regards,
Tejaswini

Re: [Qemu-devel] [PATCH v2 7/7] intel_iommu: support passthrough (PT)

2017-04-18 Thread Peter Xu

On Tue, Apr 18, 2017 at 06:02:44AM +, Liu, Yi L wrote:
> > > Hi Peter,
> > >
> > > Skip address space switching is a good idea to support Passthru mode.
> > > However, without the address space, the vfio notifier would not be
> > > registered, thus vIOMMU emulator has no way to connect to host. It is
> > > no harm if there is only map/unmap notifier. But if we have more
> > > notifiers other than map/unmap, it may be a problem.
> > >
> > > I think we need to reconsider it here.
> > 
> > For now I think as switching is good to us in general. Could I know more 
> > context
> > about this? Would it be okay to work on top of this in the future?
> > 
> 
> Let me explain. For vSVM enabling, it needs to add new notifier to bind 
> gPASID table
> to host. If an assigned device uses SVM in guest, as designed guest IOMMU 
> driver would
> set "pt" for it. This is to make sure the host second-level page table stores 
> GPA->HPA
> mapping. So that pIOMMU can do nested translation to achieve gVA->GPA 
> GPA->HPA 
> mapping.

That should mean that in the guest it will only enable first level
translation, while in the host it'll be nested (first + second)? Then
you should be trapping the guest extended context entry invalidation,
then pass the PASID table pointer downward to the host IOMMU, am I
correct?

> 
> So the situation is if we want to keep GPA->HPA mapping, we should skip 
> address space
> switch, but the vfio listener would not know vIOMMU is added, so no vfio 
> notifier would
> be registered. But if we do the switch, the GPA->HPA mapping is unmapped. And 
> need
> another way to build the GPA->HPA mapping.

If my understanding above is correct, current IOMMU notifier may not
satisfy your need. If you see the current notify interface, it's
delivering IOMMUTLBEntry. I suspect it only suites for IOTLB
notifications, but not if you want to pass some pointer (one GPA)
downward somewhere.

> 
> I think we may have two choice here. Pls let me know your ideas.
> 
> 1. skip the switch for "pt" mode, find other way to register vfio notifiers. 
> not sure if this
> is workable. Just a quick thought.
> 
> 2. do the switch, and rebuild GPA->HPA mapping if device use "pt" mode. For 
> this option,
> I also have two way to build the GPA->HPA mapping.
> a) walk all the memory region sections in address_space_memory, and build the 
> mapping.
> address_space_memory.dispatch->map.sections, sections stores all the mapped 
> sections.
> 
> b) let guest iommu driver mimics a 1:1 mapping for the devices use "pt" mode. 
> in this way,
> the GPA->HPA mapping could be setup by walking the guest SL page table. This 
> is consistent
> with your vIOVA replay solution.

I'll prefer (1). Reason explained above.

> 
> Also I'm curious if Tianyu's fault report framework needs to register new 
> notifiers.

For Tianyu's case, I feel like we need other kind of notifier as well,
but not the current IOTLB one. And, of course in this case it'll be in
an reverted direction, which is from device to vIOMMU.

(I am thinking whether it's a good time now to let any PCI device able
 to fetch its direct owner IOMMU object, then it can register anything
 it want onto that object maybe?)

Thanks,

-- 
Peter Xu

Re: [Qemu-devel] [PATCH v4 0/4] X86/HMP: Expose x86 model specific registers via human monitor

2017-04-18 Thread Julian Kirsch

Hi Eduardo,

don't worry, it's not urgent anyway. I'll resend the patches right away. (No
idea why they don't show up in the archives, though.)

Best,
Julian

On 18.04.2017 01:26, Eduardo Habkost wrote:
> 
> Hi Julian,
> 
> Sorry for taking so long to reply.
> 
> I can't find the original series on either qemu-devel archives,
> or on my own mail archive.
> 
> Searching for the Message-Id you were replying to, I can't find
> any matches:
> https://www.mail-archive.com/search?l=mid&q=20170329183017.14026-1-git%40kirschju.re
> 
> Can you resend?
> 
>>
>> On 29.03.2017 20:30, Julian Kirsch wrote:
>>> Provide read/write access to x86 model specific registers (MSRs) by means of
>>> two new HMP commands "msr_get" and "msr_set". The rationale behind this
>>> is to improve introspection capabilities for system virtualization mode.
>>> For instance, many modern x86-64 operating systems maintain access to 
>>> internal
>>> data structures via the MSR_GSBASE/MSR_KERNELGSBASE MSRs. Giving
>>> introspection utilities (such as a remotely attached gdb via "monitor 
>>> msr_get")
>>> a way of accessing these registers improves analysis results drastically.
>>>
>>> This iteration addresses Eduardo's comments of splitting the patch up into
>>> movement, reordering and addition of new MSRs.
>>>
>>> Changes v3 -> v4:
>>> * Split up x86-related parts of the patch into three distinct patches 
>>> performing
>>>   movement, reordering and addition of new MSRs.
>>>
>>> Changes v2 -> v3:
>>> * Rename HMP commands to "msr_get" and "msr_set"
>>>
>>> Changes v1 -> v2:
>>> * Rename HMP commands to "msr-get" and "msr-set"
>>> * HMP commands Operate on the current default CPU only
>>>   (removes need for cpu_index argument)
>>> * Remove QMP command alltogether
>>> * Implement HMP command in target/i386/monitor.c
>>> * Add #ifdef TARGET_I386 around msr-get/msr-set in hmp-commands.hx
>>>
>>> Julian Kirsch (4):
>>>   X86: Move rdmsr/wrmsr functionality to standalone functions
>>>   X86: Reorder MSRs in rdmsr/wrmsr to follow the order used by KVM
>>>   X86: Add MSRs supported by KVM to rdmsr/wrmsr
>>>   HMP: Introduce msr_get and msr_set HMP commands
>>>
>>>  hmp-commands.hx  |  38 
>>>  include/monitor/hmp-target.h |   2 +
>>>  target/i386/cpu.h|   3 +
>>>  target/i386/helper.c | 524 
>>> +++
>>>  target/i386/misc_helper.c| 297 +---
>>>  target/i386/monitor.c|  76 +++
>>>  6 files changed, 654 insertions(+), 286 deletions(-)
>>>
>>
>>
>

Re: [Qemu-devel] [PATCH 1/7] target/openrisc: Fixes for memory debugging

2017-04-18 Thread Richard Henderson


On 04/16/2017 04:23 PM, Stafford Horne wrote:

When debugging in gdb you might want to inspect instructions in mapped
pages or in exception vectors like 0x800 etc.  This was previously not
possible in qemu since the *get_phys_page_debug() routine only looked
into the data tlb.

Change to fall back to look into instruction tlb and plain physical
pages.

Signed-off-by: Stafford Horne 


Oh the horrors of a software managed TLB.

You might do well to architecturally define an SPR that holds the page table 
base, even if for real hardware that's only used by the software refill to load 
up the address.


That would give qemu the option of performing a real page table walk.  This 
would fix this debug hook properly (so that you can examine pages that aren't 
in the TLB at all).  It would also optionally allow QEMU to skip the software 
refill, which *significantly* speeds up emulation.


That said,

Reviewed-by: Richard Henderson 


r~

Re: [Qemu-devel] [PATCH 2/7] target/openrisc: add shutdown logic

2017-04-18 Thread Richard Henderson


On 04/16/2017 04:23 PM, Stafford Horne wrote:

In openrisc simulators we use hooks like 'l.nop 1' to cause the
simulator to exit.  Implement that for qemu too.

Reported-by: Waldemar Brodkorb 
Signed-off-by: Stafford Horne 


As I said the first time this was posted: This is horrible.

If you want to do something like this, it needs to be buried under a special 
run mode like -semihosting.



 case 0x01:/* l.nop */
 LOG_DIS("l.nop %d\n", I16);
+{
+TCGv_i32 arg = tcg_const_i32(I16);
+gen_helper_nop(arg);
+}


You also really really must special-case l.nop 0 so that it doesn't generate a 
function call.  Just think of all the extra calls you're adding for every delay 
slot that couldn't be filled.



r~

[Qemu-devel] [PATCH for-2.10] iotests: 109: Filter out "len" of failed jobs

2017-04-18 Thread Fam Zheng

Mirror calculates job len from current I/O progress:

s->common.len = s->common.offset +
(cnt + s->sectors_in_flight) * BDRV_SECTOR_SIZE;

The final "len" of a failed mirror job in iotests 109 depends on the
subtle timing of the completion of read and write issued in the first
mirror iteration.  The second iteration may or may not have run when the
I/O error happens, resulting in an undeterministic output of the
BLOCK_JOB_COMPLETED event text.

Similar to what was done in a752e4786, filter out the field to make the
test robust.

Signed-off-by: Fam Zheng 
---
 tests/qemu-iotests/109   |  3 ++-
 tests/qemu-iotests/109.out   | 12 ++--
 tests/qemu-iotests/common.filter |  6 ++
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/tests/qemu-iotests/109 b/tests/qemu-iotests/109
index 927151a..6d61cf1 100755
--- a/tests/qemu-iotests/109
+++ b/tests/qemu-iotests/109
@@ -80,7 +80,8 @@ for fmt in qcow qcow2 qed vdi vmdk vpc; do
 
 # This first test should fail: The image format was probed, we may not
 # write an image header at the start of the image
-run_qemu "$TEST_IMG" "$TEST_IMG.src" "" "BLOCK_JOB_ERROR"
+run_qemu "$TEST_IMG" "$TEST_IMG.src" "" "BLOCK_JOB_ERROR" | \
+_filter_block_job_len
 $QEMU_IO -c 'read -P 0 0 64k' "$TEST_IMG" | _filter_qemu_io
 
 
diff --git a/tests/qemu-iotests/109.out b/tests/qemu-iotests/109.out
index 55fe536..6454b7e 100644
--- a/tests/qemu-iotests/109.out
+++ b/tests/qemu-iotests/109.out
@@ -10,7 +10,7 @@ Automatically detecting the format is dangerous for raw 
images, write operations
 Specify the 'raw' format explicitly to remove the restrictions.
 {"return": {}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"BLOCK_JOB_ERROR", "data": {"device": "src", "operation": "write", "action": 
"report"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 65536, "offset": 0, 
"speed": 0, "type": "mirror", "error": "Operation not permitted"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": LEN, "offset": 0, 
"speed": 0, "type": "mirror", "error": "Operation not permitted"}}
 {"return": []}
 read 65536/65536 bytes at offset 0
 64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
@@ -31,7 +31,7 @@ Automatically detecting the format is dangerous for raw 
images, write operations
 Specify the 'raw' format explicitly to remove the restrictions.
 {"return": {}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"BLOCK_JOB_ERROR", "data": {"device": "src", "operation": "write", "action": 
"report"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 197120, "offset": 512, 
"speed": 0, "type": "mirror", "error": "Operation not permitted"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": LEN, "offset": 512, 
"speed": 0, "type": "mirror", "error": "Operation not permitted"}}
 {"return": []}
 read 65536/65536 bytes at offset 0
 64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
@@ -52,7 +52,7 @@ Automatically detecting the format is dangerous for raw 
images, write operations
 Specify the 'raw' format explicitly to remove the restrictions.
 {"return": {}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"BLOCK_JOB_ERROR", "data": {"device": "src", "operation": "write", "action": 
"report"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 327680, "offset": 
262144, "speed": 0, "type": "mirror", "error": "Operation not permitted"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": LEN, "offset": 262144, 
"speed": 0, "type": "mirror", "error": "Operation not permitted"}}
 {"return": []}
 read 65536/65536 bytes at offset 0
 64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
@@ -73,7 +73,7 @@ Automatically detecting the format is dangerous for raw 
images, write operations
 Specify the 'raw' format explicitly to remove the restrictions.
 {"return": {}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"BLOCK_JOB_ERROR", "data": {"device": "src", "operation": "write", "action": 
"report"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 65536, "offset": 0, 
"speed": 0, "type": "mirror", "error": "Operation not permitted"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": LEN, "offset": 0, 
"speed": 0, "type": "mirror", "error": "Ope

Re: [Qemu-devel] [PATCH 3/7] target/openrisc: add numcores and coreid support

2017-04-18 Thread Richard Henderson


On 04/16/2017 04:23 PM, Stafford Horne wrote:

These are used to identify the processor in SMP system.  Their
definition has been defined in verilog cores but it not yet part of the
spec but it will be soon.

The proposal for this is available:
  https://openrisc.io/proposals/core-identifier-and-number-of-cores

Signed-off-by: Stafford Horne 


Reviewed-by: Richard Henderson 


r~

Re: [Qemu-devel] [PATCH 4/7] target/openrisc: implement shadow registers

2017-04-18 Thread Richard Henderson


On 04/16/2017 04:23 PM, Stafford Horne wrote:

Shadow registers are part of the openrisc spec along with sr[cid], as
part of the fast context switching feature.  When exceptions occur,
instead of having to save registers to the stack if enabled the CID will
increment and a new set of registers will be available.

This patch only implements shadow registers which can be used as extra
scratch registers via the mfspr and mtspr if required.  This is
implemented in a way where it would be easy to add on the fast context
switching, currently cid is hardcoded to 0.


I'm not especially keen on this half-conversion.
If CID cannot be changed, then


-target_ulong gpr[32]; /* General registers */
+target_ulong shadow_gpr[16][32]; /* Shadow registers */
+target_ulong * gpr;   /* General registers (backed by shadow) */


this pointer should not be necessary.  Just use a union, or even just

target_ulong gpr[32];
target_ulong shadow_gpr[15][32];

for now.

Alternately, add accessor functions that take the whole ENV (which would be 
able to read CID, when needed).  C.f.


uint64_t cpu_alpha_load_gr(CPUAlphaState *env, unsigned reg);
void cpu_alpha_store_gr(CPUAlphaState *env, unsigned reg, uint64_t val);

If/when CID can be changed, then we can talk about various ways
that this can be modeled within TCG.


r~

Re: [Qemu-devel] [PATCH v2] char: Fix removing wrong GSource that be found by fd_in_tag

2017-04-18 Thread Marc-André Lureau

Hi

- Original Message -
> We use fd_in_tag to find a GSource, fd_in_tag is return value of
> g_source_attach(GSource *source, GMainContext *context), the return
> value is unique only in the same context, so we may get the same
> values with different 'context' parameters.
> 
> It is no problem to find the right fd_in_tag by using
>  g_main_context_find_source_by_id(GMainContext *context, guint source_id)
> while there is only one default main context.
> 
> But colo-compare tries to create/use its own context, and if we pass wrong
> 'context' parameter with right fd_in_tag, we will find a wrong GSource
> to handle. We tried to fix the related codes in commit b43dec, but it didn't
> fix the bug completely, because we still have some codes didn't pass *right*
> context parameter for remove_fd_in_watch().
> 
> Let's fix it by record the GSource directly instead of fd_in_tag.

You could simply name it "gsource" or "source" instead of "chr_gsource" to 
avoid redundancy in chr->chr_gsource.

> 
> Signed-off-by: zhanghailiang 

looks good to me:
Reviewed-by: Marc-André Lureau 


> ---
> v2:
> - Fix minor comments from Marc-André Lureau
> ---
>  chardev/char-fd.c |  8 
>  chardev/char-io.c | 23 ---
>  chardev/char-io.h |  4 ++--
>  chardev/char-pty.c|  6 +++---
>  chardev/char-socket.c |  8 
>  chardev/char-udp.c|  8 
>  chardev/char.c|  2 +-
>  include/sysemu/char.h |  2 +-
>  8 files changed, 27 insertions(+), 34 deletions(-)
> 
> diff --git a/chardev/char-fd.c b/chardev/char-fd.c
> index 548dd4c..7f0169d 100644
> --- a/chardev/char-fd.c
> +++ b/chardev/char-fd.c
> @@ -58,7 +58,7 @@ static gboolean fd_chr_read(QIOChannel *chan, GIOCondition
> cond, void *opaque)
>  ret = qio_channel_read(
>  chan, (gchar *)buf, len, NULL);
>  if (ret == 0) {
> -remove_fd_in_watch(chr, NULL);
> +remove_fd_in_watch(chr);
>  qemu_chr_be_event(chr, CHR_EVENT_CLOSED);
>  return FALSE;
>  }
> @@ -89,9 +89,9 @@ static void fd_chr_update_read_handler(Chardev *chr,
>  {
>  FDChardev *s = FD_CHARDEV(chr);
>  
> -remove_fd_in_watch(chr, NULL);
> +remove_fd_in_watch(chr);
>  if (s->ioc_in) {
> -chr->fd_in_tag = io_add_watch_poll(chr, s->ioc_in,
> +chr->chr_gsource = io_add_watch_poll(chr, s->ioc_in,
> fd_chr_read_poll,
> fd_chr_read, chr,
> context);
> @@ -103,7 +103,7 @@ static void char_fd_finalize(Object *obj)
>  Chardev *chr = CHARDEV(obj);
>  FDChardev *s = FD_CHARDEV(obj);
>  
> -remove_fd_in_watch(chr, NULL);
> +remove_fd_in_watch(chr);
>  if (s->ioc_in) {
>  object_unref(OBJECT(s->ioc_in));
>  }
> diff --git a/chardev/char-io.c b/chardev/char-io.c
> index b4bb094..d781ad6 100644
> --- a/chardev/char-io.c
> +++ b/chardev/char-io.c
> @@ -98,7 +98,7 @@ static GSourceFuncs io_watch_poll_funcs = {
>  .finalize = io_watch_poll_finalize,
>  };
>  
> -guint io_add_watch_poll(Chardev *chr,
> +GSource *io_add_watch_poll(Chardev *chr,
>  QIOChannel *ioc,
>  IOCanReadHandler *fd_can_read,
>  QIOChannelFunc fd_read,
> @@ -106,7 +106,6 @@ guint io_add_watch_poll(Chardev *chr,
>  GMainContext *context)
>  {
>  IOWatchPoll *iwp;
> -int tag;
>  char *name;
>  
>  iwp = (IOWatchPoll *) g_source_new(&io_watch_poll_funcs,
> @@ -122,21 +121,15 @@ guint io_add_watch_poll(Chardev *chr,
>  g_source_set_name((GSource *)iwp, name);
>  g_free(name);
>  
> -tag = g_source_attach(&iwp->parent, context);
> +g_source_attach(&iwp->parent, context);
>  g_source_unref(&iwp->parent);
> -return tag;
> +return (GSource *)iwp;
>  }
>  
> -static void io_remove_watch_poll(guint tag, GMainContext *context)
> +static void io_remove_watch_poll(GSource *source)
>  {
> -GSource *source;
>  IOWatchPoll *iwp;
>  
> -g_return_if_fail(tag > 0);
> -
> -source = g_main_context_find_source_by_id(context, tag);
> -g_return_if_fail(source != NULL);
> -
>  iwp = io_watch_poll_from_source(source);
>  if (iwp->src) {
>  g_source_destroy(iwp->src);
> @@ -146,11 +139,11 @@ static void io_remove_watch_poll(guint tag,
> GMainContext *context)
>  g_source_destroy(&iwp->parent);
>  }
>  
> -void remove_fd_in_watch(Chardev *chr, GMainContext *context)
> +void remove_fd_in_watch(Chardev *chr)
>  {
> -if (chr->fd_in_tag) {
> -io_remove_watch_poll(chr->fd_in_tag, context);
> -chr->fd_in_tag = 0;
> +if (chr->chr_gsource) {
> +io_remove_watch_poll(chr->chr_gsource);
> +chr->chr_gsource = NULL;
>  }
>  }
>  
> diff --git a/chardev/char-io.h b/chardev/char-io.h
> index 842be56..55973a7 100644
> --- a/chardev/char-io.h
> +++ b/chardev/char-io.h
> @@ -29,14

Re: [Qemu-devel] [PATCH 7/7] target/openrisc: Implement full vmstate serialization

2017-04-18 Thread Richard Henderson


On 04/16/2017 04:23 PM, Stafford Horne wrote:

Previously serialization did not persist the tlb, timer, pic and other
key state items.  This meant snapshotting and restoring a running os
would crash. After adding these I am able to take snapshots of a
running linux os and restore at a later time.

I am currently not trying to maintain capatibility with older versions
as I do not believe this really worked before or anyone used it.


That's fine, but you still have to bump the version numbers.


r~

Re: [Qemu-devel] [PATCH for-2.9-rc5 v2] block: Drain BH in bdrv_drained_begin

2017-04-18 Thread Paolo Bonzini

On 17/04/2017 05:33, Fam Zheng wrote:
> BDRV_POLL_WHILE in both IOThread and main loop has aio_context_acquire(ctx)
> around it; in the branch where main loop calls aio_poll(ctx, false),
> there is also no aio_context_release(ctx). So I thinki it is protected by the
> AioContext lock, and is safe.
> 
> I tested your suggested "aio_poll(qemu_get_aio_context(), false)", but it
> doesn't work.

Right, that's because "aio_poll(qemu_get_aio_context(), false)" relies
on bdrv_wakeup on the I/O thread side (which in turn is triggered by
bdrv_inc_in_flight/bdrv_dec_in_flight), but the bottom half does not
have bdrv_inc_in_flight/bdrv_dec_in_flight around it.

Paolo

Re: [Qemu-devel] [PATCH for-2.9-rc5 v2] block: Drain BH in bdrv_drained_begin

2017-04-18 Thread Paolo Bonzini



On 17/04/2017 10:27, Fam Zheng wrote:
> At this point it's even unclear to me what should be the plan for 2.9.  v1 IMO
> was the least intrusive, but didn't cover bdrv_drain_all_begin. v2 has this
> controversial "aio_poll(ctx_, false)",

v1 has it too:

-bdrv_drain_recurse(bs);
+while (true) {
+if (!bdrv_drain_recurse(bs) &&
+!aio_poll(bdrv_get_aio_context(bs), false)) {
+break;
+}
+}

I don't have any particular preference.  Both patches are self contained
and easy to revert when the underlying root cause is fixed.

Thanks,

Paolo

> however its alternative,
> "aio_poll(qemu_get_aio_context(), false)", "introduces" another crash that is
> not seen otherwise.
> 
> What should we do now?

Re: [Qemu-devel] [PATCH for-2.9-rc5 v2] block: Drain BH in bdrv_drained_begin

2017-04-18 Thread Fam Zheng

On Tue, 04/18 10:18, Paolo Bonzini wrote:
> 
> 
> On 17/04/2017 10:27, Fam Zheng wrote:
> > At this point it's even unclear to me what should be the plan for 2.9.  v1 
> > IMO
> > was the least intrusive, but didn't cover bdrv_drain_all_begin. v2 has this
> > controversial "aio_poll(ctx_, false)",
> 
> v1 has it too:
> 
> -bdrv_drain_recurse(bs);
> +while (true) {
> +if (!bdrv_drain_recurse(bs) &&
> +!aio_poll(bdrv_get_aio_context(bs), false)) {
> +break;
> +}
> +}

Yes you are right.

On the other hand, the fact that in v2 I had to add bdrv_ref/bdrv_unref around
the recursive bdrv_drain_recurse() call makes me worry a little - I assume the
same problem exists in v1 and is just latent. So maybe merging v2 is better.

> 
> I don't have any particular preference.  Both patches are self contained
> and easy to revert when the underlying root cause is fixed.
> 

Fam

[Qemu-devel] [PATCH 1/2] net/socket: convert non-blocking connect to use QIOChannel

2017-04-18 Thread Mao Zhongyi

The non-blocking connect mechanism doesn't work well in
net connection, it still blocks on DNS lookups. So it is
obsolete. Supersede it with QIOchannel.

Signed-off-by: Mao Zhongyi 
---
The test steps like this:

$ qemu-system-x86_64 -net nic -net socket,listen=:1234 ~/img/test.img
$ qemu-system-x86_64 -net nic -net socket,connect=127.0.0.1:1234 
~/img/test.img

No exception.
 
 net/socket.c | 28 ++--
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/net/socket.c b/net/socket.c
index fe3547b..1886f98 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -33,6 +33,7 @@
 #include "qemu/sockets.h"
 #include "qemu/iov.h"
 #include "qemu/main-loop.h"
+#include "io/channel-socket.h"
 
 typedef struct NetSocketState {
 NetClientState nc;
@@ -525,16 +526,21 @@ typedef struct {
 char *name;
 } socket_connect_data;
 
-static void socket_connect_data_free(socket_connect_data *c)
+static void socket_connect_data_free(void *opaque)
 {
+socket_connect_data *c = opaque;
+if (!c) {
+return;
+}
 qapi_free_SocketAddress(c->saddr);
 g_free(c->model);
 g_free(c->name);
 g_free(c);
 }
 
-static void net_socket_connected(int fd, Error *err, void *opaque)
+static void net_socket_connected(QIOTask *task, void *opaque)
 {
+QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(qio_task_get_source(task));
 socket_connect_data *c = opaque;
 NetSocketState *s;
 char *addr_str = NULL;
@@ -543,13 +549,13 @@ static void net_socket_connected(int fd, Error *err, void 
*opaque)
 addr_str = socket_address_to_string(c->saddr, &local_error);
 if (addr_str == NULL) {
 error_report_err(local_error);
-closesocket(fd);
+closesocket(sioc->fd);
 goto end;
 }
 
-s = net_socket_fd_init(c->peer, c->model, c->name, fd, true);
+s = net_socket_fd_init(c->peer, c->model, c->name, sioc->fd, true);
 if (!s) {
-closesocket(fd);
+closesocket(sioc->fd);
 goto end;
 }
 
@@ -567,7 +573,7 @@ static int net_socket_connect_init(NetClientState *peer,
const char *host_str)
 {
 socket_connect_data *c = g_new0(socket_connect_data, 1);
-int fd = -1;
+QIOChannelSocket *sioc;
 Error *local_error = NULL;
 
 c->peer = peer;
@@ -578,10 +584,12 @@ static int net_socket_connect_init(NetClientState *peer,
 goto err;
 }
 
-fd = socket_connect(c->saddr, &local_error, net_socket_connected, c);
-if (fd < 0) {
-goto err;
-}
+sioc = qio_channel_socket_new();
+qio_channel_socket_connect_async(sioc,
+ c->saddr,
+ net_socket_connected,
+ c,
+ NULL);
 
 return 0;
 
-- 
2.9.3

[Qemu-devel] [PATCH 2/2] net/socket: Improve -net socket error reporting

2017-04-18 Thread Mao Zhongyi

When -net socket fails, it first reports a specific error, then
a generic one, like this:

$ qemu-system-x86_64 -net socket,
qemu-system-x86_64: -net socket: exactly one of fd=, listen=, connect=, 
mcast= or udp= is required
qemu-system-x86_64: -net socket: Device 'socket' could not be initialized

Convert net_init_socket() to Error to get rid of the unwanted second
error message.

Signed-off-by: Mao Zhongyi 
---
 net/socket.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/net/socket.c b/net/socket.c
index 1886f98..85ad9cd 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -691,7 +691,6 @@ static int net_socket_udp_init(NetClientState *peer,
 int net_init_socket(const Netdev *netdev, const char *name,
 NetClientState *peer, Error **errp)
 {
-/* FIXME error_setg(errp, ...) on failure */
 Error *err = NULL;
 const NetdevSocketOptions *sock;
 
@@ -700,13 +699,13 @@ int net_init_socket(const Netdev *netdev, const char 
*name,
 
 if (sock->has_fd + sock->has_listen + sock->has_connect + sock->has_mcast +
 sock->has_udp != 1) {
-error_report("exactly one of fd=, listen=, connect=, mcast= or udp="
+error_setg(errp, "exactly one of fd=, listen=, connect=, mcast= or 
udp="
  " is required");
 return -1;
 }
 
 if (sock->has_localaddr && !sock->has_mcast && !sock->has_udp) {
-error_report("localaddr= is only valid with mcast= or udp=");
+error_setg(errp, "localaddr= is only valid with mcast= or udp=");
 return -1;
 }
 
@@ -752,7 +751,7 @@ int net_init_socket(const Netdev *netdev, const char *name,
 
 assert(sock->has_udp);
 if (!sock->has_localaddr) {
-error_report("localaddr= is mandatory with udp=");
+error_setg(errp, "localaddr= is mandatory with udp=");
 return -1;
 }
 if (net_socket_udp_init(peer, "socket", name, sock->udp, sock->localaddr) 
==
-- 
2.9.3

Re: [Qemu-devel] [PATCH v2 7/7] intel_iommu: support passthrough (PT)

2017-04-18 Thread Liu, Yi L

> -Original Message-
> From: Peter Xu [mailto:pet...@redhat.com]
> Sent: Tuesday, April 18, 2017 3:27 PM
> To: Liu, Yi L 
> Cc: qemu-devel@nongnu.org; Lan, Tianyu ; Michael S .
> Tsirkin ; Jason Wang ; Marcel
> Apfelbaum ; David Gibson 
> Subject: Re: [Qemu-devel] [PATCH v2 7/7] intel_iommu: support passthrough (PT)
> 
> On Tue, Apr 18, 2017 at 06:02:44AM +, Liu, Yi L wrote:
> > > > Hi Peter,
> > > >
> > > > Skip address space switching is a good idea to support Passthru mode.
> > > > However, without the address space, the vfio notifier would not be
> > > > registered, thus vIOMMU emulator has no way to connect to host. It
> > > > is no harm if there is only map/unmap notifier. But if we have
> > > > more notifiers other than map/unmap, it may be a problem.
> > > >
> > > > I think we need to reconsider it here.
> > >
> > > For now I think as switching is good to us in general. Could I know
> > > more context about this? Would it be okay to work on top of this in the 
> > > future?
> > >
> >
> > Let me explain. For vSVM enabling, it needs to add new notifier to
> > bind gPASID table to host. If an assigned device uses SVM in guest, as
> > designed guest IOMMU driver would set "pt" for it. This is to make
> > sure the host second-level page table stores GPA->HPA mapping. So that
> > pIOMMU can do nested translation to achieve gVA->GPA GPA->HPA mapping.
> 
> That should mean that in the guest it will only enable first level 
> translation, while in
> the host it'll be nested (first + second)? Then you should be trapping the 
> guest
> extended context entry invalidation, then pass the PASID table pointer 
> downward to
> the host IOMMU, am I correct?

exactly.

> 
> >
> > So the situation is if we want to keep GPA->HPA mapping, we should
> > skip address space switch, but the vfio listener would not know vIOMMU
> > is added, so no vfio notifier would be registered. But if we do the
> > switch, the GPA->HPA mapping is unmapped. And need another way to build the
> GPA->HPA mapping.
> 
> If my understanding above is correct, current IOMMU notifier may not satisfy 
> your
> need. If you see the current notify interface, it's delivering IOMMUTLBEntry. 
> I
> suspect it only suites for IOTLB notifications, but not if you want to pass 
> some
> pointer (one GPA) downward somewhere.

Yeah, you got it more than absolutely. One of my patch is to modify the para to 
be
void * and let each notifiers to translate separately. I think it should be a 
reasonable
change.

Supposedly, I would send RFC for vSVM soon. We may talk more it at that thread.

> >
> > I think we may have two choice here. Pls let me know your ideas.
> >
> > 1. skip the switch for "pt" mode, find other way to register vfio
> > notifiers. not sure if this is workable. Just a quick thought.
> >
> > 2. do the switch, and rebuild GPA->HPA mapping if device use "pt"
> > mode. For this option, I also have two way to build the GPA->HPA mapping.
> > a) walk all the memory region sections in address_space_memory, and build 
> > the
> mapping.
> > address_space_memory.dispatch->map.sections, sections stores all the mapped
> sections.
> >
> > b) let guest iommu driver mimics a 1:1 mapping for the devices use
> > "pt" mode. in this way, the GPA->HPA mapping could be setup by walking
> > the guest SL page table. This is consistent with your vIOVA replay solution.
> 
> I'll prefer (1). Reason explained above.
> 
> >
> > Also I'm curious if Tianyu's fault report framework needs to register new 
> > notifiers.
> 
> For Tianyu's case, I feel like we need other kind of notifier as well, but 
> not the
> current IOTLB one. And, of course in this case it'll be in an reverted 
> direction, which
> is from device to vIOMMU.
> 
> (I am thinking whether it's a good time now to let any PCI device able  to 
> fetch its
> direct owner IOMMU object, then it can register anything  it want onto that 
> object
> maybe?)
> 
Hmmm, let's see Tianyu's comment as he's driving fault report. Let's keep in 
touch on
this Passthru-Mode enabling.

Thanks,
Yi L

Re: [Qemu-devel] [PATCH 03/10] vga: add vga_scanline_invalidated helper

2017-04-18 Thread Gerd Hoffmann

> > +static bool vga_scanline_invalidated(VGACommonState *s, int y)
> > +{
> > +if (y >= VGA_MAX_HEIGHT) {
> > +return false;
> > +}
> > +return s->invalidated_y_table[y >> 5] |= 1 << (y & 0x1f);
> > +}

> The return stmt doesn't match what you're replacing.  Are you really 
> intending 
> to modify invalidated_y_table here?

No.  Really stupid cut&paste bug.  Fixed.

thanks,
  Gerd

Re: [Qemu-devel] [Qemu-block] [PATCH RFC 04/16] block: Propagate BLK_PERM_AIO_CONTEXT_CHANGE down the graph

2017-04-18 Thread Fam Zheng

On Mon, 04/10 09:57, Stefan Hajnoczi wrote:
> On Tue, Mar 21, 2017 at 11:16:23AM +0800, Fam Zheng wrote:
> > @@ -1713,21 +1714,22 @@ void bdrv_format_default_perms(BlockDriverState 
> > *bs, BdrvChild *c,
> >  perm |= BLK_PERM_CONSISTENT_READ;
> >  shared &= ~(BLK_PERM_WRITE | BLK_PERM_RESIZE);
> >  } else {
> > -/* We want consistent read from backing files if the parent needs 
> > it.
> > +/* We want consistent read and aio context change from backing 
> > files if
> > + * the parent needs it.
> >   * No other operations are performed on backing files. */
> > -perm &= BLK_PERM_CONSISTENT_READ;
> > +perm &= BLK_PERM_CONSISTENT_READ | BLK_PERM_AIO_CONTEXT_CHANGE;
> >  
> > -/* If the parent can deal with changing data, we're okay with a
> > +/* If the parent can deal with changing aio context, we're okay 
> > too;
> > + * If the parent can deal with changing data, we're okay with a
> >   * writable and resizable backing file. */
> >  /* TODO Require !(perm & BLK_PERM_CONSISTENT_READ), too? */
> > +shared &= BLK_PERM_AIO_CONTEXT_CHANGE | BLK_PERM_WRITE;
> >  if (shared & BLK_PERM_WRITE) {
> > -shared = BLK_PERM_WRITE | BLK_PERM_RESIZE;
> > -} else {
> > -shared = 0;
> > +shared |= BLK_PERM_WRITE | BLK_PERM_RESIZE;
> 
> We already have BLK_PERM_WRITE so we're just adding BLK_PERM_RESIZE.
> The following is clearer:
> 
>   shared |= BLK_PERM_RESIZE;
> 
> >  }
> >  
> >  shared |= BLK_PERM_CONSISTENT_READ | BLK_PERM_GRAPH_MOD |
> > -  BLK_PERM_WRITE_UNCHANGED;
> > +  BLK_PERM_WRITE_UNCHANGED | BLK_PERM_AIO_CONTEXT_CHANGE;
> 
> Why was shared &= BLK_PERM_AIO_CONTEXT_CHANGE necessary above if we
> unconditionally OR it here?

It's redundant. Will fix both.

Fam

Re: [Qemu-devel] [PATCH 18/19] monitor: move hmp_savevm() to monitor.c

2017-04-18 Thread Paolo Bonzini



On 17/04/2017 22:00, Juan Quintela wrote:
> hmp_loadvm is already there, so be consistent.

Can you move both to hmp.c instead?

Paolo

> Signed-off-by: Juan Quintela 
> ---
>  include/sysemu/sysemu.h | 1 -
>  migration/savevm.c  | 5 -
>  monitor.c   | 5 +
>  3 files changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> index 5f2f21d..146a0dc 100644
> --- a/include/sysemu/sysemu.h
> +++ b/include/sysemu/sysemu.h
> @@ -75,7 +75,6 @@ void qemu_remove_exit_notifier(Notifier *notify);
>  void qemu_add_machine_init_done_notifier(Notifier *notify);
>  void qemu_remove_machine_init_done_notifier(Notifier *notify);
>  
> -void hmp_savevm(Monitor *mon, const QDict *qdict);
>  int save_vmstate(Monitor *mon, const char *name);
>  int load_vmstate(const char *name);
>  void hmp_delvm(Monitor *mon, const QDict *qdict);
> diff --git a/migration/savevm.c b/migration/savevm.c
> index f628d01..cbd7e0d 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2156,11 +2156,6 @@ int save_vmstate(Monitor *mon, const char *name)
>  return ret;
>  }
>  
> -void hmp_savevm(Monitor *mon, const QDict *qdict)
> -{
> -save_vmstate(mon, qdict_get_try_str(qdict, "name"));
> -}
> -
>  void qmp_xen_save_devices_state(const char *filename, Error **errp)
>  {
>  QEMUFile *f;
> diff --git a/monitor.c b/monitor.c
> index ceb0489..2fca4fb 100644
> --- a/monitor.c
> +++ b/monitor.c
> @@ -1854,6 +1854,11 @@ static void hmp_loadvm(Monitor *mon, const QDict 
> *qdict)
>  }
>  }
>  
> +static void hmp_savevm(Monitor *mon, const QDict *qdict)
> +{
> +save_vmstate(mon, qdict_get_try_str(qdict, "name"));
> +}
> +
>  int monitor_get_fd(Monitor *mon, const char *fdname, Error **errp)
>  {
>  mon_fd_t *monfd;
>

Re: [Qemu-devel] [PATCH 19/19] monitor: remove monitor parameter from save_vmstate

2017-04-18 Thread Paolo Bonzini



On 17/04/2017 22:00, Juan Quintela wrote:
> load_vmstate() already use error_report, so be consistent.

Better: make both return Error* via an Error** parameter, and add

hmp_handle_error(mon, &err);

to hmp_savevm and error_report_err(err) on the loading side.

Paolo

> Signed-off-by: Juan Quintela 
> ---
>  include/sysemu/sysemu.h  |  2 +-
>  migration/savevm.c   | 16 
>  monitor.c|  2 +-
>  replay/replay-snapshot.c |  2 +-
>  4 files changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> index 146a0dc..d2582fa 100644
> --- a/include/sysemu/sysemu.h
> +++ b/include/sysemu/sysemu.h
> @@ -75,7 +75,7 @@ void qemu_remove_exit_notifier(Notifier *notify);
>  void qemu_add_machine_init_done_notifier(Notifier *notify);
>  void qemu_remove_machine_init_done_notifier(Notifier *notify);
>  
> -int save_vmstate(Monitor *mon, const char *name);
> +int save_vmstate(const char *name);
>  int load_vmstate(const char *name);
>  void hmp_delvm(Monitor *mon, const QDict *qdict);
>  void hmp_info_snapshots(Monitor *mon, const QDict *qdict);
> diff --git a/migration/savevm.c b/migration/savevm.c
> index cbd7e0d..36a6002 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2055,7 +2055,7 @@ int qemu_loadvm_state(QEMUFile *f)
>  return ret;
>  }
>  
> -int save_vmstate(Monitor *mon, const char *name)
> +int save_vmstate(const char *name)
>  {
>  BlockDriverState *bs, *bs1;
>  QEMUSnapshotInfo sn1, *sn = &sn1, old_sn1, *old_sn = &old_sn1;
> @@ -2069,8 +2069,8 @@ int save_vmstate(Monitor *mon, const char *name)
>  AioContext *aio_context;
>  
>  if (!bdrv_all_can_snapshot(&bs)) {
> -monitor_printf(mon, "Device '%s' is writable but does not "
> -   "support snapshots.\n", bdrv_get_device_name(bs));
> +error_report("Device '%s' is writable but does not support 
> snapshots.",
> + bdrv_get_device_name(bs));
>  return ret;
>  }
>  
> @@ -2087,7 +2087,7 @@ int save_vmstate(Monitor *mon, const char *name)
>  
>  bs = bdrv_all_find_vmstate_bs();
>  if (bs == NULL) {
> -monitor_printf(mon, "No block device can accept snapshots\n");
> +error_report("No block device can accept snapshots");
>  return ret;
>  }
>  aio_context = bdrv_get_aio_context(bs);
> @@ -2096,7 +2096,7 @@ int save_vmstate(Monitor *mon, const char *name)
>  
>  ret = global_state_store();
>  if (ret) {
> -monitor_printf(mon, "Error saving global state\n");
> +error_report("Error saving global state");
>  return ret;
>  }
>  vm_stop(RUN_STATE_SAVE_VM);
> @@ -2128,7 +2128,7 @@ int save_vmstate(Monitor *mon, const char *name)
>  /* save the VM state */
>  f = qemu_fopen_bdrv(bs, 1);
>  if (!f) {
> -monitor_printf(mon, "Could not open VM state file\n");
> +error_report("Could not open VM state file");
>  goto the_end;
>  }
>  ret = qemu_savevm_state(f, &local_err);
> @@ -2141,8 +2141,8 @@ int save_vmstate(Monitor *mon, const char *name)
>  
>  ret = bdrv_all_create_snapshot(sn, bs, vm_state_size, &bs);
>  if (ret < 0) {
> -monitor_printf(mon, "Error while creating snapshot on '%s'\n",
> -   bdrv_get_device_name(bs));
> +error_report("Error while creating snapshot on '%s'",
> + bdrv_get_device_name(bs));
>  goto the_end;
>  }
>  
> diff --git a/monitor.c b/monitor.c
> index 2fca4fb..9e79a97 100644
> --- a/monitor.c
> +++ b/monitor.c
> @@ -1856,7 +1856,7 @@ static void hmp_loadvm(Monitor *mon, const QDict *qdict)
>  
>  static void hmp_savevm(Monitor *mon, const QDict *qdict)
>  {
> -save_vmstate(mon, qdict_get_try_str(qdict, "name"));
> +save_vmstate(qdict_get_try_str(qdict, "name"));
>  }
>  
>  int monitor_get_fd(Monitor *mon, const char *fdname, Error **errp)
> diff --git a/replay/replay-snapshot.c b/replay/replay-snapshot.c
> index 65e2d37..8cced46 100644
> --- a/replay/replay-snapshot.c
> +++ b/replay/replay-snapshot.c
> @@ -64,7 +64,7 @@ void replay_vmstate_init(void)
>  {
>  if (replay_snapshot) {
>  if (replay_mode == REPLAY_MODE_RECORD) {
> -if (save_vmstate(cur_mon, replay_snapshot) != 0) {
> +if (save_vmstate(replay_snapshot) != 0) {
>  error_report("Could not create snapshot for icount record");
>  exit(1);
>  }
>

Re: [Qemu-devel] [PATCH 05/19] migration: Export exec.c functions in its own file

2017-04-18 Thread Paolo Bonzini



On 17/04/2017 22:00, Juan Quintela wrote:
> Signed-off-by: Juan Quintela 
> ---
>  include/migration/exec.h  | 20 

Good cleanup, but please move it to migration/exec.h and include it with
"exec.h".  This is the standard for files that are not needed outside
that particular directory.

Paolo

>  include/migration/migration.h |  4 
>  migration/exec.c  |  2 +-
>  migration/migration.c |  1 +
>  4 files changed, 22 insertions(+), 5 deletions(-)
>  create mode 100644 include/migration/exec.h
> 
> diff --git a/include/migration/exec.h b/include/migration/exec.h
> new file mode 100644
> index 000..c97a7e9
> --- /dev/null
> +++ b/include/migration/exec.h
> @@ -0,0 +1,20 @@
> +/*
> + * QEMU live migration exec functions
> + *
> + * Copyright IBM, Corp. 2008
> + *
> + * Authors:
> + *  Anthony Liguori   
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef QEMU_MIGRATION_EXEC_H
> +#define QEMU_MIGRATION_EXEC_H
> +void exec_start_incoming_migration(const char *host_port, Error **errp);
> +
> +void exec_start_outgoing_migration(MigrationState *s, const char *host_port,
> +   Error **errp);
> +#endif
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index 3f54a7c..eb0150b 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -194,10 +194,6 @@ void migration_tls_channel_connect(MigrationState *s,
>  
>  uint64_t migrate_max_downtime(void);
>  
> -void exec_start_incoming_migration(const char *host_port, Error **errp);
> -
> -void exec_start_outgoing_migration(MigrationState *s, const char *host_port, 
> Error **errp);
> -
>  void tcp_start_incoming_migration(const char *host_port, Error **errp);
>  
>  void tcp_start_outgoing_migration(MigrationState *s, const char *host_port, 
> Error **errp);
> diff --git a/migration/exec.c b/migration/exec.c
> index 2779c9c..1a2d5c2 100644
> --- a/migration/exec.c
> +++ b/migration/exec.c
> @@ -21,7 +21,7 @@
>  #include "qapi/error.h"
>  #include "qemu-common.h"
>  #include "migration/channel.h"
> -#include "migration/migration.h"
> +#include "migration/exec.h"
>  #include "io/channel-command.h"
>  #include "trace.h"
>  
> diff --git a/migration/migration.c b/migration/migration.c
> index 17d22eb..076c42a 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -21,6 +21,7 @@
>  #include "migration/machine.h"
>  #include "migration/state.h"
>  #include "migration/init.h"
> +#include "migration/exec.h"
>  #include "migration/migration.h"
>  #include "migration/qemu-file.h"
>  #include "sysemu/sysemu.h"
>

Re: [Qemu-devel] [PATCH 19/19] monitor: remove monitor parameter from save_vmstate

2017-04-18 Thread Paolo Bonzini



On 18/04/2017 11:44, Paolo Bonzini wrote:
> 
> 
> On 17/04/2017 22:00, Juan Quintela wrote:
>> load_vmstate() already use error_report, so be consistent.
> 
> Better: make both return Error* via an Error** parameter, and add
> 
> hmp_handle_error(mon, &err);
> 
> to hmp_savevm and error_report_err(err) on the loading side.

Not really, loadvm is also a monitor command (I was confusing it with
-incoming).  So it can use hmp_handle_error too.

Paolo

> Paolo
> 
>> Signed-off-by: Juan Quintela 
>> ---
>>  include/sysemu/sysemu.h  |  2 +-
>>  migration/savevm.c   | 16 
>>  monitor.c|  2 +-
>>  replay/replay-snapshot.c |  2 +-
>>  4 files changed, 11 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
>> index 146a0dc..d2582fa 100644
>> --- a/include/sysemu/sysemu.h
>> +++ b/include/sysemu/sysemu.h
>> @@ -75,7 +75,7 @@ void qemu_remove_exit_notifier(Notifier *notify);
>>  void qemu_add_machine_init_done_notifier(Notifier *notify);
>>  void qemu_remove_machine_init_done_notifier(Notifier *notify);
>>  
>> -int save_vmstate(Monitor *mon, const char *name);
>> +int save_vmstate(const char *name);
>>  int load_vmstate(const char *name);
>>  void hmp_delvm(Monitor *mon, const QDict *qdict);
>>  void hmp_info_snapshots(Monitor *mon, const QDict *qdict);
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index cbd7e0d..36a6002 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -2055,7 +2055,7 @@ int qemu_loadvm_state(QEMUFile *f)
>>  return ret;
>>  }
>>  
>> -int save_vmstate(Monitor *mon, const char *name)
>> +int save_vmstate(const char *name)
>>  {
>>  BlockDriverState *bs, *bs1;
>>  QEMUSnapshotInfo sn1, *sn = &sn1, old_sn1, *old_sn = &old_sn1;
>> @@ -2069,8 +2069,8 @@ int save_vmstate(Monitor *mon, const char *name)
>>  AioContext *aio_context;
>>  
>>  if (!bdrv_all_can_snapshot(&bs)) {
>> -monitor_printf(mon, "Device '%s' is writable but does not "
>> -   "support snapshots.\n", bdrv_get_device_name(bs));
>> +error_report("Device '%s' is writable but does not support 
>> snapshots.",
>> + bdrv_get_device_name(bs));
>>  return ret;
>>  }
>>  
>> @@ -2087,7 +2087,7 @@ int save_vmstate(Monitor *mon, const char *name)
>>  
>>  bs = bdrv_all_find_vmstate_bs();
>>  if (bs == NULL) {
>> -monitor_printf(mon, "No block device can accept snapshots\n");
>> +error_report("No block device can accept snapshots");
>>  return ret;
>>  }
>>  aio_context = bdrv_get_aio_context(bs);
>> @@ -2096,7 +2096,7 @@ int save_vmstate(Monitor *mon, const char *name)
>>  
>>  ret = global_state_store();
>>  if (ret) {
>> -monitor_printf(mon, "Error saving global state\n");
>> +error_report("Error saving global state");
>>  return ret;
>>  }
>>  vm_stop(RUN_STATE_SAVE_VM);
>> @@ -2128,7 +2128,7 @@ int save_vmstate(Monitor *mon, const char *name)
>>  /* save the VM state */
>>  f = qemu_fopen_bdrv(bs, 1);
>>  if (!f) {
>> -monitor_printf(mon, "Could not open VM state file\n");
>> +error_report("Could not open VM state file");
>>  goto the_end;
>>  }
>>  ret = qemu_savevm_state(f, &local_err);
>> @@ -2141,8 +2141,8 @@ int save_vmstate(Monitor *mon, const char *name)
>>  
>>  ret = bdrv_all_create_snapshot(sn, bs, vm_state_size, &bs);
>>  if (ret < 0) {
>> -monitor_printf(mon, "Error while creating snapshot on '%s'\n",
>> -   bdrv_get_device_name(bs));
>> +error_report("Error while creating snapshot on '%s'",
>> + bdrv_get_device_name(bs));
>>  goto the_end;
>>  }
>>  
>> diff --git a/monitor.c b/monitor.c
>> index 2fca4fb..9e79a97 100644
>> --- a/monitor.c
>> +++ b/monitor.c
>> @@ -1856,7 +1856,7 @@ static void hmp_loadvm(Monitor *mon, const QDict 
>> *qdict)
>>  
>>  static void hmp_savevm(Monitor *mon, const QDict *qdict)
>>  {
>> -save_vmstate(mon, qdict_get_try_str(qdict, "name"));
>> +save_vmstate(qdict_get_try_str(qdict, "name"));
>>  }
>>  
>>  int monitor_get_fd(Monitor *mon, const char *fdname, Error **errp)
>> diff --git a/replay/replay-snapshot.c b/replay/replay-snapshot.c
>> index 65e2d37..8cced46 100644
>> --- a/replay/replay-snapshot.c
>> +++ b/replay/replay-snapshot.c
>> @@ -64,7 +64,7 @@ void replay_vmstate_init(void)
>>  {
>>  if (replay_snapshot) {
>>  if (replay_mode == REPLAY_MODE_RECORD) {
>> -if (save_vmstate(cur_mon, replay_snapshot) != 0) {
>> +if (save_vmstate(replay_snapshot) != 0) {
>>  error_report("Could not create snapshot for icount record");
>>  exit(1);
>>  }
>>
>

Re: [Qemu-devel] define constant in .risu file

2017-04-18 Thread Peter Maydell

On 18 April 2017 at 05:20, G 3  wrote:
> Is there a way to define a constant in a .risu file? Something like this:
>
> my $upper_imm_limit = 500;

No - the risu file is just a bunch of patterns, with a
few fragments of executable code mixed in.

Currently constants defined in the perl source will
be visible to the perl fragments in the risu file,
but that's pretty ugly and one day I'll make it
a requirement that the perl explicitly exports the
functions/constants the risu fragments should see.

thanks
-- PMM

Re: [Qemu-devel] [PATCH v3 00/21] chardev clean-ups & tests (after 2.9)

2017-04-18 Thread Marc-André Lureau

Hi

On Mon, Apr 10, 2017 at 5:58 PM Marc-André Lureau <
marcandre.lur...@gmail.com> wrote:

> On Thu, Mar 16, 2017 at 10:23 AM Marc-André Lureau <
> marcandre.lur...@redhat.com> wrote:
>
>> Hi,
>>
>> The following series contains various patches:
>> - replace "chardevs" list for a /chardevs container object
>> - add a few read-only socket properties mainly useful for testing
>> - some chardev related clean-ups
>> - add various chardev tests
>>
>> This series is part of a larger refactoring series that I try to keep
>> up to date here: https://github.com/elmarco/qemu/commits/chrfe
>>
>>
> ping (most of the patches are still unreviewed)
>

Patches missing reviews: 5-9, 12, 16-21

thanks

v3:
>> - open code object_new_with_props() as suggest by Paolo
>> - added some r-b tags
>> - rebased
>>
>> v2:
>> - replaced root container unref with a TODO
>> - call object_unparent() directly instead of qemu_chr_delete()
>> - remove bad qcow2 NULL check removal
>> - rebased
>>
>> Marc-André Lureau (21):
>>   char: remove qemu_chr_be_generic_open
>>   mux: simplfy muxes_realize_done
>>   xen: use a better chardev type check
>>   container: don't leak container reference
>>   char: add a /chardevs container
>>   vl: add todo note about root container cleanup
>>   char: use /chardevs container instead of chardevs list
>>   char: remove qemu_chardev_add
>>   char: remove chardevs list
>>   char: useless NULL check
>>   char-socket: introduce update_disconnected_filename()
>>   char-socket: update local address after listen
>>   char-socket: add 'addr' property
>>   char-socket: add 'connected' property
>>   char-udp: flush as much buffer as possible
>>   tests: add alias check in /char/ringbuf
>>   tests: add /char/pipe test
>>   tests: add /char/file test
>>   tests: add /char/socket test
>>   tests: add /char/udp test
>>   tests: add /char/console test
>>
>>  chardev/char-mux.h  |   2 +-
>>  include/sysemu/char.h   |  10 --
>>  chardev/char-mux.c  |  11 +-
>>  chardev/char-pty.c  |   2 +-
>>  chardev/char-socket.c   |  46 +-
>>  chardev/char-udp.c  |  26 ++--
>>  chardev/char.c  | 148 --
>>  gdbstub.c   |   4 +-
>>  hw/usb/ccid-card-passthru.c |   2 +-
>>  hw/usb/redirect.c   |   2 +-
>>  net/vhost-user.c|   2 +-
>>  qom/container.c |   1 +
>>  tests/test-char.c   | 366
>> +++-
>>  tests/vhost-user-test.c |   2 +-
>>  ui/console.c|   2 +-
>>  ui/gtk.c|   2 +-
>>  vl.c|   1 +
>>  xen-common.c|   2 +-
>>  18 files changed, 504 insertions(+), 127 deletions(-)
>>
>> --
>> 2.12.0.191.gc5d8de91d
>>
>>
>> --
> Marc-André Lureau
>
-- 
Marc-André Lureau

Re: [Qemu-devel] What's the next QEMU version after 2.9 ? (or: when is a good point in time to get rid of old interfaces)

2017-04-18 Thread Stefan Hajnoczi

On Tue, Apr 11, 2017 at 02:53:00PM +0200, Markus Armbruster wrote:
> Stefan Hajnoczi  writes:
> 
> > On Mon, Mar 27, 2017 at 01:49:44PM +0100, Peter Maydell wrote:
> >> On 27 March 2017 at 13:01, Stefan Hajnoczi  wrote:
> >> > It would be nice to get rid of the legacy -net option in 3.0.0.  I have
> >> > added it and included pointers to loose ends.  I think this is doable
> >> > but will require some time to achieve.
> >> 
> >> What's the syntax for using -netdev with embedded network
> >> devices that you can't create with -device ?
> >> (We should document this at
> >> http://wiki.qemu-project.org/Documentation/Networking)
> >
> > I mentioned that on the wiki.  This isn't an option we can drop easily.
> > Work is necessary to make -netdev a complete replacement or to remove
> > the legacy stuff from -net.
> 
> Just like -device is a general way to plug in devices, replacing
> multiple special ways (-net, -drive, -usb, ...), we could use a general
> way to configure onboard devices.

I looked at the -device implementation to see if the bus= parameter
could be used to specify onboard device addresses, but I think you may
be right that we need a separate command-line argument for onboard
devices.


signature.asc
Description: PGP signature

Re: [Qemu-devel] [PATCH] timer.h: Provide monotonic time for ARM guests

2017-04-18 Thread Paolo Bonzini



On 17/04/2017 20:55, Pranith Kumar wrote:
>>> +/* ARM does not have a user-space readble cycle counter available.
>>> + * This is a compromise to get monotonically increasing time.
>>> + */
>>> +static inline int64_t cpu_get_host_ticks(void)
>>> +{
>>> +return get_clock();
>>> +}
>> This doesn't look like it should be ARM-specific. Is it
>> better than the current default implementation? If so,
>> why not make this the default implementation?
>
> I think we can do that...

Yes, it is always better for emulation accuracy.  It may be much slower,
depending on your OS (especially if get_clock requires a
user->kernel->user transition), but the current code is quite broken.

Paolo

>>> +
>>>  #else
>>>  /* The host CPU doesn't have an easily accessible cycle counter.
>>> Just return a monotonically increasing value.  This will be
>> The comment here says that our default is already a monotonically
>> increasing implementation -- is it wrong, or is there some other
>> advantage of your version?
> Comment #6 in the bug report by Laszlo Ersek explains why.
> Incrementing by 1 for approximately 55 ms is what is causing the
> divide by zero in grub.

Re: [Qemu-devel] [PATCH v9 1/9] memory: add section range info for IOMMU notifier

2017-04-18 Thread Peter Xu

On Tue, Apr 11, 2017 at 11:56:54AM +1000, David Gibson wrote:
> On Mon, Apr 10, 2017 at 03:09:50PM +0800, Peter Xu wrote:
> > On Mon, Apr 10, 2017 at 02:39:22PM +1000, David Gibson wrote:
> > > On Fri, Apr 07, 2017 at 06:59:07PM +0800, Peter Xu wrote:
> > > > In this patch, IOMMUNotifier.{start|end} are introduced to store section
> > > > information for a specific notifier. When notification occurs, we not
> > > > only check the notification type (MAP|UNMAP), but also check whether the
> > > > notified iova range overlaps with the range of specific IOMMU notifier,
> > > > and skip those notifiers if not in the listened range.
> > > > 
> > > > When removing an region, we need to make sure we removed the correct
> > > > VFIOGuestIOMMU by checking the IOMMUNotifier.start address as well.
> > > > 
> > > > This patch is solving the problem that vfio-pci devices receive
> > > > duplicated UNMAP notification on x86 platform when vIOMMU is there. The
> > > > issue is that x86 IOMMU has a (0, 2^64-1) IOMMU region, which is
> > > > splitted by the (0xfee0, 0xfeef) IRQ region. AFAIK
> > > > this (splitted IOMMU region) is only happening on x86.
> > > > 
> > > > This patch also helps vhost to leverage the new interface as well, so
> > > > that vhost won't get duplicated cache flushes. In that sense, it's an
> > > > slight performance improvement.
> > > > 
> > > > Suggested-by: David Gibson 
> > > > Reviewed-by: Eric Auger 
> > > > Reviewed-by: Michael S. Tsirkin 
> > > > Acked-by: Alex Williamson 
> > > > Signed-off-by: Peter Xu 
> > > > ---
> > > >  hw/vfio/common.c  | 12 +---
> > > >  hw/virtio/vhost.c | 10 --
> > > >  include/exec/memory.h | 19 ++-
> > > >  memory.c  |  9 +
> > > >  4 files changed, 44 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > > > index f3ba9b9..6b33b9f 100644
> > > > --- a/hw/vfio/common.c
> > > > +++ b/hw/vfio/common.c
> > > > @@ -478,8 +478,13 @@ static void 
> > > > vfio_listener_region_add(MemoryListener *listener,
> > > >  giommu->iommu_offset = section->offset_within_address_space -
> > > > section->offset_within_region;
> > > >  giommu->container = container;
> > > > -giommu->n.notify = vfio_iommu_map_notify;
> > > > -giommu->n.notifier_flags = IOMMU_NOTIFIER_ALL;
> > > > +llend = 
> > > > int128_add(int128_make64(section->offset_within_region),
> > > > +   section->size);
> > > > +llend = int128_sub(llend, int128_one());
> > > > +iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
> > > > +IOMMU_NOTIFIER_ALL,
> > > > +section->offset_within_region,
> > > > +int128_get64(llend));
> > > 
> > > Seems to me it would make sense to put the fiddling around to convert
> > > the MemoryRegionSection into the necessary values would make sense to
> > > go inside iommu_notifier_init().
> > 
> > But will we always get one MemoryRegionSection if we are not in any of
> > the region_{add|del} callback? E.g., what if we want to init an IOMMU
> > notifier that covers just the whole IOMMU region range?
> 
> I suppose so.  It's just the only likely users of the interface I can
> see will be always doing this from region_{add,del}.
> 
> > Considering above, I would still slightly prefer current interface.
> 
> Ok, well my opinion on the matter isn't terribly strong.

Hi, David,

Sorry to respond late (so that context might be lost). Just want to
make sure that you are okay with current patch and interface, right?

I think Marcel is going to merge it if np, and I would like to have
your confirmation on this before the merge. Thanks!

-- 
Peter Xu

Re: [Qemu-devel] QEMU build breakage on ARM against Xen 4.9 caused by libxendevicemodel

2017-04-18 Thread Paul Durrant

> -Original Message-
> From: Stefano Stabellini [mailto:sstabell...@kernel.org]
> Sent: 15 April 2017 01:40
> To: Stefano Stabellini 
> Cc: Paul Durrant ; qemu-devel@nongnu.org;
> Anthony Perard ; Wei Liu
> ; jgr...@suse.com; julien.gr...@arm.com; xen-
> de...@lists.xenproject.org
> Subject: Re: QEMU build breakage on ARM against Xen 4.9 caused by
> libxendevicemodel
> 
> On Fri, 14 Apr 2017, Stefano Stabellini wrote:
> > Hi Paul,
> >
> > The following commit in my qemu "next" branch breaks the build on arm
> > and arm64:
> >
> > commit 670271647ad15e9d937ced7a72c892349c709216
> > Author: Paul Durrant 
> > Date:   Tue Mar 7 10:55:34 2017 +
> >
> > xen: use libxendevicemodel when available
> >
> > See the appended build log. Sorry for not realizing it sooner.
> 
> As I imagined, this bug is easy to solve. It is reproducible on x86 too,
> if you pass -DXC_WANT_COMPAT_DEVICEMODEL_API=1 to configure and
> forcefully disable Xen 4.9 detection in the configure script.
> 
> If QEMU detects xen_ctrl_version = 480, the build will fail against Xen
> 4.9, even when -DXC_WANT_COMPAT_DEVICEMODEL_API=1 is specified.
> 
> The appended patch solves the problem. However, Xen 4.9 detection and
> compilation remain broken.

Ok, that fix looks fine to me.

  Paul

> 
> ---
> 
> diff --git a/include/hw/xen/xen_common.h
> b/include/hw/xen/xen_common.h
> index 274accc..b1f5f53 100644
> --- a/include/hw/xen/xen_common.h
> +++ b/include/hw/xen/xen_common.h
> @@ -9,7 +9,6 @@
>  #undef XC_WANT_COMPAT_EVTCHN_API
>  #undef XC_WANT_COMPAT_GNTTAB_API
>  #undef XC_WANT_COMPAT_MAP_FOREIGN_API
> -#undef XC_WANT_COMPAT_DEVICEMODEL_API
> 
>  #include 
>  #include 
> @@ -154,6 +153,7 @@ static inline int xendevicemodel_set_mem_type(
> 
>  #else /* CONFIG_XEN_CTRL_INTERFACE_VERSION >= 490 */
> 
> +#undef XC_WANT_COMPAT_DEVICEMODEL_API
>  #include 
> 
>  #endif

Re: [Qemu-devel] [PATCH 18/19] monitor: move hmp_savevm() to monitor.c

2017-04-18 Thread Juan Quintela

Paolo Bonzini  wrote:
> On 17/04/2017 22:00, Juan Quintela wrote:
>> hmp_loadvm is already there, so be consistent.
>
> Can you move both to hmp.c instead?

Discussed on irc with David.  On my tree is already on hmp.c.

Thanks, Juan.

Re: [Qemu-devel] [PATCH 19/19] monitor: remove monitor parameter from save_vmstate

2017-04-18 Thread Juan Quintela

Paolo Bonzini  wrote:
> On 18/04/2017 11:44, Paolo Bonzini wrote:
>> 
>> 
>> On 17/04/2017 22:00, Juan Quintela wrote:
>>> load_vmstate() already use error_report, so be consistent.
>> 
>> Better: make both return Error* via an Error** parameter, and add
>> 
>> hmp_handle_error(mon, &err);
>> 
>> to hmp_savevm and error_report_err(err) on the loading side.
>
> Not really, loadvm is also a monitor command (I was confusing it with
> -incoming).  So it can use hmp_handle_error too.

ok.  I did't kenw about that one.

Thanks.

Re: [Qemu-devel] [PATCH 05/19] migration: Export exec.c functions in its own file

2017-04-18 Thread Juan Quintela

Paolo Bonzini  wrote:
> On 17/04/2017 22:00, Juan Quintela wrote:
>> Signed-off-by: Juan Quintela 
>> ---
>>  include/migration/exec.h  | 20 
>
> Good cleanup, but please move it to migration/exec.h and include it with
> "exec.h".  This is the standard for files that are not needed outside
> that particular directory.

Aha, thanks.  I was thinking about creating something "internal", but
that makes sense.

Later, Juan.

Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure

2017-04-18 Thread Stefan Hajnoczi

On Tue, Apr 11, 2017 at 02:34:26PM +0800, Haozhong Zhang wrote:
> On 04/06/17 10:43 +0100, Stefan Hajnoczi wrote:
> > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
> > > This patch series constructs the flush hint address structures for
> > > nvdimm devices in QEMU.
> > > 
> > > It's of course not for 2.9. I send it out early in order to get
> > > comments on one point I'm uncertain (see the detailed explanation
> > > below). Thanks for any comments in advance!
> > > Background
> > > ---
> > 
> > Extra background:
> > 
> > Flush Hint Addresses are necessary because:
> > 
> > 1. Some hardware configurations may require them.  In other words, a
> >cache flush instruction is not enough to persist data.
> > 
> > 2. The host file system may need fsync(2) calls (e.g. to persist
> >metadata changes).
> > 
> > Without Flush Hint Addresses only some NVDIMM configurations actually
> > guarantee data persistence.
> > 
> > > Flush hint address structure is a substructure of NFIT and specifies
> > > one or more addresses, namely Flush Hint Addresses. Software can write
> > > to any one of these flush hint addresses to cause any preceding writes
> > > to the NVDIMM region to be flushed out of the intervening platform
> > > buffers to the targeted NVDIMM. More details can be found in ACPI Spec
> > > 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
> > 
> > Do you have performance data?  I'm concerned that Flush Hint Address
> > hardware interface is not virtualization-friendly.
> 
> Some performance data below.
> 
> Host HW config:
>   CPU: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz x 2 sockets w/ HT enabled
>   MEM: 64 GB
> 
>   As I don't have NVDIMM hardware, so I use files in ext4 fs on a
>   normal SATA SSD as the back storage of vNVDIMM.
> 
> 
> Host SW config:
>   Kernel: 4.10.1
>   QEMU: commit ea2afcf with this patch series applied.
> 
> 
> Guest config:
>   For flush hint enabled case, the following QEMU options are used
> -enable-kvm -smp 4 -cpu host -machine pc,nvdimm \
> -m 4G,slots=4,maxmem=128G \
> -object memory-backend-file,id=mem1,share,mem-path=nvm-img,size=8G \
> -device nvdimm,id=nv1,memdev=mem1,reserved-size=4K,flush-hint \
> -hda GUEST_DISK_IMG -serial pty
> 
>   For flush hint disabled case, the following QEMU options are used
> -enable-kvm -smp 4 -cpu host -machine pc,nvdimm \
> -m 4G,slots=4,maxmem=128G \
> -object memory-backend-file,id=mem1,share,mem-path=nvm-img,size=8G \
> -device nvdimm,id=nv1,memdev=mem1 \
> -hda GUEST_DISK_IMG -serial pty
> 
>   nvm-img used above is created in ext4 fs on the host SSD by
> dd if=/dev/zero of=nvm-img bs=1G count=8
> 
>   Guest kernel: 4.11.0-rc4
> 
> 
> Benchmark in guest:
>   mkfs.ext4 /dev/pmem0
>   mount -o dax /dev/pmem0 /mnt
>   dd if=/dev/zero of=/mnt/data bs=1G count=7 # warm up EPT mapping
>   rm /mnt/data   #
>   dd if=/dev/zero of=/mnt/data bs=1G count=7
> 
>   and record the write speed reported by the last 'dd' command.
> 
> 
> Result:
>   - Flush hint disabled
> Vary from 161 MB/s to 708 MB/s, depending on how many fs/device
> flush operations are performed on the host side during the guest
> 'dd'.
> 
>   - Flush hint enabled
>   
> Vary from 164 MB/s to 546 MB/s, depending on how long fsync() in
> QEMU takes. Usually, there is at least one fsync() during one 'dd'
> command that takes several seconds (the worst one takes 39 s).
> 
> To be worse, during those long host-side fsync() operations, guest
> kernel complained stalls.

I'm surprised that maximum throughput was 708 MB/s.  The guest is
DAX-aware and the write(2) syscall is a memcpy.  I expected higher
numbers without flush hints.

Also strange that throughput varied so greatly.  A benchmark that varies
4x is not useful since it's hard to tell if anything <4x indicates a
significant performance difference.  In other words, the noise is huge!

What results do you get on the host?

Dan: Any comments on this benchmark and is there a recommended way to
benchmark NVDIMM?

> Some thoughts:
> 
> - If the non-NVDIMM hardware is used as the back store of vNVDIMM,
>   QEMU may perform the host-side flush operations asynchronously with
>   VM, which will not block VM too long but sacrifice the durability
>   guarantee.
> 
> - If physical NVDIMM is used as the back store and ADR is supported on
>   the host, QEMU can rely on ADR to guarantee the data durability and
>   will not need to emulate flush hint for guest.
> 
> - If physical NVDIMM is used as the back store and ADR is not
>   supported on the host, QEMU will still need to emulate flush hint
>   for guest and need to use a fast approach other than fsync() to
>   trigger writes to host flush hint.
> 
>   Could kernel expose an interface to allow the userland (i.e. QEMU in
>   this case) to directly write to the flush hint of a NVDIMM region?
> 
> 
> Haozhong
> 
> > 
> > In Linux drivers/nvdimm/region

Re: [Qemu-devel] DMG chunk size independence

2017-04-18 Thread Ashijeet Acharya

On Tue, Apr 18, 2017 at 01:59 John Snow  wrote:

>
>
> On 04/15/2017 04:38 AM, Ashijeet Acharya wrote:
> > Hi,
> >
> > Some of you are already aware but for the benefit of the open list,
> > this mail is regarding the task mentioned
> > Here -> http://wiki.qemu-project.org/ToDo/Block/DmgChunkSizeIndependence
> >
>
> OK, so the idea here is that we should be able to read portions of
> chunks instead of buffering entire chunks, because chunks can be quite
> large and an unverified DMG file should not be able to cause QEMU to
> allocate large portions of memory.
>
> Currently, QEMU has a maximum chunk size and it will not open DMG files
> that have chunks that exceed that size, correct?
>

Yes, it has an upper limit 64MiB at the moment and refuses to cater
anything beyond that.


> > I had a chat with Fam regarding this and he suggested a solution where
> > we fix the output buffer size to a max of say "64K" and keep inflating
> > until we reach the end of the input stream. We extract the required
> > data when we enter the desired range and discard the rest. Fam however
> > termed this as only a  "quick fix".
> >
>
> So it looks like your problem now is how to allow reads to subsets while
> tolerating zipped chunks, right?


Yes

>
>
> We can't predict where the data we want is going to appear mid-stream,
> but I'm not that familiar with the DMG format, so what does the data
> look like and how do we seek to it in general?


If I understood correctly what you meant;
The data is divided into three types
a) Uncompressed
b) zlib compressed
c) bz2 compressed

All these chunks appear in random order depending on the file.

ATM we are decompressing the whole chunk in a buffer and start reading
sector by sector until we have what we need or we run out of output in that
chunk.

If you meant something else there, let me know.


>
> We've got the mish blocks stored inside of the ResouceFork (right?), and


I haven't understood yet what a ResourceFork is but its safe to say from
what I know that mish blocks do appear inside resource forks and contain
all the required info about the chunks.

>
> each mish block contains one-or-more chunk records. So given any offset
> into the virtual file, we at least know which chunk it belongs to, but
> thanks to zlib, we can't just read the bits we care about.
>
> (Correct so far?)


Absolutely


>
> > The ideal fix would obviously be if we can somehow predict the exact
> > location inside the compressed stream relative to the desired offset
> > in the output decompressed stream, such as a specific sector in a
> > chunk. Unfortunately this is not possible without doing a first pass
> > over the decompressed stream as answered on the zlib FAQ page
> > Here -> http://zlib.net/zlib_faq.html#faq28
> >
>
> Yeah, I think you need to start reading the data from the beginning of
> each chunk -- but it depends on the zlib data. It COULD be broken up
> into different pieces, but there's no way to know without scanning it in
> advance.


Hmm, that's the real issue I am facing. MAYBE break it like

a) inflate till the required starting offset in one go
b) save the access point and discard the undesired data
c) proceed by inflating one sector at a time and stop if we hit chunk's end
or request's end


>
> (Unrelated:
>
> Do we have a zlib format driver?
>
> It might be cute to break up such DMG files and offload zlib
> optimization to another driver, like this:
>
> [dmg]-->[zlib]-->[raw]
>
> And we could pretend that each zlib chunk in this file is virtually its
> own zlib "file" and access it with modified offsets as appropriate.
>
> Any optimizations we make could just apply to this driver.
>
> [anyway...])


Are you thinking about implementing zlib just like we have bz2 implemented
currently?


>
>
> Pre-scanning for these sync points is probably a waste of time as
> there's no way to know (*I THINK*) how big each sync-block would be
> decompressed, so there's still no way this helps you seek within a
> compressed block...
>

I think we can predict that actually, because we know the number of sectors
present in that chunk and each sector's size too. So...


> > AFAICT after reading the zran.c example in zlib, the above mentioned
> > ideal fix would ultimately lead us to decompress the whole chunk in
> > steps at least once to maintain an access point lookup table. This
> > solution is better if we get several random access requests over
> > different read requests, otherwise it ends up being equal to the fix
> > suggested by Fam plus some extra effort needed in building and
> > maintaining access points.
> >
>
> Yeah, probably not worth it overall... I have to imagine that most uses
> of DMG files are for iso-like cases for installers where accesses are
> going to be either sequential (or mostly sequential) and most data will
> not be read twice.


Exactly, if we are sure that there will be no requests to read the same
data twice, its completely a wasted effort. But I am not aware of the use

[Qemu-devel] [PATCH 1/1] qemu-img: wait for convert coroutines to complete

2017-04-18 Thread Denis V. Lunev

From: Anton Nefedov 

We should wait for other coroutines on error path, i.e. one of coroutines
terminates with i/o error, before cleaning the common structures. In the
other case we would crash in a lot of different places. This behaviour
was introduced by commit 2d9187bc65.

Signed-off-by: Anton Nefedov 
Signed-off-by: Denis V. Lunev 
CC: Peter Lieven 
CC: Kevin Wolf 
CC: Max Reitz 
---
 qemu-img.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/qemu-img.c b/qemu-img.c
index b220cf7..3b04c5f 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1899,7 +1899,7 @@ static int convert_do_copy(ImgConvertState *s)
 qemu_coroutine_enter(s->co[i]);
 }
 
-while (s->ret == -EINPROGRESS) {
+while (s->running_coroutines) {
 main_loop_wait(false);
 }
 
-- 
2.7.4

Re: [Qemu-devel] DMG chunk size independence

2017-04-18 Thread Kevin Wolf

Am 15.04.2017 um 10:38 hat Ashijeet Acharya geschrieben:
> Hi,
> 
> Some of you are already aware but for the benefit of the open list,
> this mail is regarding the task mentioned
> Here -> http://wiki.qemu-project.org/ToDo/Block/DmgChunkSizeIndependence
> 
> I had a chat with Fam regarding this and he suggested a solution where
> we fix the output buffer size to a max of say "64K" and keep inflating
> until we reach the end of the input stream. We extract the required
> data when we enter the desired range and discard the rest. Fam however
> termed this as only a  "quick fix".

You can cache the current position for a very easy optimisation of
sequential reads.

> The ideal fix would obviously be if we can somehow predict the exact
> location inside the compressed stream relative to the desired offset
> in the output decompressed stream, such as a specific sector in a
> chunk. Unfortunately this is not possible without doing a first pass
> over the decompressed stream as answered on the zlib FAQ page
> Here -> http://zlib.net/zlib_faq.html#faq28
> 
> AFAICT after reading the zran.c example in zlib, the above mentioned
> ideal fix would ultimately lead us to decompress the whole chunk in
> steps at least once to maintain an access point lookup table. This
> solution is better if we get several random access requests over
> different read requests, otherwise it ends up being equal to the fix
> suggested by Fam plus some extra effort needed in building and
> maintaining access points.

I'm not sure if it's worth the additional effort.

If we take a step back, what the dmg driver is used for in practice is
converting images into a different format, that is strictly sequential
I/O. The important thing is that images with large chunk sizes can be
read at all, performance (with sequential I/O) is secondary, and
performance with random I/O is almost irrelevant, as far as I am
concerned.

Kevin

Re: [Qemu-devel] [PATCH v2 0/7] VT-d: PT (passthrough) mode support and misc fixes

2017-04-18 Thread Paolo Bonzini



On 17/04/2017 13:32, Peter Xu wrote:
> v2:
> - resending v1 since my patch script had a regression, which failed to
>   send the patches to the list... trying again. sorry for the noise!
> 
> This series is for 2.10, and based on following series:
> 
> [PATCH v8 0/9] VT-d: vfio enablement and misc enhances
> 
> This series add support for per-device passthrough mode for VT-d
> emulation, along with some tweaks on existing codes.
> 
> Patches 1-2 are memory related cleanups.

These two:

Acked-by: Paolo Bonzini

[Qemu-devel] [PATCH for-2.9-rc5 v3] block: Drain BH in bdrv_drained_begin

2017-04-18 Thread Fam Zheng

During block job completion, nothing is preventing
block_job_defer_to_main_loop_bh from being called in a nested
aio_poll(), which is a trouble, such as in this code path:

qmp_block_commit
  commit_active_start
bdrv_reopen
  bdrv_reopen_multiple
bdrv_reopen_prepare
  bdrv_flush
aio_poll
  aio_bh_poll
aio_bh_call
  block_job_defer_to_main_loop_bh
stream_complete
  bdrv_reopen

block_job_defer_to_main_loop_bh is the last step of the stream job,
which should have been "paused" by the bdrv_drained_begin/end in
bdrv_reopen_multiple, but it is not done because it's in the form of a
main loop BH.

Similar to why block jobs should be paused between drained_begin and
drained_end, BHs they schedule must be excluded as well.  To achieve
this, this patch forces draining the BH in BDRV_POLL_WHILE.

Also because the BH in question can do bdrv_unref and child replacing,
protect @bs carefully to avoid use-after-free.

As a side effect this fixes a hang in block_job_detach_aio_context
during system_reset when a block job is ready:

#0  0x55aa79f3 in bdrv_drain_recurse
#1  0x55aa825d in bdrv_drained_begin
#2  0x55aa8449 in bdrv_drain
#3  0x55a9c356 in blk_drain
#4  0x55aa3cfd in mirror_drain
#5  0x55a66e11 in block_job_detach_aio_context
#6  0x55a62f4d in bdrv_detach_aio_context
#7  0x55a63116 in bdrv_set_aio_context
#8  0x55a9d326 in blk_set_aio_context
#9  0x557e38da in virtio_blk_data_plane_stop
#10 0x559f9d5f in virtio_bus_stop_ioeventfd
#11 0x559fa49b in virtio_bus_stop_ioeventfd
#12 0x559f6a18 in virtio_pci_stop_ioeventfd
#13 0x559f6a18 in virtio_pci_reset
#14 0x559139a9 in qdev_reset_one
#15 0x55916738 in qbus_walk_children
#16 0x55913318 in qdev_walk_children
#17 0x55916738 in qbus_walk_children
#18 0x559168ca in qemu_devices_reset
#19 0x5581fcbb in pc_machine_reset
#20 0x558a4d96 in qemu_system_reset
#21 0x5577157a in main_loop_should_exit
#22 0x5577157a in main_loop
#23 0x5577157a in main

The rationale is that the loop in block_job_detach_aio_context cannot
make any progress in pausing/completing the job, because bs->in_flight
is 0, so bdrv_drain doesn't process the block_job_defer_to_main_loop
BH. With this patch, it does.

Reported-by: Jeff Cody 
Signed-off-by: Fam Zheng 

---

v3: Report all aio_poll() progress as waited_. [Kevin]

v2: Do the BH poll in BDRV_POLL_WHILE to cover bdrv_drain_all_begin as
well. [Kevin]
---
 block/io.c| 10 +++---
 include/block/block.h | 22 ++
 2 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/block/io.c b/block/io.c
index 8706bfa..a472157 100644
--- a/block/io.c
+++ b/block/io.c
@@ -158,7 +158,7 @@ bool bdrv_requests_pending(BlockDriverState *bs)
 
 static bool bdrv_drain_recurse(BlockDriverState *bs)
 {
-BdrvChild *child;
+BdrvChild *child, *tmp;
 bool waited;
 
 waited = BDRV_POLL_WHILE(bs, atomic_read(&bs->in_flight) > 0);
@@ -167,8 +167,12 @@ static bool bdrv_drain_recurse(BlockDriverState *bs)
 bs->drv->bdrv_drain(bs);
 }
 
-QLIST_FOREACH(child, &bs->children, next) {
-waited |= bdrv_drain_recurse(child->bs);
+QLIST_FOREACH_SAFE(child, &bs->children, next, tmp) {
+BlockDriverState *bs = child->bs;
+assert(bs->refcnt > 0);
+bdrv_ref(bs);
+waited |= bdrv_drain_recurse(bs);
+bdrv_unref(bs);
 }
 
 return waited;
diff --git a/include/block/block.h b/include/block/block.h
index 97d4330..5ddc0cf 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -381,12 +381,13 @@ void bdrv_drain_all(void);
 
 #define BDRV_POLL_WHILE(bs, cond) ({   \
 bool waited_ = false;  \
+bool busy_ = true; \
 BlockDriverState *bs_ = (bs);  \
 AioContext *ctx_ = bdrv_get_aio_context(bs_);  \
 if (aio_context_in_iothread(ctx_)) {   \
-while ((cond)) {   \
-aio_poll(ctx_, true);  \
-waited_ = true;\
+while ((cond) || busy_) {  \
+busy_ = aio_poll(ctx_, (cond));\
+waited_ |= !!(cond) | busy_;   \
 }  \
 } else {   \
 assert(qemu_get_current_aio_context() ==   \
@@ -398,11 +399,16 @@ void bdrv_drain_all(void);
  */

Re: [Qemu-devel] [RFC 4/7] pci: Change pci_register_bus() 'parent' parameter to PCIHostState

2017-04-18 Thread Cornelia Huck

On Mon, 17 Apr 2017 18:59:13 -0300
Eduardo Habkost  wrote:

> pci_register_bus() already requires the 'parent' argument to be a
> PCI_HOST_BRIDGE object. Change the parameter type to reflect that.
> 
> Cc: Richard Henderson 
> Cc: Aurelien Jarno 
> Cc: Yongbok Kim 
> Cc: Alexander Graf 
> Cc: Scott Wood 
> Cc: Paul Burton 
> Cc: "Michael S. Tsirkin" 
> Cc: Marcel Apfelbaum 
> Cc: David Gibson 
> Cc: Cornelia Huck 
> Cc: Christian Borntraeger 
> Cc: qemu-...@nongnu.org
> Signed-off-by: Eduardo Habkost 
> ---
>  include/hw/pci/pci.h  | 2 +-
>  hw/alpha/typhoon.c| 2 +-
>  hw/mips/gt64xxx_pci.c | 2 +-
>  hw/pci-host/apb.c | 2 +-
>  hw/pci-host/bonito.c  | 2 +-
>  hw/pci-host/gpex.c| 2 +-
>  hw/pci-host/grackle.c | 2 +-
>  hw/pci-host/ppce500.c | 2 +-
>  hw/pci-host/uninorth.c| 4 ++--
>  hw/pci-host/xilinx-pcie.c | 2 +-
>  hw/pci/pci.c  | 4 ++--
>  hw/ppc/ppc4xx_pci.c   | 2 +-
>  hw/ppc/spapr_pci.c| 2 +-
>  hw/s390x/s390-pci-bus.c   | 2 +-
>  hw/sh4/sh_pci.c   | 2 +-
>  15 files changed, 17 insertions(+), 17 deletions(-)

s390 parts:

Acked-by: Cornelia Huck

Re: [Qemu-devel] [RFC 5/7] pci: Set phb->bus inside pci_register_bus()

2017-04-18 Thread Cornelia Huck

On Mon, 17 Apr 2017 18:59:14 -0300
Eduardo Habkost  wrote:

> Every single caller of of pci_register_bus() saves the return value in
> phb->bus. Do that inside pci_register_bus() to avoid code duplication
> and make it harder to break.
> 
> Most (but not all) conversions done using the following Coccinelle script:
> 
>   @@
>   identifier b;
>   expression phb;
>   @@
>   -b = pci_register_bus(phb, ARGS);
>   +phb->bus = pci_register_bus(phb, ARGS);
>...
>   -phb->bus = b;
> 
>   @@
>   expression phb;
>   expression list ARGS;
>   @@
>   -phb->bus = pci_register_bus(phb, ARGS);
>   +pci_register_bus(phb, ARGS);
> 
> Cc: Richard Henderson 
> Cc: Aurelien Jarno 
> Cc: Yongbok Kim 
> Cc: Alexander Graf 
> Cc: Scott Wood 
> Cc: Paul Burton 
> Cc: "Michael S. Tsirkin" 
> Cc: Marcel Apfelbaum 
> Cc: David Gibson 
> Cc: Cornelia Huck 
> Cc: Christian Borntraeger 
> Cc: qemu-...@nongnu.org
> Signed-off-by: Eduardo Habkost 
> ---
>  include/hw/pci/pci.h  | 12 ++--
>  hw/alpha/typhoon.c| 10 +-
>  hw/mips/gt64xxx_pci.c |  9 +++--
>  hw/pci-host/apb.c |  7 ++-
>  hw/pci-host/bonito.c  |  7 +++
>  hw/pci-host/gpex.c|  5 ++---
>  hw/pci-host/grackle.c |  9 ++---
>  hw/pci-host/ppce500.c |  8 
>  hw/pci-host/uninorth.c| 18 ++
>  hw/pci-host/xilinx-pcie.c |  6 +++---
>  hw/pci/pci.c  | 14 +++---
>  hw/ppc/ppc4xx_pci.c   |  8 
>  hw/ppc/spapr_pci.c| 10 +-
>  hw/s390x/s390-pci-bus.c   | 10 +-
>  hw/sh4/sh_pci.c   |  9 +++--
>  15 files changed, 60 insertions(+), 82 deletions(-)

s390 parts:

Acked-by: Cornelia Huck

Re: [Qemu-devel] qemu memory manage question

2017-04-18 Thread jack.chen

Hello, now I am confused about some structures in qemu related to
memory management,they are
MemoryRegion、AddressSpace、FlatView、FlatRange、MemoryRegionSection、RAMList、RAMBlock、KVMSlot、kvm_userspace_memory_region,who
can tell me the concrete connection among these structures.
thanks a lot!

2017-04-18 10:25 GMT+08:00 jack.chen :
> Thanks very much!!
>
> 2017-04-17 19:19 GMT+08:00 李强 :
>>
>>
>>> -Original Message-
>>> From: Qemu-devel
>>> [mailto:qemu-devel-bounces+liqiang6-s=360...@nongnu.org] On Behalf Of
>>> jack.chen
>>> Sent: Monday, April 17, 2017 6:56 PM
>>> To: Peter Xu
>>> Cc: qemu
>>> Subject: Re: [Qemu-devel] qemu memory manage question
>>>
>>> Thanks,from the path you have list to me,it can be  well explained,but
>>> according to the source code,in the end of kvm_init,kvm_memory_listener and
>>> kvm_io_listener were registered by memory_listener_register(),and  in the
>>> end of
>>> memory_listener_register(),listener_add_address_space() was called for each
>>> address_space,so the listener->region_add was executed then.I do not know
>>> what mistake I have made,can you explain it to me ?? thank you very much!
>>>
>>
>> They are callbacks.
>> Every change of memory topology will call these listeners, add 
>> subregion(Peter's example),
>> modify the property of memory, create address space for example.
>>
>> Thanks.
>>
>> --
>> Li Qiang /the Gear Team, Qihoo 360 Inc
>>
>>
>>> 2017-04-17 18:26 GMT+08:00 Peter Xu :
>>> > On Mon, Apr 17, 2017 at 06:09:11PM +0800, jack.chen wrote:
>>> >> hello,I have some questions about memory allocation in qemu for
>>> >> virtual machine.I found when configure_accelerator function was
>>> >> called ,memory slots  were registered to KVM,but at that time
>>> >> address_space have not been initialized and ram have not been
>>> >> allocated,it is really confused me,Thanks a lot!!
>>> >
>>> > Here's how I understand it...
>>> >
>>> > configure_accelerator() does not register memory slots in KVM.
>>> > Instead, it registers memory listeners. See
>>> > kvm_memory_listener_register(), especially:
>>> >
>>> > kml->listener.region_add = kvm_region_add;
>>> >
>>> > That's the hook function to be called when there are new memory region
>>> > added to the system.
>>> >
>>> > Further, when RAM is initialzed, it'll modify the address space layout
>>> > of system_memory, and the registered listener of KVM (kvm_region_add)
>>> > will be invoked, it'll further sync with kvm. It should be in the
>>> > following path if you break at kvm_region_add in gdb:
>>> >
>>> > #0  0x557ba13a in kvm_region_add (listener=0x568330c0,
>>> > section=0x7fffd310) at /root/git/qemu/kvm-all.c:859
>>> > #1  0x557c1910 in address_space_update_topology_pass
>>> > (as=0x5629e240 ,
>>> old_view=0x567a7090,
>>> > new_view=0x568d3460, adding=true) at /root/git/qemu/memory.c:871
>>> > #2  0x557c19f3 in address_space_update_topology
>>> > (as=0x5629e240 ) at
>>> > /root/git/qemu/memory.c:886
>>> > #3  0x557c1b41 in memory_region_transaction_commit () at
>>> > /root/git/qemu/memory.c:922
>>> > #4  0x557c4bfd in memory_region_update_container_subregions
>>> > (subregion=0x568d2fc0) at /root/git/qemu/memory.c:2075
>>> > #5  0x557c4c64 in memory_region_add_subregion_common
>>> > (mr=0x567a5830, offset=0, subregion=0x568d2fc0) at
>>> > /root/git/qemu/memory.c:2085
>>> > #6  0x557c4ca0 in memory_region_add_subregion
>>> > (mr=0x567a5830, offset=0, subregion=0x568d2fc0) at
>>> > /root/git/qemu/memory.c:2093
>>> > #7  0x5583fd68 in pc_memory_init (pcms=0x567a4100,
>>> > system_memory=0x567a5830, rom_memory=0x568d21a0,
>>> > ram_memory=0x7fffd550) at /root/git/qemu/hw/i386/pc.c:1383
>>> > #8  0x55847363 in pc_q35_init (machine=0x567a4100) at
>>> > /root/git/qemu/hw/i386/pc_q35.c:147
>>> > #9  0x55847cac in pc_init_v2_9 (machine=0x567a4100) at
>>> > /root/git/qemu/hw/i386/pc_q35.c:310
>>> > #10 0x558f7cf8 in main (argc=11, argv=0x7fffda78,
>>> > envp=0x7fffdad8) at /root/git/qemu/vl.c:4557
>>> >
>>> > Hope this helps. Thanks.
>>> >
>>> > --
>>> > Peter Xu
>>

Re: [Qemu-devel] [PATCH for-2.9-rc5 v3] block: Drain BH in bdrv_drained_begin

2017-04-18 Thread Kevin Wolf

Am 18.04.2017 um 12:39 hat Fam Zheng geschrieben:
> During block job completion, nothing is preventing
> block_job_defer_to_main_loop_bh from being called in a nested
> aio_poll(), which is a trouble, such as in this code path:
> 
> qmp_block_commit
>   commit_active_start
> bdrv_reopen
>   bdrv_reopen_multiple
> bdrv_reopen_prepare
>   bdrv_flush
> aio_poll
>   aio_bh_poll
> aio_bh_call
>   block_job_defer_to_main_loop_bh
> stream_complete
>   bdrv_reopen
> 
> block_job_defer_to_main_loop_bh is the last step of the stream job,
> which should have been "paused" by the bdrv_drained_begin/end in
> bdrv_reopen_multiple, but it is not done because it's in the form of a
> main loop BH.
> 
> Similar to why block jobs should be paused between drained_begin and
> drained_end, BHs they schedule must be excluded as well.  To achieve
> this, this patch forces draining the BH in BDRV_POLL_WHILE.
> 
> Also because the BH in question can do bdrv_unref and child replacing,
> protect @bs carefully to avoid use-after-free.
> 
> As a side effect this fixes a hang in block_job_detach_aio_context
> during system_reset when a block job is ready:
> 
> #0  0x55aa79f3 in bdrv_drain_recurse
> #1  0x55aa825d in bdrv_drained_begin
> #2  0x55aa8449 in bdrv_drain
> #3  0x55a9c356 in blk_drain
> #4  0x55aa3cfd in mirror_drain
> #5  0x55a66e11 in block_job_detach_aio_context
> #6  0x55a62f4d in bdrv_detach_aio_context
> #7  0x55a63116 in bdrv_set_aio_context
> #8  0x55a9d326 in blk_set_aio_context
> #9  0x557e38da in virtio_blk_data_plane_stop
> #10 0x559f9d5f in virtio_bus_stop_ioeventfd
> #11 0x559fa49b in virtio_bus_stop_ioeventfd
> #12 0x559f6a18 in virtio_pci_stop_ioeventfd
> #13 0x559f6a18 in virtio_pci_reset
> #14 0x559139a9 in qdev_reset_one
> #15 0x55916738 in qbus_walk_children
> #16 0x55913318 in qdev_walk_children
> #17 0x55916738 in qbus_walk_children
> #18 0x559168ca in qemu_devices_reset
> #19 0x5581fcbb in pc_machine_reset
> #20 0x558a4d96 in qemu_system_reset
> #21 0x5577157a in main_loop_should_exit
> #22 0x5577157a in main_loop
> #23 0x5577157a in main
> 
> The rationale is that the loop in block_job_detach_aio_context cannot
> make any progress in pausing/completing the job, because bs->in_flight
> is 0, so bdrv_drain doesn't process the block_job_defer_to_main_loop
> BH. With this patch, it does.
> 
> Reported-by: Jeff Cody 
> Signed-off-by: Fam Zheng 

Reviewed-by: Kevin Wolf

Re: [Qemu-devel] [PATCH 0/3 for 2.10] migration/i386 cleanup

2017-04-18 Thread Dr. David Alan Gilbert

* Eduardo Habkost (ehabk...@redhat.com) wrote:
> On Wed, Apr 05, 2017 at 08:00:21PM +0100, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" 
> > 
> > Hi,
> >   This removes some qemu_get_ and qemu_put_ use from i386/machine.c
> > and cleans out some very old code.
> >   It breaks migration compatibility from prior to 0.12
> > for i386 and prior to ~0.14 for non-softfloat - it doesn't alter the
> > machine type, it just removes some old stream features.
> > 
> > (Has anyone got a good test of FP migration to make sure
> > I've not broken the FP/mmx/etc cases?)
> 
> It would be nice if we could allow kvm-unit-tests test cases
> trigger a migration operation at a specific instruction. I assume
> we don't have an existing mechanism that could be used for that?

Not that I know of, do we have any type of breakpoint type thing
we could use?

Dave

> -- 
> Eduardo
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH 1/3] migration/i386: Remove old non-softfloat 64bit FP support

2017-04-18 Thread Dr. David Alan Gilbert

* Eduardo Habkost (ehabk...@redhat.com) wrote:
> On Wed, Apr 05, 2017 at 08:00:22PM +0100, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" 
> > 
> > Long long ago, we used to support storing the x86 FP registers in
> > a 64bit format.
> > 
> > Then c31da136a0bf8caad70c348f5ffc283206e9c7fc in v0.14-rc0 removed
> > the last support for writing that in the migration format.
> > Even before that, it was only used if you had softfloat disabled
> >  (i.e. !USE_X86LDOUBLE) so in practice use of it in even earlier
> > qemu is unlikely for most users.
> > 
> > Kill it off, it's complicated, and possibly broken.
> > 
> > Signed-off-by: Dr. David Alan Gilbert 
> 
> I love the sound of obsolete code being deleted...
> 
> Acked-by: Eduardo Habkost 
> 
> But:
> 
> [...]
> > @@ -356,6 +256,10 @@ static int cpu_post_load(void *opaque, int version_id)
> >  return -EINVAL;
> >  }
> >  
> > +if (env->fpregs_format_vmstate) {
> > +error_report("Unsupported old non-softfloat CPU state");
> > +return -EINVAL;
> > +}
> 
> Is this equivalent to using VMSTATE_UINT16_EQUAL, but with a
> better and more verbose error message?

I think it probably is, yes.

Dave

> >  /*
> >   * Real mode guest segments register DPL should be zero.
> >   * Older KVM version were setting it wrongly.
> > @@ -943,7 +847,8 @@ VMStateDescription vmstate_x86_cpu = {
> >  VMSTATE_UINT16(env.fpus_vmstate, X86CPU),
> >  VMSTATE_UINT16(env.fptag_vmstate, X86CPU),
> >  VMSTATE_UINT16(env.fpregs_format_vmstate, X86CPU),
> > -VMSTATE_FP_REGS(env.fpregs, X86CPU, 8),
> > +
> > +VMSTATE_STRUCT_ARRAY(env.fpregs, X86CPU, 8, 0, vmstate_fpreg, 
> > FPReg),
> >  
> >  VMSTATE_SEGMENT_ARRAY(env.segs, X86CPU, 6),
> >  VMSTATE_SEGMENT(env.ldt, X86CPU),
> > -- 
> > 2.9.3
> > 
> 
> -- 
> Eduardo
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH for-2.9-rc5 v3] block: Drain BH in bdrv_drained_begin

2017-04-18 Thread Jeff Cody

On Tue, Apr 18, 2017 at 06:39:48PM +0800, Fam Zheng wrote:
> During block job completion, nothing is preventing
> block_job_defer_to_main_loop_bh from being called in a nested
> aio_poll(), which is a trouble, such as in this code path:
> 
> qmp_block_commit
>   commit_active_start
> bdrv_reopen
>   bdrv_reopen_multiple
> bdrv_reopen_prepare
>   bdrv_flush
> aio_poll
>   aio_bh_poll
> aio_bh_call
>   block_job_defer_to_main_loop_bh
> stream_complete
>   bdrv_reopen
> 
> block_job_defer_to_main_loop_bh is the last step of the stream job,
> which should have been "paused" by the bdrv_drained_begin/end in
> bdrv_reopen_multiple, but it is not done because it's in the form of a
> main loop BH.
> 
> Similar to why block jobs should be paused between drained_begin and
> drained_end, BHs they schedule must be excluded as well.  To achieve
> this, this patch forces draining the BH in BDRV_POLL_WHILE.
> 
> Also because the BH in question can do bdrv_unref and child replacing,
> protect @bs carefully to avoid use-after-free.
> 
> As a side effect this fixes a hang in block_job_detach_aio_context
> during system_reset when a block job is ready:
> 
> #0  0x55aa79f3 in bdrv_drain_recurse
> #1  0x55aa825d in bdrv_drained_begin
> #2  0x55aa8449 in bdrv_drain
> #3  0x55a9c356 in blk_drain
> #4  0x55aa3cfd in mirror_drain
> #5  0x55a66e11 in block_job_detach_aio_context
> #6  0x55a62f4d in bdrv_detach_aio_context
> #7  0x55a63116 in bdrv_set_aio_context
> #8  0x55a9d326 in blk_set_aio_context
> #9  0x557e38da in virtio_blk_data_plane_stop
> #10 0x559f9d5f in virtio_bus_stop_ioeventfd
> #11 0x559fa49b in virtio_bus_stop_ioeventfd
> #12 0x559f6a18 in virtio_pci_stop_ioeventfd
> #13 0x559f6a18 in virtio_pci_reset
> #14 0x559139a9 in qdev_reset_one
> #15 0x55916738 in qbus_walk_children
> #16 0x55913318 in qdev_walk_children
> #17 0x55916738 in qbus_walk_children
> #18 0x559168ca in qemu_devices_reset
> #19 0x5581fcbb in pc_machine_reset
> #20 0x558a4d96 in qemu_system_reset
> #21 0x5577157a in main_loop_should_exit
> #22 0x5577157a in main_loop
> #23 0x5577157a in main
> 
> The rationale is that the loop in block_job_detach_aio_context cannot
> make any progress in pausing/completing the job, because bs->in_flight
> is 0, so bdrv_drain doesn't process the block_job_defer_to_main_loop
> BH. With this patch, it does.
> 
> Reported-by: Jeff Cody 
> Signed-off-by: Fam Zheng 
>

Verified this fixes the deadlock condition from my report:

Tested-by: Jeff Cody 

> ---
> 
> v3: Report all aio_poll() progress as waited_. [Kevin]
> 
> v2: Do the BH poll in BDRV_POLL_WHILE to cover bdrv_drain_all_begin as
> well. [Kevin]
> ---
>  block/io.c| 10 +++---
>  include/block/block.h | 22 ++
>  2 files changed, 21 insertions(+), 11 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 8706bfa..a472157 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -158,7 +158,7 @@ bool bdrv_requests_pending(BlockDriverState *bs)
>  
>  static bool bdrv_drain_recurse(BlockDriverState *bs)
>  {
> -BdrvChild *child;
> +BdrvChild *child, *tmp;
>  bool waited;
>  
>  waited = BDRV_POLL_WHILE(bs, atomic_read(&bs->in_flight) > 0);
> @@ -167,8 +167,12 @@ static bool bdrv_drain_recurse(BlockDriverState *bs)
>  bs->drv->bdrv_drain(bs);
>  }
>  
> -QLIST_FOREACH(child, &bs->children, next) {
> -waited |= bdrv_drain_recurse(child->bs);
> +QLIST_FOREACH_SAFE(child, &bs->children, next, tmp) {
> +BlockDriverState *bs = child->bs;
> +assert(bs->refcnt > 0);
> +bdrv_ref(bs);
> +waited |= bdrv_drain_recurse(bs);
> +bdrv_unref(bs);
>  }
>  
>  return waited;
> diff --git a/include/block/block.h b/include/block/block.h
> index 97d4330..5ddc0cf 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -381,12 +381,13 @@ void bdrv_drain_all(void);
>  
>  #define BDRV_POLL_WHILE(bs, cond) ({   \
>  bool waited_ = false;  \
> +bool busy_ = true; \
>  BlockDriverState *bs_ = (bs);  \
>  AioContext *ctx_ = bdrv_get_aio_context(bs_);  \
>  if (aio_context_in_iothread(ctx_)) {   \
> -while ((cond)) {   \
> -aio_poll(ctx_, true);  \
> -waited_ = true;\
> +while ((cond) || busy_) {  \
> +

Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation

2017-04-18 Thread Kevin Wolf

Am 14.04.2017 um 06:17 hat Denis V. Lunev geschrieben:
> [skipped...]
> 
> > Hi Denis,
> >
> > I've read this entire thread now and I really like Berto's summary which
> > I think is one of the best recaps of existing qcow2 problems and this
> > discussion so far.
> >
> > I understand your opinion that we should focus on compatible changes
> > before incompatible ones, and I also understand that you are very
> > concerned about physical fragmentation for reducing long-term IO.
> >
> > What I don't understand is why you think that subclusters will increase
> > fragmentation. If we admit that fragmentation is a problem now, surely
> > increasing cluster sizes to 1 or 2 MB will only help to *reduce*
> > physical fragmentation, right?
> >
> > Subclusters as far as I am understanding them will not actually allow
> > subclusters to be located at virtually disparate locations, we will
> > continue to allocate clusters as normal -- we'll just keep track of
> > which portions of the cluster we've written to to help us optimize COW*.
> >
> > So if we have a 1MB cluster with 64k subclusters as a hypothetical, if
> > we write just the first subcluster, we'll have a map like:
> >
> > X---
> >
> > Whatever actually happens to exist in this space, whether it be a hole
> > we punched via fallocate or literal zeroes, this space is known to the
> > filesystem to be contiguous.
> >
> > If we write to the last subcluster, we'll get:
> >
> > X--X
> >
> > And again, maybe the dashes are a fallocate hole, maybe they're zeroes.
> > but the last subcluster is located virtually exactly 15 subclusters
> > behind the first, they're not physically contiguous. We've saved the
> > space between them. Future out-of-order writes won't contribute to any
> > fragmentation, at least at this level.
> >
> > You might be able to reduce COW from 5 IOPs to 3 IOPs, but if we tune
> > the subclusters right, we'll have *zero*, won't we?
> >
> > As far as I can tell, this lets us do a lot of good things all at once:
> >
> > (1) We get some COW optimizations (for reasons Berto and Kevin have
> > detailed)
> Yes. We are fine with COW. Let us assume that we will have issued read
> entire cluster command after the COW, in the situation
> 
> X--X
> 
> with a backing store. This is possible even with 1-2 Mb cluster size.
> I have seen 4-5 Mb requests from the guest in the real life. In this
> case we will have 3 IOP:
> read left X area, read backing, read right X.
> This is the real drawback of the approach, if sub-cluster size is really
> small enough, which should be the case for optimal COW. Thus we
> will have random IO in the host instead of sequential one in guest.
> Thus we have optimized COW at the cost of long term reads. This
> is what I am worrying about as we can have a lot of such reads before
> any further data change.

So just to avoid misunderstandings about what you're comparing here:
You get these 3 iops for 2 MB clusters with 64k subclusters, whereas you
would get only a single operation for 2 MB clusters without subclusters.
Today's 64k clusters without subclusters behave no better than the
2M/64k version, but that's not what you're comparing.

Yes, you are correct about this observation. But it is a tradeoff that
you're intentionally making when using backing files. In the extreme,
there is an alternative that performs so much better: Instead of using a
backing file, use 'qemu-img convert' to copy (and defragment) the whole
image upfront. No COW whatsoever, no fragmentation, fast reads. The
downside is that it takes a while to copy the whole image upfront, and
it also costs quite a bit of disk space.

So once we acknowledge that we're dealing with a tradeoff here, it
becomes obvious that neither the extreme setup for performance (copy the
whole image upfront) nor the extreme setup for sparseness (COW on a
sector level) are the right default for the average case, nor is
optimising one-sidedly a good idea. It is good if we can provide
solutions for extreme cases, but by default we need to cater for the
average case, which cares both about reasonable performance and disk
usage.

Kevin

Re: [Qemu-devel] [PATCH 1/1] qemu-img: wait for convert coroutines to complete

2017-04-18 Thread Peter Lieven


Am 18.04.2017 um 12:27 schrieb Denis V. Lunev:

From: Anton Nefedov 

We should wait for other coroutines on error path, i.e. one of coroutines
terminates with i/o error, before cleaning the common structures. In the
other case we would crash in a lot of different places. This behaviour
was introduced by commit 2d9187bc65.

Signed-off-by: Anton Nefedov 
Signed-off-by: Denis V. Lunev 
CC: Peter Lieven 
CC: Kevin Wolf 
CC: Max Reitz 
---
  qemu-img.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/qemu-img.c b/qemu-img.c
index b220cf7..3b04c5f 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1899,7 +1899,7 @@ static int convert_do_copy(ImgConvertState *s)
  qemu_coroutine_enter(s->co[i]);
  }
  
-while (s->ret == -EINPROGRESS) {

+while (s->running_coroutines) {
  main_loop_wait(false);
  }
  


Reviewed-by: Peter Lieven 


Thanks for catching this.


Peter

Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation

2017-04-18 Thread Alberto Garcia

On Thu 13 Apr 2017 05:17:21 PM CEST, Denis V. Lunev wrote:
> On 04/13/2017 06:04 PM, Alberto Garcia wrote:
>> On Thu 13 Apr 2017 03:30:43 PM CEST, Denis V. Lunev wrote:
>>> Yes, block size should be increased. I perfectly in agreement with
>>> your.  But I think that we could do that by plain increase of the
>>> cluster size without any further dances. Sub-clusters as sub-clusters
>>> will help if we are able to avoid COW. With COW I do not see much
>>> difference.
>> I'm trying to summarize your position, tell me if I got everything
>> correctly:
>>
>> 1. We should try to reduce data fragmentation on the qcow2 file,
>>because it will have a long term effect on the I/O performance (as
>>opposed to an effect on the initial operations on the empty image).
> yes
>
>> 2. The way to do that is to increase the cluster size (to 1MB or
>>more).
> yes
>
>> 3. Benefit: increasing the cluster size also decreases the amount of
>>metadata (L2 and refcount).
> yes
>
>> 4. Problem: L2 tables become too big and fill up the cache more
>>easily. To solve this the cache code should do partial reads
>>instead of complete L2 clusters.
> yes. We can read full cluster as originally if L2 cache is empty.
>
>> 5. Problem: larger cluster sizes also mean more data to copy when
>>there's a COW. To solve this the COW code should be modified so it
>>goes from 5 OPs (read head, write head, read tail, write tail,
>>write data) to 2 OPs (read cluster, write modified cluster).
> yes, with small tweak if head and tail are in different clusters. In
> this case we
> will end up with 3 OPs.
>
>> 6. Having subclusters adds incompatible changes to the file format,
>>and they offer no benefit after allocation.
> yes
>
>> 7. Subclusters are only really useful if they match the guest fs block
>>size (because you would avoid doing COW on allocation). Otherwise
>>the only thing that you get is a faster COW (because you move less
>>data), but the improvement is not dramatic and it's better if we do
>>what's proposed in point 5.
> yes
>
>> 8. Even if the subcluster size matches the guest block size, you'll
>>get very fast initial allocation but also more chances to end up
>>with a very fragmented qcow2 image, which is worse in the long run.
> yes
>
>> 9. Problem: larger clusters make a less efficient use of disk space,
>>but that's a drawback you're fine with considering all of the
>>above.
> yes
>
>> Is that a fair summary of what you're trying to say? Anything else
>> missing?
> yes.
>
> 5a. Problem: initial cluster allocation without COW. Could be made
>   cluster-size agnostic with the help of fallocate() call. Big
> clusters are even
>   better as the amount of such allocations is reduced.
>
> Thank you very much for this cool summary! I am too tongue-tied.

Hi Denis,

I don't have the have data to verify all your claims here, but in
general what you say makes sense.

Although I'm not sure if I agree with everything (especially on whether
any of this applies to SSD drives at all) it seems that we all agree
that the COW algorithm can be improved, so perhaps I should start by
taking a look at that.

Regards,

Berto

Re: [Qemu-devel] [RFC 2/7] pci: Change pci_bus_init() 'parent' parameter to PCIHostState

2017-04-18 Thread Eduardo Habkost

On Tue, Apr 18, 2017 at 01:51:02PM +1000, David Gibson wrote:
> On Mon, Apr 17, 2017 at 06:59:11PM -0300, Eduardo Habkost wrote:
> > pci_bus_init() already requires 'parent' to be a PCI_HOST_BRIDGE object,
> > so change the parameter type to reflect that.
> > 
> > Cc: "Michael S. Tsirkin" 
> > Cc: Marcel Apfelbaum 
> > Signed-off-by: Eduardo Habkost 
> 
> Reviewed-by: David Gibson 
> 
> I had to do some looking to convince myself this wouldn't break P2P
> bridges.
> 
> I wonder if we should rename pci_bus_init() / pci_bus_new() to make it
> clearer that they're about creating a whole new PCI domain, and not
> for a new bus within an existing domain.

Good point. The current naming is confusing. I only knew the
change was safe because pci_bus_new() would crash if 'parent'
wasn't a PCI_HOST_BRIDGE. But I took some time to find the other
places where PCI buses could be created, and I'm not sure yet if
I found all of them.

Is there any other code that creates TYPE_PCI_BUS objects, except
for the ones touched by this series and pci_bridge_initfn()?

> 
> > ---
> >  hw/pci/pci.c | 10 ++
> >  1 file changed, 6 insertions(+), 4 deletions(-)
> > 
> > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > index 25118fb91d..d9535c0bdc 100644
> > --- a/hw/pci/pci.c
> > +++ b/hw/pci/pci.c
> > @@ -362,7 +362,7 @@ const char *pci_root_bus_path(PCIDevice *dev)
> >  return rootbus->qbus.name;
> >  }
> >  
> > -static void pci_bus_init(PCIBus *bus, DeviceState *parent,
> > +static void pci_bus_init(PCIBus *bus, PCIHostState *phb,
> >   MemoryRegion *address_space_mem,
> >   MemoryRegion *address_space_io,
> >   uint8_t devfn_min)
> > @@ -375,7 +375,7 @@ static void pci_bus_init(PCIBus *bus, DeviceState 
> > *parent,
> >  /* host bridge */
> >  QLIST_INIT(&bus->child);
> >  
> > -pci_host_bus_register(PCI_HOST_BRIDGE(host));
> > +pci_host_bus_register(phb);
> >  }
> >  
> >  bool pci_bus_is_express(PCIBus *bus)
> > @@ -394,8 +394,9 @@ void pci_bus_new_inplace(PCIBus *bus, size_t bus_size, 
> > DeviceState *parent,
> >   MemoryRegion *address_space_io,
> >   uint8_t devfn_min, const char *typename)
> >  {
> > +PCIHostState *phb = PCI_HOST_BRIDGE(parent);
> >  qbus_create_inplace(bus, bus_size, typename, parent, name);
> > -pci_bus_init(bus, parent, address_space_mem, address_space_io, 
> > devfn_min);
> > +pci_bus_init(bus, phb, address_space_mem, address_space_io, devfn_min);
> >  }
> >  
> >  PCIBus *pci_bus_new(DeviceState *parent, const char *name,
> > @@ -403,10 +404,11 @@ PCIBus *pci_bus_new(DeviceState *parent, const char 
> > *name,
> >  MemoryRegion *address_space_io,
> >  uint8_t devfn_min, const char *typename)
> >  {
> > +PCIHostState *phb = PCI_HOST_BRIDGE(parent);
> >  PCIBus *bus;
> >  
> >  bus = PCI_BUS(qbus_create(typename, parent, name));
> > -pci_bus_init(bus, parent, address_space_mem, address_space_io, 
> > devfn_min);
> > +pci_bus_init(bus, phb, address_space_mem, address_space_io, devfn_min);
> >  return bus;
> >  }
> >  
> 
> -- 
> David Gibson  | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au| minimalist, thank you.  NOT _the_ 
> _other_
>   | _way_ _around_!
> http://www.ozlabs.org/~dgibson



-- 
Eduardo

Re: [Qemu-devel] What's the next QEMU version after 2.9 ? (or: when is a good point in time to get rid of old interfaces)

2017-04-18 Thread Gerd Hoffmann

  Hi,

> > Just like -device is a general way to plug in devices, replacing
> > multiple special ways (-net, -drive, -usb, ...), we could use a general
> > way to configure onboard devices.
> 
> I looked at the -device implementation to see if the bus= parameter
> could be used to specify onboard device addresses, but I think you may
> be right that we need a separate command-line argument for onboard
> devices.

I think so.

There is -global, which is actually used by libvirt to configure
built-in floppy devices.  But as the name suggests it sets properties
globally, i.e. for all instances.  Which works in this specific use
case, as there can be only one floppy controller per machine, but I
don't think this is something we want build on.

There is -set, but that works only for devices created via -device,
because it operates on QemuOpts, and we don't have QemuOpts for built-in
devices.

We probably want something like
  -qom-set-property {objpath|alias}.prop=value

cheers,
  Gerd

Re: [Qemu-devel] [PATCH 20/31] parallels: Switch to .bdrv_co_block_status()

2017-04-18 Thread Denis V. Lunev

On 04/18/2017 04:33 AM, Eric Blake wrote:
> We are gradually moving away from sector-based interfaces, towards
> byte-based.  Update the parallels driver accordingly.  Note that
> the internal function block_status() is still sector-based, because
> it is still in use by other sector-based functions; but that's okay
> because request_alignment is 512 as a result of those functions.
>
> Signed-off-by: Eric Blake 
Reviewed-by: Denis V. Lunev 


> ---
>  block/parallels.c | 15 +--
>  1 file changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/block/parallels.c b/block/parallels.c
> index 8be46a7..2bc1918 100644
> --- a/block/parallels.c
> +++ b/block/parallels.c
> @@ -274,22 +274,25 @@ static coroutine_fn int 
> parallels_co_flush_to_os(BlockDriverState *bs)
>  }
>
>
> -static int64_t coroutine_fn parallels_co_get_block_status(BlockDriverState 
> *bs,
> -int64_t sector_num, int nb_sectors, int *pnum, BlockDriverState 
> **file)
> +static int64_t coroutine_fn parallels_co_block_status(BlockDriverState *bs,
> +int64_t offset, int64_t bytes, int64_t *pnum, BlockDriverState 
> **file)
>  {
>  BDRVParallelsState *s = bs->opaque;
> -int64_t offset;
> +int count;
>
> +assert(QEMU_IS_ALIGNED(offset | bytes, BDRV_SECTOR_SIZE));
>  qemu_co_mutex_lock(&s->lock);
> -offset = block_status(s, sector_num, nb_sectors, pnum);
> +offset = block_status(s, offset >> BDRV_SECTOR_BITS,
> +  bytes >> BDRV_SECTOR_BITS, &count);
>  qemu_co_mutex_unlock(&s->lock);
>
>  if (offset < 0) {
>  return 0;
>  }
>
> +*pnum = count * BDRV_SECTOR_SIZE;
>  *file = bs->file->bs;
> -return (offset << BDRV_SECTOR_BITS) |
> +return (offset & BDRV_BLOCK_OFFSET_MASK) |
>  BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
>  }
>
> @@ -775,7 +778,7 @@ static BlockDriver bdrv_parallels = {
>  .bdrv_open   = parallels_open,
>  .bdrv_close  = parallels_close,
>  .bdrv_child_perm  = bdrv_format_default_perms,
> -.bdrv_co_get_block_status = parallels_co_get_block_status,
> +.bdrv_co_block_status = parallels_co_block_status,
>  .bdrv_has_zero_init   = bdrv_has_zero_init_1,
>  .bdrv_co_flush_to_os  = parallels_co_flush_to_os,
>  .bdrv_co_readv  = parallels_co_readv,

Re: [Qemu-devel] [PATCH 3/3] qemu-iotests: Test postcopy migration

2017-04-18 Thread Dr. David Alan Gilbert

* Kevin Wolf (kw...@redhat.com) wrote:
> Signed-off-by: Kevin Wolf 
> ---
>  tests/qemu-iotests/181 | 117 
> +
>  tests/qemu-iotests/181.out |  38 +++
>  tests/qemu-iotests/group   |   1 +
>  3 files changed, 156 insertions(+)
>  create mode 100755 tests/qemu-iotests/181
>  create mode 100644 tests/qemu-iotests/181.out
> 
> diff --git a/tests/qemu-iotests/181 b/tests/qemu-iotests/181
> new file mode 100755
> index 000..563aee7
> --- /dev/null
> +++ b/tests/qemu-iotests/181
> @@ -0,0 +1,117 @@
> +#!/bin/bash
> +#
> +# Test postcopy live migration with shared storage
> +#
> +# Copyright (C) 2017 Red Hat, Inc.
> +#
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 2 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see .
> +#
> +
> +# creator
> +owner=kw...@redhat.com
> +
> +seq=`basename $0`
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +status=1 # failure is the default!
> +
> +MIG_SOCKET="${TEST_DIR}/migrate"
> +
> +_cleanup()
> +{
> +rm -f "${MIG_SOCKET}"
> + _cleanup_test_img
> +_cleanup_qemu
> +}
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +# get standard environment, filters and checks
> +. ./common.rc
> +. ./common.filter
> +. ./common.qemu
> +
> +_supported_fmt generic
> +_supported_proto generic
> +_supported_os Linux
> +
> +size=64M
> +_make_test_img $size
> +
> +echo
> +echo === Starting VMs ===
> +echo
> +
> +qemu_comm_method="monitor"
> +
> +_launch_qemu -drive file="${TEST_IMG}",cache=${CACHEMODE},id=disk
> +src=$QEMU_HANDLE
> +
> +_launch_qemu -drive file="${TEST_IMG}",cache=${CACHEMODE},id=disk \
> + -incoming "unix:${MIG_SOCKET}"
> +dest=$QEMU_HANDLE
> +
> +echo
> +echo === Write something on the source ===
> +echo
> +
> +silent=
> +_send_qemu_cmd $src 'qemu-io disk "write -P 0x55 0 64k"' "(qemu)"
> +_send_qemu_cmd $src "" "ops/sec"
> +_send_qemu_cmd $src 'qemu-io disk "read -P 0x55 0 64k"' "(qemu)"
> +_send_qemu_cmd $src "" "ops/sec"
> +
> +echo
> +echo === Do postcopy migration to destination ===
> +echo
> +
> +# Slow down migration so much that it definitely won't finish before we can
> +# switch to postcopy
> +silent=yes
> +_send_qemu_cmd $src 'migrate_set_speed 4k' "(qemu)"
> +_send_qemu_cmd $src 'migrate_set_capability postcopy-ram on' "(qemu)"

You should probably also set this on the destination as well;  we
do it in pretty much all the places where we use postcopy, but I think
it'll work without it.

However,

Reviewed-by: Dr. David Alan Gilbert 

> +_send_qemu_cmd $src "migrate -d unix:${MIG_SOCKET}" "(qemu)"
> +_send_qemu_cmd $src 'migrate_start_postcopy' "(qemu)"
> +
> +QEMU_COMM_TIMEOUT=1 qemu_cmd_repeat=10 silent=yes \
> +_send_qemu_cmd $src "info migrate" "completed\|failed"
> +silent=yes _send_qemu_cmd $src "" "(qemu)"
> +
> +echo
> +echo === Do some I/O on the destination ===
> +echo
> +
> +# It is important that we use the BlockBackend of the guest device here 
> instead
> +# of the node name, which would create a new BlockBackend and not test 
> whether
> +# the guest has the necessary permissions to access the image now
> +silent=
> +_send_qemu_cmd $dest 'qemu-io disk "read -P 0x55 0 64k"' "(qemu)"
> +_send_qemu_cmd $dest "" "ops/sec"
> +_send_qemu_cmd $dest 'qemu-io disk "write -P 0x66 1M 64k"' "(qemu)"
> +_send_qemu_cmd $dest "" "ops/sec"
> +
> +echo
> +echo === Shut down and check image ===
> +echo
> +
> +_send_qemu_cmd $src 'quit' ""
> +_send_qemu_cmd $dest 'quit' ""
> +wait=1 _cleanup_qemu
> +
> +_check_test_img
> +
> +# success, all done
> +echo "*** done"
> +rm -f $seq.full
> +status=0
> diff --git a/tests/qemu-iotests/181.out b/tests/qemu-iotests/181.out
> new file mode 100644
> index 000..6534ba2
> --- /dev/null
> +++ b/tests/qemu-iotests/181.out
> @@ -0,0 +1,38 @@
> +QA output created by 181
> +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
> +
> +=== Starting VMs ===
> +
> +
> +=== Write something on the source ===
> +
> +QEMU X.Y.Z monitor - type 'help' for more information
> +(qemu) qemu-io disk "write -P 0x55 0 64k"
> +wrote 65536/65536 bytes at offset 0
> +64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +(qemu) 
> +(qemu) qemu-io disk "read -P 0x55 0 64k"
> +read 65536/65536 bytes at offset 0
> +64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +
> +=== Do postcopy migration to destination ===
> +
> +
> +=== Do some I/O on the destination ===
> +
> +QEMU X.Y.Z mon

Re: [Qemu-devel] migrate -b problems

2017-04-18 Thread Juan Quintela

Kevin Wolf  wrote:
> Hi all,

Hi

> after getting assertion failure reports for block migration in the last
> minute, we just hacked around it by commenting out op blocker assertions
> for the 2.9 release, but now we need to see how to fix things properly.
> Luckily, get_maintainer.pl doesn't report me, but only you. :-)

migration/block.c  is a classic.  Anyone that touches it last is the
maintainer.


> The main problem I see with the block migration code (on the
> destination) is that it abuses the BlockBackend that belongs to the
> guest device to make its own writes to the image file. If the guest
> isn't allowed to write to the image (which it now isn't during incoming
> migration since it would conflict with the newer style of block
> migration using an NBD server), writing to this BlockBackend doesn't
> work any more.
>
> So what should really happen is that incoming block migration creates
> its own BlockBackend for writing to the image. Now we don't want to do
> this anew for every incoming block, but ideally we'd just create all
> necessary BlockBackends upfront and then keep using them throughout the
> whole migration. Is there a way to get some setup/teardown callbacks
> at the start/end of the migration that could initialise and free such
> global data?

Two answers:

- Easy one (for me).  look at how spice/qxl use migration notifiers (no,
  it is not pretty, but its what is already done).

- More difficult
  I am trying to get an easier to use way migration notifiers.  What we
  need is to be able to schedule:
 - when we start a migration
* this thing that you need
 - when we finish a migration (either complete, error or cancel)
* basically to do a free
* or spice to handle the screen seamlesly to the new client
 - when we are about to start last stage of migration
* so devices can write things to memory
  Look at new arm GIC (or PIC) or whatever on the list

And probably something more that I haven't yet think about.

> The other problem with block migration is that is uses a BlockBackend
> name to identify which device is migrated. However, there can be images
> that are not attached to any BlockBackend, or if it is, the BlockBackend
> might be anonymous, so this doesn't work. I suppose changing the field
> to "device name if available, node-name otherwise" would solve this.

That is above my knowledge.

Later, Juan.

Re: [Qemu-devel] [PATCH 0/3 for 2.10] migration/i386 cleanup

2017-04-18 Thread Paolo Bonzini



On 18/04/2017 13:00, Dr. David Alan Gilbert wrote:
>> It would be nice if we could allow kvm-unit-tests test cases
>> trigger a migration operation at a specific instruction. I assume
>> we don't have an existing mechanism that could be used for that?
> Not that I know of, do we have any type of breakpoint type thing
> we could use?

What about the gdbstub support?

Paolo

Re: [Qemu-devel] [PATCH for-2.9-rc5 v3] block: Drain BH in bdrv_drained_begin

2017-04-18 Thread Paolo Bonzini



On 18/04/2017 12:39, Fam Zheng wrote:
> +QLIST_FOREACH_SAFE(child, &bs->children, next, tmp) {
> +BlockDriverState *bs = child->bs;
> +assert(bs->refcnt > 0);
> +bdrv_ref(bs);
> +waited |= bdrv_drain_recurse(bs);
> +bdrv_unref(bs);
>  }

I think this accesses global state that is not protected by the
AioContext lock?

Paolo

Re: [Qemu-devel] [PATCH 2/2] target/openrisc: Implement EPH bit

2017-04-18 Thread Stafford Horne

On Tue, Apr 18, 2017 at 04:15:51PM +1000, Tim 'mithro' Ansell wrote:
> Exception Prefix High (EPH) control bit of the Supervision Register
> (SR).
> 
> The significant bits (31-12) of the vector offset address for each
> exception depend on the setting of the Supervision Register (SR)'s EPH
> bit and the Exception Vector Base Address Register (EVBAR).
> 
> If SR[EPH] is set, the vector offset is logically ORed with the offset
> 0xF000.
> 
> This means if EPH is;
>  * 0 - Exceptions vectors start at EVBAR
>  * 1 - Exception vectors start at EVBAR | 0xF000
> 
> Signed-off-by: Tim 'mithro' Ansell 

Acked-by: Stafford Horne 

Thanks, I will queue this with my other changes unless anyone has any
objections.

> ---
>  target/openrisc/interrupt.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/target/openrisc/interrupt.c b/target/openrisc/interrupt.c
> index 78f0ba9421..2c91fab380 100644
> --- a/target/openrisc/interrupt.c
> +++ b/target/openrisc/interrupt.c
> @@ -69,6 +69,9 @@ void openrisc_cpu_do_interrupt(CPUState *cs)
>  if (env->cpucfgr & CPUCFGR_EVBARP) {
>  vect_pc |= env->evbar;
>  }
> +if (env->sr & SR_EPH) {
> +vect_pc |= 0xf000;
> +}
>  env->pc = vect_pc;
>  } else {
>  cpu_abort(cs, "Unhandled exception 0x%x\n", cs->exception_index);
> -- 
> 2.12.1
>

Re: [Qemu-devel] [PATCH 1/2] target/openrisc: Implement EVBAR register

2017-04-18 Thread Stafford Horne

On Tue, Apr 18, 2017 at 04:15:50PM +1000, Tim 'mithro' Ansell wrote:
> Exception Vector Base Address Register (EVBAR) - This optional register
> can be used to apply an offset to the exception vector addresses.
> 
> The significant bits (31-12) of the vector offset address for each
> exception depend on the setting of the Supervision Register (SR)'s EPH
> bit and the Exception Vector Base Address Register (EVBAR).
> 
> Its presence is indicated by the EVBARP bit in the CPU Configuration
> Register (CPUCFGR).
> 
> Signed-off-by: Tim 'mithro' Ansell 
> ---
>  target/openrisc/cpu.c| 2 ++
>  target/openrisc/cpu.h| 7 +++
>  target/openrisc/interrupt.c  | 6 +-
>  target/openrisc/sys_helper.c | 7 +++
>  4 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/target/openrisc/cpu.c b/target/openrisc/cpu.c
> index 7fd2b9a216..1524ed981a 100644
> --- a/target/openrisc/cpu.c
> +++ b/target/openrisc/cpu.c
> @@ -134,6 +134,7 @@ static void or1200_initfn(Object *obj)
>  
>  set_feature(cpu, OPENRISC_FEATURE_OB32S);
>  set_feature(cpu, OPENRISC_FEATURE_OF32S);
> +set_feature(cpu, OPENRISC_FEATURE_EVBAR);
>  }
>  
>  static void openrisc_any_initfn(Object *obj)
> @@ -141,6 +142,7 @@ static void openrisc_any_initfn(Object *obj)
>  OpenRISCCPU *cpu = OPENRISC_CPU(obj);
>  
>  set_feature(cpu, OPENRISC_FEATURE_OB32S);
> +set_feature(cpu, OPENRISC_FEATURE_EVBAR);
>  }
>  
>  typedef struct OpenRISCCPUInfo {
> diff --git a/target/openrisc/cpu.h b/target/openrisc/cpu.h
> index 418a0e6960..1958b72718 100644
> --- a/target/openrisc/cpu.h
> +++ b/target/openrisc/cpu.h
> @@ -111,6 +111,11 @@ enum {
>  CPUCFGR_OF32S = (1 << 7),
>  CPUCFGR_OF64S = (1 << 8),
>  CPUCFGR_OV64S = (1 << 9),
> +/* CPUCFGR_ND = (1 << 10), */
> +/* CPUCFGR_AVRP = (1 << 11), */
> +CPUCFGR_EVBARP = (1 << 12),
> +/* CPUCFGR_ISRP = (1 << 13), */
> +/* CPUCFGR_AECSRP = (1 << 14), */
>  };
>  
>  /* DMMU configure register */
> @@ -200,6 +205,7 @@ enum {
>  OPENRISC_FEATURE_OF32S = (1 << 7),
>  OPENRISC_FEATURE_OF64S = (1 << 8),
>  OPENRISC_FEATURE_OV64S = (1 << 9),
> +OPENRISC_FEATURE_EVBAR = (1 << 12),

Does anyone know why we have to add both Features and CPUCFG bits?

It seems like duplication to me.

>  };
>  
>  /* Tick Timer Mode Register */
> @@ -289,6 +295,7 @@ typedef struct CPUOpenRISCState {
>  uint32_t dmmucfgr;/* DMMU configure register */
>  uint32_t immucfgr;/* IMMU configure register */
>  uint32_t esr; /* Exception supervisor register */
> +uint32_t evbar;   /* Exception vector base address register */

If we add something to CPUOpenRISCState, we ideally need to also add it to
machine.c vmstate definition.  However, I currently have a patch out to
rework the vmstate serialization.  I can add this to that patch.

>  uint32_t fpcsr;   /* Float register */
>  float_status fp_status;
>  
> diff --git a/target/openrisc/interrupt.c b/target/openrisc/interrupt.c
> index a2eec6fb32..78f0ba9421 100644
> --- a/target/openrisc/interrupt.c
> +++ b/target/openrisc/interrupt.c
> @@ -65,7 +65,11 @@ void openrisc_cpu_do_interrupt(CPUState *cs)
>  env->lock_addr = -1;
>  
>  if (cs->exception_index > 0 && cs->exception_index < EXCP_NR) {
> -env->pc = (cs->exception_index << 8);
> +hwaddr vect_pc = cs->exception_index << 8;
> +if (env->cpucfgr & CPUCFGR_EVBARP) {
> +vect_pc |= env->evbar;
> +}
> +env->pc = vect_pc;
>  } else {
>  cpu_abort(cs, "Unhandled exception 0x%x\n", cs->exception_index);
>  }
> diff --git a/target/openrisc/sys_helper.c b/target/openrisc/sys_helper.c
> index 60c3193656..6ba816249b 100644
> --- a/target/openrisc/sys_helper.c
> +++ b/target/openrisc/sys_helper.c
> @@ -39,6 +39,10 @@ void HELPER(mtspr)(CPUOpenRISCState *env,
>  env->vr = rb;
>  break;
>  
> +case TO_SPR(0, 11): /* EVBAR */
> +env->evbar = rb;
> +break;
> +
>  case TO_SPR(0, 16): /* NPC */
>  cpu_restore_state(cs, GETPC());
>  /* ??? Mirror or1ksim in not trashing delayed branch state
> @@ -206,6 +210,9 @@ target_ulong HELPER(mfspr)(CPUOpenRISCState *env,
>  case TO_SPR(0, 4): /* IMMUCFGR */
>  return env->immucfgr;
>  
> +case TO_SPR(0, 11): /* EVBAR */
> +return env->evbar;
> +
>  case TO_SPR(0, 16): /* NPC (equals PC) */
>  cpu_restore_state(cs, GETPC());
>  return env->pc;

Reviewed-by: Stafford Horne 

I will pull into my branch and send PR as is unless someone has more to
say.

-Stafford

> -- 
> 2.12.1
>

Re: [Qemu-devel] [RFC 1/7] pci: Change pci_host_bus_register() parameter to PCIHostState

2017-04-18 Thread Marcel Apfelbaum


On 04/18/2017 12:59 AM, Eduardo Habkost wrote:

The function requires a PCI_HOST_BRIDGE object, so change the parameter
type to reflect that.

Cc: "Michael S. Tsirkin" 
Cc: Marcel Apfelbaum 
Signed-off-by: Eduardo Habkost 
---
 hw/pci/pci.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 259483b1c0..25118fb91d 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -312,11 +312,9 @@ static void pcibus_reset(BusState *qbus)
 }
 }

-static void pci_host_bus_register(DeviceState *host)
+static void pci_host_bus_register(PCIHostState *phb)
 {
-PCIHostState *host_bridge = PCI_HOST_BRIDGE(host);
-
-QLIST_INSERT_HEAD(&pci_host_bridges, host_bridge, next);
+QLIST_INSERT_HEAD(&pci_host_bridges, phb, next);
 }

 PCIBus *pci_find_primary_bus(void)
@@ -377,7 +375,7 @@ static void pci_bus_init(PCIBus *bus, DeviceState *parent,
 /* host bridge */
 QLIST_INIT(&bus->child);

-pci_host_bus_register(parent);
+pci_host_bus_register(PCI_HOST_BRIDGE(host));
 }

 bool pci_bus_is_express(PCIBus *bus)




Reviewed-by: Marcel Apfelbaum 

Thanks,
Marcel

Re: [Qemu-devel] [ANNOUNCE] QEMU 2.9.0-rc4 is now available

2017-04-18 Thread Peter Maydell

On 11 April 2017 at 20:28, Michael Roth  wrote:
> Hello,
>
> On behalf of the QEMU Team, I'd like to announce the availability of the
> fifth release candidate for the QEMU 2.9 release.  This release is meant
> for testing purposes and should not be used in a production environment.
>
>   http://download.qemu-project.org/qemu-2.9.0-rc4.tar.xz
>   http://download.qemu-project.org/qemu-2.9.0-rc4.tar.xz.sig
>
> A note from the maintainer:
>
>   This is intended to be the final release candidate for v2.9.0, and
>   unless any showstopper bugs are reported we plan to cut the final
>   release next week on the 18th April.

Regrettably there were a couple of final bugs we needed
to squash, and so the revised plan is to roll an rc5 today
and then have final release on Thursday or Friday.

thanks
-- PMM

Re: [Qemu-devel] [PATCH RFC 1/3] block: Introduce BDRV_O_UNSAFE_READ

2017-04-18 Thread Eric Blake

On 03/13/2017 09:39 PM, Fam Zheng wrote:
> This flag clears out the "consistent read" permission that blk_new_open
> requests.
> 
> Signed-off-by: Fam Zheng 
> ---
>  block/block-backend.c | 2 +-
>  include/block/block.h | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index 5742c09..99428ee 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -197,7 +197,7 @@ BlockBackend *blk_new_open(const char *filename, const 
> char *reference,
>   * caller of blk_new_open() doesn't make use of the permissions, but they
>   * shouldn't hurt either. We can still share everything here because the
>   * guest devices will add their own blockers if they can't share. */
> -perm = BLK_PERM_CONSISTENT_READ;
> +perm = flags & BDRV_O_UNSAFE_READ ? 0 : BLK_PERM_CONSISTENT_READ;

A bit awkward that we have to add an option that requests negation of a
permission under the hood, but looks like it is the least-invasive way
to get what we want.

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [PATCH RFC 2/3] qemu-img: Add --unsafe-read option to subcommands

2017-04-18 Thread Eric Blake

On 03/13/2017 09:39 PM, Fam Zheng wrote:
> Signed-off-by: Fam Zheng 
> ---
>  qemu-img.c | 148 
> +++--
>  1 file changed, 114 insertions(+), 34 deletions(-)
> 

> @@ -2711,9 +2751,10 @@ static int img_map(int argc, char **argv)
>  {"output", required_argument, 0, OPTION_OUTPUT},
>  {"object", required_argument, 0, OPTION_OBJECT},
>  {"image-opts", no_argument, 0, OPTION_IMAGE_OPTS},
> +{"unsafe-read", no_argument, 0, 'U'},
>  {0, 0, 0, 0}
>  };
> -c = getopt_long(argc, argv, "f:h",
> +c = getopt_long(argc, argv, "f:hU",

Yay for 'U' being an unambiguous mnemonic that we could use on all the
commands.  You got lucky (it gets ever harder to add new options, as our
letters get thinner ;)

> @@ -2970,6 +3020,7 @@ static int img_rebase(int argc, char **argv)
>  int c, flags, src_flags, ret;
>  bool writethrough, src_writethrough;
>  int unsafe = 0;
> +int unsafe_read = 0;

bool, please.  (Just because existing code is lousy with using an int
for a bool doesn't mean we should perpetuate it)


> @@ -3033,6 +3085,9 @@ static int img_rebase(int argc, char **argv)
>  case OPTION_IMAGE_OPTS:
>  image_opts = true;
>  break;
> +case 'U':
> +unsafe_read = 1;

s/1/true/

Good patch, but missing documentation updates to --help/man page.  So,
as encouragement to add that when you drop the RFC, I'll withhold my R-b
until then. ;)

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [RFC 2/7] pci: Change pci_bus_init() 'parent' parameter to PCIHostState

2017-04-18 Thread Marcel Apfelbaum


On 04/18/2017 02:48 PM, Eduardo Habkost wrote:

On Tue, Apr 18, 2017 at 01:51:02PM +1000, David Gibson wrote:

On Mon, Apr 17, 2017 at 06:59:11PM -0300, Eduardo Habkost wrote:

pci_bus_init() already requires 'parent' to be a PCI_HOST_BRIDGE object,
so change the parameter type to reflect that.

Cc: "Michael S. Tsirkin" 
Cc: Marcel Apfelbaum 
Signed-off-by: Eduardo Habkost 


Reviewed-by: David Gibson 

I had to do some looking to convince myself this wouldn't break P2P
bridges.

I wonder if we should rename pci_bus_init() / pci_bus_new() to make it
clearer that they're about creating a whole new PCI domain, and not
for a new bus within an existing domain.


Good point. The current naming is confusing. I only knew the
change was safe because pci_bus_new() would crash if 'parent'
wasn't a PCI_HOST_BRIDGE. But I took some time to find the other
places where PCI buses could be created, and I'm not sure yet if
I found all of them.

Is there any other code that creates TYPE_PCI_BUS objects, except
for the ones touched by this series and pci_bridge_initfn()?



pxb and pxb-pcie devices have a built-in PCI/PCIe bus,
but I saw them handled by another patch.

Reviewed-by: Marcel Apfelbaum 


Thanks,
Marcel




---
 hw/pci/pci.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 25118fb91d..d9535c0bdc 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -362,7 +362,7 @@ const char *pci_root_bus_path(PCIDevice *dev)
 return rootbus->qbus.name;
 }

-static void pci_bus_init(PCIBus *bus, DeviceState *parent,
+static void pci_bus_init(PCIBus *bus, PCIHostState *phb,
  MemoryRegion *address_space_mem,
  MemoryRegion *address_space_io,
  uint8_t devfn_min)
@@ -375,7 +375,7 @@ static void pci_bus_init(PCIBus *bus, DeviceState *parent,
 /* host bridge */
 QLIST_INIT(&bus->child);

-pci_host_bus_register(PCI_HOST_BRIDGE(host));
+pci_host_bus_register(phb);
 }

 bool pci_bus_is_express(PCIBus *bus)
@@ -394,8 +394,9 @@ void pci_bus_new_inplace(PCIBus *bus, size_t bus_size, 
DeviceState *parent,
  MemoryRegion *address_space_io,
  uint8_t devfn_min, const char *typename)
 {
+PCIHostState *phb = PCI_HOST_BRIDGE(parent);
 qbus_create_inplace(bus, bus_size, typename, parent, name);
-pci_bus_init(bus, parent, address_space_mem, address_space_io, devfn_min);
+pci_bus_init(bus, phb, address_space_mem, address_space_io, devfn_min);
 }

 PCIBus *pci_bus_new(DeviceState *parent, const char *name,
@@ -403,10 +404,11 @@ PCIBus *pci_bus_new(DeviceState *parent, const char *name,
 MemoryRegion *address_space_io,
 uint8_t devfn_min, const char *typename)
 {
+PCIHostState *phb = PCI_HOST_BRIDGE(parent);
 PCIBus *bus;

 bus = PCI_BUS(qbus_create(typename, parent, name));
-pci_bus_init(bus, parent, address_space_mem, address_space_io, devfn_min);
+pci_bus_init(bus, phb, address_space_mem, address_space_io, devfn_min);
 return bus;
 }



--
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

Re: [Qemu-devel] [PATCH v2 3/4] versatile: remove cannot_destroy_with_object_finalize_yet

2017-04-18 Thread Markus Armbruster

Laurent Vivier  writes:

> cannot_destroy_with_object_finalize_yet was added by 4c315c2
> ("qdev: Protect device-list-properties against broken devices")
> because "realview_pci" and "versatile_pci" were hanging
> during "device-list-properties" cleanup (an infinite loop in
> bus_unparent()).
>
> We have this problem because the child is not removed from
> the list of the PCI bus child because it has no defined parent:

s/child/children/

> qdev_set_parent_bus() set the device parent_bus pointer to bus, and
> adds the device in the bus children list, but doesn't update the
> device parent pointer.
>
> To fix the problem, move all the involved parts to the realize function.
>
> Signed-off-by: Laurent Vivier 

The devices aren't unpluggable.  There is no unrealize().  How work
needs to be spread between init and realize isn't obvious to me in
general.  I can see this work done both in init and in realize.  I
figure the move is okay.

Reviewed-by: Markus Armbruster

Re: [Qemu-devel] [PATCH v2 4/4] qdev: remove cannot_destroy_with_object_finalize_yet

2017-04-18 Thread Markus Armbruster

Laurent Vivier  writes:

> As all users have been removed, we can remove
> cannot_destroy_with_object_finalize_yet field
> from the DeviceClass structure.
>
> Signed-off-by: Laurent Vivier 

Huzzah!

Reviewed-by: Markus Armbruster

Re: [Qemu-devel] [PATCH RFC 3/3] qemu-io: Add --unsafe-read option

2017-04-18 Thread Eric Blake

On 03/13/2017 09:39 PM, Fam Zheng wrote:
> Signed-off-by: Fam Zheng 
> ---
>  qemu-io.c | 28 +---
>  1 file changed, 21 insertions(+), 7 deletions(-)
> 
> @@ -108,6 +112,7 @@ static void open_help(void)
>  " -r, -- open file read-only\n"
>  " -s, -- use snapshot file\n"
>  " -n, -- disable host cache, short for -t none\n"
> +" -U, -- accept unsafe opening of the image even if it's in use by another 
> process\n"

Long line (exceeds 80 columns of output); can we shorten it?  Maybe

 -U, allow unsafe reads when another process is writing

You did get 'qemu-io -c "help open"' adjusted, but I don't see where you
adjusted 'qemu-io --help'.  Since both places gained a -U, you still
need more documentation. But at least there is no man page to worry
about here.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [RFC 3/7] pci: Change pci_bus_new*() parameter to PCIHostState

2017-04-18 Thread Marcel Apfelbaum


On 04/18/2017 12:59 AM, Eduardo Habkost wrote:

The pci_bus_new*() functions already require the 'parent' argument to be
a PCI_HOST_BRIDGE object. Change the parameter type to reflect that.

Cc: "Michael S. Tsirkin" 
Cc: Marcel Apfelbaum 
Cc: "Hervé Poussineau" 
Cc: Peter Maydell 
Cc: qemu-...@nongnu.org
Cc: qemu-...@nongnu.org
Signed-off-by: Eduardo Habkost 
---
 include/hw/pci/pci.h|  5 +++--
 hw/pci-bridge/pci_expander_bridge.c | 15 ---
 hw/pci-host/piix.c  |  2 +-
 hw/pci-host/prep.c  |  2 +-
 hw/pci-host/q35.c   |  2 +-
 hw/pci-host/versatile.c |  2 +-
 hw/pci/pci.c| 13 ++---
 7 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index a37a2d5cb6..2242aa25eb 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -393,12 +393,13 @@ typedef PCIINTxRoute (*pci_route_irq_fn)(void *opaque, 
int pin);

 bool pci_bus_is_express(PCIBus *bus);
 bool pci_bus_is_root(PCIBus *bus);
-void pci_bus_new_inplace(PCIBus *bus, size_t bus_size, DeviceState *parent,
+void pci_bus_new_inplace(PCIBus *bus, size_t bus_size,
+ PCIHostState *phb,
  const char *name,
  MemoryRegion *address_space_mem,
  MemoryRegion *address_space_io,
  uint8_t devfn_min, const char *typename);
-PCIBus *pci_bus_new(DeviceState *parent, const char *name,
+PCIBus *pci_bus_new(PCIHostState *phb, const char *name,
 MemoryRegion *address_space_mem,
 MemoryRegion *address_space_io,
 uint8_t devfn_min, const char *typename);
diff --git a/hw/pci-bridge/pci_expander_bridge.c 
b/hw/pci-bridge/pci_expander_bridge.c
index 6ac187fa32..39d29d2230 100644
--- a/hw/pci-bridge/pci_expander_bridge.c
+++ b/hw/pci-bridge/pci_expander_bridge.c
@@ -213,7 +213,8 @@ static gint pxb_compare(gconstpointer a, gconstpointer b)
 static void pxb_dev_realize_common(PCIDevice *dev, bool pcie, Error **errp)
 {
 PXBDev *pxb = convert_to_pxb(dev);
-DeviceState *ds, *bds = NULL;
+DeviceState *bds = NULL;
+PCIHostState *phb;
 PCIBus *bus;
 const char *dev_name = NULL;
 Error *local_err = NULL;
@@ -228,11 +229,11 @@ static void pxb_dev_realize_common(PCIDevice *dev, bool 
pcie, Error **errp)
 dev_name = dev->qdev.id;
 }

-ds = qdev_create(NULL, TYPE_PXB_HOST);
+phb = PCI_HOST_BRIDGE(qdev_create(NULL, TYPE_PXB_HOST));
 if (pcie) {
-bus = pci_bus_new(ds, dev_name, NULL, NULL, 0, TYPE_PXB_PCIE_BUS);
+bus = pci_bus_new(phb, dev_name, NULL, NULL, 0, TYPE_PXB_PCIE_BUS);
 } else {
-bus = pci_bus_new(ds, "pxb-internal", NULL, NULL, 0, TYPE_PXB_BUS);
+bus = pci_bus_new(phb, "pxb-internal", NULL, NULL, 0, TYPE_PXB_BUS);
 bds = qdev_create(BUS(bus), "pci-bridge");
 bds->id = dev_name;
 qdev_prop_set_uint8(bds, PCI_BRIDGE_DEV_PROP_CHASSIS_NR, pxb->bus_nr);
@@ -244,7 +245,7 @@ static void pxb_dev_realize_common(PCIDevice *dev, bool 
pcie, Error **errp)
 bus->address_space_io = dev->bus->address_space_io;
 bus->map_irq = pxb_map_irq_fn;

-PCI_HOST_BRIDGE(ds)->bus = bus;
+phb->bus = bus;

 pxb_register_bus(dev, bus, &local_err);
 if (local_err) {
@@ -252,7 +253,7 @@ static void pxb_dev_realize_common(PCIDevice *dev, bool 
pcie, Error **errp)
 goto err_register_bus;
 }

-qdev_init_nofail(ds);
+qdev_init_nofail(DEVICE(phb));
 if (bds) {
 qdev_init_nofail(bds);
 }
@@ -267,7 +268,7 @@ static void pxb_dev_realize_common(PCIDevice *dev, bool 
pcie, Error **errp)
 err_register_bus:
 object_unref(OBJECT(bds));
 object_unparent(OBJECT(bus));
-object_unref(OBJECT(ds));
+object_unref(OBJECT(phb));
 }

 static void pxb_dev_realize(PCIDevice *dev, Error **errp)
diff --git a/hw/pci-host/piix.c b/hw/pci-host/piix.c
index f9218aa952..91fec05b38 100644
--- a/hw/pci-host/piix.c
+++ b/hw/pci-host/piix.c
@@ -340,7 +340,7 @@ PCIBus *i440fx_init(const char *host_type, const char 
*pci_type,

 dev = qdev_create(NULL, host_type);
 s = PCI_HOST_BRIDGE(dev);
-b = pci_bus_new(dev, NULL, pci_address_space,
+b = pci_bus_new(s, NULL, pci_address_space,
 address_space_io, 0, TYPE_PCI_BUS);
 s->bus = b;
 object_property_add_child(qdev_get_machine(), "i440fx", OBJECT(dev), NULL);
diff --git a/hw/pci-host/prep.c b/hw/pci-host/prep.c
index 260a119a9e..2e2cd267f4 100644
--- a/hw/pci-host/prep.c
+++ b/hw/pci-host/prep.c
@@ -269,7 +269,7 @@ static void raven_pcihost_initfn(Object *obj)
 memory_region_add_subregion_overlap(address_space_mem, 0x8000,
 &s->pci_io_non_contiguous, 1);
 memory_region_add_subregion(address_space_mem, 0xc000, &s->pci_memory);
-pci_bus_new_inplace(&s->pci_bus, sizeo

Re: [Qemu-devel] [RFC 3/7] pci: Change pci_bus_new*() parameter to PCIHostState

2017-04-18 Thread Eduardo Habkost

On Tue, Apr 18, 2017 at 04:27:23PM +0300, Marcel Apfelbaum wrote:
[...]
> > @@ -431,7 +430,7 @@ PCIBus *pci_register_bus(DeviceState *parent, const 
> > char *name,
> >  {
> >  PCIBus *bus;
> > 
> > -bus = pci_bus_new(parent, name, address_space_mem,
> > +bus = pci_bus_new(PCI_HOST_BRIDGE(parent), name, address_space_mem,
> 
> If you request the parent to be PCI_HOST_BRIDGE, why don't you change also 
> the signature of
> pci_register_bus ?

I do it on patch 4/7.

-- 
Eduardo

Re: [Qemu-devel] [PATCH] char: Fix removing wrong GSource that be found by fd_in_tag

2017-04-18 Thread Eric Blake

On 04/14/2017 05:10 AM, Marc-André Lureau wrote:
> Hi
> 
> - Original Message -
>> We use fd_in_tag to find a GSource, fd_in_tag is return value of
>> g_source_attach(GSource *source, GMainContext *context), the return
>> value is unique only in the same context, so we may get the same
>> values with different 'context' parameters.
>>
>> It is no problem to find the right fd_in_tag by using
>>  g_main_context_find_source_by_id(GMainContext *context, guint source_id)
>> while there is only one default main context.
>>
>> But colo-compare tries to create/use its own context, and if we pass wrong
>> 'context' parameter with right fd_in_tag, we will find a wrong GSource
>> to handle. We tied to fix the related codes in commit b43dec, but it didn't
> 
> tied->tried
> 
> Please use a bit longer commit sha1, or full sha1, it will likely conflict 
> otherwise in the future.

6 chars is indeed short, 7 is git's default as usually long enough,
although I've encountered collisions that require 8 chars.  [And google
has proved that you can have a collision across the entire hash,
although that is harder to generate.] I generally use 8 or so when
writing commit messages.  Fortunately, even if a collision is introduces
later, someone that is motivated enough can still resolve the collision
by filtering out any collisions that resolve to non-commits, and among
the remaining colliding SHA1 focus on the one that has a commit date
which predates the message with the reference.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [RFC 3/7] pci: Change pci_bus_new*() parameter to PCIHostState

2017-04-18 Thread Marcel Apfelbaum


On 04/18/2017 04:30 PM, Eduardo Habkost wrote:

On Tue, Apr 18, 2017 at 04:27:23PM +0300, Marcel Apfelbaum wrote:
[...]

@@ -431,7 +430,7 @@ PCIBus *pci_register_bus(DeviceState *parent, const char 
*name,
 {
 PCIBus *bus;

-bus = pci_bus_new(parent, name, address_space_mem,
+bus = pci_bus_new(PCI_HOST_BRIDGE(parent), name, address_space_mem,


If you request the parent to be PCI_HOST_BRIDGE, why don't you change also the 
signature of
pci_register_bus ?


I do it on patch 4/7.



Yes, I saw it :). at least it means I am following the patches...

Thanks,
Marcel

Re: [Qemu-devel] [RFC 4/7] pci: Change pci_register_bus() 'parent' parameter to PCIHostState

2017-04-18 Thread Marcel Apfelbaum


On 04/18/2017 12:59 AM, Eduardo Habkost wrote:

pci_register_bus() already requires the 'parent' argument to be a
PCI_HOST_BRIDGE object. Change the parameter type to reflect that.

Cc: Richard Henderson 
Cc: Aurelien Jarno 
Cc: Yongbok Kim 
Cc: Alexander Graf 
Cc: Scott Wood 
Cc: Paul Burton 
Cc: "Michael S. Tsirkin" 
Cc: Marcel Apfelbaum 
Cc: David Gibson 
Cc: Cornelia Huck 
Cc: Christian Borntraeger 
Cc: qemu-...@nongnu.org
Signed-off-by: Eduardo Habkost 
---
 include/hw/pci/pci.h  | 2 +-
 hw/alpha/typhoon.c| 2 +-
 hw/mips/gt64xxx_pci.c | 2 +-
 hw/pci-host/apb.c | 2 +-
 hw/pci-host/bonito.c  | 2 +-
 hw/pci-host/gpex.c| 2 +-
 hw/pci-host/grackle.c | 2 +-
 hw/pci-host/ppce500.c | 2 +-
 hw/pci-host/uninorth.c| 4 ++--
 hw/pci-host/xilinx-pcie.c | 2 +-
 hw/pci/pci.c  | 4 ++--
 hw/ppc/ppc4xx_pci.c   | 2 +-
 hw/ppc/spapr_pci.c| 2 +-
 hw/s390x/s390-pci-bus.c   | 2 +-
 hw/sh4/sh_pci.c   | 2 +-
 15 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 2242aa25eb..56387ccb0c 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -408,7 +408,7 @@ void pci_bus_irqs(PCIBus *bus, pci_set_irq_fn set_irq, 
pci_map_irq_fn map_irq,
 int pci_bus_get_irq_level(PCIBus *bus, int irq_num);
 /* 0 <= pin <= 3 0 = INTA, 1 = INTB, 2 = INTC, 3 = INTD */
 int pci_swizzle_map_irq_fn(PCIDevice *pci_dev, int pin);
-PCIBus *pci_register_bus(DeviceState *parent, const char *name,
+PCIBus *pci_register_bus(PCIHostState *phb, const char *name,
  pci_set_irq_fn set_irq, pci_map_irq_fn map_irq,
  void *irq_opaque,
  MemoryRegion *address_space_mem,
diff --git a/hw/alpha/typhoon.c b/hw/alpha/typhoon.c
index f50f5cf186..ac0633a55e 100644
--- a/hw/alpha/typhoon.c
+++ b/hw/alpha/typhoon.c
@@ -883,7 +883,7 @@ PCIBus *typhoon_init(ram_addr_t ram_size, ISABus **isa_bus,
 memory_region_add_subregion(addr_space, 0x801fc00ULL,
 &s->pchip.reg_io);

-b = pci_register_bus(dev, "pci",
+b = pci_register_bus(phb, "pci",
  typhoon_set_irq, sys_map_irq, s,
  &s->pchip.reg_mem, &s->pchip.reg_io,
  0, 64, TYPE_PCI_BUS);
diff --git a/hw/mips/gt64xxx_pci.c b/hw/mips/gt64xxx_pci.c
index 4811843ab6..bd131bcdc6 100644
--- a/hw/mips/gt64xxx_pci.c
+++ b/hw/mips/gt64xxx_pci.c
@@ -1171,7 +1171,7 @@ PCIBus *gt64120_register(qemu_irq *pic)
 phb = PCI_HOST_BRIDGE(dev);
 memory_region_init(&d->pci0_mem, OBJECT(dev), "pci0-mem", UINT32_MAX);
 address_space_init(&d->pci0_mem_as, &d->pci0_mem, "pci0-mem");
-phb->bus = pci_register_bus(dev, "pci",
+phb->bus = pci_register_bus(phb, "pci",
 gt64120_pci_set_irq, gt64120_pci_map_irq,
 pic,
 &d->pci0_mem,
diff --git a/hw/pci-host/apb.c b/hw/pci-host/apb.c
index 653e711121..1156a54224 100644
--- a/hw/pci-host/apb.c
+++ b/hw/pci-host/apb.c
@@ -671,7 +671,7 @@ PCIBus *pci_apb_init(hwaddr special_base,
 dev = qdev_create(NULL, TYPE_APB);
 d = APB_DEVICE(dev);
 phb = PCI_HOST_BRIDGE(dev);
-phb->bus = pci_register_bus(DEVICE(phb), "pci",
+phb->bus = pci_register_bus(phb, "pci",
 pci_apb_set_irq, pci_pbm_map_irq, d,
 &d->pci_mmio,
 get_system_io(),
diff --git a/hw/pci-host/bonito.c b/hw/pci-host/bonito.c
index 1999ece590..27842edc04 100644
--- a/hw/pci-host/bonito.c
+++ b/hw/pci-host/bonito.c
@@ -714,7 +714,7 @@ static int bonito_pcihost_initfn(SysBusDevice *dev)
 {
 PCIHostState *phb = PCI_HOST_BRIDGE(dev);

-phb->bus = pci_register_bus(DEVICE(dev), "pci",
+phb->bus = pci_register_bus(phb, "pci",
 pci_bonito_set_irq, pci_bonito_map_irq, dev,
 get_system_memory(), get_system_io(),
 0x28, 32, TYPE_PCI_BUS);
diff --git a/hw/pci-host/gpex.c b/hw/pci-host/gpex.c
index 66055ee5cc..042d127271 100644
--- a/hw/pci-host/gpex.c
+++ b/hw/pci-host/gpex.c
@@ -62,7 +62,7 @@ static void gpex_host_realize(DeviceState *dev, Error **errp)
 sysbus_init_irq(sbd, &s->irq[i]);
 }

-pci->bus = pci_register_bus(dev, "pcie.0", gpex_set_irq,
+pci->bus = pci_register_bus(pci, "pcie.0", gpex_set_irq,
 pci_swizzle_map_irq_fn, s, &s->io_mmio,
 &s->io_ioport, 0, 4, TYPE_PCIE_BUS);

diff --git a/hw/pci-host/grackle.c b/hw/pci-host/grackle.c
index 2c8acdaaca..a56c063be9 100644
--- a/hw/pci-host/grackle.c
+++ b/hw/pci-host/grackle.c
@@ -82,7 +82,7 @@ PCIBus *pci_grackle_init(uint32_t base, qemu_irq *pic,
 memory_region_add_subregion(address_space_mem, 0x8000ULL,
 &d->pci_hole);

-

Re: [Qemu-devel] [RFC 5/7] pci: Set phb->bus inside pci_register_bus()

2017-04-18 Thread Marcel Apfelbaum


On 04/18/2017 12:59 AM, Eduardo Habkost wrote:

Every single caller of of pci_register_bus() saves the return value in
phb->bus. Do that inside pci_register_bus() to avoid code duplication
and make it harder to break.



I personally find that more difficult to follow, maybe the function
name is misleading.
Maybe we can find a better function name or explain the side effect
in a comment?

Thanks,
Marcel


Most (but not all) conversions done using the following Coccinelle script:

  @@
  identifier b;
  expression phb;
  @@
  -b = pci_register_bus(phb, ARGS);
  +phb->bus = pci_register_bus(phb, ARGS);
   ...
  -phb->bus = b;

  @@
  expression phb;
  expression list ARGS;
  @@
  -phb->bus = pci_register_bus(phb, ARGS);
  +pci_register_bus(phb, ARGS);

Cc: Richard Henderson 
Cc: Aurelien Jarno 
Cc: Yongbok Kim 
Cc: Alexander Graf 
Cc: Scott Wood 
Cc: Paul Burton 
Cc: "Michael S. Tsirkin" 
Cc: Marcel Apfelbaum 
Cc: David Gibson 
Cc: Cornelia Huck 
Cc: Christian Borntraeger 
Cc: qemu-...@nongnu.org
Signed-off-by: Eduardo Habkost 
---
 include/hw/pci/pci.h  | 12 ++--
 hw/alpha/typhoon.c| 10 +-
 hw/mips/gt64xxx_pci.c |  9 +++--
 hw/pci-host/apb.c |  7 ++-
 hw/pci-host/bonito.c  |  7 +++
 hw/pci-host/gpex.c|  5 ++---
 hw/pci-host/grackle.c |  9 ++---
 hw/pci-host/ppce500.c |  8 
 hw/pci-host/uninorth.c| 18 ++
 hw/pci-host/xilinx-pcie.c |  6 +++---
 hw/pci/pci.c  | 14 +++---
 hw/ppc/ppc4xx_pci.c   |  8 
 hw/ppc/spapr_pci.c| 10 +-
 hw/s390x/s390-pci-bus.c   | 10 +-
 hw/sh4/sh_pci.c   |  9 +++--
 15 files changed, 60 insertions(+), 82 deletions(-)

diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 56387ccb0c..3b1e2c408a 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -408,12 +408,12 @@ void pci_bus_irqs(PCIBus *bus, pci_set_irq_fn set_irq, 
pci_map_irq_fn map_irq,
 int pci_bus_get_irq_level(PCIBus *bus, int irq_num);
 /* 0 <= pin <= 3 0 = INTA, 1 = INTB, 2 = INTC, 3 = INTD */
 int pci_swizzle_map_irq_fn(PCIDevice *pci_dev, int pin);
-PCIBus *pci_register_bus(PCIHostState *phb, const char *name,
- pci_set_irq_fn set_irq, pci_map_irq_fn map_irq,
- void *irq_opaque,
- MemoryRegion *address_space_mem,
- MemoryRegion *address_space_io,
- uint8_t devfn_min, int nirq, const char *typename);
+void pci_register_bus(PCIHostState *phb, const char *name,
+  pci_set_irq_fn set_irq, pci_map_irq_fn map_irq,
+  void *irq_opaque,
+  MemoryRegion *address_space_mem,
+  MemoryRegion *address_space_io,
+  uint8_t devfn_min, int nirq, const char *typename);
 void pci_bus_set_route_irq_fn(PCIBus *, pci_route_irq_fn);
 PCIINTxRoute pci_device_route_intx_to_irq(PCIDevice *dev, int pin);
 bool pci_intx_route_changed(PCIINTxRoute *old, PCIINTxRoute *new);
diff --git a/hw/alpha/typhoon.c b/hw/alpha/typhoon.c
index ac0633a55e..5926686d79 100644
--- a/hw/alpha/typhoon.c
+++ b/hw/alpha/typhoon.c
@@ -883,11 +883,11 @@ PCIBus *typhoon_init(ram_addr_t ram_size, ISABus 
**isa_bus,
 memory_region_add_subregion(addr_space, 0x801fc00ULL,
 &s->pchip.reg_io);

-b = pci_register_bus(phb, "pci",
- typhoon_set_irq, sys_map_irq, s,
- &s->pchip.reg_mem, &s->pchip.reg_io,
- 0, 64, TYPE_PCI_BUS);
-phb->bus = b;
+pci_register_bus(phb, "pci",
+ typhoon_set_irq, sys_map_irq, s,
+ &s->pchip.reg_mem, &s->pchip.reg_io,
+ 0, 64, TYPE_PCI_BUS);
+b = phb->bus;
 qdev_init_nofail(dev);

 /* Host memory as seen from the PCI side, via the IOMMU.  */
diff --git a/hw/mips/gt64xxx_pci.c b/hw/mips/gt64xxx_pci.c
index bd131bcdc6..69963453f0 100644
--- a/hw/mips/gt64xxx_pci.c
+++ b/hw/mips/gt64xxx_pci.c
@@ -1171,12 +1171,9 @@ PCIBus *gt64120_register(qemu_irq *pic)
 phb = PCI_HOST_BRIDGE(dev);
 memory_region_init(&d->pci0_mem, OBJECT(dev), "pci0-mem", UINT32_MAX);
 address_space_init(&d->pci0_mem_as, &d->pci0_mem, "pci0-mem");
-phb->bus = pci_register_bus(phb, "pci",
-gt64120_pci_set_irq, gt64120_pci_map_irq,
-pic,
-&d->pci0_mem,
-get_system_io(),
-PCI_DEVFN(18, 0), 4, TYPE_PCI_BUS);
+pci_register_bus(phb, "pci", gt64120_pci_set_irq, gt64120_pci_map_irq,
+ pic, &d->pci0_mem, get_system_io(), PCI_DEVFN(18, 0), 4,
+ TYPE_PCI_BUS);
 qdev_init_nofail(dev);
 memory_region_init_io(&d->ISD_mem, OBJECT(dev), &isd_mem_ops, d, 
"isd-mem", 0x1000);

diff --git

Re: [Qemu-devel] [PATCH for-2.9-rc5 v3] block: Drain BH in bdrv_drained_begin

2017-04-18 Thread Fam Zheng

On Tue, 04/18 14:36, Paolo Bonzini wrote:
> 
> 
> On 18/04/2017 12:39, Fam Zheng wrote:
> > +QLIST_FOREACH_SAFE(child, &bs->children, next, tmp) {
> > +BlockDriverState *bs = child->bs;
> > +assert(bs->refcnt > 0);
> > +bdrv_ref(bs);
> > +waited |= bdrv_drain_recurse(bs);
> > +bdrv_unref(bs);
> >  }
> 
> I think this accesses global state that is not protected by the
> AioContext lock?

Good catch! If called from IOThread, this bdrv_unref is simply wrong, although
in practice it cannot delete bs because of the reference held by the owning
device.

It may be better to wrap the bdrv_ref/bdrv_unref calls with

 if (qemu_get_current_aio_context() == qemu_get_aio_context())

because only the main loop needs it.

Will make this hunk a separate patch in v4.

Fam

Re: [Qemu-devel] [RFC 5/7] pci: Set phb->bus inside pci_register_bus()

2017-04-18 Thread Eduardo Habkost

On Tue, Apr 18, 2017 at 04:43:48PM +0300, Marcel Apfelbaum wrote:
> On 04/18/2017 12:59 AM, Eduardo Habkost wrote:
> > Every single caller of of pci_register_bus() saves the return value in
> > phb->bus. Do that inside pci_register_bus() to avoid code duplication
> > and make it harder to break.
> > 
> 
> I personally find that more difficult to follow, maybe the function
> name is misleading.
> Maybe we can find a better function name or explain the side effect
> in a comment?

I agree we need better function names (for both
pci_register_bus() and pci_bus_new*()). But I am not experienced
enough with the PCI code to find a good name. Any suggestions?

-- 
Eduardo

[Qemu-devel] [PATCH v5 0/9] qemu-img: add measure sub-command

2017-04-18 Thread Stefan Hajnoczi

v5:
 * Use UINT64_MAX instead of ~0ULL [Berto]
 * Document qemu-img measure ofmt, fmt, output_fmt, and snapshot_param
   [Berto]

v4:
 * Make qcow2 refcount calculation conservative [Maor]
 * Include actual qemu-img convert image size in test cases

v3:
 * Drop RFC, this is ready to go for QEMU 2.10
 * Use "required size" instead of "required bytes" in qemu-img output for
   consistency [Nir]
 * Clarify BlockMeasureInfo semantics [Max]
 * Clarify bdrv_measure() opts argument and error handling [Nir]
 * Handle -o backing_file= for qcow2 [Max]
 * Handle snapshot options in qemu-img measure
 * Probe input image for allocated data clusters for qcow2.  Didn't centralize
   this because there are format-specific aspects such as the cluster_size.  It
   may make sense to centralize it later (with a bit more complexity) if
   support is added to more formats.
 * Add qemu-img(1) man page section for 'measure' sub-command [Max]
 * Extend test case to cover additional scenarios [Nir]

RFCv2:
 * Publishing RFC again to discuss the new user-visible interfaces.  Code has
   changed quite a bit, I have not kept any Reviewed-by tags.
 * Rename qemu-img sub-command "measure" and API bdrv_measure() [Nir]
 * Report both "required bytes" and "fully allocated bytes" to handle the empty
   image file and prealloc use cases [Nir and Dan]
 * Use bdrv_getlength() instead of bdrv_nb_sectors() [Berto]
 * Rename "err" label "out" in qemu-img-cmds.c [Nir]
 * Add basic qcow2 support, doesn't support qemu-img convert from existing 
files yet

RFCv1:
 * Publishing patch series with just raw support, no qcow2 yet.  Please review
   the command-line interface and let me know if you are happy with this
   approach.

Users and management tools sometimes need to know the size required for a new
disk image so that an LVM volume, SAN LUN, etc can be allocated ahead of time.
Image formats like qcow2 have non-trivial metadata that makes it hard to
estimate the exact size without knowledge of file format internals.

This patch series introduces a new qemu-img sub-command that calculates the
required size for both image creation and conversion scenarios.

The conversion scenario is:

  $ qemu-img measure -f raw -O qcow2 input.img
  required size: 1327680
  fully allocated size: 1074069504

Here an existing image file is taken and the output includes the space required
for data from the input image file.

The creation scenario is:

  $ qemu-img measure -O qcow2 --size 5G
  required size: 327680
  fully allocated size: 1074069504

Stefan Hajnoczi (9):
  block: add bdrv_measure() API
  raw-format: add bdrv_measure() support
  qcow2: extract preallocation calculation function
  qcow2: make refcount size calculation conservative
  qcow2: extract image creation option parsing
  qcow2: add bdrv_measure() support
  qemu-img: add measure subcommand
  qemu-iotests: support per-format golden output files
  iotests: add test 178 for qemu-img measure

 qapi/block-core.json |  25 +++
 include/block/block.h|   4 +
 include/block/block_int.h|   2 +
 block.c  |  35 
 block/qcow2.c| 367 +--
 block/raw-format.c   |  22 +++
 qemu-img.c   | 227 
 qemu-img-cmds.hx |   6 +
 qemu-img.texi|  30 
 tests/qemu-iotests/178   | 168 ++
 tests/qemu-iotests/178.out.qcow2 | 278 +
 tests/qemu-iotests/178.out.raw   | 146 
 tests/qemu-iotests/check |   5 +
 tests/qemu-iotests/group |   1 +
 14 files changed, 1222 insertions(+), 94 deletions(-)
 create mode 100755 tests/qemu-iotests/178
 create mode 100644 tests/qemu-iotests/178.out.qcow2
 create mode 100644 tests/qemu-iotests/178.out.raw

-- 
2.9.3

[Qemu-devel] [PATCH v5 6/9] qcow2: add bdrv_measure() support

2017-04-18 Thread Stefan Hajnoczi

Use qcow2_calc_prealloc_size() to get the required file size.

Signed-off-by: Stefan Hajnoczi 
---
 block/qcow2.c | 122 ++
 1 file changed, 122 insertions(+)

diff --git a/block/qcow2.c b/block/qcow2.c
index 0181d72..a2c7f97 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2942,6 +2942,127 @@ static coroutine_fn int 
qcow2_co_flush_to_os(BlockDriverState *bs)
 return 0;
 }
 
+static void qcow2_measure(QemuOpts *opts, BlockDriverState *in_bs,
+  BlockMeasureInfo *info, Error **errp)
+{
+Error *local_err = NULL;
+uint64_t required = 0; /* bytes that contribute to required size */
+uint64_t virtual_size; /* disk size as seen by guest */
+uint64_t refcount_bits;
+size_t cluster_size;
+int version;
+char *optstr;
+PreallocMode prealloc;
+bool has_backing_file;
+
+/* Parse image creation options */
+cluster_size = qcow2_opt_get_cluster_size_del(opts, &local_err);
+if (local_err) {
+goto err;
+}
+
+version = qcow2_opt_get_version_del(opts, &local_err);
+if (local_err) {
+goto err;
+}
+
+refcount_bits = qcow2_opt_get_refcount_bits_del(opts, version, &local_err);
+if (local_err) {
+goto err;
+}
+
+optstr = qemu_opt_get_del(opts, BLOCK_OPT_PREALLOC);
+prealloc = qapi_enum_parse(PreallocMode_lookup, optstr,
+   PREALLOC_MODE__MAX, PREALLOC_MODE_OFF,
+   &local_err);
+g_free(optstr);
+if (local_err) {
+goto err;
+}
+
+optstr = qemu_opt_get_del(opts, BLOCK_OPT_BACKING_FILE);
+has_backing_file = !!optstr;
+g_free(optstr);
+
+virtual_size = align_offset(qemu_opt_get_size_del(opts, BLOCK_OPT_SIZE, 0),
+cluster_size);
+
+/* Account for input image */
+if (in_bs) {
+int64_t ssize = bdrv_getlength(in_bs);
+if (ssize < 0) {
+error_setg_errno(errp, -ssize, "Unable to get image virtual_size");
+return;
+}
+
+virtual_size = align_offset(ssize, cluster_size);
+
+if (has_backing_file) {
+/* We don't how much of the backing chain is shared by the input
+ * image and the new image file.  In the worst case the new image's
+ * backing file has nothing in common with the input image.  Be
+ * conservative and assume all clusters need to be written.
+ */
+required = virtual_size;
+} else {
+int cluster_sectors = cluster_size / BDRV_SECTOR_SIZE;
+int64_t sector_num;
+int pnum = 0;
+
+for (sector_num = 0;
+ sector_num < ssize / BDRV_SECTOR_SIZE;
+ sector_num += pnum) {
+int nb_sectors = MAX(ssize / BDRV_SECTOR_SIZE - sector_num,
+ INT_MAX);
+BlockDriverState *file;
+int64_t ret;
+
+ret = bdrv_get_block_status_above(in_bs, NULL,
+  sector_num, nb_sectors,
+  &pnum, &file);
+if (ret < 0) {
+error_setg_errno(errp, -ret, "Unable to get block status");
+return;
+}
+
+if (ret & BDRV_BLOCK_ZERO) {
+/* Skip zero regions (safe with no backing file) */
+} else if ((ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ALLOCATED)) ==
+   (BDRV_BLOCK_DATA | BDRV_BLOCK_ALLOCATED)) {
+/* Extend pnum to end of cluster for next iteration */
+pnum = ROUND_UP(sector_num + pnum, cluster_sectors) -
+   sector_num;
+
+/* Count clusters we've seen */
+required += (sector_num % cluster_sectors + pnum) *
+BDRV_SECTOR_SIZE;
+}
+}
+}
+}
+
+/* Take into account preallocation.  Nothing special is needed for
+ * PREALLOC_MODE_METADATA since metadata is always counted.
+ */
+if (prealloc == PREALLOC_MODE_FULL || prealloc == PREALLOC_MODE_FALLOC) {
+required = virtual_size;
+}
+
+info->fully_allocated =
+qcow2_calc_prealloc_size(virtual_size, cluster_size,
+ ctz32(refcount_bits));
+
+/* Remove data clusters that are not required.  This overestimates the
+ * required size because metadata needed for the fully allocated file is
+ * still counted.
+ */
+info->required = info->fully_allocated - virtual_size + required;
+return;
+
+err:
+error_propagate(errp, local_err);
+}
+
 static int qcow2_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
 {
 BDRVQcow2State *s = bs->opaque;
@@ -3489,6 +3610,7 @@ BlockDriver bdrv_qcow2 = {
 .bdrv_s

[Qemu-devel] [PATCH v5 1/9] block: add bdrv_measure() API

2017-04-18 Thread Stefan Hajnoczi

bdrv_measure() provides a conservative maximum for the size of a new
image.  This information is handy if storage needs to be allocated (e.g.
a SAN or an LVM volume) ahead of time.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Alberto Garcia 
---
 qapi/block-core.json  | 25 +
 include/block/block.h |  4 
 include/block/block_int.h |  2 ++
 block.c   | 35 +++
 4 files changed, 66 insertions(+)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 033457c..1ea5536 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -463,6 +463,31 @@
'*dirty-bitmaps': ['BlockDirtyInfo'] } }
 
 ##
+# @BlockMeasureInfo:
+#
+# Image size calculation information.  This structure describes the size
+# requirements for creating a new image file.
+#
+# The size requirements depend on the new image file format.  File size always
+# equals virtual disk size for the 'raw' format.  Compact formats such as
+# 'qcow2' represent unallocated and zero regions efficiently so file size may
+# be smaller than virtual disk size.
+#
+# The values are upper bounds that are guaranteed to fit the new image file.
+# Subsequent modification, such as internal snapshot or bitmap creation, may
+# require additional space and is not covered here.
+#
+# @required: Size required for a new image file, in bytes.
+#
+# @fully-allocated: Image file size, in bytes, once data has been written
+#   to all sectors.
+#
+# Since: 2.10
+##
+{ 'struct': 'BlockMeasureInfo',
+  'data': {'required': 'int', 'fully-allocated': 'int'} }
+
+##
 # @query-block:
 #
 # Get a list of BlockInfo for all virtual block devices.
diff --git a/include/block/block.h b/include/block/block.h
index 97d4330..4d4b88d 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -298,6 +298,10 @@ int bdrv_truncate(BdrvChild *child, int64_t offset);
 int64_t bdrv_nb_sectors(BlockDriverState *bs);
 int64_t bdrv_getlength(BlockDriverState *bs);
 int64_t bdrv_get_allocated_file_size(BlockDriverState *bs);
+void bdrv_measure(BlockDriver *drv, QemuOpts *opts,
+  BlockDriverState *in_bs,
+  BlockMeasureInfo *info,
+  Error **errp);
 void bdrv_get_geometry(BlockDriverState *bs, uint64_t *nb_sectors_ptr);
 void bdrv_refresh_limits(BlockDriverState *bs, Error **errp);
 int bdrv_commit(BlockDriverState *bs);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 59400bd..5099a58 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -201,6 +201,8 @@ struct BlockDriver {
 int64_t (*bdrv_getlength)(BlockDriverState *bs);
 bool has_variable_length;
 int64_t (*bdrv_get_allocated_file_size)(BlockDriverState *bs);
+void (*bdrv_measure)(QemuOpts *opts, BlockDriverState *in_bs,
+ BlockMeasureInfo *info, Error **errp);
 
 int coroutine_fn (*bdrv_co_pwritev_compressed)(BlockDriverState *bs,
 uint64_t offset, uint64_t bytes, QEMUIOVector *qiov);
diff --git a/block.c b/block.c
index 1fbbb8d..1b14214 100644
--- a/block.c
+++ b/block.c
@@ -3316,6 +3316,41 @@ int64_t bdrv_get_allocated_file_size(BlockDriverState 
*bs)
 return -ENOTSUP;
 }
 
+/*
+ * bdrv_measure:
+ * @drv: Format driver
+ * @opts: Creation options for new image
+ * @in_bs: Existing image containing data for new image (may be NULL)
+ * @info: Result object
+ * @errp: Error object
+ *
+ * Calculate file size required to create a new image.
+ *
+ * If @in_bs is given then space for allocated clusters and zero clusters
+ * from that image are included in the calculation.  If @opts contains a
+ * backing file that is shared by @in_bs then backing clusters are omitted
+ * from the calculation.
+ *
+ * If @in_bs is NULL then the calculation includes no allocated clusters
+ * unless a preallocation option is given in @opts.
+ *
+ * Note that @in_bs may use a different BlockDriver from @drv.
+ *
+ * If an error occurs the @errp pointer is set.
+ */
+void bdrv_measure(BlockDriver *drv, QemuOpts *opts,
+  BlockDriverState *in_bs, BlockMeasureInfo *info,
+  Error **errp)
+{
+if (!drv->bdrv_measure) {
+error_setg(errp, "Block driver '%s' does not support size measurement",
+   drv->format_name);
+return;
+}
+
+drv->bdrv_measure(opts, in_bs, info, errp);
+}
+
 /**
  * Return number of sectors on success, -errno on error.
  */
-- 
2.9.3

[Qemu-devel] [PATCH v5 7/9] qemu-img: add measure subcommand

2017-04-18 Thread Stefan Hajnoczi

The measure subcommand calculates the size required by a new image file.
This can be used by users or management tools that need to allocate
space on an LVM volume, SAN LUN, etc before creating or converting an
image file.

Suggested-by: Maor Lipchuk 
Signed-off-by: Stefan Hajnoczi 
---
v5:
 * Use UINT64_MAX instead of ~0ULL [Berto]
 * Document qemu-img measure ofmt, fmt, output_fmt, and snapshot_param
   [Berto]
---
 qemu-img.c   | 227 +++
 qemu-img-cmds.hx |   6 ++
 qemu-img.texi|  30 
 3 files changed, 263 insertions(+)

diff --git a/qemu-img.c b/qemu-img.c
index b220cf7..cb1a81f 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -59,6 +59,7 @@ enum {
 OPTION_PATTERN = 260,
 OPTION_FLUSH_INTERVAL = 261,
 OPTION_NO_DRAIN = 262,
+OPTION_SIZE = 263,
 };
 
 typedef enum OutputFormat {
@@ -4359,6 +4360,232 @@ out:
 return 0;
 }
 
+static void dump_json_block_measure_info(BlockMeasureInfo *info)
+{
+QString *str;
+QObject *obj;
+Visitor *v = qobject_output_visitor_new(&obj);
+
+visit_type_BlockMeasureInfo(v, NULL, &info, &error_abort);
+visit_complete(v, &obj);
+str = qobject_to_json_pretty(obj);
+assert(str != NULL);
+printf("%s\n", qstring_get_str(str));
+qobject_decref(obj);
+visit_free(v);
+QDECREF(str);
+}
+
+static int img_measure(int argc, char **argv)
+{
+static const struct option long_options[] = {
+{"help", no_argument, 0, 'h'},
+{"image-opts", no_argument, 0, OPTION_IMAGE_OPTS},
+{"object", required_argument, 0, OPTION_OBJECT},
+{"output", required_argument, 0, OPTION_OUTPUT},
+{"size", required_argument, 0, OPTION_SIZE},
+{0, 0, 0, 0}
+};
+OutputFormat output_format = OFORMAT_HUMAN;
+BlockBackend *in_blk = NULL;
+BlockDriver *drv;
+const char *filename = NULL;
+const char *fmt = NULL;
+const char *out_fmt = "raw";
+char *options = NULL;
+char *snapshot_name = NULL;
+QemuOpts *opts = NULL;
+QemuOpts *object_opts = NULL;
+QemuOpts *sn_opts = NULL;
+QemuOptsList *create_opts = NULL;
+bool image_opts = false;
+uint64_t img_size = UINT64_MAX;
+BlockMeasureInfo info;
+Error *local_err = NULL;
+int ret = 1;
+int c;
+
+while ((c = getopt_long(argc, argv, "hf:O:o:l:",
+long_options, NULL)) != -1) {
+switch (c) {
+case '?':
+case 'h':
+help();
+break;
+case 'f':
+fmt = optarg;
+break;
+case 'O':
+out_fmt = optarg;
+break;
+case 'o':
+if (!is_valid_option_list(optarg)) {
+error_report("Invalid option list: %s", optarg);
+goto out;
+}
+if (!options) {
+options = g_strdup(optarg);
+} else {
+char *old_options = options;
+options = g_strdup_printf("%s,%s", options, optarg);
+g_free(old_options);
+}
+break;
+case 'l':
+if (strstart(optarg, SNAPSHOT_OPT_BASE, NULL)) {
+sn_opts = qemu_opts_parse_noisily(&internal_snapshot_opts,
+  optarg, false);
+if (!sn_opts) {
+error_report("Failed in parsing snapshot param '%s'",
+ optarg);
+goto out;
+}
+} else {
+snapshot_name = optarg;
+}
+break;
+case OPTION_OBJECT:
+object_opts = qemu_opts_parse_noisily(&qemu_object_opts,
+  optarg, true);
+if (!object_opts) {
+goto out;
+}
+break;
+case OPTION_IMAGE_OPTS:
+image_opts = true;
+break;
+case OPTION_OUTPUT:
+if (!strcmp(optarg, "json")) {
+output_format = OFORMAT_JSON;
+} else if (!strcmp(optarg, "human")) {
+output_format = OFORMAT_HUMAN;
+} else {
+error_report("--output must be used with human or json "
+ "as argument.");
+goto out;
+}
+break;
+case OPTION_SIZE:
+{
+int64_t sval;
+
+sval = cvtnum(optarg);
+if (sval < 0) {
+if (sval == -ERANGE) {
+error_report("Image size must be less than 8 EiB!");
+} else {
+error_report("Invalid image size specified! You may use "
+ "k, M, G, T, P or E suffixes for ");
+error_report("kilobytes, megabytes, gigabytes, terabytes, "
+ "petabytes and exabytes.");
+}
+

[Qemu-devel] [PATCH v5 2/9] raw-format: add bdrv_measure() support

2017-04-18 Thread Stefan Hajnoczi

Maximum size calculation is trivial for the raw format: it's just the
requested image size (because there is no metadata).

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Alberto Garcia 
---
 block/raw-format.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/block/raw-format.c b/block/raw-format.c
index 86fbc65..e66ba00 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -312,6 +312,27 @@ static int64_t raw_getlength(BlockDriverState *bs)
 return s->size;
 }
 
+static void raw_measure(QemuOpts *opts, BlockDriverState *in_bs,
+BlockMeasureInfo *info,
+Error **errp)
+{
+if (in_bs) {
+int64_t ssize = bdrv_getlength(in_bs);
+if (ssize < 0) {
+error_setg_errno(errp, -ssize, "Unable to get image size");
+return;
+}
+info->required = ssize;
+} else {
+info->required =
+ROUND_UP(qemu_opt_get_size_del(opts, BLOCK_OPT_SIZE, 0),
+ BDRV_SECTOR_SIZE);
+}
+
+/* Unallocated sectors count towards the file size in raw images */
+info->fully_allocated = info->required;
+}
+
 static int raw_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
 {
 return bdrv_get_info(bs->file->bs, bdi);
@@ -477,6 +498,7 @@ BlockDriver bdrv_raw = {
 .bdrv_truncate= &raw_truncate,
 .bdrv_getlength   = &raw_getlength,
 .has_variable_length  = true,
+.bdrv_measure = &raw_measure,
 .bdrv_get_info= &raw_get_info,
 .bdrv_refresh_limits  = &raw_refresh_limits,
 .bdrv_probe_blocksizes = &raw_probe_blocksizes,
-- 
2.9.3

[Qemu-devel] [PATCH v5 4/9] qcow2: make refcount size calculation conservative

2017-04-18 Thread Stefan Hajnoczi

The refcount metadata size calculation is inaccurate and can produce
numbers that are too small.  This is bad because we should calculate a
conservative number - one that is guaranteed to be large enough.

This patch switches the approach to a fixed point calculation because
the existing equation is hard to solve when inaccuracies are taken care
of.

Signed-off-by: Stefan Hajnoczi 
---
 block/qcow2.c | 82 ++-
 1 file changed, 42 insertions(+), 40 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 7c702f4..ffc529e 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2095,6 +2095,43 @@ static int preallocate(BlockDriverState *bs)
 return 0;
 }
 
+/* qcow2_refcount_metadata_size:
+ * @clusters: number of clusters to refcount (including data and L1/L2 tables)
+ * @cluster_size: size of a cluster, in bytes
+ * @refcount_order: refcount bits power-of-2 exponent
+ *
+ * Returns: Number of bytes required for refcount blocks and table metadata.
+ */
+static int64_t qcow2_refcount_metadata_size(int64_t clusters,
+size_t cluster_size,
+int refcount_order)
+{
+/*
+ * Every host cluster is reference-counted, including metadata (even
+ * refcount metadata is recursively included).
+ *
+ * An accurate formula for the size of refcount metadata size is difficult
+ * to derive.  An easier method of calculation is finding the fixed point
+ * where no further refcount blocks or table clusters are required to
+ * reference count every cluster.
+ */
+int64_t blocks_per_table_cluster = cluster_size / sizeof(uint64_t);
+int64_t refcounts_per_block = cluster_size * 8 / (1 << refcount_order);
+int64_t table = 0;  /* number of refcount table clusters */
+int64_t blocks = 0; /* number of refcount block clusters */
+int64_t last;
+int64_t n = 0;
+
+do {
+last = n;
+blocks = DIV_ROUND_UP(clusters + table + blocks, refcounts_per_block);
+table = DIV_ROUND_UP(blocks, blocks_per_table_cluster);
+n = clusters + blocks + table;
+} while (n != last);
+
+return (blocks + table) * cluster_size;
+}
+
 /**
  * qcow2_calc_prealloc_size:
  * @total_size: virtual disk size in bytes
@@ -2108,22 +2145,9 @@ static int64_t qcow2_calc_prealloc_size(int64_t 
total_size,
 size_t cluster_size,
 int refcount_order)
 {
-/* Note: The following calculation does not need to be exact; if it is a
- * bit off, either some bytes will be "leaked" (which is fine) or we
- * will need to increase the file size by some bytes (which is fine,
- * too, as long as the bulk is allocated here). Therefore, using
- * floating point arithmetic is fine. */
 int64_t meta_size = 0;
-uint64_t nreftablee, nrefblocke, nl1e, nl2e;
+uint64_t nl1e, nl2e;
 int64_t aligned_total_size = align_offset(total_size, cluster_size);
-int cluster_bits = ctz32(cluster_size);
-int refblock_bits, refblock_size;
-/* refcount entry size in bytes */
-double rces = (1 << refcount_order) / 8.;
-
-/* see qcow2_open() */
-refblock_bits = cluster_bits - (refcount_order - 3);
-refblock_size = 1 << refblock_bits;
 
 /* header: 1 cluster */
 meta_size += cluster_size;
@@ -2138,32 +2162,10 @@ static int64_t qcow2_calc_prealloc_size(int64_t 
total_size,
 nl1e = align_offset(nl1e, cluster_size / sizeof(uint64_t));
 meta_size += nl1e * sizeof(uint64_t);
 
-/* total size of refcount blocks
- *
- * note: every host cluster is reference-counted, including metadata
- * (even refcount blocks are recursively included).
- * Let:
- *   a = total_size (this is the guest disk size)
- *   m = meta size not including refcount blocks and refcount tables
- *   c = cluster size
- *   y1 = number of refcount blocks entries
- *   y2 = meta size including everything
- *   rces = refcount entry size in bytes
- * then,
- *   y1 = (y2 + a)/c
- *   y2 = y1 * rces + y1 * rces * sizeof(u64) / c + m
- * we can get y1:
- *   y1 = (a + m) / (c - rces - rces * sizeof(u64) / c)
- */
-nrefblocke = (aligned_total_size + meta_size + cluster_size)
-/ (cluster_size - rces - rces * sizeof(uint64_t)
-/ cluster_size);
-meta_size += DIV_ROUND_UP(nrefblocke, refblock_size) * cluster_size;
-
-/* total size of refcount tables */
-nreftablee = nrefblocke / refblock_size;
-nreftablee = align_offset(nreftablee, cluster_size / sizeof(uint64_t));
-meta_size += nreftablee * sizeof(uint64_t);
+/* total size of refcount table and blocks */
+meta_size += qcow2_refcount_metadata_size(
+(meta_size + aligned_total_size) / cluster_size,
+cluster_size, refcount_order);
 
 return meta_size + aligned_total_si

[Qemu-devel] [PATCH v5 8/9] qemu-iotests: support per-format golden output files

2017-04-18 Thread Stefan Hajnoczi

Some tests produce format-dependent output.  Either the difference is
filtered out and ignored, or the test case is format-specific so we
don't need to worry about per-format output differences.

There is a third case: the test script is the same for all image formats
and the format-dependent output is relevant.  An ugly workaround is to
copy-paste the test into multiple per-format test cases.  This
duplicates code and is not maintainable.

This patch allows test cases to add per-format golden output files so a
single test case can work correctly when format-dependent output must be
checked:

  123.out.qcow2
  123.out.raw
  123.out.vmdk
  ...

This naming scheme is not composable with 123.out.nocache or 123.pc.out,
two other scenarios where output files are split.  I don't think it
matters since few test cases need these features.

Signed-off-by: Stefan Hajnoczi 
---
 tests/qemu-iotests/check | 5 +
 1 file changed, 5 insertions(+)

diff --git a/tests/qemu-iotests/check b/tests/qemu-iotests/check
index 4b1c674..29553cf 100755
--- a/tests/qemu-iotests/check
+++ b/tests/qemu-iotests/check
@@ -338,6 +338,11 @@ do
 reference="$reference_machine"
 fi
 
+reference_format="$source_iotests/$seq.out.$IMGFMT"
+if [ -f "$reference_format" ]; then
+reference="$reference_format"
+fi
+
 if [ "$CACHEMODE" = "none" ]; then
 [ -f "$source_iotests/$seq.out.nocache" ] && 
reference="$source_iotests/$seq.out.nocache"
 fi
-- 
2.9.3

[Qemu-devel] [PATCH v5 3/9] qcow2: extract preallocation calculation function

2017-04-18 Thread Stefan Hajnoczi

Calculating the preallocated image size will be needed to implement
.bdrv_measure().  Extract the code out into a separate function.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Alberto Garcia 
---
 block/qcow2.c | 134 +-
 1 file changed, 76 insertions(+), 58 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 6a92d2e..7c702f4 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2095,6 +2095,79 @@ static int preallocate(BlockDriverState *bs)
 return 0;
 }
 
+/**
+ * qcow2_calc_prealloc_size:
+ * @total_size: virtual disk size in bytes
+ * @cluster_size: cluster size in bytes
+ * @refcount_order: refcount bits power-of-2 exponent
+ *
+ * Returns: Total number of bytes required for the fully allocated image
+ * (including metadata).
+ */
+static int64_t qcow2_calc_prealloc_size(int64_t total_size,
+size_t cluster_size,
+int refcount_order)
+{
+/* Note: The following calculation does not need to be exact; if it is a
+ * bit off, either some bytes will be "leaked" (which is fine) or we
+ * will need to increase the file size by some bytes (which is fine,
+ * too, as long as the bulk is allocated here). Therefore, using
+ * floating point arithmetic is fine. */
+int64_t meta_size = 0;
+uint64_t nreftablee, nrefblocke, nl1e, nl2e;
+int64_t aligned_total_size = align_offset(total_size, cluster_size);
+int cluster_bits = ctz32(cluster_size);
+int refblock_bits, refblock_size;
+/* refcount entry size in bytes */
+double rces = (1 << refcount_order) / 8.;
+
+/* see qcow2_open() */
+refblock_bits = cluster_bits - (refcount_order - 3);
+refblock_size = 1 << refblock_bits;
+
+/* header: 1 cluster */
+meta_size += cluster_size;
+
+/* total size of L2 tables */
+nl2e = aligned_total_size / cluster_size;
+nl2e = align_offset(nl2e, cluster_size / sizeof(uint64_t));
+meta_size += nl2e * sizeof(uint64_t);
+
+/* total size of L1 tables */
+nl1e = nl2e * sizeof(uint64_t) / cluster_size;
+nl1e = align_offset(nl1e, cluster_size / sizeof(uint64_t));
+meta_size += nl1e * sizeof(uint64_t);
+
+/* total size of refcount blocks
+ *
+ * note: every host cluster is reference-counted, including metadata
+ * (even refcount blocks are recursively included).
+ * Let:
+ *   a = total_size (this is the guest disk size)
+ *   m = meta size not including refcount blocks and refcount tables
+ *   c = cluster size
+ *   y1 = number of refcount blocks entries
+ *   y2 = meta size including everything
+ *   rces = refcount entry size in bytes
+ * then,
+ *   y1 = (y2 + a)/c
+ *   y2 = y1 * rces + y1 * rces * sizeof(u64) / c + m
+ * we can get y1:
+ *   y1 = (a + m) / (c - rces - rces * sizeof(u64) / c)
+ */
+nrefblocke = (aligned_total_size + meta_size + cluster_size)
+/ (cluster_size - rces - rces * sizeof(uint64_t)
+/ cluster_size);
+meta_size += DIV_ROUND_UP(nrefblocke, refblock_size) * cluster_size;
+
+/* total size of refcount tables */
+nreftablee = nrefblocke / refblock_size;
+nreftablee = align_offset(nreftablee, cluster_size / sizeof(uint64_t));
+meta_size += nreftablee * sizeof(uint64_t);
+
+return meta_size + aligned_total_size;
+}
+
 static int qcow2_create2(const char *filename, int64_t total_size,
  const char *backing_file, const char *backing_format,
  int flags, size_t cluster_size, PreallocMode prealloc,
@@ -2133,64 +2206,9 @@ static int qcow2_create2(const char *filename, int64_t 
total_size,
 int ret;
 
 if (prealloc == PREALLOC_MODE_FULL || prealloc == PREALLOC_MODE_FALLOC) {
-/* Note: The following calculation does not need to be exact; if it is 
a
- * bit off, either some bytes will be "leaked" (which is fine) or we
- * will need to increase the file size by some bytes (which is fine,
- * too, as long as the bulk is allocated here). Therefore, using
- * floating point arithmetic is fine. */
-int64_t meta_size = 0;
-uint64_t nreftablee, nrefblocke, nl1e, nl2e;
-int64_t aligned_total_size = align_offset(total_size, cluster_size);
-int refblock_bits, refblock_size;
-/* refcount entry size in bytes */
-double rces = (1 << refcount_order) / 8.;
-
-/* see qcow2_open() */
-refblock_bits = cluster_bits - (refcount_order - 3);
-refblock_size = 1 << refblock_bits;
-
-/* header: 1 cluster */
-meta_size += cluster_size;
-
-/* total size of L2 tables */
-nl2e = aligned_total_size / cluster_size;
-nl2e = align_offset(nl2e, cluster_size / sizeof(uint64_t));
-meta_size += nl2e * sizeof(uint64_t);
-
-/* total size of L1 tables */
-nl1e = nl2e * sizeof(

[Qemu-devel] [PATCH v5 5/9] qcow2: extract image creation option parsing

2017-04-18 Thread Stefan Hajnoczi

The image creation options parsed by qcow2_create() are also needed to
implement .bdrv_measure().  Extract the parsing code, including input
validation.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Alberto Garcia 
---
 block/qcow2.c | 109 +++---
 1 file changed, 73 insertions(+), 36 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index ffc529e..0181d72 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2170,24 +2170,73 @@ static int64_t qcow2_calc_prealloc_size(int64_t 
total_size,
 return meta_size + aligned_total_size;
 }
 
-static int qcow2_create2(const char *filename, int64_t total_size,
- const char *backing_file, const char *backing_format,
- int flags, size_t cluster_size, PreallocMode prealloc,
- QemuOpts *opts, int version, int refcount_order,
- Error **errp)
+static size_t qcow2_opt_get_cluster_size_del(QemuOpts *opts, Error **errp)
 {
+size_t cluster_size;
 int cluster_bits;
-QDict *options;
 
-/* Calculate cluster_bits */
+cluster_size = qemu_opt_get_size_del(opts, BLOCK_OPT_CLUSTER_SIZE,
+ DEFAULT_CLUSTER_SIZE);
 cluster_bits = ctz32(cluster_size);
 if (cluster_bits < MIN_CLUSTER_BITS || cluster_bits > MAX_CLUSTER_BITS ||
 (1 << cluster_bits) != cluster_size)
 {
 error_setg(errp, "Cluster size must be a power of two between %d and "
"%dk", 1 << MIN_CLUSTER_BITS, 1 << (MAX_CLUSTER_BITS - 10));
-return -EINVAL;
+return 0;
 }
+return cluster_size;
+}
+
+static int qcow2_opt_get_version_del(QemuOpts *opts, Error **errp)
+{
+char *buf;
+int ret;
+
+buf = qemu_opt_get_del(opts, BLOCK_OPT_COMPAT_LEVEL);
+if (!buf) {
+ret = 3; /* default */
+} else if (!strcmp(buf, "0.10")) {
+ret = 2;
+} else if (!strcmp(buf, "1.1")) {
+ret = 3;
+} else {
+error_setg(errp, "Invalid compatibility level: '%s'", buf);
+ret = -EINVAL;
+}
+g_free(buf);
+return ret;
+}
+
+static uint64_t qcow2_opt_get_refcount_bits_del(QemuOpts *opts, int version,
+Error **errp)
+{
+uint64_t refcount_bits;
+
+refcount_bits = qemu_opt_get_number_del(opts, BLOCK_OPT_REFCOUNT_BITS, 16);
+if (refcount_bits > 64 || !is_power_of_2(refcount_bits)) {
+error_setg(errp, "Refcount width must be a power of two and may not "
+   "exceed 64 bits");
+return 0;
+}
+
+if (version < 3 && refcount_bits != 16) {
+error_setg(errp, "Different refcount widths than 16 bits require "
+   "compatibility level 1.1 or above (use compat=1.1 or "
+   "greater)");
+return 0;
+}
+
+return refcount_bits;
+}
+
+static int qcow2_create2(const char *filename, int64_t total_size,
+ const char *backing_file, const char *backing_format,
+ int flags, size_t cluster_size, PreallocMode prealloc,
+ QemuOpts *opts, int version, int refcount_order,
+ Error **errp)
+{
+QDict *options;
 
 /*
  * Open the image file and write a minimal qcow2 header.
@@ -2237,7 +2286,7 @@ static int qcow2_create2(const char *filename, int64_t 
total_size,
 *header = (QCowHeader) {
 .magic  = cpu_to_be32(QCOW_MAGIC),
 .version= cpu_to_be32(version),
-.cluster_bits   = cpu_to_be32(cluster_bits),
+.cluster_bits   = cpu_to_be32(ctz32(cluster_size)),
 .size   = cpu_to_be64(0),
 .l1_table_offset= cpu_to_be64(0),
 .l1_size= cpu_to_be32(0),
@@ -2373,8 +2422,8 @@ static int qcow2_create(const char *filename, QemuOpts 
*opts, Error **errp)
 int flags = 0;
 size_t cluster_size = DEFAULT_CLUSTER_SIZE;
 PreallocMode prealloc;
-int version = 3;
-uint64_t refcount_bits = 16;
+int version;
+uint64_t refcount_bits;
 int refcount_order;
 Error *local_err = NULL;
 int ret;
@@ -2387,8 +2436,12 @@ static int qcow2_create(const char *filename, QemuOpts 
*opts, Error **errp)
 if (qemu_opt_get_bool_del(opts, BLOCK_OPT_ENCRYPT, false)) {
 flags |= BLOCK_FLAG_ENCRYPT;
 }
-cluster_size = qemu_opt_get_size_del(opts, BLOCK_OPT_CLUSTER_SIZE,
- DEFAULT_CLUSTER_SIZE);
+cluster_size = qcow2_opt_get_cluster_size_del(opts, &local_err);
+if (local_err) {
+error_propagate(errp, local_err);
+ret = -EINVAL;
+goto finish;
+}
 buf = qemu_opt_get_del(opts, BLOCK_OPT_PREALLOC);
 prealloc = qapi_enum_parse(PreallocMode_lookup, buf,
PREALLOC_MODE__MAX, PREALLOC_MODE_OFF,
@@ -2

Re: [Qemu-devel] [PATCH v9 1/9] memory: add section range info for IOMMU notifier

2017-04-18 Thread David Gibson

On Tue, Apr 18, 2017 at 05:56:37PM +0800, Peter Xu wrote:
> On Tue, Apr 11, 2017 at 11:56:54AM +1000, David Gibson wrote:
> > On Mon, Apr 10, 2017 at 03:09:50PM +0800, Peter Xu wrote:
> > > On Mon, Apr 10, 2017 at 02:39:22PM +1000, David Gibson wrote:
> > > > On Fri, Apr 07, 2017 at 06:59:07PM +0800, Peter Xu wrote:
> > > > > In this patch, IOMMUNotifier.{start|end} are introduced to store 
> > > > > section
> > > > > information for a specific notifier. When notification occurs, we not
> > > > > only check the notification type (MAP|UNMAP), but also check whether 
> > > > > the
> > > > > notified iova range overlaps with the range of specific IOMMU 
> > > > > notifier,
> > > > > and skip those notifiers if not in the listened range.
> > > > > 
> > > > > When removing an region, we need to make sure we removed the correct
> > > > > VFIOGuestIOMMU by checking the IOMMUNotifier.start address as well.
> > > > > 
> > > > > This patch is solving the problem that vfio-pci devices receive
> > > > > duplicated UNMAP notification on x86 platform when vIOMMU is there. 
> > > > > The
> > > > > issue is that x86 IOMMU has a (0, 2^64-1) IOMMU region, which is
> > > > > splitted by the (0xfee0, 0xfeef) IRQ region. AFAIK
> > > > > this (splitted IOMMU region) is only happening on x86.
> > > > > 
> > > > > This patch also helps vhost to leverage the new interface as well, so
> > > > > that vhost won't get duplicated cache flushes. In that sense, it's an
> > > > > slight performance improvement.
> > > > > 
> > > > > Suggested-by: David Gibson 
> > > > > Reviewed-by: Eric Auger 
> > > > > Reviewed-by: Michael S. Tsirkin 
> > > > > Acked-by: Alex Williamson 
> > > > > Signed-off-by: Peter Xu 
> > > > > ---
> > > > >  hw/vfio/common.c  | 12 +---
> > > > >  hw/virtio/vhost.c | 10 --
> > > > >  include/exec/memory.h | 19 ++-
> > > > >  memory.c  |  9 +
> > > > >  4 files changed, 44 insertions(+), 6 deletions(-)
> > > > > 
> > > > > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > > > > index f3ba9b9..6b33b9f 100644
> > > > > --- a/hw/vfio/common.c
> > > > > +++ b/hw/vfio/common.c
> > > > > @@ -478,8 +478,13 @@ static void 
> > > > > vfio_listener_region_add(MemoryListener *listener,
> > > > >  giommu->iommu_offset = section->offset_within_address_space -
> > > > > section->offset_within_region;
> > > > >  giommu->container = container;
> > > > > -giommu->n.notify = vfio_iommu_map_notify;
> > > > > -giommu->n.notifier_flags = IOMMU_NOTIFIER_ALL;
> > > > > +llend = 
> > > > > int128_add(int128_make64(section->offset_within_region),
> > > > > +   section->size);
> > > > > +llend = int128_sub(llend, int128_one());
> > > > > +iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
> > > > > +IOMMU_NOTIFIER_ALL,
> > > > > +section->offset_within_region,
> > > > > +int128_get64(llend));
> > > > 
> > > > Seems to me it would make sense to put the fiddling around to convert
> > > > the MemoryRegionSection into the necessary values would make sense to
> > > > go inside iommu_notifier_init().
> > > 
> > > But will we always get one MemoryRegionSection if we are not in any of
> > > the region_{add|del} callback? E.g., what if we want to init an IOMMU
> > > notifier that covers just the whole IOMMU region range?
> > 
> > I suppose so.  It's just the only likely users of the interface I can
> > see will be always doing this from region_{add,del}.
> > 
> > > Considering above, I would still slightly prefer current interface.
> > 
> > Ok, well my opinion on the matter isn't terribly strong.
> 
> Hi, David,
> 
> Sorry to respond late (so that context might be lost). Just want to
> make sure that you are okay with current patch and interface, right?
> 
> I think Marcel is going to merge it if np, and I would like to have
> your confirmation on this before the merge. Thanks!

Yes, that's fine.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

[Qemu-devel] [PATCH v5 9/9] iotests: add test 178 for qemu-img measure

2017-04-18 Thread Stefan Hajnoczi

Signed-off-by: Stefan Hajnoczi 
---
 tests/qemu-iotests/178   | 168 +++
 tests/qemu-iotests/178.out.qcow2 | 278 +++
 tests/qemu-iotests/178.out.raw   | 146 
 tests/qemu-iotests/group |   1 +
 4 files changed, 593 insertions(+)
 create mode 100755 tests/qemu-iotests/178
 create mode 100644 tests/qemu-iotests/178.out.qcow2
 create mode 100644 tests/qemu-iotests/178.out.raw

diff --git a/tests/qemu-iotests/178 b/tests/qemu-iotests/178
new file mode 100755
index 000..a835a97
--- /dev/null
+++ b/tests/qemu-iotests/178
@@ -0,0 +1,168 @@
+#!/bin/bash
+#
+# qemu-img measure sub-command tests
+#
+# Copyright (C) 2017 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+# creator
+owner=stefa...@redhat.com
+
+seq=`basename $0`
+echo "QA output created by $seq"
+
+here=`pwd`
+status=1# failure is the default!
+
+_cleanup()
+{
+_cleanup_test_img
+rm -f "$TEST_IMG.converted"
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+. ./common.pattern
+
+_supported_fmt raw qcow2
+_supported_proto file
+_supported_os Linux
+
+echo "== Input validation =="
+echo
+
+_make_test_img 1G
+
+$QEMU_IMG measure # missing arguments
+$QEMU_IMG measure --size 2G "$TEST_IMG" # only one allowed
+$QEMU_IMG measure "$TEST_IMG" a # only one filename allowed
+$QEMU_IMG measure --object secret,id=sec0,data=MTIzNDU2,format=base64 # 
missing filename
+$QEMU_IMG measure --image-opts # missing filename
+$QEMU_IMG measure -f qcow2 # missing filename
+$QEMU_IMG measure -l snap1 # missing filename
+$QEMU_IMG measure -o , # invalid option list
+$QEMU_IMG measure -l snapshot.foo # invalid snapshot option
+$QEMU_IMG measure --output foo # invalid output format
+$QEMU_IMG measure --size -1 # invalid image size
+$QEMU_IMG measure -O foo "$TEST_IMG" # unknown image file format
+
+make_test_img_with_fmt() {
+# Shadow global variables within this function
+local IMGFMT="$1" IMGOPTS=""
+_make_test_img "$2"
+}
+
+qemu_io_with_fmt() {
+# Shadow global variables within this function
+local QEMU_IO_OPTIONS=$(echo "$QEMU_IO_OPTIONS" | sed "s/-f $IMGFMT/-f 
$1/")
+shift
+$QEMU_IO "$@"
+}
+
+# The proof is in the pudding: converted image size cannot be larger than the
+# required size.
+#
+# Note: if a change to the image format code causes the file size to change,
+# then this test fails!  This is good because it's a reminder to check that the
+# required size is still at least as big as the actual converted file size.
+convert_and_show_size() {
+local fmt="$1"
+shift
+$QEMU_IMG convert -f "$fmt" -O "$IMGFMT" "$TEST_IMG" "$@" 
"$TEST_IMG.converted"
+stat -c "converted image file size in bytes: %s" "$TEST_IMG.converted"
+}
+
+for ofmt in human json; do
+echo
+echo "== Size calculation for a new file ($ofmt) =="
+echo
+
+# Try a few interesting sizes
+$QEMU_IMG measure --output=$ofmt -O "$IMGFMT" --size 0
+$QEMU_IMG measure --output=$ofmt -O "$IMGFMT" --size 2G
+$QEMU_IMG measure --output=$ofmt -O "$IMGFMT" --size 64G
+$QEMU_IMG measure --output=$ofmt -O "$IMGFMT" --size 256G
+$QEMU_IMG measure --output=$ofmt -O "$IMGFMT" --size 1T
+
+# Always test the raw input files but also IMGFMT
+for fmt in $(echo -e "raw\n$IMGFMT\n" | sort -u); do
+echo
+echo "== Empty $fmt input image ($ofmt) =="
+echo
+make_test_img_with_fmt "$fmt" 0
+$QEMU_IMG measure --output=$ofmt -f "$fmt" -O "$IMGFMT" "$TEST_IMG"
+echo
+convert_and_show_size "$fmt"
+
+echo
+echo "== $fmt input image with data ($ofmt) =="
+echo
+make_test_img_with_fmt "$fmt" 1G
+$QEMU_IMG measure --output=$ofmt -f "$fmt" -O "$IMGFMT" "$TEST_IMG"
+qemu_io_with_fmt "$fmt" -c "write 512 512" "$TEST_IMG" | 
_filter_qemu_io
+qemu_io_with_fmt "$fmt" -c "write 64K 64K" "$TEST_IMG" | 
_filter_qemu_io
+if [ "$fmt" = "qcow2" ]; then
+$QEMU_IMG snapshot -c snapshot1 "$TEST_IMG"
+fi
+qemu_io_with_fmt "$fmt" -c "write 128M 63K" "$TEST_IMG" | 
_filter_qemu_io
+$QEMU_IMG measure --output=$ofmt -f "$fmt" -O "$IMGFMT" "$TEST_IMG"
+echo
+convert_and_show_size "$fmt"
+
+if [ "$fmt" = "qcow2"

Re: [Qemu-devel] [PATCH 1/7] target/openrisc: Fixes for memory debugging

2017-04-18 Thread Stafford Horne

On Tue, Apr 18, 2017 at 12:47:30AM -0700, Richard Henderson wrote:
> On 04/16/2017 04:23 PM, Stafford Horne wrote:
> > When debugging in gdb you might want to inspect instructions in mapped
> > pages or in exception vectors like 0x800 etc.  This was previously not
> > possible in qemu since the *get_phys_page_debug() routine only looked
> > into the data tlb.
> > 
> > Change to fall back to look into instruction tlb and plain physical
> > pages.
> > 
> > Signed-off-by: Stafford Horne 
> 
> Oh the horrors of a software managed TLB.
> 
> You might do well to architecturally define an SPR that holds the page table
> base, even if for real hardware that's only used by the software refill to
> load up the address.
> 
> That would give qemu the option of performing a real page table walk.  This
> would fix this debug hook properly (so that you can examine pages that
> aren't in the TLB at all).  It would also optionally allow QEMU to skip the
> software refill, which *significantly* speeds up emulation.

Understood, I guess we would also need a way to represent which paging
model we are using (1 level, 2 level etc)?

> That said,
> 
> Reviewed-by: Richard Henderson 

Thanks for the review.

-Stafford

> 
> r~

Re: [Qemu-devel] [PATCH 2/7] target/openrisc: add shutdown logic

2017-04-18 Thread Stafford Horne

On Tue, Apr 18, 2017 at 12:52:52AM -0700, Richard Henderson wrote:
> On 04/16/2017 04:23 PM, Stafford Horne wrote:
> > In openrisc simulators we use hooks like 'l.nop 1' to cause the
> > simulator to exit.  Implement that for qemu too.
> > 
> > Reported-by: Waldemar Brodkorb 
> > Signed-off-by: Stafford Horne 
> 
> As I said the first time this was posted: This is horrible.
> 
> If you want to do something like this, it needs to be buried under a special
> run mode like -semihosting.

Understood,  I will revise this.  I didnt know this was posted before.

> >  case 0x01:/* l.nop */
> >  LOG_DIS("l.nop %d\n", I16);
> > +{
> > +TCGv_i32 arg = tcg_const_i32(I16);
> > +gen_helper_nop(arg);
> > +}
> 
> You also really really must special-case l.nop 0 so that it doesn't generate
> a function call.  Just think of all the extra calls you're adding for every
> delay slot that couldn't be filled.

Yeah, that makes sense.  Ill add that for l.nop 0.

-Stafford

> 
> r~

Re: [Qemu-devel] [PATCH 4/7] target/openrisc: implement shadow registers

2017-04-18 Thread Stafford Horne

On Tue, Apr 18, 2017 at 01:11:53AM -0700, Richard Henderson wrote:
> On 04/16/2017 04:23 PM, Stafford Horne wrote:
> > Shadow registers are part of the openrisc spec along with sr[cid], as
> > part of the fast context switching feature.  When exceptions occur,
> > instead of having to save registers to the stack if enabled the CID will
> > increment and a new set of registers will be available.
> > 
> > This patch only implements shadow registers which can be used as extra
> > scratch registers via the mfspr and mtspr if required.  This is
> > implemented in a way where it would be easy to add on the fast context
> > switching, currently cid is hardcoded to 0.
> 
> I'm not especially keen on this half-conversion.
> If CID cannot be changed, then
> 
> > -target_ulong gpr[32]; /* General registers */
> > +target_ulong shadow_gpr[16][32]; /* Shadow registers */
> > +target_ulong * gpr;   /* General registers (backed by shadow) */
> 
> this pointer should not be necessary.  Just use a union, or even just
> 
> target_ulong gpr[32];
> target_ulong shadow_gpr[15][32];
> 
> for now.
> 
> Alternately, add accessor functions that take the whole ENV (which would be
> able to read CID, when needed).  C.f.
> 
> uint64_t cpu_alpha_load_gr(CPUAlphaState *env, unsigned reg);
> void cpu_alpha_store_gr(CPUAlphaState *env, unsigned reg, uint64_t val);

Actually, I started off writing it this way, but there were a lot of places
that used env->gpr.  I had some regex's to replace everything, but in the
end I thought using the gpr pointer was just easier.

Using the union would work as well, but it wouldnt help to allow switching
of cid.  My idea with the pointer was that if cid was updates I could
update the pointer at that time.

Ill rework this to use the accessor functions.

-Stafford

> If/when CID can be changed, then we can talk about various ways
> that this can be modeled within TCG.
> 
> 
> r~

Re: [Qemu-devel] [PATCH 7/7] target/openrisc: Implement full vmstate serialization

2017-04-18 Thread Stafford Horne

On Tue, Apr 18, 2017 at 01:14:19AM -0700, Richard Henderson wrote:
> On 04/16/2017 04:23 PM, Stafford Horne wrote:
> > Previously serialization did not persist the tlb, timer, pic and other
> > key state items.  This meant snapshotting and restoring a running os
> > would crash. After adding these I am able to take snapshots of a
> > running linux os and restore at a later time.
> > 
> > I am currently not trying to maintain capatibility with older versions
> > as I do not believe this really worked before or anyone used it.
> 
> That's fine, but you still have to bump the version numbers.

Actually, I bumped the version in the previous patch.  Ill make that more
clear an do that in this patch instead.

-Stafford

> r~
>

[Qemu-devel] [PATCH for-2.9-rc5 v4 0/2] block: Drain BH in bdrv_drained_begin

2017-04-18 Thread Fam Zheng

v4: Split patch, and fix the unsafe bdrv_unref. [Paolo]

Fam Zheng (2):
  block: Walk bs->children carefully in bdrv_drain_recurse
  block: Drain BH in bdrv_drained_begin

 block/io.c| 23 ---
 include/block/block.h | 22 ++
 2 files changed, 34 insertions(+), 11 deletions(-)

-- 
2.9.3

Re: [Qemu-devel] guest host time synchronization

2017-04-18 Thread Stefan Hajnoczi

On Tue, Apr 11, 2017 at 09:01:40AM +0200, Jiahuan Zhang wrote:
> I made a test to check the time difference of QEMU emulated ARM-based Linux
> guest and the Windows host. The result is, the guest time is around 0.2
> second faster than the host. I used the "-rtc base" option.
> 
> So can I say that, the time precision of QEMU emulation is in seconds? Is
> it possible to make it more precise? because I need to test the latency of
> my guest/host communication with high time resolution, at least in
> miliseconds.
> 
> Please let me know if you have any method  for the communication
> performance testing. Thank you very much.

This can depend on the exact board (machine type) being emulated because
the emulated clock and timer hardware may be different.

Please share your QEMU command-line so we can see the exact ARM guest
hardware configuration.

Stefan


signature.asc
Description: PGP signature

[Qemu-devel] [PATCH for-2.9-rc5 v4 1/2] block: Walk bs->children carefully in bdrv_drain_recurse

2017-04-18 Thread Fam Zheng

The recursive bdrv_drain_recurse may run a block job completion BH that
drops nodes. The coming changes will make that more likely and use-after-free
would happen without this patch

Stash the bs pointer and use bdrv_ref/bdrv_unref in addition to
QLIST_FOREACH_SAFE to prevent such a case from happening.

Since bdrv_unref accesses global state that is not protected by the AioContext
lock, we cannot use bdrv_ref/bdrv_unref unconditionally.  Fortunately the
protection is not needed in IOThread because only main loop can modify a graph
with the AioContext lock held.

Signed-off-by: Fam Zheng 
---
 block/io.c | 23 ---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/block/io.c b/block/io.c
index 8706bfa..a0df8c4 100644
--- a/block/io.c
+++ b/block/io.c
@@ -158,7 +158,7 @@ bool bdrv_requests_pending(BlockDriverState *bs)
 
 static bool bdrv_drain_recurse(BlockDriverState *bs)
 {
-BdrvChild *child;
+BdrvChild *child, *tmp;
 bool waited;
 
 waited = BDRV_POLL_WHILE(bs, atomic_read(&bs->in_flight) > 0);
@@ -167,8 +167,25 @@ static bool bdrv_drain_recurse(BlockDriverState *bs)
 bs->drv->bdrv_drain(bs);
 }
 
-QLIST_FOREACH(child, &bs->children, next) {
-waited |= bdrv_drain_recurse(child->bs);
+QLIST_FOREACH_SAFE(child, &bs->children, next, tmp) {
+BlockDriverState *bs = child->bs;
+bool in_main_loop =
+qemu_get_current_aio_context() == qemu_get_aio_context();
+assert(bs->refcnt > 0);
+if (in_main_loop) {
+/* In case the resursive bdrv_drain_recurse processes a
+ * block_job_defer_to_main_loop BH and modifies the graph,
+ * let's hold a reference to bs until we are done.
+ *
+ * IOThread doesn't have such a BH, and it is not safe to call
+ * bdrv_unref without BQL, so skip doing it there.
+ **/
+bdrv_ref(bs);
+}
+waited |= bdrv_drain_recurse(bs);
+if (in_main_loop) {
+bdrv_unref(bs);
+}
 }
 
 return waited;
-- 
2.9.3

Re: [Qemu-devel] [RFC for-2.10 2/3] pci: Allow host bridges to override PCI/PCIe hybrid device behaviour

2017-04-18 Thread Eduardo Habkost

On Tue, Apr 18, 2017 at 12:21:51PM +1000, David Gibson wrote:
> On Mon, Apr 17, 2017 at 03:30:46PM -0300, Eduardo Habkost wrote:
> > On Tue, Mar 28, 2017 at 01:16:50PM +1100, David Gibson wrote:
> > > Currently PCI/PCIe hybrid devices - that is, devices which can appear as
> > > either plain PCI or PCIe depending on where they're attached - will only
> > > appear in PCIe mode if they're attached to a PCIe bus via a root port or
> > > downstream port.
> > > 
> > > This is correct for "standard" PCIe setups, but there are some platforms
> > > which need different behaviour (notably "pseries" whose paravirtualized
> > > PCI host bridges have some idiosyncracies).
> > > 
> > > This patch allows the host bridge to override the normal behaviour.
> > > 
> > > Signed-off-by: David Gibson 
> > > ---
> > >  hw/pci/pci.c  | 11 +--
> > >  include/hw/pci/pci_host.h |  1 +
> > >  2 files changed, 10 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > > index 779787b..ac68065 100644
> > > --- a/hw/pci/pci.c
> > > +++ b/hw/pci/pci.c
> > > @@ -392,9 +392,16 @@ bool pci_bus_is_root(PCIBus *bus)
> > >  
> > >  bool pci_allow_hybrid_pcie(PCIDevice *pci_dev)
> > 
> > Why are we asking the device about allowing hybrid pcie, and not
> > the bus itself? Shouldn't we be able to query this before the
> > device is plugged, and before pci_dev->bus is even set?
> > 
> > In other words, why pci_allow_hyberid_pcie() and
> > pci_device_get_bus() get a PCIDevice* argument instead of a
> > PCIBus* argument?
> 
> Ah good point.  I made it a PCIDevice simply for convenience, but
> you're right we should be able to query before the device is plugged.

Understood.

> 
> > 
> > >  {
> > > -PCIBus *bus = pci_dev->bus;
> > > +PCIHostState *host_bridge = 
> > > PCI_HOST_BRIDGE(pci_device_root_bus(pci_dev)->qbus.parent);
> > 
> > There's something I don't understand completely: what exactly
> > guarantees that pci_device_root_bus(...)->qbus.parent is always
> > going to be a PCI_HOST_BRIDGE?
> 
> Well, by definition whatever is above the root bus isn't PCI, which
> pretty much means it has to be a PCI host bridge.  A machine could
> break this assumption, but I think that would be a bug.  We use this
> same pattern to find a PCI device's (or bus's) host bridge in other
> places - there doesn't appear to be another way.

After taking a better look at the way PCI buses are created, I
agree it's OK. Maybe a
  PCIHostState *pci_host_bridge(PCIBus *bus)
helper would be useful?

Anyway, both pci_bus_root_bus() and pci_host_bridge() could be
implemented as follow-ups if you prefer, so:

Reviewed-by: Eduardo Habkost 

But I still have another suggestion:

> 
> > > +PCIHostBridgeClass *hc = PCI_HOST_BRIDGE_GET_CLASS(host_bridge);
> > > +
> > > +if (hc->allow_hybrid_pcie) {
> > > +return hc->allow_hybrid_pcie(host_bridge, pci_dev);
> > > +} else {
> > > +PCIBus *bus = pci_dev->bus;
> > >  
> > > -return pci_bus_is_express(bus) && !pci_bus_is_root(bus);
> > > +return pci_bus_is_express(bus) && !pci_bus_is_root(bus);

Maybe it's a matter of personal taste, but instead of requiring
an explicit check for NULL hc->allow_hybrid_pcie, why not set a
default hc->allow_hybrid_pcie() implementation on
TYPE_PCI_HOST_BRIDGE that returns
(pci_bus_is_express(bus) && !pci_bus_is_root(bus))?

-- 
Eduardo

[Qemu-devel] [PATCH for-2.9-rc5 v4 2/2] block: Drain BH in bdrv_drained_begin

2017-04-18 Thread Fam Zheng

During block job completion, nothing is preventing
block_job_defer_to_main_loop_bh from being called in a nested
aio_poll(), which is a trouble, such as in this code path:

qmp_block_commit
  commit_active_start
bdrv_reopen
  bdrv_reopen_multiple
bdrv_reopen_prepare
  bdrv_flush
aio_poll
  aio_bh_poll
aio_bh_call
  block_job_defer_to_main_loop_bh
stream_complete
  bdrv_reopen

block_job_defer_to_main_loop_bh is the last step of the stream job,
which should have been "paused" by the bdrv_drained_begin/end in
bdrv_reopen_multiple, but it is not done because it's in the form of a
main loop BH.

Similar to why block jobs should be paused between drained_begin and
drained_end, BHs they schedule must be excluded as well.  To achieve
this, this patch forces draining the BH in BDRV_POLL_WHILE.

As a side effect this fixes a hang in block_job_detach_aio_context
during system_reset when a block job is ready:

#0  0x55aa79f3 in bdrv_drain_recurse
#1  0x55aa825d in bdrv_drained_begin
#2  0x55aa8449 in bdrv_drain
#3  0x55a9c356 in blk_drain
#4  0x55aa3cfd in mirror_drain
#5  0x55a66e11 in block_job_detach_aio_context
#6  0x55a62f4d in bdrv_detach_aio_context
#7  0x55a63116 in bdrv_set_aio_context
#8  0x55a9d326 in blk_set_aio_context
#9  0x557e38da in virtio_blk_data_plane_stop
#10 0x559f9d5f in virtio_bus_stop_ioeventfd
#11 0x559fa49b in virtio_bus_stop_ioeventfd
#12 0x559f6a18 in virtio_pci_stop_ioeventfd
#13 0x559f6a18 in virtio_pci_reset
#14 0x559139a9 in qdev_reset_one
#15 0x55916738 in qbus_walk_children
#16 0x55913318 in qdev_walk_children
#17 0x55916738 in qbus_walk_children
#18 0x559168ca in qemu_devices_reset
#19 0x5581fcbb in pc_machine_reset
#20 0x558a4d96 in qemu_system_reset
#21 0x5577157a in main_loop_should_exit
#22 0x5577157a in main_loop
#23 0x5577157a in main

The rationale is that the loop in block_job_detach_aio_context cannot
make any progress in pausing/completing the job, because bs->in_flight
is 0, so bdrv_drain doesn't process the block_job_defer_to_main_loop
BH. With this patch, it does.

Reported-by: Jeff Cody 
Signed-off-by: Fam Zheng 
---
 include/block/block.h | 22 ++
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 97d4330..5ddc0cf 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -381,12 +381,13 @@ void bdrv_drain_all(void);
 
 #define BDRV_POLL_WHILE(bs, cond) ({   \
 bool waited_ = false;  \
+bool busy_ = true; \
 BlockDriverState *bs_ = (bs);  \
 AioContext *ctx_ = bdrv_get_aio_context(bs_);  \
 if (aio_context_in_iothread(ctx_)) {   \
-while ((cond)) {   \
-aio_poll(ctx_, true);  \
-waited_ = true;\
+while ((cond) || busy_) {  \
+busy_ = aio_poll(ctx_, (cond));\
+waited_ |= !!(cond) | busy_;   \
 }  \
 } else {   \
 assert(qemu_get_current_aio_context() ==   \
@@ -398,11 +399,16 @@ void bdrv_drain_all(void);
  */\
 assert(!bs_->wakeup);  \
 bs_->wakeup = true;\
-while ((cond)) {   \
-aio_context_release(ctx_); \
-aio_poll(qemu_get_aio_context(), true);\
-aio_context_acquire(ctx_); \
-waited_ = true;\
+while (busy_) {\
+if ((cond)) {  \
+waited_ = busy_ = true;\
+aio_context_release(ctx_); \
+aio_poll(qemu_get_aio_context(), true);\
+aio_context_acquire(ctx_); \
+} else {   \
+busy_ = aio_poll(ctx_, false); \
+waited_ |= busy_;  \
+}  \
 }

Re: [Qemu-devel] [PATCH 03/14] cg3: remove unused width and height variables

2017-04-18 Thread Mark Cave-Ayland

On 15/04/17 11:54, Richard Henderson wrote:

> On 04/05/2017 01:35 AM, Mark Cave-Ayland wrote:
>> These aren't required since we can use the display width and height
>> directly.
>>
>> Signed-off-by: Mark Cave-Ayland 
>> ---
>>  hw/display/cg3.c |   15 ++-
>>  1 file changed, 6 insertions(+), 9 deletions(-)
>>
>> diff --git a/hw/display/cg3.c b/hw/display/cg3.c
>> index b42f60e..178a6dd 100644
>> --- a/hw/display/cg3.c
>> +++ b/hw/display/cg3.c
>> @@ -93,14 +93,11 @@ static void cg3_update_display(void *opaque)
>>  uint32_t *data;
>>  uint32_t dval;
>>  int x, y, y_start;
>> -unsigned int width, height;
>>  ram_addr_t page, page_min, page_max;
>>
>>  if (surface_bits_per_pixel(surface) != 32) {
>>  return;
>>  }
>> -width = s->width;
>> -height = s->height;
>>
>>  y_start = -1;
>>  page_min = -1;
>> @@ -110,11 +107,11 @@ static void cg3_update_display(void *opaque)
>>  data = (uint32_t *)surface_data(surface);
>>
>>  memory_region_sync_dirty_bitmap(&s->vram_mem);
>> -for (y = 0; y < height; y++) {
>> +for (y = 0; y < s->height; y++) {
> 
> I suspect the generated code is much worse, since the compiler can no
> longer tell that the loop bounds are constant.
> 
> You probably do want to keep width and height in local variables across
> this function.

Can you explain a bit more about this? My thoughts were that with
optimisation enabled the compiler would assume s->width is constant
throughout the loop by default (or at least in OpenBIOS I've had to
explicitly mark variables as volatile to ensure the compiler refetches
the original value instead of reusing a previous copy from the
stack/registers)?


ATB,

Mark.

Re: [Qemu-devel] [Qemu-block] migrate -b problems

2017-04-18 Thread Stefan Hajnoczi

On Wed, Apr 12, 2017 at 05:51:23PM +0800, 858585 jemmy wrote:
> it this bug?
> https://bugs.launchpad.net/qemu/+bug/1681688

Yes.  This discussion is about the long-term fix instead of a short-term
hack for QEMU 2.9.

Stefan


signature.asc
Description: PGP signature

Re: [Qemu-devel] [PATCH for-2.9-rc5 v4 0/2] block: Drain BH in bdrv_drained_begin

2017-04-18 Thread Paolo Bonzini



On 18/04/2017 16:30, Fam Zheng wrote:
> v4: Split patch, and fix the unsafe bdrv_unref. [Paolo]
> 
> Fam Zheng (2):
>   block: Walk bs->children carefully in bdrv_drain_recurse
>   block: Drain BH in bdrv_drained_begin
> 
>  block/io.c| 23 ---
>  include/block/block.h | 22 ++
>  2 files changed, 34 insertions(+), 11 deletions(-)
> 

Thanks, this looks good.

Paolo

Re: [Qemu-devel] [PATCH 05/14] tcx: alter tcx_set_dirty() to accept address and length parameters

2017-04-18 Thread Mark Cave-Ayland

On 15/04/17 15:27, Philippe Mathieu-Daudé wrote:

Hi Philippe,

> Hi Mark,
> 
> On 04/05/2017 05:35 AM, Mark Cave-Ayland wrote:
>> Signed-off-by: Mark Cave-Ayland 
>> ---
>>  hw/display/tcx.c |   12 ++--
>>  1 file changed, 6 insertions(+), 6 deletions(-)
>>
>> diff --git a/hw/display/tcx.c b/hw/display/tcx.c
>> index 8e26aae..d24466f 100644
>> --- a/hw/display/tcx.c
>> +++ b/hw/display/tcx.c
>> @@ -93,9 +93,9 @@ typedef struct TCXState {
>>  uint16_t cursy;
>>  } TCXState;
>>
>> -static void tcx_set_dirty(TCXState *s)
>> +static void tcx_set_dirty(TCXState *s, ram_addr_t addr, int len)
> 
> len is uint64_t

Yes, this was deliberate in order to match the existing tcx_* functions
in the same file which tend to use int for len/width parameters. Note
that TCX is fixed 1024x768 resolution giving a maximum len of ~800K so
there is no risk of overflow here.

> with this change:
> Reviewed-by: Philippe Mathieu-Daudé 
> 
> This function can also be inlined.

My current understanding is that these days compilers tend to handle
inlining themselves and inline is either a no-op or a very weak hint?

>>  {
>> -memory_region_set_dirty(&s->vram_mem, 0, MAXX * MAXY);
>> +memory_region_set_dirty(&s->vram_mem, addr, len);
>>  }
>>
>>  static inline int tcx24_check_dirty(TCXState *s, ram_addr_t page,
>> @@ -156,7 +156,7 @@ static void update_palette_entries(TCXState *s,
>> int start, int end)
>>  break;
>>  }
>>  }
>> -tcx_set_dirty(s);
>> +tcx_set_dirty(s, 0, memory_region_size(&s->vram_mem));
>>  }
>>
>>  static void tcx_draw_line32(TCXState *s1, uint8_t *d,
>> @@ -526,7 +526,7 @@ static void tcx_invalidate_display(void *opaque)
>>  {
>>  TCXState *s = opaque;
>>
>> -tcx_set_dirty(s);
>> +tcx_set_dirty(s, 0, memory_region_size(&s->vram_mem));
>>  qemu_console_resize(s->con, s->width, s->height);
>>  }
>>
>> @@ -534,7 +534,7 @@ static void tcx24_invalidate_display(void *opaque)
>>  {
>>  TCXState *s = opaque;
>>
>> -tcx_set_dirty(s);
>> +tcx_set_dirty(s, 0, memory_region_size(&s->vram_mem));
>>  qemu_console_resize(s->con, s->width, s->height);
>>  }
>>
>> @@ -543,7 +543,7 @@ static int vmstate_tcx_post_load(void *opaque, int
>> version_id)
>>  TCXState *s = opaque;
>>
>>  update_palette_entries(s, 0, 256);
>> -tcx_set_dirty(s);
>> +tcx_set_dirty(s, 0, memory_region_size(&s->vram_mem));
>>  return 0;
>>  }

ATB,

Mark.

Re: [Qemu-devel] [PATCH for-2.9-rc5 v4 1/2] block: Walk bs->children carefully in bdrv_drain_recurse

2017-04-18 Thread Kevin Wolf

Am 18.04.2017 um 16:30 hat Fam Zheng geschrieben:
> The recursive bdrv_drain_recurse may run a block job completion BH that
> drops nodes. The coming changes will make that more likely and use-after-free
> would happen without this patch
> 
> Stash the bs pointer and use bdrv_ref/bdrv_unref in addition to
> QLIST_FOREACH_SAFE to prevent such a case from happening.
> 
> Since bdrv_unref accesses global state that is not protected by the AioContext
> lock, we cannot use bdrv_ref/bdrv_unref unconditionally.  Fortunately the
> protection is not needed in IOThread because only main loop can modify a graph
> with the AioContext lock held.
> 
> Signed-off-by: Fam Zheng 
> ---
>  block/io.c | 23 ---
>  1 file changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 8706bfa..a0df8c4 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -158,7 +158,7 @@ bool bdrv_requests_pending(BlockDriverState *bs)
>  
>  static bool bdrv_drain_recurse(BlockDriverState *bs)
>  {
> -BdrvChild *child;
> +BdrvChild *child, *tmp;
>  bool waited;
>  
>  waited = BDRV_POLL_WHILE(bs, atomic_read(&bs->in_flight) > 0);
> @@ -167,8 +167,25 @@ static bool bdrv_drain_recurse(BlockDriverState *bs)
>  bs->drv->bdrv_drain(bs);
>  }
>  
> -QLIST_FOREACH(child, &bs->children, next) {
> -waited |= bdrv_drain_recurse(child->bs);
> +QLIST_FOREACH_SAFE(child, &bs->children, next, tmp) {
> +BlockDriverState *bs = child->bs;
> +bool in_main_loop =
> +qemu_get_current_aio_context() == qemu_get_aio_context();
> +assert(bs->refcnt > 0);
> +if (in_main_loop) {
> +/* In case the resursive bdrv_drain_recurse processes a

s/resursive/recursive/

> + * block_job_defer_to_main_loop BH and modifies the graph,
> + * let's hold a reference to bs until we are done.
> + *
> + * IOThread doesn't have such a BH, and it is not safe to call
> + * bdrv_unref without BQL, so skip doing it there.
> + **/

And **/ is unusual, too.

> +bdrv_ref(bs);
> +}
> +waited |= bdrv_drain_recurse(bs);
> +if (in_main_loop) {
> +bdrv_unref(bs);
> +}
>  }

Other than this, the series looks good to me.

Kevin

Re: [Qemu-devel] [Qemu-block] migrate -b problems

2017-04-18 Thread Stefan Hajnoczi

On Wed, Apr 12, 2017 at 11:18:19AM +0200, Kevin Wolf wrote:
> after getting assertion failure reports for block migration in the last
> minute, we just hacked around it by commenting out op blocker assertions
> for the 2.9 release, but now we need to see how to fix things properly.
> Luckily, get_maintainer.pl doesn't report me, but only you. :-)
> 
> The main problem I see with the block migration code (on the
> destination) is that it abuses the BlockBackend that belongs to the
> guest device to make its own writes to the image file. If the guest
> isn't allowed to write to the image (which it now isn't during incoming
> migration since it would conflict with the newer style of block
> migration using an NBD server), writing to this BlockBackend doesn't
> work any more.
> 
> So what should really happen is that incoming block migration creates
> its own BlockBackend for writing to the image. Now we don't want to do
> this anew for every incoming block, but ideally we'd just create all
> necessary BlockBackends upfront and then keep using them throughout the
> whole migration. Is there a way to get some setup/teardown callbacks
> at the start/end of the migration that could initialise and free such
> global data?

It can be done in the beginning of block_load() similar to
block_mig_state.bmds_list, which is created in init_blk_migration() at
save time.

We can also move the if (blk != blk_prev) blk_invalidate_cache() code
out of the load loop.  It should be done once when setting up
BlockBackends.

> The other problem with block migration is that is uses a BlockBackend
> name to identify which device is migrated. However, there can be images
> that are not attached to any BlockBackend, or if it is, the BlockBackend
> might be anonymous, so this doesn't work. I suppose changing the field
> to "device name if available, node-name otherwise" would solve this.

Yes, that sounds good and is backwards compatible.


signature.asc
Description: PGP signature

Re: [Qemu-devel] help for code

2017-04-18 Thread Stefan Hajnoczi

On Wed, Apr 12, 2017 at 04:36:58PM +0430, ali saeedi wrote:
> I am new in qemu vm live migration. why we have this condition
> "current_time >= initial_time + BUFFER_DELAY" in qemu?

BUFFER_DELAY is used for rate-limiting.  The maximum network throughput
can be set by the user so that QEMU doesn't hog the network link.

Rate-limit calculations are done every BUFFER_DELAY time interval.

Stefan

signature.asc
Description: PGP signature

Re: [Qemu-devel] [PATCH for-2.9-rc5 v4 0/2] block: Drain BH in bdrv_drained_begin

2017-04-18 Thread Jeff Cody

On Tue, Apr 18, 2017 at 10:30:42PM +0800, Fam Zheng wrote:
> v4: Split patch, and fix the unsafe bdrv_unref. [Paolo]
> 
> Fam Zheng (2):
>   block: Walk bs->children carefully in bdrv_drain_recurse
>   block: Drain BH in bdrv_drained_begin
> 
>  block/io.c| 23 ---
>  include/block/block.h | 22 ++
>  2 files changed, 34 insertions(+), 11 deletions(-)
> 
> -- 
> 2.9.3
>

Reviewed-by: Jeff Cody 

(also Tested-by)

1 2 3 >

1 - 100 of 200 matches

Mail list logo