from:"xie"

[Qemu-devel] [Bug 587344] Re: gfxmenu from GRUB or GRUB4DOS hang qemu.

2013-03-28 Thread xie

I think the problem is relate to the cpu arch. I have the same problem
in my machine of which the cpu is Nehalem

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/587344

Title:
  gfxmenu from GRUB or GRUB4DOS hang qemu.

Status in QEMU:
  Fix Released

Bug description:
  When you run the graphical menu GRUB (gfxmenu, any theme) execution stops.
  Test image (iso, as an attachment): 
  http://www.multiupload.com/AAI7MGMOLE

  Image make options: 
  mkisofs -R -b grldr -no-emul-boot -boot-load-seg 0x1000 -o bootable.iso 
image_root/

  With options and GFX menu:
  qemu -cdrom bootable.iso
  execution stops after rendering background.

  With options and GFX menu::
  qemu -cdrom bootable.iso --no-kvm
  the menu is drawn, and the execution stops.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/587344/+subscriptions

[Qemu-devel] [Bug 587344] Re: gfxmenu from GRUB or GRUB4DOS hang qemu.

2013-03-28 Thread xie

In addition, CPU is Intel E5520

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/587344

Title:
  gfxmenu from GRUB or GRUB4DOS hang qemu.

Status in QEMU:
  Fix Released

Bug description:
  When you run the graphical menu GRUB (gfxmenu, any theme) execution stops.
  Test image (iso, as an attachment): 
  http://www.multiupload.com/AAI7MGMOLE

  Image make options: 
  mkisofs -R -b grldr -no-emul-boot -boot-load-seg 0x1000 -o bootable.iso 
image_root/

  With options and GFX menu:
  qemu -cdrom bootable.iso
  execution stops after rendering background.

  With options and GFX menu::
  qemu -cdrom bootable.iso --no-kvm
  the menu is drawn, and the execution stops.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/587344/+subscriptions

[Qemu-devel] Would virtio support 64 bit address for vring virtqueue?

2013-08-27 Thread Xie, Huawei

I am developing virtio user space poll mode network driver. We allocate vring 
physical memory from huge page. On VMs with less than 4GB  memory, it works 
well.  But on VMs with like 8GB memory, huge page are all allocated from high 
end memory.
So would virtio support 64bit address for vring virtqueue?

Thanks

Re: [Qemu-devel] Would virtio support 64 bit address for vring virtqueue?

2013-08-28 Thread Xie, Huawei

Hi Stefan:
I think you mention the descriptor address? I mean the vring PFN register.
/* A 32-bit r/w PFN for the currently selected queue */
#define VIRTIO_PCI_QUEUE_PFN8

And the linux driver sample code:
iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);

-Original Message-
From: Stefan Hajnoczi [mailto:stefa...@gmail.com] 
Sent: Wednesday, August 28, 2013 4:07 PM
To: Xie, Huawei
Cc: qemu-devel@nongnu.org; ru...@rustcorp.com.au; Stefan Hajnoczi
Subject: Re: [Qemu-devel] Would virtio support 64 bit address for vring 
virtqueue?

On Wed, Aug 28, 2013 at 03:18:39AM +, Xie, Huawei wrote:
> I am developing virtio user space poll mode network driver. We allocate vring 
> physical memory from huge page. On VMs with less than 4GB  memory, it works 
> well.  But on VMs with like 8GB memory, huge page are all allocated from high 
> end memory.
> So would virtio support 64bit address for vring virtqueue?

The vring takes guest physical addresses and the C type is __u64 (see 
/usr/include/linux/virtio_ring.h).  64-bit addresses are fine.

Stefan

Re: [Qemu-devel] Would virtio support 64 bit address for vring virtqueue?

2013-08-28 Thread Xie, Huawei

If this is the case, one possible fix would be:
Write two continuous 32bit DWORD to combine a 64bit address
Use the upper 12 bits of PFN val to indicate if it is combined write
In this way, we wouldn't break other virtio driver, register layout and only 
need a few lines of modification.

-Original Message-
From: Xie, Huawei 
Sent: Wednesday, August 28, 2013 5:23 PM
To: 'Stefan Hajnoczi'
Cc: qemu-devel@nongnu.org; ru...@rustcorp.com.au; Stefan Hajnoczi
Subject: RE: [Qemu-devel] Would virtio support 64 bit address for vring 
virtqueue?

Hi Stefan:
I think you mention the descriptor address? I mean the vring PFN register.
/* A 32-bit r/w PFN for the currently selected queue */
#define VIRTIO_PCI_QUEUE_PFN8

And the linux driver sample code:
iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);

-Original Message-
From: Stefan Hajnoczi [mailto:stefa...@gmail.com] 
Sent: Wednesday, August 28, 2013 4:07 PM
To: Xie, Huawei
Cc: qemu-devel@nongnu.org; ru...@rustcorp.com.au; Stefan Hajnoczi
Subject: Re: [Qemu-devel] Would virtio support 64 bit address for vring 
virtqueue?

On Wed, Aug 28, 2013 at 03:18:39AM +, Xie, Huawei wrote:
> I am developing virtio user space poll mode network driver. We allocate vring 
> physical memory from huge page. On VMs with less than 4GB  memory, it works 
> well.  But on VMs with like 8GB memory, huge page are all allocated from high 
> end memory.
> So would virtio support 64bit address for vring virtqueue?

The vring takes guest physical addresses and the C type is __u64 (see 
/usr/include/linux/virtio_ring.h).  64-bit addresses are fine.

Stefan

Re: [Qemu-devel] Would virtio support 64 bit address for vring virtqueue?

2013-08-28 Thread Xie, Huawei

I know it is PFN, that is why I want to use the high 12bit for combined write 
in another reply, to fully address 64 bit address.
Is the 16TB address space what it is originally designed for? If it is, that 
will basically solve the problem. I will try later.
Thanks for reminder

-Original Message-
From: Laszlo Ersek [mailto:ler...@redhat.com] 
Sent: Wednesday, August 28, 2013 7:46 PM
To: Xie, Huawei
Cc: Stefan Hajnoczi; ru...@rustcorp.com.au; qemu-devel@nongnu.org; Stefan 
Hajnoczi
Subject: Re: [Qemu-devel] Would virtio support 64 bit address for vring 
virtqueue?

On 08/28/13 11:22, Xie, Huawei wrote:
> Hi Stefan:
> I think you mention the descriptor address? I mean the vring PFN register.
> /* A 32-bit r/w PFN for the currently selected queue */
> #define VIRTIO_PCI_QUEUE_PFN  8
> 
> And the linux driver sample code:
> iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
> vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
> 
> -Original Message-
> From: Stefan Hajnoczi [mailto:stefa...@gmail.com]
> Sent: Wednesday, August 28, 2013 4:07 PM
> To: Xie, Huawei
> Cc: qemu-devel@nongnu.org; ru...@rustcorp.com.au; Stefan Hajnoczi
> Subject: Re: [Qemu-devel] Would virtio support 64 bit address for vring 
> virtqueue?
> 
> On Wed, Aug 28, 2013 at 03:18:39AM +, Xie, Huawei wrote:
>> I am developing virtio user space poll mode network driver. We allocate 
>> vring physical memory from huge page. On VMs with less than 4GB  memory, it 
>> works well.  But on VMs with like 8GB memory, huge page are all allocated 
>> from high end memory.
>> So would virtio support 64bit address for vring virtqueue?
> 
> The vring takes guest physical addresses and the C type is __u64 (see 
> /usr/include/linux/virtio_ring.h).  64-bit addresses are fine.

If you mean Queue Address field in the Virtio Header (section 2.2.2 in the 
virtio-0.9.5 specification), then please see 2.3 Virtqueue Configuration, step 
3. You have to divide the guest-phys address by 4096 and store the quotient. 
The Queue Address field takes a page frame number, not a page frame address.

This allows it to address up to 2^32 * 4096 == 16T bytes of RAM.

Laszlo

[Qemu-devel] QEMU+SystemC

2013-12-11 Thread Sean Xie

Hi all,

Recently I am struggling in examing the effect of host kernel I/O stacks on
high-speed SSD, especially the new PCIe SSD, without the availability of
the necessary hardware. After several reading online I found out
QEMU+SystemC model environment might be able to to solve my problem.

Sorry I am new to QEMU. My main objective is to collect guest OS statistics
based on the attached external SSD model. So simple SystemC stub will be
the best for my case. Can anybody give me any suggestion on which QEMU
variant with SystemC fit me the best? I will sincerely appreciate that if
you could forward me the correct URL for download also.

Thank you for your attention,

Sean Xie

Re: [Qemu-devel] Disabling IRQ error

2013-10-22 Thread Xie Xianshan


Dear Max,
Sorry for the late reply.
Thanks for your advice about lowering the irqs after raising them, 
and that fixed my problem.

Thanks again for your kindness.

Thanks,
Simen

于 2013/09/11 17:29, Max Filippov 写道:

On Wed, Sep 11, 2013 at 12:12 PM, Xie Xianshan  wrote:

   I want to add a new device "fpga" for e500, and trigger an interrupt IRQ3
while the register BB_INTR_REG which belongs to device "fpga" is wrote by
the device driver of "fpga".
   For e500, IRQ3 is an external interrupt irq.
   According the debug log, the disabling error is encoutered during writing
BB_INTR_REG register.
   - write BB_INTR_REG register
   - qemu_irq_raise() is called.
   - after serval minutes,
 the error message about disabling irq is displayed.
   - continue the next execution without error(with poll?)


So your device raises IRQ, but it doesn't lower it. Real devices
usually don't do that, they either generate a short pulse on the
IRQ line (in case of edge-triggered IRQ) or raise IRQ line on
some event and then lower it on a command from its driver
(level-triggered IRQ).

You can do the following to make your device behave that way:
- make your fpga device capable of lowering its IRQ, e.g. by adding
   another register:


static void fpga_write(FPGAState *s, unsigned int offset, uint32_t value,
unsigned size) {
 switch(offset) {
 case BB_INTR_REG:
 qemu_irq_raise(s->irq);
 break;

case BB_INTC_REG:
qemu_irq_lower(s->irq);
break;

 }
}


- provide an interrupt service routine in the linux driver for your fpga
   device that would check whether the interrupt was caused by its
   device, and if so lower the device's IRQ.

Thanks.
-- Max

Re: [Qemu-devel] [PATCH v14 3/8] Backup: clear all bitmap when doing block checkpoint

2016-01-27 Thread Changlong Xie


On 01/28/2016 12:05 AM, Stefan Hajnoczi wrote:

On Wed, Jan 13, 2016 at 05:18:27PM +0800, Changlong Xie wrote:

diff --git a/blockjob.c b/blockjob.c
index 80adb9d..0c8edfe 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -533,3 +533,14 @@ void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job)
  QLIST_INSERT_HEAD(&txn->jobs, job, txn_list);
  block_job_txn_ref(txn);
  }
+
+void block_job_do_checkpoint(BlockJob *job, Error **errp)
+{
+if (!job->driver->do_checkpoint) {
+error_setg(errp, "The job %s doesn't support block checkpoint",
+   BlockJobType_lookup[job->driver->job_type]);
+return;
+}
+
+job->driver->do_checkpoint(job, errp);
+}
diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index d84ccd8..abdba7c 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -70,6 +70,9 @@ typedef struct BlockJobDriver {
   * never both.
   */
  void (*abort)(BlockJob *job);
+
+/** Optional callback for job types that support checkpoint. */
+void (*do_checkpoint)(BlockJob *job, Error **errp);


The COLO/replication-specific callbacks have been moved out of
BlockDriver into their own replication struct.  Similar reasoning
applies to BlockJobDriver:

The do_checkpoint() callback is only implemented by one type of job and
its purpose is related to COLO rather than jobs.  This is a strong
indication that this shouldn't be part of the generic BlockJobDriver
struct.

Please drop changes to the generic blockjob interface.  Instead, make
backup_do_checkpoint() public and add assert(job->driver->type ==
BLOCK_JOB_TYPE_BACKUP) into the function.

Then the replication filter can call backup_do_checkpoint() directly.



Will fix it in next version.

Thanks
-Xie


Stefan

Re: [Qemu-devel] [PATCH v14 4/8] Allow creating backup jobs when opening BDS

2016-01-27 Thread Changlong Xie


On 01/27/2016 10:04 PM, Stefan Hajnoczi wrote:

On Wed, Jan 13, 2016 at 05:18:28PM +0800, Changlong Xie wrote:

From: Wen Congyang 

When opening BDS, we need to create backup jobs for
image-fleecing.

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Jeff Cody 
---
  block/Makefile.objs | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 58ef2ef..fa05f37 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -22,10 +22,10 @@ block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
  block-obj-$(CONFIG_LIBSSH2) += ssh.o
  block-obj-y += accounting.o
  block-obj-y += write-threshold.o
+block-obj-y += backup.o

  common-obj-y += stream.o
  common-obj-y += commit.o
-common-obj-y += backup.o

  iscsi.o-cflags := $(LIBISCSI_CFLAGS)
  iscsi.o-libs   := $(LIBISCSI_LIBS)


The commit message and description seem outdated.  I guess the purpose
of this patch is to link the backup block job into all programs that use
the block layer because you want to add a dependency on the it from core
code.



Will update it in next version.

Thanks
-Xie

Re: [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication

2016-01-28 Thread Changlong Xie


On 01/28/2016 11:15 PM, Stefan Hajnoczi wrote:

On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote:

On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:

On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:

+static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
+{
+Error *local_err = NULL;
+int ret;
+
+if (!s->secondary_disk->job) {
+error_setg(errp, "Backup job is cancelled unexpectedly");
+return;
+}
+
+block_job_do_checkpoint(s->secondary_disk->job, &local_err);
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+
+ret = s->active_disk->drv->bdrv_make_empty(s->active_disk);


What happens to in-flight requests to the active and hidden disks?


we MUST call do_checkpoint() when the vm is stopped.


Please document the environment under which the block replication
callback functions run.


OK



I'm concerned that the bdrv_drain_all() in vm_stop() can take a long
time if the disk is slow/failing.  bdrv_drain_all() blocks until all
in-flight I/O requests have completed.  What does the Primary do if the
Secondary becomes unresponsive?


Actually, we knew this problem. But currently, there seems no better way 
to resolve it. If you have any ideas?





+switch (s->mode) {
+case REPLICATION_MODE_PRIMARY:
+break;
+case REPLICATION_MODE_SECONDARY:
+s->active_disk = bs->file->bs;
+if (!bs->file->bs->backing) {
+error_setg(errp, "Active disk doesn't have backing file");
+return;
+}
+
+s->hidden_disk = s->active_disk->backing->bs;
+if (!s->hidden_disk->backing) {
+error_setg(errp, "Hidden disk doesn't have backing file");
+return;
+}
+
+s->secondary_disk = s->hidden_disk->backing->bs;
+if (!s->secondary_disk->blk) {
+error_setg(errp, "The secondary disk doesn't have block backend");
+return;
+}


Kevin: Is code allowed to stash away BlockDriverState pointers for
convenience or should it keep the BdrvChild pointers instead?  In order
for replication to work as expected, the graph shouldn't change but for
consistency maybe BdrvChild is best.


I asked Kevin about this on IRC and he agreed that BdrvChild should be
used instead of holding on to BlockDriverState * pointers.  Although
these pointers will not change during replication (if the op blockers
are set up correctly), it's more consistent and certainly safer to go
through BdrvChild.


Ok




+/* start backup job now */
+error_setg(&s->blocker,
+   "block device is in use by internal backup job");
+bdrv_op_block_all(s->top_bs, s->blocker);
+bdrv_op_unblock(s->top_bs, BLOCK_OP_TYPE_DATAPLANE, s->blocker);
+bdrv_ref(s->hidden_disk);


Why is the explicit reference to hidden_disk (but not secondary_disk or
active_disk) is necessary?


IIRC, we should reference the backup target before calling backup_start(),
and we will reference the backup source in backup_start().


I'm not sure why this is necessary since they are part of the backing
chain.



Just as Wen said, we should reference the backup target before calling 
backup_start() to protect it from destroying, if backup job is stopped 
unexpectedly.



If it is necessary, please add a comment so it's clear why the reference
is being taken.



Ok


Stefan

Re: [Qemu-devel] [PATCH v14 5/8] docs: block replication's description

2016-02-02 Thread Changlong Xie


On 02/02/2016 11:20 PM, Eric Blake wrote:

On 01/13/2016 02:18 AM, Changlong Xie wrote:

From: Wen Congyang 

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
---
  docs/block-replication.txt | 229 +
  1 file changed, 229 insertions(+)
  create mode 100644 docs/block-replication.txt

diff --git a/docs/block-replication.txt b/docs/block-replication.txt
new file mode 100644
index 000..d1a231e
--- /dev/null
+++ b/docs/block-replication.txt
@@ -0,0 +1,229 @@
+Block replication
+
+Copyright Fujitsu, Corp. 2015
+Copyright (c) 2015 Intel Corporation
+Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.


Do you want to claim 2016 for any of this?


Will update in next version.




+
+This work is licensed under the terms of the GNU GPL, version 2 or later.
+See the COPYING file in the top-level directory.
+
+Block replication is used for continuous checkpoints. It is designed
+for COLO (COarse-grain LOck-stepping) where the Secondary VM is running.
+It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
+where the Secondary VM is not running.
+
+This document gives an overview of block replication's design.
+
+== Background ==
+High availability solutions such as micro checkpoint and COLO will do
+consecutive checkpoints. The VM state of Primary VM and Secondary VM is


s/of Primary/of the Primary/


OK




+identical right after a VM checkpoint, but becomes different as the VM
+executes till the next checkpoint. To support disk contents checkpoint,
+the modified disk contents in the Secondary VM must be buffered, and are
+only dropped at next checkpoint time. To reduce the network transportation
+effort at the time of checkpoint, the disk modification operations of


s/at the time of/during a vmstate/
s/operations of/operations of the/


OK




+Primary disk are asynchronously forwarded to the Secondary node.
+
+== Workflow ==



+== Architecture ==



+
+6) The drive-backup job(sync=none) is run to allow hidden-disk to buffer


Space before ( in English description.


OK




+any state that would otherwise be lost by the speculative write-through
+of the NBD server into the secondary disk. So before block replication,
+the primary disk and secondary disk should contain the same data.
+
+== Failure Handling ==



+== Usage ==
+Primary:
+  -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\
+ children.0.file.filename=1.raw,\
+ children.0.driver=raw
+
+  Run qmp command in primary qemu:
+{ 'execute': 'human-monitor-command',
+  'arguments': {
+  'command-line': 'drive_add buddy 
driver=replication,mode=primary,file.driver=nbd,file.host=,file.port=,file.export=colo1,node-name=nbd_client1,if=none'


Eww. We shouldn't ever have to pack a command line as . single QMP
string that needs reparsing.  Instead, you should pass the information
as a nested QMP dictionary, something like:

'arguments': {
   'remote-command': { 'command': 'drive_add', 'name': 'buddy',
   'driver': 'replication', 'mode': 'primary',
   'file': { 'driver': 'nbd', 'host': '',


Hi Eric

What is 'remote-command' here? I just tried below commands, but got some 
errors.


{'execute': 'human-monitor-command',
'arguments': {
   'command-line': {
'command': 'drive_add',
'name': 'buddy',
'driver': 'replication',
'mode': 'primary',
'if': 'none',
'node-name': 'primary_nbd_node',
'file': {
'driver': 'nbd',
'host': '192.168.3.2',
'port': '8889',
'export': 'colo-disk',
}

}
 }
}
{"error": {"class": "GenericError", "desc": "Invalid JSON syntax"}}   


'blockdev-add' doesn't support 'nbd'. So we use 'drive_add' here, and 
it's a hmp command. If i'm right, there seems just one way to execute 
hmp commands in QMP:



EQMP

{
.name   = "human-monitor-command",
.args_type  = "command-line:s,cpu-index:i?",
.mhandler.cmd_new = qmp_marshal_human_monitor_command,
},

SQMP
human-monitor-command
-

Execute a Human Monitor

Re: [Qemu-devel] [PATCH v13 00/10] Block replication for continuous checkpoints

2016-02-03 Thread Changlong Xie


On 02/01/2016 09:18 AM, Wen Congyang wrote:

On 01/29/2016 06:47 PM, Dr. David Alan Gilbert wrote:

* Wen Congyang (we...@cn.fujitsu.com) wrote:

On 01/29/2016 06:07 PM, Dr. David Alan Gilbert wrote:

* Wen Congyang (we...@cn.fujitsu.com) wrote:

On 01/27/2016 07:03 PM, Dr. David Alan Gilbert wrote:

Hi,
   I've got a block error if I kill the secondary.

Start both primary & secondary
kill -9 secondary qemu
x_colo_lost_heartbeat on primary

The guest sees a block error and the ext4 root switches to read-only.

I gdb'd the primary with a breakpoint on quorum_report_bad; see
backtrace below.
(This is based on colo-v2.4-periodic-mode of the framework
code with the block and network proxy merged in; so it could be my
merging but I don't think so ?)


(gdb) where
#0  quorum_report_bad (node_name=0x7f2946a0892c "node0", ret=-5, 
acb=0x7f2946cb3910, acb=0x7f2946cb3910)
 at /root/colo/jan-2016/qemu/block/quorum.c:222
#1  0x7f2943b23058 in quorum_aio_cb (opaque=, ret=)
 at /root/colo/jan-2016/qemu/block/quorum.c:315
#2  0x7f2943b311be in bdrv_co_complete (acb=0x7f2946cb3f60) at 
/root/colo/jan-2016/qemu/block/io.c:2122
#3  0x7f2943ae777d in aio_bh_call (bh=) at 
/root/colo/jan-2016/qemu/async.c:64
#4  aio_bh_poll (ctx=ctx@entry=0x7f2945b771d0) at 
/root/colo/jan-2016/qemu/async.c:92
#5  0x7f2943af5090 in aio_dispatch (ctx=0x7f2945b771d0) at 
/root/colo/jan-2016/qemu/aio-posix.c:305
#6  0x7f2943ae756e in aio_ctx_dispatch (source=, 
callback=,
 user_data=) at /root/colo/jan-2016/qemu/async.c:231
#7  0x7f293b84a79a in g_main_context_dispatch () from 
/lib64/libglib-2.0.so.0
#8  0x7f2943af3a00 in glib_pollfds_poll () at 
/root/colo/jan-2016/qemu/main-loop.c:211
#9  os_host_main_loop_wait (timeout=) at 
/root/colo/jan-2016/qemu/main-loop.c:256
#10 main_loop_wait (nonblocking=) at 
/root/colo/jan-2016/qemu/main-loop.c:504
#11 0x7f29438529ee in main_loop () at /root/colo/jan-2016/qemu/vl.c:1945
#12 main (argc=, argv=, envp=) at 
/root/colo/jan-2016/qemu/vl.c:4707

(gdb) p s->num_children
$1 = 2
(gdb) p acb->success_count
$2 = 0
(gdb) p acb->is_read
$5 = false


Sorry for the late reply.


No problem.


What it the value of acb->count?


(gdb) p acb->count
$1 = 1


Note, the count is 1, not 2. Writing to children.0 is in flight. If writing to 
children.0 successes,
the guest doesn't know this error.

If secondary host is down, you should remove quorum's children.1. Otherwise, 
you will get
I/O error event.


Is that safe?  If the secondary fails, do you always have time to issue the 
command to
remove the children.1  before the guest sees the error?


We will write to two children, and expect that writing to children.0 will 
success. If so,
the guest doesn't know this error. You just get the I/O error event.


I think children.0 is the disk, and that should be OK - so only the 
children.1/replication should
be failing - so in that case why do I see the error?


I don't know, and I will check the codes.


The 'node0' in the backtrace above is the name of the replication, so it does 
look like the error
is coming from the replication.


No, the backtrace is just report an I/O error events to the management 
application.




Anyway, I tried removing children.1 but it segfaults now, I guess the 
replication is unhappy:

(qemu) x_block_change colo-disk0 -d children.1
(qemu) x_colo_lost_heartbeat


Hmm, you should not remove the child before failover. I will check it how to 
avoid it in the codes.


  But you said 'If secondary host is down, you should remove quorum's 
children.1' - is that not
what you meant?


Yes, you should excute 'x_colo_lost_heartbeat' fist, and then excute 
'x_block_change ... -d ...'.


Hi david

It seems we missed 'drive_del' command, and will document it in next 
version. Here is the right commands order:


{ "execute": "x-colo-lost-heartbeat" }
{ 'execute': 'x-blockdev-change', 'arguments': {'parent': 'colo-disk', 
'child': 'children.1'}}
{ 'execute': 'human-monitor-command', 'arguments': {'command-line': 
'drive_del x'}}


Thanks
-Xie



12973 Segmentation fault  (core dumped) 
./try/x86_64-softmmu/qemu-system-x86_64 -enable-kvm $console_param -S -boot c 
-m 4080 -smp 4 -machine pc-i440fx-2.5,accel=kvm -name debug-threads=on -trace 
events=trace-file -device virtio-rng-pci $block_param $net_param

#0  0x7f0a398a864c in bdrv_stop_replication (bs=0x7f0a3b0a8430, 
failover=true, errp=0x7fff6a5c3420)
 at /root/colo/jan-2016/qemu/block.c:4426

(gdb) p drv
$1 = (BlockDriver *) 0x5d2a

   it looks like the whole of bs is bogus.

#1  0x7f0a398d87f6 in quorum_stop_replication (bs=, 
failover=,
 errp=) at /root/colo/jan-2016/qemu/block/quorum.c:1213

(gdb) p s->r

Re: [Qemu-devel] [PATCH v13 00/10] Block replication for continuous checkpoints

2016-02-04 Thread Changlong Xie


On 02/04/2016 05:07 PM, Dr. David Alan Gilbert wrote:

* Changlong Xie (xiecl.f...@cn.fujitsu.com) wrote:

On 02/01/2016 09:18 AM, Wen Congyang wrote:

On 01/29/2016 06:47 PM, Dr. David Alan Gilbert wrote:

* Wen Congyang (we...@cn.fujitsu.com) wrote:

On 01/29/2016 06:07 PM, Dr. David Alan Gilbert wrote:

* Wen Congyang (we...@cn.fujitsu.com) wrote:

On 01/27/2016 07:03 PM, Dr. David Alan Gilbert wrote:

Hi,
   I've got a block error if I kill the secondary.

Start both primary & secondary
kill -9 secondary qemu
x_colo_lost_heartbeat on primary

The guest sees a block error and the ext4 root switches to read-only.

I gdb'd the primary with a breakpoint on quorum_report_bad; see
backtrace below.
(This is based on colo-v2.4-periodic-mode of the framework
code with the block and network proxy merged in; so it could be my
merging but I don't think so ?)


(gdb) where
#0  quorum_report_bad (node_name=0x7f2946a0892c "node0", ret=-5, 
acb=0x7f2946cb3910, acb=0x7f2946cb3910)
 at /root/colo/jan-2016/qemu/block/quorum.c:222
#1  0x7f2943b23058 in quorum_aio_cb (opaque=, ret=)
 at /root/colo/jan-2016/qemu/block/quorum.c:315
#2  0x7f2943b311be in bdrv_co_complete (acb=0x7f2946cb3f60) at 
/root/colo/jan-2016/qemu/block/io.c:2122
#3  0x7f2943ae777d in aio_bh_call (bh=) at 
/root/colo/jan-2016/qemu/async.c:64
#4  aio_bh_poll (ctx=ctx@entry=0x7f2945b771d0) at 
/root/colo/jan-2016/qemu/async.c:92
#5  0x7f2943af5090 in aio_dispatch (ctx=0x7f2945b771d0) at 
/root/colo/jan-2016/qemu/aio-posix.c:305
#6  0x7f2943ae756e in aio_ctx_dispatch (source=, 
callback=,
 user_data=) at /root/colo/jan-2016/qemu/async.c:231
#7  0x7f293b84a79a in g_main_context_dispatch () from 
/lib64/libglib-2.0.so.0
#8  0x7f2943af3a00 in glib_pollfds_poll () at 
/root/colo/jan-2016/qemu/main-loop.c:211
#9  os_host_main_loop_wait (timeout=) at 
/root/colo/jan-2016/qemu/main-loop.c:256
#10 main_loop_wait (nonblocking=) at 
/root/colo/jan-2016/qemu/main-loop.c:504
#11 0x7f29438529ee in main_loop () at /root/colo/jan-2016/qemu/vl.c:1945
#12 main (argc=, argv=, envp=) at 
/root/colo/jan-2016/qemu/vl.c:4707

(gdb) p s->num_children
$1 = 2
(gdb) p acb->success_count
$2 = 0
(gdb) p acb->is_read
$5 = false


Sorry for the late reply.


No problem.


What it the value of acb->count?


(gdb) p acb->count
$1 = 1


Note, the count is 1, not 2. Writing to children.0 is in flight. If writing to 
children.0 successes,
the guest doesn't know this error.

If secondary host is down, you should remove quorum's children.1. Otherwise, 
you will get
I/O error event.


Is that safe?  If the secondary fails, do you always have time to issue the 
command to
remove the children.1  before the guest sees the error?


We will write to two children, and expect that writing to children.0 will 
success. If so,
the guest doesn't know this error. You just get the I/O error event.


I think children.0 is the disk, and that should be OK - so only the 
children.1/replication should
be failing - so in that case why do I see the error?


I don't know, and I will check the codes.


The 'node0' in the backtrace above is the name of the replication, so it does 
look like the error
is coming from the replication.


No, the backtrace is just report an I/O error events to the management 
application.




Anyway, I tried removing children.1 but it segfaults now, I guess the 
replication is unhappy:

(qemu) x_block_change colo-disk0 -d children.1
(qemu) x_colo_lost_heartbeat


Hmm, you should not remove the child before failover. I will check it how to 
avoid it in the codes.


  But you said 'If secondary host is down, you should remove quorum's 
children.1' - is that not
what you meant?


Yes, you should excute 'x_colo_lost_heartbeat' fist, and then excute 
'x_block_change ... -d ...'.


Hi david


Hi Xie,
   Thanks for the response.


It seems we missed 'drive_del' command, and will document it in next
version. Here is the right commands order:

{ "execute": "x-colo-lost-heartbeat" }
{ 'execute': 'x-blockdev-change', 'arguments': {'parent': 'colo-disk',
'child': 'children.1'}}
{ 'execute': 'human-monitor-command', 'arguments': {'command-line':
'drive_del x'}}


OK,  however, you should fix the seg fault if you don't issue the drive_del;
qemu should never crash.
(Also I still get the IO error in the guest if I do the x-colo-lost-heartbeat).



Here is a quick fix, i just tested for several times. It work well to me.

bugfix

Signed-off-by: Changlong Xie 

diff --git a/block/quorum.c b/block/quorum.c
index e5a7e4f..f4f1d28 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -458,6 +458,11 @@ static QuorumVoteVersion 
*quorum_get_vote_winner(QuorumVotes *votes)

 if (c

[Qemu-devel] [PATCH] quorum: fix segfault when read fails in fifo mode

2016-02-04 Thread Changlong Xie

Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
---
 block/quorum.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/quorum.c b/block/quorum.c
index a5ae4b8..0965277 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -295,6 +295,9 @@ static void quorum_aio_cb(void *opaque, int ret)
 quorum_copy_qiov(acb->qiov, &acb->qcrs[acb->child_iter].qiov);
 }
 acb->vote_ret = ret;
+if (ret < 0) {
+acb->child_iter--;
+}
 quorum_aio_finalize(acb);
 return;
 }
-- 
1.9.3

Re: [Qemu-devel] [PATCH] quorum: fix segfault when read fails in fifo mode

2016-02-04 Thread Changlong Xie


On 02/04/2016 08:24 PM, Kevin Wolf wrote:

Am 04.02.2016 um 11:19 hat Changlong Xie geschrieben:

Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
---
  block/quorum.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/block/quorum.c b/block/quorum.c
index a5ae4b8..0965277 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -295,6 +295,9 @@ static void quorum_aio_cb(void *opaque, int ret)
  quorum_copy_qiov(acb->qiov, &acb->qcrs[acb->child_iter].qiov);
  }
  acb->vote_ret = ret;
+if (ret < 0) {
+acb->child_iter--;
+}
  quorum_aio_finalize(acb);
  return;
  }


This looks semantically correct to me (but I'd like to have an Ack from
Berto), but I would fix it above: We shouldn't do ++acb->child_iter in
the first place if the new value is >= s->num_children. So instead of
decrementing after the fact, just move the increment to inside the then
branch above.


Yes, will fix in next version.

Thanks
-Xie


Kevin


.

[Qemu-devel] [PATCH V2] quorum: fix segfault when read fails in fifo mode

2016-02-04 Thread Changlong Xie

Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
---
 block/quorum.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/quorum.c b/block/quorum.c
index a5ae4b8..11cc60b 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -286,7 +286,8 @@ static void quorum_aio_cb(void *opaque, int ret)
 
 if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) {
 /* We try to read next child in FIFO order if we fail to read */
-if (ret < 0 && ++acb->child_iter < s->num_children) {
+if (ret < 0 && (acb->child_iter + 1) < s->num_children) {
+acb->child_iter++;
 read_fifo_child(acb);
 return;
 }
-- 
1.9.3

[Qemu-devel] [PATCH v15 4/9] Link backup into block core

2016-02-04 Thread Changlong Xie

From: Wen Congyang 

Some programs that add a dependency on it will use
the block layer directly.

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Jeff Cody 
---
 block/Makefile.objs | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 58ef2ef..fa05f37 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -22,10 +22,10 @@ block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
 block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
 block-obj-y += write-threshold.o
+block-obj-y += backup.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
-common-obj-y += backup.o
 
 iscsi.o-cflags := $(LIBISCSI_CFLAGS)
 iscsi.o-libs   := $(LIBISCSI_LIBS)
-- 
1.9.3

[Qemu-devel] [PATCH v15 6/9] auto complete active commit

2016-02-04 Thread Changlong Xie

From: Wen Congyang 

Auto complete mirror job in background to prevent from
blocking synchronously

Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
---
 block/mirror.c| 13 +
 blockdev.c|  2 +-
 include/block/block_int.h |  3 ++-
 qemu-img.c|  2 +-
 4 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 2c0edfa..79ec170 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -718,7 +718,8 @@ static void mirror_start_job(BlockDriverState *bs, 
BlockDriverState *target,
  BlockCompletionFunc *cb,
  void *opaque, Error **errp,
  const BlockJobDriver *driver,
- bool is_none_mode, BlockDriverState *base)
+ bool is_none_mode, BlockDriverState *base,
+ bool auto_complete)
 {
 MirrorBlockJob *s;
 BlockDriverState *replaced_bs;
@@ -774,6 +775,9 @@ static void mirror_start_job(BlockDriverState *bs, 
BlockDriverState *target,
 s->granularity = granularity;
 s->buf_size = ROUND_UP(buf_size, granularity);
 s->unmap = unmap;
+if (auto_complete) {
+s->should_complete = true;
+}
 
 s->dirty_bitmap = bdrv_create_dirty_bitmap(bs, granularity, NULL, errp);
 if (!s->dirty_bitmap) {
@@ -815,14 +819,15 @@ void mirror_start(BlockDriverState *bs, BlockDriverState 
*target,
 mirror_start_job(bs, target, replaces,
  speed, granularity, buf_size,
  on_source_error, on_target_error, unmap, cb, opaque, errp,
- &mirror_job_driver, is_none_mode, base);
+ &mirror_job_driver, is_none_mode, base, false);
 }
 
 void commit_active_start(BlockDriverState *bs, BlockDriverState *base,
  int64_t speed,
  BlockdevOnError on_error,
  BlockCompletionFunc *cb,
- void *opaque, Error **errp)
+ void *opaque, Error **errp,
+ bool auto_complete)
 {
 int64_t length, base_length;
 int orig_base_flags;
@@ -863,7 +868,7 @@ void commit_active_start(BlockDriverState *bs, 
BlockDriverState *base,
 bdrv_ref(base);
 mirror_start_job(bs, base, NULL, speed, 0, 0,
  on_error, on_error, false, cb, opaque, &local_err,
- &commit_active_job_driver, false, base);
+ &commit_active_job_driver, false, base, auto_complete);
 if (local_err) {
 error_propagate(errp, local_err);
 goto error_restore_flags;
diff --git a/blockdev.c b/blockdev.c
index 7c1d6da..fd7dce2 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3079,7 +3079,7 @@ void qmp_block_commit(const char *device,
 goto out;
 }
 commit_active_start(bs, base_bs, speed, on_error, block_job_cb,
-bs, &local_err);
+bs, &local_err, false);
 } else {
 commit_start(bs, base_bs, top_bs, speed, on_error, block_job_cb, bs,
  has_backing_file ? backing_file : NULL, &local_err);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index e67b6fb..9eb1db6 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -638,13 +638,14 @@ void commit_start(BlockDriverState *bs, BlockDriverState 
*base,
  * @cb: Completion function for the job.
  * @opaque: Opaque pointer value passed to @cb.
  * @errp: Error object.
+ * @auto_complete: Auto complete the job.
  *
  */
 void commit_active_start(BlockDriverState *bs, BlockDriverState *base,
  int64_t speed,
  BlockdevOnError on_error,
  BlockCompletionFunc *cb,
- void *opaque, Error **errp);
+ void *opaque, Error **errp, bool auto_complete);
 /*
  * mirror_start:
  * @bs: Block device to operate on.
diff --git a/qemu-img.c b/qemu-img.c
index f121980..986cc6d 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -756,7 +756,7 @@ static int img_commit(int argc, char **argv)
 };
 
 commit_active_start(bs, base_bs, 0, BLOCKDEV_ON_ERROR_REPORT,
-common_block_job_cb, &cbi, &local_err);
+common_block_job_cb, &cbi, &local_err, false);
 if (local_err) {
 goto done;
 }
-- 
1.9.3

[Qemu-devel] [PATCH v15 2/9] Store parent BDS in BdrvChild

2016-02-04 Thread Changlong Xie

From: Wen Congyang 

We need to access the parent BDS to get the root BDS.

Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
---
 block.c   | 1 +
 include/block/block_int.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/block.c b/block.c
index 70ab625..c18b462 100644
--- a/block.c
+++ b/block.c
@@ -1208,6 +1208,7 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
 BdrvChild *child = g_new(BdrvChild, 1);
 *child = (BdrvChild) {
 .bs = child_bs,
+.parent = parent_bs,
 .name   = g_strdup(child_name),
 .role   = child_role,
 };
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 89ec138..9204d1e 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -363,6 +363,7 @@ extern const BdrvChildRole child_format;
 
 struct BdrvChild {
 BlockDriverState *bs;
+BlockDriverState *parent;
 char *name;
 const BdrvChildRole *role;
 QLIST_ENTRY(BdrvChild) next;
-- 
1.9.3

[Qemu-devel] [PATCH v15 8/9] Implement new driver for block replication

2016-02-04 Thread Changlong Xie

From: Wen Congyang 

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
---
 block/Makefile.objs |   1 +
 block/replication.c | 594 
 2 files changed, 595 insertions(+)
 create mode 100644 block/replication.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index fa05f37..94c1d03 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -23,6 +23,7 @@ block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
 block-obj-y += write-threshold.o
 block-obj-y += backup.o
+block-obj-y += replication.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
diff --git a/block/replication.c b/block/replication.c
new file mode 100644
index 000..4631667
--- /dev/null
+++ b/block/replication.c
@@ -0,0 +1,594 @@
+/*
+ * Replication Block filter
+ *
+ * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2016 Intel Corporation
+ * Copyright (c) 2016 FUJITSU LIMITED
+ *
+ * Author:
+ *   Wen Congyang 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu-common.h"
+#include "block/nbd.h"
+#include "block/blockjob.h"
+#include "block/block_int.h"
+#include "replication.h"
+
+typedef struct BDRVReplicationState {
+ReplicationMode mode;
+int replication_state;
+BdrvChild *active_disk;
+BdrvChild *hidden_disk;
+BdrvChild *secondary_disk;
+BlockDriverState *top_bs;
+ReplicationState *rs;
+Error *blocker;
+int orig_hidden_flags;
+int orig_secondary_flags;
+int error;
+} BDRVReplicationState;
+
+enum {
+BLOCK_REPLICATION_NONE, /* block replication is not started */
+BLOCK_REPLICATION_RUNNING,  /* block replication is running */
+BLOCK_REPLICATION_FAILOVER, /* failover is running in background */
+BLOCK_REPLICATION_FAILOVER_FAILED,  /* failover failed */
+BLOCK_REPLICATION_DONE, /* block replication is done(after 
failover) */
+};
+
+static void replication_start(ReplicationState *rs, ReplicationMode mode,
+  Error **errp);
+static void replication_do_checkpoint(ReplicationState *rs, Error **errp);
+static void replication_get_error(ReplicationState *rs, Error **errp);
+static void replication_stop(ReplicationState *rs, bool failover,
+ Error **errp);
+
+#define REPLICATION_MODE"mode"
+static QemuOptsList replication_runtime_opts = {
+.name = "replication",
+.head = QTAILQ_HEAD_INITIALIZER(replication_runtime_opts.head),
+.desc = {
+{
+.name = REPLICATION_MODE,
+.type = QEMU_OPT_STRING,
+},
+{ /* end of list */ }
+},
+};
+
+static ReplicationOps replication_ops = {
+.start = replication_start,
+.checkpoint = replication_do_checkpoint,
+.get_error = replication_get_error,
+.stop = replication_stop,
+};
+
+static int replication_open(BlockDriverState *bs, QDict *options,
+int flags, Error **errp)
+{
+int ret;
+BDRVReplicationState *s = bs->opaque;
+Error *local_err = NULL;
+QemuOpts *opts = NULL;
+const char *mode;
+
+ret = -EINVAL;
+opts = qemu_opts_create(&replication_runtime_opts, NULL, 0, &error_abort);
+qemu_opts_absorb_qdict(opts, options, &local_err);
+if (local_err) {
+goto fail;
+}
+
+mode = qemu_opt_get(opts, REPLICATION_MODE);
+if (!mode) {
+error_setg(&local_err, "Missing the option mode");
+goto fail;
+}
+
+if (!strcmp(mode, "primary")) {
+s->mode = REPLICATION_MODE_PRIMARY;
+} else if (!strcmp(mode, "secondary")) {
+s->mode = REPLICATION_MODE_SECONDARY;
+} else {
+error_setg(&local_err,
+   "The option mode's value should be primary or secondary");
+goto fail;
+}
+
+s->rs = replication_new(bs, &replication_ops);
+
+ret = 0;
+
+fail:
+qemu_opts_del(opts);
+error_propagate(errp, local_err);
+
+return ret;
+}
+
+static void replication_close(BlockDriverState *bs)
+{
+BDRVReplicationState *s = bs->opaque;
+
+if (s->replication_state == BLOCK_REPLICATION_RUNNING) {
+replication_stop(s->rs, false, NULL);
+replication_remove(s->rs);
+}
+}
+
+static int64_t replication_getlength(BlockDriverState *bs)
+{
+return bdrv_getlength(bs->file->bs);
+}
+
+static int replication_get_io_status(BDRVReplicationState *s)
+{
+switch (s->replication_state) {
+case BLOCK_REPLICATION_NONE:
+return -EIO;
+case BLOCK_REPLICATION_RUNNING:
+return 0;
+case BLOCK_REPLICATION_FAILOVER:
+return s->mode == REP

[Qemu-devel] [PATCH v15 1/9] unblock backup operations in backing file

2016-02-04 Thread Changlong Xie

From: Wen Congyang 

Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
---
 block.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/block.c b/block.c
index a285de5..70ab625 100644
--- a/block.c
+++ b/block.c
@@ -1279,6 +1279,24 @@ void bdrv_set_backing_hd(BlockDriverState *bs, 
BlockDriverState *backing_hd)
 /* Otherwise we won't be able to commit due to check in bdrv_commit */
 bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_COMMIT_TARGET,
 bs->backing_blocker);
+/*
+ * We do backup in 3 ways:
+ * 1. drive backup
+ *The target bs is new opened, and the source is top BDS
+ * 2. blockdev backup
+ *Both the source and the target are top BDSes.
+ * 3. internal backup(used for block replication)
+ *Both the source and the target are backing file
+ *
+ * In case 1, and 2, the backing file is neither the source nor
+ * the target.
+ * In case 3, we will block the top BDS, so there is only one block
+ * job for the top BDS and its backing chain.
+ */
+bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_SOURCE,
+bs->backing_blocker);
+bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_TARGET,
+bs->backing_blocker);
 out:
 bdrv_refresh_limits(bs, NULL);
 }
-- 
1.9.3

[Qemu-devel] [PATCH v15 0/9] Block replication for continuous checkpoints

2016-02-04 Thread Changlong Xie

Block replication is a very important feature which is used for
continuous checkpoints(for example: COLO).

You can get the detailed information about block replication from here:
http://wiki.qemu.org/Features/BlockReplication

Usage:
Please refer to docs/block-replication.txt

This patch series is based on the following patch series:
1. http://lists.nongnu.org/archive/html/qemu-devel/2015-12/msg04570.html

Patch status:
1. Acked patches: none 
2. Reviewed patches: patch 4
3. Updated patches: patch 3, 4, 5, 7, 8
Note: patch 7 is a new patch that introduces genernal start/stop/checkpoint/
get_error
APIs.

You can get the patch here:
https://github.com/Pating/qemu/tree/changlox/block-replication-v15

You can get the patch with framework here:
https://github.com/Pating/qemu/tree/changlox/colo_framework_v14

TODO:
1. Continuous block replication. It will be started after basic functions
   are accepted.

Changs Log:
V15:
1. Rebase to the newest codes
2. Fix typos and coding style addresed Eric's comments
3. Address Stefan's comments
   1) Make backup_do_checkpoint public, drop the changes on BlockJobDriver
   2) Update the message and description for [PATCH 4/9]
   3) Make replication_(start/stop/do_checkpoint)_all as global interfaces
   4) Introduce AioContext lock to protect start/stop/do_checkpoint callbacks
   5) Use BdrvChild instead of holding on to BlockDriverState * pointers
4. Clear BDRV_O_INACTIVE for hidden disk's open_flags since commit 09e0c771  
5. Introduce replication_get_error_all to check replication status
6. Remove useless discard interface
V14:
1. Implement auto complete active commit
2. Implement active commit block job for replication.c
3. Address the comments from Stefan, add replication-specific API and data
   structure, also remove old block layer APIs
V13:
1. Rebase to the newest codes
2. Remove redundant marcos and semicolon in replication.c 
3. Fix typos in block-replication.txt
V12:
1. Rebase to the newest codes
2. Use backing reference to replcace 'allow-write-backing-file'
V11:
1. Reopen the backing file when starting blcok replication if it is not
   opened in R/W mode
2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET
   when opening backing file
3. Block the top BDS so there is only one block job for the top BDS and
   its backing chain.
V10:
1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing
   reference.
2. Address the comments from Eric Blake
V9:
1. Update the error messages
2. Rebase to the newest qemu
3. Split child add/delete support. These patches are sent in another patchset.
V8:
1. Address Alberto Garcia's comments
V7:
1. Implement adding/removing quorum child. Remove the option non-connect.
2. Simplify the backing refrence option according to Stefan Hajnoczi's 
suggestion
V6:
1. Rebase to the newest qemu.
V5:
1. Address the comments from Gong Lei
2. Speed the failover up. The secondary vm can take over very quickly even
   if there are too many I/O requests.
V4:
1. Introduce a new driver replication to avoid touch nbd and qcow2.
V3:
1: use error_setg() instead of error_set()
2. Add a new block job API
3. Active disk, hidden disk and nbd target uses the same AioContext
4. Add a testcase to test new hbitmap API
V2:
1. Redesign the secondary qemu(use image-fleecing)
2. Use Error objects to return error message
3. Address the comments from Max Reitz and Eric Blake

Changlong Xie (1):
  Introduce new APIs to do replication operation

Wen Congyang (8):
  unblock backup operations in backing file
  Store parent BDS in BdrvChild
  Backup: clear all bitmap when doing block checkpoint
  Link backup into block core
  docs: block replication's description
  auto complete active commit
  Implement new driver for block replication
  support replication driver in blockdev-add

 Makefile.objs  |   1 +
 block.c|  19 ++
 block/Makefile.objs|   3 +-
 block/backup.c |  15 ++
 block/mirror.c |  13 +-
 block/replication.c| 594 +
 blockdev.c |   2 +-
 docs/block-replication.txt | 238 ++
 include/block/block_int.h  |   6 +-
 qapi/block-core.json   |  33 ++-
 qemu-img.c |   2 +-
 replication.c  |  94 +++
 replication.h  |  53 
 13 files changed, 1063 insertions(+), 10 deletions(-)
 create mode 100644 block/replication.c
 create mode 100644 docs/block-replication.txt
 create mode 100644 replication.c
 create mode 100644 replication.h

-- 
1.9.3

[Qemu-devel] [PATCH v15 9/9] support replication driver in blockdev-add

2016-02-04 Thread Changlong Xie

From: Wen Congyang 

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
Reviewed-by: Eric Blake 
---
 qapi/block-core.json | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 12362b8..af723d5 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -247,6 +247,7 @@
 #   2.2: 'archipelago' added, 'cow' dropped
 #   2.3: 'host_floppy' deprecated
 #   2.5: 'host_floppy' dropped
+#   2.6: 'replication' added
 #
 # @backing_file: #optional the name of the backing file (for copy-on-write)
 #
@@ -1567,6 +1568,7 @@
 # Drivers that are supported in block device operations.
 #
 # @host_device, @host_cdrom: Since 2.1
+# @replication: Since 2.6
 #
 # Since: 2.0
 ##
@@ -1574,8 +1576,8 @@
   'data': [ 'archipelago', 'blkdebug', 'blkverify', 'bochs', 'cloop',
 'dmg', 'file', 'ftp', 'ftps', 'host_cdrom', 'host_device',
 'http', 'https', 'null-aio', 'null-co', 'parallels',
-'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'tftp', 'vdi', 'vhdx',
-'vmdk', 'vpc', 'vvfat' ] }
+'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'replication',
+'tftp', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
 
 ##
 # @BlockdevOptionsBase
@@ -2015,6 +2017,19 @@
 { 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
 
 ##
+# @BlockdevOptionsReplication
+#
+# Driver specific block device options for replication
+#
+# @mode: the replication mode
+#
+# Since: 2.6
+##
+{ 'struct': 'BlockdevOptionsReplication',
+  'base': 'BlockdevOptionsGenericFormat',
+  'data': { 'mode': 'ReplicationMode'  } }
+
+##
 # @BlockdevOptions
 #
 # Options for creating a block device.
@@ -2051,6 +2066,7 @@
   'quorum': 'BlockdevOptionsQuorum',
   'raw':'BlockdevOptionsGenericFormat',
 # TODO rbd: Wait for structured options
+  'replication':'BlockdevOptionsReplication',
 # TODO sheepdog: Wait for structured options
 # TODO ssh: Should take InetSocketAddress for 'host'?
   'tftp':   'BlockdevOptionsFile',
-- 
1.9.3

[Qemu-devel] [PATCH v15 3/9] Backup: clear all bitmap when doing block checkpoint

2016-02-04 Thread Changlong Xie

From: Wen Congyang 

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
---
 block/backup.c| 15 +++
 include/block/block_int.h |  2 ++
 2 files changed, 17 insertions(+)

diff --git a/block/backup.c b/block/backup.c
index 00cafdb..841ec27 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -251,6 +251,21 @@ static void backup_abort(BlockJob *job)
 }
 }
 
+void backup_do_checkpoint(BlockJob *job, Error **errp)
+{
+BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
+
+assert(job->driver->job_type == BLOCK_JOB_TYPE_BACKUP);
+
+if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) {
+error_setg(errp, "The backup job only supports block checkpoint in"
+   " sync=none mode");
+return;
+}
+
+hbitmap_reset_all(backup_job->bitmap);
+}
+
 static const BlockJobDriver backup_job_driver = {
 .instance_size  = sizeof(BackupBlockJob),
 .job_type   = BLOCK_JOB_TYPE_BACKUP,
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 9204d1e..e67b6fb 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -700,6 +700,8 @@ void backup_start(BlockDriverState *bs, BlockDriverState 
*target,
   BlockCompletionFunc *cb, void *opaque,
   BlockJobTxn *txn, Error **errp);
 
+void backup_do_checkpoint(BlockJob *job, Error **errp);
+
 void blk_set_bs(BlockBackend *blk, BlockDriverState *bs);
 
 void blk_dev_change_media_cb(BlockBackend *blk, bool load);
-- 
1.9.3

[Qemu-devel] [PATCH v15 7/9] Introduce new APIs to do replication operation

2016-02-04 Thread Changlong Xie

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
---
 Makefile.objs|  1 +
 qapi/block-core.json | 13 
 replication.c| 94 
 replication.h| 53 +
 4 files changed, 161 insertions(+)
 create mode 100644 replication.c
 create mode 100644 replication.h

diff --git a/Makefile.objs b/Makefile.objs
index 06b95c7..a8c74b7 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -15,6 +15,7 @@ block-obj-$(CONFIG_POSIX) += aio-posix.o
 block-obj-$(CONFIG_WIN32) += aio-win32.o
 block-obj-y += block/
 block-obj-y += qemu-io-cmds.o
+block-obj-y += replication.o
 
 block-obj-m = block/
 
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 7e9e8fe..12362b8 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2002,6 +2002,19 @@
 '*read-pattern': 'QuorumReadPattern' } }
 
 ##
+# @ReplicationMode
+#
+# An enumeration of replication modes.
+#
+# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
+#
+# @secondary: Secondary mode, receive the vm's state from primary QEMU.
+#
+# Since: 2.6
+##
+{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
+
+##
 # @BlockdevOptions
 #
 # Options for creating a block device.
diff --git a/replication.c b/replication.c
new file mode 100644
index 000..e8ac2f0
--- /dev/null
+++ b/replication.c
@@ -0,0 +1,94 @@
+/*
+ * Replication filter
+ *
+ * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2016 Intel Corporation
+ * Copyright (c) 2016 FUJITSU LIMITED
+ *
+ * Author:
+ *   Wen Congyang 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "replication.h"
+
+static QLIST_HEAD(, ReplicationState) replication_states;
+
+ReplicationState *replication_new(void *opaque, ReplicationOps *ops)
+{
+ReplicationState *rs;
+
+rs = g_new0(ReplicationState, 1);
+rs->opaque = opaque;
+rs->ops = ops;
+QLIST_INSERT_HEAD(&replication_states, rs, node);
+
+return rs;
+}
+
+void replication_remove(ReplicationState *rs)
+{
+QLIST_REMOVE(rs, node);
+g_free(rs);
+}
+
+/*
+ * The caller of the function *MUST* make sure vm stopped
+ */
+void replication_start_all(ReplicationMode mode, Error **errp)
+{
+ReplicationState *rs, *next;
+
+QLIST_FOREACH_SAFE(rs, &replication_states, node, next) {
+if (rs->ops && rs->ops->start) {
+rs->ops->start(rs, mode, errp);
+}
+if (*errp != NULL) {
+return;
+}
+}
+}
+
+void replication_do_checkpoint_all(Error **errp)
+{
+ReplicationState *rs, *next;
+
+QLIST_FOREACH_SAFE(rs, &replication_states, node, next) {
+if (rs->ops && rs->ops->checkpoint) {
+rs->ops->checkpoint(rs, errp);
+}
+if (*errp != NULL) {
+return;
+}
+}
+}
+
+void replication_get_error_all(Error **errp)
+{
+ReplicationState *rs, *next;
+
+QLIST_FOREACH_SAFE(rs, &replication_states, node, next) {
+if (rs->ops && rs->ops->get_error) {
+rs->ops->get_error(rs, errp);
+}
+if (*errp != NULL) {
+return;
+}
+}
+}
+
+void replication_stop_all(bool failover, Error **errp)
+{
+ReplicationState *rs, *next;
+
+QLIST_FOREACH_SAFE(rs, &replication_states, node, next) {
+if (rs->ops && rs->ops->stop) {
+rs->ops->stop(rs, failover, errp);
+}
+if (*errp != NULL) {
+return;
+}
+}
+}
diff --git a/replication.h b/replication.h
new file mode 100644
index 000..faea649
--- /dev/null
+++ b/replication.h
@@ -0,0 +1,53 @@
+/*
+ * Replication filter
+ *
+ * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2016 Intel Corporation
+ * Copyright (c) 2016 FUJITSU LIMITED
+ *
+ * Author:
+ *   Wen Congyang 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef REPLICATION_H
+#define REPLICATION_H
+
+#include "sysemu/sysemu.h"
+
+typedef struct ReplicationOps ReplicationOps;
+typedef struct ReplicationState ReplicationState;
+typedef void (*Start)(ReplicationState *rs, ReplicationMode mode, Error 
**errp);
+typedef void (*Stop)(ReplicationState *rs, bool failover, Error **errp);
+typedef void (*Checkpoint)(ReplicationState *rs, Error **errp);
+typedef void (*GetError)(ReplicationState *rs, Error **errp);
+
+struct ReplicationState {
+void *opaque;
+ReplicationOps *ops;
+QLIST_ENTRY(ReplicationState) node;
+};
+
+struct ReplicationOps{
+

[Qemu-devel] [PATCH v15 5/9] docs: block replication's description

2016-02-04 Thread Changlong Xie

From: Wen Congyang 

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
---
 docs/block-replication.txt | 238 +
 1 file changed, 238 insertions(+)
 create mode 100644 docs/block-replication.txt

diff --git a/docs/block-replication.txt b/docs/block-replication.txt
new file mode 100644
index 000..d8e20c7
--- /dev/null
+++ b/docs/block-replication.txt
@@ -0,0 +1,238 @@
+Block replication
+
+Copyright Fujitsu, Corp. 2016
+Copyright (c) 2016 Intel Corporation
+Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
+
+This work is licensed under the terms of the GNU GPL, version 2 or later.
+See the COPYING file in the top-level directory.
+
+Block replication is used for continuous checkpoints. It is designed
+for COLO (COarse-grain LOck-stepping) where the Secondary VM is running.
+It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
+where the Secondary VM is not running.
+
+This document gives an overview of block replication's design.
+
+== Background ==
+High availability solutions such as micro checkpoint and COLO will do
+consecutive checkpoints. The VM state of the Primary and Secondary VM is
+identical right after a VM checkpoint, but becomes different as the VM
+executes till the next checkpoint. To support disk contents checkpoint,
+the modified disk contents in the Secondary VM must be buffered, and are
+only dropped at next checkpoint time. To reduce the network transportation
+effort during a vmstate checkpoint, the disk modification operations of
+the Primary disk are asynchronously forwarded to the Secondary node.
+
+== Workflow ==
+The following is the image of block replication workflow:
+
++--+++
+|Primary Write Requests||Secondary Write Requests|
++--+++
+  |   |
+  |  (4)
+  |   V
+  |  /-\
+  |  Copy and Forward| |
+  |-(1)--+   | Disk Buffer |
+  |  |   | |
+  | (3)  \-/
+  | speculative  ^
+  |write through(2)
+  |  |   |
+  V  V   |
+   +--+   ++
+   | Primary Disk |   | Secondary Disk |
+   +--+   ++
+
+1) Primary write requests will be copied and forwarded to Secondary
+   QEMU.
+2) Before Primary write requests are written to Secondary disk, the
+   original sector content will be read from Secondary disk and
+   buffered in the Disk buffer, but it will not overwrite the existing
+   sector content (it could be from either "Secondary Write Requests" or
+   previous COW of "Primary Write Requests") in the Disk buffer.
+3) Primary write requests will be written to Secondary disk.
+4) Secondary write requests will be buffered in the Disk buffer and it
+   will overwrite the existing sector content in the buffer.
+
+== Architecture ==
+We are going to implement block replication from many basic
+blocks that are already in QEMU.
+
+ virtio-blk   ||
+ ^||.--
+ |||| Secondary
+1 Quorum  ||'--
+ /  \ ||
+/\||
+   Primary2 filter
+ disk ^
 virtio-blk
+  |
  ^
+3 NBD  --->  3 NBD 
  |
+client|| server
  2 filter
+  ||^  
  ^
+. |||  
  |
+Primary | ||  Secondary disk <- hidden-disk 5 
<- active-disk 4
+' |||  backing^   backing
+  ||| |
+  ||| |
+  ||'-'
+

Re: [Qemu-devel] [PATCH v15 7/9] Introduce new APIs to do replication operation

2016-02-14 Thread Changlong Xie


On 02/15/2016 08:57 AM, Hailiang Zhang wrote:

On 2016/2/5 12:18, Changlong Xie wrote:

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
---
  Makefile.objs|  1 +
  qapi/block-core.json | 13 
  replication.c| 94

  replication.h| 53 +
  4 files changed, 161 insertions(+)
  create mode 100644 replication.c
  create mode 100644 replication.h

diff --git a/Makefile.objs b/Makefile.objs
index 06b95c7..a8c74b7 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -15,6 +15,7 @@ block-obj-$(CONFIG_POSIX) += aio-posix.o
  block-obj-$(CONFIG_WIN32) += aio-win32.o
  block-obj-y += block/
  block-obj-y += qemu-io-cmds.o
+block-obj-y += replication.o

  block-obj-m = block/

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 7e9e8fe..12362b8 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2002,6 +2002,19 @@
  '*read-pattern': 'QuorumReadPattern' } }

  ##
+# @ReplicationMode
+#
+# An enumeration of replication modes.
+#
+# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
+#
+# @secondary: Secondary mode, receive the vm's state from primary QEMU.
+#
+# Since: 2.6
+##
+{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
+
+##
  # @BlockdevOptions
  #
  # Options for creating a block device.
diff --git a/replication.c b/replication.c
new file mode 100644
index 000..e8ac2f0
--- /dev/null
+++ b/replication.c
@@ -0,0 +1,94 @@
+/*
+ * Replication filter
+ *
+ * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2016 Intel Corporation
+ * Copyright (c) 2016 FUJITSU LIMITED
+ *
+ * Author:
+ *   Wen Congyang 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "replication.h"
+
+static QLIST_HEAD(, ReplicationState) replication_states;
+
+ReplicationState *replication_new(void *opaque, ReplicationOps *ops)
+{
+ReplicationState *rs;
+
+rs = g_new0(ReplicationState, 1);
+rs->opaque = opaque;
+rs->ops = ops;
+QLIST_INSERT_HEAD(&replication_states, rs, node);
+
+return rs;
+}
+
+void replication_remove(ReplicationState *rs)
+{
+QLIST_REMOVE(rs, node);
+g_free(rs);
+}
+
+/*
+ * The caller of the function *MUST* make sure vm stopped
+ */
+void replication_start_all(ReplicationMode mode, Error **errp)
+{


Is this public API is only used for block ?
If yes, I'd like it with a 'block_' prefix.


+ReplicationState *rs, *next;
+
+QLIST_FOREACH_SAFE(rs, &replication_states, node, next) {
+if (rs->ops && rs->ops->start) {
+rs->ops->start(rs, mode, errp);
+}
+if (*errp != NULL) {


This is incorrect, you miss checking if errp is NULL,
if errp is NULL, there will be an error that accessing memory at address
0x0.
Same with other places in this patch.


You are right, will fix in next version.

Thanks
-Xie




+return;
+}
+}
+}
+
+void replication_do_checkpoint_all(Error **errp)
+{
+ReplicationState *rs, *next;
+
+QLIST_FOREACH_SAFE(rs, &replication_states, node, next) {
+if (rs->ops && rs->ops->checkpoint) {
+rs->ops->checkpoint(rs, errp);
+}
+if (*errp != NULL) {
+return;



+}
+}
+}
+
+void replication_get_error_all(Error **errp)
+{
+ReplicationState *rs, *next;
+
+QLIST_FOREACH_SAFE(rs, &replication_states, node, next) {
+if (rs->ops && rs->ops->get_error) {
+rs->ops->get_error(rs, errp);
+}
+if (*errp != NULL) {



+return;
+}
+}
+}
+
+void replication_stop_all(bool failover, Error **errp)
+{
+ReplicationState *rs, *next;
+
+QLIST_FOREACH_SAFE(rs, &replication_states, node, next) {
+if (rs->ops && rs->ops->stop) {
+rs->ops->stop(rs, failover, errp);
+}
+if (*errp != NULL) {
+return;
+}
+}
+}
diff --git a/replication.h b/replication.h
new file mode 100644
index 000..faea649
--- /dev/null
+++ b/replication.h
@@ -0,0 +1,53 @@
+/*
+ * Replication filter
+ *
+ * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2016 Intel Corporation
+ * Copyright (c) 2016 FUJITSU LIMITED
+ *
+ * Author:
+ *   Wen Congyang 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef REPLICATION_H
+#define REPLICATION_H
+
+#include "sysemu/sysemu.h"
+
+typedef struct ReplicationOps ReplicationOps;
+typedef struct ReplicationState ReplicationState;
+

Re: [Qemu-devel] [Qemu-block] [PATCH v9 3/3] qmp: add monitor command to add/remove a child

2016-02-14 Thread Changlong Xie


On 02/11/2016 02:02 AM, Max Reitz wrote:

On 25.12.2015 10:22, Changlong Xie wrote:

From: Wen Congyang 

The new QMP command name is x-blockdev-change. It's just for adding/removing
quorum's child now, and doesn't support all kinds of children, all kinds of
operations, nor all block drivers. So it is experimental now.

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
---
  blockdev.c   | 54 
  qapi/block-core.json | 23 ++
  qmp-commands.hx  | 47 +
  3 files changed, 124 insertions(+)

diff --git a/blockdev.c b/blockdev.c
index 64dbfeb..4e62fdf 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3836,6 +3836,60 @@ out:
  aio_context_release(aio_context);
  }

+static BlockDriverState *bdrv_find_child(BlockDriverState *parent_bs,
+ const char *child_name)
+{
+BdrvChild *child;
+
+QLIST_FOREACH(child, &parent_bs->children, next) {
+if (strcmp(child->name, child_name) == 0) {
+return child->bs;
+}
+}
+
+return NULL;
+}
+
+void qmp_x_blockdev_change(const char *parent, bool has_child,
+   const char *child, bool has_node,
+   const char *node, Error **errp)
+{
+BlockDriverState *parent_bs, *child_bs = NULL, *new_bs = NULL;
+
+parent_bs = bdrv_lookup_bs(parent, parent, errp);
+if (!parent_bs) {
+return;
+}
+
+if (has_child == has_node) {
+if (has_child) {
+error_setg(errp, "The paramter child and node is conflict");


"The parameters child and node are in conflict"

Or, more naturally:

"child and node may not be specified at the same time"



OK


+} else {
+error_setg(errp, "Either child or node should be specified");


s/should/must/


Ok




+}
+return;
+}
+
+if (has_child) {
+child_bs = bdrv_find_child(parent_bs, child);
+if (!child_bs) {
+error_setg(errp, "Node '%s' doesn't have child %s",


s/doesn't/does not/

(This is a personal opinion, but a pretty strong one.)

Also, if you put quotes around the node name, maybe you should do the
same around the child name.


OK




+   parent, child);
+return;
+}
+bdrv_del_child(parent_bs, child_bs, errp);
+}
+
+if (has_node) {
+new_bs = bdrv_find_node(node);
+if (!new_bs) {
+error_setg(errp, "Node '%s' not found", node);
+return;
+}
+bdrv_add_child(parent_bs, new_bs, errp);
+}
+}
+
  BlockJobInfoList *qmp_query_block_jobs(Error **errp)
  {
  BlockJobInfoList *head = NULL, **p_next = &head;
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 1a5d9ce..fe63c6d 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2408,3 +2408,26 @@
  ##
  { 'command': 'block-set-write-threshold',
'data': { 'node-name': 'str', 'write-threshold': 'uint64' } }
+
+##
+# @x-blockdev-change
+#
+# Dynamically reconfigure the block driver state graph. It can be used
+# to add, remove, insert or replace a block driver state. Currently only


I'd prefer "graph node" over BDS in this line.


ok




+# the Quorum driver implements this feature to add or remove its child.
+# This is useful to fix a broken quorum child.
+#


I'd like a list here what this command does depending on the parameters
given. For instance:

If @(new-)node is specified, it will be inserted under @parent. @child
may not be specified in this case.

If both @parent and @child are specified but @(new-)node is not, @child
will be detached from @parent.



ok


+# @parent: the id or name of the node that will be changed.


I don't know. The parent will actually not be changed, it's an edge that
will be changed; and the parent is the parent node of that edge. Just put

@parent: the id or name of the parent node

here.



Yes, you are right here.


+#
+# @child: #optional the name of the child that will be deleted.


For now. But maybe that will change in the future. Generally, it is just
the child node of the edge that will be changed. So just putting

@child: #optional the name of a child under the given parent node

(Let's hope this is clear enough that people know that this is not a
node name.)



ok


+#
+# @node: #optional the name of the node will be added.


Maybe this should be named new-node instead.

Also: "...of the node that will be added."


Yes, it's more clear.




+#
+# Note: this command is experimental, and its API is not stable.
+#
+# Since: 2.6
+##
+{ 'command': 'x-blockde

Re: [Qemu-devel] [PATCH V2] quorum: fix segfault when read fails in fifo mode

2016-02-14 Thread Changlong Xie


On 02/05/2016 11:46 PM, Eric Blake wrote:

On 02/04/2016 07:25 PM, Changlong Xie wrote:

Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
---
  block/quorum.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/quorum.c b/block/quorum.c
index a5ae4b8..11cc60b 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -286,7 +286,8 @@ static void quorum_aio_cb(void *opaque, int ret)

  if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) {
  /* We try to read next child in FIFO order if we fail to read */
-if (ret < 0 && ++acb->child_iter < s->num_children) {
+if (ret < 0 && (acb->child_iter + 1) < s->num_children) {


Could shorten this as
   if (ret < 0 && acb->child_iter <= s->num_children) {
but I'm not sure it's worth the mental gymnastics.


Hi Eric

Just the same to me. Since it has been applied to block branch. Let's 
keep the original one.


Thanks
-Xie

Re: [Qemu-devel] [PATCH v15 0/9] Block replication for continuous checkpoints

2016-02-14 Thread Changlong Xie


Hi all

It seems no any review during the long Spring festival. Ping :)

Thanks
-Xie

On 02/05/2016 12:17 PM, Changlong Xie wrote:

Block replication is a very important feature which is used for
continuous checkpoints(for example: COLO).

You can get the detailed information about block replication from here:
http://wiki.qemu.org/Features/BlockReplication

Usage:
Please refer to docs/block-replication.txt

This patch series is based on the following patch series:
1. http://lists.nongnu.org/archive/html/qemu-devel/2015-12/msg04570.html

Patch status:
1. Acked patches: none
2. Reviewed patches: patch 4
3. Updated patches: patch 3, 4, 5, 7, 8
Note: patch 7 is a new patch that introduces genernal start/stop/checkpoint/
get_error
APIs.

You can get the patch here:
https://github.com/Pating/qemu/tree/changlox/block-replication-v15

You can get the patch with framework here:
https://github.com/Pating/qemu/tree/changlox/colo_framework_v14

TODO:
1. Continuous block replication. It will be started after basic functions
are accepted.

Changs Log:
V15:
1. Rebase to the newest codes
2. Fix typos and coding style addresed Eric's comments
3. Address Stefan's comments
1) Make backup_do_checkpoint public, drop the changes on BlockJobDriver
2) Update the message and description for [PATCH 4/9]
3) Make replication_(start/stop/do_checkpoint)_all as global interfaces
4) Introduce AioContext lock to protect start/stop/do_checkpoint callbacks
5) Use BdrvChild instead of holding on to BlockDriverState * pointers
4. Clear BDRV_O_INACTIVE for hidden disk's open_flags since commit 09e0c771
5. Introduce replication_get_error_all to check replication status
6. Remove useless discard interface
V14:
1. Implement auto complete active commit
2. Implement active commit block job for replication.c
3. Address the comments from Stefan, add replication-specific API and data
structure, also remove old block layer APIs
V13:
1. Rebase to the newest codes
2. Remove redundant marcos and semicolon in replication.c
3. Fix typos in block-replication.txt
V12:
1. Rebase to the newest codes
2. Use backing reference to replcace 'allow-write-backing-file'
V11:
1. Reopen the backing file when starting blcok replication if it is not
opened in R/W mode
2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET
when opening backing file
3. Block the top BDS so there is only one block job for the top BDS and
its backing chain.
V10:
1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing
reference.
2. Address the comments from Eric Blake
V9:
1. Update the error messages
2. Rebase to the newest qemu
3. Split child add/delete support. These patches are sent in another patchset.
V8:
1. Address Alberto Garcia's comments
V7:
1. Implement adding/removing quorum child. Remove the option non-connect.
2. Simplify the backing refrence option according to Stefan Hajnoczi's 
suggestion
V6:
1. Rebase to the newest qemu.
V5:
1. Address the comments from Gong Lei
2. Speed the failover up. The secondary vm can take over very quickly even
if there are too many I/O requests.
V4:
1. Introduce a new driver replication to avoid touch nbd and qcow2.
V3:
1: use error_setg() instead of error_set()
2. Add a new block job API
3. Active disk, hidden disk and nbd target uses the same AioContext
4. Add a testcase to test new hbitmap API
V2:
1. Redesign the secondary qemu(use image-fleecing)
2. Use Error objects to return error message
3. Address the comments from Max Reitz and Eric Blake

Changlong Xie (1):
   Introduce new APIs to do replication operation

Wen Congyang (8):
   unblock backup operations in backing file
   Store parent BDS in BdrvChild
   Backup: clear all bitmap when doing block checkpoint
   Link backup into block core
   docs: block replication's description
   auto complete active commit
   Implement new driver for block replication
   support replication driver in blockdev-add

  Makefile.objs  |   1 +
  block.c|  19 ++
  block/Makefile.objs|   3 +-
  block/backup.c |  15 ++
  block/mirror.c |  13 +-
  block/replication.c| 594 +
  blockdev.c |   2 +-
  docs/block-replication.txt | 238 ++
  include/block/block_int.h  |   6 +-
  qapi/block-core.json   |  33 ++-
  qemu-img.c |   2 +-
  replication.c  |  94 +++
  replication.h  |  53 
  13 files changed, 1063 insertions(+), 10 deletions(-)
  create mode 100644 block/replication.c
  create mode 100644 docs/block-replication.txt
  create mode 100644 replication.c
  create mode 100644 replication.h

[Qemu-devel] [PATCH 1/1] quorum: change vote rules for 64 bits hash

2016-02-15 Thread Changlong Xie

Before:
1) vote_count > max: winner = candidate, update max
2) vote_count <= max: do nothing
Current:
1) vote_count > max: winner = candidate, update max
2) vote_count = max: compare the value of winner with
candidate, if candidate->value.l == 0, winner = candidate,
else do nothing
3) vote_count < max: do nothing

Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
---
 block/quorum.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index a5ae4b8..e431ff4 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -446,7 +446,7 @@ static int quorum_compute_hash(QuorumAIOCB *acb, int i, 
QuorumVoteValue *hash)
 return 0;
 }
 
-static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes)
+static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes, bool 
vote_error)
 {
 int max = 0;
 QuorumVoteVersion *candidate, *winner = NULL;
@@ -455,6 +455,12 @@ static QuorumVoteVersion 
*quorum_get_vote_winner(QuorumVotes *votes)
 if (candidate->vote_count > max) {
 max = candidate->vote_count;
 winner = candidate;
+continue;
+}
+/* For 64 bit hash */
+if (vote_error == true && candidate->vote_count == max
+&& candidate->value.l == 0) {
+winner = candidate;
 }
 }
 
@@ -544,7 +550,7 @@ static int quorum_vote_error(QuorumAIOCB *acb)
 }
 
 if (error) {
-winner = quorum_get_vote_winner(&error_votes);
+winner = quorum_get_vote_winner(&error_votes, false);
 ret = winner->value.l;
 }
 
@@ -609,7 +615,7 @@ static bool quorum_vote(QuorumAIOCB *acb)
 }
 
 /* vote to select the most represented version */
-winner = quorum_get_vote_winner(&acb->votes);
+winner = quorum_get_vote_winner(&acb->votes, false);
 
 /* if the winner count is smaller than threshold the read fails */
 if (winner->vote_count < s->threshold) {
@@ -769,7 +775,7 @@ static coroutine_fn int quorum_co_flush(BlockDriverState 
*bs)
 quorum_count_vote(&error_votes, &result_value, i);
 }
 
-winner = quorum_get_vote_winner(&error_votes);
+winner = quorum_get_vote_winner(&error_votes, true);
 result = winner->value.l;
 
 quorum_free_vote_list(&error_votes);
-- 
1.9.3

[Qemu-devel] [PATCH 0/1] change quorum vote rules for 64-bits hash

2016-02-15 Thread Changlong Xie

If quorum has two children(A, B). A do flush sucessfully, but 
B flush failed. We MUST choice A as winner, otherwise we will 
get following errors:

{"timestamp": {"seconds": 1455641588, "microseconds": 415937}, 
"event": "BLOCK_IO_ERROR", "data": {"device": "colo-disk", 
"nospace": false, "reason": "Bad file descriptor", "operation": 
"write", "action": "report"}}

And the filesystem of guest became read-only with following errors:

[xxx] end_request: I/O error, dev vda, sector 11159960
[xxx] Aborting journal on device vda3-8
[xxx] EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort 
journal
[xxx] EXT4-fs (vda3): Remounting filesystem read-only

[Ref] http://lists.nongnu.org/archive/html/qemu-devel/2016-01/msg05342.html 

Changlong Xie (1):
  quorum: change vote rules for 64 bits hash

 block/quorum.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

-- 
1.9.3

Re: [Qemu-devel] [PATCH 1/1] quorum: change vote rules for 64 bits hash

2016-02-15 Thread Changlong Xie


On 02/16/2016 01:10 AM, Eric Blake wrote:

On 02/15/2016 02:52 AM, Changlong Xie wrote:

Before:
1) vote_count > max: winner = candidate, update max
2) vote_count <= max: do nothing
Current:
1) vote_count > max: winner = candidate, update max
2) vote_count = max: compare the value of winner with
candidate, if candidate->value.l == 0, winner = candidate,
else do nothing
3) vote_count < max: do nothing


This says what you did, but not why.  Can you demonstrate a scenario
that is broken without the patch, as part of your commit message?


Surely, will do in next version.





Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
---
  block/quorum.c | 14 ++
  1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index a5ae4b8..e431ff4 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -446,7 +446,7 @@ static int quorum_compute_hash(QuorumAIOCB *acb, int i, 
QuorumVoteValue *hash)
  return 0;
  }

-static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes)
+static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes, bool 
vote_error)


Long line. Please wrap things to stay within 80 columns.



Ok


  {
  int max = 0;
  QuorumVoteVersion *candidate, *winner = NULL;
@@ -455,6 +455,12 @@ static QuorumVoteVersion 
*quorum_get_vote_winner(QuorumVotes *votes)
  if (candidate->vote_count > max) {
  max = candidate->vote_count;
  winner = candidate;
+continue;
+}
+/* For 64 bit hash */
+if (vote_error == true && candidate->vote_count == max


s/ == true// (no need to do a redundant comparison of a bool against a
bool).


Ok

Thanks
-Xie

[Qemu-devel] [PATCH v2 0/1] change quorum vote rules for 64-bits hash

2016-02-15 Thread Changlong Xie

If quorum has two children(A, B). A do flush sucessfully, but 
B flush failed. We MUST choice A as winner, otherwise we will 
get following errors:

{"timestamp": {"seconds": 1455641588, "microseconds": 415937}, 
"event": "BLOCK_IO_ERROR", "data": {"device": "colo-disk", 
"nospace": false, "reason": "Bad file descriptor", "operation": 
"write", "action": "report"}}

And the filesystem of guest became read-only with following errors:

[xxx] end_request: I/O error, dev vda, sector 11159960
[xxx] Aborting journal on device vda3-8
[xxx] EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort 
journal
[xxx] EXT4-fs (vda3): Remounting filesystem read-only

[Ref] http://lists.nongnu.org/archive/html/qemu-devel/2016-01/msg05342.html 

Changlong Xie (1):
  quorum: Change vote rules for 64 bits hash

 block/quorum.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

-- 
1.9.3

[Qemu-devel] [PATCH v2 1/1] quorum: Change vote rules for 64 bits hash

2016-02-15 Thread Changlong Xie

If quorum has two children(A, B). A do flush sucessfully, but B flush failed.
We MUST choice A as winner rather than just pick anyone of them. Otherwise
the filesystem of guest will become read-only with following errors:

end_request: I/O error, dev vda, sector 11159960
Aborting journal on device vda3-8
EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal
EXT4-fs (vda3): Remounting filesystem read-only

Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
---
 block/quorum.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index 11cc60b..f094208 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -447,7 +447,8 @@ static int quorum_compute_hash(QuorumAIOCB *acb, int i, 
QuorumVoteValue *hash)
 return 0;
 }
 
-static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes)
+static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes,
+ bool vote_error)
 {
 int max = 0;
 QuorumVoteVersion *candidate, *winner = NULL;
@@ -456,6 +457,12 @@ static QuorumVoteVersion 
*quorum_get_vote_winner(QuorumVotes *votes)
 if (candidate->vote_count > max) {
 max = candidate->vote_count;
 winner = candidate;
+continue;
+}
+/* For 64 bit hash */
+if (vote_error && candidate->vote_count == max
+&& candidate->value.l == 0) {
+winner = candidate;
 }
 }
 
@@ -545,7 +552,7 @@ static int quorum_vote_error(QuorumAIOCB *acb)
 }
 
 if (error) {
-winner = quorum_get_vote_winner(&error_votes);
+winner = quorum_get_vote_winner(&error_votes, false);
 ret = winner->value.l;
 }
 
@@ -610,7 +617,7 @@ static bool quorum_vote(QuorumAIOCB *acb)
 }
 
 /* vote to select the most represented version */
-winner = quorum_get_vote_winner(&acb->votes);
+winner = quorum_get_vote_winner(&acb->votes, false);
 
 /* if the winner count is smaller than threshold the read fails */
 if (winner->vote_count < s->threshold) {
@@ -770,7 +777,7 @@ static coroutine_fn int quorum_co_flush(BlockDriverState 
*bs)
 quorum_count_vote(&error_votes, &result_value, i);
 }
 
-winner = quorum_get_vote_winner(&error_votes);
+winner = quorum_get_vote_winner(&error_votes, true);
 result = winner->value.l;
 
 quorum_free_vote_list(&error_votes);
-- 
1.9.3

Re: [Qemu-devel] [PATCH v9 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()

2016-02-15 Thread Changlong Xie


On 02/09/2016 01:06 AM, Alberto Garcia wrote:

On Fri 22 Jan 2016 09:02:10 PM CET, "Dr. David Alan Gilbert" 
 wrote:


In general, what do you do to make sure that the data in a new
Quorum child is consistent with that of the rest of the array?


Quorum can have more than one child when it starts. But we don't
do the similar check. So I don't think we should do such check
here.


Yes, but when you start a VM you can verify in advance that all
members of the Quorum have the same data. If you do that on a
running VM how can you know if the new disk is consistent with the
others?


User error if it is not.  Just the same as it is user error if you
request a shallow drive-mirror but the destination is not the same
contents as the backing file.  I don't think qemu has to protect us
from user error in this case.


But the backing file is read-only so the user can guarantee that the
destination has the same data before the shallow mirror. How do you
do that in this case?


I think in the colo case they're relying on doing a block migrate to
synchronise the remote disk prior to switching into colo mode.


Yes but this is a general API that can be used independently from
COLO. I'd say if we want to allow that we should at least place a big
warning in the documentation.



Ok, that's fair enough. Will add in next version.

Thanks
-Xie

Berto


.

[Qemu-devel] [PATCH v10 0/3] qapi: child add/delete support

2016-02-16 Thread Changlong Xie

ChangLog:
v9~v10:
1. Rebase to the newest codes
2. Address comments from Berto and Max, update the documents in block-core.json 
   and qmp-commands.hx 
3. Remove redundant codes in quorum_add_child() and quorum_del_child()
v8:
1. Rebase to the newest codes
2. Address the comments from Eric Blake
v7:
1. Remove the qmp command x-blockdev-change's parameter operation according
   to Kevin's comments.
2. Remove the hmp command.
v6:
1. Use a single qmp command x-blockdev-change to replace x-blockdev-child-add
   and x-blockdev-child-delete
v5:
1. Address Eric Blake's comments
v4:
1. drop nbd driver's implementation. We can use human-monitor-command
   to do it.
2. Rename the command name.
v3:
1. Don't open BDS in bdrv_add_child(). Use the existing BDS which is
   created by the QMP command blockdev-add.
2. The driver NBD can support filename, path, host:port now.
v2:
1. Use bdrv_get_device_or_node_name() instead of new function
   bdrv_get_id_or_node_name()
2. Update the error message
3. Update the documents in block-core.json

Wen Congyang (3):
  Add new block driver interface to add/delete a BDS's child
  quorum: implement bdrv_add_child() and bdrv_del_child()
  qmp: add monitor command to add/remove a child

 block.c   |  58 --
 block/quorum.c| 122 +-
 blockdev.c|  54 
 include/block/block.h |   9 
 include/block/block_int.h |   5 ++
 qapi/block-core.json  |  32 
 qmp-commands.hx   |  50 +++
 7 files changed, 324 insertions(+), 6 deletions(-)

-- 
1.9.3

[Qemu-devel] [PATCH v10 1/3] Add new block driver interface to add/delete a BDS's child

2016-02-16 Thread Changlong Xie

From: Wen Congyang 

In some cases, we want to take a quorum child offline, and take
another child online.

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
Reviewed-by: Eric Blake 
Reviewed-by: Alberto Garcia 
---
 block.c   | 50 +++
 include/block/block.h |  5 +
 include/block/block_int.h |  5 +
 3 files changed, 60 insertions(+)

diff --git a/block.c b/block.c
index efc3c43..08aa979 100644
--- a/block.c
+++ b/block.c
@@ -4399,3 +4399,53 @@ void bdrv_refresh_filename(BlockDriverState *bs)
 QDECREF(json);
 }
 }
+
+/*
+ * Hot add/remove a BDS's child. So the user can take a child offline when
+ * it is broken and take a new child online
+ */
+void bdrv_add_child(BlockDriverState *parent_bs, BlockDriverState *child_bs,
+Error **errp)
+{
+
+if (!parent_bs->drv || !parent_bs->drv->bdrv_add_child) {
+error_setg(errp, "The node %s doesn't support adding a child",
+   bdrv_get_device_or_node_name(parent_bs));
+return;
+}
+
+if (!QLIST_EMPTY(&child_bs->parents)) {
+error_setg(errp, "The node %s already has a parent",
+   child_bs->node_name);
+return;
+}
+
+parent_bs->drv->bdrv_add_child(parent_bs, child_bs, errp);
+}
+
+void bdrv_del_child(BlockDriverState *parent_bs, BlockDriverState *child_bs,
+Error **errp)
+{
+BdrvChild *child;
+
+if (!parent_bs->drv || !parent_bs->drv->bdrv_del_child) {
+error_setg(errp, "The node %s doesn't support removing a child",
+   bdrv_get_device_or_node_name(parent_bs));
+return;
+}
+
+QLIST_FOREACH(child, &parent_bs->children, next) {
+if (child->bs == child_bs) {
+break;
+}
+}
+
+if (!child) {
+error_setg(errp, "The node %s is not a child of %s",
+   bdrv_get_device_or_node_name(child_bs),
+   bdrv_get_device_or_node_name(parent_bs));
+return;
+}
+
+parent_bs->drv->bdrv_del_child(parent_bs, child_bs, errp);
+}
diff --git a/include/block/block.h b/include/block/block.h
index 1c4f4d8..ecde190 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -582,4 +582,9 @@ void bdrv_drained_begin(BlockDriverState *bs);
  */
 void bdrv_drained_end(BlockDriverState *bs);
 
+void bdrv_add_child(BlockDriverState *parent, BlockDriverState *child,
+Error **errp);
+void bdrv_del_child(BlockDriverState *parent, BlockDriverState *child,
+Error **errp);
+
 #endif
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 9ef823a..89ec138 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -305,6 +305,11 @@ struct BlockDriver {
  */
 void (*bdrv_drain)(BlockDriverState *bs);
 
+void (*bdrv_add_child)(BlockDriverState *parent, BlockDriverState *child,
+   Error **errp);
+void (*bdrv_del_child)(BlockDriverState *parent, BlockDriverState *child,
+   Error **errp);
+
 QLIST_ENTRY(BlockDriver) list;
 };
 
-- 
1.9.3

[Qemu-devel] [PATCH v10 3/3] qmp: add monitor command to add/remove a child

2016-02-16 Thread Changlong Xie

From: Wen Congyang 

The new QMP command name is x-blockdev-change. It's just for adding/removing
quorum's child now, and doesn't support all kinds of children, all kinds of
operations, nor all block drivers. So it is experimental now.

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
---
 blockdev.c   | 54 
 qapi/block-core.json | 32 +++
 qmp-commands.hx  | 50 
 3 files changed, 136 insertions(+)

diff --git a/blockdev.c b/blockdev.c
index 1f73478..ca040b0 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3983,6 +3983,60 @@ out:
 aio_context_release(aio_context);
 }
 
+static BlockDriverState *bdrv_find_child(BlockDriverState *parent_bs,
+ const char *child_name)
+{
+BdrvChild *child;
+
+QLIST_FOREACH(child, &parent_bs->children, next) {
+if (strcmp(child->name, child_name) == 0) {
+return child->bs;
+}
+}
+
+return NULL;
+}
+
+void qmp_x_blockdev_change(const char *parent, bool has_child,
+   const char *child, bool has_node,
+   const char *node, Error **errp)
+{
+BlockDriverState *parent_bs, *child_bs = NULL, *new_bs = NULL;
+
+parent_bs = bdrv_lookup_bs(parent, parent, errp);
+if (!parent_bs) {
+return;
+}
+
+if (has_child == has_node) {
+if (has_child) {
+error_setg(errp, "The parameters child and node are in conflict");
+} else {
+error_setg(errp, "Either child or node must be specified");
+}
+return;
+}
+
+if (has_child) {
+child_bs = bdrv_find_child(parent_bs, child);
+if (!child_bs) {
+error_setg(errp, "Node '%s' does not have child '%s'",
+   parent, child);
+return;
+}
+bdrv_del_child(parent_bs, child_bs, errp);
+}
+
+if (has_node) {
+new_bs = bdrv_find_node(node);
+if (!new_bs) {
+error_setg(errp, "Node '%s' not found", node);
+return;
+}
+bdrv_add_child(parent_bs, new_bs, errp);
+}
+}
+
 BlockJobInfoList *qmp_query_block_jobs(Error **errp)
 {
 BlockJobInfoList *head = NULL, **p_next = &head;
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 33012b8..92eb7fe 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2482,3 +2482,35 @@
 ##
 { 'command': 'block-set-write-threshold',
   'data': { 'node-name': 'str', 'write-threshold': 'uint64' } }
+
+##
+# @x-blockdev-change
+#
+# Dynamically reconfigure the block driver state graph. It can be used
+# to add, remove, insert or replace a graph node. Currently only the
+# Quorum driver implements this feature to add or remove its child. This
+# is useful to fix a broken quorum child.
+#
+# If @node is specified, it will be inserted under @parent. @child
+# may not be specified in this case. If both @parent and @child are
+# specified but @node is not, @child will be detached from @parent.
+#
+# @parent: the id or name of the parent node.
+#
+# @child: #optional the name of a child under the given parent node.
+#
+# @node: #optional the name of the node that will be added.
+#
+# Note: this command is experimental, and its API is not stable. It
+# does not support all kinds of operations, all kinds of children, nor
+# all block drivers.
+#
+# Warning: The data in a new quorum child MUST be consistent with that of
+# the rest of the array.
+#
+# Since: 2.6
+##
+{ 'command': 'x-blockdev-change',
+  'data' : { 'parent': 'str',
+ '*child': 'str',
+ '*node': 'str' } }
diff --git a/qmp-commands.hx b/qmp-commands.hx
index 020e5ee..1c9a06f 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -4364,6 +4364,56 @@ Example:
 EQMP
 
 {
+.name   = "x-blockdev-change",
+.args_type  = "parent:B,child:B?,node:B?",
+.mhandler.cmd_new = qmp_marshal_x_blockdev_change,
+},
+
+SQMP
+x-blockdev-change
+-
+
+Dynamically reconfigure the block driver state graph. It can be used
+to add, remove, insert or replace a graph node. Currently only the
+Quorum driver implements this feature to add or remove its child. This
+is useful to fix a broken quorum child.
+
+Arguments:
+- "parent": the id or name of the parent node (json-string)
+- "child": the name of a child under the given parent node (json-string, 
optional)
+- "node": the name of the node that will be added (json-string, optional)
+
+Note: thi

[Qemu-devel] [PATCH v10 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()

2016-02-16 Thread Changlong Xie

From: Wen Congyang 

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
---
 block.c   |   8 ++--
 block/quorum.c| 122 +-
 include/block/block.h |   4 ++
 3 files changed, 128 insertions(+), 6 deletions(-)

diff --git a/block.c b/block.c
index 08aa979..c3c9dc0 100644
--- a/block.c
+++ b/block.c
@@ -1198,10 +1198,10 @@ static int bdrv_fill_options(QDict **options, const 
char *filename,
 return 0;
 }
 
-static BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
-BlockDriverState *child_bs,
-const char *child_name,
-const BdrvChildRole *child_role)
+BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
+ BlockDriverState *child_bs,
+ const char *child_name,
+ const BdrvChildRole *child_role)
 {
 BdrvChild *child = g_new(BdrvChild, 1);
 *child = (BdrvChild) {
diff --git a/block/quorum.c b/block/quorum.c
index a5ae4b8..e5a7e4f 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -24,6 +24,7 @@
 #include "qapi/qmp/qstring.h"
 #include "qapi-event.h"
 #include "crypto/hash.h"
+#include "qemu/bitmap.h"
 
 #define HASH_LENGTH 32
 
@@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
 bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
 * block if Quorum is reached.
 */
+unsigned long *index_bitmap;
+int bsize;
 
 QuorumReadPattern read_pattern;
 } BDRVQuorumState;
@@ -876,9 +879,9 @@ static int quorum_open(BlockDriverState *bs, QDict 
*options, int flags,
 ret = -EINVAL;
 goto exit;
 }
-if (s->num_children < 2) {
+if (s->num_children < 1) {
 error_setg(&local_err,
-   "Number of provided children must be greater than 1");
+   "Number of provided children must be 1 or more");
 ret = -EINVAL;
 goto exit;
 }
@@ -927,6 +930,7 @@ static int quorum_open(BlockDriverState *bs, QDict 
*options, int flags,
 /* allocate the children array */
 s->children = g_new0(BdrvChild *, s->num_children);
 opened = g_new0(bool, s->num_children);
+s->index_bitmap = bitmap_new(s->num_children);
 
 for (i = 0; i < s->num_children; i++) {
 char indexstr[32];
@@ -942,6 +946,8 @@ static int quorum_open(BlockDriverState *bs, QDict 
*options, int flags,
 
 opened[i] = true;
 }
+bitmap_set(s->index_bitmap, 0, s->num_children);
+s->bsize = s->num_children;
 
 g_free(opened);
 goto exit;
@@ -998,6 +1004,115 @@ static void quorum_attach_aio_context(BlockDriverState 
*bs,
 }
 }
 
+static int get_new_child_index(BDRVQuorumState *s)
+{
+int index;
+
+index = find_next_zero_bit(s->index_bitmap, s->bsize, 0);
+if (index < s->bsize) {
+return index;
+}
+
+if ((s->bsize % BITS_PER_LONG) == 0) {
+s->index_bitmap = bitmap_zero_extend(s->index_bitmap, s->bsize,
+ s->bsize + 1);
+}
+
+return s->bsize++;
+}
+
+static void remove_child_index(BDRVQuorumState *s, int index)
+{
+int last_index;
+long new_len;
+
+assert(index < s->bsize);
+
+clear_bit(index, s->index_bitmap);
+if (index < s->bsize - 1) {
+/*
+ * The last bit is always set, and we don't clear
+ * the last bit.
+ */
+return;
+}
+
+last_index = find_last_bit(s->index_bitmap, s->bsize);
+s->bsize = last_index + 1;
+if (BITS_TO_LONGS(last_index + 1) == BITS_TO_LONGS(s->bsize)) {
+return;
+}
+
+new_len = BITS_TO_LONGS(last_index + 1) * sizeof(unsigned long);
+s->index_bitmap = g_realloc(s->index_bitmap, new_len);
+}
+
+static void quorum_add_child(BlockDriverState *bs, BlockDriverState *child_bs,
+ Error **errp)
+{
+BDRVQuorumState *s = bs->opaque;
+BdrvChild *child;
+char indexstr[32];
+int index, ret;
+
+index = get_new_child_index(s);
+ret = snprintf(indexstr, 32, "children.%d", index);
+if (ret < 0 || ret >= 32) {
+error_setg(errp, "cannot generate child name");
+return;
+}
+
+bdrv_drain(bs);
+
+assert(s->num_children <= INT_MAX / sizeof(BdrvChild *));
+if (s->num_children == INT_MAX / sizeof(BdrvChild *)) {
+error_setg(errp, "Too many children");
+return;
+}
+s->children = g_renew(BdrvChild *, s->children, s->num_children + 1);
+
+bdrv_ref(child_bs);
+child = bdrv_attach_child(bs, child_bs, in

[Qemu-devel] [RFC PATCH 1/3] linux-headers: Update VFIO headers from linux-next tag ToBeFilled

2015-12-11 Thread Yongji Xie

Syncup VFIO related linux headers from linux-next tree using
scripts/update-linux-headers.sh.

Integrate two VFIO-PCI ioctl flags:
- VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED
- VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP

Signed-off-by: Yongji Xie 
---
 linux-headers/linux/vfio.h |4 
 1 file changed, 4 insertions(+)

diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index aa276bc..3b79d63 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -164,6 +164,10 @@ struct vfio_device_info {
 #define VFIO_DEVICE_FLAGS_PCI  (1 << 1)/* vfio-pci device */
 #define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2)/* vfio-platform device */
 #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)   /* vfio-amba device */
+/* Platform support all PCI MMIO BARs to be page aligned */
+#define VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED  (1 << 4)
+/* platform support mmapping PCI MSI-X vector table */
+#define VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP  (1 << 5)
__u32   num_regions;/* Max region index + 1 */
__u32   num_irqs;   /* Max IRQ index + 1 */
 };
-- 
1.7.9.5

[Qemu-devel] [RFC PATCH 0/3] vfio/pci: Add support for mmapping sub-page MMIO BARs and MSI-X table

2015-12-11 Thread Yongji Xie

This patch set adds support for two VFIO-PCI ioctl flags:
VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED and VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP.
Note that the kernel bits of this two flags are not in upstream and
posted as .

VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED indicates that platform support
all PCI MMIO BARs to be page aligned which means any BARs' mmio
page would not be shared with other BARs. So it's allowed to mmap
sub-page(size < PAGE_SIZE) MMIO BARs in QEMU.

VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP indicates that it's safe to mmap MSI-X
table in QEMU.

Besides supporting for these two mmapping cases, we also fix some issues
which would block KVM to create memory slot for these mmapped sub-page
MMIO BARs and mmapped MSI-X table.

Posted kernel patches link:
- https://www.mail-archive.com/kvm@vger.kernel.org/msg124031.html

Yongji Xie (3):
  linux-headers: Update VFIO headers from linux-next tag ToBeFilled
  vfio/pci: Add support for mmapped sub-page MMIO BARs
  vfio/pci: Add support for mmapping MSI-X table

 hw/vfio/pci.c  |   65 
 linux-headers/linux/vfio.h |4 +++
 2 files changed, 64 insertions(+), 5 deletions(-)

-- 
1.7.9.5

[Qemu-devel] [RFC PATCH 2/3] vfio/pci: Add support for mmapped sub-page MMIO BARs

2015-12-11 Thread Yongji Xie

The VFIO-PCI ioctl flag VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED indicates platform
support all PCI MMIO BARs to be page aligned and sub-page(size < PAGE_SIZE)
MMIO BARs can be mmapped.

But this has an issue for these mmapped sub-page MMIO BARs - KVM would not
allow to create memory slot for them via ioctl KVM_SET_USER_MEMORY_REGION
because these BARs' size are still less than PAGE_SIZE.

So this patch expands these sub-page MMIO BARs to page size in
vfio_pci_write_config(), so that the BARs could be passed to KVM
ioctl KVM_SET_USER_MEMORY_REGION with a valid size. And we also set the
priority of these BARs' memory regions to zero in case of overlap with
page unaligned BARs when guest doesn't support all PCI MMIO BARs to be
page aligned.

Signed-off-by: Yongji Xie 
---
 hw/vfio/pci.c |   46 ++
 1 file changed, 46 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 1fb868c..a7c6171 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1028,6 +1028,44 @@ static const MemoryRegionOps vfio_vga_ops = {
 };
 
 /*
+ * Expand sub-page(size < PAGE_SIZE) MMIO BARs to page size if host support
+ * all MMIO BARs to be page aligned. And we should set the priority of these
+ * BARs' memory regions to zero in case of overlap with page unaligned
+ * BARs when guest doesn't support all MMIO BARs to be page aligned.
+ */
+static void vfio_expand_bar_to_page_size(PCIDevice *pdev)
+{
+VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
+MemoryRegion *mmap_mr;
+MemoryRegion *mr;
+PCIIORegion *r;
+pcibus_t bar_addr;
+int i;
+
+memory_region_transaction_begin();
+for (i = 0; i < PCI_NUM_REGIONS; i++) {
+r = &pdev->io_regions[i];
+if (r->size > 0 && r->size < qemu_real_host_page_size) {
+bar_addr = r->addr;
+if (bar_addr != PCI_BAR_UNMAPPED &&
+!(bar_addr & ~qemu_real_host_page_mask)) {
+mr = &vdev->bars[i].region.mem;
+mmap_mr = &vdev->bars[i].region.mmap_mem;
+if (memory_region_is_mapped(mr) && vdev->bars[i].region.mmap &&
+memory_region_size(mr) < qemu_real_host_page_size) {
+memory_region_del_subregion(r->address_space, mr);
+memory_region_set_size(mr, qemu_real_host_page_size);
+memory_region_set_size(mmap_mr, qemu_real_host_page_size);
+memory_region_add_subregion_overlap(r->address_space,
+bar_addr, mr, 0);
+}
+}
+}
+}
+memory_region_transaction_commit();
+}
+
+/*
  * PCI config space
  */
 uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
@@ -1112,6 +1150,14 @@ void vfio_pci_write_config(PCIDevice *pdev,
 } else if (was_enabled && !is_enabled) {
 vfio_msix_disable(vdev);
 }
+} else if (vdev->vbasedev.flags & VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED &&
+(ranges_overlap(addr, len, PCI_BASE_ADDRESS_0, 24) ||
+ranges_overlap(addr, l, PCI_ROM_ADDRESS, 4) ||
+ranges_overlap(addr, l, PCI_ROM_ADDRESS1, 4) ||
+range_covers_byte(addr, len, PCI_COMMAND))) {
+pci_default_write_config(pdev, addr, val, len);
+
+vfio_expand_bar_to_page_size(pdev);
 } else {
 /* Write everything to QEMU to keep emulated bits correct */
 pci_default_write_config(pdev, addr, val, len);
-- 
1.7.9.5

[Qemu-devel] [RFC PATCH 3/3] vfio/pci: Add support for mmapping MSI-X table

2015-12-11 Thread Yongji Xie

The VFIO-PCI ioctl VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP flag indicate that
platform support mmapping MSI-X table.

With this flag set, we skip some MSI-X table check so that QEMU can
mmap MSI-X table successfully. And we also raise the priority of mmap
memory region in case of overlap with MSI-X/PBA memory region which
is added in msix_init().

Signed-off-by: Yongji Xie 
---
 hw/vfio/pci.c |   19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index a7c6171..bec301c 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1339,7 +1339,8 @@ static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, 
bool enabled)
 }
 
 memory_region_set_enabled(&bar->region.mmap_mem, enabled);
-if (vdev->msix && vdev->msix->table_bar == i) {
+if (vdev->msix && vdev->msix->table_bar == i &&
+!(vdev->vbasedev.flags & VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP)) {
 memory_region_set_enabled(&vdev->msix->mmap_mem, enabled);
 }
 }
@@ -1357,7 +1358,8 @@ static void vfio_unregister_bar(VFIOPCIDevice *vdev, int 
nr)
 
 memory_region_del_subregion(&bar->region.mem, &bar->region.mmap_mem);
 
-if (vdev->msix && vdev->msix->table_bar == nr) {
+if (vdev->msix && vdev->msix->table_bar == nr &&
+!(vdev->vbasedev.flags & VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP)) {
 memory_region_del_subregion(&bar->region.mem, &vdev->msix->mmap_mem);
 }
 }
@@ -1374,7 +1376,8 @@ static void vfio_unmap_bar(VFIOPCIDevice *vdev, int nr)
 
 munmap(bar->region.mmap, memory_region_size(&bar->region.mmap_mem));
 
-if (vdev->msix && vdev->msix->table_bar == nr) {
+if (vdev->msix && vdev->msix->table_bar == nr &&
+!(vdev->vbasedev.flags & VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP)) {
 munmap(vdev->msix->mmap, memory_region_size(&vdev->msix->mmap_mem));
 }
 }
@@ -1420,7 +1423,8 @@ static void vfio_map_bar(VFIOPCIDevice *vdev, int nr)
  * We can't mmap areas overlapping the MSIX vector table, so we
  * potentially insert a direct-mapped subregion before and after it.
  */
-if (vdev->msix && vdev->msix->table_bar == nr) {
+if (vdev->msix && vdev->msix->table_bar == nr &&
+!(vdev->vbasedev.flags & VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP)) {
 size = vdev->msix->table_offset & qemu_real_host_page_mask;
 }
 
@@ -1429,9 +1433,14 @@ static void vfio_map_bar(VFIOPCIDevice *vdev, int nr)
   &bar->region.mmap_mem, &bar->region.mmap,
   size, 0, name)) {
 error_report("%s unsupported. Performance may be slow", name);
+} else if (vdev->vbasedev.flags & VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP) {
+memory_region_del_subregion(&bar->region.mem, &bar->region.mmap_mem);
+memory_region_add_subregion_overlap(&bar->region.mem, 0,
+  &bar->region.mmap_mem, 1);
 }
 
-if (vdev->msix && vdev->msix->table_bar == nr) {
+if (vdev->msix && vdev->msix->table_bar == nr &&
+!(vdev->vbasedev.flags & VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP)) {
 uint64_t start;
 
 start = REAL_HOST_PAGE_ALIGN((uint64_t)vdev->msix->table_offset +
-- 
1.7.9.5

Re: [Qemu-devel] [PATCH COLO-Frame v12 18/38] COLO: Flush PVM's cached RAM into SVM's memory

2015-12-15 Thread Changlong Xie


On 12/15/2015 04:22 PM, zhanghailiang wrote:

During the time of VM's running, PVM may dirty some pages, we will transfer
PVM's dirty pages to SVM and store them into SVM's RAM cache at next checkpoint
time. So, the content of SVM's RAM cache will always be some with PVM's memory

"some" => "same"

Thanks
-Xie

after checkpoint.

Instead of flushing all content of PVM's RAM cache into SVM's MEMORY,
we do this in a more efficient way:
Only flush any page that dirtied by PVM since last checkpoint.
In this way, we can ensure SVM's memory same with PVM's.

Besides, we must ensure flush RAM cache before load device state.

Signed-off-by: zhanghailiang 
Signed-off-by: Li Zhijian 
Signed-off-by: Gonglei 
Reviewed-by: Dr. David Alan Gilbert 
---
v12:
- Add a trace point in the end of colo_flush_ram_cache() (Dave's suggestion)
- Add Reviewed-by tag
v11:
- Move the place of 'need_flush' (Dave's suggestion)
- Remove unused 'DPRINTF("Flush ram_cache\n")'
v10:
- trace the number of dirty pages that be received.

Signed-off-by: zhanghailiang 
---
  include/migration/migration.h |  1 +
  migration/colo.c  |  2 --
  migration/ram.c   | 38 ++
  trace-events  |  2 ++
  4 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index e41372d..221176b 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -336,4 +336,5 @@ PostcopyState postcopy_state_set(PostcopyState new_state);
  /* ram cache */
  int colo_init_ram_cache(void);
  void colo_release_ram_cache(void);
+void colo_flush_ram_cache(void);
  #endif
diff --git a/migration/colo.c b/migration/colo.c
index a4d49ff..e40cdb9 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -401,8 +401,6 @@ void *colo_process_incoming_thread(void *opaque)
  }
  qemu_mutex_unlock_iothread();

-/* TODO: flush vm state */
-
  ret = colo_put_cmd(mis->to_src_file, COLO_COMMAND_VMSTATE_LOADED);
  if (ret < 0) {
  goto out;
diff --git a/migration/ram.c b/migration/ram.c
index 3d5947b..8ff7f7c 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2458,6 +2458,7 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
   * be atomic
   */
  bool postcopy_running = postcopy_state_get() >= 
POSTCOPY_INCOMING_LISTENING;
+bool need_flush = false;

  seq_iter++;

@@ -2492,6 +2493,7 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
  /* After going into COLO, we should load the Page into colo_cache 
*/
  if (ram_cache_enable) {
  host = colo_cache_from_block_offset(block, addr);
+need_flush = true;
  } else {
  host = host_from_ram_block_offset(block, addr);
  }
@@ -2585,6 +2587,10 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
  }

  rcu_read_unlock();
+
+if (!ret  && ram_cache_enable && need_flush) {
+colo_flush_ram_cache();
+}
  DPRINTF("Completed load of VM with exit code %d seq iteration "
  "%" PRIu64 "\n", ret, seq_iter);
  return ret;
@@ -2657,6 +2663,38 @@ void colo_release_ram_cache(void)
  rcu_read_unlock();
  }

+/*
+ * Flush content of RAM cache into SVM's memory.
+ * Only flush the pages that be dirtied by PVM or SVM or both.
+ */
+void colo_flush_ram_cache(void)
+{
+RAMBlock *block = NULL;
+void *dst_host;
+void *src_host;
+ram_addr_t offset = 0;
+
+trace_colo_flush_ram_cache_begin(migration_dirty_pages);
+rcu_read_lock();
+block = QLIST_FIRST_RCU(&ram_list.blocks);
+while (block) {
+ram_addr_t ram_addr_abs;
+offset = migration_bitmap_find_dirty(block, offset, &ram_addr_abs);
+migration_bitmap_clear_dirty(ram_addr_abs);
+if (offset >= block->used_length) {
+offset = 0;
+block = QLIST_NEXT_RCU(block, next);
+} else {
+dst_host = block->host + offset;
+src_host = block->colo_cache + offset;
+memcpy(dst_host, src_host, TARGET_PAGE_SIZE);
+}
+}
+rcu_read_unlock();
+trace_colo_flush_ram_cache_end();
+assert(migration_dirty_pages == 0);
+}
+
  static SaveVMHandlers savevm_ram_handlers = {
  .save_live_setup = ram_save_setup,
  .save_live_iterate = ram_save_iterate,
diff --git a/trace-events b/trace-events
index 39fdd8d..7f76029 100644
--- a/trace-events
+++ b/trace-events
@@ -1264,6 +1264,8 @@ migration_throttle(void) ""
  ram_load_postcopy_loop(uint64_t addr, int flags) "@%" PRIx64 " %x"
  ram_postcopy_send_discard_bitmap(void) ""
  ram_save_queue_pages(const char *rb

[Qemu-devel] [PATCH v14 1/8] unblock backup operations in backing file

2016-01-13 Thread Changlong Xie

From: Wen Congyang 

Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
---
 block.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/block.c b/block.c
index e90b6cf..12b5231 100644
--- a/block.c
+++ b/block.c
@@ -1275,6 +1275,24 @@ void bdrv_set_backing_hd(BlockDriverState *bs, 
BlockDriverState *backing_hd)
 /* Otherwise we won't be able to commit due to check in bdrv_commit */
 bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_COMMIT_TARGET,
 bs->backing_blocker);
+/*
+ * We do backup in 3 ways:
+ * 1. drive backup
+ *The target bs is new opened, and the source is top BDS
+ * 2. blockdev backup
+ *Both the source and the target are top BDSes.
+ * 3. internal backup(used for block replication)
+ *Both the source and the target are backing file
+ *
+ * In case 1, and 2, the backing file is neither the source nor
+ * the target.
+ * In case 3, we will block the top BDS, so there is only one block
+ * job for the top BDS and its backing chain.
+ */
+bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_SOURCE,
+bs->backing_blocker);
+bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_TARGET,
+bs->backing_blocker);
 out:
 bdrv_refresh_limits(bs, NULL);
 }
-- 
1.9.3

[Qemu-devel] [PATCH v14 6/8] auto complete active commit

2016-01-13 Thread Changlong Xie

From: Wen Congyang 

Auto complete mirror job in background to prevent from
blocking synchronously

Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
---
 block/mirror.c| 13 +
 blockdev.c|  2 +-
 include/block/block_int.h |  3 ++-
 qemu-img.c|  2 +-
 4 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index f201f2b..b8e51f8 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -716,7 +716,8 @@ static void mirror_start_job(BlockDriverState *bs, 
BlockDriverState *target,
  BlockCompletionFunc *cb,
  void *opaque, Error **errp,
  const BlockJobDriver *driver,
- bool is_none_mode, BlockDriverState *base)
+ bool is_none_mode, BlockDriverState *base,
+ bool auto_complete)
 {
 MirrorBlockJob *s;
 BlockDriverState *replaced_bs;
@@ -772,6 +773,9 @@ static void mirror_start_job(BlockDriverState *bs, 
BlockDriverState *target,
 s->granularity = granularity;
 s->buf_size = ROUND_UP(buf_size, granularity);
 s->unmap = unmap;
+if (auto_complete) {
+s->should_complete = true;
+}
 
 s->dirty_bitmap = bdrv_create_dirty_bitmap(bs, granularity, NULL, errp);
 if (!s->dirty_bitmap) {
@@ -813,14 +817,15 @@ void mirror_start(BlockDriverState *bs, BlockDriverState 
*target,
 mirror_start_job(bs, target, replaces,
  speed, granularity, buf_size,
  on_source_error, on_target_error, unmap, cb, opaque, errp,
- &mirror_job_driver, is_none_mode, base);
+ &mirror_job_driver, is_none_mode, base, false);
 }
 
 void commit_active_start(BlockDriverState *bs, BlockDriverState *base,
  int64_t speed,
  BlockdevOnError on_error,
  BlockCompletionFunc *cb,
- void *opaque, Error **errp)
+ void *opaque, Error **errp,
+ bool auto_complete)
 {
 int64_t length, base_length;
 int orig_base_flags;
@@ -861,7 +866,7 @@ void commit_active_start(BlockDriverState *bs, 
BlockDriverState *base,
 bdrv_ref(base);
 mirror_start_job(bs, base, NULL, speed, 0, 0,
  on_error, on_error, false, cb, opaque, &local_err,
- &commit_active_job_driver, false, base);
+ &commit_active_job_driver, false, base, auto_complete);
 if (local_err) {
 error_propagate(errp, local_err);
 goto error_restore_flags;
diff --git a/blockdev.c b/blockdev.c
index 574a365..603e858 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3038,7 +3038,7 @@ void qmp_block_commit(const char *device,
 goto out;
 }
 commit_active_start(bs, base_bs, speed, on_error, block_job_cb,
-bs, &local_err);
+bs, &local_err, false);
 } else {
 commit_start(bs, base_bs, top_bs, speed, on_error, block_job_cb, bs,
  has_backing_file ? backing_file : NULL, &local_err);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 19c02b6..1b2d51a 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -634,13 +634,14 @@ void commit_start(BlockDriverState *bs, BlockDriverState 
*base,
  * @cb: Completion function for the job.
  * @opaque: Opaque pointer value passed to @cb.
  * @errp: Error object.
+ * @auto_complete: Auto complete the job.
  *
  */
 void commit_active_start(BlockDriverState *bs, BlockDriverState *base,
  int64_t speed,
  BlockdevOnError on_error,
  BlockCompletionFunc *cb,
- void *opaque, Error **errp);
+ void *opaque, Error **errp, bool auto_complete);
 /*
  * mirror_start:
  * @bs: Block device to operate on.
diff --git a/qemu-img.c b/qemu-img.c
index 3d48b4f..4b8b092 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -759,7 +759,7 @@ static int img_commit(int argc, char **argv)
 };
 
 commit_active_start(bs, base_bs, 0, BLOCKDEV_ON_ERROR_REPORT,
-common_block_job_cb, &cbi, &local_err);
+common_block_job_cb, &cbi, &local_err, false);
 if (local_err) {
 goto done;
 }
-- 
1.9.3

[Qemu-devel] [PATCH v14 0/8] Block replication for continuous checkpoints

2016-01-13 Thread Changlong Xie

Block replication is a very important feature which is used for
continuous checkpoints(for example: COLO).

You can get the detailed information about block replication from here:
http://wiki.qemu.org/Features/BlockReplication

Usage:
Please refer to docs/block-replication.txt

This patch series is based on the following patch series:
1. http://lists.nongnu.org/archive/html/qemu-devel/2015-12/msg04570.html

You can get the patch here:
https://github.com/Pating/qemu/tree/changlox/block-replication-v14

You can get the patch with framework here:
https://github.com/Pating/qemu/tree/changlox/colo_framework_v13

TODO:
1. Continuous block replication. It will be started after basic functions
   are accepted.

Changs Log:
V14:
1. Implement auto complete active commit
2. Implement active commit block job for replication.c
3. Address the comments from Stefan, add replication-specific API and data
   structure, also remove old block layer APIs
V13:
1. Rebase to the newest codes
2. Remove redundant marcos and semicolon in replication.c 
3. Fix typos in block-replication.txt
V12:
1. Rebase to the newest codes
2. Use backing reference to replcace 'allow-write-backing-file'
V11:
1. Reopen the backing file when starting blcok replication if it is not
   opened in R/W mode
2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET
   when opening backing file
3. Block the top BDS so there is only one block job for the top BDS and
   its backing chain.
V10:
1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing
   reference.
2. Address the comments from Eric Blake
V9:
1. Update the error messages
2. Rebase to the newest qemu
3. Split child add/delete support. These patches are sent in another patchset.
V8:
1. Address Alberto Garcia's comments
V7:
1. Implement adding/removing quorum child. Remove the option non-connect.
2. Simplify the backing refrence option according to Stefan Hajnoczi's 
suggestion
V6:
1. Rebase to the newest qemu.
V5:
1. Address the comments from Gong Lei
2. Speed the failover up. The secondary vm can take over very quickly even
   if there are too many I/O requests.
V4:
1. Introduce a new driver replication to avoid touch nbd and qcow2.
V3:
1: use error_setg() instead of error_set()
2. Add a new block job API
3. Active disk, hidden disk and nbd target uses the same AioContext
4. Add a testcase to test new hbitmap API
V2:
1. Redesign the secondary qemu(use image-fleecing)
2. Use Error objects to return error message
3. Address the comments from Max Reitz and Eric Blake

Wen Congyang (8):
  unblock backup operations in backing file
  Store parent BDS in BdrvChild
  Backup: clear all bitmap when doing block checkpoint
  Allow creating backup jobs when opening BDS
  docs: block replication's description
  auto complete active commit
  Implement new driver for block replication
  support replication driver in blockdev-add

 block.c  |  19 ++
 block/Makefile.objs  |   3 +-
 block/backup.c   |  14 +
 block/mirror.c   |  13 +-
 block/replication-comm.c |  66 +
 block/replication.c  | 590 +++
 blockdev.c   |   2 +-
 blockjob.c   |  11 +
 docs/block-replication.txt   | 229 +++
 include/block/block_int.h|   4 +-
 include/block/blockjob.h |  12 +
 include/block/replication-comm.h |  50 
 qapi/block-core.json |  33 ++-
 qemu-img.c   |   2 +-
 14 files changed, 1038 insertions(+), 10 deletions(-)
 create mode 100644 block/replication-comm.c
 create mode 100644 block/replication.c
 create mode 100644 docs/block-replication.txt
 create mode 100644 include/block/replication-comm.h

-- 
1.9.3

[Qemu-devel] [PATCH v14 8/8] support replication driver in blockdev-add

2016-01-13 Thread Changlong Xie

From: Wen Congyang 

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
Reviewed-by: Eric Blake 
---
 qapi/block-core.json | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index d73f46b..00350ed 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -220,6 +220,7 @@
 #   2.2: 'archipelago' added, 'cow' dropped
 #   2.3: 'host_floppy' deprecated
 #   2.5: 'host_floppy' dropped
+#   2.6: 'replication' added
 #
 # @backing_file: #optional the name of the backing file (for copy-on-write)
 #
@@ -1540,6 +1541,7 @@
 # Drivers that are supported in block device operations.
 #
 # @host_device, @host_cdrom: Since 2.1
+# @replication: Since 2.6
 #
 # Since: 2.0
 ##
@@ -1547,8 +1549,8 @@
   'data': [ 'archipelago', 'blkdebug', 'blkverify', 'bochs', 'cloop',
 'dmg', 'file', 'ftp', 'ftps', 'host_cdrom', 'host_device',
 'http', 'https', 'null-aio', 'null-co', 'parallels',
-'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'tftp', 'vdi', 'vhdx',
-'vmdk', 'vpc', 'vvfat' ] }
+'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'replication',
+'tftp', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
 
 ##
 # @BlockdevOptionsBase
@@ -1988,6 +1990,19 @@
 { 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
 
 ##
+# @BlockdevOptionsReplication
+#
+# Driver specific block device options for replication
+#
+# @mode: the replication mode
+#
+# Since: 2.6
+##
+{ 'struct': 'BlockdevOptionsReplication',
+  'base': 'BlockdevOptionsGenericFormat',
+  'data': { 'mode': 'ReplicationMode'  } }
+
+##
 # @BlockdevOptions
 #
 # Options for creating a block device.
@@ -2024,6 +2039,7 @@
   'quorum': 'BlockdevOptionsQuorum',
   'raw':'BlockdevOptionsGenericFormat',
 # TODO rbd: Wait for structured options
+  'replication':'BlockdevOptionsReplication',
 # TODO sheepdog: Wait for structured options
 # TODO ssh: Should take InetSocketAddress for 'host'?
   'tftp':   'BlockdevOptionsFile',
-- 
1.9.3

[Qemu-devel] [PATCH v14 4/8] Allow creating backup jobs when opening BDS

2016-01-13 Thread Changlong Xie

From: Wen Congyang 

When opening BDS, we need to create backup jobs for
image-fleecing.

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Jeff Cody 
---
 block/Makefile.objs | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 58ef2ef..fa05f37 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -22,10 +22,10 @@ block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
 block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
 block-obj-y += write-threshold.o
+block-obj-y += backup.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
-common-obj-y += backup.o
 
 iscsi.o-cflags := $(LIBISCSI_CFLAGS)
 iscsi.o-libs   := $(LIBISCSI_LIBS)
-- 
1.9.3

[Qemu-devel] [PATCH v14 3/8] Backup: clear all bitmap when doing block checkpoint

2016-01-13 Thread Changlong Xie

From: Wen Congyang 

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
Reviewed-by: Jeff Cody 
---
 block/backup.c   | 14 ++
 blockjob.c   | 11 +++
 include/block/blockjob.h | 12 
 3 files changed, 37 insertions(+)

diff --git a/block/backup.c b/block/backup.c
index 705bb77..0a27d01 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -253,11 +253,25 @@ static void backup_abort(BlockJob *job)
 }
 }
 
+static void backup_do_checkpoint(BlockJob *job, Error **errp)
+{
+BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
+
+if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) {
+error_setg(errp, "The backup job only supports block checkpoint in"
+   " sync=none mode");
+return;
+}
+
+hbitmap_reset_all(backup_job->bitmap);
+}
+
 static const BlockJobDriver backup_job_driver = {
 .instance_size  = sizeof(BackupBlockJob),
 .job_type   = BLOCK_JOB_TYPE_BACKUP,
 .set_speed  = backup_set_speed,
 .iostatus_reset = backup_iostatus_reset,
+.do_checkpoint  = backup_do_checkpoint,
 .commit = backup_commit,
 .abort  = backup_abort,
 };
diff --git a/blockjob.c b/blockjob.c
index 80adb9d..0c8edfe 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -533,3 +533,14 @@ void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job)
 QLIST_INSERT_HEAD(&txn->jobs, job, txn_list);
 block_job_txn_ref(txn);
 }
+
+void block_job_do_checkpoint(BlockJob *job, Error **errp)
+{
+if (!job->driver->do_checkpoint) {
+error_setg(errp, "The job %s doesn't support block checkpoint",
+   BlockJobType_lookup[job->driver->job_type]);
+return;
+}
+
+job->driver->do_checkpoint(job, errp);
+}
diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index d84ccd8..abdba7c 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -70,6 +70,9 @@ typedef struct BlockJobDriver {
  * never both.
  */
 void (*abort)(BlockJob *job);
+
+/** Optional callback for job types that support checkpoint. */
+void (*do_checkpoint)(BlockJob *job, Error **errp);
 } BlockJobDriver;
 
 /**
@@ -443,4 +446,13 @@ void block_job_txn_unref(BlockJobTxn *txn);
  */
 void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job);
 
+/**
+ * block_job_do_checkpoint:
+ * @job: The job.
+ * @errp: Error object.
+ *
+ * Do block checkpoint on the specified job.
+ */
+void block_job_do_checkpoint(BlockJob *job, Error **errp);
+
 #endif
-- 
1.9.3

[Qemu-devel] [PATCH v14 2/8] Store parent BDS in BdrvChild

2016-01-13 Thread Changlong Xie

From: Wen Congyang 

We need to access the parent BDS to get the root BDS.

Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
---
 block.c   | 1 +
 include/block/block_int.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/block.c b/block.c
index 12b5231..33d3427 100644
--- a/block.c
+++ b/block.c
@@ -1204,6 +1204,7 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
 BdrvChild *child = g_new(BdrvChild, 1);
 *child = (BdrvChild) {
 .bs = child_bs,
+.parent = parent_bs,
 .name   = g_strdup(child_name),
 .role   = child_role,
 };
diff --git a/include/block/block_int.h b/include/block/block_int.h
index ebe8b1e..19c02b6 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -361,6 +361,7 @@ extern const BdrvChildRole child_format;
 
 struct BdrvChild {
 BlockDriverState *bs;
+BlockDriverState *parent;
 char *name;
 const BdrvChildRole *role;
 QLIST_ENTRY(BdrvChild) next;
-- 
1.9.3

[Qemu-devel] [PATCH v14 5/8] docs: block replication's description

2016-01-13 Thread Changlong Xie

From: Wen Congyang 

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
---
 docs/block-replication.txt | 229 +
 1 file changed, 229 insertions(+)
 create mode 100644 docs/block-replication.txt

diff --git a/docs/block-replication.txt b/docs/block-replication.txt
new file mode 100644
index 000..d1a231e
--- /dev/null
+++ b/docs/block-replication.txt
@@ -0,0 +1,229 @@
+Block replication
+
+Copyright Fujitsu, Corp. 2015
+Copyright (c) 2015 Intel Corporation
+Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
+
+This work is licensed under the terms of the GNU GPL, version 2 or later.
+See the COPYING file in the top-level directory.
+
+Block replication is used for continuous checkpoints. It is designed
+for COLO (COarse-grain LOck-stepping) where the Secondary VM is running.
+It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
+where the Secondary VM is not running.
+
+This document gives an overview of block replication's design.
+
+== Background ==
+High availability solutions such as micro checkpoint and COLO will do
+consecutive checkpoints. The VM state of Primary VM and Secondary VM is
+identical right after a VM checkpoint, but becomes different as the VM
+executes till the next checkpoint. To support disk contents checkpoint,
+the modified disk contents in the Secondary VM must be buffered, and are
+only dropped at next checkpoint time. To reduce the network transportation
+effort at the time of checkpoint, the disk modification operations of
+Primary disk are asynchronously forwarded to the Secondary node.
+
+== Workflow ==
+The following is the image of block replication workflow:
+
++--+++
+|Primary Write Requests||Secondary Write Requests|
++--+++
+  |   |
+  |  (4)
+  |   V
+  |  /-\
+  |  Copy and Forward| |
+  |-(1)--+   | Disk Buffer |
+  |  |   | |
+  | (3)  \-/
+  | speculative  ^
+  |write through(2)
+  |  |   |
+  V  V   |
+   +--+   ++
+   | Primary Disk |   | Secondary Disk |
+   +--+   ++
+
+1) Primary write requests will be copied and forwarded to Secondary
+   QEMU.
+2) Before Primary write requests are written to Secondary disk, the
+   original sector content will be read from Secondary disk and
+   buffered in the Disk buffer, but it will not overwrite the existing
+   sector content (it could be from either "Secondary Write Requests" or
+   previous COW of "Primary Write Requests") in the Disk buffer.
+3) Primary write requests will be written to Secondary disk.
+4) Secondary write requests will be buffered in the Disk buffer and it
+   will overwrite the existing sector content in the buffer.
+
+== Architecture ==
+We are going to implement block replication from many basic
+blocks that are already in QEMU.
+
+ virtio-blk   ||
+ ^||.--
+ |||| Secondary
+1 Quorum  ||'--
+ /  \ ||
+/\||
+   Primary2 filter
+ disk ^
 virtio-blk
+  |
  ^
+3 NBD  --->  3 NBD 
  |
+client|| server
  2 filter
+  ||^  
  ^
+. |||  
  |
+Primary | ||  Secondary disk <- hidden-disk 5 
<- active-disk 4
+' |||  backing^   backing
+  ||| |
+  ||| |
+  ||'-'
+

[Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication

2016-01-13 Thread Changlong Xie

From: Wen Congyang 

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
---
 block/Makefile.objs  |   1 +
 block/replication-comm.c |  66 +
 block/replication.c  | 590 +++
 include/block/replication-comm.h |  50 
 qapi/block-core.json |  13 +
 5 files changed, 720 insertions(+)
 create mode 100644 block/replication-comm.c
 create mode 100644 block/replication.c
 create mode 100644 include/block/replication-comm.h

diff --git a/block/Makefile.objs b/block/Makefile.objs
index fa05f37..7037662 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -23,6 +23,7 @@ block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
 block-obj-y += write-threshold.o
 block-obj-y += backup.o
+block-obj-y += replication-comm.o replication.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
diff --git a/block/replication-comm.c b/block/replication-comm.c
new file mode 100644
index 000..8af748b
--- /dev/null
+++ b/block/replication-comm.c
@@ -0,0 +1,66 @@
+/*
+ * Replication Block filter
+ *
+ * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2015 Intel Corporation
+ * Copyright (c) 2015 FUJITSU LIMITED
+ *
+ * Author:
+ *   Wen Congyang 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "block/replication-comm.h"
+
+static QLIST_HEAD(, BlockReplicationState) block_replication_states;
+
+BlockReplicationState *block_replication_new(BlockDriverState *bs,
+ BlockReplicationOps *ops)
+{
+BlockReplicationState *brs;
+
+brs = g_new0(BlockReplicationState, 1);
+brs->bs = bs;
+brs->ops = ops;
+QLIST_INSERT_HEAD(&block_replication_states, brs, node);
+
+return brs;
+}
+
+void block_replication_remove(BlockReplicationState *brs)
+{
+QLIST_REMOVE(brs, node);
+g_free(brs);
+}
+
+void block_replication_start_all(ReplicationMode mode, Error **errp)
+{
+BlockReplicationState *brs, *next;
+QLIST_FOREACH_SAFE(brs, &block_replication_states, node, next) {
+if (brs->ops && brs->ops->start) {
+brs->ops->start(brs, mode, errp);
+}
+}
+}
+
+void block_replication_do_checkpoint_all(Error **errp)
+{
+BlockReplicationState *brs, *next;
+QLIST_FOREACH_SAFE(brs, &block_replication_states, node, next) {
+if (brs->ops && brs->ops->checkpoint) {
+brs->ops->checkpoint(brs, errp);
+}
+}
+}
+
+void block_replication_stop_all(bool failover, Error **errp)
+{
+BlockReplicationState *brs, *next;
+QLIST_FOREACH_SAFE(brs, &block_replication_states, node, next) {
+if (brs->ops && brs->ops->stop) {
+brs->ops->stop(brs, failover, errp);
+}
+}
+}
diff --git a/block/replication.c b/block/replication.c
new file mode 100644
index 000..29c677a
--- /dev/null
+++ b/block/replication.c
@@ -0,0 +1,590 @@
+/*
+ * Replication Block filter
+ *
+ * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2015 Intel Corporation
+ * Copyright (c) 2015 FUJITSU LIMITED
+ *
+ * Author:
+ *   Wen Congyang 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu-common.h"
+#include "block/blockjob.h"
+#include "block/nbd.h"
+#include "block/replication-comm.h"
+
+typedef struct BDRVReplicationState {
+ReplicationMode mode;
+int replication_state;
+BlockDriverState *active_disk;
+BlockDriverState *hidden_disk;
+BlockDriverState *secondary_disk;
+BlockDriverState *top_bs;
+BlockReplicationState *brs;
+Error *blocker;
+int orig_hidden_flags;
+int orig_secondary_flags;
+int error;
+} BDRVReplicationState;
+
+enum {
+BLOCK_REPLICATION_NONE, /* block replication is not started */
+BLOCK_REPLICATION_RUNNING,  /* block replication is running */
+BLOCK_REPLICATION_FAILOVER, /* failover is running in background */
+BLOCK_REPLICATION_FAILOVER_FAILED,  /* failover failed*/
+BLOCK_REPLICATION_DONE, /* block replication is done(after 
failover) */
+};
+
+static void replication_start(BlockReplicationState *brs, ReplicationMode mode,
+  Error **errp);
+static void replication_do_checkpoint(BlockReplicationState *brs, Error 
**errp);
+static void replication_stop(BlockReplicationState *brs, bool failover,
+ Error **errp);
+
+#define REPLICATION_MODE"mode"
+static QemuOptsList replication_runtime_opts = {
+.name = "replication",
+.head = QTAILQ_HEAD_INITIALIZER(replication

Re: [Qemu-devel] [PATCH v14 0/8] Block replication for continuous checkpoints

2016-01-13 Thread Changlong Xie


It seems i missed someone in CC list, add them.

Thanks
-Xie

On 01/13/2016 05:18 PM, Changlong Xie wrote:

Block replication is a very important feature which is used for
continuous checkpoints(for example: COLO).

You can get the detailed information about block replication from here:
http://wiki.qemu.org/Features/BlockReplication

Usage:
Please refer to docs/block-replication.txt

This patch series is based on the following patch series:
1. http://lists.nongnu.org/archive/html/qemu-devel/2015-12/msg04570.html

You can get the patch here:
https://github.com/Pating/qemu/tree/changlox/block-replication-v14

You can get the patch with framework here:
https://github.com/Pating/qemu/tree/changlox/colo_framework_v13

TODO:
1. Continuous block replication. It will be started after basic functions
are accepted.

Changs Log:
V14:
1. Implement auto complete active commit
2. Implement active commit block job for replication.c
3. Address the comments from Stefan, add replication-specific API and data
structure, also remove old block layer APIs
V13:
1. Rebase to the newest codes
2. Remove redundant marcos and semicolon in replication.c
3. Fix typos in block-replication.txt
V12:
1. Rebase to the newest codes
2. Use backing reference to replcace 'allow-write-backing-file'
V11:
1. Reopen the backing file when starting blcok replication if it is not
opened in R/W mode
2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET
when opening backing file
3. Block the top BDS so there is only one block job for the top BDS and
its backing chain.
V10:
1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing
reference.
2. Address the comments from Eric Blake
V9:
1. Update the error messages
2. Rebase to the newest qemu
3. Split child add/delete support. These patches are sent in another patchset.
V8:
1. Address Alberto Garcia's comments
V7:
1. Implement adding/removing quorum child. Remove the option non-connect.
2. Simplify the backing refrence option according to Stefan Hajnoczi's 
suggestion
V6:
1. Rebase to the newest qemu.
V5:
1. Address the comments from Gong Lei
2. Speed the failover up. The secondary vm can take over very quickly even
if there are too many I/O requests.
V4:
1. Introduce a new driver replication to avoid touch nbd and qcow2.
V3:
1: use error_setg() instead of error_set()
2. Add a new block job API
3. Active disk, hidden disk and nbd target uses the same AioContext
4. Add a testcase to test new hbitmap API
V2:
1. Redesign the secondary qemu(use image-fleecing)
2. Use Error objects to return error message
3. Address the comments from Max Reitz and Eric Blake

Wen Congyang (8):
   unblock backup operations in backing file
   Store parent BDS in BdrvChild
   Backup: clear all bitmap when doing block checkpoint
   Allow creating backup jobs when opening BDS
   docs: block replication's description
   auto complete active commit
   Implement new driver for block replication
   support replication driver in blockdev-add

  block.c  |  19 ++
  block/Makefile.objs  |   3 +-
  block/backup.c   |  14 +
  block/mirror.c   |  13 +-
  block/replication-comm.c |  66 +
  block/replication.c  | 590 +++
  blockdev.c   |   2 +-
  blockjob.c   |  11 +
  docs/block-replication.txt   | 229 +++
  include/block/block_int.h|   4 +-
  include/block/blockjob.h |  12 +
  include/block/replication-comm.h |  50 
  qapi/block-core.json |  33 ++-
  qemu-img.c   |   2 +-
  14 files changed, 1038 insertions(+), 10 deletions(-)
  create mode 100644 block/replication-comm.c
  create mode 100644 block/replication.c
  create mode 100644 docs/block-replication.txt
  create mode 100644 include/block/replication-comm.h

Re: [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication

2016-01-20 Thread Changlong Xie


On 01/20/2016 08:04 AM, Eric Blake wrote:

On 01/13/2016 02:18 AM, Changlong Xie wrote:

From: Wen Congyang 

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
---
  block/Makefile.objs  |   1 +
  block/replication-comm.c |  66 +
  block/replication.c  | 590 +++
  include/block/replication-comm.h |  50 
  qapi/block-core.json |  13 +
  5 files changed, 720 insertions(+)
  create mode 100644 block/replication-comm.c
  create mode 100644 block/replication.c
  create mode 100644 include/block/replication-comm.h



Just a high-level overview, mainly on the user-visible interface and
things that caught my eye.


Hi eric, thanks for your patience.




+++ b/block/replication-comm.c
@@ -0,0 +1,66 @@
+/*
+ * Replication Block filter
+ *
+ * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2015 Intel Corporation
+ * Copyright (c) 2015 FUJITSU LIMITED


Do you want to claim 2016 in any of this?




Will correct all Copyright issues in next version.


+
+enum {
+BLOCK_REPLICATION_NONE, /* block replication is not started */
+BLOCK_REPLICATION_RUNNING,  /* block replication is running */
+BLOCK_REPLICATION_FAILOVER, /* failover is running in background */
+BLOCK_REPLICATION_FAILOVER_FAILED,  /* failover failed*/


Space before */


Will fix this space issue




+BLOCK_REPLICATION_DONE, /* block replication is done(after 
failover) */
+};


Should this be an enum type in a .json file?



Since this is just an internal enum that only used in 
block/replication.c, i think we don't need put it in *.json file.





+
+static int replication_open(BlockDriverState *bs, QDict *options,
+int flags, Error **errp)
+{



+
+fail:
+qemu_opts_del(opts);
+/* propagate error */
+if (local_err) {
+error_propagate(errp, local_err);
+}


It's safe to call error_propagate() unconditionally (you could drop the
'if (local_err)').


You're right, i'll fix all relevant issue in my patchset.





+static coroutine_fn int replication_co_discard(BlockDriverState *bs,
+   int64_t sector_num,
+   int nb_sectors)
+{
+BDRVReplicationState *s = bs->opaque;
+int ret;
+
+ret = replication_get_io_status(s);
+if (ret < 0) {
+return ret;
+}
+
+if (ret == 1) {
+/* It is secondary qemu and failover are running*/


Space before */


Will fix




+static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
+{
+Error *local_err = NULL;
+int ret;
+
+if (!s->secondary_disk->job) {
+error_setg(errp, "Backup job is cancelled unexpectedly");


Maybe s/is/was/



will fix


+static void replication_start(BlockReplicationState *brs, ReplicationMode mode,
+  Error **errp)
+{
+BlockDriverState *bs = brs->bs;
+BDRVReplicationState *s = brs->bs->opaque;
+int64_t active_length, hidden_length, disk_length;
+AioContext *aio_context;
+Error *local_err = NULL;
+
+if (s->replication_state != BLOCK_REPLICATION_NONE) {
+error_setg(errp, "Block replication is running or done");
+return;
+}
+
+if (s->mode != mode) {
+error_setg(errp, "The parameter mode's value is invalid, needs %d,"
+   " but receives %d", s->mode, mode);


s/receives/got/



will fix




+static void replication_do_checkpoint(BlockReplicationState *brs, Error **errp)
+{
+BDRVReplicationState *s = brs->bs->opaque;
+
+if (s->replication_state != BLOCK_REPLICATION_RUNNING) {
+error_setg(errp, "Block replication is not running");
+return;
+}
+
+if (s->error) {
+error_setg(errp, "I/O error occurs");


s/occurs/occurred/



Will fix


+++ b/qapi/block-core.json
@@ -1975,6 +1975,19 @@
  '*read-pattern': 'QuorumReadPattern' } }

  ##
+# @ReplicationMode
+#
+# An enumeration of replication modes.
+#
+# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
+#
+# @secondary: Secondary mode, receive the vm's state from primary QEMU.
+#
+# Since: 2.6
+##
+{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
+


Interface looks okay.



Thanks
-Xie

[Qemu-devel] [PATCH v9 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()

2015-12-25 Thread Changlong Xie

From: Wen Congyang 

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
---
 block.c   |   8 ++--
 block/quorum.c| 122 +-
 include/block/block.h |   4 ++
 3 files changed, 128 insertions(+), 6 deletions(-)

diff --git a/block.c b/block.c
index a347008..b9e99da 100644
--- a/block.c
+++ b/block.c
@@ -1196,10 +1196,10 @@ static int bdrv_fill_options(QDict **options, const 
char *filename,
 return 0;
 }
 
-static BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
-BlockDriverState *child_bs,
-const char *child_name,
-const BdrvChildRole *child_role)
+BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
+ BlockDriverState *child_bs,
+ const char *child_name,
+ const BdrvChildRole *child_role)
 {
 BdrvChild *child = g_new(BdrvChild, 1);
 *child = (BdrvChild) {
diff --git a/block/quorum.c b/block/quorum.c
index 6793f12..e73418c 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -23,6 +23,7 @@
 #include "qapi/qmp/qstring.h"
 #include "qapi-event.h"
 #include "crypto/hash.h"
+#include "qemu/bitmap.h"
 
 #define HASH_LENGTH 32
 
@@ -80,6 +81,8 @@ typedef struct BDRVQuorumState {
 bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
 * block if Quorum is reached.
 */
+unsigned long *index_bitmap;
+int bsize;
 
 QuorumReadPattern read_pattern;
 } BDRVQuorumState;
@@ -875,9 +878,9 @@ static int quorum_open(BlockDriverState *bs, QDict 
*options, int flags,
 ret = -EINVAL;
 goto exit;
 }
-if (s->num_children < 2) {
+if (s->num_children < 1) {
 error_setg(&local_err,
-   "Number of provided children must be greater than 1");
+   "Number of provided children must be 1 or more");
 ret = -EINVAL;
 goto exit;
 }
@@ -926,6 +929,7 @@ static int quorum_open(BlockDriverState *bs, QDict 
*options, int flags,
 /* allocate the children array */
 s->children = g_new0(BdrvChild *, s->num_children);
 opened = g_new0(bool, s->num_children);
+s->index_bitmap = bitmap_new(s->num_children);
 
 for (i = 0; i < s->num_children; i++) {
 char indexstr[32];
@@ -941,6 +945,8 @@ static int quorum_open(BlockDriverState *bs, QDict 
*options, int flags,
 
 opened[i] = true;
 }
+bitmap_set(s->index_bitmap, 0, s->num_children);
+s->bsize = s->num_children;
 
 g_free(opened);
 goto exit;
@@ -997,6 +1003,115 @@ static void quorum_attach_aio_context(BlockDriverState 
*bs,
 }
 }
 
+static int get_new_child_index(BDRVQuorumState *s)
+{
+int index;
+
+index = find_next_zero_bit(s->index_bitmap, s->bsize, 0);
+if (index < s->bsize) {
+return index;
+}
+
+if ((s->bsize % BITS_PER_LONG) == 0) {
+s->index_bitmap = bitmap_zero_extend(s->index_bitmap, s->bsize,
+ s->bsize + 1);
+}
+
+return s->bsize++;
+}
+
+static void remove_child_index(BDRVQuorumState *s, int index)
+{
+int last_index;
+long new_len;
+
+assert(index < s->bsize);
+
+clear_bit(index, s->index_bitmap);
+if (index < s->bsize - 1) {
+/*
+ * The last bit is always set, and we don't clear
+ * the last bit.
+ */
+return;
+}
+
+last_index = find_last_bit(s->index_bitmap, s->bsize);
+s->bsize = last_index + 1;
+if (BITS_TO_LONGS(last_index + 1) == BITS_TO_LONGS(s->bsize)) {
+return;
+}
+
+new_len = BITS_TO_LONGS(last_index + 1) * sizeof(unsigned long);
+s->index_bitmap = g_realloc(s->index_bitmap, new_len);
+}
+
+static void quorum_add_child(BlockDriverState *bs, BlockDriverState *child_bs,
+ Error **errp)
+{
+BDRVQuorumState *s = bs->opaque;
+BdrvChild *child;
+char indexstr[32];
+int index, ret;
+
+index = get_new_child_index(s);
+ret = snprintf(indexstr, 32, "children.%d", index);
+if (ret < 0 || ret >= 32) {
+error_setg(errp, "cannot generate child name");
+return;
+}
+
+bdrv_drain(bs);
+
+assert(s->num_children <= INT_MAX / sizeof(BdrvChild *));
+if (s->num_children == INT_MAX / sizeof(BdrvChild *)) {
+error_setg(errp, "Too many children");
+return;
+}
+s->children = g_renew(BdrvChild *, s->children, s->num_children + 1);
+
+bdrv_ref(child_bs);
+child = bdrv_attach_child(bs, child_bs, in

[Qemu-devel] [PATCH v9 0/3] qapi: child add/delete support

2015-12-25 Thread Changlong Xie

If quorum's child is broken, we can use mirror job to replace it.
But sometimes, the user only need to remove the broken child, and
add it later when the problem is fixed.

ChangLog:
v9:
1. Rebase to the newest codes
2. Remove redundant codes in quorum_add_child() and quorum_del_child()
3. Fix typos and in qmp-commands.hx 
v8:
1. Rebase to the newest codes
2. Address the comments from Eric Blake
v7:
1. Remove the qmp command x-blockdev-change's parameter operation according
   to Kevin's comments.
2. Remove the hmp command.
v6:
1. Use a single qmp command x-blockdev-change to replace x-blockdev-child-add
   and x-blockdev-child-delete
v5:
1. Address Eric Blake's comments
v4:
1. drop nbd driver's implementation. We can use human-monitor-command
   to do it.
2. Rename the command name.
v3:
1. Don't open BDS in bdrv_add_child(). Use the existing BDS which is
   created by the QMP command blockdev-add.
2. The driver NBD can support filename, path, host:port now.
v2:
1. Use bdrv_get_device_or_node_name() instead of new function
   bdrv_get_id_or_node_name()
2. Update the error message
3. Update the documents in block-core.json

Wen Congyang (3):
  Add new block driver interface to add/delete a BDS's child
  quorum: implement bdrv_add_child() and bdrv_del_child()
  qmp: add monitor command to add/remove a child

 block.c   |  58 --
 block/quorum.c| 122 +-
 blockdev.c|  54 
 include/block/block.h |   9 
 include/block/block_int.h |   5 ++
 qapi/block-core.json  |  23 +
 qmp-commands.hx   |  47 ++
 7 files changed, 312 insertions(+), 6 deletions(-)

-- 
1.9.3

[Qemu-devel] [PATCH v9 1/3] Add new block driver interface to add/delete a BDS's child

2015-12-25 Thread Changlong Xie

From: Wen Congyang 

In some cases, we want to take a quorum child offline, and take
another child online.

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
Reviewed-by: Eric Blake 
Reviewed-by: Alberto Garcia 
---
 block.c   | 50 +++
 include/block/block.h |  5 +
 include/block/block_int.h |  5 +
 3 files changed, 60 insertions(+)

diff --git a/block.c b/block.c
index 411edbf..a347008 100644
--- a/block.c
+++ b/block.c
@@ -4320,3 +4320,53 @@ void bdrv_refresh_filename(BlockDriverState *bs)
 QDECREF(json);
 }
 }
+
+/*
+ * Hot add/remove a BDS's child. So the user can take a child offline when
+ * it is broken and take a new child online
+ */
+void bdrv_add_child(BlockDriverState *parent_bs, BlockDriverState *child_bs,
+Error **errp)
+{
+
+if (!parent_bs->drv || !parent_bs->drv->bdrv_add_child) {
+error_setg(errp, "The node %s doesn't support adding a child",
+   bdrv_get_device_or_node_name(parent_bs));
+return;
+}
+
+if (!QLIST_EMPTY(&child_bs->parents)) {
+error_setg(errp, "The node %s already has a parent",
+   child_bs->node_name);
+return;
+}
+
+parent_bs->drv->bdrv_add_child(parent_bs, child_bs, errp);
+}
+
+void bdrv_del_child(BlockDriverState *parent_bs, BlockDriverState *child_bs,
+Error **errp)
+{
+BdrvChild *child;
+
+if (!parent_bs->drv || !parent_bs->drv->bdrv_del_child) {
+error_setg(errp, "The node %s doesn't support removing a child",
+   bdrv_get_device_or_node_name(parent_bs));
+return;
+}
+
+QLIST_FOREACH(child, &parent_bs->children, next) {
+if (child->bs == child_bs) {
+break;
+}
+}
+
+if (!child) {
+error_setg(errp, "The node %s is not a child of %s",
+   bdrv_get_device_or_node_name(child_bs),
+   bdrv_get_device_or_node_name(parent_bs));
+return;
+}
+
+parent_bs->drv->bdrv_del_child(parent_bs, child_bs, errp);
+}
diff --git a/include/block/block.h b/include/block/block.h
index db8e096..863a7c8 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -578,4 +578,9 @@ void bdrv_drained_begin(BlockDriverState *bs);
  */
 void bdrv_drained_end(BlockDriverState *bs);
 
+void bdrv_add_child(BlockDriverState *parent, BlockDriverState *child,
+Error **errp);
+void bdrv_del_child(BlockDriverState *parent, BlockDriverState *child,
+Error **errp);
+
 #endif
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 256609d..ebe8b1e 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -303,6 +303,11 @@ struct BlockDriver {
  */
 void (*bdrv_drain)(BlockDriverState *bs);
 
+void (*bdrv_add_child)(BlockDriverState *parent, BlockDriverState *child,
+   Error **errp);
+void (*bdrv_del_child)(BlockDriverState *parent, BlockDriverState *child,
+   Error **errp);
+
 QLIST_ENTRY(BlockDriver) list;
 };
 
-- 
1.9.3

[Qemu-devel] [PATCH v9 3/3] qmp: add monitor command to add/remove a child

2015-12-25 Thread Changlong Xie

From: Wen Congyang 

The new QMP command name is x-blockdev-change. It's just for adding/removing
quorum's child now, and doesn't support all kinds of children, all kinds of
operations, nor all block drivers. So it is experimental now.

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
---
 blockdev.c   | 54 
 qapi/block-core.json | 23 ++
 qmp-commands.hx  | 47 +
 3 files changed, 124 insertions(+)

diff --git a/blockdev.c b/blockdev.c
index 64dbfeb..4e62fdf 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3836,6 +3836,60 @@ out:
 aio_context_release(aio_context);
 }
 
+static BlockDriverState *bdrv_find_child(BlockDriverState *parent_bs,
+ const char *child_name)
+{
+BdrvChild *child;
+
+QLIST_FOREACH(child, &parent_bs->children, next) {
+if (strcmp(child->name, child_name) == 0) {
+return child->bs;
+}
+}
+
+return NULL;
+}
+
+void qmp_x_blockdev_change(const char *parent, bool has_child,
+   const char *child, bool has_node,
+   const char *node, Error **errp)
+{
+BlockDriverState *parent_bs, *child_bs = NULL, *new_bs = NULL;
+
+parent_bs = bdrv_lookup_bs(parent, parent, errp);
+if (!parent_bs) {
+return;
+}
+
+if (has_child == has_node) {
+if (has_child) {
+error_setg(errp, "The paramter child and node is conflict");
+} else {
+error_setg(errp, "Either child or node should be specified");
+}
+return;
+}
+
+if (has_child) {
+child_bs = bdrv_find_child(parent_bs, child);
+if (!child_bs) {
+error_setg(errp, "Node '%s' doesn't have child %s",
+   parent, child);
+return;
+}
+bdrv_del_child(parent_bs, child_bs, errp);
+}
+
+if (has_node) {
+new_bs = bdrv_find_node(node);
+if (!new_bs) {
+error_setg(errp, "Node '%s' not found", node);
+return;
+}
+bdrv_add_child(parent_bs, new_bs, errp);
+}
+}
+
 BlockJobInfoList *qmp_query_block_jobs(Error **errp)
 {
 BlockJobInfoList *head = NULL, **p_next = &head;
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 1a5d9ce..fe63c6d 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2408,3 +2408,26 @@
 ##
 { 'command': 'block-set-write-threshold',
   'data': { 'node-name': 'str', 'write-threshold': 'uint64' } }
+
+##
+# @x-blockdev-change
+#
+# Dynamically reconfigure the block driver state graph. It can be used
+# to add, remove, insert or replace a block driver state. Currently only
+# the Quorum driver implements this feature to add or remove its child.
+# This is useful to fix a broken quorum child.
+#
+# @parent: the id or name of the node that will be changed.
+#
+# @child: #optional the name of the child that will be deleted.
+#
+# @node: #optional the name of the node will be added.
+#
+# Note: this command is experimental, and its API is not stable.
+#
+# Since: 2.6
+##
+{ 'command': 'x-blockdev-change',
+  'data' : { 'parent': 'str',
+ '*child': 'str',
+ '*node': 'str' } }
diff --git a/qmp-commands.hx b/qmp-commands.hx
index 7b235ee..efee0ca 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -4293,6 +4293,53 @@ Example:
 EQMP
 
 {
+.name   = "x-blockdev-change",
+.args_type  = "parent:B,child:B?,node:B?",
+.mhandler.cmd_new = qmp_marshal_x_blockdev_change,
+},
+
+SQMP
+x-blockdev-change
+-
+
+Dynamically reconfigure the block driver state graph. It can be used to
+add, remove, insert, or replace a block driver state. Currently only
+the Quorum driver implements this feature to add and remove its child.
+This is useful to fix a broken quorum child.
+
+Arguments:
+- "parent": the id or node name of which node will be changed (json-string)
+- "child": the child name which will be deleted (json-string, optional)
+- "node": the new node-name which will be added (json-string, optional)
+
+Note: this command is experimental, and not a stable API. It doesn't
+support all kinds of operations, all kinds of children, nor all block
+drivers.
+
+Example:
+
+Add a new node to a quorum
+-> { "execute": "blockdev-add",
+"arguments": { "options": { "driver": "raw",
+"node-name": "new_node",
+

[Qemu-devel] [PATCH v13 02/10] Store parent BDS in BdrvChild

2015-12-25 Thread Changlong Xie

From: Wen Congyang 

We need to access the parent BDS to get the root BDS.

Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
---
 block.c   | 1 +
 include/block/block_int.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/block.c b/block.c
index 1589c0d..c9c913e 100644
--- a/block.c
+++ b/block.c
@@ -1204,6 +1204,7 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
 BdrvChild *child = g_new(BdrvChild, 1);
 *child = (BdrvChild) {
 .bs = child_bs,
+.parent = parent_bs,
 .name   = g_strdup(child_name),
 .role   = child_role,
 };
diff --git a/include/block/block_int.h b/include/block/block_int.h
index ebe8b1e..19c02b6 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -361,6 +361,7 @@ extern const BdrvChildRole child_format;
 
 struct BdrvChild {
 BlockDriverState *bs;
+BlockDriverState *parent;
 char *name;
 const BdrvChildRole *role;
 QLIST_ENTRY(BdrvChild) next;
-- 
1.9.3

[Qemu-devel] [PATCH v13 03/10] Backup: clear all bitmap when doing block checkpoint

2015-12-25 Thread Changlong Xie

From: Wen Congyang 

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
Reviewed-by: Jeff Cody 
---
 block/backup.c   | 14 ++
 blockjob.c   | 11 +++
 include/block/blockjob.h | 12 
 3 files changed, 37 insertions(+)

diff --git a/block/backup.c b/block/backup.c
index 705bb77..0a27d01 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -253,11 +253,25 @@ static void backup_abort(BlockJob *job)
 }
 }
 
+static void backup_do_checkpoint(BlockJob *job, Error **errp)
+{
+BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
+
+if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) {
+error_setg(errp, "The backup job only supports block checkpoint in"
+   " sync=none mode");
+return;
+}
+
+hbitmap_reset_all(backup_job->bitmap);
+}
+
 static const BlockJobDriver backup_job_driver = {
 .instance_size  = sizeof(BackupBlockJob),
 .job_type   = BLOCK_JOB_TYPE_BACKUP,
 .set_speed  = backup_set_speed,
 .iostatus_reset = backup_iostatus_reset,
+.do_checkpoint  = backup_do_checkpoint,
 .commit = backup_commit,
 .abort  = backup_abort,
 };
diff --git a/blockjob.c b/blockjob.c
index 80adb9d..0c8edfe 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -533,3 +533,14 @@ void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job)
 QLIST_INSERT_HEAD(&txn->jobs, job, txn_list);
 block_job_txn_ref(txn);
 }
+
+void block_job_do_checkpoint(BlockJob *job, Error **errp)
+{
+if (!job->driver->do_checkpoint) {
+error_setg(errp, "The job %s doesn't support block checkpoint",
+   BlockJobType_lookup[job->driver->job_type]);
+return;
+}
+
+job->driver->do_checkpoint(job, errp);
+}
diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index d84ccd8..abdba7c 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -70,6 +70,9 @@ typedef struct BlockJobDriver {
  * never both.
  */
 void (*abort)(BlockJob *job);
+
+/** Optional callback for job types that support checkpoint. */
+void (*do_checkpoint)(BlockJob *job, Error **errp);
 } BlockJobDriver;
 
 /**
@@ -443,4 +446,13 @@ void block_job_txn_unref(BlockJobTxn *txn);
  */
 void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job);
 
+/**
+ * block_job_do_checkpoint:
+ * @job: The job.
+ * @errp: Error object.
+ *
+ * Do block checkpoint on the specified job.
+ */
+void block_job_do_checkpoint(BlockJob *job, Error **errp);
+
 #endif
-- 
1.9.3

[Qemu-devel] [PATCH v13 01/10] unblock backup operations in backing file

2015-12-25 Thread Changlong Xie

From: Wen Congyang 

Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
---
 block.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/block.c b/block.c
index b9e99da..1589c0d 100644
--- a/block.c
+++ b/block.c
@@ -1275,6 +1275,24 @@ void bdrv_set_backing_hd(BlockDriverState *bs, 
BlockDriverState *backing_hd)
 /* Otherwise we won't be able to commit due to check in bdrv_commit */
 bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_COMMIT_TARGET,
 bs->backing_blocker);
+/*
+ * We do backup in 3 ways:
+ * 1. drive backup
+ *The target bs is new opened, and the source is top BDS
+ * 2. blockdev backup
+ *Both the source and the target are top BDSes.
+ * 3. internal backup(used for block replication)
+ *Both the source and the target are backing file
+ *
+ * In case 1, and 2, the backing file is neither the source nor
+ * the target.
+ * In case 3, we will block the top BDS, so there is only one block
+ * job for the top BDS and its backing chain.
+ */
+bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_SOURCE,
+bs->backing_blocker);
+bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_TARGET,
+bs->backing_blocker);
 out:
 bdrv_refresh_limits(bs, NULL);
 }
-- 
1.9.3

[Qemu-devel] [PATCH v13 04/10] Allow creating backup jobs when opening BDS

2015-12-25 Thread Changlong Xie

From: Wen Congyang 

When opening BDS, we need to create backup jobs for
image-fleecing.

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Jeff Cody 
---
 block/Makefile.objs | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 58ef2ef..fa05f37 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -22,10 +22,10 @@ block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
 block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
 block-obj-y += write-threshold.o
+block-obj-y += backup.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
-common-obj-y += backup.o
 
 iscsi.o-cflags := $(LIBISCSI_CFLAGS)
 iscsi.o-libs   := $(LIBISCSI_LIBS)
-- 
1.9.3

[Qemu-devel] [PATCH v13 00/10] Block replication for continuous checkpoints

2015-12-25 Thread Changlong Xie

Block replication is a very important feature which is used for
continuous checkpoints(for example: COLO).

You can get the detailed information about block replication from here:
http://wiki.qemu.org/Features/BlockReplication

Usage:
Please refer to docs/block-replication.txt

This patch series is based on the following patch series:
1. http://lists.nongnu.org/archive/html/qemu-devel/2015-12/msg04570.html

You can get the patch here:
https://github.com/Pating/qemu/tree/changlox/block-replication-v13

You can get the patch with framework here:
https://github.com/Pating/qemu/tree/changlox/colo_framework_v12

TODO:
1. Continuous block replication. It will be started after basic functions
   are accepted.

Changs Log:
V13:
1. Rebase to the newest codes
2. Remove redundant marcos and semicolon in replication.c 
3. Fix typos in block-replication.txt
V12:
1. Rebase to the newest codes
2. Use backing reference to replcace 'allow-write-backing-file'
V11:
1. Reopen the backing file when starting blcok replication if it is not
   opened in R/W mode
2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET
   when opening backing file
3. Block the top BDS so there is only one block job for the top BDS and
   its backing chain.
V10:
1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing
   reference.
2. Address the comments from Eric Blake
V9:
1. Update the error messages
2. Rebase to the newest qemu
3. Split child add/delete support. These patches are sent in another patchset.
V8:
1. Address Alberto Garcia's comments
V7:
1. Implement adding/removing quorum child. Remove the option non-connect.
2. Simplify the backing refrence option according to Stefan Hajnoczi's 
suggestion
V6:
1. Rebase to the newest qemu.
V5:
1. Address the comments from Gong Lei
2. Speed the failover up. The secondary vm can take over very quickly even
   if there are too many I/O requests.
V4:
1. Introduce a new driver replication to avoid touch nbd and qcow2.
V3:
1: use error_setg() instead of error_set()
2. Add a new block job API
3. Active disk, hidden disk and nbd target uses the same AioContext
4. Add a testcase to test new hbitmap API
V2:
1. Redesign the secondary qemu(use image-fleecing)
2. Use Error objects to return error message
3. Address the comments from Max Reitz and Eric Blake

Wen Congyang (10):
  unblock backup operations in backing file
  Store parent BDS in BdrvChild
  Backup: clear all bitmap when doing block checkpoint
  Allow creating backup jobs when opening BDS
  docs: block replication's description
  Add new block driver interfaces to control block replication
  quorum: implement block driver interfaces for block replication
  Implement new driver for block replication
  support replication driver in blockdev-add
  Add a new API to start/stop replication, do checkpoint to all BDSes

 block.c| 145 
 block/Makefile.objs|   3 +-
 block/backup.c |  14 ++
 block/quorum.c |  78 +++
 block/replication.c| 545 +
 blockjob.c |  11 +
 docs/block-replication.txt | 227 +++
 include/block/block.h  |   9 +
 include/block/block_int.h  |  15 ++
 include/block/blockjob.h   |  12 +
 qapi/block-core.json   |  33 ++-
 11 files changed, 1089 insertions(+), 3 deletions(-)
 create mode 100644 block/replication.c
 create mode 100644 docs/block-replication.txt

-- 
1.9.3

[Qemu-devel] [PATCH v13 09/10] support replication driver in blockdev-add

2015-12-25 Thread Changlong Xie

From: Wen Congyang 

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
Reviewed-by: Eric Blake 
---
 qapi/block-core.json | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 610da92..7354c6a 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -220,6 +220,7 @@
 #   2.2: 'archipelago' added, 'cow' dropped
 #   2.3: 'host_floppy' deprecated
 #   2.5: 'host_floppy' dropped
+#   2.6: 'replication' added
 #
 # @backing_file: #optional the name of the backing file (for copy-on-write)
 #
@@ -1492,6 +1493,7 @@
 # Drivers that are supported in block device operations.
 #
 # @host_device, @host_cdrom: Since 2.1
+# @replication: Since 2.6
 #
 # Since: 2.0
 ##
@@ -1499,8 +1501,8 @@
   'data': [ 'archipelago', 'blkdebug', 'blkverify', 'bochs', 'cloop',
 'dmg', 'file', 'ftp', 'ftps', 'host_cdrom', 'host_device',
 'http', 'https', 'null-aio', 'null-co', 'parallels',
-'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'tftp', 'vdi', 'vhdx',
-'vmdk', 'vpc', 'vvfat' ] }
+'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'replication',
+'tftp', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
 
 ##
 # @BlockdevOptionsBase
@@ -1940,6 +1942,19 @@
 { 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
 
 ##
+# @BlockdevOptionsReplication
+#
+# Driver specific block device options for replication
+#
+# @mode: the replication mode
+#
+# Since: 2.6
+##
+{ 'struct': 'BlockdevOptionsReplication',
+  'base': 'BlockdevOptionsGenericFormat',
+  'data': { 'mode': 'ReplicationMode'  } }
+
+##
 # @BlockdevOptions
 #
 # Options for creating a block device.
@@ -1976,6 +1991,7 @@
   'quorum': 'BlockdevOptionsQuorum',
   'raw':'BlockdevOptionsGenericFormat',
 # TODO rbd: Wait for structured options
+  'replication':'BlockdevOptionsReplication',
 # TODO sheepdog: Wait for structured options
 # TODO ssh: Should take InetSocketAddress for 'host'?
   'tftp':   'BlockdevOptionsFile',
-- 
1.9.3

[Qemu-devel] [PATCH v13 07/10] quorum: implement block driver interfaces for block replication

2015-12-25 Thread Changlong Xie

From: Wen Congyang 

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
Reviewed-by: Alberto Garcia 
---
 block/quorum.c | 78 ++
 1 file changed, 78 insertions(+)

diff --git a/block/quorum.c b/block/quorum.c
index e73418c..aa8c4dd 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -85,6 +85,8 @@ typedef struct BDRVQuorumState {
 int bsize;
 
 QuorumReadPattern read_pattern;
+
+int replication_index; /* store which child supports block replication */
 } BDRVQuorumState;
 
 typedef struct QuorumAIOCB QuorumAIOCB;
@@ -949,6 +951,7 @@ static int quorum_open(BlockDriverState *bs, QDict 
*options, int flags,
 s->bsize = s->num_children;
 
 g_free(opened);
+s->replication_index = -1;
 goto exit;
 
 close_exit:
@@ -1146,6 +1149,77 @@ static void quorum_refresh_filename(BlockDriverState 
*bs, QDict *options)
 bs->full_open_options = opts;
 }
 
+static void quorum_start_replication(BlockDriverState *bs, ReplicationMode 
mode,
+ Error **errp)
+{
+BDRVQuorumState *s = bs->opaque;
+int count = 0, i, index;
+Error *local_err = NULL;
+
+/*
+ * TODO: support REPLICATION_MODE_SECONDARY if we allow secondary
+ * QEMU becoming primary QEMU.
+ */
+if (mode != REPLICATION_MODE_PRIMARY) {
+error_setg(errp, "The replication mode for quorum should be 
'primary'");
+return;
+}
+
+if (s->read_pattern != QUORUM_READ_PATTERN_FIFO) {
+error_setg(errp, "Block replication needs read pattern 'fifo'");
+return;
+}
+
+for (i = 0; i < s->num_children; i++) {
+bdrv_start_replication(s->children[i]->bs, mode, &local_err);
+if (local_err) {
+error_free(local_err);
+local_err = NULL;
+} else {
+count++;
+index = i;
+}
+}
+
+if (count == 0) {
+error_setg(errp, "No child supports block replication");
+} else if (count > 1) {
+for (i = 0; i < s->num_children; i++) {
+bdrv_stop_replication(s->children[i]->bs, false, NULL);
+}
+error_setg(errp, "Too many children support block replication");
+} else {
+s->replication_index = index;
+}
+}
+
+static void quorum_do_checkpoint(BlockDriverState *bs, Error **errp)
+{
+BDRVQuorumState *s = bs->opaque;
+
+if (s->replication_index < 0) {
+error_setg(errp, "Block replication is not running");
+return;
+}
+
+bdrv_do_checkpoint(s->children[s->replication_index]->bs, errp);
+}
+
+static void quorum_stop_replication(BlockDriverState *bs, bool failover,
+Error **errp)
+{
+BDRVQuorumState *s = bs->opaque;
+
+if (s->replication_index < 0) {
+error_setg(errp, "Block replication is not running");
+return;
+}
+
+bdrv_stop_replication(s->children[s->replication_index]->bs, failover,
+  errp);
+s->replication_index = -1;
+}
+
 static BlockDriver bdrv_quorum = {
 .format_name= "quorum",
 .protocol_name  = "quorum",
@@ -1172,6 +1246,10 @@ static BlockDriver bdrv_quorum = {
 
 .is_filter  = true,
 .bdrv_recurse_is_first_non_filter   = quorum_recurse_is_first_non_filter,
+
+.bdrv_start_replication = quorum_start_replication,
+.bdrv_do_checkpoint = quorum_do_checkpoint,
+.bdrv_stop_replication  = quorum_stop_replication,
 };
 
 static void bdrv_quorum_init(void)
-- 
1.9.3

[Qemu-devel] [PATCH v13 06/10] Add new block driver interfaces to control block replication

2015-12-25 Thread Changlong Xie

From: Wen Congyang 

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
Cc: Luiz Capitulino 
Cc: Michael Roth 
Reviewed-by: Paolo Bonzini 
---
 block.c   | 43 +++
 include/block/block.h |  5 +
 include/block/block_int.h | 14 ++
 qapi/block-core.json  | 13 +
 4 files changed, 75 insertions(+)

diff --git a/block.c b/block.c
index c9c913e..275d8b4 100644
--- a/block.c
+++ b/block.c
@@ -4389,3 +4389,46 @@ void bdrv_del_child(BlockDriverState *parent_bs, 
BlockDriverState *child_bs,
 
 parent_bs->drv->bdrv_del_child(parent_bs, child_bs, errp);
 }
+
+void bdrv_start_replication(BlockDriverState *bs, ReplicationMode mode,
+Error **errp)
+{
+BlockDriver *drv = bs->drv;
+
+if (drv && drv->bdrv_start_replication) {
+drv->bdrv_start_replication(bs, mode, errp);
+} else if (bs->file) {
+bdrv_start_replication(bs->file->bs, mode, errp);
+} else {
+error_setg(errp, "The BDS %s doesn't support starting block"
+   " replication", bs->filename);
+}
+}
+
+void bdrv_do_checkpoint(BlockDriverState *bs, Error **errp)
+{
+BlockDriver *drv = bs->drv;
+
+if (drv && drv->bdrv_do_checkpoint) {
+drv->bdrv_do_checkpoint(bs, errp);
+} else if (bs->file) {
+bdrv_do_checkpoint(bs->file->bs, errp);
+} else {
+error_setg(errp, "The BDS %s doesn't support block checkpoint",
+   bs->filename);
+}
+}
+
+void bdrv_stop_replication(BlockDriverState *bs, bool failover, Error **errp)
+{
+BlockDriver *drv = bs->drv;
+
+if (drv && drv->bdrv_stop_replication) {
+drv->bdrv_stop_replication(bs, failover, errp);
+} else if (bs->file) {
+bdrv_stop_replication(bs->file->bs, failover, errp);
+} else {
+error_setg(errp, "The BDS %s doesn't support stopping block"
+   " replication", bs->filename);
+}
+}
diff --git a/include/block/block.h b/include/block/block.h
index 6c7e54b..5d47cef 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -587,4 +587,9 @@ void bdrv_add_child(BlockDriverState *parent, 
BlockDriverState *child,
 void bdrv_del_child(BlockDriverState *parent, BlockDriverState *child,
 Error **errp);
 
+void bdrv_start_replication(BlockDriverState *bs, ReplicationMode mode,
+Error **errp);
+void bdrv_do_checkpoint(BlockDriverState *bs, Error **errp);
+void bdrv_stop_replication(BlockDriverState *bs, bool failover, Error **errp);
+
 #endif
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 19c02b6..e31f9db 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -308,6 +308,20 @@ struct BlockDriver {
 void (*bdrv_del_child)(BlockDriverState *parent, BlockDriverState *child,
Error **errp);
 
+void (*bdrv_start_replication)(BlockDriverState *bs, ReplicationMode mode,
+   Error **errp);
+/* Drop Disk buffer when doing checkpoint. */
+void (*bdrv_do_checkpoint)(BlockDriverState *bs, Error **errp);
+/*
+ * After failover, we should flush Disk buffer into secondary disk
+ * and stop block replication.
+ *
+ * If the guest is shutdown, we should drop Disk buffer and stop
+ * block representation.
+ */
+void (*bdrv_stop_replication)(BlockDriverState *bs, bool failover,
+  Error **errp);
+
 QLIST_ENTRY(BlockDriver) list;
 };
 
diff --git a/qapi/block-core.json b/qapi/block-core.json
index fe63c6d..610da92 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1927,6 +1927,19 @@
 '*read-pattern': 'QuorumReadPattern' } }
 
 ##
+# @ReplicationMode
+#
+# An enumeration of replication modes.
+#
+# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
+#
+# @secondary: Secondary mode, receive the vm's state from primary QEMU.
+#
+# Since: 2.6
+##
+{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
+
+##
 # @BlockdevOptions
 #
 # Options for creating a block device.
-- 
1.9.3

[Qemu-devel] [PATCH v13 05/10] docs: block replication's description

2015-12-25 Thread Changlong Xie

From: Wen Congyang 

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
---
 docs/block-replication.txt | 227 +
 1 file changed, 227 insertions(+)
 create mode 100644 docs/block-replication.txt

diff --git a/docs/block-replication.txt b/docs/block-replication.txt
new file mode 100644
index 000..73abb65
--- /dev/null
+++ b/docs/block-replication.txt
@@ -0,0 +1,227 @@
+Block replication
+
+Copyright Fujitsu, Corp. 2015
+Copyright (c) 2015 Intel Corporation
+Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
+
+This work is licensed under the terms of the GNU GPL, version 2 or later.
+See the COPYING file in the top-level directory.
+
+Block replication is used for continuous checkpoints. It is designed
+for COLO (COarse-grain LOck-stepping) where the Secondary VM is running.
+It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
+where the Secondary VM is not running.
+
+This document gives an overview of block replication's design.
+
+== Background ==
+High availability solutions such as micro checkpoint and COLO will do
+consecutive checkpoints. The VM state of Primary VM and Secondary VM is
+identical right after a VM checkpoint, but becomes different as the VM
+executes till the next checkpoint. To support disk contents checkpoint,
+the modified disk contents in the Secondary VM must be buffered, and are
+only dropped at next checkpoint time. To reduce the network transportation
+effort at the time of checkpoint, the disk modification operations of
+Primary disk are asynchronously forwarded to the Secondary node.
+
+== Workflow ==
+The following is the image of block replication workflow:
+
++--+++
+|Primary Write Requests||Secondary Write Requests|
++--+++
+  |   |
+  |  (4)
+  |   V
+  |  /-\
+  |  Copy and Forward| |
+  |-(1)--+   | Disk Buffer |
+  |  |   | |
+  | (3)  \-/
+  | speculative  ^
+  |write through(2)
+  |  |   |
+  V  V   |
+   +--+   ++
+   | Primary Disk |   | Secondary Disk |
+   +--+   ++
+
+1) Primary write requests will be copied and forwarded to Secondary
+   QEMU.
+2) Before Primary write requests are written to Secondary disk, the
+   original sector content will be read from Secondary disk and
+   buffered in the Disk buffer, but it will not overwrite the existing
+   sector content (it could be from either "Secondary Write Requests" or
+   previous COW of "Primary Write Requests") in the Disk buffer.
+3) Primary write requests will be written to Secondary disk.
+4) Secondary write requests will be buffered in the Disk buffer and it
+   will overwrite the existing sector content in the buffer.
+
+== Architecture ==
+We are going to implement block replication from many basic
+blocks that are already in QEMU.
+
+ virtio-blk   ||
+ ^||.--
+ |||| Secondary
+1 Quorum  ||'--
+ /  \ ||
+/\||
+   Primary2 filter
+ disk ^
 virtio-blk
+  |
  ^
+3 NBD  --->  3 NBD 
  |
+client|| server
  2 filter
+  ||^  
  ^
+. |||  
  |
+Primary | ||  Secondary disk <- hidden-disk 5 
<- active-disk 4
+' |||  backing^   backing
+  ||| |
+  ||| |
+  ||'-'
+

[Qemu-devel] [PATCH v13 08/10] Implement new driver for block replication

2015-12-25 Thread Changlong Xie

From: Wen Congyang 

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
---
 block/Makefile.objs |   1 +
 block/replication.c | 545 
 2 files changed, 546 insertions(+)
 create mode 100644 block/replication.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index fa05f37..94c1d03 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -23,6 +23,7 @@ block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
 block-obj-y += write-threshold.o
 block-obj-y += backup.o
+block-obj-y += replication.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
diff --git a/block/replication.c b/block/replication.c
new file mode 100644
index 000..6a061c9
--- /dev/null
+++ b/block/replication.c
@@ -0,0 +1,545 @@
+/*
+ * Replication Block filter
+ *
+ * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2015 Intel Corporation
+ * Copyright (c) 2015 FUJITSU LIMITED
+ *
+ * Author:
+ *   Wen Congyang 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu-common.h"
+#include "block/block_int.h"
+#include "block/blockjob.h"
+#include "block/nbd.h"
+
+typedef struct BDRVReplicationState {
+ReplicationMode mode;
+int replication_state;
+BlockDriverState *active_disk;
+BlockDriverState *hidden_disk;
+BlockDriverState *secondary_disk;
+BlockDriverState *top_bs;
+Error *blocker;
+int orig_hidden_flags;
+int orig_secondary_flags;
+int error;
+} BDRVReplicationState;
+
+enum {
+BLOCK_REPLICATION_NONE, /* block replication is not started */
+BLOCK_REPLICATION_RUNNING,  /* block replication is running */
+BLOCK_REPLICATION_DONE, /* block replication is done(failover) */
+};
+
+static void replication_stop(BlockDriverState *bs, bool failover, Error 
**errp);
+
+#define REPLICATION_MODE"mode"
+static QemuOptsList replication_runtime_opts = {
+.name = "replication",
+.head = QTAILQ_HEAD_INITIALIZER(replication_runtime_opts.head),
+.desc = {
+{
+.name = REPLICATION_MODE,
+.type = QEMU_OPT_STRING,
+},
+{ /* end of list */ }
+},
+};
+
+static int replication_open(BlockDriverState *bs, QDict *options,
+int flags, Error **errp)
+{
+int ret;
+BDRVReplicationState *s = bs->opaque;
+Error *local_err = NULL;
+QemuOpts *opts = NULL;
+const char *mode;
+
+ret = -EINVAL;
+opts = qemu_opts_create(&replication_runtime_opts, NULL, 0, &error_abort);
+qemu_opts_absorb_qdict(opts, options, &local_err);
+if (local_err) {
+goto fail;
+}
+
+mode = qemu_opt_get(opts, REPLICATION_MODE);
+if (!mode) {
+error_setg(&local_err, "Missing the option mode");
+goto fail;
+}
+
+if (!strcmp(mode, "primary")) {
+s->mode = REPLICATION_MODE_PRIMARY;
+} else if (!strcmp(mode, "secondary")) {
+s->mode = REPLICATION_MODE_SECONDARY;
+} else {
+error_setg(&local_err,
+   "The option mode's value should be primary or secondary");
+goto fail;
+}
+
+ret = 0;
+
+fail:
+qemu_opts_del(opts);
+/* propagate error */
+if (local_err) {
+error_propagate(errp, local_err);
+}
+return ret;
+}
+
+static void replication_close(BlockDriverState *bs)
+{
+BDRVReplicationState *s = bs->opaque;
+
+if (s->replication_state == BLOCK_REPLICATION_RUNNING) {
+replication_stop(bs, false, NULL);
+}
+}
+
+static int64_t replication_getlength(BlockDriverState *bs)
+{
+return bdrv_getlength(bs->file->bs);
+}
+
+static int replication_get_io_status(BDRVReplicationState *s)
+{
+switch (s->replication_state) {
+case BLOCK_REPLICATION_NONE:
+return -EIO;
+case BLOCK_REPLICATION_RUNNING:
+return 0;
+case BLOCK_REPLICATION_DONE:
+return s->mode == REPLICATION_MODE_PRIMARY ? -EIO : 1;
+default:
+abort();
+}
+}
+
+static int replication_return_value(BDRVReplicationState *s, int ret)
+{
+if (s->mode == REPLICATION_MODE_SECONDARY) {
+return ret;
+}
+
+if (ret < 0) {
+s->error = ret;
+ret = 0;
+}
+
+return ret;
+}
+
+static coroutine_fn int replication_co_readv(BlockDriverState *bs,
+ int64_t sector_num,
+ int remaining_sectors,
+ QEMUIOVector *qiov)
+{
+BDRVReplicationState *s = bs->opaque;
+int ret;
+
+if (s->mode == REPLICATION_MODE_PRIMARY) {
+/* We only use it to forward primary write requests */
+

[Qemu-devel] [PATCH v13 10/10] Add a new API to start/stop replication, do checkpoint to all BDSes

2015-12-25 Thread Changlong Xie

From: Wen Congyang 

Signed-off-by: Wen Congyang 
Signed-off-by: zhanghailiang 
Signed-off-by: Gonglei 
Signed-off-by: Changlong Xie 
---
 block.c   | 83 +++
 include/block/block.h |  4 +++
 2 files changed, 87 insertions(+)

diff --git a/block.c b/block.c
index 275d8b4..634cc97 100644
--- a/block.c
+++ b/block.c
@@ -4432,3 +4432,86 @@ void bdrv_stop_replication(BlockDriverState *bs, bool 
failover, Error **errp)
" replication", bs->filename);
 }
 }
+
+void bdrv_start_replication_all(ReplicationMode mode, Error **errp)
+{
+BlockDriverState *bs = NULL, *temp = NULL;
+Error *local_err = NULL;
+
+while ((bs = bdrv_next(bs))) {
+if (!QLIST_EMPTY(&bs->parents)) {
+/* It is not top BDS */
+continue;
+}
+
+if (bdrv_is_read_only(bs) || !bdrv_is_inserted(bs)) {
+continue;
+}
+
+bdrv_start_replication(bs, mode, &local_err);
+if (local_err) {
+error_propagate(errp, local_err);
+goto fail;
+}
+}
+
+return;
+
+fail:
+while ((temp = bdrv_next(temp)) && bs != temp) {
+bdrv_stop_replication(temp, false, NULL);
+}
+}
+
+void bdrv_do_checkpoint_all(Error **errp)
+{
+BlockDriverState *bs = NULL;
+Error *local_err = NULL;
+
+while ((bs = bdrv_next(bs))) {
+if (!QLIST_EMPTY(&bs->parents)) {
+/* It is not top BDS */
+continue;
+}
+
+if (bdrv_is_read_only(bs) || !bdrv_is_inserted(bs)) {
+continue;
+}
+
+bdrv_do_checkpoint(bs, &local_err);
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+}
+}
+
+void bdrv_stop_replication_all(bool failover, Error **errp)
+{
+BlockDriverState *bs = NULL;
+Error *local_err = NULL;
+
+while ((bs = bdrv_next(bs))) {
+if (!QLIST_EMPTY(&bs->parents)) {
+/* It is not top BDS */
+continue;
+}
+
+if (bdrv_is_read_only(bs) || !bdrv_is_inserted(bs)) {
+continue;
+}
+
+bdrv_stop_replication(bs, failover, &local_err);
+if (!errp) {
+/*
+ * The caller doesn't care the result, they just
+ * want to stop all block's replication.
+ */
+continue;
+}
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+}
+}
diff --git a/include/block/block.h b/include/block/block.h
index 5d47cef..9c4de14 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -592,4 +592,8 @@ void bdrv_start_replication(BlockDriverState *bs, 
ReplicationMode mode,
 void bdrv_do_checkpoint(BlockDriverState *bs, Error **errp);
 void bdrv_stop_replication(BlockDriverState *bs, bool failover, Error **errp);
 
+void bdrv_start_replication_all(ReplicationMode mode, Error **errp);
+void bdrv_do_checkpoint_all(Error **errp);
+void bdrv_stop_replication_all(bool failover, Error **errp);
+
 #endif
-- 
1.9.3

Re: [Qemu-devel] [PATCH v2 1/1] quorum: Change vote rules for 64 bits hash

2016-02-21 Thread Changlong Xie


On 02/20/2016 10:28 PM, Max Reitz wrote:

On 19.02.2016 12:24, Alberto Garcia wrote:

On Fri 19 Feb 2016 09:26:53 AM CET, Wen Congyang  wrote:


If quorum has two children(A, B). A do flush sucessfully, but B
flush failed.  We MUST choice A as winner rather than just pick
anyone of them. Otherwise the filesystem of guest will become
read-only with following errors:

end_request: I/O error, dev vda, sector 11159960
Aborting journal on device vda3-8
EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal
EXT4-fs (vda3): Remounting filesystem read-only


Hi Xie,

Let's see if I'm getting this right:

- When Quorum flushes to disk, there's a vote among the return values of
   the flush operations of its members, and the one that wins is the one
   that Quorum returns.

- If there's a tie then Quorum choses the first result from the list of
   winners.

- With your patch you want to give priority to the vote with result == 0
   if there's any, so Quorum would return 0 (and succeed).

This seems to me like an ad-hoc fix for a particular use case. What
if you have 3 members and two of them fail with the same error code?
Would you still return 0 or the error code from the other two?


For example:
children.0 returns 0
children.1 returns -EIO
children.2 returns -EPIPE

In this case, quorum returns -EPIPE now(without this patch).

For example:
children.0 returns -EPIPE
children.1 returns -EIO
children.2 returns 0
In this case, quorum returns 0 now.


My question is: what's the rationale for returning 0 in case a) but not
in case b)?

   a)
 children.0 returns -EPIPE
 children.1 returns -EIO
 children.2 returns 0

   b)
 children.0 returns -EIO
 children.1 returns -EIO
 children.2 returns 0

In both cases you have one successful flush and two errors. You want to
return always 0 in case a) and always -EIO in case b). But the only
difference is that in case b) the errors happen to be the same, so why
does that matter?

That said, I'm not very convinced of the current logics of the Quorum
flush code either, so it's not even a problem with your patch... it
seems to me that the code should follow the same logics as in the
read/write case: if the number of correct flushes >= threshold then
return 0, else select the most common error code.


I'm not convinced of the logic either, which is why I waited for you to
respond to this patch. :-)

Intuitively, I'd expect Quorum to return an error if flushing failed for
any of the children, because, well, flushing failed. I somehow feel like
flushing is different from a read or write operation and therefore
ignoring the threshold would be fine here. However, maybe my intuition
is just off.

Anyway, regardless of that, if we do take the threshold into account, we
should not use the exact error value for voting but just whether an
error occurred or not. If all but one children fail to flush (all for
different reasons), I find it totally wrong to return success. We should
then just return -EIO or something.


Hi Berto & Max

Thanks for your comments, i'd like to have a summary here. For flush cases:

1) if flush successfully(result >= 0), result = 0; else if result < 0, 
result = -EIO. then invoke quorum_count_vote

2) if correct flushes >= threshold, mark correct flushes as winner directly.

Will fix in next version.

Thanks
-Xie

Max

[Qemu-devel] [PATCH v3 1/1] quorum: modify vote rules for flush operation

2016-02-22 Thread Changlong Xie

Keep flush interface the same logic as quorum read/write, Otherwise in
following scenario, we'll encounter unexpected errors.

Quorum has two children(A, B). A do flush sucessfully, but B flush failed.
This cause the filesystem of guest become read-only with following errors:

end_request: I/O error, dev vda, sector 11159960
Aborting journal on device vda3-8
EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal
EXT4-fs (vda3): Remounting filesystem read-only

Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
---
 block/quorum.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index f78d4cb..62d3ccb 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -763,19 +763,26 @@ static coroutine_fn int quorum_co_flush(BlockDriverState 
*bs)
 QuorumVoteValue result_value;
 int i;
 int result = 0;
+int success_count = 0;
 
 QLIST_INIT(&error_votes.vote_list);
 error_votes.compare = quorum_64bits_compare;
 
 for (i = 0; i < s->num_children; i++) {
 result = bdrv_co_flush(s->children[i]->bs);
-result_value.l = result;
-quorum_count_vote(&error_votes, &result_value, i);
+if (result) {
+result_value.l = result;
+quorum_count_vote(&error_votes, &result_value, i);
+} else
+success_count++;
 }
 
-winner = quorum_get_vote_winner(&error_votes);
-result = winner->value.l;
-
+if (success_count >= s->threshold)
+result = 0;
+else {
+winner = quorum_get_vote_winner(&error_votes);
+result = winner->value.l;
+}
 quorum_free_vote_list(&error_votes);
 
 return result;
-- 
1.9.3

[Qemu-devel] [PATCH v3 0/1] modify vote rules for flush operation

2016-02-22 Thread Changlong Xie

ChangLog:
v3:
*Note* that, the codes logic is different from what we talked in v2. I just
keep flush interface the same logic as quorum read/write, and think it's
reasoned.

[Ref] http://lists.nongnu.org/archive/html/qemu-devel/2016-01/msg05342.htm

Changlong Xie (1):
  quorum: modify vote rules for flush operation

 block/quorum.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

-- 
1.9.3

Re: [Qemu-devel] [PATCH v2 1/1] quorum: Change vote rules for 64 bits hash

2016-02-22 Thread Changlong Xie


On 02/22/2016 05:02 PM, Dr. David Alan Gilbert wrote:

* Changlong Xie (xiecl.f...@cn.fujitsu.com) wrote:

On 02/20/2016 10:28 PM, Max Reitz wrote:

On 19.02.2016 12:24, Alberto Garcia wrote:

On Fri 19 Feb 2016 09:26:53 AM CET, Wen Congyang  wrote:


If quorum has two children(A, B). A do flush sucessfully, but B
flush failed.  We MUST choice A as winner rather than just pick
anyone of them. Otherwise the filesystem of guest will become
read-only with following errors:

end_request: I/O error, dev vda, sector 11159960
Aborting journal on device vda3-8
EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal
EXT4-fs (vda3): Remounting filesystem read-only


Hi Xie,

Let's see if I'm getting this right:

- When Quorum flushes to disk, there's a vote among the return values of
   the flush operations of its members, and the one that wins is the one
   that Quorum returns.

- If there's a tie then Quorum choses the first result from the list of
   winners.

- With your patch you want to give priority to the vote with result == 0
   if there's any, so Quorum would return 0 (and succeed).

This seems to me like an ad-hoc fix for a particular use case. What
if you have 3 members and two of them fail with the same error code?
Would you still return 0 or the error code from the other two?


For example:
children.0 returns 0
children.1 returns -EIO
children.2 returns -EPIPE

In this case, quorum returns -EPIPE now(without this patch).

For example:
children.0 returns -EPIPE
children.1 returns -EIO
children.2 returns 0
In this case, quorum returns 0 now.


My question is: what's the rationale for returning 0 in case a) but not
in case b)?

   a)
 children.0 returns -EPIPE
 children.1 returns -EIO
 children.2 returns 0

   b)
 children.0 returns -EIO
 children.1 returns -EIO
 children.2 returns 0

In both cases you have one successful flush and two errors. You want to
return always 0 in case a) and always -EIO in case b). But the only
difference is that in case b) the errors happen to be the same, so why
does that matter?

That said, I'm not very convinced of the current logics of the Quorum
flush code either, so it's not even a problem with your patch... it
seems to me that the code should follow the same logics as in the
read/write case: if the number of correct flushes >= threshold then
return 0, else select the most common error code.


I'm not convinced of the logic either, which is why I waited for you to
respond to this patch. :-)

Intuitively, I'd expect Quorum to return an error if flushing failed for
any of the children, because, well, flushing failed. I somehow feel like
flushing is different from a read or write operation and therefore
ignoring the threshold would be fine here. However, maybe my intuition
is just off.

Anyway, regardless of that, if we do take the threshold into account, we
should not use the exact error value for voting but just whether an
error occurred or not. If all but one children fail to flush (all for
different reasons), I find it totally wrong to return success. We should
then just return -EIO or something.


Hi Berto & Max

Thanks for your comments, i'd like to have a summary here. For flush cases:

1) if flush successfully(result >= 0), result = 0; else if result < 0,
result = -EIO. then invoke quorum_count_vote
2) if correct flushes >= threshold, mark correct flushes as winner directly.


I find it difficult to understand how this corresponds to the behaviour needed
in COLO, where we have the NBD and the real storage on the primary; in that
case the failure of the real storage should give an error to the guest, but the
failure of the NBD shouldn't produce a guest visible failure.


Hi Dave

"in that case the failure of the real storage should give an error to 
the guest, but the failure of the NBD shouldn't produce a guest visible 
failure."

This is just what i think :), but there is a restricted condition
.
1) If the guest *Must* fetch the return code for flush operation? This 
is prerequisite.
2) If no. Since Colo and Quorum are independent of each other, so quorum 
don't know if we are in colo mode. I think the only way to implement 
your idea is:"pass some parameters such as flush= on/off to each quorum 
child".


BTW, i've just sent out V3, and thought it make scense.

Thanks
-Xie

Dave


Will fix in next version.

Thanks
-Xie

Max





--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK


.

Re: [Qemu-devel] [PATCH v2 1/1] quorum: Change vote rules for 64 bits hash

2016-02-22 Thread Changlong Xie


On 02/22/2016 06:34 PM, Kevin Wolf wrote:

Am 22.02.2016 um 10:02 hat Dr. David Alan Gilbert geschrieben:

* Changlong Xie (xiecl.f...@cn.fujitsu.com) wrote:

On 02/20/2016 10:28 PM, Max Reitz wrote:

On 19.02.2016 12:24, Alberto Garcia wrote:

On Fri 19 Feb 2016 09:26:53 AM CET, Wen Congyang  wrote:


If quorum has two children(A, B). A do flush sucessfully, but B
flush failed.  We MUST choice A as winner rather than just pick
anyone of them. Otherwise the filesystem of guest will become
read-only with following errors:

end_request: I/O error, dev vda, sector 11159960
Aborting journal on device vda3-8
EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal
EXT4-fs (vda3): Remounting filesystem read-only


Hi Xie,

Let's see if I'm getting this right:

- When Quorum flushes to disk, there's a vote among the return values of
   the flush operations of its members, and the one that wins is the one
   that Quorum returns.

- If there's a tie then Quorum choses the first result from the list of
   winners.

- With your patch you want to give priority to the vote with result == 0
   if there's any, so Quorum would return 0 (and succeed).

This seems to me like an ad-hoc fix for a particular use case. What
if you have 3 members and two of them fail with the same error code?
Would you still return 0 or the error code from the other two?


For example:
children.0 returns 0
children.1 returns -EIO
children.2 returns -EPIPE

In this case, quorum returns -EPIPE now(without this patch).

For example:
children.0 returns -EPIPE
children.1 returns -EIO
children.2 returns 0
In this case, quorum returns 0 now.


My question is: what's the rationale for returning 0 in case a) but not
in case b)?

   a)
 children.0 returns -EPIPE
 children.1 returns -EIO
 children.2 returns 0

   b)
 children.0 returns -EIO
 children.1 returns -EIO
 children.2 returns 0

In both cases you have one successful flush and two errors. You want to
return always 0 in case a) and always -EIO in case b). But the only
difference is that in case b) the errors happen to be the same, so why
does that matter?

That said, I'm not very convinced of the current logics of the Quorum
flush code either, so it's not even a problem with your patch... it
seems to me that the code should follow the same logics as in the
read/write case: if the number of correct flushes >= threshold then
return 0, else select the most common error code.


I'm not convinced of the logic either, which is why I waited for you to
respond to this patch. :-)

Intuitively, I'd expect Quorum to return an error if flushing failed for
any of the children, because, well, flushing failed. I somehow feel like
flushing is different from a read or write operation and therefore
ignoring the threshold would be fine here. However, maybe my intuition
is just off.

Anyway, regardless of that, if we do take the threshold into account, we
should not use the exact error value for voting but just whether an
error occurred or not. If all but one children fail to flush (all for
different reasons), I find it totally wrong to return success. We should
then just return -EIO or something.


Hi Berto & Max

Thanks for your comments, i'd like to have a summary here. For flush cases:

1) if flush successfully(result >= 0), result = 0; else if result < 0,
result = -EIO. then invoke quorum_count_vote
2) if correct flushes >= threshold, mark correct flushes as winner directly.


Please try to return one of the real error codes instead of -EIO.
Essentially we should use the same logic as for writes (like Berto
suggested above).


Yes, i correct myself. It seems everyone reaches an agreement in this 
thread, and will use the same logic as writes in next version.


Thanks
-Xie



I find it difficult to understand how this corresponds to the behaviour needed
in COLO, where we have the NBD and the real storage on the primary; in that
case the failure of the real storage should give an error to the guest, but the
failure of the NBD shouldn't produce a guest visible failure.


That's probably because you're abusing quorum as an active mirroring
filter, which it really isn't. This is okay for now so you have
something to work with, but I expect that eventually you'll need a
different driver. (Well, or maybe I'm mistaken and you actually do need
to read back data from NBD to compare it to the real storage - do you?)

Anyway, I'm curious how you would handle a failed write/flush request to
the NBD target. Simply ignoring it doesn't feel right; in case of a
failover, wouldn't you switch to a potentially corrupted image then?

Kevin


.

Re: [Qemu-devel] [PATCH v3 1/1] quorum: modify vote rules for flush operation

2016-02-22 Thread Changlong Xie


On 02/22/2016 10:33 PM, Alberto Garcia wrote:

On Mon 22 Feb 2016 10:50:37 AM CET, Changlong Xie wrote:

-winner = quorum_get_vote_winner(&error_votes);
-result = winner->value.l;
-
+if (success_count >= s->threshold)
+result = 0;
+else {
+winner = quorum_get_vote_winner(&error_votes);
+result = winner->value.l;
+}


Please use braces in both branches of the if. scripts/checkpatch.pl
should report that.


Surely.



Other than that I think the patch is correct, but I still wonder if we
should emit QUORUM_REPORT_BAD (or a new event) for the operations that


To me QUORUM_REPORT_BAD is not a good choice.


fail. Or why wouldn't the user want to be notified of a flush failure?



I'll introduce a new event in next series.

Thanks
-Xie


Berto


.

[Qemu-devel] [PATCH v4 0/2] modify vote rules for flush operation

2016-02-23 Thread Changlong Xie

ChangLog:
v4:
1. Introduce QUORUM_FLUSH_ERROR event to notify flush failure.
v3:
1. *Note* that, the codes logic is different from what we talked in v2.
I just keep flush interface the same logic as quorum read/write, and think
it's reasoned.

[Ref] http://lists.nongnu.org/archive/html/qemu-devel/2016-01/msg05342.htm

Changlong Xie (2):
  qmp event: Add QUORUM_FLUSH_ERROR
  quorum: modify vote rules for flush operation

 block/quorum.c  | 24 +++-
 docs/qmp-events.txt | 18 ++
 qapi/event.json | 16 
 3 files changed, 53 insertions(+), 5 deletions(-)

-- 
1.9.3

[Qemu-devel] [PATCH v4 1/2] qmp event: Add QUORUM_FLUSH_ERROR

2016-02-23 Thread Changlong Xie

Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
---
 block/quorum.c  |  5 +
 docs/qmp-events.txt | 18 ++
 qapi/event.json | 16 
 3 files changed, 39 insertions(+)

diff --git a/block/quorum.c b/block/quorum.c
index f78d4cb..d3c3958 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -235,6 +235,11 @@ static void quorum_report_failure(QuorumAIOCB *acb)
acb->nb_sectors, &error_abort);
 }
 
+static void quorum_flush_error(char *node_name, const char *msg)
+{
+qapi_event_send_quorum_flush_error(node_name, msg, &error_abort);
+}
+
 static int quorum_vote_error(QuorumAIOCB *acb);
 
 static bool quorum_has_too_much_io_failed(QuorumAIOCB *acb)
diff --git a/docs/qmp-events.txt b/docs/qmp-events.txt
index b6e8937..d777873 100644
--- a/docs/qmp-events.txt
+++ b/docs/qmp-events.txt
@@ -340,6 +340,24 @@ Example:
 
 Note: this event is rate-limited.
 
+QUORUM_FLUSH_ERROR
+-
+
+Emitted to report flush error message of the Quorum block driver
+
+Data:
+
+- "node-name": The graph node name of the block driver state.
+- "error": This field contains a human-readable error message.  There 
are
+   no semantics other than that the block layer reported an 
error
+   and clients should not try to interpret the error string.
+
+Example:
+
+{ "event": "QUORUM_FLUSH_ERROR",
+ "data": { "node-name": "1.raw", "error": "xx" },
+ "timestamp": { "seconds": 1344522075, "microseconds": 745528 } }
+
 RESET
 -
 
diff --git a/qapi/event.json b/qapi/event.json
index cfcc887..5b16706 100644
--- a/qapi/event.json
+++ b/qapi/event.json
@@ -358,6 +358,22 @@
 'sector-num': 'int', 'sectors-count': 'int' } }
 
 ##
+# @QUORUM_FLUSH_ERROR
+#
+# Emitted to report flush error message of the Quorum block driver
+#
+# @node-name: the graph node name of the block driver state
+#
+# @error: This field contains a human-readable error message. There are no 
semantics
+# other than that the block layer reported an error and clients should 
not
+# try to interpret the error string.
+#
+# Since: 2.5
+##
+{ 'event': 'QUORUM_FLUSH_ERROR',
+  'data': { 'node-name': 'str', 'error': 'str'} }
+
+##
 # @VSERPORT_CHANGE
 #
 # Emitted when the guest opens or closes a virtio-serial port.
-- 
1.9.3

[Qemu-devel] [PATCH v4 2/2] quorum: modify vote rules for flush operation

2016-02-23 Thread Changlong Xie

Keep flush interface the same logic as quorum read/write, Otherwise in
following scenario, we'll encounter unexpected errors.

Quorum has two children(A, B). A do flush sucessfully, but B flush failed.
This cause the filesystem of guest become read-only with following errors:

end_request: I/O error, dev vda, sector 11159960
Aborting journal on device vda3-8
EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal
EXT4-fs (vda3): Remounting filesystem read-only

Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
---
 block/quorum.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index d3c3958..6048208 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -768,19 +768,28 @@ static coroutine_fn int quorum_co_flush(BlockDriverState 
*bs)
 QuorumVoteValue result_value;
 int i;
 int result = 0;
+int success_count = 0;
 
 QLIST_INIT(&error_votes.vote_list);
 error_votes.compare = quorum_64bits_compare;
 
 for (i = 0; i < s->num_children; i++) {
 result = bdrv_co_flush(s->children[i]->bs);
-result_value.l = result;
-quorum_count_vote(&error_votes, &result_value, i);
+if (result) {
+quorum_flush_error(s->children[i]->bs->node_name, "Flush failed");
+result_value.l = result;
+quorum_count_vote(&error_votes, &result_value, i);
+} else {
+success_count++;
+}
 }
 
-winner = quorum_get_vote_winner(&error_votes);
-result = winner->value.l;
-
+if (success_count >= s->threshold) {
+result = 0;
+} else {
+winner = quorum_get_vote_winner(&error_votes);
+result = winner->value.l;
+}
 quorum_free_vote_list(&error_votes);
 
 return result;
-- 
1.9.3

Re: [Qemu-devel] [PATCH v4 1/2] qmp event: Add QUORUM_FLUSH_ERROR

2016-02-23 Thread Changlong Xie


cc: Markus Armbruster 

On 02/23/2016 05:01 PM, Changlong Xie wrote:

Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
---
  block/quorum.c  |  5 +
  docs/qmp-events.txt | 18 ++
  qapi/event.json | 16 
  3 files changed, 39 insertions(+)

diff --git a/block/quorum.c b/block/quorum.c
index f78d4cb..d3c3958 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -235,6 +235,11 @@ static void quorum_report_failure(QuorumAIOCB *acb)
 acb->nb_sectors, &error_abort);
  }

+static void quorum_flush_error(char *node_name, const char *msg)
+{
+qapi_event_send_quorum_flush_error(node_name, msg, &error_abort);
+}
+
  static int quorum_vote_error(QuorumAIOCB *acb);

  static bool quorum_has_too_much_io_failed(QuorumAIOCB *acb)
diff --git a/docs/qmp-events.txt b/docs/qmp-events.txt
index b6e8937..d777873 100644
--- a/docs/qmp-events.txt
+++ b/docs/qmp-events.txt
@@ -340,6 +340,24 @@ Example:

  Note: this event is rate-limited.

+QUORUM_FLUSH_ERROR
+-
+
+Emitted to report flush error message of the Quorum block driver
+
+Data:
+
+- "node-name": The graph node name of the block driver state.
+- "error": This field contains a human-readable error message.  There 
are
+   no semantics other than that the block layer reported an 
error
+   and clients should not try to interpret the error string.
+
+Example:
+
+{ "event": "QUORUM_FLUSH_ERROR",
+ "data": { "node-name": "1.raw", "error": "xx" },
+ "timestamp": { "seconds": 1344522075, "microseconds": 745528 } }
+
  RESET
  -

diff --git a/qapi/event.json b/qapi/event.json
index cfcc887..5b16706 100644
--- a/qapi/event.json
+++ b/qapi/event.json
@@ -358,6 +358,22 @@
  'sector-num': 'int', 'sectors-count': 'int' } }

  ##
+# @QUORUM_FLUSH_ERROR
+#
+# Emitted to report flush error message of the Quorum block driver
+#
+# @node-name: the graph node name of the block driver state
+#
+# @error: This field contains a human-readable error message. There are no 
semantics
+# other than that the block layer reported an error and clients should 
not
+# try to interpret the error string.
+#
+# Since: 2.5
+##
+{ 'event': 'QUORUM_FLUSH_ERROR',
+  'data': { 'node-name': 'str', 'error': 'str'} }
+
+##
  # @VSERPORT_CHANGE
  #
  # Emitted when the guest opens or closes a virtio-serial port.

Re: [Qemu-devel] [PATCH v15 0/9] Block replication for continuous checkpoints

2016-02-23 Thread Changlong Xie


Ping again ...

Thanks
-Xie

On 02/05/2016 12:17 PM, Changlong Xie wrote:

Block replication is a very important feature which is used for
continuous checkpoints(for example: COLO).

You can get the detailed information about block replication from here:
http://wiki.qemu.org/Features/BlockReplication

Usage:
Please refer to docs/block-replication.txt

This patch series is based on the following patch series:
1. http://lists.nongnu.org/archive/html/qemu-devel/2015-12/msg04570.html

Patch status:
1. Acked patches: none
2. Reviewed patches: patch 4
3. Updated patches: patch 3, 4, 5, 7, 8
Note: patch 7 is a new patch that introduces genernal start/stop/checkpoint/
get_error
APIs.

You can get the patch here:
https://github.com/Pating/qemu/tree/changlox/block-replication-v15

You can get the patch with framework here:
https://github.com/Pating/qemu/tree/changlox/colo_framework_v14

TODO:
1. Continuous block replication. It will be started after basic functions
are accepted.

Changs Log:
V15:
1. Rebase to the newest codes
2. Fix typos and coding style addresed Eric's comments
3. Address Stefan's comments
1) Make backup_do_checkpoint public, drop the changes on BlockJobDriver
2) Update the message and description for [PATCH 4/9]
3) Make replication_(start/stop/do_checkpoint)_all as global interfaces
4) Introduce AioContext lock to protect start/stop/do_checkpoint callbacks
5) Use BdrvChild instead of holding on to BlockDriverState * pointers
4. Clear BDRV_O_INACTIVE for hidden disk's open_flags since commit 09e0c771
5. Introduce replication_get_error_all to check replication status
6. Remove useless discard interface
V14:
1. Implement auto complete active commit
2. Implement active commit block job for replication.c
3. Address the comments from Stefan, add replication-specific API and data
structure, also remove old block layer APIs
V13:
1. Rebase to the newest codes
2. Remove redundant marcos and semicolon in replication.c
3. Fix typos in block-replication.txt
V12:
1. Rebase to the newest codes
2. Use backing reference to replcace 'allow-write-backing-file'
V11:
1. Reopen the backing file when starting blcok replication if it is not
opened in R/W mode
2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET
when opening backing file
3. Block the top BDS so there is only one block job for the top BDS and
its backing chain.
V10:
1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing
reference.
2. Address the comments from Eric Blake
V9:
1. Update the error messages
2. Rebase to the newest qemu
3. Split child add/delete support. These patches are sent in another patchset.
V8:
1. Address Alberto Garcia's comments
V7:
1. Implement adding/removing quorum child. Remove the option non-connect.
2. Simplify the backing refrence option according to Stefan Hajnoczi's 
suggestion
V6:
1. Rebase to the newest qemu.
V5:
1. Address the comments from Gong Lei
2. Speed the failover up. The secondary vm can take over very quickly even
if there are too many I/O requests.
V4:
1. Introduce a new driver replication to avoid touch nbd and qcow2.
V3:
1: use error_setg() instead of error_set()
2. Add a new block job API
3. Active disk, hidden disk and nbd target uses the same AioContext
4. Add a testcase to test new hbitmap API
V2:
1. Redesign the secondary qemu(use image-fleecing)
2. Use Error objects to return error message
3. Address the comments from Max Reitz and Eric Blake

Changlong Xie (1):
   Introduce new APIs to do replication operation

Wen Congyang (8):
   unblock backup operations in backing file
   Store parent BDS in BdrvChild
   Backup: clear all bitmap when doing block checkpoint
   Link backup into block core
   docs: block replication's description
   auto complete active commit
   Implement new driver for block replication
   support replication driver in blockdev-add

  Makefile.objs  |   1 +
  block.c|  19 ++
  block/Makefile.objs|   3 +-
  block/backup.c |  15 ++
  block/mirror.c |  13 +-
  block/replication.c| 594 +
  blockdev.c |   2 +-
  docs/block-replication.txt | 238 ++
  include/block/block_int.h  |   6 +-
  qapi/block-core.json   |  33 ++-
  qemu-img.c |   2 +-
  replication.c  |  94 +++
  replication.h  |  53 
  13 files changed, 1063 insertions(+), 10 deletions(-)
  create mode 100644 block/replication.c
  create mode 100644 docs/block-replication.txt
  create mode 100644 replication.c
  create mode 100644 replication.h

Re: [Qemu-devel] [PATCH v4 1/2] qmp event: Add QUORUM_FLUSH_ERROR

2016-02-23 Thread Changlong Xie


On 02/23/2016 10:05 PM, Alberto Garcia wrote:

On Tue 23 Feb 2016 02:45:49 PM CET, Eric Blake wrote:


Commit message should say why we need a third event, rather than
reusing either of the other two (my guess: because you don't have a
location, and don't want to modify the existing two to report a
location - but why not just use 'sector-num':0,
'sectors-count': to report the entire file as the
location?)


I would also be fine with that solution.


I would also be fine if we added an optional enum member to the
existing event that said which operation failed ('read', 'write',
'flush') - adding optional output members is safe, while converting
existing mandatory output members to optional may confuse existing
clients.




Hi Berto & Eric

Thanks for all your comments. Surely, this is the best option to me too
:-), will fix it in next version.

Thanks
-Xie


That might actually be the best option :-)

Berto


.

[Qemu-devel] [PATCH v5 0/3] modify vote rules for flush operation

2016-02-24 Thread Changlong Xie

ChangLog:
v5:
1. Fix invalid node name in docs/qmp-events.txt
2. Address comments from Eric, drop QUORUM_FLUSH_ERROR. Instead of reworking
QUORUM_REPORT_BAD to make it compatible with quorum read/write/flush operations
v4:
1. Introduce QUORUM_FLUSH_ERROR event to notify flush failure.
v3:
1. *Note* that, the codes logic is different from what we talked in v2.
I just keep flush interface the same logic as quorum read/write, and think
it's reasoned.

[Ref] http://lists.nongnu.org/archive/html/qemu-devel/2016-01/msg05342.htm

Changlong Xie (3):
  docs: fix invalid node name in qmp event
  qmp event: Refactor QUORUM_REPORT_BAD
  quorum: modify vote rules for flush operation

 block/quorum.c  | 48 ++--
 docs/qmp-events.txt | 10 +-
 qapi/block.json | 16 
 qapi/event.json |  4 +++-
 4 files changed, 66 insertions(+), 12 deletions(-)

-- 
1.9.3

[Qemu-devel] [PATCH v5 3/3] quorum: modify vote rules for flush operation

2016-02-24 Thread Changlong Xie

Keep flush interface the same logic as quorum read/write, Otherwise in
following scenario, we'll encounter unexpected errors.

Quorum has two children(A, B). A do flush sucessfully, but B flush failed.
This cause the filesystem of guest become read-only with following errors:

end_request: I/O error, dev vda, sector 11159960
Aborting journal on device vda3-8
EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal
EXT4-fs (vda3): Remounting filesystem read-only

Cc: Dr. David Alan Gilbert 
Cc: Wen Congyang 
Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
---
 block/quorum.c | 20 +++-
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index 03d68c1..0c7e4ad 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -778,19 +778,29 @@ static coroutine_fn int quorum_co_flush(BlockDriverState 
*bs)
 QuorumVoteValue result_value;
 int i;
 int result = 0;
+int success_count = 0;
 
 QLIST_INIT(&error_votes.vote_list);
 error_votes.compare = quorum_64bits_compare;
 
 for (i = 0; i < s->num_children; i++) {
 result = bdrv_co_flush(s->children[i]->bs);
-result_value.l = result;
-quorum_count_vote(&error_votes, &result_value, i);
+if (result) {
+quorum_report_bad(QUORUM_OP_TYPE_FLUSH, NULL,
+  s->children[i]->bs->node_name, result);
+result_value.l = result;
+quorum_count_vote(&error_votes, &result_value, i);
+} else {
+success_count++;
+}
 }
 
-winner = quorum_get_vote_winner(&error_votes);
-result = winner->value.l;
-
+if (success_count >= s->threshold) {
+result = 0;
+} else {
+winner = quorum_get_vote_winner(&error_votes);
+result = winner->value.l;
+}
 quorum_free_vote_list(&error_votes);
 
 return result;
-- 
1.9.3

[Qemu-devel] [PATCH v5 2/3] qmp event: Refactor QUORUM_REPORT_BAD

2016-02-24 Thread Changlong Xie

Introduce QuorumOpType, and make QUORUM_REPORT_BAD compatible
with it.

Cc: Dr. David Alan Gilbert 
Cc: Wen Congyang 
Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
---
 block/quorum.c  | 28 +++-
 docs/qmp-events.txt |  8 
 qapi/block.json | 16 
 qapi/event.json |  4 +++-
 4 files changed, 50 insertions(+), 6 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index 11cc60b..03d68c1 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -215,14 +215,29 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
 return acb;
 }
 
-static void quorum_report_bad(QuorumAIOCB *acb, char *node_name, int ret)
+static void quorum_report_bad(QuorumOpType type, QuorumAIOCB *acb,
+  char *node_name, int ret)
 {
 const char *msg = NULL;
 if (ret < 0) {
 msg = strerror(-ret);
 }
-qapi_event_send_quorum_report_bad(!!msg, msg, node_name,
-  acb->sector_num, acb->nb_sectors, 
&error_abort);
+
+switch (type) {
+case QUORUM_OP_TYPE_READ:
+case QUORUM_OP_TYPE_WRITE:
+qapi_event_send_quorum_report_bad(false, 0, !!msg, msg, node_name,
+  acb->sector_num, acb->nb_sectors,
+  &error_abort);
+break;
+case QUORUM_OP_TYPE_FLUSH:
+qapi_event_send_quorum_report_bad(true, type, !!msg, msg, node_name,
+  0, 0, &error_abort);
+break;
+default:
+/* should never happen */
+abort();
+}
 }
 
 static void quorum_report_failure(QuorumAIOCB *acb)
@@ -282,6 +297,7 @@ static void quorum_aio_cb(void *opaque, int ret)
 QuorumChildRequest *sacb = opaque;
 QuorumAIOCB *acb = sacb->parent;
 BDRVQuorumState *s = acb->common.bs->opaque;
+QuorumOpType type;
 bool rewrite = false;
 
 if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) {
@@ -300,12 +316,13 @@ static void quorum_aio_cb(void *opaque, int ret)
 return;
 }
 
+type = acb->is_read ? QUORUM_OP_TYPE_READ : QUORUM_OP_TYPE_WRITE;
 sacb->ret = ret;
 acb->count++;
 if (ret == 0) {
 acb->success_count++;
 } else {
-quorum_report_bad(acb, sacb->aiocb->bs->node_name, ret);
+quorum_report_bad(type, acb, sacb->aiocb->bs->node_name, ret);
 }
 assert(acb->count <= s->num_children);
 assert(acb->success_count <= s->num_children);
@@ -338,7 +355,8 @@ static void quorum_report_bad_versions(BDRVQuorumState *s,
 continue;
 }
 QLIST_FOREACH(item, &version->items, next) {
-quorum_report_bad(acb, s->children[item->index]->bs->node_name, 0);
+quorum_report_bad(QUORUM_OP_TYPE_READ, acb,
+  s->children[item->index]->bs->node_name, 0);
 }
 }
 }
diff --git a/docs/qmp-events.txt b/docs/qmp-events.txt
index f96e120..0a082db 100644
--- a/docs/qmp-events.txt
+++ b/docs/qmp-events.txt
@@ -307,6 +307,7 @@ Emitted to report a corruption of a Quorum file.
 
 Data:
 
+- "type":  Quorum operation type (json-string, optional)
 - "error": Error message (json-string, optional)
Only present on failure.  This field contains a 
human-readable
error message.  There are no semantics other than that the
@@ -318,10 +319,17 @@ Data:
 
 Example:
 
+Read/Write operation:
 { "event": "QUORUM_REPORT_BAD",
  "data": { "node-name": "node0", "sector-num": 345435, "sectors-count": 5 
},
  "timestamp": { "seconds": 1344522075, "microseconds": 745528 } }
 
+Flush operation:
+{ "event": "QUORUM_REPORT_BAD",
+ "data": { "node-name": "node0", "sectors-count": 0, "sector-num": 0,
+ "type": "flush", "error": "Broken pipe" },
+ "timestamp": { "seconds": 1456406829, "microseconds": 291763 } }
+
 Note: this event is rate-limited.
 
 RESET
diff --git a/qapi/block.json b/qapi/block.json
index 58e6b30..937337d 100644
--- a/qapi/block.json
+++ b/qapi/block.json
@@ -196,3 +196,19 @@
 ##
 { 'event': 'DEVICE_TRAY_MOVED',
   'data': { 'device': 'str', 'tray-open': 'bool' } }
+
+##
+# @QuorumOpType
+#
+# An enumeration of the quorum operation types
+#
+# @read: read operation
+#
+# @write: write operation
+#
+# @flush: flush operation
+#
+# Since: 2.6
+##
+{ 'enum': 'QuorumOpType',
+  'data': [ 'read', 'write', 'flush' ] }
di

[Qemu-devel] [PATCH v5 1/3] docs: fix invalid node name in qmp event

2016-02-24 Thread Changlong Xie

Cc: Dr. David Alan Gilbert 
Cc: Wen Congyang 
Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
---
 docs/qmp-events.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/qmp-events.txt b/docs/qmp-events.txt
index 52eb7e2..f96e120 100644
--- a/docs/qmp-events.txt
+++ b/docs/qmp-events.txt
@@ -319,7 +319,7 @@ Data:
 Example:
 
 { "event": "QUORUM_REPORT_BAD",
- "data": { "node-name": "1.raw", "sector-num": 345435, "sectors-count": 5 
},
+ "data": { "node-name": "node0", "sector-num": 345435, "sectors-count": 5 
},
  "timestamp": { "seconds": 1344522075, "microseconds": 745528 } }
 
 Note: this event is rate-limited.
-- 
1.9.3

Re: [Qemu-devel] [PATCH v5 2/3] qmp event: Refactor QUORUM_REPORT_BAD

2016-02-24 Thread Changlong Xie


On 02/25/2016 09:12 AM, Eric Blake wrote:

On 02/24/2016 05:50 PM, Wen Congyang wrote:


+- "type":  Quorum operation type (json-string, optional)


I don't think 'type' needs to be optional, after all.  Just always
output it.


If we output read/write type, old libvirt will ignore the read/write error 
events?


The QMP protocol already documents that ALL clients MUST ignore
unexpected output fields.  Any client that is unprepared for new fields
appearing in the dictionary is buggy.  Libvirt will be just fine if you
output a new "type":"read".


Ok, i'll make "type" mandatory.

Thanks
-Xie

Re: [Qemu-devel] [PATCH v5 2/3] qmp event: Refactor QUORUM_REPORT_BAD

2016-02-24 Thread Changlong Xie


On 02/24/2016 08:35 PM, Alberto Garcia wrote:

On Wed 24 Feb 2016 11:11:54 AM CET, Changlong Xie wrote:

-static void quorum_report_bad(QuorumAIOCB *acb, char *node_name, int ret)
+static void quorum_report_bad(QuorumOpType type, QuorumAIOCB *acb,
+  char *node_name, int ret)
  {
  const char *msg = NULL;
  if (ret < 0) {
  msg = strerror(-ret);
  }
-qapi_event_send_quorum_report_bad(!!msg, msg, node_name,
-  acb->sector_num, acb->nb_sectors, 
&error_abort);
+
+switch (type) {
+case QUORUM_OP_TYPE_READ:
+case QUORUM_OP_TYPE_WRITE:
+qapi_event_send_quorum_report_bad(false, 0, !!msg, msg, node_name,
+  acb->sector_num, acb->nb_sectors,
+  &error_abort);
+break;
+case QUORUM_OP_TYPE_FLUSH:
+qapi_event_send_quorum_report_bad(true, type, !!msg, msg, node_name,
+  0, 0, &error_abort);
+break;


A few comments:

   - Why don't you set the 'type' field in read and write operations? You
 defined all three values but you are only using 'flush' here.

   - For the case of flush you could set sectors-count to the total size
 of the BlockDriverState as Eric suggested (bdrv_nb_sectors(bs)).
 Setting both to 0 could confuse clients that are not ready to deal
 with flush failures.

   - Since the QuorumAIOCB parameter is not used in the flush case, you
 could replace it from the function prototype and use sector_num and
 nb_sectors instead. That way you can also omit the switch.



Surely, all your suggestions are helpful. Also i will add "Reviewed-by" 
in patch 1/3, 3/3 in next version.


Thanks
-Xie


Berto


.

[Qemu-devel] [PATCH v6 1/3] docs: fix invalid node name in qmp event

2016-02-24 Thread Changlong Xie

Cc: Dr. David Alan Gilbert 
Cc: Wen Congyang 
Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
Reviewed-by: Alberto Garcia 
---
 docs/qmp-events.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/qmp-events.txt b/docs/qmp-events.txt
index b6e8937..afb5e20 100644
--- a/docs/qmp-events.txt
+++ b/docs/qmp-events.txt
@@ -335,7 +335,7 @@ Data:
 Example:
 
 { "event": "QUORUM_REPORT_BAD",
- "data": { "node-name": "1.raw", "sector-num": 345435, "sectors-count": 5 
},
+ "data": { "node-name": "node0", "sector-num": 345435, "sectors-count": 5 
},
  "timestamp": { "seconds": 1344522075, "microseconds": 745528 } }
 
 Note: this event is rate-limited.
-- 
1.9.3

[Qemu-devel] [PATCH v6 0/3] modify vote rules for flush operation

2016-02-24 Thread Changlong Xie

ChangLog:
v6:
1. Make "type" mandatory for [PATCH 2/3] 
v5:
1. Fix invalid node name in docs/qmp-events.txt
2. Address comments from Eric, drop QUORUM_FLUSH_ERROR. Instead of reworking
QUORUM_REPORT_BAD to make it compatible with quorum read/write/flush operations
v4:
1. Introduce QUORUM_FLUSH_ERROR event to notify flush failure.
v3:
1. *Note* that, the codes logic is different from what we talked in v2.
I just keep flush interface the same logic as quorum read/write, and think
it's reasoned.

[Ref] http://lists.nongnu.org/archive/html/qemu-devel/2016-01/msg05342.htm

Changlong Xie (3):
  docs: fix invalid node name in qmp event
  qmp event: Refactor QUORUM_REPORT_BAD
  quorum: modify vote rules for flush operation

 block/quorum.c  | 38 --
 docs/qmp-events.txt | 11 ++-
 qapi/block.json | 16 
 qapi/event.json |  4 +++-
 4 files changed, 57 insertions(+), 12 deletions(-)

-- 
1.9.3

[Qemu-devel] [PATCH v6 3/3] quorum: modify vote rules for flush operation

2016-02-24 Thread Changlong Xie

Keep flush interface the same logic as quorum read/write, Otherwise in
following scenario, we'll encounter unexpected errors.

Quorum has two children(A, B). A do flush sucessfully, but B flush failed.
This cause the filesystem of guest become read-only with following errors:

end_request: I/O error, dev vda, sector 11159960
Aborting journal on device vda3-8
EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal
EXT4-fs (vda3): Remounting filesystem read-only

Cc: Dr. David Alan Gilbert 
Cc: Wen Congyang 
Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
Reviewed-by: Alberto Garcia 
---
 block/quorum.c | 21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index 974fff9..ed906d0 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -770,19 +770,30 @@ static coroutine_fn int quorum_co_flush(BlockDriverState 
*bs)
 QuorumVoteValue result_value;
 int i;
 int result = 0;
+int success_count = 0;
 
 QLIST_INIT(&error_votes.vote_list);
 error_votes.compare = quorum_64bits_compare;
 
 for (i = 0; i < s->num_children; i++) {
 result = bdrv_co_flush(s->children[i]->bs);
-result_value.l = result;
-quorum_count_vote(&error_votes, &result_value, i);
+if (result) {
+quorum_report_bad(QUORUM_OP_TYPE_FLUSH, 0,
+  bdrv_nb_sectors(s->children[i]->bs),
+  s->children[i]->bs->node_name, result);
+result_value.l = result;
+quorum_count_vote(&error_votes, &result_value, i);
+} else {
+success_count++;
+}
 }
 
-winner = quorum_get_vote_winner(&error_votes);
-result = winner->value.l;
-
+if (success_count >= s->threshold) {
+result = 0;
+} else {
+winner = quorum_get_vote_winner(&error_votes);
+result = winner->value.l;
+}
 quorum_free_vote_list(&error_votes);
 
 return result;
-- 
1.9.3

[Qemu-devel] [PATCH v6 2/3] qmp event: Refactor QUORUM_REPORT_BAD

2016-02-24 Thread Changlong Xie

Introduce QuorumOpType, and make QUORUM_REPORT_BAD compatible
with it.

Cc: Dr. David Alan Gilbert 
Cc: Wen Congyang 
Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
---
 block/quorum.c  | 17 -
 docs/qmp-events.txt | 11 ++-
 qapi/block.json | 16 
 qapi/event.json |  4 +++-
 4 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index f78d4cb..974fff9 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -218,14 +218,16 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
 return acb;
 }
 
-static void quorum_report_bad(QuorumAIOCB *acb, char *node_name, int ret)
+static void quorum_report_bad(QuorumOpType type, uint64_t sector_num,
+  int nb_sectors, char *node_name, int ret)
 {
 const char *msg = NULL;
 if (ret < 0) {
 msg = strerror(-ret);
 }
-qapi_event_send_quorum_report_bad(!!msg, msg, node_name,
-  acb->sector_num, acb->nb_sectors, 
&error_abort);
+
+qapi_event_send_quorum_report_bad(type, !!msg, msg, node_name,
+  sector_num, nb_sectors, &error_abort);
 }
 
 static void quorum_report_failure(QuorumAIOCB *acb)
@@ -285,6 +287,7 @@ static void quorum_aio_cb(void *opaque, int ret)
 QuorumChildRequest *sacb = opaque;
 QuorumAIOCB *acb = sacb->parent;
 BDRVQuorumState *s = acb->common.bs->opaque;
+QuorumOpType type;
 bool rewrite = false;
 
 if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) {
@@ -303,12 +306,14 @@ static void quorum_aio_cb(void *opaque, int ret)
 return;
 }
 
+type = acb->is_read ? QUORUM_OP_TYPE_READ : QUORUM_OP_TYPE_WRITE;
 sacb->ret = ret;
 acb->count++;
 if (ret == 0) {
 acb->success_count++;
 } else {
-quorum_report_bad(acb, sacb->aiocb->bs->node_name, ret);
+quorum_report_bad(type, acb->sector_num, acb->nb_sectors,
+  sacb->aiocb->bs->node_name, ret);
 }
 assert(acb->count <= s->num_children);
 assert(acb->success_count <= s->num_children);
@@ -341,7 +346,9 @@ static void quorum_report_bad_versions(BDRVQuorumState *s,
 continue;
 }
 QLIST_FOREACH(item, &version->items, next) {
-quorum_report_bad(acb, s->children[item->index]->bs->node_name, 0);
+quorum_report_bad(QUORUM_OP_TYPE_READ, acb->sector_num,
+  acb->nb_sectors,
+  s->children[item->index]->bs->node_name, 0);
 }
 }
 }
diff --git a/docs/qmp-events.txt b/docs/qmp-events.txt
index afb5e20..96baacb 100644
--- a/docs/qmp-events.txt
+++ b/docs/qmp-events.txt
@@ -323,6 +323,7 @@ Emitted to report a corruption of a Quorum file.
 
 Data:
 
+- "type":  Quorum operation type
 - "error": Error message (json-string, optional)
Only present on failure.  This field contains a 
human-readable
error message.  There are no semantics other than that the
@@ -334,10 +335,18 @@ Data:
 
 Example:
 
+Read/Write operation:
 { "event": "QUORUM_REPORT_BAD",
- "data": { "node-name": "node0", "sector-num": 345435, "sectors-count": 5 
},
+ "data": { "node-name": "node0", "sector-num": 345435, "sectors-count": 5,
+ "type": "read" },
  "timestamp": { "seconds": 1344522075, "microseconds": 745528 } }
 
+Flush operation:
+{ "event": "QUORUM_REPORT_BAD",
+ "data": { "node-name": "node0", "sector-num": 0, "sectors-count": 2097120,
+ "type": "flush", "error": "Broken pipe" },
+ "timestamp": { "seconds": 1456406829, "microseconds": 291763 } }
+
 Note: this event is rate-limited.
 
 RESET
diff --git a/qapi/block.json b/qapi/block.json
index 58e6b30..937337d 100644
--- a/qapi/block.json
+++ b/qapi/block.json
@@ -196,3 +196,19 @@
 ##
 { 'event': 'DEVICE_TRAY_MOVED',
   'data': { 'device': 'str', 'tray-open': 'bool' } }
+
+##
+# @QuorumOpType
+#
+# An enumeration of the quorum operation types
+#
+# @read: read operation
+#
+# @write: write operation
+#
+# @flush: flush operation
+#
+# Since: 2.6
+##
+{ 'enum': 'QuorumOpType',
+  'data': [ 'read', 'write', 'flush' ] }
diff --git a/qapi/event.json b/qapi/event.json
index cfcc887..a2b8db6 100644
--- a/qapi/event.json
+++ b/qapi/event.json
@@ -340,6 +340,8 @@
 #
 # Emitted to report a corruption of a Quorum file
 #
+# @type: quorum operation type (Since 2.6)
+#
 # @error: #optional, error message. Only present on failure. This field
 # contains a human-readable error message. There are no semantics other
 # than that the block layer reported an error and clients should not
@@ -354,7 +356,7 @@
 # Since: 2.0
 ##
 { 'event': 'QUORUM_REPORT_BAD',
-  'data': { '*error': 'str', 'node-name': 'str',
+  'data': { 'type': 'QuorumOpType', '*error': 'str', 'node-name': 'str',
 'sector-num': 'int', 'sectors-count': 'int' } }
 
 ##
-- 
1.9.3

Re: [Qemu-devel] [PATCH v6 2/3] qmp event: Refactor QUORUM_REPORT_BAD

2016-02-25 Thread Changlong Xie


On 02/25/2016 06:57 PM, Alberto Garcia wrote:

On Thu 25 Feb 2016 06:33:08 AM CET, Changlong Xie  
wrote:

+Read/Write operation:
  { "event": "QUORUM_REPORT_BAD",
- "data": { "node-name": "node0", "sector-num": 345435, "sectors-count": 5 
},
+ "data": { "node-name": "node0", "sector-num": 345435, "sectors-count": 5,
+ "type": "read" },
   "timestamp": { "seconds": 1344522075, "microseconds": 745528 } }


Since you introduced the 'type' field and this is now an example of a
read error, you can change the description to say simply "Read
operation:". In my opinion there's no need to add yet another example
for a write operation, I think it's clear enough.


Ok




+Flush operation:
+{ "event": "QUORUM_REPORT_BAD",
+ "data": { "node-name": "node0", "sector-num": 0, "sectors-count": 2097120,
+ "type": "flush", "error": "Broken pipe" },
+ "timestamp": { "seconds": 1456406829, "microseconds": 291763 } }


Here (and in the previous case) please indent "type" so it goes under


Surely.


"node-name":

{ "event": "QUORUM_REPORT_BAD",
  "data": { "node-name": "node0", "sector-num": 0, "sectors-count": 2097120,
"type": "flush", "error": "Broken pipe" },
  "timestamp": { "seconds": 1456406829, "microseconds": 291763 } }

Otherwise I think the patch looks perfect now. Thanks!



Thanks for your review. Will send another series.

Thanks
-Xie


Berto


.

[Qemu-devel] [PATCH v7 1/3] docs: fix invalid node name in qmp event

2016-02-25 Thread Changlong Xie

Cc: Dr. David Alan Gilbert 
Cc: Wen Congyang 
Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
Reviewed-by: Alberto Garcia 
---
 docs/qmp-events.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/qmp-events.txt b/docs/qmp-events.txt
index 4e3eb9e..d6b9aea 100644
--- a/docs/qmp-events.txt
+++ b/docs/qmp-events.txt
@@ -337,7 +337,7 @@ Data:
 Example:
 
 { "event": "QUORUM_REPORT_BAD",
- "data": { "node-name": "1.raw", "sector-num": 345435, "sectors-count": 5 
},
+ "data": { "node-name": "node0", "sector-num": 345435, "sectors-count": 5 
},
  "timestamp": { "seconds": 1344522075, "microseconds": 745528 } }
 
 Note: this event is rate-limited.
-- 
1.9.3

[Qemu-devel] [PATCH v7 3/3] quorum: modify vote rules for flush operation

2016-02-25 Thread Changlong Xie

Keep flush interface the same logic as quorum read/write, Otherwise in
following scenario, we'll encounter unexpected errors.

Quorum has two children(A, B). A do flush sucessfully, but B flush failed.
This cause the filesystem of guest become read-only with following errors:

end_request: I/O error, dev vda, sector 11159960
Aborting journal on device vda3-8
EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal
EXT4-fs (vda3): Remounting filesystem read-only

Cc: Dr. David Alan Gilbert 
Cc: Wen Congyang 
Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
Reviewed-by: Alberto Garcia 
---
 block/quorum.c | 21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index 8f83574..b16171b 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -767,19 +767,30 @@ static coroutine_fn int quorum_co_flush(BlockDriverState 
*bs)
 QuorumVoteValue result_value;
 int i;
 int result = 0;
+int success_count = 0;
 
 QLIST_INIT(&error_votes.vote_list);
 error_votes.compare = quorum_64bits_compare;
 
 for (i = 0; i < s->num_children; i++) {
 result = bdrv_co_flush(s->children[i]->bs);
-result_value.l = result;
-quorum_count_vote(&error_votes, &result_value, i);
+if (result) {
+quorum_report_bad(QUORUM_OP_TYPE_FLUSH, 0,
+  bdrv_nb_sectors(s->children[i]->bs),
+  s->children[i]->bs->node_name, result);
+result_value.l = result;
+quorum_count_vote(&error_votes, &result_value, i);
+} else {
+success_count++;
+}
 }
 
-winner = quorum_get_vote_winner(&error_votes);
-result = winner->value.l;
-
+if (success_count >= s->threshold) {
+result = 0;
+} else {
+winner = quorum_get_vote_winner(&error_votes);
+result = winner->value.l;
+}
 quorum_free_vote_list(&error_votes);
 
 return result;
-- 
1.9.3

[Qemu-devel] [PATCH v7 0/3] modify vote rules for flush operation

2016-02-25 Thread Changlong Xie

ChangLog:
v7:
1. Fix coding style in doc/qmp-event.txt
v6:
1. Make "type" mandatory for [PATCH 2/3] 
v5:
1. Fix invalid node name in docs/qmp-events.txt
2. Address comments from Eric, drop QUORUM_FLUSH_ERROR. Instead of reworking
QUORUM_REPORT_BAD to make it compatible with quorum read/write/flush operations
v4:
1. Introduce QUORUM_FLUSH_ERROR event to notify flush failure.
v3:
1. *Note* that, the codes logic is different from what we talked in v2.
I just keep flush interface the same logic as quorum read/write, and think
it's reasoned.

[Ref] http://lists.nongnu.org/archive/html/qemu-devel/2016-01/msg05342.html

Changlong Xie (3):
  docs: fix invalid node name in qmp event
  qmp event: Refactor QUORUM_REPORT_BAD
  quorum: modify vote rules for flush operation

 block/quorum.c  | 38 --
 docs/qmp-events.txt | 11 ++-
 qapi/block.json | 16 
 qapi/event.json |  4 +++-
 4 files changed, 57 insertions(+), 12 deletions(-)

-- 
1.9.3

[Qemu-devel] [PATCH v7 2/3] qmp event: Refactor QUORUM_REPORT_BAD

2016-02-25 Thread Changlong Xie

Introduce QuorumOpType, and make QUORUM_REPORT_BAD compatible
with it.

Cc: Dr. David Alan Gilbert 
Cc: Wen Congyang 
Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
---
 block/quorum.c  | 17 -
 docs/qmp-events.txt | 11 ++-
 qapi/block.json | 16 
 qapi/event.json |  4 +++-
 4 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index 11cc60b..8f83574 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -215,14 +215,16 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
 return acb;
 }
 
-static void quorum_report_bad(QuorumAIOCB *acb, char *node_name, int ret)
+static void quorum_report_bad(QuorumOpType type, uint64_t sector_num,
+  int nb_sectors, char *node_name, int ret)
 {
 const char *msg = NULL;
 if (ret < 0) {
 msg = strerror(-ret);
 }
-qapi_event_send_quorum_report_bad(!!msg, msg, node_name,
-  acb->sector_num, acb->nb_sectors, 
&error_abort);
+
+qapi_event_send_quorum_report_bad(type, !!msg, msg, node_name,
+  sector_num, nb_sectors, &error_abort);
 }
 
 static void quorum_report_failure(QuorumAIOCB *acb)
@@ -282,6 +284,7 @@ static void quorum_aio_cb(void *opaque, int ret)
 QuorumChildRequest *sacb = opaque;
 QuorumAIOCB *acb = sacb->parent;
 BDRVQuorumState *s = acb->common.bs->opaque;
+QuorumOpType type;
 bool rewrite = false;
 
 if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) {
@@ -300,12 +303,14 @@ static void quorum_aio_cb(void *opaque, int ret)
 return;
 }
 
+type = acb->is_read ? QUORUM_OP_TYPE_READ : QUORUM_OP_TYPE_WRITE;
 sacb->ret = ret;
 acb->count++;
 if (ret == 0) {
 acb->success_count++;
 } else {
-quorum_report_bad(acb, sacb->aiocb->bs->node_name, ret);
+quorum_report_bad(type, acb->sector_num, acb->nb_sectors,
+  sacb->aiocb->bs->node_name, ret);
 }
 assert(acb->count <= s->num_children);
 assert(acb->success_count <= s->num_children);
@@ -338,7 +343,9 @@ static void quorum_report_bad_versions(BDRVQuorumState *s,
 continue;
 }
 QLIST_FOREACH(item, &version->items, next) {
-quorum_report_bad(acb, s->children[item->index]->bs->node_name, 0);
+quorum_report_bad(QUORUM_OP_TYPE_READ, acb->sector_num,
+  acb->nb_sectors,
+  s->children[item->index]->bs->node_name, 0);
 }
 }
 }
diff --git a/docs/qmp-events.txt b/docs/qmp-events.txt
index d6b9aea..fa7574d 100644
--- a/docs/qmp-events.txt
+++ b/docs/qmp-events.txt
@@ -325,6 +325,7 @@ Emitted to report a corruption of a Quorum file.
 
 Data:
 
+- "type":  Quorum operation type
 - "error": Error message (json-string, optional)
Only present on failure.  This field contains a 
human-readable
error message.  There are no semantics other than that the
@@ -336,10 +337,18 @@ Data:
 
 Example:
 
+Read operation:
 { "event": "QUORUM_REPORT_BAD",
- "data": { "node-name": "node0", "sector-num": 345435, "sectors-count": 5 
},
+ "data": { "node-name": "node0", "sector-num": 345435, "sectors-count": 5,
+   "type": "read" },
  "timestamp": { "seconds": 1344522075, "microseconds": 745528 } }
 
+Flush operation:
+{ "event": "QUORUM_REPORT_BAD",
+ "data": { "node-name": "node0", "sector-num": 0, "sectors-count": 2097120,
+   "type": "flush", "error": "Broken pipe" },
+ "timestamp": { "seconds": 1456406829, "microseconds": 291763 } }
+
 Note: this event is rate-limited.
 
 RESET
diff --git a/qapi/block.json b/qapi/block.json
index 58e6b30..937337d 100644
--- a/qapi/block.json
+++ b/qapi/block.json
@@ -196,3 +196,19 @@
 ##
 { 'event': 'DEVICE_TRAY_MOVED',
   'data': { 'device': 'str', 'tray-open': 'bool' } }
+
+##
+# @QuorumOpType
+#
+# An enumeration of the quorum operation types
+#
+# @read: read operation
+#
+# @write: write operation
+#
+# @flush: flush operation
+#
+# Since: 2.6
+##
+{ 'enum': 'QuorumOpType',
+  'data': [ 'read', 'write', 'flush' ] }
diff --git a/qapi/event.json b/qapi/event.json
index 1a45a6c..8642052 100644
--- a/qapi/event.json
+++ b/qapi/event.json
@@ -325,6 +325,

[Qemu-devel] Disabling IRQ error

2013-09-10 Thread Xie Xianshan


hi everyone,

I`m getting the nobody cared disabling IRQ error, when i raised external 
interrupts IRQ3 to the Openpic in QEMU.
(Actually, any external interrupts irq i raised can reproduce this 
error, but internal interrupts work fine)


And this IRQ3 is sharing irq with usb card.

I have tried to resolve this issue as follows, but nothing changed.
 1)tried to boot with "irqpoll" option.
 2)tried to stop raising the irq for usb card.
 3)tried to boot with "kernel_irqchip" option

Related log from dmesg:
[ 2079.800787] irq 19: nobody cared (try booting with the "irqpoll" option)
[ 2079.800891] Call Trace:
[ 2079.801303] [d7ff3f40] [c0007780] show_stack+0x7c/0x1a0 (unreliable)
[ 2079.801398] [d7ff3f80] [c007ad48] __report_bad_irq+0x5c/0xe0
[ 2079.801439] [d7ff3fa0] [c007af58] note_interrupt+0x18c/0x240
[ 2079.801466] [d7ff3fd0] [c007be04] handle_fasteoi_irq+0xf8/0x158
[ 2079.801492] [d7ff3ff0] [c000d9d4] call_handle_irq+0x18/0x28
[ 2079.801582] [d7ff5ec0] [c0004df0] do_IRQ+0xf0/0x16c
[ 2079.801609] [d7ff5ee0] [c000e9bc] ret_from_except+0x0/0x18
[ 2079.801652] --- Exception: 501 at __do_softirq+0x94/0x18c
[ 2079.801664] LR = __do_softirq+0x54/0x18c
[ 2079.801698] [d7ff5ff0] [c000d9ac] call_do_softirq+0x14/0x24
[ 2079.801725] [d788fc50] [c0004bc0] do_softirq+0x74/0xa0
[ 2079.801751] [d788fc70] [c004350c] irq_exit+0x3c/0x8c
[ 2079.801775] [d788fc80] [c0004e3c] do_IRQ+0x13c/0x16c
[ 2079.801800] [d788fca0] [c000e9bc] ret_from_except+0x0/0x18
[ 2079.802566] --- Exception: 501 at bbc_dma_exec+0xa08/0xef4 [bbc_driver]
[ 2079.802580] LR = bbc_dma_exec+0x970/0xef4 [bbc_driver]
[ 2079.802620] [d788fdf0] [d94d4f28] 
bbc_ioctl_dma_read_write+0x2bc/0x464 [bbc_driver]

[ 2079.802661] [d788fe70] [d94cae84] bbc_ioctl+0x2a0/0x36c [bbc_driver]
[ 2079.802754] [d788feb0] [c00d1f04] do_vfs_ioctl+0x6b8/0x760
[ 2079.802782] [d788ff10] [c00d2014] sys_ioctl+0x68/0xa8
[ 2079.802807] [d788ff40] [c000e368] ret_from_syscall+0x0/0x3c
[ 2079.802891] --- Exception: c01 at 0xfaffda8
[ 2079.802901] LR = 0xfb8ec20
[ 2079.802936] handlers:
[ 2079.803034] [] (usb_hcd_irq+0x0/0xac)
[ 2079.803120] [] (bbc_interrupt_handler+0x0/0x65c [bbc_driver])
[ 2079.803181] Disabling IRQ #19

And interrupt information from /proc/interrupts and stat:
# cat /proc/interrupts
   CPU0
 18:   6923OpenPIC Levelohci_hcd:usb2, 
ohci_hcd:usb3, drvbbc_ldc

 19: 11OpenPIC Levelehci_hcd:usb1, drvbbc
 20:  0OpenPIC Edge internal_error
 21:  0OpenPIC LevelNMI
 22:  0OpenPIC LevelSP_I2C_Handler
 42:   1148OpenPIC Levelserial
 43:174OpenPIC Leveli2c-mpc, i2c-mpc
 59:  0OpenPIC Levelfsl_espi
 72:  51888OpenPIC Levelmmc0
LOC: 143330   Local timer interrupts
SPU:  0   Spurious interrupts
CNT:  0   Performance monitoring interrupts
MCE:  0   Machine check exceptions

#cat /proc/stat
cpu  4929 0 5784 353692 736 0 460 0 0 0
cpu0 4929 0 5784 353692 736 0 460 0 0 0
intr 394342 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6929 11 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1574 174 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 53202 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

ctxt 357415
btime 1378108122
processes 1648
procs_running 1
procs_blocked 0
softirq 329977 0 154040 0 0 1145 0 17497 0 337 156958


I`ve no idea what the problem is and how to fix it.
Do you have experience to this?

Thanks in advance for any advice.


Simen

1 2 3 4 5 6 7 >

1 - 100 of 639 matches

Mail list logo