[Qemu-devel] [Bug 587344] Re: gfxmenu from GRUB or GRUB4DOS hang qemu.
I think the problem is relate to the cpu arch. I have the same problem in my machine of which the cpu is Nehalem -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/587344 Title: gfxmenu from GRUB or GRUB4DOS hang qemu. Status in QEMU: Fix Released Bug description: When you run the graphical menu GRUB (gfxmenu, any theme) execution stops. Test image (iso, as an attachment): http://www.multiupload.com/AAI7MGMOLE Image make options: mkisofs -R -b grldr -no-emul-boot -boot-load-seg 0x1000 -o bootable.iso image_root/ With options and GFX menu: qemu -cdrom bootable.iso execution stops after rendering background. With options and GFX menu:: qemu -cdrom bootable.iso --no-kvm the menu is drawn, and the execution stops. To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/587344/+subscriptions
[Qemu-devel] [Bug 587344] Re: gfxmenu from GRUB or GRUB4DOS hang qemu.
In addition, CPU is Intel E5520 -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/587344 Title: gfxmenu from GRUB or GRUB4DOS hang qemu. Status in QEMU: Fix Released Bug description: When you run the graphical menu GRUB (gfxmenu, any theme) execution stops. Test image (iso, as an attachment): http://www.multiupload.com/AAI7MGMOLE Image make options: mkisofs -R -b grldr -no-emul-boot -boot-load-seg 0x1000 -o bootable.iso image_root/ With options and GFX menu: qemu -cdrom bootable.iso execution stops after rendering background. With options and GFX menu:: qemu -cdrom bootable.iso --no-kvm the menu is drawn, and the execution stops. To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/587344/+subscriptions
[Qemu-devel] Would virtio support 64 bit address for vring virtqueue?
I am developing virtio user space poll mode network driver. We allocate vring physical memory from huge page. On VMs with less than 4GB memory, it works well. But on VMs with like 8GB memory, huge page are all allocated from high end memory. So would virtio support 64bit address for vring virtqueue? Thanks
Re: [Qemu-devel] Would virtio support 64 bit address for vring virtqueue?
Hi Stefan: I think you mention the descriptor address? I mean the vring PFN register. /* A 32-bit r/w PFN for the currently selected queue */ #define VIRTIO_PCI_QUEUE_PFN8 And the linux driver sample code: iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN); -Original Message- From: Stefan Hajnoczi [mailto:stefa...@gmail.com] Sent: Wednesday, August 28, 2013 4:07 PM To: Xie, Huawei Cc: qemu-devel@nongnu.org; ru...@rustcorp.com.au; Stefan Hajnoczi Subject: Re: [Qemu-devel] Would virtio support 64 bit address for vring virtqueue? On Wed, Aug 28, 2013 at 03:18:39AM +, Xie, Huawei wrote: > I am developing virtio user space poll mode network driver. We allocate vring > physical memory from huge page. On VMs with less than 4GB memory, it works > well. But on VMs with like 8GB memory, huge page are all allocated from high > end memory. > So would virtio support 64bit address for vring virtqueue? The vring takes guest physical addresses and the C type is __u64 (see /usr/include/linux/virtio_ring.h). 64-bit addresses are fine. Stefan
Re: [Qemu-devel] Would virtio support 64 bit address for vring virtqueue?
If this is the case, one possible fix would be: Write two continuous 32bit DWORD to combine a 64bit address Use the upper 12 bits of PFN val to indicate if it is combined write In this way, we wouldn't break other virtio driver, register layout and only need a few lines of modification. -Original Message- From: Xie, Huawei Sent: Wednesday, August 28, 2013 5:23 PM To: 'Stefan Hajnoczi' Cc: qemu-devel@nongnu.org; ru...@rustcorp.com.au; Stefan Hajnoczi Subject: RE: [Qemu-devel] Would virtio support 64 bit address for vring virtqueue? Hi Stefan: I think you mention the descriptor address? I mean the vring PFN register. /* A 32-bit r/w PFN for the currently selected queue */ #define VIRTIO_PCI_QUEUE_PFN8 And the linux driver sample code: iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN); -Original Message- From: Stefan Hajnoczi [mailto:stefa...@gmail.com] Sent: Wednesday, August 28, 2013 4:07 PM To: Xie, Huawei Cc: qemu-devel@nongnu.org; ru...@rustcorp.com.au; Stefan Hajnoczi Subject: Re: [Qemu-devel] Would virtio support 64 bit address for vring virtqueue? On Wed, Aug 28, 2013 at 03:18:39AM +, Xie, Huawei wrote: > I am developing virtio user space poll mode network driver. We allocate vring > physical memory from huge page. On VMs with less than 4GB memory, it works > well. But on VMs with like 8GB memory, huge page are all allocated from high > end memory. > So would virtio support 64bit address for vring virtqueue? The vring takes guest physical addresses and the C type is __u64 (see /usr/include/linux/virtio_ring.h). 64-bit addresses are fine. Stefan
Re: [Qemu-devel] Would virtio support 64 bit address for vring virtqueue?
I know it is PFN, that is why I want to use the high 12bit for combined write in another reply, to fully address 64 bit address. Is the 16TB address space what it is originally designed for? If it is, that will basically solve the problem. I will try later. Thanks for reminder -Original Message- From: Laszlo Ersek [mailto:ler...@redhat.com] Sent: Wednesday, August 28, 2013 7:46 PM To: Xie, Huawei Cc: Stefan Hajnoczi; ru...@rustcorp.com.au; qemu-devel@nongnu.org; Stefan Hajnoczi Subject: Re: [Qemu-devel] Would virtio support 64 bit address for vring virtqueue? On 08/28/13 11:22, Xie, Huawei wrote: > Hi Stefan: > I think you mention the descriptor address? I mean the vring PFN register. > /* A 32-bit r/w PFN for the currently selected queue */ > #define VIRTIO_PCI_QUEUE_PFN 8 > > And the linux driver sample code: > iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT, > vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN); > > -Original Message- > From: Stefan Hajnoczi [mailto:stefa...@gmail.com] > Sent: Wednesday, August 28, 2013 4:07 PM > To: Xie, Huawei > Cc: qemu-devel@nongnu.org; ru...@rustcorp.com.au; Stefan Hajnoczi > Subject: Re: [Qemu-devel] Would virtio support 64 bit address for vring > virtqueue? > > On Wed, Aug 28, 2013 at 03:18:39AM +, Xie, Huawei wrote: >> I am developing virtio user space poll mode network driver. We allocate >> vring physical memory from huge page. On VMs with less than 4GB memory, it >> works well. But on VMs with like 8GB memory, huge page are all allocated >> from high end memory. >> So would virtio support 64bit address for vring virtqueue? > > The vring takes guest physical addresses and the C type is __u64 (see > /usr/include/linux/virtio_ring.h). 64-bit addresses are fine. If you mean Queue Address field in the Virtio Header (section 2.2.2 in the virtio-0.9.5 specification), then please see 2.3 Virtqueue Configuration, step 3. You have to divide the guest-phys address by 4096 and store the quotient. The Queue Address field takes a page frame number, not a page frame address. This allows it to address up to 2^32 * 4096 == 16T bytes of RAM. Laszlo
[Qemu-devel] QEMU+SystemC
Hi all, Recently I am struggling in examing the effect of host kernel I/O stacks on high-speed SSD, especially the new PCIe SSD, without the availability of the necessary hardware. After several reading online I found out QEMU+SystemC model environment might be able to to solve my problem. Sorry I am new to QEMU. My main objective is to collect guest OS statistics based on the attached external SSD model. So simple SystemC stub will be the best for my case. Can anybody give me any suggestion on which QEMU variant with SystemC fit me the best? I will sincerely appreciate that if you could forward me the correct URL for download also. Thank you for your attention, Sean Xie
Re: [Qemu-devel] Disabling IRQ error
Dear Max, Sorry for the late reply. Thanks for your advice about lowering the irqs after raising them, and that fixed my problem. Thanks again for your kindness. Thanks, Simen δΊ 2013/09/11 17:29, Max Filippov ει: On Wed, Sep 11, 2013 at 12:12 PM, Xie Xianshan wrote: I want to add a new device "fpga" for e500, and trigger an interrupt IRQ3 while the register BB_INTR_REG which belongs to device "fpga" is wrote by the device driver of "fpga". For e500, IRQ3 is an external interrupt irq. According the debug log, the disabling error is encoutered during writing BB_INTR_REG register. - write BB_INTR_REG register - qemu_irq_raise() is called. - after serval minutes, the error message about disabling irq is displayed. - continue the next execution without error(with poll?) So your device raises IRQ, but it doesn't lower it. Real devices usually don't do that, they either generate a short pulse on the IRQ line (in case of edge-triggered IRQ) or raise IRQ line on some event and then lower it on a command from its driver (level-triggered IRQ). You can do the following to make your device behave that way: - make your fpga device capable of lowering its IRQ, e.g. by adding another register: static void fpga_write(FPGAState *s, unsigned int offset, uint32_t value, unsigned size) { switch(offset) { case BB_INTR_REG: qemu_irq_raise(s->irq); break; case BB_INTC_REG: qemu_irq_lower(s->irq); break; } } - provide an interrupt service routine in the linux driver for your fpga device that would check whether the interrupt was caused by its device, and if so lower the device's IRQ. Thanks. -- Max
Re: [Qemu-devel] [PATCH v14 3/8] Backup: clear all bitmap when doing block checkpoint
On 01/28/2016 12:05 AM, Stefan Hajnoczi wrote: On Wed, Jan 13, 2016 at 05:18:27PM +0800, Changlong Xie wrote: diff --git a/blockjob.c b/blockjob.c index 80adb9d..0c8edfe 100644 --- a/blockjob.c +++ b/blockjob.c @@ -533,3 +533,14 @@ void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job) QLIST_INSERT_HEAD(&txn->jobs, job, txn_list); block_job_txn_ref(txn); } + +void block_job_do_checkpoint(BlockJob *job, Error **errp) +{ +if (!job->driver->do_checkpoint) { +error_setg(errp, "The job %s doesn't support block checkpoint", + BlockJobType_lookup[job->driver->job_type]); +return; +} + +job->driver->do_checkpoint(job, errp); +} diff --git a/include/block/blockjob.h b/include/block/blockjob.h index d84ccd8..abdba7c 100644 --- a/include/block/blockjob.h +++ b/include/block/blockjob.h @@ -70,6 +70,9 @@ typedef struct BlockJobDriver { * never both. */ void (*abort)(BlockJob *job); + +/** Optional callback for job types that support checkpoint. */ +void (*do_checkpoint)(BlockJob *job, Error **errp); The COLO/replication-specific callbacks have been moved out of BlockDriver into their own replication struct. Similar reasoning applies to BlockJobDriver: The do_checkpoint() callback is only implemented by one type of job and its purpose is related to COLO rather than jobs. This is a strong indication that this shouldn't be part of the generic BlockJobDriver struct. Please drop changes to the generic blockjob interface. Instead, make backup_do_checkpoint() public and add assert(job->driver->type == BLOCK_JOB_TYPE_BACKUP) into the function. Then the replication filter can call backup_do_checkpoint() directly. Will fix it in next version. Thanks -Xie Stefan
Re: [Qemu-devel] [PATCH v14 4/8] Allow creating backup jobs when opening BDS
On 01/27/2016 10:04 PM, Stefan Hajnoczi wrote: On Wed, Jan 13, 2016 at 05:18:28PM +0800, Changlong Xie wrote: From: Wen Congyang When opening BDS, we need to create backup jobs for image-fleecing. Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie Reviewed-by: Stefan Hajnoczi Reviewed-by: Jeff Cody --- block/Makefile.objs | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/Makefile.objs b/block/Makefile.objs index 58ef2ef..fa05f37 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -22,10 +22,10 @@ block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o block-obj-$(CONFIG_LIBSSH2) += ssh.o block-obj-y += accounting.o block-obj-y += write-threshold.o +block-obj-y += backup.o common-obj-y += stream.o common-obj-y += commit.o -common-obj-y += backup.o iscsi.o-cflags := $(LIBISCSI_CFLAGS) iscsi.o-libs := $(LIBISCSI_LIBS) The commit message and description seem outdated. I guess the purpose of this patch is to link the backup block job into all programs that use the block layer because you want to add a dependency on the it from core code. Will update it in next version. Thanks -Xie
Re: [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication
On 01/28/2016 11:15 PM, Stefan Hajnoczi wrote: On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote: On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote: On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote: +static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp) +{ +Error *local_err = NULL; +int ret; + +if (!s->secondary_disk->job) { +error_setg(errp, "Backup job is cancelled unexpectedly"); +return; +} + +block_job_do_checkpoint(s->secondary_disk->job, &local_err); +if (local_err) { +error_propagate(errp, local_err); +return; +} + +ret = s->active_disk->drv->bdrv_make_empty(s->active_disk); What happens to in-flight requests to the active and hidden disks? we MUST call do_checkpoint() when the vm is stopped. Please document the environment under which the block replication callback functions run. OK I'm concerned that the bdrv_drain_all() in vm_stop() can take a long time if the disk is slow/failing. bdrv_drain_all() blocks until all in-flight I/O requests have completed. What does the Primary do if the Secondary becomes unresponsive? Actually, we knew this problem. But currently, there seems no better way to resolve it. If you have any ideas? +switch (s->mode) { +case REPLICATION_MODE_PRIMARY: +break; +case REPLICATION_MODE_SECONDARY: +s->active_disk = bs->file->bs; +if (!bs->file->bs->backing) { +error_setg(errp, "Active disk doesn't have backing file"); +return; +} + +s->hidden_disk = s->active_disk->backing->bs; +if (!s->hidden_disk->backing) { +error_setg(errp, "Hidden disk doesn't have backing file"); +return; +} + +s->secondary_disk = s->hidden_disk->backing->bs; +if (!s->secondary_disk->blk) { +error_setg(errp, "The secondary disk doesn't have block backend"); +return; +} Kevin: Is code allowed to stash away BlockDriverState pointers for convenience or should it keep the BdrvChild pointers instead? In order for replication to work as expected, the graph shouldn't change but for consistency maybe BdrvChild is best. I asked Kevin about this on IRC and he agreed that BdrvChild should be used instead of holding on to BlockDriverState * pointers. Although these pointers will not change during replication (if the op blockers are set up correctly), it's more consistent and certainly safer to go through BdrvChild. Ok +/* start backup job now */ +error_setg(&s->blocker, + "block device is in use by internal backup job"); +bdrv_op_block_all(s->top_bs, s->blocker); +bdrv_op_unblock(s->top_bs, BLOCK_OP_TYPE_DATAPLANE, s->blocker); +bdrv_ref(s->hidden_disk); Why is the explicit reference to hidden_disk (but not secondary_disk or active_disk) is necessary? IIRC, we should reference the backup target before calling backup_start(), and we will reference the backup source in backup_start(). I'm not sure why this is necessary since they are part of the backing chain. Just as Wen said, we should reference the backup target before calling backup_start() to protect it from destroying, if backup job is stopped unexpectedly. If it is necessary, please add a comment so it's clear why the reference is being taken. Ok Stefan
Re: [Qemu-devel] [PATCH v14 5/8] docs: block replication's description
On 02/02/2016 11:20 PM, Eric Blake wrote: On 01/13/2016 02:18 AM, Changlong Xie wrote: From: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie --- docs/block-replication.txt | 229 + 1 file changed, 229 insertions(+) create mode 100644 docs/block-replication.txt diff --git a/docs/block-replication.txt b/docs/block-replication.txt new file mode 100644 index 000..d1a231e --- /dev/null +++ b/docs/block-replication.txt @@ -0,0 +1,229 @@ +Block replication + +Copyright Fujitsu, Corp. 2015 +Copyright (c) 2015 Intel Corporation +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD. Do you want to claim 2016 for any of this? Will update in next version. + +This work is licensed under the terms of the GNU GPL, version 2 or later. +See the COPYING file in the top-level directory. + +Block replication is used for continuous checkpoints. It is designed +for COLO (COarse-grain LOck-stepping) where the Secondary VM is running. +It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario, +where the Secondary VM is not running. + +This document gives an overview of block replication's design. + +== Background == +High availability solutions such as micro checkpoint and COLO will do +consecutive checkpoints. The VM state of Primary VM and Secondary VM is s/of Primary/of the Primary/ OK +identical right after a VM checkpoint, but becomes different as the VM +executes till the next checkpoint. To support disk contents checkpoint, +the modified disk contents in the Secondary VM must be buffered, and are +only dropped at next checkpoint time. To reduce the network transportation +effort at the time of checkpoint, the disk modification operations of s/at the time of/during a vmstate/ s/operations of/operations of the/ OK +Primary disk are asynchronously forwarded to the Secondary node. + +== Workflow == +== Architecture == + +6) The drive-backup job(sync=none) is run to allow hidden-disk to buffer Space before ( in English description. OK +any state that would otherwise be lost by the speculative write-through +of the NBD server into the secondary disk. So before block replication, +the primary disk and secondary disk should contain the same data. + +== Failure Handling == +== Usage == +Primary: + -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\ + children.0.file.filename=1.raw,\ + children.0.driver=raw + + Run qmp command in primary qemu: +{ 'execute': 'human-monitor-command', + 'arguments': { + 'command-line': 'drive_add buddy driver=replication,mode=primary,file.driver=nbd,file.host=,file.port=,file.export=colo1,node-name=nbd_client1,if=none' Eww. We shouldn't ever have to pack a command line as . single QMP string that needs reparsing. Instead, you should pass the information as a nested QMP dictionary, something like: 'arguments': { 'remote-command': { 'command': 'drive_add', 'name': 'buddy', 'driver': 'replication', 'mode': 'primary', 'file': { 'driver': 'nbd', 'host': '', Hi Eric What is 'remote-command' here? I just tried below commands, but got some errors. {'execute': 'human-monitor-command', 'arguments': { 'command-line': { 'command': 'drive_add', 'name': 'buddy', 'driver': 'replication', 'mode': 'primary', 'if': 'none', 'node-name': 'primary_nbd_node', 'file': { 'driver': 'nbd', 'host': '192.168.3.2', 'port': '8889', 'export': 'colo-disk', } } } } {"error": {"class": "GenericError", "desc": "Invalid JSON syntax"}} 'blockdev-add' doesn't support 'nbd'. So we use 'drive_add' here, and it's a hmp command. If i'm right, there seems just one way to execute hmp commands in QMP: EQMP { .name = "human-monitor-command", .args_type = "command-line:s,cpu-index:i?", .mhandler.cmd_new = qmp_marshal_human_monitor_command, }, SQMP human-monitor-command - Execute a Human Monitor
Re: [Qemu-devel] [PATCH v13 00/10] Block replication for continuous checkpoints
On 02/01/2016 09:18 AM, Wen Congyang wrote: On 01/29/2016 06:47 PM, Dr. David Alan Gilbert wrote: * Wen Congyang (we...@cn.fujitsu.com) wrote: On 01/29/2016 06:07 PM, Dr. David Alan Gilbert wrote: * Wen Congyang (we...@cn.fujitsu.com) wrote: On 01/27/2016 07:03 PM, Dr. David Alan Gilbert wrote: Hi, I've got a block error if I kill the secondary. Start both primary & secondary kill -9 secondary qemu x_colo_lost_heartbeat on primary The guest sees a block error and the ext4 root switches to read-only. I gdb'd the primary with a breakpoint on quorum_report_bad; see backtrace below. (This is based on colo-v2.4-periodic-mode of the framework code with the block and network proxy merged in; so it could be my merging but I don't think so ?) (gdb) where #0 quorum_report_bad (node_name=0x7f2946a0892c "node0", ret=-5, acb=0x7f2946cb3910, acb=0x7f2946cb3910) at /root/colo/jan-2016/qemu/block/quorum.c:222 #1 0x7f2943b23058 in quorum_aio_cb (opaque=, ret=) at /root/colo/jan-2016/qemu/block/quorum.c:315 #2 0x7f2943b311be in bdrv_co_complete (acb=0x7f2946cb3f60) at /root/colo/jan-2016/qemu/block/io.c:2122 #3 0x7f2943ae777d in aio_bh_call (bh=) at /root/colo/jan-2016/qemu/async.c:64 #4 aio_bh_poll (ctx=ctx@entry=0x7f2945b771d0) at /root/colo/jan-2016/qemu/async.c:92 #5 0x7f2943af5090 in aio_dispatch (ctx=0x7f2945b771d0) at /root/colo/jan-2016/qemu/aio-posix.c:305 #6 0x7f2943ae756e in aio_ctx_dispatch (source=, callback=, user_data=) at /root/colo/jan-2016/qemu/async.c:231 #7 0x7f293b84a79a in g_main_context_dispatch () from /lib64/libglib-2.0.so.0 #8 0x7f2943af3a00 in glib_pollfds_poll () at /root/colo/jan-2016/qemu/main-loop.c:211 #9 os_host_main_loop_wait (timeout=) at /root/colo/jan-2016/qemu/main-loop.c:256 #10 main_loop_wait (nonblocking=) at /root/colo/jan-2016/qemu/main-loop.c:504 #11 0x7f29438529ee in main_loop () at /root/colo/jan-2016/qemu/vl.c:1945 #12 main (argc=, argv=, envp=) at /root/colo/jan-2016/qemu/vl.c:4707 (gdb) p s->num_children $1 = 2 (gdb) p acb->success_count $2 = 0 (gdb) p acb->is_read $5 = false Sorry for the late reply. No problem. What it the value of acb->count? (gdb) p acb->count $1 = 1 Note, the count is 1, not 2. Writing to children.0 is in flight. If writing to children.0 successes, the guest doesn't know this error. If secondary host is down, you should remove quorum's children.1. Otherwise, you will get I/O error event. Is that safe? If the secondary fails, do you always have time to issue the command to remove the children.1 before the guest sees the error? We will write to two children, and expect that writing to children.0 will success. If so, the guest doesn't know this error. You just get the I/O error event. I think children.0 is the disk, and that should be OK - so only the children.1/replication should be failing - so in that case why do I see the error? I don't know, and I will check the codes. The 'node0' in the backtrace above is the name of the replication, so it does look like the error is coming from the replication. No, the backtrace is just report an I/O error events to the management application. Anyway, I tried removing children.1 but it segfaults now, I guess the replication is unhappy: (qemu) x_block_change colo-disk0 -d children.1 (qemu) x_colo_lost_heartbeat Hmm, you should not remove the child before failover. I will check it how to avoid it in the codes. But you said 'If secondary host is down, you should remove quorum's children.1' - is that not what you meant? Yes, you should excute 'x_colo_lost_heartbeat' fist, and then excute 'x_block_change ... -d ...'. Hi david It seems we missed 'drive_del' command, and will document it in next version. Here is the right commands order: { "execute": "x-colo-lost-heartbeat" } { 'execute': 'x-blockdev-change', 'arguments': {'parent': 'colo-disk', 'child': 'children.1'}} { 'execute': 'human-monitor-command', 'arguments': {'command-line': 'drive_del x'}} Thanks -Xie 12973 Segmentation fault (core dumped) ./try/x86_64-softmmu/qemu-system-x86_64 -enable-kvm $console_param -S -boot c -m 4080 -smp 4 -machine pc-i440fx-2.5,accel=kvm -name debug-threads=on -trace events=trace-file -device virtio-rng-pci $block_param $net_param #0 0x7f0a398a864c in bdrv_stop_replication (bs=0x7f0a3b0a8430, failover=true, errp=0x7fff6a5c3420) at /root/colo/jan-2016/qemu/block.c:4426 (gdb) p drv $1 = (BlockDriver *) 0x5d2a it looks like the whole of bs is bogus. #1 0x7f0a398d87f6 in quorum_stop_replication (bs=, failover=, errp=) at /root/colo/jan-2016/qemu/block/quorum.c:1213 (gdb) p s->r
Re: [Qemu-devel] [PATCH v13 00/10] Block replication for continuous checkpoints
On 02/04/2016 05:07 PM, Dr. David Alan Gilbert wrote: * Changlong Xie (xiecl.f...@cn.fujitsu.com) wrote: On 02/01/2016 09:18 AM, Wen Congyang wrote: On 01/29/2016 06:47 PM, Dr. David Alan Gilbert wrote: * Wen Congyang (we...@cn.fujitsu.com) wrote: On 01/29/2016 06:07 PM, Dr. David Alan Gilbert wrote: * Wen Congyang (we...@cn.fujitsu.com) wrote: On 01/27/2016 07:03 PM, Dr. David Alan Gilbert wrote: Hi, I've got a block error if I kill the secondary. Start both primary & secondary kill -9 secondary qemu x_colo_lost_heartbeat on primary The guest sees a block error and the ext4 root switches to read-only. I gdb'd the primary with a breakpoint on quorum_report_bad; see backtrace below. (This is based on colo-v2.4-periodic-mode of the framework code with the block and network proxy merged in; so it could be my merging but I don't think so ?) (gdb) where #0 quorum_report_bad (node_name=0x7f2946a0892c "node0", ret=-5, acb=0x7f2946cb3910, acb=0x7f2946cb3910) at /root/colo/jan-2016/qemu/block/quorum.c:222 #1 0x7f2943b23058 in quorum_aio_cb (opaque=, ret=) at /root/colo/jan-2016/qemu/block/quorum.c:315 #2 0x7f2943b311be in bdrv_co_complete (acb=0x7f2946cb3f60) at /root/colo/jan-2016/qemu/block/io.c:2122 #3 0x7f2943ae777d in aio_bh_call (bh=) at /root/colo/jan-2016/qemu/async.c:64 #4 aio_bh_poll (ctx=ctx@entry=0x7f2945b771d0) at /root/colo/jan-2016/qemu/async.c:92 #5 0x7f2943af5090 in aio_dispatch (ctx=0x7f2945b771d0) at /root/colo/jan-2016/qemu/aio-posix.c:305 #6 0x7f2943ae756e in aio_ctx_dispatch (source=, callback=, user_data=) at /root/colo/jan-2016/qemu/async.c:231 #7 0x7f293b84a79a in g_main_context_dispatch () from /lib64/libglib-2.0.so.0 #8 0x7f2943af3a00 in glib_pollfds_poll () at /root/colo/jan-2016/qemu/main-loop.c:211 #9 os_host_main_loop_wait (timeout=) at /root/colo/jan-2016/qemu/main-loop.c:256 #10 main_loop_wait (nonblocking=) at /root/colo/jan-2016/qemu/main-loop.c:504 #11 0x7f29438529ee in main_loop () at /root/colo/jan-2016/qemu/vl.c:1945 #12 main (argc=, argv=, envp=) at /root/colo/jan-2016/qemu/vl.c:4707 (gdb) p s->num_children $1 = 2 (gdb) p acb->success_count $2 = 0 (gdb) p acb->is_read $5 = false Sorry for the late reply. No problem. What it the value of acb->count? (gdb) p acb->count $1 = 1 Note, the count is 1, not 2. Writing to children.0 is in flight. If writing to children.0 successes, the guest doesn't know this error. If secondary host is down, you should remove quorum's children.1. Otherwise, you will get I/O error event. Is that safe? If the secondary fails, do you always have time to issue the command to remove the children.1 before the guest sees the error? We will write to two children, and expect that writing to children.0 will success. If so, the guest doesn't know this error. You just get the I/O error event. I think children.0 is the disk, and that should be OK - so only the children.1/replication should be failing - so in that case why do I see the error? I don't know, and I will check the codes. The 'node0' in the backtrace above is the name of the replication, so it does look like the error is coming from the replication. No, the backtrace is just report an I/O error events to the management application. Anyway, I tried removing children.1 but it segfaults now, I guess the replication is unhappy: (qemu) x_block_change colo-disk0 -d children.1 (qemu) x_colo_lost_heartbeat Hmm, you should not remove the child before failover. I will check it how to avoid it in the codes. But you said 'If secondary host is down, you should remove quorum's children.1' - is that not what you meant? Yes, you should excute 'x_colo_lost_heartbeat' fist, and then excute 'x_block_change ... -d ...'. Hi david Hi Xie, Thanks for the response. It seems we missed 'drive_del' command, and will document it in next version. Here is the right commands order: { "execute": "x-colo-lost-heartbeat" } { 'execute': 'x-blockdev-change', 'arguments': {'parent': 'colo-disk', 'child': 'children.1'}} { 'execute': 'human-monitor-command', 'arguments': {'command-line': 'drive_del x'}} OK, however, you should fix the seg fault if you don't issue the drive_del; qemu should never crash. (Also I still get the IO error in the guest if I do the x-colo-lost-heartbeat). Here is a quick fix, i just tested for several times. It work well to me. bugfix Signed-off-by: Changlong Xie diff --git a/block/quorum.c b/block/quorum.c index e5a7e4f..f4f1d28 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -458,6 +458,11 @@ static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes) if (c
[Qemu-devel] [PATCH] quorum: fix segfault when read fails in fifo mode
Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie --- block/quorum.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/block/quorum.c b/block/quorum.c index a5ae4b8..0965277 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -295,6 +295,9 @@ static void quorum_aio_cb(void *opaque, int ret) quorum_copy_qiov(acb->qiov, &acb->qcrs[acb->child_iter].qiov); } acb->vote_ret = ret; +if (ret < 0) { +acb->child_iter--; +} quorum_aio_finalize(acb); return; } -- 1.9.3
Re: [Qemu-devel] [PATCH] quorum: fix segfault when read fails in fifo mode
On 02/04/2016 08:24 PM, Kevin Wolf wrote: Am 04.02.2016 um 11:19 hat Changlong Xie geschrieben: Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie --- block/quorum.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/block/quorum.c b/block/quorum.c index a5ae4b8..0965277 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -295,6 +295,9 @@ static void quorum_aio_cb(void *opaque, int ret) quorum_copy_qiov(acb->qiov, &acb->qcrs[acb->child_iter].qiov); } acb->vote_ret = ret; +if (ret < 0) { +acb->child_iter--; +} quorum_aio_finalize(acb); return; } This looks semantically correct to me (but I'd like to have an Ack from Berto), but I would fix it above: We shouldn't do ++acb->child_iter in the first place if the new value is >= s->num_children. So instead of decrementing after the fact, just move the increment to inside the then branch above. Yes, will fix in next version. Thanks -Xie Kevin .
[Qemu-devel] [PATCH V2] quorum: fix segfault when read fails in fifo mode
Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie --- block/quorum.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/block/quorum.c b/block/quorum.c index a5ae4b8..11cc60b 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -286,7 +286,8 @@ static void quorum_aio_cb(void *opaque, int ret) if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) { /* We try to read next child in FIFO order if we fail to read */ -if (ret < 0 && ++acb->child_iter < s->num_children) { +if (ret < 0 && (acb->child_iter + 1) < s->num_children) { +acb->child_iter++; read_fifo_child(acb); return; } -- 1.9.3
[Qemu-devel] [PATCH v15 4/9] Link backup into block core
From: Wen Congyang Some programs that add a dependency on it will use the block layer directly. Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie Reviewed-by: Stefan Hajnoczi Reviewed-by: Jeff Cody --- block/Makefile.objs | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/Makefile.objs b/block/Makefile.objs index 58ef2ef..fa05f37 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -22,10 +22,10 @@ block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o block-obj-$(CONFIG_LIBSSH2) += ssh.o block-obj-y += accounting.o block-obj-y += write-threshold.o +block-obj-y += backup.o common-obj-y += stream.o common-obj-y += commit.o -common-obj-y += backup.o iscsi.o-cflags := $(LIBISCSI_CFLAGS) iscsi.o-libs := $(LIBISCSI_LIBS) -- 1.9.3
[Qemu-devel] [PATCH v15 6/9] auto complete active commit
From: Wen Congyang Auto complete mirror job in background to prevent from blocking synchronously Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie --- block/mirror.c| 13 + blockdev.c| 2 +- include/block/block_int.h | 3 ++- qemu-img.c| 2 +- 4 files changed, 13 insertions(+), 7 deletions(-) diff --git a/block/mirror.c b/block/mirror.c index 2c0edfa..79ec170 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -718,7 +718,8 @@ static void mirror_start_job(BlockDriverState *bs, BlockDriverState *target, BlockCompletionFunc *cb, void *opaque, Error **errp, const BlockJobDriver *driver, - bool is_none_mode, BlockDriverState *base) + bool is_none_mode, BlockDriverState *base, + bool auto_complete) { MirrorBlockJob *s; BlockDriverState *replaced_bs; @@ -774,6 +775,9 @@ static void mirror_start_job(BlockDriverState *bs, BlockDriverState *target, s->granularity = granularity; s->buf_size = ROUND_UP(buf_size, granularity); s->unmap = unmap; +if (auto_complete) { +s->should_complete = true; +} s->dirty_bitmap = bdrv_create_dirty_bitmap(bs, granularity, NULL, errp); if (!s->dirty_bitmap) { @@ -815,14 +819,15 @@ void mirror_start(BlockDriverState *bs, BlockDriverState *target, mirror_start_job(bs, target, replaces, speed, granularity, buf_size, on_source_error, on_target_error, unmap, cb, opaque, errp, - &mirror_job_driver, is_none_mode, base); + &mirror_job_driver, is_none_mode, base, false); } void commit_active_start(BlockDriverState *bs, BlockDriverState *base, int64_t speed, BlockdevOnError on_error, BlockCompletionFunc *cb, - void *opaque, Error **errp) + void *opaque, Error **errp, + bool auto_complete) { int64_t length, base_length; int orig_base_flags; @@ -863,7 +868,7 @@ void commit_active_start(BlockDriverState *bs, BlockDriverState *base, bdrv_ref(base); mirror_start_job(bs, base, NULL, speed, 0, 0, on_error, on_error, false, cb, opaque, &local_err, - &commit_active_job_driver, false, base); + &commit_active_job_driver, false, base, auto_complete); if (local_err) { error_propagate(errp, local_err); goto error_restore_flags; diff --git a/blockdev.c b/blockdev.c index 7c1d6da..fd7dce2 100644 --- a/blockdev.c +++ b/blockdev.c @@ -3079,7 +3079,7 @@ void qmp_block_commit(const char *device, goto out; } commit_active_start(bs, base_bs, speed, on_error, block_job_cb, -bs, &local_err); +bs, &local_err, false); } else { commit_start(bs, base_bs, top_bs, speed, on_error, block_job_cb, bs, has_backing_file ? backing_file : NULL, &local_err); diff --git a/include/block/block_int.h b/include/block/block_int.h index e67b6fb..9eb1db6 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -638,13 +638,14 @@ void commit_start(BlockDriverState *bs, BlockDriverState *base, * @cb: Completion function for the job. * @opaque: Opaque pointer value passed to @cb. * @errp: Error object. + * @auto_complete: Auto complete the job. * */ void commit_active_start(BlockDriverState *bs, BlockDriverState *base, int64_t speed, BlockdevOnError on_error, BlockCompletionFunc *cb, - void *opaque, Error **errp); + void *opaque, Error **errp, bool auto_complete); /* * mirror_start: * @bs: Block device to operate on. diff --git a/qemu-img.c b/qemu-img.c index f121980..986cc6d 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -756,7 +756,7 @@ static int img_commit(int argc, char **argv) }; commit_active_start(bs, base_bs, 0, BLOCKDEV_ON_ERROR_REPORT, -common_block_job_cb, &cbi, &local_err); +common_block_job_cb, &cbi, &local_err, false); if (local_err) { goto done; } -- 1.9.3
[Qemu-devel] [PATCH v15 2/9] Store parent BDS in BdrvChild
From: Wen Congyang We need to access the parent BDS to get the root BDS. Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie --- block.c | 1 + include/block/block_int.h | 1 + 2 files changed, 2 insertions(+) diff --git a/block.c b/block.c index 70ab625..c18b462 100644 --- a/block.c +++ b/block.c @@ -1208,6 +1208,7 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs, BdrvChild *child = g_new(BdrvChild, 1); *child = (BdrvChild) { .bs = child_bs, +.parent = parent_bs, .name = g_strdup(child_name), .role = child_role, }; diff --git a/include/block/block_int.h b/include/block/block_int.h index 89ec138..9204d1e 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -363,6 +363,7 @@ extern const BdrvChildRole child_format; struct BdrvChild { BlockDriverState *bs; +BlockDriverState *parent; char *name; const BdrvChildRole *role; QLIST_ENTRY(BdrvChild) next; -- 1.9.3
[Qemu-devel] [PATCH v15 8/9] Implement new driver for block replication
From: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie --- block/Makefile.objs | 1 + block/replication.c | 594 2 files changed, 595 insertions(+) create mode 100644 block/replication.c diff --git a/block/Makefile.objs b/block/Makefile.objs index fa05f37..94c1d03 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -23,6 +23,7 @@ block-obj-$(CONFIG_LIBSSH2) += ssh.o block-obj-y += accounting.o block-obj-y += write-threshold.o block-obj-y += backup.o +block-obj-y += replication.o common-obj-y += stream.o common-obj-y += commit.o diff --git a/block/replication.c b/block/replication.c new file mode 100644 index 000..4631667 --- /dev/null +++ b/block/replication.c @@ -0,0 +1,594 @@ +/* + * Replication Block filter + * + * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD. + * Copyright (c) 2016 Intel Corporation + * Copyright (c) 2016 FUJITSU LIMITED + * + * Author: + * Wen Congyang + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#include "qemu-common.h" +#include "block/nbd.h" +#include "block/blockjob.h" +#include "block/block_int.h" +#include "replication.h" + +typedef struct BDRVReplicationState { +ReplicationMode mode; +int replication_state; +BdrvChild *active_disk; +BdrvChild *hidden_disk; +BdrvChild *secondary_disk; +BlockDriverState *top_bs; +ReplicationState *rs; +Error *blocker; +int orig_hidden_flags; +int orig_secondary_flags; +int error; +} BDRVReplicationState; + +enum { +BLOCK_REPLICATION_NONE, /* block replication is not started */ +BLOCK_REPLICATION_RUNNING, /* block replication is running */ +BLOCK_REPLICATION_FAILOVER, /* failover is running in background */ +BLOCK_REPLICATION_FAILOVER_FAILED, /* failover failed */ +BLOCK_REPLICATION_DONE, /* block replication is done(after failover) */ +}; + +static void replication_start(ReplicationState *rs, ReplicationMode mode, + Error **errp); +static void replication_do_checkpoint(ReplicationState *rs, Error **errp); +static void replication_get_error(ReplicationState *rs, Error **errp); +static void replication_stop(ReplicationState *rs, bool failover, + Error **errp); + +#define REPLICATION_MODE"mode" +static QemuOptsList replication_runtime_opts = { +.name = "replication", +.head = QTAILQ_HEAD_INITIALIZER(replication_runtime_opts.head), +.desc = { +{ +.name = REPLICATION_MODE, +.type = QEMU_OPT_STRING, +}, +{ /* end of list */ } +}, +}; + +static ReplicationOps replication_ops = { +.start = replication_start, +.checkpoint = replication_do_checkpoint, +.get_error = replication_get_error, +.stop = replication_stop, +}; + +static int replication_open(BlockDriverState *bs, QDict *options, +int flags, Error **errp) +{ +int ret; +BDRVReplicationState *s = bs->opaque; +Error *local_err = NULL; +QemuOpts *opts = NULL; +const char *mode; + +ret = -EINVAL; +opts = qemu_opts_create(&replication_runtime_opts, NULL, 0, &error_abort); +qemu_opts_absorb_qdict(opts, options, &local_err); +if (local_err) { +goto fail; +} + +mode = qemu_opt_get(opts, REPLICATION_MODE); +if (!mode) { +error_setg(&local_err, "Missing the option mode"); +goto fail; +} + +if (!strcmp(mode, "primary")) { +s->mode = REPLICATION_MODE_PRIMARY; +} else if (!strcmp(mode, "secondary")) { +s->mode = REPLICATION_MODE_SECONDARY; +} else { +error_setg(&local_err, + "The option mode's value should be primary or secondary"); +goto fail; +} + +s->rs = replication_new(bs, &replication_ops); + +ret = 0; + +fail: +qemu_opts_del(opts); +error_propagate(errp, local_err); + +return ret; +} + +static void replication_close(BlockDriverState *bs) +{ +BDRVReplicationState *s = bs->opaque; + +if (s->replication_state == BLOCK_REPLICATION_RUNNING) { +replication_stop(s->rs, false, NULL); +replication_remove(s->rs); +} +} + +static int64_t replication_getlength(BlockDriverState *bs) +{ +return bdrv_getlength(bs->file->bs); +} + +static int replication_get_io_status(BDRVReplicationState *s) +{ +switch (s->replication_state) { +case BLOCK_REPLICATION_NONE: +return -EIO; +case BLOCK_REPLICATION_RUNNING: +return 0; +case BLOCK_REPLICATION_FAILOVER: +return s->mode == REP
[Qemu-devel] [PATCH v15 1/9] unblock backup operations in backing file
From: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie --- block.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/block.c b/block.c index a285de5..70ab625 100644 --- a/block.c +++ b/block.c @@ -1279,6 +1279,24 @@ void bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd) /* Otherwise we won't be able to commit due to check in bdrv_commit */ bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_COMMIT_TARGET, bs->backing_blocker); +/* + * We do backup in 3 ways: + * 1. drive backup + *The target bs is new opened, and the source is top BDS + * 2. blockdev backup + *Both the source and the target are top BDSes. + * 3. internal backup(used for block replication) + *Both the source and the target are backing file + * + * In case 1, and 2, the backing file is neither the source nor + * the target. + * In case 3, we will block the top BDS, so there is only one block + * job for the top BDS and its backing chain. + */ +bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_SOURCE, +bs->backing_blocker); +bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_TARGET, +bs->backing_blocker); out: bdrv_refresh_limits(bs, NULL); } -- 1.9.3
[Qemu-devel] [PATCH v15 0/9] Block replication for continuous checkpoints
Block replication is a very important feature which is used for continuous checkpoints(for example: COLO). You can get the detailed information about block replication from here: http://wiki.qemu.org/Features/BlockReplication Usage: Please refer to docs/block-replication.txt This patch series is based on the following patch series: 1. http://lists.nongnu.org/archive/html/qemu-devel/2015-12/msg04570.html Patch status: 1. Acked patches: none 2. Reviewed patches: patch 4 3. Updated patches: patch 3, 4, 5, 7, 8 Note: patch 7 is a new patch that introduces genernal start/stop/checkpoint/ get_error APIs. You can get the patch here: https://github.com/Pating/qemu/tree/changlox/block-replication-v15 You can get the patch with framework here: https://github.com/Pating/qemu/tree/changlox/colo_framework_v14 TODO: 1. Continuous block replication. It will be started after basic functions are accepted. Changs Log: V15: 1. Rebase to the newest codes 2. Fix typos and coding style addresed Eric's comments 3. Address Stefan's comments 1) Make backup_do_checkpoint public, drop the changes on BlockJobDriver 2) Update the message and description for [PATCH 4/9] 3) Make replication_(start/stop/do_checkpoint)_all as global interfaces 4) Introduce AioContext lock to protect start/stop/do_checkpoint callbacks 5) Use BdrvChild instead of holding on to BlockDriverState * pointers 4. Clear BDRV_O_INACTIVE for hidden disk's open_flags since commit 09e0c771 5. Introduce replication_get_error_all to check replication status 6. Remove useless discard interface V14: 1. Implement auto complete active commit 2. Implement active commit block job for replication.c 3. Address the comments from Stefan, add replication-specific API and data structure, also remove old block layer APIs V13: 1. Rebase to the newest codes 2. Remove redundant marcos and semicolon in replication.c 3. Fix typos in block-replication.txt V12: 1. Rebase to the newest codes 2. Use backing reference to replcace 'allow-write-backing-file' V11: 1. Reopen the backing file when starting blcok replication if it is not opened in R/W mode 2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET when opening backing file 3. Block the top BDS so there is only one block job for the top BDS and its backing chain. V10: 1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing reference. 2. Address the comments from Eric Blake V9: 1. Update the error messages 2. Rebase to the newest qemu 3. Split child add/delete support. These patches are sent in another patchset. V8: 1. Address Alberto Garcia's comments V7: 1. Implement adding/removing quorum child. Remove the option non-connect. 2. Simplify the backing refrence option according to Stefan Hajnoczi's suggestion V6: 1. Rebase to the newest qemu. V5: 1. Address the comments from Gong Lei 2. Speed the failover up. The secondary vm can take over very quickly even if there are too many I/O requests. V4: 1. Introduce a new driver replication to avoid touch nbd and qcow2. V3: 1: use error_setg() instead of error_set() 2. Add a new block job API 3. Active disk, hidden disk and nbd target uses the same AioContext 4. Add a testcase to test new hbitmap API V2: 1. Redesign the secondary qemu(use image-fleecing) 2. Use Error objects to return error message 3. Address the comments from Max Reitz and Eric Blake Changlong Xie (1): Introduce new APIs to do replication operation Wen Congyang (8): unblock backup operations in backing file Store parent BDS in BdrvChild Backup: clear all bitmap when doing block checkpoint Link backup into block core docs: block replication's description auto complete active commit Implement new driver for block replication support replication driver in blockdev-add Makefile.objs | 1 + block.c| 19 ++ block/Makefile.objs| 3 +- block/backup.c | 15 ++ block/mirror.c | 13 +- block/replication.c| 594 + blockdev.c | 2 +- docs/block-replication.txt | 238 ++ include/block/block_int.h | 6 +- qapi/block-core.json | 33 ++- qemu-img.c | 2 +- replication.c | 94 +++ replication.h | 53 13 files changed, 1063 insertions(+), 10 deletions(-) create mode 100644 block/replication.c create mode 100644 docs/block-replication.txt create mode 100644 replication.c create mode 100644 replication.h -- 1.9.3
[Qemu-devel] [PATCH v15 9/9] support replication driver in blockdev-add
From: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie Reviewed-by: Eric Blake --- qapi/block-core.json | 20 ++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/qapi/block-core.json b/qapi/block-core.json index 12362b8..af723d5 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -247,6 +247,7 @@ # 2.2: 'archipelago' added, 'cow' dropped # 2.3: 'host_floppy' deprecated # 2.5: 'host_floppy' dropped +# 2.6: 'replication' added # # @backing_file: #optional the name of the backing file (for copy-on-write) # @@ -1567,6 +1568,7 @@ # Drivers that are supported in block device operations. # # @host_device, @host_cdrom: Since 2.1 +# @replication: Since 2.6 # # Since: 2.0 ## @@ -1574,8 +1576,8 @@ 'data': [ 'archipelago', 'blkdebug', 'blkverify', 'bochs', 'cloop', 'dmg', 'file', 'ftp', 'ftps', 'host_cdrom', 'host_device', 'http', 'https', 'null-aio', 'null-co', 'parallels', -'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'tftp', 'vdi', 'vhdx', -'vmdk', 'vpc', 'vvfat' ] } +'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'replication', +'tftp', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] } ## # @BlockdevOptionsBase @@ -2015,6 +2017,19 @@ { 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] } ## +# @BlockdevOptionsReplication +# +# Driver specific block device options for replication +# +# @mode: the replication mode +# +# Since: 2.6 +## +{ 'struct': 'BlockdevOptionsReplication', + 'base': 'BlockdevOptionsGenericFormat', + 'data': { 'mode': 'ReplicationMode' } } + +## # @BlockdevOptions # # Options for creating a block device. @@ -2051,6 +2066,7 @@ 'quorum': 'BlockdevOptionsQuorum', 'raw':'BlockdevOptionsGenericFormat', # TODO rbd: Wait for structured options + 'replication':'BlockdevOptionsReplication', # TODO sheepdog: Wait for structured options # TODO ssh: Should take InetSocketAddress for 'host'? 'tftp': 'BlockdevOptionsFile', -- 1.9.3
[Qemu-devel] [PATCH v15 3/9] Backup: clear all bitmap when doing block checkpoint
From: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie --- block/backup.c| 15 +++ include/block/block_int.h | 2 ++ 2 files changed, 17 insertions(+) diff --git a/block/backup.c b/block/backup.c index 00cafdb..841ec27 100644 --- a/block/backup.c +++ b/block/backup.c @@ -251,6 +251,21 @@ static void backup_abort(BlockJob *job) } } +void backup_do_checkpoint(BlockJob *job, Error **errp) +{ +BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common); + +assert(job->driver->job_type == BLOCK_JOB_TYPE_BACKUP); + +if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) { +error_setg(errp, "The backup job only supports block checkpoint in" + " sync=none mode"); +return; +} + +hbitmap_reset_all(backup_job->bitmap); +} + static const BlockJobDriver backup_job_driver = { .instance_size = sizeof(BackupBlockJob), .job_type = BLOCK_JOB_TYPE_BACKUP, diff --git a/include/block/block_int.h b/include/block/block_int.h index 9204d1e..e67b6fb 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -700,6 +700,8 @@ void backup_start(BlockDriverState *bs, BlockDriverState *target, BlockCompletionFunc *cb, void *opaque, BlockJobTxn *txn, Error **errp); +void backup_do_checkpoint(BlockJob *job, Error **errp); + void blk_set_bs(BlockBackend *blk, BlockDriverState *bs); void blk_dev_change_media_cb(BlockBackend *blk, bool load); -- 1.9.3
[Qemu-devel] [PATCH v15 7/9] Introduce new APIs to do replication operation
Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie --- Makefile.objs| 1 + qapi/block-core.json | 13 replication.c| 94 replication.h| 53 + 4 files changed, 161 insertions(+) create mode 100644 replication.c create mode 100644 replication.h diff --git a/Makefile.objs b/Makefile.objs index 06b95c7..a8c74b7 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -15,6 +15,7 @@ block-obj-$(CONFIG_POSIX) += aio-posix.o block-obj-$(CONFIG_WIN32) += aio-win32.o block-obj-y += block/ block-obj-y += qemu-io-cmds.o +block-obj-y += replication.o block-obj-m = block/ diff --git a/qapi/block-core.json b/qapi/block-core.json index 7e9e8fe..12362b8 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -2002,6 +2002,19 @@ '*read-pattern': 'QuorumReadPattern' } } ## +# @ReplicationMode +# +# An enumeration of replication modes. +# +# @primary: Primary mode, the vm's state will be sent to secondary QEMU. +# +# @secondary: Secondary mode, receive the vm's state from primary QEMU. +# +# Since: 2.6 +## +{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] } + +## # @BlockdevOptions # # Options for creating a block device. diff --git a/replication.c b/replication.c new file mode 100644 index 000..e8ac2f0 --- /dev/null +++ b/replication.c @@ -0,0 +1,94 @@ +/* + * Replication filter + * + * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD. + * Copyright (c) 2016 Intel Corporation + * Copyright (c) 2016 FUJITSU LIMITED + * + * Author: + * Wen Congyang + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#include "replication.h" + +static QLIST_HEAD(, ReplicationState) replication_states; + +ReplicationState *replication_new(void *opaque, ReplicationOps *ops) +{ +ReplicationState *rs; + +rs = g_new0(ReplicationState, 1); +rs->opaque = opaque; +rs->ops = ops; +QLIST_INSERT_HEAD(&replication_states, rs, node); + +return rs; +} + +void replication_remove(ReplicationState *rs) +{ +QLIST_REMOVE(rs, node); +g_free(rs); +} + +/* + * The caller of the function *MUST* make sure vm stopped + */ +void replication_start_all(ReplicationMode mode, Error **errp) +{ +ReplicationState *rs, *next; + +QLIST_FOREACH_SAFE(rs, &replication_states, node, next) { +if (rs->ops && rs->ops->start) { +rs->ops->start(rs, mode, errp); +} +if (*errp != NULL) { +return; +} +} +} + +void replication_do_checkpoint_all(Error **errp) +{ +ReplicationState *rs, *next; + +QLIST_FOREACH_SAFE(rs, &replication_states, node, next) { +if (rs->ops && rs->ops->checkpoint) { +rs->ops->checkpoint(rs, errp); +} +if (*errp != NULL) { +return; +} +} +} + +void replication_get_error_all(Error **errp) +{ +ReplicationState *rs, *next; + +QLIST_FOREACH_SAFE(rs, &replication_states, node, next) { +if (rs->ops && rs->ops->get_error) { +rs->ops->get_error(rs, errp); +} +if (*errp != NULL) { +return; +} +} +} + +void replication_stop_all(bool failover, Error **errp) +{ +ReplicationState *rs, *next; + +QLIST_FOREACH_SAFE(rs, &replication_states, node, next) { +if (rs->ops && rs->ops->stop) { +rs->ops->stop(rs, failover, errp); +} +if (*errp != NULL) { +return; +} +} +} diff --git a/replication.h b/replication.h new file mode 100644 index 000..faea649 --- /dev/null +++ b/replication.h @@ -0,0 +1,53 @@ +/* + * Replication filter + * + * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD. + * Copyright (c) 2016 Intel Corporation + * Copyright (c) 2016 FUJITSU LIMITED + * + * Author: + * Wen Congyang + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#ifndef REPLICATION_H +#define REPLICATION_H + +#include "sysemu/sysemu.h" + +typedef struct ReplicationOps ReplicationOps; +typedef struct ReplicationState ReplicationState; +typedef void (*Start)(ReplicationState *rs, ReplicationMode mode, Error **errp); +typedef void (*Stop)(ReplicationState *rs, bool failover, Error **errp); +typedef void (*Checkpoint)(ReplicationState *rs, Error **errp); +typedef void (*GetError)(ReplicationState *rs, Error **errp); + +struct ReplicationState { +void *opaque; +ReplicationOps *ops; +QLIST_ENTRY(ReplicationState) node; +}; + +struct ReplicationOps{ +
[Qemu-devel] [PATCH v15 5/9] docs: block replication's description
From: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie --- docs/block-replication.txt | 238 + 1 file changed, 238 insertions(+) create mode 100644 docs/block-replication.txt diff --git a/docs/block-replication.txt b/docs/block-replication.txt new file mode 100644 index 000..d8e20c7 --- /dev/null +++ b/docs/block-replication.txt @@ -0,0 +1,238 @@ +Block replication + +Copyright Fujitsu, Corp. 2016 +Copyright (c) 2016 Intel Corporation +Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD. + +This work is licensed under the terms of the GNU GPL, version 2 or later. +See the COPYING file in the top-level directory. + +Block replication is used for continuous checkpoints. It is designed +for COLO (COarse-grain LOck-stepping) where the Secondary VM is running. +It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario, +where the Secondary VM is not running. + +This document gives an overview of block replication's design. + +== Background == +High availability solutions such as micro checkpoint and COLO will do +consecutive checkpoints. The VM state of the Primary and Secondary VM is +identical right after a VM checkpoint, but becomes different as the VM +executes till the next checkpoint. To support disk contents checkpoint, +the modified disk contents in the Secondary VM must be buffered, and are +only dropped at next checkpoint time. To reduce the network transportation +effort during a vmstate checkpoint, the disk modification operations of +the Primary disk are asynchronously forwarded to the Secondary node. + +== Workflow == +The following is the image of block replication workflow: + ++--+++ +|Primary Write Requests||Secondary Write Requests| ++--+++ + | | + | (4) + | V + | /-\ + | Copy and Forward| | + |-(1)--+ | Disk Buffer | + | | | | + | (3) \-/ + | speculative ^ + |write through(2) + | | | + V V | + +--+ ++ + | Primary Disk | | Secondary Disk | + +--+ ++ + +1) Primary write requests will be copied and forwarded to Secondary + QEMU. +2) Before Primary write requests are written to Secondary disk, the + original sector content will be read from Secondary disk and + buffered in the Disk buffer, but it will not overwrite the existing + sector content (it could be from either "Secondary Write Requests" or + previous COW of "Primary Write Requests") in the Disk buffer. +3) Primary write requests will be written to Secondary disk. +4) Secondary write requests will be buffered in the Disk buffer and it + will overwrite the existing sector content in the buffer. + +== Architecture == +We are going to implement block replication from many basic +blocks that are already in QEMU. + + virtio-blk || + ^||.-- + |||| Secondary +1 Quorum ||'-- + / \ || +/\|| + Primary2 filter + disk ^ virtio-blk + | ^ +3 NBD ---> 3 NBD | +client|| server 2 filter + ||^ ^ +. ||| | +Primary | || Secondary disk <- hidden-disk 5 <- active-disk 4 +' ||| backing^ backing + ||| | + ||| | + ||'-' +
Re: [Qemu-devel] [PATCH v15 7/9] Introduce new APIs to do replication operation
On 02/15/2016 08:57 AM, Hailiang Zhang wrote: On 2016/2/5 12:18, Changlong Xie wrote: Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie --- Makefile.objs| 1 + qapi/block-core.json | 13 replication.c| 94 replication.h| 53 + 4 files changed, 161 insertions(+) create mode 100644 replication.c create mode 100644 replication.h diff --git a/Makefile.objs b/Makefile.objs index 06b95c7..a8c74b7 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -15,6 +15,7 @@ block-obj-$(CONFIG_POSIX) += aio-posix.o block-obj-$(CONFIG_WIN32) += aio-win32.o block-obj-y += block/ block-obj-y += qemu-io-cmds.o +block-obj-y += replication.o block-obj-m = block/ diff --git a/qapi/block-core.json b/qapi/block-core.json index 7e9e8fe..12362b8 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -2002,6 +2002,19 @@ '*read-pattern': 'QuorumReadPattern' } } ## +# @ReplicationMode +# +# An enumeration of replication modes. +# +# @primary: Primary mode, the vm's state will be sent to secondary QEMU. +# +# @secondary: Secondary mode, receive the vm's state from primary QEMU. +# +# Since: 2.6 +## +{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] } + +## # @BlockdevOptions # # Options for creating a block device. diff --git a/replication.c b/replication.c new file mode 100644 index 000..e8ac2f0 --- /dev/null +++ b/replication.c @@ -0,0 +1,94 @@ +/* + * Replication filter + * + * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD. + * Copyright (c) 2016 Intel Corporation + * Copyright (c) 2016 FUJITSU LIMITED + * + * Author: + * Wen Congyang + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#include "replication.h" + +static QLIST_HEAD(, ReplicationState) replication_states; + +ReplicationState *replication_new(void *opaque, ReplicationOps *ops) +{ +ReplicationState *rs; + +rs = g_new0(ReplicationState, 1); +rs->opaque = opaque; +rs->ops = ops; +QLIST_INSERT_HEAD(&replication_states, rs, node); + +return rs; +} + +void replication_remove(ReplicationState *rs) +{ +QLIST_REMOVE(rs, node); +g_free(rs); +} + +/* + * The caller of the function *MUST* make sure vm stopped + */ +void replication_start_all(ReplicationMode mode, Error **errp) +{ Is this public API is only used for block ? If yes, I'd like it with a 'block_' prefix. +ReplicationState *rs, *next; + +QLIST_FOREACH_SAFE(rs, &replication_states, node, next) { +if (rs->ops && rs->ops->start) { +rs->ops->start(rs, mode, errp); +} +if (*errp != NULL) { This is incorrect, you miss checking if errp is NULL, if errp is NULL, there will be an error that accessing memory at address 0x0. Same with other places in this patch. You are right, will fix in next version. Thanks -Xie +return; +} +} +} + +void replication_do_checkpoint_all(Error **errp) +{ +ReplicationState *rs, *next; + +QLIST_FOREACH_SAFE(rs, &replication_states, node, next) { +if (rs->ops && rs->ops->checkpoint) { +rs->ops->checkpoint(rs, errp); +} +if (*errp != NULL) { +return; +} +} +} + +void replication_get_error_all(Error **errp) +{ +ReplicationState *rs, *next; + +QLIST_FOREACH_SAFE(rs, &replication_states, node, next) { +if (rs->ops && rs->ops->get_error) { +rs->ops->get_error(rs, errp); +} +if (*errp != NULL) { +return; +} +} +} + +void replication_stop_all(bool failover, Error **errp) +{ +ReplicationState *rs, *next; + +QLIST_FOREACH_SAFE(rs, &replication_states, node, next) { +if (rs->ops && rs->ops->stop) { +rs->ops->stop(rs, failover, errp); +} +if (*errp != NULL) { +return; +} +} +} diff --git a/replication.h b/replication.h new file mode 100644 index 000..faea649 --- /dev/null +++ b/replication.h @@ -0,0 +1,53 @@ +/* + * Replication filter + * + * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD. + * Copyright (c) 2016 Intel Corporation + * Copyright (c) 2016 FUJITSU LIMITED + * + * Author: + * Wen Congyang + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#ifndef REPLICATION_H +#define REPLICATION_H + +#include "sysemu/sysemu.h" + +typedef struct ReplicationOps ReplicationOps; +typedef struct ReplicationState ReplicationState; +
Re: [Qemu-devel] [Qemu-block] [PATCH v9 3/3] qmp: add monitor command to add/remove a child
On 02/11/2016 02:02 AM, Max Reitz wrote: On 25.12.2015 10:22, Changlong Xie wrote: From: Wen Congyang The new QMP command name is x-blockdev-change. It's just for adding/removing quorum's child now, and doesn't support all kinds of children, all kinds of operations, nor all block drivers. So it is experimental now. Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie --- blockdev.c | 54 qapi/block-core.json | 23 ++ qmp-commands.hx | 47 + 3 files changed, 124 insertions(+) diff --git a/blockdev.c b/blockdev.c index 64dbfeb..4e62fdf 100644 --- a/blockdev.c +++ b/blockdev.c @@ -3836,6 +3836,60 @@ out: aio_context_release(aio_context); } +static BlockDriverState *bdrv_find_child(BlockDriverState *parent_bs, + const char *child_name) +{ +BdrvChild *child; + +QLIST_FOREACH(child, &parent_bs->children, next) { +if (strcmp(child->name, child_name) == 0) { +return child->bs; +} +} + +return NULL; +} + +void qmp_x_blockdev_change(const char *parent, bool has_child, + const char *child, bool has_node, + const char *node, Error **errp) +{ +BlockDriverState *parent_bs, *child_bs = NULL, *new_bs = NULL; + +parent_bs = bdrv_lookup_bs(parent, parent, errp); +if (!parent_bs) { +return; +} + +if (has_child == has_node) { +if (has_child) { +error_setg(errp, "The paramter child and node is conflict"); "The parameters child and node are in conflict" Or, more naturally: "child and node may not be specified at the same time" OK +} else { +error_setg(errp, "Either child or node should be specified"); s/should/must/ Ok +} +return; +} + +if (has_child) { +child_bs = bdrv_find_child(parent_bs, child); +if (!child_bs) { +error_setg(errp, "Node '%s' doesn't have child %s", s/doesn't/does not/ (This is a personal opinion, but a pretty strong one.) Also, if you put quotes around the node name, maybe you should do the same around the child name. OK + parent, child); +return; +} +bdrv_del_child(parent_bs, child_bs, errp); +} + +if (has_node) { +new_bs = bdrv_find_node(node); +if (!new_bs) { +error_setg(errp, "Node '%s' not found", node); +return; +} +bdrv_add_child(parent_bs, new_bs, errp); +} +} + BlockJobInfoList *qmp_query_block_jobs(Error **errp) { BlockJobInfoList *head = NULL, **p_next = &head; diff --git a/qapi/block-core.json b/qapi/block-core.json index 1a5d9ce..fe63c6d 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -2408,3 +2408,26 @@ ## { 'command': 'block-set-write-threshold', 'data': { 'node-name': 'str', 'write-threshold': 'uint64' } } + +## +# @x-blockdev-change +# +# Dynamically reconfigure the block driver state graph. It can be used +# to add, remove, insert or replace a block driver state. Currently only I'd prefer "graph node" over BDS in this line. ok +# the Quorum driver implements this feature to add or remove its child. +# This is useful to fix a broken quorum child. +# I'd like a list here what this command does depending on the parameters given. For instance: If @(new-)node is specified, it will be inserted under @parent. @child may not be specified in this case. If both @parent and @child are specified but @(new-)node is not, @child will be detached from @parent. ok +# @parent: the id or name of the node that will be changed. I don't know. The parent will actually not be changed, it's an edge that will be changed; and the parent is the parent node of that edge. Just put @parent: the id or name of the parent node here. Yes, you are right here. +# +# @child: #optional the name of the child that will be deleted. For now. But maybe that will change in the future. Generally, it is just the child node of the edge that will be changed. So just putting @child: #optional the name of a child under the given parent node (Let's hope this is clear enough that people know that this is not a node name.) ok +# +# @node: #optional the name of the node will be added. Maybe this should be named new-node instead. Also: "...of the node that will be added." Yes, it's more clear. +# +# Note: this command is experimental, and its API is not stable. +# +# Since: 2.6 +## +{ 'command': 'x-blockde
Re: [Qemu-devel] [PATCH V2] quorum: fix segfault when read fails in fifo mode
On 02/05/2016 11:46 PM, Eric Blake wrote: On 02/04/2016 07:25 PM, Changlong Xie wrote: Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie --- block/quorum.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/block/quorum.c b/block/quorum.c index a5ae4b8..11cc60b 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -286,7 +286,8 @@ static void quorum_aio_cb(void *opaque, int ret) if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) { /* We try to read next child in FIFO order if we fail to read */ -if (ret < 0 && ++acb->child_iter < s->num_children) { +if (ret < 0 && (acb->child_iter + 1) < s->num_children) { Could shorten this as if (ret < 0 && acb->child_iter <= s->num_children) { but I'm not sure it's worth the mental gymnastics. Hi Eric Just the same to me. Since it has been applied to block branch. Let's keep the original one. Thanks -Xie
Re: [Qemu-devel] [PATCH v15 0/9] Block replication for continuous checkpoints
Hi all It seems no any review during the long Spring festival. Ping :) Thanks -Xie On 02/05/2016 12:17 PM, Changlong Xie wrote: Block replication is a very important feature which is used for continuous checkpoints(for example: COLO). You can get the detailed information about block replication from here: http://wiki.qemu.org/Features/BlockReplication Usage: Please refer to docs/block-replication.txt This patch series is based on the following patch series: 1. http://lists.nongnu.org/archive/html/qemu-devel/2015-12/msg04570.html Patch status: 1. Acked patches: none 2. Reviewed patches: patch 4 3. Updated patches: patch 3, 4, 5, 7, 8 Note: patch 7 is a new patch that introduces genernal start/stop/checkpoint/ get_error APIs. You can get the patch here: https://github.com/Pating/qemu/tree/changlox/block-replication-v15 You can get the patch with framework here: https://github.com/Pating/qemu/tree/changlox/colo_framework_v14 TODO: 1. Continuous block replication. It will be started after basic functions are accepted. Changs Log: V15: 1. Rebase to the newest codes 2. Fix typos and coding style addresed Eric's comments 3. Address Stefan's comments 1) Make backup_do_checkpoint public, drop the changes on BlockJobDriver 2) Update the message and description for [PATCH 4/9] 3) Make replication_(start/stop/do_checkpoint)_all as global interfaces 4) Introduce AioContext lock to protect start/stop/do_checkpoint callbacks 5) Use BdrvChild instead of holding on to BlockDriverState * pointers 4. Clear BDRV_O_INACTIVE for hidden disk's open_flags since commit 09e0c771 5. Introduce replication_get_error_all to check replication status 6. Remove useless discard interface V14: 1. Implement auto complete active commit 2. Implement active commit block job for replication.c 3. Address the comments from Stefan, add replication-specific API and data structure, also remove old block layer APIs V13: 1. Rebase to the newest codes 2. Remove redundant marcos and semicolon in replication.c 3. Fix typos in block-replication.txt V12: 1. Rebase to the newest codes 2. Use backing reference to replcace 'allow-write-backing-file' V11: 1. Reopen the backing file when starting blcok replication if it is not opened in R/W mode 2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET when opening backing file 3. Block the top BDS so there is only one block job for the top BDS and its backing chain. V10: 1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing reference. 2. Address the comments from Eric Blake V9: 1. Update the error messages 2. Rebase to the newest qemu 3. Split child add/delete support. These patches are sent in another patchset. V8: 1. Address Alberto Garcia's comments V7: 1. Implement adding/removing quorum child. Remove the option non-connect. 2. Simplify the backing refrence option according to Stefan Hajnoczi's suggestion V6: 1. Rebase to the newest qemu. V5: 1. Address the comments from Gong Lei 2. Speed the failover up. The secondary vm can take over very quickly even if there are too many I/O requests. V4: 1. Introduce a new driver replication to avoid touch nbd and qcow2. V3: 1: use error_setg() instead of error_set() 2. Add a new block job API 3. Active disk, hidden disk and nbd target uses the same AioContext 4. Add a testcase to test new hbitmap API V2: 1. Redesign the secondary qemu(use image-fleecing) 2. Use Error objects to return error message 3. Address the comments from Max Reitz and Eric Blake Changlong Xie (1): Introduce new APIs to do replication operation Wen Congyang (8): unblock backup operations in backing file Store parent BDS in BdrvChild Backup: clear all bitmap when doing block checkpoint Link backup into block core docs: block replication's description auto complete active commit Implement new driver for block replication support replication driver in blockdev-add Makefile.objs | 1 + block.c| 19 ++ block/Makefile.objs| 3 +- block/backup.c | 15 ++ block/mirror.c | 13 +- block/replication.c| 594 + blockdev.c | 2 +- docs/block-replication.txt | 238 ++ include/block/block_int.h | 6 +- qapi/block-core.json | 33 ++- qemu-img.c | 2 +- replication.c | 94 +++ replication.h | 53 13 files changed, 1063 insertions(+), 10 deletions(-) create mode 100644 block/replication.c create mode 100644 docs/block-replication.txt create mode 100644 replication.c create mode 100644 replication.h
[Qemu-devel] [PATCH 1/1] quorum: change vote rules for 64 bits hash
Before: 1) vote_count > max: winner = candidate, update max 2) vote_count <= max: do nothing Current: 1) vote_count > max: winner = candidate, update max 2) vote_count = max: compare the value of winner with candidate, if candidate->value.l == 0, winner = candidate, else do nothing 3) vote_count < max: do nothing Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie --- block/quorum.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index a5ae4b8..e431ff4 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -446,7 +446,7 @@ static int quorum_compute_hash(QuorumAIOCB *acb, int i, QuorumVoteValue *hash) return 0; } -static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes) +static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes, bool vote_error) { int max = 0; QuorumVoteVersion *candidate, *winner = NULL; @@ -455,6 +455,12 @@ static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes) if (candidate->vote_count > max) { max = candidate->vote_count; winner = candidate; +continue; +} +/* For 64 bit hash */ +if (vote_error == true && candidate->vote_count == max +&& candidate->value.l == 0) { +winner = candidate; } } @@ -544,7 +550,7 @@ static int quorum_vote_error(QuorumAIOCB *acb) } if (error) { -winner = quorum_get_vote_winner(&error_votes); +winner = quorum_get_vote_winner(&error_votes, false); ret = winner->value.l; } @@ -609,7 +615,7 @@ static bool quorum_vote(QuorumAIOCB *acb) } /* vote to select the most represented version */ -winner = quorum_get_vote_winner(&acb->votes); +winner = quorum_get_vote_winner(&acb->votes, false); /* if the winner count is smaller than threshold the read fails */ if (winner->vote_count < s->threshold) { @@ -769,7 +775,7 @@ static coroutine_fn int quorum_co_flush(BlockDriverState *bs) quorum_count_vote(&error_votes, &result_value, i); } -winner = quorum_get_vote_winner(&error_votes); +winner = quorum_get_vote_winner(&error_votes, true); result = winner->value.l; quorum_free_vote_list(&error_votes); -- 1.9.3
[Qemu-devel] [PATCH 0/1] change quorum vote rules for 64-bits hash
If quorum has two children(A, B). A do flush sucessfully, but B flush failed. We MUST choice A as winner, otherwise we will get following errors: {"timestamp": {"seconds": 1455641588, "microseconds": 415937}, "event": "BLOCK_IO_ERROR", "data": {"device": "colo-disk", "nospace": false, "reason": "Bad file descriptor", "operation": "write", "action": "report"}} And the filesystem of guest became read-only with following errors: [xxx] end_request: I/O error, dev vda, sector 11159960 [xxx] Aborting journal on device vda3-8 [xxx] EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal [xxx] EXT4-fs (vda3): Remounting filesystem read-only [Ref] http://lists.nongnu.org/archive/html/qemu-devel/2016-01/msg05342.html Changlong Xie (1): quorum: change vote rules for 64 bits hash block/quorum.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) -- 1.9.3
Re: [Qemu-devel] [PATCH 1/1] quorum: change vote rules for 64 bits hash
On 02/16/2016 01:10 AM, Eric Blake wrote: On 02/15/2016 02:52 AM, Changlong Xie wrote: Before: 1) vote_count > max: winner = candidate, update max 2) vote_count <= max: do nothing Current: 1) vote_count > max: winner = candidate, update max 2) vote_count = max: compare the value of winner with candidate, if candidate->value.l == 0, winner = candidate, else do nothing 3) vote_count < max: do nothing This says what you did, but not why. Can you demonstrate a scenario that is broken without the patch, as part of your commit message? Surely, will do in next version. Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie --- block/quorum.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index a5ae4b8..e431ff4 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -446,7 +446,7 @@ static int quorum_compute_hash(QuorumAIOCB *acb, int i, QuorumVoteValue *hash) return 0; } -static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes) +static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes, bool vote_error) Long line. Please wrap things to stay within 80 columns. Ok { int max = 0; QuorumVoteVersion *candidate, *winner = NULL; @@ -455,6 +455,12 @@ static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes) if (candidate->vote_count > max) { max = candidate->vote_count; winner = candidate; +continue; +} +/* For 64 bit hash */ +if (vote_error == true && candidate->vote_count == max s/ == true// (no need to do a redundant comparison of a bool against a bool). Ok Thanks -Xie
[Qemu-devel] [PATCH v2 0/1] change quorum vote rules for 64-bits hash
If quorum has two children(A, B). A do flush sucessfully, but B flush failed. We MUST choice A as winner, otherwise we will get following errors: {"timestamp": {"seconds": 1455641588, "microseconds": 415937}, "event": "BLOCK_IO_ERROR", "data": {"device": "colo-disk", "nospace": false, "reason": "Bad file descriptor", "operation": "write", "action": "report"}} And the filesystem of guest became read-only with following errors: [xxx] end_request: I/O error, dev vda, sector 11159960 [xxx] Aborting journal on device vda3-8 [xxx] EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal [xxx] EXT4-fs (vda3): Remounting filesystem read-only [Ref] http://lists.nongnu.org/archive/html/qemu-devel/2016-01/msg05342.html Changlong Xie (1): quorum: Change vote rules for 64 bits hash block/quorum.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) -- 1.9.3
[Qemu-devel] [PATCH v2 1/1] quorum: Change vote rules for 64 bits hash
If quorum has two children(A, B). A do flush sucessfully, but B flush failed. We MUST choice A as winner rather than just pick anyone of them. Otherwise the filesystem of guest will become read-only with following errors: end_request: I/O error, dev vda, sector 11159960 Aborting journal on device vda3-8 EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal EXT4-fs (vda3): Remounting filesystem read-only Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie --- block/quorum.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index 11cc60b..f094208 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -447,7 +447,8 @@ static int quorum_compute_hash(QuorumAIOCB *acb, int i, QuorumVoteValue *hash) return 0; } -static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes) +static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes, + bool vote_error) { int max = 0; QuorumVoteVersion *candidate, *winner = NULL; @@ -456,6 +457,12 @@ static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes) if (candidate->vote_count > max) { max = candidate->vote_count; winner = candidate; +continue; +} +/* For 64 bit hash */ +if (vote_error && candidate->vote_count == max +&& candidate->value.l == 0) { +winner = candidate; } } @@ -545,7 +552,7 @@ static int quorum_vote_error(QuorumAIOCB *acb) } if (error) { -winner = quorum_get_vote_winner(&error_votes); +winner = quorum_get_vote_winner(&error_votes, false); ret = winner->value.l; } @@ -610,7 +617,7 @@ static bool quorum_vote(QuorumAIOCB *acb) } /* vote to select the most represented version */ -winner = quorum_get_vote_winner(&acb->votes); +winner = quorum_get_vote_winner(&acb->votes, false); /* if the winner count is smaller than threshold the read fails */ if (winner->vote_count < s->threshold) { @@ -770,7 +777,7 @@ static coroutine_fn int quorum_co_flush(BlockDriverState *bs) quorum_count_vote(&error_votes, &result_value, i); } -winner = quorum_get_vote_winner(&error_votes); +winner = quorum_get_vote_winner(&error_votes, true); result = winner->value.l; quorum_free_vote_list(&error_votes); -- 1.9.3
Re: [Qemu-devel] [PATCH v9 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
On 02/09/2016 01:06 AM, Alberto Garcia wrote: On Fri 22 Jan 2016 09:02:10 PM CET, "Dr. David Alan Gilbert" wrote: In general, what do you do to make sure that the data in a new Quorum child is consistent with that of the rest of the array? Quorum can have more than one child when it starts. But we don't do the similar check. So I don't think we should do such check here. Yes, but when you start a VM you can verify in advance that all members of the Quorum have the same data. If you do that on a running VM how can you know if the new disk is consistent with the others? User error if it is not. Just the same as it is user error if you request a shallow drive-mirror but the destination is not the same contents as the backing file. I don't think qemu has to protect us from user error in this case. But the backing file is read-only so the user can guarantee that the destination has the same data before the shallow mirror. How do you do that in this case? I think in the colo case they're relying on doing a block migrate to synchronise the remote disk prior to switching into colo mode. Yes but this is a general API that can be used independently from COLO. I'd say if we want to allow that we should at least place a big warning in the documentation. Ok, that's fair enough. Will add in next version. Thanks -Xie Berto .
[Qemu-devel] [PATCH v10 0/3] qapi: child add/delete support
ChangLog: v9~v10: 1. Rebase to the newest codes 2. Address comments from Berto and Max, update the documents in block-core.json and qmp-commands.hx 3. Remove redundant codes in quorum_add_child() and quorum_del_child() v8: 1. Rebase to the newest codes 2. Address the comments from Eric Blake v7: 1. Remove the qmp command x-blockdev-change's parameter operation according to Kevin's comments. 2. Remove the hmp command. v6: 1. Use a single qmp command x-blockdev-change to replace x-blockdev-child-add and x-blockdev-child-delete v5: 1. Address Eric Blake's comments v4: 1. drop nbd driver's implementation. We can use human-monitor-command to do it. 2. Rename the command name. v3: 1. Don't open BDS in bdrv_add_child(). Use the existing BDS which is created by the QMP command blockdev-add. 2. The driver NBD can support filename, path, host:port now. v2: 1. Use bdrv_get_device_or_node_name() instead of new function bdrv_get_id_or_node_name() 2. Update the error message 3. Update the documents in block-core.json Wen Congyang (3): Add new block driver interface to add/delete a BDS's child quorum: implement bdrv_add_child() and bdrv_del_child() qmp: add monitor command to add/remove a child block.c | 58 -- block/quorum.c| 122 +- blockdev.c| 54 include/block/block.h | 9 include/block/block_int.h | 5 ++ qapi/block-core.json | 32 qmp-commands.hx | 50 +++ 7 files changed, 324 insertions(+), 6 deletions(-) -- 1.9.3
[Qemu-devel] [PATCH v10 1/3] Add new block driver interface to add/delete a BDS's child
From: Wen Congyang In some cases, we want to take a quorum child offline, and take another child online. Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie Reviewed-by: Eric Blake Reviewed-by: Alberto Garcia --- block.c | 50 +++ include/block/block.h | 5 + include/block/block_int.h | 5 + 3 files changed, 60 insertions(+) diff --git a/block.c b/block.c index efc3c43..08aa979 100644 --- a/block.c +++ b/block.c @@ -4399,3 +4399,53 @@ void bdrv_refresh_filename(BlockDriverState *bs) QDECREF(json); } } + +/* + * Hot add/remove a BDS's child. So the user can take a child offline when + * it is broken and take a new child online + */ +void bdrv_add_child(BlockDriverState *parent_bs, BlockDriverState *child_bs, +Error **errp) +{ + +if (!parent_bs->drv || !parent_bs->drv->bdrv_add_child) { +error_setg(errp, "The node %s doesn't support adding a child", + bdrv_get_device_or_node_name(parent_bs)); +return; +} + +if (!QLIST_EMPTY(&child_bs->parents)) { +error_setg(errp, "The node %s already has a parent", + child_bs->node_name); +return; +} + +parent_bs->drv->bdrv_add_child(parent_bs, child_bs, errp); +} + +void bdrv_del_child(BlockDriverState *parent_bs, BlockDriverState *child_bs, +Error **errp) +{ +BdrvChild *child; + +if (!parent_bs->drv || !parent_bs->drv->bdrv_del_child) { +error_setg(errp, "The node %s doesn't support removing a child", + bdrv_get_device_or_node_name(parent_bs)); +return; +} + +QLIST_FOREACH(child, &parent_bs->children, next) { +if (child->bs == child_bs) { +break; +} +} + +if (!child) { +error_setg(errp, "The node %s is not a child of %s", + bdrv_get_device_or_node_name(child_bs), + bdrv_get_device_or_node_name(parent_bs)); +return; +} + +parent_bs->drv->bdrv_del_child(parent_bs, child_bs, errp); +} diff --git a/include/block/block.h b/include/block/block.h index 1c4f4d8..ecde190 100644 --- a/include/block/block.h +++ b/include/block/block.h @@ -582,4 +582,9 @@ void bdrv_drained_begin(BlockDriverState *bs); */ void bdrv_drained_end(BlockDriverState *bs); +void bdrv_add_child(BlockDriverState *parent, BlockDriverState *child, +Error **errp); +void bdrv_del_child(BlockDriverState *parent, BlockDriverState *child, +Error **errp); + #endif diff --git a/include/block/block_int.h b/include/block/block_int.h index 9ef823a..89ec138 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -305,6 +305,11 @@ struct BlockDriver { */ void (*bdrv_drain)(BlockDriverState *bs); +void (*bdrv_add_child)(BlockDriverState *parent, BlockDriverState *child, + Error **errp); +void (*bdrv_del_child)(BlockDriverState *parent, BlockDriverState *child, + Error **errp); + QLIST_ENTRY(BlockDriver) list; }; -- 1.9.3
[Qemu-devel] [PATCH v10 3/3] qmp: add monitor command to add/remove a child
From: Wen Congyang The new QMP command name is x-blockdev-change. It's just for adding/removing quorum's child now, and doesn't support all kinds of children, all kinds of operations, nor all block drivers. So it is experimental now. Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie --- blockdev.c | 54 qapi/block-core.json | 32 +++ qmp-commands.hx | 50 3 files changed, 136 insertions(+) diff --git a/blockdev.c b/blockdev.c index 1f73478..ca040b0 100644 --- a/blockdev.c +++ b/blockdev.c @@ -3983,6 +3983,60 @@ out: aio_context_release(aio_context); } +static BlockDriverState *bdrv_find_child(BlockDriverState *parent_bs, + const char *child_name) +{ +BdrvChild *child; + +QLIST_FOREACH(child, &parent_bs->children, next) { +if (strcmp(child->name, child_name) == 0) { +return child->bs; +} +} + +return NULL; +} + +void qmp_x_blockdev_change(const char *parent, bool has_child, + const char *child, bool has_node, + const char *node, Error **errp) +{ +BlockDriverState *parent_bs, *child_bs = NULL, *new_bs = NULL; + +parent_bs = bdrv_lookup_bs(parent, parent, errp); +if (!parent_bs) { +return; +} + +if (has_child == has_node) { +if (has_child) { +error_setg(errp, "The parameters child and node are in conflict"); +} else { +error_setg(errp, "Either child or node must be specified"); +} +return; +} + +if (has_child) { +child_bs = bdrv_find_child(parent_bs, child); +if (!child_bs) { +error_setg(errp, "Node '%s' does not have child '%s'", + parent, child); +return; +} +bdrv_del_child(parent_bs, child_bs, errp); +} + +if (has_node) { +new_bs = bdrv_find_node(node); +if (!new_bs) { +error_setg(errp, "Node '%s' not found", node); +return; +} +bdrv_add_child(parent_bs, new_bs, errp); +} +} + BlockJobInfoList *qmp_query_block_jobs(Error **errp) { BlockJobInfoList *head = NULL, **p_next = &head; diff --git a/qapi/block-core.json b/qapi/block-core.json index 33012b8..92eb7fe 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -2482,3 +2482,35 @@ ## { 'command': 'block-set-write-threshold', 'data': { 'node-name': 'str', 'write-threshold': 'uint64' } } + +## +# @x-blockdev-change +# +# Dynamically reconfigure the block driver state graph. It can be used +# to add, remove, insert or replace a graph node. Currently only the +# Quorum driver implements this feature to add or remove its child. This +# is useful to fix a broken quorum child. +# +# If @node is specified, it will be inserted under @parent. @child +# may not be specified in this case. If both @parent and @child are +# specified but @node is not, @child will be detached from @parent. +# +# @parent: the id or name of the parent node. +# +# @child: #optional the name of a child under the given parent node. +# +# @node: #optional the name of the node that will be added. +# +# Note: this command is experimental, and its API is not stable. It +# does not support all kinds of operations, all kinds of children, nor +# all block drivers. +# +# Warning: The data in a new quorum child MUST be consistent with that of +# the rest of the array. +# +# Since: 2.6 +## +{ 'command': 'x-blockdev-change', + 'data' : { 'parent': 'str', + '*child': 'str', + '*node': 'str' } } diff --git a/qmp-commands.hx b/qmp-commands.hx index 020e5ee..1c9a06f 100644 --- a/qmp-commands.hx +++ b/qmp-commands.hx @@ -4364,6 +4364,56 @@ Example: EQMP { +.name = "x-blockdev-change", +.args_type = "parent:B,child:B?,node:B?", +.mhandler.cmd_new = qmp_marshal_x_blockdev_change, +}, + +SQMP +x-blockdev-change +- + +Dynamically reconfigure the block driver state graph. It can be used +to add, remove, insert or replace a graph node. Currently only the +Quorum driver implements this feature to add or remove its child. This +is useful to fix a broken quorum child. + +Arguments: +- "parent": the id or name of the parent node (json-string) +- "child": the name of a child under the given parent node (json-string, optional) +- "node": the name of the node that will be added (json-string, optional) + +Note: thi
[Qemu-devel] [PATCH v10 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
From: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie --- block.c | 8 ++-- block/quorum.c| 122 +- include/block/block.h | 4 ++ 3 files changed, 128 insertions(+), 6 deletions(-) diff --git a/block.c b/block.c index 08aa979..c3c9dc0 100644 --- a/block.c +++ b/block.c @@ -1198,10 +1198,10 @@ static int bdrv_fill_options(QDict **options, const char *filename, return 0; } -static BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs, -BlockDriverState *child_bs, -const char *child_name, -const BdrvChildRole *child_role) +BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs, + BlockDriverState *child_bs, + const char *child_name, + const BdrvChildRole *child_role) { BdrvChild *child = g_new(BdrvChild, 1); *child = (BdrvChild) { diff --git a/block/quorum.c b/block/quorum.c index a5ae4b8..e5a7e4f 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -24,6 +24,7 @@ #include "qapi/qmp/qstring.h" #include "qapi-event.h" #include "crypto/hash.h" +#include "qemu/bitmap.h" #define HASH_LENGTH 32 @@ -81,6 +82,8 @@ typedef struct BDRVQuorumState { bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted * block if Quorum is reached. */ +unsigned long *index_bitmap; +int bsize; QuorumReadPattern read_pattern; } BDRVQuorumState; @@ -876,9 +879,9 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags, ret = -EINVAL; goto exit; } -if (s->num_children < 2) { +if (s->num_children < 1) { error_setg(&local_err, - "Number of provided children must be greater than 1"); + "Number of provided children must be 1 or more"); ret = -EINVAL; goto exit; } @@ -927,6 +930,7 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags, /* allocate the children array */ s->children = g_new0(BdrvChild *, s->num_children); opened = g_new0(bool, s->num_children); +s->index_bitmap = bitmap_new(s->num_children); for (i = 0; i < s->num_children; i++) { char indexstr[32]; @@ -942,6 +946,8 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags, opened[i] = true; } +bitmap_set(s->index_bitmap, 0, s->num_children); +s->bsize = s->num_children; g_free(opened); goto exit; @@ -998,6 +1004,115 @@ static void quorum_attach_aio_context(BlockDriverState *bs, } } +static int get_new_child_index(BDRVQuorumState *s) +{ +int index; + +index = find_next_zero_bit(s->index_bitmap, s->bsize, 0); +if (index < s->bsize) { +return index; +} + +if ((s->bsize % BITS_PER_LONG) == 0) { +s->index_bitmap = bitmap_zero_extend(s->index_bitmap, s->bsize, + s->bsize + 1); +} + +return s->bsize++; +} + +static void remove_child_index(BDRVQuorumState *s, int index) +{ +int last_index; +long new_len; + +assert(index < s->bsize); + +clear_bit(index, s->index_bitmap); +if (index < s->bsize - 1) { +/* + * The last bit is always set, and we don't clear + * the last bit. + */ +return; +} + +last_index = find_last_bit(s->index_bitmap, s->bsize); +s->bsize = last_index + 1; +if (BITS_TO_LONGS(last_index + 1) == BITS_TO_LONGS(s->bsize)) { +return; +} + +new_len = BITS_TO_LONGS(last_index + 1) * sizeof(unsigned long); +s->index_bitmap = g_realloc(s->index_bitmap, new_len); +} + +static void quorum_add_child(BlockDriverState *bs, BlockDriverState *child_bs, + Error **errp) +{ +BDRVQuorumState *s = bs->opaque; +BdrvChild *child; +char indexstr[32]; +int index, ret; + +index = get_new_child_index(s); +ret = snprintf(indexstr, 32, "children.%d", index); +if (ret < 0 || ret >= 32) { +error_setg(errp, "cannot generate child name"); +return; +} + +bdrv_drain(bs); + +assert(s->num_children <= INT_MAX / sizeof(BdrvChild *)); +if (s->num_children == INT_MAX / sizeof(BdrvChild *)) { +error_setg(errp, "Too many children"); +return; +} +s->children = g_renew(BdrvChild *, s->children, s->num_children + 1); + +bdrv_ref(child_bs); +child = bdrv_attach_child(bs, child_bs, in
[Qemu-devel] [RFC PATCH 1/3] linux-headers: Update VFIO headers from linux-next tag ToBeFilled
Syncup VFIO related linux headers from linux-next tree using scripts/update-linux-headers.sh. Integrate two VFIO-PCI ioctl flags: - VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED - VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP Signed-off-by: Yongji Xie --- linux-headers/linux/vfio.h |4 1 file changed, 4 insertions(+) diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h index aa276bc..3b79d63 100644 --- a/linux-headers/linux/vfio.h +++ b/linux-headers/linux/vfio.h @@ -164,6 +164,10 @@ struct vfio_device_info { #define VFIO_DEVICE_FLAGS_PCI (1 << 1)/* vfio-pci device */ #define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2)/* vfio-platform device */ #define VFIO_DEVICE_FLAGS_AMBA (1 << 3) /* vfio-amba device */ +/* Platform support all PCI MMIO BARs to be page aligned */ +#define VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED (1 << 4) +/* platform support mmapping PCI MSI-X vector table */ +#define VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP (1 << 5) __u32 num_regions;/* Max region index + 1 */ __u32 num_irqs; /* Max IRQ index + 1 */ }; -- 1.7.9.5
[Qemu-devel] [RFC PATCH 0/3] vfio/pci: Add support for mmapping sub-page MMIO BARs and MSI-X table
This patch set adds support for two VFIO-PCI ioctl flags: VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED and VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP. Note that the kernel bits of this two flags are not in upstream and posted as . VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED indicates that platform support all PCI MMIO BARs to be page aligned which means any BARs' mmio page would not be shared with other BARs. So it's allowed to mmap sub-page(size < PAGE_SIZE) MMIO BARs in QEMU. VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP indicates that it's safe to mmap MSI-X table in QEMU. Besides supporting for these two mmapping cases, we also fix some issues which would block KVM to create memory slot for these mmapped sub-page MMIO BARs and mmapped MSI-X table. Posted kernel patches link: - https://www.mail-archive.com/kvm@vger.kernel.org/msg124031.html Yongji Xie (3): linux-headers: Update VFIO headers from linux-next tag ToBeFilled vfio/pci: Add support for mmapped sub-page MMIO BARs vfio/pci: Add support for mmapping MSI-X table hw/vfio/pci.c | 65 linux-headers/linux/vfio.h |4 +++ 2 files changed, 64 insertions(+), 5 deletions(-) -- 1.7.9.5
[Qemu-devel] [RFC PATCH 2/3] vfio/pci: Add support for mmapped sub-page MMIO BARs
The VFIO-PCI ioctl flag VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED indicates platform support all PCI MMIO BARs to be page aligned and sub-page(size < PAGE_SIZE) MMIO BARs can be mmapped. But this has an issue for these mmapped sub-page MMIO BARs - KVM would not allow to create memory slot for them via ioctl KVM_SET_USER_MEMORY_REGION because these BARs' size are still less than PAGE_SIZE. So this patch expands these sub-page MMIO BARs to page size in vfio_pci_write_config(), so that the BARs could be passed to KVM ioctl KVM_SET_USER_MEMORY_REGION with a valid size. And we also set the priority of these BARs' memory regions to zero in case of overlap with page unaligned BARs when guest doesn't support all PCI MMIO BARs to be page aligned. Signed-off-by: Yongji Xie --- hw/vfio/pci.c | 46 ++ 1 file changed, 46 insertions(+) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 1fb868c..a7c6171 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -1028,6 +1028,44 @@ static const MemoryRegionOps vfio_vga_ops = { }; /* + * Expand sub-page(size < PAGE_SIZE) MMIO BARs to page size if host support + * all MMIO BARs to be page aligned. And we should set the priority of these + * BARs' memory regions to zero in case of overlap with page unaligned + * BARs when guest doesn't support all MMIO BARs to be page aligned. + */ +static void vfio_expand_bar_to_page_size(PCIDevice *pdev) +{ +VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev); +MemoryRegion *mmap_mr; +MemoryRegion *mr; +PCIIORegion *r; +pcibus_t bar_addr; +int i; + +memory_region_transaction_begin(); +for (i = 0; i < PCI_NUM_REGIONS; i++) { +r = &pdev->io_regions[i]; +if (r->size > 0 && r->size < qemu_real_host_page_size) { +bar_addr = r->addr; +if (bar_addr != PCI_BAR_UNMAPPED && +!(bar_addr & ~qemu_real_host_page_mask)) { +mr = &vdev->bars[i].region.mem; +mmap_mr = &vdev->bars[i].region.mmap_mem; +if (memory_region_is_mapped(mr) && vdev->bars[i].region.mmap && +memory_region_size(mr) < qemu_real_host_page_size) { +memory_region_del_subregion(r->address_space, mr); +memory_region_set_size(mr, qemu_real_host_page_size); +memory_region_set_size(mmap_mr, qemu_real_host_page_size); +memory_region_add_subregion_overlap(r->address_space, +bar_addr, mr, 0); +} +} +} +} +memory_region_transaction_commit(); +} + +/* * PCI config space */ uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len) @@ -1112,6 +1150,14 @@ void vfio_pci_write_config(PCIDevice *pdev, } else if (was_enabled && !is_enabled) { vfio_msix_disable(vdev); } +} else if (vdev->vbasedev.flags & VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED && +(ranges_overlap(addr, len, PCI_BASE_ADDRESS_0, 24) || +ranges_overlap(addr, l, PCI_ROM_ADDRESS, 4) || +ranges_overlap(addr, l, PCI_ROM_ADDRESS1, 4) || +range_covers_byte(addr, len, PCI_COMMAND))) { +pci_default_write_config(pdev, addr, val, len); + +vfio_expand_bar_to_page_size(pdev); } else { /* Write everything to QEMU to keep emulated bits correct */ pci_default_write_config(pdev, addr, val, len); -- 1.7.9.5
[Qemu-devel] [RFC PATCH 3/3] vfio/pci: Add support for mmapping MSI-X table
The VFIO-PCI ioctl VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP flag indicate that platform support mmapping MSI-X table. With this flag set, we skip some MSI-X table check so that QEMU can mmap MSI-X table successfully. And we also raise the priority of mmap memory region in case of overlap with MSI-X/PBA memory region which is added in msix_init(). Signed-off-by: Yongji Xie --- hw/vfio/pci.c | 19 ++- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index a7c6171..bec301c 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -1339,7 +1339,8 @@ static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled) } memory_region_set_enabled(&bar->region.mmap_mem, enabled); -if (vdev->msix && vdev->msix->table_bar == i) { +if (vdev->msix && vdev->msix->table_bar == i && +!(vdev->vbasedev.flags & VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP)) { memory_region_set_enabled(&vdev->msix->mmap_mem, enabled); } } @@ -1357,7 +1358,8 @@ static void vfio_unregister_bar(VFIOPCIDevice *vdev, int nr) memory_region_del_subregion(&bar->region.mem, &bar->region.mmap_mem); -if (vdev->msix && vdev->msix->table_bar == nr) { +if (vdev->msix && vdev->msix->table_bar == nr && +!(vdev->vbasedev.flags & VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP)) { memory_region_del_subregion(&bar->region.mem, &vdev->msix->mmap_mem); } } @@ -1374,7 +1376,8 @@ static void vfio_unmap_bar(VFIOPCIDevice *vdev, int nr) munmap(bar->region.mmap, memory_region_size(&bar->region.mmap_mem)); -if (vdev->msix && vdev->msix->table_bar == nr) { +if (vdev->msix && vdev->msix->table_bar == nr && +!(vdev->vbasedev.flags & VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP)) { munmap(vdev->msix->mmap, memory_region_size(&vdev->msix->mmap_mem)); } } @@ -1420,7 +1423,8 @@ static void vfio_map_bar(VFIOPCIDevice *vdev, int nr) * We can't mmap areas overlapping the MSIX vector table, so we * potentially insert a direct-mapped subregion before and after it. */ -if (vdev->msix && vdev->msix->table_bar == nr) { +if (vdev->msix && vdev->msix->table_bar == nr && +!(vdev->vbasedev.flags & VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP)) { size = vdev->msix->table_offset & qemu_real_host_page_mask; } @@ -1429,9 +1433,14 @@ static void vfio_map_bar(VFIOPCIDevice *vdev, int nr) &bar->region.mmap_mem, &bar->region.mmap, size, 0, name)) { error_report("%s unsupported. Performance may be slow", name); +} else if (vdev->vbasedev.flags & VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP) { +memory_region_del_subregion(&bar->region.mem, &bar->region.mmap_mem); +memory_region_add_subregion_overlap(&bar->region.mem, 0, + &bar->region.mmap_mem, 1); } -if (vdev->msix && vdev->msix->table_bar == nr) { +if (vdev->msix && vdev->msix->table_bar == nr && +!(vdev->vbasedev.flags & VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP)) { uint64_t start; start = REAL_HOST_PAGE_ALIGN((uint64_t)vdev->msix->table_offset + -- 1.7.9.5
Re: [Qemu-devel] [PATCH COLO-Frame v12 18/38] COLO: Flush PVM's cached RAM into SVM's memory
On 12/15/2015 04:22 PM, zhanghailiang wrote: During the time of VM's running, PVM may dirty some pages, we will transfer PVM's dirty pages to SVM and store them into SVM's RAM cache at next checkpoint time. So, the content of SVM's RAM cache will always be some with PVM's memory "some" => "same" Thanks -Xie after checkpoint. Instead of flushing all content of PVM's RAM cache into SVM's MEMORY, we do this in a more efficient way: Only flush any page that dirtied by PVM since last checkpoint. In this way, we can ensure SVM's memory same with PVM's. Besides, we must ensure flush RAM cache before load device state. Signed-off-by: zhanghailiang Signed-off-by: Li Zhijian Signed-off-by: Gonglei Reviewed-by: Dr. David Alan Gilbert --- v12: - Add a trace point in the end of colo_flush_ram_cache() (Dave's suggestion) - Add Reviewed-by tag v11: - Move the place of 'need_flush' (Dave's suggestion) - Remove unused 'DPRINTF("Flush ram_cache\n")' v10: - trace the number of dirty pages that be received. Signed-off-by: zhanghailiang --- include/migration/migration.h | 1 + migration/colo.c | 2 -- migration/ram.c | 38 ++ trace-events | 2 ++ 4 files changed, 41 insertions(+), 2 deletions(-) diff --git a/include/migration/migration.h b/include/migration/migration.h index e41372d..221176b 100644 --- a/include/migration/migration.h +++ b/include/migration/migration.h @@ -336,4 +336,5 @@ PostcopyState postcopy_state_set(PostcopyState new_state); /* ram cache */ int colo_init_ram_cache(void); void colo_release_ram_cache(void); +void colo_flush_ram_cache(void); #endif diff --git a/migration/colo.c b/migration/colo.c index a4d49ff..e40cdb9 100644 --- a/migration/colo.c +++ b/migration/colo.c @@ -401,8 +401,6 @@ void *colo_process_incoming_thread(void *opaque) } qemu_mutex_unlock_iothread(); -/* TODO: flush vm state */ - ret = colo_put_cmd(mis->to_src_file, COLO_COMMAND_VMSTATE_LOADED); if (ret < 0) { goto out; diff --git a/migration/ram.c b/migration/ram.c index 3d5947b..8ff7f7c 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -2458,6 +2458,7 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id) * be atomic */ bool postcopy_running = postcopy_state_get() >= POSTCOPY_INCOMING_LISTENING; +bool need_flush = false; seq_iter++; @@ -2492,6 +2493,7 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id) /* After going into COLO, we should load the Page into colo_cache */ if (ram_cache_enable) { host = colo_cache_from_block_offset(block, addr); +need_flush = true; } else { host = host_from_ram_block_offset(block, addr); } @@ -2585,6 +2587,10 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id) } rcu_read_unlock(); + +if (!ret && ram_cache_enable && need_flush) { +colo_flush_ram_cache(); +} DPRINTF("Completed load of VM with exit code %d seq iteration " "%" PRIu64 "\n", ret, seq_iter); return ret; @@ -2657,6 +2663,38 @@ void colo_release_ram_cache(void) rcu_read_unlock(); } +/* + * Flush content of RAM cache into SVM's memory. + * Only flush the pages that be dirtied by PVM or SVM or both. + */ +void colo_flush_ram_cache(void) +{ +RAMBlock *block = NULL; +void *dst_host; +void *src_host; +ram_addr_t offset = 0; + +trace_colo_flush_ram_cache_begin(migration_dirty_pages); +rcu_read_lock(); +block = QLIST_FIRST_RCU(&ram_list.blocks); +while (block) { +ram_addr_t ram_addr_abs; +offset = migration_bitmap_find_dirty(block, offset, &ram_addr_abs); +migration_bitmap_clear_dirty(ram_addr_abs); +if (offset >= block->used_length) { +offset = 0; +block = QLIST_NEXT_RCU(block, next); +} else { +dst_host = block->host + offset; +src_host = block->colo_cache + offset; +memcpy(dst_host, src_host, TARGET_PAGE_SIZE); +} +} +rcu_read_unlock(); +trace_colo_flush_ram_cache_end(); +assert(migration_dirty_pages == 0); +} + static SaveVMHandlers savevm_ram_handlers = { .save_live_setup = ram_save_setup, .save_live_iterate = ram_save_iterate, diff --git a/trace-events b/trace-events index 39fdd8d..7f76029 100644 --- a/trace-events +++ b/trace-events @@ -1264,6 +1264,8 @@ migration_throttle(void) "" ram_load_postcopy_loop(uint64_t addr, int flags) "@%" PRIx64 " %x" ram_postcopy_send_discard_bitmap(void) "" ram_save_queue_pages(const char *rb
[Qemu-devel] [PATCH v14 1/8] unblock backup operations in backing file
From: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie --- block.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/block.c b/block.c index e90b6cf..12b5231 100644 --- a/block.c +++ b/block.c @@ -1275,6 +1275,24 @@ void bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd) /* Otherwise we won't be able to commit due to check in bdrv_commit */ bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_COMMIT_TARGET, bs->backing_blocker); +/* + * We do backup in 3 ways: + * 1. drive backup + *The target bs is new opened, and the source is top BDS + * 2. blockdev backup + *Both the source and the target are top BDSes. + * 3. internal backup(used for block replication) + *Both the source and the target are backing file + * + * In case 1, and 2, the backing file is neither the source nor + * the target. + * In case 3, we will block the top BDS, so there is only one block + * job for the top BDS and its backing chain. + */ +bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_SOURCE, +bs->backing_blocker); +bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_TARGET, +bs->backing_blocker); out: bdrv_refresh_limits(bs, NULL); } -- 1.9.3
[Qemu-devel] [PATCH v14 6/8] auto complete active commit
From: Wen Congyang Auto complete mirror job in background to prevent from blocking synchronously Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie --- block/mirror.c| 13 + blockdev.c| 2 +- include/block/block_int.h | 3 ++- qemu-img.c| 2 +- 4 files changed, 13 insertions(+), 7 deletions(-) diff --git a/block/mirror.c b/block/mirror.c index f201f2b..b8e51f8 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -716,7 +716,8 @@ static void mirror_start_job(BlockDriverState *bs, BlockDriverState *target, BlockCompletionFunc *cb, void *opaque, Error **errp, const BlockJobDriver *driver, - bool is_none_mode, BlockDriverState *base) + bool is_none_mode, BlockDriverState *base, + bool auto_complete) { MirrorBlockJob *s; BlockDriverState *replaced_bs; @@ -772,6 +773,9 @@ static void mirror_start_job(BlockDriverState *bs, BlockDriverState *target, s->granularity = granularity; s->buf_size = ROUND_UP(buf_size, granularity); s->unmap = unmap; +if (auto_complete) { +s->should_complete = true; +} s->dirty_bitmap = bdrv_create_dirty_bitmap(bs, granularity, NULL, errp); if (!s->dirty_bitmap) { @@ -813,14 +817,15 @@ void mirror_start(BlockDriverState *bs, BlockDriverState *target, mirror_start_job(bs, target, replaces, speed, granularity, buf_size, on_source_error, on_target_error, unmap, cb, opaque, errp, - &mirror_job_driver, is_none_mode, base); + &mirror_job_driver, is_none_mode, base, false); } void commit_active_start(BlockDriverState *bs, BlockDriverState *base, int64_t speed, BlockdevOnError on_error, BlockCompletionFunc *cb, - void *opaque, Error **errp) + void *opaque, Error **errp, + bool auto_complete) { int64_t length, base_length; int orig_base_flags; @@ -861,7 +866,7 @@ void commit_active_start(BlockDriverState *bs, BlockDriverState *base, bdrv_ref(base); mirror_start_job(bs, base, NULL, speed, 0, 0, on_error, on_error, false, cb, opaque, &local_err, - &commit_active_job_driver, false, base); + &commit_active_job_driver, false, base, auto_complete); if (local_err) { error_propagate(errp, local_err); goto error_restore_flags; diff --git a/blockdev.c b/blockdev.c index 574a365..603e858 100644 --- a/blockdev.c +++ b/blockdev.c @@ -3038,7 +3038,7 @@ void qmp_block_commit(const char *device, goto out; } commit_active_start(bs, base_bs, speed, on_error, block_job_cb, -bs, &local_err); +bs, &local_err, false); } else { commit_start(bs, base_bs, top_bs, speed, on_error, block_job_cb, bs, has_backing_file ? backing_file : NULL, &local_err); diff --git a/include/block/block_int.h b/include/block/block_int.h index 19c02b6..1b2d51a 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -634,13 +634,14 @@ void commit_start(BlockDriverState *bs, BlockDriverState *base, * @cb: Completion function for the job. * @opaque: Opaque pointer value passed to @cb. * @errp: Error object. + * @auto_complete: Auto complete the job. * */ void commit_active_start(BlockDriverState *bs, BlockDriverState *base, int64_t speed, BlockdevOnError on_error, BlockCompletionFunc *cb, - void *opaque, Error **errp); + void *opaque, Error **errp, bool auto_complete); /* * mirror_start: * @bs: Block device to operate on. diff --git a/qemu-img.c b/qemu-img.c index 3d48b4f..4b8b092 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -759,7 +759,7 @@ static int img_commit(int argc, char **argv) }; commit_active_start(bs, base_bs, 0, BLOCKDEV_ON_ERROR_REPORT, -common_block_job_cb, &cbi, &local_err); +common_block_job_cb, &cbi, &local_err, false); if (local_err) { goto done; } -- 1.9.3
[Qemu-devel] [PATCH v14 0/8] Block replication for continuous checkpoints
Block replication is a very important feature which is used for continuous checkpoints(for example: COLO). You can get the detailed information about block replication from here: http://wiki.qemu.org/Features/BlockReplication Usage: Please refer to docs/block-replication.txt This patch series is based on the following patch series: 1. http://lists.nongnu.org/archive/html/qemu-devel/2015-12/msg04570.html You can get the patch here: https://github.com/Pating/qemu/tree/changlox/block-replication-v14 You can get the patch with framework here: https://github.com/Pating/qemu/tree/changlox/colo_framework_v13 TODO: 1. Continuous block replication. It will be started after basic functions are accepted. Changs Log: V14: 1. Implement auto complete active commit 2. Implement active commit block job for replication.c 3. Address the comments from Stefan, add replication-specific API and data structure, also remove old block layer APIs V13: 1. Rebase to the newest codes 2. Remove redundant marcos and semicolon in replication.c 3. Fix typos in block-replication.txt V12: 1. Rebase to the newest codes 2. Use backing reference to replcace 'allow-write-backing-file' V11: 1. Reopen the backing file when starting blcok replication if it is not opened in R/W mode 2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET when opening backing file 3. Block the top BDS so there is only one block job for the top BDS and its backing chain. V10: 1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing reference. 2. Address the comments from Eric Blake V9: 1. Update the error messages 2. Rebase to the newest qemu 3. Split child add/delete support. These patches are sent in another patchset. V8: 1. Address Alberto Garcia's comments V7: 1. Implement adding/removing quorum child. Remove the option non-connect. 2. Simplify the backing refrence option according to Stefan Hajnoczi's suggestion V6: 1. Rebase to the newest qemu. V5: 1. Address the comments from Gong Lei 2. Speed the failover up. The secondary vm can take over very quickly even if there are too many I/O requests. V4: 1. Introduce a new driver replication to avoid touch nbd and qcow2. V3: 1: use error_setg() instead of error_set() 2. Add a new block job API 3. Active disk, hidden disk and nbd target uses the same AioContext 4. Add a testcase to test new hbitmap API V2: 1. Redesign the secondary qemu(use image-fleecing) 2. Use Error objects to return error message 3. Address the comments from Max Reitz and Eric Blake Wen Congyang (8): unblock backup operations in backing file Store parent BDS in BdrvChild Backup: clear all bitmap when doing block checkpoint Allow creating backup jobs when opening BDS docs: block replication's description auto complete active commit Implement new driver for block replication support replication driver in blockdev-add block.c | 19 ++ block/Makefile.objs | 3 +- block/backup.c | 14 + block/mirror.c | 13 +- block/replication-comm.c | 66 + block/replication.c | 590 +++ blockdev.c | 2 +- blockjob.c | 11 + docs/block-replication.txt | 229 +++ include/block/block_int.h| 4 +- include/block/blockjob.h | 12 + include/block/replication-comm.h | 50 qapi/block-core.json | 33 ++- qemu-img.c | 2 +- 14 files changed, 1038 insertions(+), 10 deletions(-) create mode 100644 block/replication-comm.c create mode 100644 block/replication.c create mode 100644 docs/block-replication.txt create mode 100644 include/block/replication-comm.h -- 1.9.3
[Qemu-devel] [PATCH v14 8/8] support replication driver in blockdev-add
From: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie Reviewed-by: Eric Blake --- qapi/block-core.json | 20 ++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/qapi/block-core.json b/qapi/block-core.json index d73f46b..00350ed 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -220,6 +220,7 @@ # 2.2: 'archipelago' added, 'cow' dropped # 2.3: 'host_floppy' deprecated # 2.5: 'host_floppy' dropped +# 2.6: 'replication' added # # @backing_file: #optional the name of the backing file (for copy-on-write) # @@ -1540,6 +1541,7 @@ # Drivers that are supported in block device operations. # # @host_device, @host_cdrom: Since 2.1 +# @replication: Since 2.6 # # Since: 2.0 ## @@ -1547,8 +1549,8 @@ 'data': [ 'archipelago', 'blkdebug', 'blkverify', 'bochs', 'cloop', 'dmg', 'file', 'ftp', 'ftps', 'host_cdrom', 'host_device', 'http', 'https', 'null-aio', 'null-co', 'parallels', -'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'tftp', 'vdi', 'vhdx', -'vmdk', 'vpc', 'vvfat' ] } +'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'replication', +'tftp', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] } ## # @BlockdevOptionsBase @@ -1988,6 +1990,19 @@ { 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] } ## +# @BlockdevOptionsReplication +# +# Driver specific block device options for replication +# +# @mode: the replication mode +# +# Since: 2.6 +## +{ 'struct': 'BlockdevOptionsReplication', + 'base': 'BlockdevOptionsGenericFormat', + 'data': { 'mode': 'ReplicationMode' } } + +## # @BlockdevOptions # # Options for creating a block device. @@ -2024,6 +2039,7 @@ 'quorum': 'BlockdevOptionsQuorum', 'raw':'BlockdevOptionsGenericFormat', # TODO rbd: Wait for structured options + 'replication':'BlockdevOptionsReplication', # TODO sheepdog: Wait for structured options # TODO ssh: Should take InetSocketAddress for 'host'? 'tftp': 'BlockdevOptionsFile', -- 1.9.3
[Qemu-devel] [PATCH v14 4/8] Allow creating backup jobs when opening BDS
From: Wen Congyang When opening BDS, we need to create backup jobs for image-fleecing. Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie Reviewed-by: Stefan Hajnoczi Reviewed-by: Jeff Cody --- block/Makefile.objs | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/Makefile.objs b/block/Makefile.objs index 58ef2ef..fa05f37 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -22,10 +22,10 @@ block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o block-obj-$(CONFIG_LIBSSH2) += ssh.o block-obj-y += accounting.o block-obj-y += write-threshold.o +block-obj-y += backup.o common-obj-y += stream.o common-obj-y += commit.o -common-obj-y += backup.o iscsi.o-cflags := $(LIBISCSI_CFLAGS) iscsi.o-libs := $(LIBISCSI_LIBS) -- 1.9.3
[Qemu-devel] [PATCH v14 3/8] Backup: clear all bitmap when doing block checkpoint
From: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie Reviewed-by: Jeff Cody --- block/backup.c | 14 ++ blockjob.c | 11 +++ include/block/blockjob.h | 12 3 files changed, 37 insertions(+) diff --git a/block/backup.c b/block/backup.c index 705bb77..0a27d01 100644 --- a/block/backup.c +++ b/block/backup.c @@ -253,11 +253,25 @@ static void backup_abort(BlockJob *job) } } +static void backup_do_checkpoint(BlockJob *job, Error **errp) +{ +BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common); + +if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) { +error_setg(errp, "The backup job only supports block checkpoint in" + " sync=none mode"); +return; +} + +hbitmap_reset_all(backup_job->bitmap); +} + static const BlockJobDriver backup_job_driver = { .instance_size = sizeof(BackupBlockJob), .job_type = BLOCK_JOB_TYPE_BACKUP, .set_speed = backup_set_speed, .iostatus_reset = backup_iostatus_reset, +.do_checkpoint = backup_do_checkpoint, .commit = backup_commit, .abort = backup_abort, }; diff --git a/blockjob.c b/blockjob.c index 80adb9d..0c8edfe 100644 --- a/blockjob.c +++ b/blockjob.c @@ -533,3 +533,14 @@ void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job) QLIST_INSERT_HEAD(&txn->jobs, job, txn_list); block_job_txn_ref(txn); } + +void block_job_do_checkpoint(BlockJob *job, Error **errp) +{ +if (!job->driver->do_checkpoint) { +error_setg(errp, "The job %s doesn't support block checkpoint", + BlockJobType_lookup[job->driver->job_type]); +return; +} + +job->driver->do_checkpoint(job, errp); +} diff --git a/include/block/blockjob.h b/include/block/blockjob.h index d84ccd8..abdba7c 100644 --- a/include/block/blockjob.h +++ b/include/block/blockjob.h @@ -70,6 +70,9 @@ typedef struct BlockJobDriver { * never both. */ void (*abort)(BlockJob *job); + +/** Optional callback for job types that support checkpoint. */ +void (*do_checkpoint)(BlockJob *job, Error **errp); } BlockJobDriver; /** @@ -443,4 +446,13 @@ void block_job_txn_unref(BlockJobTxn *txn); */ void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job); +/** + * block_job_do_checkpoint: + * @job: The job. + * @errp: Error object. + * + * Do block checkpoint on the specified job. + */ +void block_job_do_checkpoint(BlockJob *job, Error **errp); + #endif -- 1.9.3
[Qemu-devel] [PATCH v14 2/8] Store parent BDS in BdrvChild
From: Wen Congyang We need to access the parent BDS to get the root BDS. Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie --- block.c | 1 + include/block/block_int.h | 1 + 2 files changed, 2 insertions(+) diff --git a/block.c b/block.c index 12b5231..33d3427 100644 --- a/block.c +++ b/block.c @@ -1204,6 +1204,7 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs, BdrvChild *child = g_new(BdrvChild, 1); *child = (BdrvChild) { .bs = child_bs, +.parent = parent_bs, .name = g_strdup(child_name), .role = child_role, }; diff --git a/include/block/block_int.h b/include/block/block_int.h index ebe8b1e..19c02b6 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -361,6 +361,7 @@ extern const BdrvChildRole child_format; struct BdrvChild { BlockDriverState *bs; +BlockDriverState *parent; char *name; const BdrvChildRole *role; QLIST_ENTRY(BdrvChild) next; -- 1.9.3
[Qemu-devel] [PATCH v14 5/8] docs: block replication's description
From: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie --- docs/block-replication.txt | 229 + 1 file changed, 229 insertions(+) create mode 100644 docs/block-replication.txt diff --git a/docs/block-replication.txt b/docs/block-replication.txt new file mode 100644 index 000..d1a231e --- /dev/null +++ b/docs/block-replication.txt @@ -0,0 +1,229 @@ +Block replication + +Copyright Fujitsu, Corp. 2015 +Copyright (c) 2015 Intel Corporation +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD. + +This work is licensed under the terms of the GNU GPL, version 2 or later. +See the COPYING file in the top-level directory. + +Block replication is used for continuous checkpoints. It is designed +for COLO (COarse-grain LOck-stepping) where the Secondary VM is running. +It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario, +where the Secondary VM is not running. + +This document gives an overview of block replication's design. + +== Background == +High availability solutions such as micro checkpoint and COLO will do +consecutive checkpoints. The VM state of Primary VM and Secondary VM is +identical right after a VM checkpoint, but becomes different as the VM +executes till the next checkpoint. To support disk contents checkpoint, +the modified disk contents in the Secondary VM must be buffered, and are +only dropped at next checkpoint time. To reduce the network transportation +effort at the time of checkpoint, the disk modification operations of +Primary disk are asynchronously forwarded to the Secondary node. + +== Workflow == +The following is the image of block replication workflow: + ++--+++ +|Primary Write Requests||Secondary Write Requests| ++--+++ + | | + | (4) + | V + | /-\ + | Copy and Forward| | + |-(1)--+ | Disk Buffer | + | | | | + | (3) \-/ + | speculative ^ + |write through(2) + | | | + V V | + +--+ ++ + | Primary Disk | | Secondary Disk | + +--+ ++ + +1) Primary write requests will be copied and forwarded to Secondary + QEMU. +2) Before Primary write requests are written to Secondary disk, the + original sector content will be read from Secondary disk and + buffered in the Disk buffer, but it will not overwrite the existing + sector content (it could be from either "Secondary Write Requests" or + previous COW of "Primary Write Requests") in the Disk buffer. +3) Primary write requests will be written to Secondary disk. +4) Secondary write requests will be buffered in the Disk buffer and it + will overwrite the existing sector content in the buffer. + +== Architecture == +We are going to implement block replication from many basic +blocks that are already in QEMU. + + virtio-blk || + ^||.-- + |||| Secondary +1 Quorum ||'-- + / \ || +/\|| + Primary2 filter + disk ^ virtio-blk + | ^ +3 NBD ---> 3 NBD | +client|| server 2 filter + ||^ ^ +. ||| | +Primary | || Secondary disk <- hidden-disk 5 <- active-disk 4 +' ||| backing^ backing + ||| | + ||| | + ||'-' +
[Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication
From: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie --- block/Makefile.objs | 1 + block/replication-comm.c | 66 + block/replication.c | 590 +++ include/block/replication-comm.h | 50 qapi/block-core.json | 13 + 5 files changed, 720 insertions(+) create mode 100644 block/replication-comm.c create mode 100644 block/replication.c create mode 100644 include/block/replication-comm.h diff --git a/block/Makefile.objs b/block/Makefile.objs index fa05f37..7037662 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -23,6 +23,7 @@ block-obj-$(CONFIG_LIBSSH2) += ssh.o block-obj-y += accounting.o block-obj-y += write-threshold.o block-obj-y += backup.o +block-obj-y += replication-comm.o replication.o common-obj-y += stream.o common-obj-y += commit.o diff --git a/block/replication-comm.c b/block/replication-comm.c new file mode 100644 index 000..8af748b --- /dev/null +++ b/block/replication-comm.c @@ -0,0 +1,66 @@ +/* + * Replication Block filter + * + * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD. + * Copyright (c) 2015 Intel Corporation + * Copyright (c) 2015 FUJITSU LIMITED + * + * Author: + * Wen Congyang + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#include "block/replication-comm.h" + +static QLIST_HEAD(, BlockReplicationState) block_replication_states; + +BlockReplicationState *block_replication_new(BlockDriverState *bs, + BlockReplicationOps *ops) +{ +BlockReplicationState *brs; + +brs = g_new0(BlockReplicationState, 1); +brs->bs = bs; +brs->ops = ops; +QLIST_INSERT_HEAD(&block_replication_states, brs, node); + +return brs; +} + +void block_replication_remove(BlockReplicationState *brs) +{ +QLIST_REMOVE(brs, node); +g_free(brs); +} + +void block_replication_start_all(ReplicationMode mode, Error **errp) +{ +BlockReplicationState *brs, *next; +QLIST_FOREACH_SAFE(brs, &block_replication_states, node, next) { +if (brs->ops && brs->ops->start) { +brs->ops->start(brs, mode, errp); +} +} +} + +void block_replication_do_checkpoint_all(Error **errp) +{ +BlockReplicationState *brs, *next; +QLIST_FOREACH_SAFE(brs, &block_replication_states, node, next) { +if (brs->ops && brs->ops->checkpoint) { +brs->ops->checkpoint(brs, errp); +} +} +} + +void block_replication_stop_all(bool failover, Error **errp) +{ +BlockReplicationState *brs, *next; +QLIST_FOREACH_SAFE(brs, &block_replication_states, node, next) { +if (brs->ops && brs->ops->stop) { +brs->ops->stop(brs, failover, errp); +} +} +} diff --git a/block/replication.c b/block/replication.c new file mode 100644 index 000..29c677a --- /dev/null +++ b/block/replication.c @@ -0,0 +1,590 @@ +/* + * Replication Block filter + * + * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD. + * Copyright (c) 2015 Intel Corporation + * Copyright (c) 2015 FUJITSU LIMITED + * + * Author: + * Wen Congyang + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#include "qemu-common.h" +#include "block/blockjob.h" +#include "block/nbd.h" +#include "block/replication-comm.h" + +typedef struct BDRVReplicationState { +ReplicationMode mode; +int replication_state; +BlockDriverState *active_disk; +BlockDriverState *hidden_disk; +BlockDriverState *secondary_disk; +BlockDriverState *top_bs; +BlockReplicationState *brs; +Error *blocker; +int orig_hidden_flags; +int orig_secondary_flags; +int error; +} BDRVReplicationState; + +enum { +BLOCK_REPLICATION_NONE, /* block replication is not started */ +BLOCK_REPLICATION_RUNNING, /* block replication is running */ +BLOCK_REPLICATION_FAILOVER, /* failover is running in background */ +BLOCK_REPLICATION_FAILOVER_FAILED, /* failover failed*/ +BLOCK_REPLICATION_DONE, /* block replication is done(after failover) */ +}; + +static void replication_start(BlockReplicationState *brs, ReplicationMode mode, + Error **errp); +static void replication_do_checkpoint(BlockReplicationState *brs, Error **errp); +static void replication_stop(BlockReplicationState *brs, bool failover, + Error **errp); + +#define REPLICATION_MODE"mode" +static QemuOptsList replication_runtime_opts = { +.name = "replication", +.head = QTAILQ_HEAD_INITIALIZER(replication
Re: [Qemu-devel] [PATCH v14 0/8] Block replication for continuous checkpoints
It seems i missed someone in CC list, add them. Thanks -Xie On 01/13/2016 05:18 PM, Changlong Xie wrote: Block replication is a very important feature which is used for continuous checkpoints(for example: COLO). You can get the detailed information about block replication from here: http://wiki.qemu.org/Features/BlockReplication Usage: Please refer to docs/block-replication.txt This patch series is based on the following patch series: 1. http://lists.nongnu.org/archive/html/qemu-devel/2015-12/msg04570.html You can get the patch here: https://github.com/Pating/qemu/tree/changlox/block-replication-v14 You can get the patch with framework here: https://github.com/Pating/qemu/tree/changlox/colo_framework_v13 TODO: 1. Continuous block replication. It will be started after basic functions are accepted. Changs Log: V14: 1. Implement auto complete active commit 2. Implement active commit block job for replication.c 3. Address the comments from Stefan, add replication-specific API and data structure, also remove old block layer APIs V13: 1. Rebase to the newest codes 2. Remove redundant marcos and semicolon in replication.c 3. Fix typos in block-replication.txt V12: 1. Rebase to the newest codes 2. Use backing reference to replcace 'allow-write-backing-file' V11: 1. Reopen the backing file when starting blcok replication if it is not opened in R/W mode 2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET when opening backing file 3. Block the top BDS so there is only one block job for the top BDS and its backing chain. V10: 1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing reference. 2. Address the comments from Eric Blake V9: 1. Update the error messages 2. Rebase to the newest qemu 3. Split child add/delete support. These patches are sent in another patchset. V8: 1. Address Alberto Garcia's comments V7: 1. Implement adding/removing quorum child. Remove the option non-connect. 2. Simplify the backing refrence option according to Stefan Hajnoczi's suggestion V6: 1. Rebase to the newest qemu. V5: 1. Address the comments from Gong Lei 2. Speed the failover up. The secondary vm can take over very quickly even if there are too many I/O requests. V4: 1. Introduce a new driver replication to avoid touch nbd and qcow2. V3: 1: use error_setg() instead of error_set() 2. Add a new block job API 3. Active disk, hidden disk and nbd target uses the same AioContext 4. Add a testcase to test new hbitmap API V2: 1. Redesign the secondary qemu(use image-fleecing) 2. Use Error objects to return error message 3. Address the comments from Max Reitz and Eric Blake Wen Congyang (8): unblock backup operations in backing file Store parent BDS in BdrvChild Backup: clear all bitmap when doing block checkpoint Allow creating backup jobs when opening BDS docs: block replication's description auto complete active commit Implement new driver for block replication support replication driver in blockdev-add block.c | 19 ++ block/Makefile.objs | 3 +- block/backup.c | 14 + block/mirror.c | 13 +- block/replication-comm.c | 66 + block/replication.c | 590 +++ blockdev.c | 2 +- blockjob.c | 11 + docs/block-replication.txt | 229 +++ include/block/block_int.h| 4 +- include/block/blockjob.h | 12 + include/block/replication-comm.h | 50 qapi/block-core.json | 33 ++- qemu-img.c | 2 +- 14 files changed, 1038 insertions(+), 10 deletions(-) create mode 100644 block/replication-comm.c create mode 100644 block/replication.c create mode 100644 docs/block-replication.txt create mode 100644 include/block/replication-comm.h
Re: [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication
On 01/20/2016 08:04 AM, Eric Blake wrote: On 01/13/2016 02:18 AM, Changlong Xie wrote: From: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie --- block/Makefile.objs | 1 + block/replication-comm.c | 66 + block/replication.c | 590 +++ include/block/replication-comm.h | 50 qapi/block-core.json | 13 + 5 files changed, 720 insertions(+) create mode 100644 block/replication-comm.c create mode 100644 block/replication.c create mode 100644 include/block/replication-comm.h Just a high-level overview, mainly on the user-visible interface and things that caught my eye. Hi eric, thanks for your patience. +++ b/block/replication-comm.c @@ -0,0 +1,66 @@ +/* + * Replication Block filter + * + * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD. + * Copyright (c) 2015 Intel Corporation + * Copyright (c) 2015 FUJITSU LIMITED Do you want to claim 2016 in any of this? Will correct all Copyright issues in next version. + +enum { +BLOCK_REPLICATION_NONE, /* block replication is not started */ +BLOCK_REPLICATION_RUNNING, /* block replication is running */ +BLOCK_REPLICATION_FAILOVER, /* failover is running in background */ +BLOCK_REPLICATION_FAILOVER_FAILED, /* failover failed*/ Space before */ Will fix this space issue +BLOCK_REPLICATION_DONE, /* block replication is done(after failover) */ +}; Should this be an enum type in a .json file? Since this is just an internal enum that only used in block/replication.c, i think we don't need put it in *.json file. + +static int replication_open(BlockDriverState *bs, QDict *options, +int flags, Error **errp) +{ + +fail: +qemu_opts_del(opts); +/* propagate error */ +if (local_err) { +error_propagate(errp, local_err); +} It's safe to call error_propagate() unconditionally (you could drop the 'if (local_err)'). You're right, i'll fix all relevant issue in my patchset. +static coroutine_fn int replication_co_discard(BlockDriverState *bs, + int64_t sector_num, + int nb_sectors) +{ +BDRVReplicationState *s = bs->opaque; +int ret; + +ret = replication_get_io_status(s); +if (ret < 0) { +return ret; +} + +if (ret == 1) { +/* It is secondary qemu and failover are running*/ Space before */ Will fix +static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp) +{ +Error *local_err = NULL; +int ret; + +if (!s->secondary_disk->job) { +error_setg(errp, "Backup job is cancelled unexpectedly"); Maybe s/is/was/ will fix +static void replication_start(BlockReplicationState *brs, ReplicationMode mode, + Error **errp) +{ +BlockDriverState *bs = brs->bs; +BDRVReplicationState *s = brs->bs->opaque; +int64_t active_length, hidden_length, disk_length; +AioContext *aio_context; +Error *local_err = NULL; + +if (s->replication_state != BLOCK_REPLICATION_NONE) { +error_setg(errp, "Block replication is running or done"); +return; +} + +if (s->mode != mode) { +error_setg(errp, "The parameter mode's value is invalid, needs %d," + " but receives %d", s->mode, mode); s/receives/got/ will fix +static void replication_do_checkpoint(BlockReplicationState *brs, Error **errp) +{ +BDRVReplicationState *s = brs->bs->opaque; + +if (s->replication_state != BLOCK_REPLICATION_RUNNING) { +error_setg(errp, "Block replication is not running"); +return; +} + +if (s->error) { +error_setg(errp, "I/O error occurs"); s/occurs/occurred/ Will fix +++ b/qapi/block-core.json @@ -1975,6 +1975,19 @@ '*read-pattern': 'QuorumReadPattern' } } ## +# @ReplicationMode +# +# An enumeration of replication modes. +# +# @primary: Primary mode, the vm's state will be sent to secondary QEMU. +# +# @secondary: Secondary mode, receive the vm's state from primary QEMU. +# +# Since: 2.6 +## +{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] } + Interface looks okay. Thanks -Xie
[Qemu-devel] [PATCH v9 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
From: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie --- block.c | 8 ++-- block/quorum.c| 122 +- include/block/block.h | 4 ++ 3 files changed, 128 insertions(+), 6 deletions(-) diff --git a/block.c b/block.c index a347008..b9e99da 100644 --- a/block.c +++ b/block.c @@ -1196,10 +1196,10 @@ static int bdrv_fill_options(QDict **options, const char *filename, return 0; } -static BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs, -BlockDriverState *child_bs, -const char *child_name, -const BdrvChildRole *child_role) +BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs, + BlockDriverState *child_bs, + const char *child_name, + const BdrvChildRole *child_role) { BdrvChild *child = g_new(BdrvChild, 1); *child = (BdrvChild) { diff --git a/block/quorum.c b/block/quorum.c index 6793f12..e73418c 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -23,6 +23,7 @@ #include "qapi/qmp/qstring.h" #include "qapi-event.h" #include "crypto/hash.h" +#include "qemu/bitmap.h" #define HASH_LENGTH 32 @@ -80,6 +81,8 @@ typedef struct BDRVQuorumState { bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted * block if Quorum is reached. */ +unsigned long *index_bitmap; +int bsize; QuorumReadPattern read_pattern; } BDRVQuorumState; @@ -875,9 +878,9 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags, ret = -EINVAL; goto exit; } -if (s->num_children < 2) { +if (s->num_children < 1) { error_setg(&local_err, - "Number of provided children must be greater than 1"); + "Number of provided children must be 1 or more"); ret = -EINVAL; goto exit; } @@ -926,6 +929,7 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags, /* allocate the children array */ s->children = g_new0(BdrvChild *, s->num_children); opened = g_new0(bool, s->num_children); +s->index_bitmap = bitmap_new(s->num_children); for (i = 0; i < s->num_children; i++) { char indexstr[32]; @@ -941,6 +945,8 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags, opened[i] = true; } +bitmap_set(s->index_bitmap, 0, s->num_children); +s->bsize = s->num_children; g_free(opened); goto exit; @@ -997,6 +1003,115 @@ static void quorum_attach_aio_context(BlockDriverState *bs, } } +static int get_new_child_index(BDRVQuorumState *s) +{ +int index; + +index = find_next_zero_bit(s->index_bitmap, s->bsize, 0); +if (index < s->bsize) { +return index; +} + +if ((s->bsize % BITS_PER_LONG) == 0) { +s->index_bitmap = bitmap_zero_extend(s->index_bitmap, s->bsize, + s->bsize + 1); +} + +return s->bsize++; +} + +static void remove_child_index(BDRVQuorumState *s, int index) +{ +int last_index; +long new_len; + +assert(index < s->bsize); + +clear_bit(index, s->index_bitmap); +if (index < s->bsize - 1) { +/* + * The last bit is always set, and we don't clear + * the last bit. + */ +return; +} + +last_index = find_last_bit(s->index_bitmap, s->bsize); +s->bsize = last_index + 1; +if (BITS_TO_LONGS(last_index + 1) == BITS_TO_LONGS(s->bsize)) { +return; +} + +new_len = BITS_TO_LONGS(last_index + 1) * sizeof(unsigned long); +s->index_bitmap = g_realloc(s->index_bitmap, new_len); +} + +static void quorum_add_child(BlockDriverState *bs, BlockDriverState *child_bs, + Error **errp) +{ +BDRVQuorumState *s = bs->opaque; +BdrvChild *child; +char indexstr[32]; +int index, ret; + +index = get_new_child_index(s); +ret = snprintf(indexstr, 32, "children.%d", index); +if (ret < 0 || ret >= 32) { +error_setg(errp, "cannot generate child name"); +return; +} + +bdrv_drain(bs); + +assert(s->num_children <= INT_MAX / sizeof(BdrvChild *)); +if (s->num_children == INT_MAX / sizeof(BdrvChild *)) { +error_setg(errp, "Too many children"); +return; +} +s->children = g_renew(BdrvChild *, s->children, s->num_children + 1); + +bdrv_ref(child_bs); +child = bdrv_attach_child(bs, child_bs, in
[Qemu-devel] [PATCH v9 0/3] qapi: child add/delete support
If quorum's child is broken, we can use mirror job to replace it. But sometimes, the user only need to remove the broken child, and add it later when the problem is fixed. ChangLog: v9: 1. Rebase to the newest codes 2. Remove redundant codes in quorum_add_child() and quorum_del_child() 3. Fix typos and in qmp-commands.hx v8: 1. Rebase to the newest codes 2. Address the comments from Eric Blake v7: 1. Remove the qmp command x-blockdev-change's parameter operation according to Kevin's comments. 2. Remove the hmp command. v6: 1. Use a single qmp command x-blockdev-change to replace x-blockdev-child-add and x-blockdev-child-delete v5: 1. Address Eric Blake's comments v4: 1. drop nbd driver's implementation. We can use human-monitor-command to do it. 2. Rename the command name. v3: 1. Don't open BDS in bdrv_add_child(). Use the existing BDS which is created by the QMP command blockdev-add. 2. The driver NBD can support filename, path, host:port now. v2: 1. Use bdrv_get_device_or_node_name() instead of new function bdrv_get_id_or_node_name() 2. Update the error message 3. Update the documents in block-core.json Wen Congyang (3): Add new block driver interface to add/delete a BDS's child quorum: implement bdrv_add_child() and bdrv_del_child() qmp: add monitor command to add/remove a child block.c | 58 -- block/quorum.c| 122 +- blockdev.c| 54 include/block/block.h | 9 include/block/block_int.h | 5 ++ qapi/block-core.json | 23 + qmp-commands.hx | 47 ++ 7 files changed, 312 insertions(+), 6 deletions(-) -- 1.9.3
[Qemu-devel] [PATCH v9 1/3] Add new block driver interface to add/delete a BDS's child
From: Wen Congyang In some cases, we want to take a quorum child offline, and take another child online. Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie Reviewed-by: Eric Blake Reviewed-by: Alberto Garcia --- block.c | 50 +++ include/block/block.h | 5 + include/block/block_int.h | 5 + 3 files changed, 60 insertions(+) diff --git a/block.c b/block.c index 411edbf..a347008 100644 --- a/block.c +++ b/block.c @@ -4320,3 +4320,53 @@ void bdrv_refresh_filename(BlockDriverState *bs) QDECREF(json); } } + +/* + * Hot add/remove a BDS's child. So the user can take a child offline when + * it is broken and take a new child online + */ +void bdrv_add_child(BlockDriverState *parent_bs, BlockDriverState *child_bs, +Error **errp) +{ + +if (!parent_bs->drv || !parent_bs->drv->bdrv_add_child) { +error_setg(errp, "The node %s doesn't support adding a child", + bdrv_get_device_or_node_name(parent_bs)); +return; +} + +if (!QLIST_EMPTY(&child_bs->parents)) { +error_setg(errp, "The node %s already has a parent", + child_bs->node_name); +return; +} + +parent_bs->drv->bdrv_add_child(parent_bs, child_bs, errp); +} + +void bdrv_del_child(BlockDriverState *parent_bs, BlockDriverState *child_bs, +Error **errp) +{ +BdrvChild *child; + +if (!parent_bs->drv || !parent_bs->drv->bdrv_del_child) { +error_setg(errp, "The node %s doesn't support removing a child", + bdrv_get_device_or_node_name(parent_bs)); +return; +} + +QLIST_FOREACH(child, &parent_bs->children, next) { +if (child->bs == child_bs) { +break; +} +} + +if (!child) { +error_setg(errp, "The node %s is not a child of %s", + bdrv_get_device_or_node_name(child_bs), + bdrv_get_device_or_node_name(parent_bs)); +return; +} + +parent_bs->drv->bdrv_del_child(parent_bs, child_bs, errp); +} diff --git a/include/block/block.h b/include/block/block.h index db8e096..863a7c8 100644 --- a/include/block/block.h +++ b/include/block/block.h @@ -578,4 +578,9 @@ void bdrv_drained_begin(BlockDriverState *bs); */ void bdrv_drained_end(BlockDriverState *bs); +void bdrv_add_child(BlockDriverState *parent, BlockDriverState *child, +Error **errp); +void bdrv_del_child(BlockDriverState *parent, BlockDriverState *child, +Error **errp); + #endif diff --git a/include/block/block_int.h b/include/block/block_int.h index 256609d..ebe8b1e 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -303,6 +303,11 @@ struct BlockDriver { */ void (*bdrv_drain)(BlockDriverState *bs); +void (*bdrv_add_child)(BlockDriverState *parent, BlockDriverState *child, + Error **errp); +void (*bdrv_del_child)(BlockDriverState *parent, BlockDriverState *child, + Error **errp); + QLIST_ENTRY(BlockDriver) list; }; -- 1.9.3
[Qemu-devel] [PATCH v9 3/3] qmp: add monitor command to add/remove a child
From: Wen Congyang The new QMP command name is x-blockdev-change. It's just for adding/removing quorum's child now, and doesn't support all kinds of children, all kinds of operations, nor all block drivers. So it is experimental now. Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie --- blockdev.c | 54 qapi/block-core.json | 23 ++ qmp-commands.hx | 47 + 3 files changed, 124 insertions(+) diff --git a/blockdev.c b/blockdev.c index 64dbfeb..4e62fdf 100644 --- a/blockdev.c +++ b/blockdev.c @@ -3836,6 +3836,60 @@ out: aio_context_release(aio_context); } +static BlockDriverState *bdrv_find_child(BlockDriverState *parent_bs, + const char *child_name) +{ +BdrvChild *child; + +QLIST_FOREACH(child, &parent_bs->children, next) { +if (strcmp(child->name, child_name) == 0) { +return child->bs; +} +} + +return NULL; +} + +void qmp_x_blockdev_change(const char *parent, bool has_child, + const char *child, bool has_node, + const char *node, Error **errp) +{ +BlockDriverState *parent_bs, *child_bs = NULL, *new_bs = NULL; + +parent_bs = bdrv_lookup_bs(parent, parent, errp); +if (!parent_bs) { +return; +} + +if (has_child == has_node) { +if (has_child) { +error_setg(errp, "The paramter child and node is conflict"); +} else { +error_setg(errp, "Either child or node should be specified"); +} +return; +} + +if (has_child) { +child_bs = bdrv_find_child(parent_bs, child); +if (!child_bs) { +error_setg(errp, "Node '%s' doesn't have child %s", + parent, child); +return; +} +bdrv_del_child(parent_bs, child_bs, errp); +} + +if (has_node) { +new_bs = bdrv_find_node(node); +if (!new_bs) { +error_setg(errp, "Node '%s' not found", node); +return; +} +bdrv_add_child(parent_bs, new_bs, errp); +} +} + BlockJobInfoList *qmp_query_block_jobs(Error **errp) { BlockJobInfoList *head = NULL, **p_next = &head; diff --git a/qapi/block-core.json b/qapi/block-core.json index 1a5d9ce..fe63c6d 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -2408,3 +2408,26 @@ ## { 'command': 'block-set-write-threshold', 'data': { 'node-name': 'str', 'write-threshold': 'uint64' } } + +## +# @x-blockdev-change +# +# Dynamically reconfigure the block driver state graph. It can be used +# to add, remove, insert or replace a block driver state. Currently only +# the Quorum driver implements this feature to add or remove its child. +# This is useful to fix a broken quorum child. +# +# @parent: the id or name of the node that will be changed. +# +# @child: #optional the name of the child that will be deleted. +# +# @node: #optional the name of the node will be added. +# +# Note: this command is experimental, and its API is not stable. +# +# Since: 2.6 +## +{ 'command': 'x-blockdev-change', + 'data' : { 'parent': 'str', + '*child': 'str', + '*node': 'str' } } diff --git a/qmp-commands.hx b/qmp-commands.hx index 7b235ee..efee0ca 100644 --- a/qmp-commands.hx +++ b/qmp-commands.hx @@ -4293,6 +4293,53 @@ Example: EQMP { +.name = "x-blockdev-change", +.args_type = "parent:B,child:B?,node:B?", +.mhandler.cmd_new = qmp_marshal_x_blockdev_change, +}, + +SQMP +x-blockdev-change +- + +Dynamically reconfigure the block driver state graph. It can be used to +add, remove, insert, or replace a block driver state. Currently only +the Quorum driver implements this feature to add and remove its child. +This is useful to fix a broken quorum child. + +Arguments: +- "parent": the id or node name of which node will be changed (json-string) +- "child": the child name which will be deleted (json-string, optional) +- "node": the new node-name which will be added (json-string, optional) + +Note: this command is experimental, and not a stable API. It doesn't +support all kinds of operations, all kinds of children, nor all block +drivers. + +Example: + +Add a new node to a quorum +-> { "execute": "blockdev-add", +"arguments": { "options": { "driver": "raw", +"node-name": "new_node", +
[Qemu-devel] [PATCH v13 02/10] Store parent BDS in BdrvChild
From: Wen Congyang We need to access the parent BDS to get the root BDS. Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie --- block.c | 1 + include/block/block_int.h | 1 + 2 files changed, 2 insertions(+) diff --git a/block.c b/block.c index 1589c0d..c9c913e 100644 --- a/block.c +++ b/block.c @@ -1204,6 +1204,7 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs, BdrvChild *child = g_new(BdrvChild, 1); *child = (BdrvChild) { .bs = child_bs, +.parent = parent_bs, .name = g_strdup(child_name), .role = child_role, }; diff --git a/include/block/block_int.h b/include/block/block_int.h index ebe8b1e..19c02b6 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -361,6 +361,7 @@ extern const BdrvChildRole child_format; struct BdrvChild { BlockDriverState *bs; +BlockDriverState *parent; char *name; const BdrvChildRole *role; QLIST_ENTRY(BdrvChild) next; -- 1.9.3
[Qemu-devel] [PATCH v13 03/10] Backup: clear all bitmap when doing block checkpoint
From: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie Reviewed-by: Jeff Cody --- block/backup.c | 14 ++ blockjob.c | 11 +++ include/block/blockjob.h | 12 3 files changed, 37 insertions(+) diff --git a/block/backup.c b/block/backup.c index 705bb77..0a27d01 100644 --- a/block/backup.c +++ b/block/backup.c @@ -253,11 +253,25 @@ static void backup_abort(BlockJob *job) } } +static void backup_do_checkpoint(BlockJob *job, Error **errp) +{ +BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common); + +if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) { +error_setg(errp, "The backup job only supports block checkpoint in" + " sync=none mode"); +return; +} + +hbitmap_reset_all(backup_job->bitmap); +} + static const BlockJobDriver backup_job_driver = { .instance_size = sizeof(BackupBlockJob), .job_type = BLOCK_JOB_TYPE_BACKUP, .set_speed = backup_set_speed, .iostatus_reset = backup_iostatus_reset, +.do_checkpoint = backup_do_checkpoint, .commit = backup_commit, .abort = backup_abort, }; diff --git a/blockjob.c b/blockjob.c index 80adb9d..0c8edfe 100644 --- a/blockjob.c +++ b/blockjob.c @@ -533,3 +533,14 @@ void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job) QLIST_INSERT_HEAD(&txn->jobs, job, txn_list); block_job_txn_ref(txn); } + +void block_job_do_checkpoint(BlockJob *job, Error **errp) +{ +if (!job->driver->do_checkpoint) { +error_setg(errp, "The job %s doesn't support block checkpoint", + BlockJobType_lookup[job->driver->job_type]); +return; +} + +job->driver->do_checkpoint(job, errp); +} diff --git a/include/block/blockjob.h b/include/block/blockjob.h index d84ccd8..abdba7c 100644 --- a/include/block/blockjob.h +++ b/include/block/blockjob.h @@ -70,6 +70,9 @@ typedef struct BlockJobDriver { * never both. */ void (*abort)(BlockJob *job); + +/** Optional callback for job types that support checkpoint. */ +void (*do_checkpoint)(BlockJob *job, Error **errp); } BlockJobDriver; /** @@ -443,4 +446,13 @@ void block_job_txn_unref(BlockJobTxn *txn); */ void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job); +/** + * block_job_do_checkpoint: + * @job: The job. + * @errp: Error object. + * + * Do block checkpoint on the specified job. + */ +void block_job_do_checkpoint(BlockJob *job, Error **errp); + #endif -- 1.9.3
[Qemu-devel] [PATCH v13 01/10] unblock backup operations in backing file
From: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie --- block.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/block.c b/block.c index b9e99da..1589c0d 100644 --- a/block.c +++ b/block.c @@ -1275,6 +1275,24 @@ void bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd) /* Otherwise we won't be able to commit due to check in bdrv_commit */ bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_COMMIT_TARGET, bs->backing_blocker); +/* + * We do backup in 3 ways: + * 1. drive backup + *The target bs is new opened, and the source is top BDS + * 2. blockdev backup + *Both the source and the target are top BDSes. + * 3. internal backup(used for block replication) + *Both the source and the target are backing file + * + * In case 1, and 2, the backing file is neither the source nor + * the target. + * In case 3, we will block the top BDS, so there is only one block + * job for the top BDS and its backing chain. + */ +bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_SOURCE, +bs->backing_blocker); +bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_TARGET, +bs->backing_blocker); out: bdrv_refresh_limits(bs, NULL); } -- 1.9.3
[Qemu-devel] [PATCH v13 04/10] Allow creating backup jobs when opening BDS
From: Wen Congyang When opening BDS, we need to create backup jobs for image-fleecing. Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie Reviewed-by: Stefan Hajnoczi Reviewed-by: Jeff Cody --- block/Makefile.objs | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/Makefile.objs b/block/Makefile.objs index 58ef2ef..fa05f37 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -22,10 +22,10 @@ block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o block-obj-$(CONFIG_LIBSSH2) += ssh.o block-obj-y += accounting.o block-obj-y += write-threshold.o +block-obj-y += backup.o common-obj-y += stream.o common-obj-y += commit.o -common-obj-y += backup.o iscsi.o-cflags := $(LIBISCSI_CFLAGS) iscsi.o-libs := $(LIBISCSI_LIBS) -- 1.9.3
[Qemu-devel] [PATCH v13 00/10] Block replication for continuous checkpoints
Block replication is a very important feature which is used for continuous checkpoints(for example: COLO). You can get the detailed information about block replication from here: http://wiki.qemu.org/Features/BlockReplication Usage: Please refer to docs/block-replication.txt This patch series is based on the following patch series: 1. http://lists.nongnu.org/archive/html/qemu-devel/2015-12/msg04570.html You can get the patch here: https://github.com/Pating/qemu/tree/changlox/block-replication-v13 You can get the patch with framework here: https://github.com/Pating/qemu/tree/changlox/colo_framework_v12 TODO: 1. Continuous block replication. It will be started after basic functions are accepted. Changs Log: V13: 1. Rebase to the newest codes 2. Remove redundant marcos and semicolon in replication.c 3. Fix typos in block-replication.txt V12: 1. Rebase to the newest codes 2. Use backing reference to replcace 'allow-write-backing-file' V11: 1. Reopen the backing file when starting blcok replication if it is not opened in R/W mode 2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET when opening backing file 3. Block the top BDS so there is only one block job for the top BDS and its backing chain. V10: 1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing reference. 2. Address the comments from Eric Blake V9: 1. Update the error messages 2. Rebase to the newest qemu 3. Split child add/delete support. These patches are sent in another patchset. V8: 1. Address Alberto Garcia's comments V7: 1. Implement adding/removing quorum child. Remove the option non-connect. 2. Simplify the backing refrence option according to Stefan Hajnoczi's suggestion V6: 1. Rebase to the newest qemu. V5: 1. Address the comments from Gong Lei 2. Speed the failover up. The secondary vm can take over very quickly even if there are too many I/O requests. V4: 1. Introduce a new driver replication to avoid touch nbd and qcow2. V3: 1: use error_setg() instead of error_set() 2. Add a new block job API 3. Active disk, hidden disk and nbd target uses the same AioContext 4. Add a testcase to test new hbitmap API V2: 1. Redesign the secondary qemu(use image-fleecing) 2. Use Error objects to return error message 3. Address the comments from Max Reitz and Eric Blake Wen Congyang (10): unblock backup operations in backing file Store parent BDS in BdrvChild Backup: clear all bitmap when doing block checkpoint Allow creating backup jobs when opening BDS docs: block replication's description Add new block driver interfaces to control block replication quorum: implement block driver interfaces for block replication Implement new driver for block replication support replication driver in blockdev-add Add a new API to start/stop replication, do checkpoint to all BDSes block.c| 145 block/Makefile.objs| 3 +- block/backup.c | 14 ++ block/quorum.c | 78 +++ block/replication.c| 545 + blockjob.c | 11 + docs/block-replication.txt | 227 +++ include/block/block.h | 9 + include/block/block_int.h | 15 ++ include/block/blockjob.h | 12 + qapi/block-core.json | 33 ++- 11 files changed, 1089 insertions(+), 3 deletions(-) create mode 100644 block/replication.c create mode 100644 docs/block-replication.txt -- 1.9.3
[Qemu-devel] [PATCH v13 09/10] support replication driver in blockdev-add
From: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie Reviewed-by: Eric Blake --- qapi/block-core.json | 20 ++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/qapi/block-core.json b/qapi/block-core.json index 610da92..7354c6a 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -220,6 +220,7 @@ # 2.2: 'archipelago' added, 'cow' dropped # 2.3: 'host_floppy' deprecated # 2.5: 'host_floppy' dropped +# 2.6: 'replication' added # # @backing_file: #optional the name of the backing file (for copy-on-write) # @@ -1492,6 +1493,7 @@ # Drivers that are supported in block device operations. # # @host_device, @host_cdrom: Since 2.1 +# @replication: Since 2.6 # # Since: 2.0 ## @@ -1499,8 +1501,8 @@ 'data': [ 'archipelago', 'blkdebug', 'blkverify', 'bochs', 'cloop', 'dmg', 'file', 'ftp', 'ftps', 'host_cdrom', 'host_device', 'http', 'https', 'null-aio', 'null-co', 'parallels', -'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'tftp', 'vdi', 'vhdx', -'vmdk', 'vpc', 'vvfat' ] } +'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'replication', +'tftp', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] } ## # @BlockdevOptionsBase @@ -1940,6 +1942,19 @@ { 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] } ## +# @BlockdevOptionsReplication +# +# Driver specific block device options for replication +# +# @mode: the replication mode +# +# Since: 2.6 +## +{ 'struct': 'BlockdevOptionsReplication', + 'base': 'BlockdevOptionsGenericFormat', + 'data': { 'mode': 'ReplicationMode' } } + +## # @BlockdevOptions # # Options for creating a block device. @@ -1976,6 +1991,7 @@ 'quorum': 'BlockdevOptionsQuorum', 'raw':'BlockdevOptionsGenericFormat', # TODO rbd: Wait for structured options + 'replication':'BlockdevOptionsReplication', # TODO sheepdog: Wait for structured options # TODO ssh: Should take InetSocketAddress for 'host'? 'tftp': 'BlockdevOptionsFile', -- 1.9.3
[Qemu-devel] [PATCH v13 07/10] quorum: implement block driver interfaces for block replication
From: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie Reviewed-by: Alberto Garcia --- block/quorum.c | 78 ++ 1 file changed, 78 insertions(+) diff --git a/block/quorum.c b/block/quorum.c index e73418c..aa8c4dd 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -85,6 +85,8 @@ typedef struct BDRVQuorumState { int bsize; QuorumReadPattern read_pattern; + +int replication_index; /* store which child supports block replication */ } BDRVQuorumState; typedef struct QuorumAIOCB QuorumAIOCB; @@ -949,6 +951,7 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags, s->bsize = s->num_children; g_free(opened); +s->replication_index = -1; goto exit; close_exit: @@ -1146,6 +1149,77 @@ static void quorum_refresh_filename(BlockDriverState *bs, QDict *options) bs->full_open_options = opts; } +static void quorum_start_replication(BlockDriverState *bs, ReplicationMode mode, + Error **errp) +{ +BDRVQuorumState *s = bs->opaque; +int count = 0, i, index; +Error *local_err = NULL; + +/* + * TODO: support REPLICATION_MODE_SECONDARY if we allow secondary + * QEMU becoming primary QEMU. + */ +if (mode != REPLICATION_MODE_PRIMARY) { +error_setg(errp, "The replication mode for quorum should be 'primary'"); +return; +} + +if (s->read_pattern != QUORUM_READ_PATTERN_FIFO) { +error_setg(errp, "Block replication needs read pattern 'fifo'"); +return; +} + +for (i = 0; i < s->num_children; i++) { +bdrv_start_replication(s->children[i]->bs, mode, &local_err); +if (local_err) { +error_free(local_err); +local_err = NULL; +} else { +count++; +index = i; +} +} + +if (count == 0) { +error_setg(errp, "No child supports block replication"); +} else if (count > 1) { +for (i = 0; i < s->num_children; i++) { +bdrv_stop_replication(s->children[i]->bs, false, NULL); +} +error_setg(errp, "Too many children support block replication"); +} else { +s->replication_index = index; +} +} + +static void quorum_do_checkpoint(BlockDriverState *bs, Error **errp) +{ +BDRVQuorumState *s = bs->opaque; + +if (s->replication_index < 0) { +error_setg(errp, "Block replication is not running"); +return; +} + +bdrv_do_checkpoint(s->children[s->replication_index]->bs, errp); +} + +static void quorum_stop_replication(BlockDriverState *bs, bool failover, +Error **errp) +{ +BDRVQuorumState *s = bs->opaque; + +if (s->replication_index < 0) { +error_setg(errp, "Block replication is not running"); +return; +} + +bdrv_stop_replication(s->children[s->replication_index]->bs, failover, + errp); +s->replication_index = -1; +} + static BlockDriver bdrv_quorum = { .format_name= "quorum", .protocol_name = "quorum", @@ -1172,6 +1246,10 @@ static BlockDriver bdrv_quorum = { .is_filter = true, .bdrv_recurse_is_first_non_filter = quorum_recurse_is_first_non_filter, + +.bdrv_start_replication = quorum_start_replication, +.bdrv_do_checkpoint = quorum_do_checkpoint, +.bdrv_stop_replication = quorum_stop_replication, }; static void bdrv_quorum_init(void) -- 1.9.3
[Qemu-devel] [PATCH v13 06/10] Add new block driver interfaces to control block replication
From: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie Cc: Luiz Capitulino Cc: Michael Roth Reviewed-by: Paolo Bonzini --- block.c | 43 +++ include/block/block.h | 5 + include/block/block_int.h | 14 ++ qapi/block-core.json | 13 + 4 files changed, 75 insertions(+) diff --git a/block.c b/block.c index c9c913e..275d8b4 100644 --- a/block.c +++ b/block.c @@ -4389,3 +4389,46 @@ void bdrv_del_child(BlockDriverState *parent_bs, BlockDriverState *child_bs, parent_bs->drv->bdrv_del_child(parent_bs, child_bs, errp); } + +void bdrv_start_replication(BlockDriverState *bs, ReplicationMode mode, +Error **errp) +{ +BlockDriver *drv = bs->drv; + +if (drv && drv->bdrv_start_replication) { +drv->bdrv_start_replication(bs, mode, errp); +} else if (bs->file) { +bdrv_start_replication(bs->file->bs, mode, errp); +} else { +error_setg(errp, "The BDS %s doesn't support starting block" + " replication", bs->filename); +} +} + +void bdrv_do_checkpoint(BlockDriverState *bs, Error **errp) +{ +BlockDriver *drv = bs->drv; + +if (drv && drv->bdrv_do_checkpoint) { +drv->bdrv_do_checkpoint(bs, errp); +} else if (bs->file) { +bdrv_do_checkpoint(bs->file->bs, errp); +} else { +error_setg(errp, "The BDS %s doesn't support block checkpoint", + bs->filename); +} +} + +void bdrv_stop_replication(BlockDriverState *bs, bool failover, Error **errp) +{ +BlockDriver *drv = bs->drv; + +if (drv && drv->bdrv_stop_replication) { +drv->bdrv_stop_replication(bs, failover, errp); +} else if (bs->file) { +bdrv_stop_replication(bs->file->bs, failover, errp); +} else { +error_setg(errp, "The BDS %s doesn't support stopping block" + " replication", bs->filename); +} +} diff --git a/include/block/block.h b/include/block/block.h index 6c7e54b..5d47cef 100644 --- a/include/block/block.h +++ b/include/block/block.h @@ -587,4 +587,9 @@ void bdrv_add_child(BlockDriverState *parent, BlockDriverState *child, void bdrv_del_child(BlockDriverState *parent, BlockDriverState *child, Error **errp); +void bdrv_start_replication(BlockDriverState *bs, ReplicationMode mode, +Error **errp); +void bdrv_do_checkpoint(BlockDriverState *bs, Error **errp); +void bdrv_stop_replication(BlockDriverState *bs, bool failover, Error **errp); + #endif diff --git a/include/block/block_int.h b/include/block/block_int.h index 19c02b6..e31f9db 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -308,6 +308,20 @@ struct BlockDriver { void (*bdrv_del_child)(BlockDriverState *parent, BlockDriverState *child, Error **errp); +void (*bdrv_start_replication)(BlockDriverState *bs, ReplicationMode mode, + Error **errp); +/* Drop Disk buffer when doing checkpoint. */ +void (*bdrv_do_checkpoint)(BlockDriverState *bs, Error **errp); +/* + * After failover, we should flush Disk buffer into secondary disk + * and stop block replication. + * + * If the guest is shutdown, we should drop Disk buffer and stop + * block representation. + */ +void (*bdrv_stop_replication)(BlockDriverState *bs, bool failover, + Error **errp); + QLIST_ENTRY(BlockDriver) list; }; diff --git a/qapi/block-core.json b/qapi/block-core.json index fe63c6d..610da92 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -1927,6 +1927,19 @@ '*read-pattern': 'QuorumReadPattern' } } ## +# @ReplicationMode +# +# An enumeration of replication modes. +# +# @primary: Primary mode, the vm's state will be sent to secondary QEMU. +# +# @secondary: Secondary mode, receive the vm's state from primary QEMU. +# +# Since: 2.6 +## +{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] } + +## # @BlockdevOptions # # Options for creating a block device. -- 1.9.3
[Qemu-devel] [PATCH v13 05/10] docs: block replication's description
From: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie --- docs/block-replication.txt | 227 + 1 file changed, 227 insertions(+) create mode 100644 docs/block-replication.txt diff --git a/docs/block-replication.txt b/docs/block-replication.txt new file mode 100644 index 000..73abb65 --- /dev/null +++ b/docs/block-replication.txt @@ -0,0 +1,227 @@ +Block replication + +Copyright Fujitsu, Corp. 2015 +Copyright (c) 2015 Intel Corporation +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD. + +This work is licensed under the terms of the GNU GPL, version 2 or later. +See the COPYING file in the top-level directory. + +Block replication is used for continuous checkpoints. It is designed +for COLO (COarse-grain LOck-stepping) where the Secondary VM is running. +It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario, +where the Secondary VM is not running. + +This document gives an overview of block replication's design. + +== Background == +High availability solutions such as micro checkpoint and COLO will do +consecutive checkpoints. The VM state of Primary VM and Secondary VM is +identical right after a VM checkpoint, but becomes different as the VM +executes till the next checkpoint. To support disk contents checkpoint, +the modified disk contents in the Secondary VM must be buffered, and are +only dropped at next checkpoint time. To reduce the network transportation +effort at the time of checkpoint, the disk modification operations of +Primary disk are asynchronously forwarded to the Secondary node. + +== Workflow == +The following is the image of block replication workflow: + ++--+++ +|Primary Write Requests||Secondary Write Requests| ++--+++ + | | + | (4) + | V + | /-\ + | Copy and Forward| | + |-(1)--+ | Disk Buffer | + | | | | + | (3) \-/ + | speculative ^ + |write through(2) + | | | + V V | + +--+ ++ + | Primary Disk | | Secondary Disk | + +--+ ++ + +1) Primary write requests will be copied and forwarded to Secondary + QEMU. +2) Before Primary write requests are written to Secondary disk, the + original sector content will be read from Secondary disk and + buffered in the Disk buffer, but it will not overwrite the existing + sector content (it could be from either "Secondary Write Requests" or + previous COW of "Primary Write Requests") in the Disk buffer. +3) Primary write requests will be written to Secondary disk. +4) Secondary write requests will be buffered in the Disk buffer and it + will overwrite the existing sector content in the buffer. + +== Architecture == +We are going to implement block replication from many basic +blocks that are already in QEMU. + + virtio-blk || + ^||.-- + |||| Secondary +1 Quorum ||'-- + / \ || +/\|| + Primary2 filter + disk ^ virtio-blk + | ^ +3 NBD ---> 3 NBD | +client|| server 2 filter + ||^ ^ +. ||| | +Primary | || Secondary disk <- hidden-disk 5 <- active-disk 4 +' ||| backing^ backing + ||| | + ||| | + ||'-' +
[Qemu-devel] [PATCH v13 08/10] Implement new driver for block replication
From: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie --- block/Makefile.objs | 1 + block/replication.c | 545 2 files changed, 546 insertions(+) create mode 100644 block/replication.c diff --git a/block/Makefile.objs b/block/Makefile.objs index fa05f37..94c1d03 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -23,6 +23,7 @@ block-obj-$(CONFIG_LIBSSH2) += ssh.o block-obj-y += accounting.o block-obj-y += write-threshold.o block-obj-y += backup.o +block-obj-y += replication.o common-obj-y += stream.o common-obj-y += commit.o diff --git a/block/replication.c b/block/replication.c new file mode 100644 index 000..6a061c9 --- /dev/null +++ b/block/replication.c @@ -0,0 +1,545 @@ +/* + * Replication Block filter + * + * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD. + * Copyright (c) 2015 Intel Corporation + * Copyright (c) 2015 FUJITSU LIMITED + * + * Author: + * Wen Congyang + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#include "qemu-common.h" +#include "block/block_int.h" +#include "block/blockjob.h" +#include "block/nbd.h" + +typedef struct BDRVReplicationState { +ReplicationMode mode; +int replication_state; +BlockDriverState *active_disk; +BlockDriverState *hidden_disk; +BlockDriverState *secondary_disk; +BlockDriverState *top_bs; +Error *blocker; +int orig_hidden_flags; +int orig_secondary_flags; +int error; +} BDRVReplicationState; + +enum { +BLOCK_REPLICATION_NONE, /* block replication is not started */ +BLOCK_REPLICATION_RUNNING, /* block replication is running */ +BLOCK_REPLICATION_DONE, /* block replication is done(failover) */ +}; + +static void replication_stop(BlockDriverState *bs, bool failover, Error **errp); + +#define REPLICATION_MODE"mode" +static QemuOptsList replication_runtime_opts = { +.name = "replication", +.head = QTAILQ_HEAD_INITIALIZER(replication_runtime_opts.head), +.desc = { +{ +.name = REPLICATION_MODE, +.type = QEMU_OPT_STRING, +}, +{ /* end of list */ } +}, +}; + +static int replication_open(BlockDriverState *bs, QDict *options, +int flags, Error **errp) +{ +int ret; +BDRVReplicationState *s = bs->opaque; +Error *local_err = NULL; +QemuOpts *opts = NULL; +const char *mode; + +ret = -EINVAL; +opts = qemu_opts_create(&replication_runtime_opts, NULL, 0, &error_abort); +qemu_opts_absorb_qdict(opts, options, &local_err); +if (local_err) { +goto fail; +} + +mode = qemu_opt_get(opts, REPLICATION_MODE); +if (!mode) { +error_setg(&local_err, "Missing the option mode"); +goto fail; +} + +if (!strcmp(mode, "primary")) { +s->mode = REPLICATION_MODE_PRIMARY; +} else if (!strcmp(mode, "secondary")) { +s->mode = REPLICATION_MODE_SECONDARY; +} else { +error_setg(&local_err, + "The option mode's value should be primary or secondary"); +goto fail; +} + +ret = 0; + +fail: +qemu_opts_del(opts); +/* propagate error */ +if (local_err) { +error_propagate(errp, local_err); +} +return ret; +} + +static void replication_close(BlockDriverState *bs) +{ +BDRVReplicationState *s = bs->opaque; + +if (s->replication_state == BLOCK_REPLICATION_RUNNING) { +replication_stop(bs, false, NULL); +} +} + +static int64_t replication_getlength(BlockDriverState *bs) +{ +return bdrv_getlength(bs->file->bs); +} + +static int replication_get_io_status(BDRVReplicationState *s) +{ +switch (s->replication_state) { +case BLOCK_REPLICATION_NONE: +return -EIO; +case BLOCK_REPLICATION_RUNNING: +return 0; +case BLOCK_REPLICATION_DONE: +return s->mode == REPLICATION_MODE_PRIMARY ? -EIO : 1; +default: +abort(); +} +} + +static int replication_return_value(BDRVReplicationState *s, int ret) +{ +if (s->mode == REPLICATION_MODE_SECONDARY) { +return ret; +} + +if (ret < 0) { +s->error = ret; +ret = 0; +} + +return ret; +} + +static coroutine_fn int replication_co_readv(BlockDriverState *bs, + int64_t sector_num, + int remaining_sectors, + QEMUIOVector *qiov) +{ +BDRVReplicationState *s = bs->opaque; +int ret; + +if (s->mode == REPLICATION_MODE_PRIMARY) { +/* We only use it to forward primary write requests */ +
[Qemu-devel] [PATCH v13 10/10] Add a new API to start/stop replication, do checkpoint to all BDSes
From: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: zhanghailiang Signed-off-by: Gonglei Signed-off-by: Changlong Xie --- block.c | 83 +++ include/block/block.h | 4 +++ 2 files changed, 87 insertions(+) diff --git a/block.c b/block.c index 275d8b4..634cc97 100644 --- a/block.c +++ b/block.c @@ -4432,3 +4432,86 @@ void bdrv_stop_replication(BlockDriverState *bs, bool failover, Error **errp) " replication", bs->filename); } } + +void bdrv_start_replication_all(ReplicationMode mode, Error **errp) +{ +BlockDriverState *bs = NULL, *temp = NULL; +Error *local_err = NULL; + +while ((bs = bdrv_next(bs))) { +if (!QLIST_EMPTY(&bs->parents)) { +/* It is not top BDS */ +continue; +} + +if (bdrv_is_read_only(bs) || !bdrv_is_inserted(bs)) { +continue; +} + +bdrv_start_replication(bs, mode, &local_err); +if (local_err) { +error_propagate(errp, local_err); +goto fail; +} +} + +return; + +fail: +while ((temp = bdrv_next(temp)) && bs != temp) { +bdrv_stop_replication(temp, false, NULL); +} +} + +void bdrv_do_checkpoint_all(Error **errp) +{ +BlockDriverState *bs = NULL; +Error *local_err = NULL; + +while ((bs = bdrv_next(bs))) { +if (!QLIST_EMPTY(&bs->parents)) { +/* It is not top BDS */ +continue; +} + +if (bdrv_is_read_only(bs) || !bdrv_is_inserted(bs)) { +continue; +} + +bdrv_do_checkpoint(bs, &local_err); +if (local_err) { +error_propagate(errp, local_err); +return; +} +} +} + +void bdrv_stop_replication_all(bool failover, Error **errp) +{ +BlockDriverState *bs = NULL; +Error *local_err = NULL; + +while ((bs = bdrv_next(bs))) { +if (!QLIST_EMPTY(&bs->parents)) { +/* It is not top BDS */ +continue; +} + +if (bdrv_is_read_only(bs) || !bdrv_is_inserted(bs)) { +continue; +} + +bdrv_stop_replication(bs, failover, &local_err); +if (!errp) { +/* + * The caller doesn't care the result, they just + * want to stop all block's replication. + */ +continue; +} +if (local_err) { +error_propagate(errp, local_err); +return; +} +} +} diff --git a/include/block/block.h b/include/block/block.h index 5d47cef..9c4de14 100644 --- a/include/block/block.h +++ b/include/block/block.h @@ -592,4 +592,8 @@ void bdrv_start_replication(BlockDriverState *bs, ReplicationMode mode, void bdrv_do_checkpoint(BlockDriverState *bs, Error **errp); void bdrv_stop_replication(BlockDriverState *bs, bool failover, Error **errp); +void bdrv_start_replication_all(ReplicationMode mode, Error **errp); +void bdrv_do_checkpoint_all(Error **errp); +void bdrv_stop_replication_all(bool failover, Error **errp); + #endif -- 1.9.3
Re: [Qemu-devel] [PATCH v2 1/1] quorum: Change vote rules for 64 bits hash
On 02/20/2016 10:28 PM, Max Reitz wrote: On 19.02.2016 12:24, Alberto Garcia wrote: On Fri 19 Feb 2016 09:26:53 AM CET, Wen Congyang wrote: If quorum has two children(A, B). A do flush sucessfully, but B flush failed. We MUST choice A as winner rather than just pick anyone of them. Otherwise the filesystem of guest will become read-only with following errors: end_request: I/O error, dev vda, sector 11159960 Aborting journal on device vda3-8 EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal EXT4-fs (vda3): Remounting filesystem read-only Hi Xie, Let's see if I'm getting this right: - When Quorum flushes to disk, there's a vote among the return values of the flush operations of its members, and the one that wins is the one that Quorum returns. - If there's a tie then Quorum choses the first result from the list of winners. - With your patch you want to give priority to the vote with result == 0 if there's any, so Quorum would return 0 (and succeed). This seems to me like an ad-hoc fix for a particular use case. What if you have 3 members and two of them fail with the same error code? Would you still return 0 or the error code from the other two? For example: children.0 returns 0 children.1 returns -EIO children.2 returns -EPIPE In this case, quorum returns -EPIPE now(without this patch). For example: children.0 returns -EPIPE children.1 returns -EIO children.2 returns 0 In this case, quorum returns 0 now. My question is: what's the rationale for returning 0 in case a) but not in case b)? a) children.0 returns -EPIPE children.1 returns -EIO children.2 returns 0 b) children.0 returns -EIO children.1 returns -EIO children.2 returns 0 In both cases you have one successful flush and two errors. You want to return always 0 in case a) and always -EIO in case b). But the only difference is that in case b) the errors happen to be the same, so why does that matter? That said, I'm not very convinced of the current logics of the Quorum flush code either, so it's not even a problem with your patch... it seems to me that the code should follow the same logics as in the read/write case: if the number of correct flushes >= threshold then return 0, else select the most common error code. I'm not convinced of the logic either, which is why I waited for you to respond to this patch. :-) Intuitively, I'd expect Quorum to return an error if flushing failed for any of the children, because, well, flushing failed. I somehow feel like flushing is different from a read or write operation and therefore ignoring the threshold would be fine here. However, maybe my intuition is just off. Anyway, regardless of that, if we do take the threshold into account, we should not use the exact error value for voting but just whether an error occurred or not. If all but one children fail to flush (all for different reasons), I find it totally wrong to return success. We should then just return -EIO or something. Hi Berto & Max Thanks for your comments, i'd like to have a summary here. For flush cases: 1) if flush successfully(result >= 0), result = 0; else if result < 0, result = -EIO. then invoke quorum_count_vote 2) if correct flushes >= threshold, mark correct flushes as winner directly. Will fix in next version. Thanks -Xie Max
[Qemu-devel] [PATCH v3 1/1] quorum: modify vote rules for flush operation
Keep flush interface the same logic as quorum read/write, Otherwise in following scenario, we'll encounter unexpected errors. Quorum has two children(A, B). A do flush sucessfully, but B flush failed. This cause the filesystem of guest become read-only with following errors: end_request: I/O error, dev vda, sector 11159960 Aborting journal on device vda3-8 EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal EXT4-fs (vda3): Remounting filesystem read-only Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie --- block/quorum.c | 17 - 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index f78d4cb..62d3ccb 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -763,19 +763,26 @@ static coroutine_fn int quorum_co_flush(BlockDriverState *bs) QuorumVoteValue result_value; int i; int result = 0; +int success_count = 0; QLIST_INIT(&error_votes.vote_list); error_votes.compare = quorum_64bits_compare; for (i = 0; i < s->num_children; i++) { result = bdrv_co_flush(s->children[i]->bs); -result_value.l = result; -quorum_count_vote(&error_votes, &result_value, i); +if (result) { +result_value.l = result; +quorum_count_vote(&error_votes, &result_value, i); +} else +success_count++; } -winner = quorum_get_vote_winner(&error_votes); -result = winner->value.l; - +if (success_count >= s->threshold) +result = 0; +else { +winner = quorum_get_vote_winner(&error_votes); +result = winner->value.l; +} quorum_free_vote_list(&error_votes); return result; -- 1.9.3
[Qemu-devel] [PATCH v3 0/1] modify vote rules for flush operation
ChangLog: v3: *Note* that, the codes logic is different from what we talked in v2. I just keep flush interface the same logic as quorum read/write, and think it's reasoned. [Ref] http://lists.nongnu.org/archive/html/qemu-devel/2016-01/msg05342.htm Changlong Xie (1): quorum: modify vote rules for flush operation block/quorum.c | 17 - 1 file changed, 12 insertions(+), 5 deletions(-) -- 1.9.3
Re: [Qemu-devel] [PATCH v2 1/1] quorum: Change vote rules for 64 bits hash
On 02/22/2016 05:02 PM, Dr. David Alan Gilbert wrote: * Changlong Xie (xiecl.f...@cn.fujitsu.com) wrote: On 02/20/2016 10:28 PM, Max Reitz wrote: On 19.02.2016 12:24, Alberto Garcia wrote: On Fri 19 Feb 2016 09:26:53 AM CET, Wen Congyang wrote: If quorum has two children(A, B). A do flush sucessfully, but B flush failed. We MUST choice A as winner rather than just pick anyone of them. Otherwise the filesystem of guest will become read-only with following errors: end_request: I/O error, dev vda, sector 11159960 Aborting journal on device vda3-8 EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal EXT4-fs (vda3): Remounting filesystem read-only Hi Xie, Let's see if I'm getting this right: - When Quorum flushes to disk, there's a vote among the return values of the flush operations of its members, and the one that wins is the one that Quorum returns. - If there's a tie then Quorum choses the first result from the list of winners. - With your patch you want to give priority to the vote with result == 0 if there's any, so Quorum would return 0 (and succeed). This seems to me like an ad-hoc fix for a particular use case. What if you have 3 members and two of them fail with the same error code? Would you still return 0 or the error code from the other two? For example: children.0 returns 0 children.1 returns -EIO children.2 returns -EPIPE In this case, quorum returns -EPIPE now(without this patch). For example: children.0 returns -EPIPE children.1 returns -EIO children.2 returns 0 In this case, quorum returns 0 now. My question is: what's the rationale for returning 0 in case a) but not in case b)? a) children.0 returns -EPIPE children.1 returns -EIO children.2 returns 0 b) children.0 returns -EIO children.1 returns -EIO children.2 returns 0 In both cases you have one successful flush and two errors. You want to return always 0 in case a) and always -EIO in case b). But the only difference is that in case b) the errors happen to be the same, so why does that matter? That said, I'm not very convinced of the current logics of the Quorum flush code either, so it's not even a problem with your patch... it seems to me that the code should follow the same logics as in the read/write case: if the number of correct flushes >= threshold then return 0, else select the most common error code. I'm not convinced of the logic either, which is why I waited for you to respond to this patch. :-) Intuitively, I'd expect Quorum to return an error if flushing failed for any of the children, because, well, flushing failed. I somehow feel like flushing is different from a read or write operation and therefore ignoring the threshold would be fine here. However, maybe my intuition is just off. Anyway, regardless of that, if we do take the threshold into account, we should not use the exact error value for voting but just whether an error occurred or not. If all but one children fail to flush (all for different reasons), I find it totally wrong to return success. We should then just return -EIO or something. Hi Berto & Max Thanks for your comments, i'd like to have a summary here. For flush cases: 1) if flush successfully(result >= 0), result = 0; else if result < 0, result = -EIO. then invoke quorum_count_vote 2) if correct flushes >= threshold, mark correct flushes as winner directly. I find it difficult to understand how this corresponds to the behaviour needed in COLO, where we have the NBD and the real storage on the primary; in that case the failure of the real storage should give an error to the guest, but the failure of the NBD shouldn't produce a guest visible failure. Hi Dave "in that case the failure of the real storage should give an error to the guest, but the failure of the NBD shouldn't produce a guest visible failure." This is just what i think :), but there is a restricted condition . 1) If the guest *Must* fetch the return code for flush operation? This is prerequisite. 2) If no. Since Colo and Quorum are independent of each other, so quorum don't know if we are in colo mode. I think the only way to implement your idea is:"pass some parameters such as flush= on/off to each quorum child". BTW, i've just sent out V3, and thought it make scense. Thanks -Xie Dave Will fix in next version. Thanks -Xie Max -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK .
Re: [Qemu-devel] [PATCH v2 1/1] quorum: Change vote rules for 64 bits hash
On 02/22/2016 06:34 PM, Kevin Wolf wrote: Am 22.02.2016 um 10:02 hat Dr. David Alan Gilbert geschrieben: * Changlong Xie (xiecl.f...@cn.fujitsu.com) wrote: On 02/20/2016 10:28 PM, Max Reitz wrote: On 19.02.2016 12:24, Alberto Garcia wrote: On Fri 19 Feb 2016 09:26:53 AM CET, Wen Congyang wrote: If quorum has two children(A, B). A do flush sucessfully, but B flush failed. We MUST choice A as winner rather than just pick anyone of them. Otherwise the filesystem of guest will become read-only with following errors: end_request: I/O error, dev vda, sector 11159960 Aborting journal on device vda3-8 EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal EXT4-fs (vda3): Remounting filesystem read-only Hi Xie, Let's see if I'm getting this right: - When Quorum flushes to disk, there's a vote among the return values of the flush operations of its members, and the one that wins is the one that Quorum returns. - If there's a tie then Quorum choses the first result from the list of winners. - With your patch you want to give priority to the vote with result == 0 if there's any, so Quorum would return 0 (and succeed). This seems to me like an ad-hoc fix for a particular use case. What if you have 3 members and two of them fail with the same error code? Would you still return 0 or the error code from the other two? For example: children.0 returns 0 children.1 returns -EIO children.2 returns -EPIPE In this case, quorum returns -EPIPE now(without this patch). For example: children.0 returns -EPIPE children.1 returns -EIO children.2 returns 0 In this case, quorum returns 0 now. My question is: what's the rationale for returning 0 in case a) but not in case b)? a) children.0 returns -EPIPE children.1 returns -EIO children.2 returns 0 b) children.0 returns -EIO children.1 returns -EIO children.2 returns 0 In both cases you have one successful flush and two errors. You want to return always 0 in case a) and always -EIO in case b). But the only difference is that in case b) the errors happen to be the same, so why does that matter? That said, I'm not very convinced of the current logics of the Quorum flush code either, so it's not even a problem with your patch... it seems to me that the code should follow the same logics as in the read/write case: if the number of correct flushes >= threshold then return 0, else select the most common error code. I'm not convinced of the logic either, which is why I waited for you to respond to this patch. :-) Intuitively, I'd expect Quorum to return an error if flushing failed for any of the children, because, well, flushing failed. I somehow feel like flushing is different from a read or write operation and therefore ignoring the threshold would be fine here. However, maybe my intuition is just off. Anyway, regardless of that, if we do take the threshold into account, we should not use the exact error value for voting but just whether an error occurred or not. If all but one children fail to flush (all for different reasons), I find it totally wrong to return success. We should then just return -EIO or something. Hi Berto & Max Thanks for your comments, i'd like to have a summary here. For flush cases: 1) if flush successfully(result >= 0), result = 0; else if result < 0, result = -EIO. then invoke quorum_count_vote 2) if correct flushes >= threshold, mark correct flushes as winner directly. Please try to return one of the real error codes instead of -EIO. Essentially we should use the same logic as for writes (like Berto suggested above). Yes, i correct myself. It seems everyone reaches an agreement in this thread, and will use the same logic as writes in next version. Thanks -Xie I find it difficult to understand how this corresponds to the behaviour needed in COLO, where we have the NBD and the real storage on the primary; in that case the failure of the real storage should give an error to the guest, but the failure of the NBD shouldn't produce a guest visible failure. That's probably because you're abusing quorum as an active mirroring filter, which it really isn't. This is okay for now so you have something to work with, but I expect that eventually you'll need a different driver. (Well, or maybe I'm mistaken and you actually do need to read back data from NBD to compare it to the real storage - do you?) Anyway, I'm curious how you would handle a failed write/flush request to the NBD target. Simply ignoring it doesn't feel right; in case of a failover, wouldn't you switch to a potentially corrupted image then? Kevin .
Re: [Qemu-devel] [PATCH v3 1/1] quorum: modify vote rules for flush operation
On 02/22/2016 10:33 PM, Alberto Garcia wrote: On Mon 22 Feb 2016 10:50:37 AM CET, Changlong Xie wrote: -winner = quorum_get_vote_winner(&error_votes); -result = winner->value.l; - +if (success_count >= s->threshold) +result = 0; +else { +winner = quorum_get_vote_winner(&error_votes); +result = winner->value.l; +} Please use braces in both branches of the if. scripts/checkpatch.pl should report that. Surely. Other than that I think the patch is correct, but I still wonder if we should emit QUORUM_REPORT_BAD (or a new event) for the operations that To me QUORUM_REPORT_BAD is not a good choice. fail. Or why wouldn't the user want to be notified of a flush failure? I'll introduce a new event in next series. Thanks -Xie Berto .
[Qemu-devel] [PATCH v4 0/2] modify vote rules for flush operation
ChangLog: v4: 1. Introduce QUORUM_FLUSH_ERROR event to notify flush failure. v3: 1. *Note* that, the codes logic is different from what we talked in v2. I just keep flush interface the same logic as quorum read/write, and think it's reasoned. [Ref] http://lists.nongnu.org/archive/html/qemu-devel/2016-01/msg05342.htm Changlong Xie (2): qmp event: Add QUORUM_FLUSH_ERROR quorum: modify vote rules for flush operation block/quorum.c | 24 +++- docs/qmp-events.txt | 18 ++ qapi/event.json | 16 3 files changed, 53 insertions(+), 5 deletions(-) -- 1.9.3
[Qemu-devel] [PATCH v4 1/2] qmp event: Add QUORUM_FLUSH_ERROR
Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie --- block/quorum.c | 5 + docs/qmp-events.txt | 18 ++ qapi/event.json | 16 3 files changed, 39 insertions(+) diff --git a/block/quorum.c b/block/quorum.c index f78d4cb..d3c3958 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -235,6 +235,11 @@ static void quorum_report_failure(QuorumAIOCB *acb) acb->nb_sectors, &error_abort); } +static void quorum_flush_error(char *node_name, const char *msg) +{ +qapi_event_send_quorum_flush_error(node_name, msg, &error_abort); +} + static int quorum_vote_error(QuorumAIOCB *acb); static bool quorum_has_too_much_io_failed(QuorumAIOCB *acb) diff --git a/docs/qmp-events.txt b/docs/qmp-events.txt index b6e8937..d777873 100644 --- a/docs/qmp-events.txt +++ b/docs/qmp-events.txt @@ -340,6 +340,24 @@ Example: Note: this event is rate-limited. +QUORUM_FLUSH_ERROR +- + +Emitted to report flush error message of the Quorum block driver + +Data: + +- "node-name": The graph node name of the block driver state. +- "error": This field contains a human-readable error message. There are + no semantics other than that the block layer reported an error + and clients should not try to interpret the error string. + +Example: + +{ "event": "QUORUM_FLUSH_ERROR", + "data": { "node-name": "1.raw", "error": "xx" }, + "timestamp": { "seconds": 1344522075, "microseconds": 745528 } } + RESET - diff --git a/qapi/event.json b/qapi/event.json index cfcc887..5b16706 100644 --- a/qapi/event.json +++ b/qapi/event.json @@ -358,6 +358,22 @@ 'sector-num': 'int', 'sectors-count': 'int' } } ## +# @QUORUM_FLUSH_ERROR +# +# Emitted to report flush error message of the Quorum block driver +# +# @node-name: the graph node name of the block driver state +# +# @error: This field contains a human-readable error message. There are no semantics +# other than that the block layer reported an error and clients should not +# try to interpret the error string. +# +# Since: 2.5 +## +{ 'event': 'QUORUM_FLUSH_ERROR', + 'data': { 'node-name': 'str', 'error': 'str'} } + +## # @VSERPORT_CHANGE # # Emitted when the guest opens or closes a virtio-serial port. -- 1.9.3
[Qemu-devel] [PATCH v4 2/2] quorum: modify vote rules for flush operation
Keep flush interface the same logic as quorum read/write, Otherwise in following scenario, we'll encounter unexpected errors. Quorum has two children(A, B). A do flush sucessfully, but B flush failed. This cause the filesystem of guest become read-only with following errors: end_request: I/O error, dev vda, sector 11159960 Aborting journal on device vda3-8 EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal EXT4-fs (vda3): Remounting filesystem read-only Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie --- block/quorum.c | 19 ++- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index d3c3958..6048208 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -768,19 +768,28 @@ static coroutine_fn int quorum_co_flush(BlockDriverState *bs) QuorumVoteValue result_value; int i; int result = 0; +int success_count = 0; QLIST_INIT(&error_votes.vote_list); error_votes.compare = quorum_64bits_compare; for (i = 0; i < s->num_children; i++) { result = bdrv_co_flush(s->children[i]->bs); -result_value.l = result; -quorum_count_vote(&error_votes, &result_value, i); +if (result) { +quorum_flush_error(s->children[i]->bs->node_name, "Flush failed"); +result_value.l = result; +quorum_count_vote(&error_votes, &result_value, i); +} else { +success_count++; +} } -winner = quorum_get_vote_winner(&error_votes); -result = winner->value.l; - +if (success_count >= s->threshold) { +result = 0; +} else { +winner = quorum_get_vote_winner(&error_votes); +result = winner->value.l; +} quorum_free_vote_list(&error_votes); return result; -- 1.9.3
Re: [Qemu-devel] [PATCH v4 1/2] qmp event: Add QUORUM_FLUSH_ERROR
cc: Markus Armbruster On 02/23/2016 05:01 PM, Changlong Xie wrote: Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie --- block/quorum.c | 5 + docs/qmp-events.txt | 18 ++ qapi/event.json | 16 3 files changed, 39 insertions(+) diff --git a/block/quorum.c b/block/quorum.c index f78d4cb..d3c3958 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -235,6 +235,11 @@ static void quorum_report_failure(QuorumAIOCB *acb) acb->nb_sectors, &error_abort); } +static void quorum_flush_error(char *node_name, const char *msg) +{ +qapi_event_send_quorum_flush_error(node_name, msg, &error_abort); +} + static int quorum_vote_error(QuorumAIOCB *acb); static bool quorum_has_too_much_io_failed(QuorumAIOCB *acb) diff --git a/docs/qmp-events.txt b/docs/qmp-events.txt index b6e8937..d777873 100644 --- a/docs/qmp-events.txt +++ b/docs/qmp-events.txt @@ -340,6 +340,24 @@ Example: Note: this event is rate-limited. +QUORUM_FLUSH_ERROR +- + +Emitted to report flush error message of the Quorum block driver + +Data: + +- "node-name": The graph node name of the block driver state. +- "error": This field contains a human-readable error message. There are + no semantics other than that the block layer reported an error + and clients should not try to interpret the error string. + +Example: + +{ "event": "QUORUM_FLUSH_ERROR", + "data": { "node-name": "1.raw", "error": "xx" }, + "timestamp": { "seconds": 1344522075, "microseconds": 745528 } } + RESET - diff --git a/qapi/event.json b/qapi/event.json index cfcc887..5b16706 100644 --- a/qapi/event.json +++ b/qapi/event.json @@ -358,6 +358,22 @@ 'sector-num': 'int', 'sectors-count': 'int' } } ## +# @QUORUM_FLUSH_ERROR +# +# Emitted to report flush error message of the Quorum block driver +# +# @node-name: the graph node name of the block driver state +# +# @error: This field contains a human-readable error message. There are no semantics +# other than that the block layer reported an error and clients should not +# try to interpret the error string. +# +# Since: 2.5 +## +{ 'event': 'QUORUM_FLUSH_ERROR', + 'data': { 'node-name': 'str', 'error': 'str'} } + +## # @VSERPORT_CHANGE # # Emitted when the guest opens or closes a virtio-serial port.
Re: [Qemu-devel] [PATCH v15 0/9] Block replication for continuous checkpoints
Ping again ... Thanks -Xie On 02/05/2016 12:17 PM, Changlong Xie wrote: Block replication is a very important feature which is used for continuous checkpoints(for example: COLO). You can get the detailed information about block replication from here: http://wiki.qemu.org/Features/BlockReplication Usage: Please refer to docs/block-replication.txt This patch series is based on the following patch series: 1. http://lists.nongnu.org/archive/html/qemu-devel/2015-12/msg04570.html Patch status: 1. Acked patches: none 2. Reviewed patches: patch 4 3. Updated patches: patch 3, 4, 5, 7, 8 Note: patch 7 is a new patch that introduces genernal start/stop/checkpoint/ get_error APIs. You can get the patch here: https://github.com/Pating/qemu/tree/changlox/block-replication-v15 You can get the patch with framework here: https://github.com/Pating/qemu/tree/changlox/colo_framework_v14 TODO: 1. Continuous block replication. It will be started after basic functions are accepted. Changs Log: V15: 1. Rebase to the newest codes 2. Fix typos and coding style addresed Eric's comments 3. Address Stefan's comments 1) Make backup_do_checkpoint public, drop the changes on BlockJobDriver 2) Update the message and description for [PATCH 4/9] 3) Make replication_(start/stop/do_checkpoint)_all as global interfaces 4) Introduce AioContext lock to protect start/stop/do_checkpoint callbacks 5) Use BdrvChild instead of holding on to BlockDriverState * pointers 4. Clear BDRV_O_INACTIVE for hidden disk's open_flags since commit 09e0c771 5. Introduce replication_get_error_all to check replication status 6. Remove useless discard interface V14: 1. Implement auto complete active commit 2. Implement active commit block job for replication.c 3. Address the comments from Stefan, add replication-specific API and data structure, also remove old block layer APIs V13: 1. Rebase to the newest codes 2. Remove redundant marcos and semicolon in replication.c 3. Fix typos in block-replication.txt V12: 1. Rebase to the newest codes 2. Use backing reference to replcace 'allow-write-backing-file' V11: 1. Reopen the backing file when starting blcok replication if it is not opened in R/W mode 2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET when opening backing file 3. Block the top BDS so there is only one block job for the top BDS and its backing chain. V10: 1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing reference. 2. Address the comments from Eric Blake V9: 1. Update the error messages 2. Rebase to the newest qemu 3. Split child add/delete support. These patches are sent in another patchset. V8: 1. Address Alberto Garcia's comments V7: 1. Implement adding/removing quorum child. Remove the option non-connect. 2. Simplify the backing refrence option according to Stefan Hajnoczi's suggestion V6: 1. Rebase to the newest qemu. V5: 1. Address the comments from Gong Lei 2. Speed the failover up. The secondary vm can take over very quickly even if there are too many I/O requests. V4: 1. Introduce a new driver replication to avoid touch nbd and qcow2. V3: 1: use error_setg() instead of error_set() 2. Add a new block job API 3. Active disk, hidden disk and nbd target uses the same AioContext 4. Add a testcase to test new hbitmap API V2: 1. Redesign the secondary qemu(use image-fleecing) 2. Use Error objects to return error message 3. Address the comments from Max Reitz and Eric Blake Changlong Xie (1): Introduce new APIs to do replication operation Wen Congyang (8): unblock backup operations in backing file Store parent BDS in BdrvChild Backup: clear all bitmap when doing block checkpoint Link backup into block core docs: block replication's description auto complete active commit Implement new driver for block replication support replication driver in blockdev-add Makefile.objs | 1 + block.c| 19 ++ block/Makefile.objs| 3 +- block/backup.c | 15 ++ block/mirror.c | 13 +- block/replication.c| 594 + blockdev.c | 2 +- docs/block-replication.txt | 238 ++ include/block/block_int.h | 6 +- qapi/block-core.json | 33 ++- qemu-img.c | 2 +- replication.c | 94 +++ replication.h | 53 13 files changed, 1063 insertions(+), 10 deletions(-) create mode 100644 block/replication.c create mode 100644 docs/block-replication.txt create mode 100644 replication.c create mode 100644 replication.h
Re: [Qemu-devel] [PATCH v4 1/2] qmp event: Add QUORUM_FLUSH_ERROR
On 02/23/2016 10:05 PM, Alberto Garcia wrote: On Tue 23 Feb 2016 02:45:49 PM CET, Eric Blake wrote: Commit message should say why we need a third event, rather than reusing either of the other two (my guess: because you don't have a location, and don't want to modify the existing two to report a location - but why not just use 'sector-num':0, 'sectors-count': to report the entire file as the location?) I would also be fine with that solution. I would also be fine if we added an optional enum member to the existing event that said which operation failed ('read', 'write', 'flush') - adding optional output members is safe, while converting existing mandatory output members to optional may confuse existing clients. Hi Berto & Eric Thanks for all your comments. Surely, this is the best option to me too :-), will fix it in next version. Thanks -Xie That might actually be the best option :-) Berto .
[Qemu-devel] [PATCH v5 0/3] modify vote rules for flush operation
ChangLog: v5: 1. Fix invalid node name in docs/qmp-events.txt 2. Address comments from Eric, drop QUORUM_FLUSH_ERROR. Instead of reworking QUORUM_REPORT_BAD to make it compatible with quorum read/write/flush operations v4: 1. Introduce QUORUM_FLUSH_ERROR event to notify flush failure. v3: 1. *Note* that, the codes logic is different from what we talked in v2. I just keep flush interface the same logic as quorum read/write, and think it's reasoned. [Ref] http://lists.nongnu.org/archive/html/qemu-devel/2016-01/msg05342.htm Changlong Xie (3): docs: fix invalid node name in qmp event qmp event: Refactor QUORUM_REPORT_BAD quorum: modify vote rules for flush operation block/quorum.c | 48 ++-- docs/qmp-events.txt | 10 +- qapi/block.json | 16 qapi/event.json | 4 +++- 4 files changed, 66 insertions(+), 12 deletions(-) -- 1.9.3
[Qemu-devel] [PATCH v5 3/3] quorum: modify vote rules for flush operation
Keep flush interface the same logic as quorum read/write, Otherwise in following scenario, we'll encounter unexpected errors. Quorum has two children(A, B). A do flush sucessfully, but B flush failed. This cause the filesystem of guest become read-only with following errors: end_request: I/O error, dev vda, sector 11159960 Aborting journal on device vda3-8 EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal EXT4-fs (vda3): Remounting filesystem read-only Cc: Dr. David Alan Gilbert Cc: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie --- block/quorum.c | 20 +++- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index 03d68c1..0c7e4ad 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -778,19 +778,29 @@ static coroutine_fn int quorum_co_flush(BlockDriverState *bs) QuorumVoteValue result_value; int i; int result = 0; +int success_count = 0; QLIST_INIT(&error_votes.vote_list); error_votes.compare = quorum_64bits_compare; for (i = 0; i < s->num_children; i++) { result = bdrv_co_flush(s->children[i]->bs); -result_value.l = result; -quorum_count_vote(&error_votes, &result_value, i); +if (result) { +quorum_report_bad(QUORUM_OP_TYPE_FLUSH, NULL, + s->children[i]->bs->node_name, result); +result_value.l = result; +quorum_count_vote(&error_votes, &result_value, i); +} else { +success_count++; +} } -winner = quorum_get_vote_winner(&error_votes); -result = winner->value.l; - +if (success_count >= s->threshold) { +result = 0; +} else { +winner = quorum_get_vote_winner(&error_votes); +result = winner->value.l; +} quorum_free_vote_list(&error_votes); return result; -- 1.9.3
[Qemu-devel] [PATCH v5 2/3] qmp event: Refactor QUORUM_REPORT_BAD
Introduce QuorumOpType, and make QUORUM_REPORT_BAD compatible with it. Cc: Dr. David Alan Gilbert Cc: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie --- block/quorum.c | 28 +++- docs/qmp-events.txt | 8 qapi/block.json | 16 qapi/event.json | 4 +++- 4 files changed, 50 insertions(+), 6 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index 11cc60b..03d68c1 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -215,14 +215,29 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s, return acb; } -static void quorum_report_bad(QuorumAIOCB *acb, char *node_name, int ret) +static void quorum_report_bad(QuorumOpType type, QuorumAIOCB *acb, + char *node_name, int ret) { const char *msg = NULL; if (ret < 0) { msg = strerror(-ret); } -qapi_event_send_quorum_report_bad(!!msg, msg, node_name, - acb->sector_num, acb->nb_sectors, &error_abort); + +switch (type) { +case QUORUM_OP_TYPE_READ: +case QUORUM_OP_TYPE_WRITE: +qapi_event_send_quorum_report_bad(false, 0, !!msg, msg, node_name, + acb->sector_num, acb->nb_sectors, + &error_abort); +break; +case QUORUM_OP_TYPE_FLUSH: +qapi_event_send_quorum_report_bad(true, type, !!msg, msg, node_name, + 0, 0, &error_abort); +break; +default: +/* should never happen */ +abort(); +} } static void quorum_report_failure(QuorumAIOCB *acb) @@ -282,6 +297,7 @@ static void quorum_aio_cb(void *opaque, int ret) QuorumChildRequest *sacb = opaque; QuorumAIOCB *acb = sacb->parent; BDRVQuorumState *s = acb->common.bs->opaque; +QuorumOpType type; bool rewrite = false; if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) { @@ -300,12 +316,13 @@ static void quorum_aio_cb(void *opaque, int ret) return; } +type = acb->is_read ? QUORUM_OP_TYPE_READ : QUORUM_OP_TYPE_WRITE; sacb->ret = ret; acb->count++; if (ret == 0) { acb->success_count++; } else { -quorum_report_bad(acb, sacb->aiocb->bs->node_name, ret); +quorum_report_bad(type, acb, sacb->aiocb->bs->node_name, ret); } assert(acb->count <= s->num_children); assert(acb->success_count <= s->num_children); @@ -338,7 +355,8 @@ static void quorum_report_bad_versions(BDRVQuorumState *s, continue; } QLIST_FOREACH(item, &version->items, next) { -quorum_report_bad(acb, s->children[item->index]->bs->node_name, 0); +quorum_report_bad(QUORUM_OP_TYPE_READ, acb, + s->children[item->index]->bs->node_name, 0); } } } diff --git a/docs/qmp-events.txt b/docs/qmp-events.txt index f96e120..0a082db 100644 --- a/docs/qmp-events.txt +++ b/docs/qmp-events.txt @@ -307,6 +307,7 @@ Emitted to report a corruption of a Quorum file. Data: +- "type": Quorum operation type (json-string, optional) - "error": Error message (json-string, optional) Only present on failure. This field contains a human-readable error message. There are no semantics other than that the @@ -318,10 +319,17 @@ Data: Example: +Read/Write operation: { "event": "QUORUM_REPORT_BAD", "data": { "node-name": "node0", "sector-num": 345435, "sectors-count": 5 }, "timestamp": { "seconds": 1344522075, "microseconds": 745528 } } +Flush operation: +{ "event": "QUORUM_REPORT_BAD", + "data": { "node-name": "node0", "sectors-count": 0, "sector-num": 0, + "type": "flush", "error": "Broken pipe" }, + "timestamp": { "seconds": 1456406829, "microseconds": 291763 } } + Note: this event is rate-limited. RESET diff --git a/qapi/block.json b/qapi/block.json index 58e6b30..937337d 100644 --- a/qapi/block.json +++ b/qapi/block.json @@ -196,3 +196,19 @@ ## { 'event': 'DEVICE_TRAY_MOVED', 'data': { 'device': 'str', 'tray-open': 'bool' } } + +## +# @QuorumOpType +# +# An enumeration of the quorum operation types +# +# @read: read operation +# +# @write: write operation +# +# @flush: flush operation +# +# Since: 2.6 +## +{ 'enum': 'QuorumOpType', + 'data': [ 'read', 'write', 'flush' ] } di
[Qemu-devel] [PATCH v5 1/3] docs: fix invalid node name in qmp event
Cc: Dr. David Alan Gilbert Cc: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie --- docs/qmp-events.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/qmp-events.txt b/docs/qmp-events.txt index 52eb7e2..f96e120 100644 --- a/docs/qmp-events.txt +++ b/docs/qmp-events.txt @@ -319,7 +319,7 @@ Data: Example: { "event": "QUORUM_REPORT_BAD", - "data": { "node-name": "1.raw", "sector-num": 345435, "sectors-count": 5 }, + "data": { "node-name": "node0", "sector-num": 345435, "sectors-count": 5 }, "timestamp": { "seconds": 1344522075, "microseconds": 745528 } } Note: this event is rate-limited. -- 1.9.3
Re: [Qemu-devel] [PATCH v5 2/3] qmp event: Refactor QUORUM_REPORT_BAD
On 02/25/2016 09:12 AM, Eric Blake wrote: On 02/24/2016 05:50 PM, Wen Congyang wrote: +- "type": Quorum operation type (json-string, optional) I don't think 'type' needs to be optional, after all. Just always output it. If we output read/write type, old libvirt will ignore the read/write error events? The QMP protocol already documents that ALL clients MUST ignore unexpected output fields. Any client that is unprepared for new fields appearing in the dictionary is buggy. Libvirt will be just fine if you output a new "type":"read". Ok, i'll make "type" mandatory. Thanks -Xie
Re: [Qemu-devel] [PATCH v5 2/3] qmp event: Refactor QUORUM_REPORT_BAD
On 02/24/2016 08:35 PM, Alberto Garcia wrote: On Wed 24 Feb 2016 11:11:54 AM CET, Changlong Xie wrote: -static void quorum_report_bad(QuorumAIOCB *acb, char *node_name, int ret) +static void quorum_report_bad(QuorumOpType type, QuorumAIOCB *acb, + char *node_name, int ret) { const char *msg = NULL; if (ret < 0) { msg = strerror(-ret); } -qapi_event_send_quorum_report_bad(!!msg, msg, node_name, - acb->sector_num, acb->nb_sectors, &error_abort); + +switch (type) { +case QUORUM_OP_TYPE_READ: +case QUORUM_OP_TYPE_WRITE: +qapi_event_send_quorum_report_bad(false, 0, !!msg, msg, node_name, + acb->sector_num, acb->nb_sectors, + &error_abort); +break; +case QUORUM_OP_TYPE_FLUSH: +qapi_event_send_quorum_report_bad(true, type, !!msg, msg, node_name, + 0, 0, &error_abort); +break; A few comments: - Why don't you set the 'type' field in read and write operations? You defined all three values but you are only using 'flush' here. - For the case of flush you could set sectors-count to the total size of the BlockDriverState as Eric suggested (bdrv_nb_sectors(bs)). Setting both to 0 could confuse clients that are not ready to deal with flush failures. - Since the QuorumAIOCB parameter is not used in the flush case, you could replace it from the function prototype and use sector_num and nb_sectors instead. That way you can also omit the switch. Surely, all your suggestions are helpful. Also i will add "Reviewed-by" in patch 1/3, 3/3 in next version. Thanks -Xie Berto .
[Qemu-devel] [PATCH v6 1/3] docs: fix invalid node name in qmp event
Cc: Dr. David Alan Gilbert Cc: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie Reviewed-by: Alberto Garcia --- docs/qmp-events.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/qmp-events.txt b/docs/qmp-events.txt index b6e8937..afb5e20 100644 --- a/docs/qmp-events.txt +++ b/docs/qmp-events.txt @@ -335,7 +335,7 @@ Data: Example: { "event": "QUORUM_REPORT_BAD", - "data": { "node-name": "1.raw", "sector-num": 345435, "sectors-count": 5 }, + "data": { "node-name": "node0", "sector-num": 345435, "sectors-count": 5 }, "timestamp": { "seconds": 1344522075, "microseconds": 745528 } } Note: this event is rate-limited. -- 1.9.3
[Qemu-devel] [PATCH v6 0/3] modify vote rules for flush operation
ChangLog: v6: 1. Make "type" mandatory for [PATCH 2/3] v5: 1. Fix invalid node name in docs/qmp-events.txt 2. Address comments from Eric, drop QUORUM_FLUSH_ERROR. Instead of reworking QUORUM_REPORT_BAD to make it compatible with quorum read/write/flush operations v4: 1. Introduce QUORUM_FLUSH_ERROR event to notify flush failure. v3: 1. *Note* that, the codes logic is different from what we talked in v2. I just keep flush interface the same logic as quorum read/write, and think it's reasoned. [Ref] http://lists.nongnu.org/archive/html/qemu-devel/2016-01/msg05342.htm Changlong Xie (3): docs: fix invalid node name in qmp event qmp event: Refactor QUORUM_REPORT_BAD quorum: modify vote rules for flush operation block/quorum.c | 38 -- docs/qmp-events.txt | 11 ++- qapi/block.json | 16 qapi/event.json | 4 +++- 4 files changed, 57 insertions(+), 12 deletions(-) -- 1.9.3
[Qemu-devel] [PATCH v6 3/3] quorum: modify vote rules for flush operation
Keep flush interface the same logic as quorum read/write, Otherwise in following scenario, we'll encounter unexpected errors. Quorum has two children(A, B). A do flush sucessfully, but B flush failed. This cause the filesystem of guest become read-only with following errors: end_request: I/O error, dev vda, sector 11159960 Aborting journal on device vda3-8 EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal EXT4-fs (vda3): Remounting filesystem read-only Cc: Dr. David Alan Gilbert Cc: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie Reviewed-by: Alberto Garcia --- block/quorum.c | 21 - 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index 974fff9..ed906d0 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -770,19 +770,30 @@ static coroutine_fn int quorum_co_flush(BlockDriverState *bs) QuorumVoteValue result_value; int i; int result = 0; +int success_count = 0; QLIST_INIT(&error_votes.vote_list); error_votes.compare = quorum_64bits_compare; for (i = 0; i < s->num_children; i++) { result = bdrv_co_flush(s->children[i]->bs); -result_value.l = result; -quorum_count_vote(&error_votes, &result_value, i); +if (result) { +quorum_report_bad(QUORUM_OP_TYPE_FLUSH, 0, + bdrv_nb_sectors(s->children[i]->bs), + s->children[i]->bs->node_name, result); +result_value.l = result; +quorum_count_vote(&error_votes, &result_value, i); +} else { +success_count++; +} } -winner = quorum_get_vote_winner(&error_votes); -result = winner->value.l; - +if (success_count >= s->threshold) { +result = 0; +} else { +winner = quorum_get_vote_winner(&error_votes); +result = winner->value.l; +} quorum_free_vote_list(&error_votes); return result; -- 1.9.3
[Qemu-devel] [PATCH v6 2/3] qmp event: Refactor QUORUM_REPORT_BAD
Introduce QuorumOpType, and make QUORUM_REPORT_BAD compatible with it. Cc: Dr. David Alan Gilbert Cc: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie --- block/quorum.c | 17 - docs/qmp-events.txt | 11 ++- qapi/block.json | 16 qapi/event.json | 4 +++- 4 files changed, 41 insertions(+), 7 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index f78d4cb..974fff9 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -218,14 +218,16 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s, return acb; } -static void quorum_report_bad(QuorumAIOCB *acb, char *node_name, int ret) +static void quorum_report_bad(QuorumOpType type, uint64_t sector_num, + int nb_sectors, char *node_name, int ret) { const char *msg = NULL; if (ret < 0) { msg = strerror(-ret); } -qapi_event_send_quorum_report_bad(!!msg, msg, node_name, - acb->sector_num, acb->nb_sectors, &error_abort); + +qapi_event_send_quorum_report_bad(type, !!msg, msg, node_name, + sector_num, nb_sectors, &error_abort); } static void quorum_report_failure(QuorumAIOCB *acb) @@ -285,6 +287,7 @@ static void quorum_aio_cb(void *opaque, int ret) QuorumChildRequest *sacb = opaque; QuorumAIOCB *acb = sacb->parent; BDRVQuorumState *s = acb->common.bs->opaque; +QuorumOpType type; bool rewrite = false; if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) { @@ -303,12 +306,14 @@ static void quorum_aio_cb(void *opaque, int ret) return; } +type = acb->is_read ? QUORUM_OP_TYPE_READ : QUORUM_OP_TYPE_WRITE; sacb->ret = ret; acb->count++; if (ret == 0) { acb->success_count++; } else { -quorum_report_bad(acb, sacb->aiocb->bs->node_name, ret); +quorum_report_bad(type, acb->sector_num, acb->nb_sectors, + sacb->aiocb->bs->node_name, ret); } assert(acb->count <= s->num_children); assert(acb->success_count <= s->num_children); @@ -341,7 +346,9 @@ static void quorum_report_bad_versions(BDRVQuorumState *s, continue; } QLIST_FOREACH(item, &version->items, next) { -quorum_report_bad(acb, s->children[item->index]->bs->node_name, 0); +quorum_report_bad(QUORUM_OP_TYPE_READ, acb->sector_num, + acb->nb_sectors, + s->children[item->index]->bs->node_name, 0); } } } diff --git a/docs/qmp-events.txt b/docs/qmp-events.txt index afb5e20..96baacb 100644 --- a/docs/qmp-events.txt +++ b/docs/qmp-events.txt @@ -323,6 +323,7 @@ Emitted to report a corruption of a Quorum file. Data: +- "type": Quorum operation type - "error": Error message (json-string, optional) Only present on failure. This field contains a human-readable error message. There are no semantics other than that the @@ -334,10 +335,18 @@ Data: Example: +Read/Write operation: { "event": "QUORUM_REPORT_BAD", - "data": { "node-name": "node0", "sector-num": 345435, "sectors-count": 5 }, + "data": { "node-name": "node0", "sector-num": 345435, "sectors-count": 5, + "type": "read" }, "timestamp": { "seconds": 1344522075, "microseconds": 745528 } } +Flush operation: +{ "event": "QUORUM_REPORT_BAD", + "data": { "node-name": "node0", "sector-num": 0, "sectors-count": 2097120, + "type": "flush", "error": "Broken pipe" }, + "timestamp": { "seconds": 1456406829, "microseconds": 291763 } } + Note: this event is rate-limited. RESET diff --git a/qapi/block.json b/qapi/block.json index 58e6b30..937337d 100644 --- a/qapi/block.json +++ b/qapi/block.json @@ -196,3 +196,19 @@ ## { 'event': 'DEVICE_TRAY_MOVED', 'data': { 'device': 'str', 'tray-open': 'bool' } } + +## +# @QuorumOpType +# +# An enumeration of the quorum operation types +# +# @read: read operation +# +# @write: write operation +# +# @flush: flush operation +# +# Since: 2.6 +## +{ 'enum': 'QuorumOpType', + 'data': [ 'read', 'write', 'flush' ] } diff --git a/qapi/event.json b/qapi/event.json index cfcc887..a2b8db6 100644 --- a/qapi/event.json +++ b/qapi/event.json @@ -340,6 +340,8 @@ # # Emitted to report a corruption of a Quorum file # +# @type: quorum operation type (Since 2.6) +# # @error: #optional, error message. Only present on failure. This field # contains a human-readable error message. There are no semantics other # than that the block layer reported an error and clients should not @@ -354,7 +356,7 @@ # Since: 2.0 ## { 'event': 'QUORUM_REPORT_BAD', - 'data': { '*error': 'str', 'node-name': 'str', + 'data': { 'type': 'QuorumOpType', '*error': 'str', 'node-name': 'str', 'sector-num': 'int', 'sectors-count': 'int' } } ## -- 1.9.3
Re: [Qemu-devel] [PATCH v6 2/3] qmp event: Refactor QUORUM_REPORT_BAD
On 02/25/2016 06:57 PM, Alberto Garcia wrote: On Thu 25 Feb 2016 06:33:08 AM CET, Changlong Xie wrote: +Read/Write operation: { "event": "QUORUM_REPORT_BAD", - "data": { "node-name": "node0", "sector-num": 345435, "sectors-count": 5 }, + "data": { "node-name": "node0", "sector-num": 345435, "sectors-count": 5, + "type": "read" }, "timestamp": { "seconds": 1344522075, "microseconds": 745528 } } Since you introduced the 'type' field and this is now an example of a read error, you can change the description to say simply "Read operation:". In my opinion there's no need to add yet another example for a write operation, I think it's clear enough. Ok +Flush operation: +{ "event": "QUORUM_REPORT_BAD", + "data": { "node-name": "node0", "sector-num": 0, "sectors-count": 2097120, + "type": "flush", "error": "Broken pipe" }, + "timestamp": { "seconds": 1456406829, "microseconds": 291763 } } Here (and in the previous case) please indent "type" so it goes under Surely. "node-name": { "event": "QUORUM_REPORT_BAD", "data": { "node-name": "node0", "sector-num": 0, "sectors-count": 2097120, "type": "flush", "error": "Broken pipe" }, "timestamp": { "seconds": 1456406829, "microseconds": 291763 } } Otherwise I think the patch looks perfect now. Thanks! Thanks for your review. Will send another series. Thanks -Xie Berto .
[Qemu-devel] [PATCH v7 1/3] docs: fix invalid node name in qmp event
Cc: Dr. David Alan Gilbert Cc: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie Reviewed-by: Alberto Garcia --- docs/qmp-events.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/qmp-events.txt b/docs/qmp-events.txt index 4e3eb9e..d6b9aea 100644 --- a/docs/qmp-events.txt +++ b/docs/qmp-events.txt @@ -337,7 +337,7 @@ Data: Example: { "event": "QUORUM_REPORT_BAD", - "data": { "node-name": "1.raw", "sector-num": 345435, "sectors-count": 5 }, + "data": { "node-name": "node0", "sector-num": 345435, "sectors-count": 5 }, "timestamp": { "seconds": 1344522075, "microseconds": 745528 } } Note: this event is rate-limited. -- 1.9.3
[Qemu-devel] [PATCH v7 3/3] quorum: modify vote rules for flush operation
Keep flush interface the same logic as quorum read/write, Otherwise in following scenario, we'll encounter unexpected errors. Quorum has two children(A, B). A do flush sucessfully, but B flush failed. This cause the filesystem of guest become read-only with following errors: end_request: I/O error, dev vda, sector 11159960 Aborting journal on device vda3-8 EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal EXT4-fs (vda3): Remounting filesystem read-only Cc: Dr. David Alan Gilbert Cc: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie Reviewed-by: Alberto Garcia --- block/quorum.c | 21 - 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index 8f83574..b16171b 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -767,19 +767,30 @@ static coroutine_fn int quorum_co_flush(BlockDriverState *bs) QuorumVoteValue result_value; int i; int result = 0; +int success_count = 0; QLIST_INIT(&error_votes.vote_list); error_votes.compare = quorum_64bits_compare; for (i = 0; i < s->num_children; i++) { result = bdrv_co_flush(s->children[i]->bs); -result_value.l = result; -quorum_count_vote(&error_votes, &result_value, i); +if (result) { +quorum_report_bad(QUORUM_OP_TYPE_FLUSH, 0, + bdrv_nb_sectors(s->children[i]->bs), + s->children[i]->bs->node_name, result); +result_value.l = result; +quorum_count_vote(&error_votes, &result_value, i); +} else { +success_count++; +} } -winner = quorum_get_vote_winner(&error_votes); -result = winner->value.l; - +if (success_count >= s->threshold) { +result = 0; +} else { +winner = quorum_get_vote_winner(&error_votes); +result = winner->value.l; +} quorum_free_vote_list(&error_votes); return result; -- 1.9.3
[Qemu-devel] [PATCH v7 0/3] modify vote rules for flush operation
ChangLog: v7: 1. Fix coding style in doc/qmp-event.txt v6: 1. Make "type" mandatory for [PATCH 2/3] v5: 1. Fix invalid node name in docs/qmp-events.txt 2. Address comments from Eric, drop QUORUM_FLUSH_ERROR. Instead of reworking QUORUM_REPORT_BAD to make it compatible with quorum read/write/flush operations v4: 1. Introduce QUORUM_FLUSH_ERROR event to notify flush failure. v3: 1. *Note* that, the codes logic is different from what we talked in v2. I just keep flush interface the same logic as quorum read/write, and think it's reasoned. [Ref] http://lists.nongnu.org/archive/html/qemu-devel/2016-01/msg05342.html Changlong Xie (3): docs: fix invalid node name in qmp event qmp event: Refactor QUORUM_REPORT_BAD quorum: modify vote rules for flush operation block/quorum.c | 38 -- docs/qmp-events.txt | 11 ++- qapi/block.json | 16 qapi/event.json | 4 +++- 4 files changed, 57 insertions(+), 12 deletions(-) -- 1.9.3
[Qemu-devel] [PATCH v7 2/3] qmp event: Refactor QUORUM_REPORT_BAD
Introduce QuorumOpType, and make QUORUM_REPORT_BAD compatible with it. Cc: Dr. David Alan Gilbert Cc: Wen Congyang Signed-off-by: Wen Congyang Signed-off-by: Changlong Xie --- block/quorum.c | 17 - docs/qmp-events.txt | 11 ++- qapi/block.json | 16 qapi/event.json | 4 +++- 4 files changed, 41 insertions(+), 7 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index 11cc60b..8f83574 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -215,14 +215,16 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s, return acb; } -static void quorum_report_bad(QuorumAIOCB *acb, char *node_name, int ret) +static void quorum_report_bad(QuorumOpType type, uint64_t sector_num, + int nb_sectors, char *node_name, int ret) { const char *msg = NULL; if (ret < 0) { msg = strerror(-ret); } -qapi_event_send_quorum_report_bad(!!msg, msg, node_name, - acb->sector_num, acb->nb_sectors, &error_abort); + +qapi_event_send_quorum_report_bad(type, !!msg, msg, node_name, + sector_num, nb_sectors, &error_abort); } static void quorum_report_failure(QuorumAIOCB *acb) @@ -282,6 +284,7 @@ static void quorum_aio_cb(void *opaque, int ret) QuorumChildRequest *sacb = opaque; QuorumAIOCB *acb = sacb->parent; BDRVQuorumState *s = acb->common.bs->opaque; +QuorumOpType type; bool rewrite = false; if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) { @@ -300,12 +303,14 @@ static void quorum_aio_cb(void *opaque, int ret) return; } +type = acb->is_read ? QUORUM_OP_TYPE_READ : QUORUM_OP_TYPE_WRITE; sacb->ret = ret; acb->count++; if (ret == 0) { acb->success_count++; } else { -quorum_report_bad(acb, sacb->aiocb->bs->node_name, ret); +quorum_report_bad(type, acb->sector_num, acb->nb_sectors, + sacb->aiocb->bs->node_name, ret); } assert(acb->count <= s->num_children); assert(acb->success_count <= s->num_children); @@ -338,7 +343,9 @@ static void quorum_report_bad_versions(BDRVQuorumState *s, continue; } QLIST_FOREACH(item, &version->items, next) { -quorum_report_bad(acb, s->children[item->index]->bs->node_name, 0); +quorum_report_bad(QUORUM_OP_TYPE_READ, acb->sector_num, + acb->nb_sectors, + s->children[item->index]->bs->node_name, 0); } } } diff --git a/docs/qmp-events.txt b/docs/qmp-events.txt index d6b9aea..fa7574d 100644 --- a/docs/qmp-events.txt +++ b/docs/qmp-events.txt @@ -325,6 +325,7 @@ Emitted to report a corruption of a Quorum file. Data: +- "type": Quorum operation type - "error": Error message (json-string, optional) Only present on failure. This field contains a human-readable error message. There are no semantics other than that the @@ -336,10 +337,18 @@ Data: Example: +Read operation: { "event": "QUORUM_REPORT_BAD", - "data": { "node-name": "node0", "sector-num": 345435, "sectors-count": 5 }, + "data": { "node-name": "node0", "sector-num": 345435, "sectors-count": 5, + "type": "read" }, "timestamp": { "seconds": 1344522075, "microseconds": 745528 } } +Flush operation: +{ "event": "QUORUM_REPORT_BAD", + "data": { "node-name": "node0", "sector-num": 0, "sectors-count": 2097120, + "type": "flush", "error": "Broken pipe" }, + "timestamp": { "seconds": 1456406829, "microseconds": 291763 } } + Note: this event is rate-limited. RESET diff --git a/qapi/block.json b/qapi/block.json index 58e6b30..937337d 100644 --- a/qapi/block.json +++ b/qapi/block.json @@ -196,3 +196,19 @@ ## { 'event': 'DEVICE_TRAY_MOVED', 'data': { 'device': 'str', 'tray-open': 'bool' } } + +## +# @QuorumOpType +# +# An enumeration of the quorum operation types +# +# @read: read operation +# +# @write: write operation +# +# @flush: flush operation +# +# Since: 2.6 +## +{ 'enum': 'QuorumOpType', + 'data': [ 'read', 'write', 'flush' ] } diff --git a/qapi/event.json b/qapi/event.json index 1a45a6c..8642052 100644 --- a/qapi/event.json +++ b/qapi/event.json @@ -325,6 +325,
[Qemu-devel] Disabling IRQ error
hi everyone, I`m getting the nobody cared disabling IRQ error, when i raised external interrupts IRQ3 to the Openpic in QEMU. (Actually, any external interrupts irq i raised can reproduce this error, but internal interrupts work fine) And this IRQ3 is sharing irq with usb card. I have tried to resolve this issue as follows, but nothing changed. 1)tried to boot with "irqpoll" option. 2)tried to stop raising the irq for usb card. 3)tried to boot with "kernel_irqchip" option Related log from dmesg: [ 2079.800787] irq 19: nobody cared (try booting with the "irqpoll" option) [ 2079.800891] Call Trace: [ 2079.801303] [d7ff3f40] [c0007780] show_stack+0x7c/0x1a0 (unreliable) [ 2079.801398] [d7ff3f80] [c007ad48] __report_bad_irq+0x5c/0xe0 [ 2079.801439] [d7ff3fa0] [c007af58] note_interrupt+0x18c/0x240 [ 2079.801466] [d7ff3fd0] [c007be04] handle_fasteoi_irq+0xf8/0x158 [ 2079.801492] [d7ff3ff0] [c000d9d4] call_handle_irq+0x18/0x28 [ 2079.801582] [d7ff5ec0] [c0004df0] do_IRQ+0xf0/0x16c [ 2079.801609] [d7ff5ee0] [c000e9bc] ret_from_except+0x0/0x18 [ 2079.801652] --- Exception: 501 at __do_softirq+0x94/0x18c [ 2079.801664] LR = __do_softirq+0x54/0x18c [ 2079.801698] [d7ff5ff0] [c000d9ac] call_do_softirq+0x14/0x24 [ 2079.801725] [d788fc50] [c0004bc0] do_softirq+0x74/0xa0 [ 2079.801751] [d788fc70] [c004350c] irq_exit+0x3c/0x8c [ 2079.801775] [d788fc80] [c0004e3c] do_IRQ+0x13c/0x16c [ 2079.801800] [d788fca0] [c000e9bc] ret_from_except+0x0/0x18 [ 2079.802566] --- Exception: 501 at bbc_dma_exec+0xa08/0xef4 [bbc_driver] [ 2079.802580] LR = bbc_dma_exec+0x970/0xef4 [bbc_driver] [ 2079.802620] [d788fdf0] [d94d4f28] bbc_ioctl_dma_read_write+0x2bc/0x464 [bbc_driver] [ 2079.802661] [d788fe70] [d94cae84] bbc_ioctl+0x2a0/0x36c [bbc_driver] [ 2079.802754] [d788feb0] [c00d1f04] do_vfs_ioctl+0x6b8/0x760 [ 2079.802782] [d788ff10] [c00d2014] sys_ioctl+0x68/0xa8 [ 2079.802807] [d788ff40] [c000e368] ret_from_syscall+0x0/0x3c [ 2079.802891] --- Exception: c01 at 0xfaffda8 [ 2079.802901] LR = 0xfb8ec20 [ 2079.802936] handlers: [ 2079.803034] [] (usb_hcd_irq+0x0/0xac) [ 2079.803120] [] (bbc_interrupt_handler+0x0/0x65c [bbc_driver]) [ 2079.803181] Disabling IRQ #19 And interrupt information from /proc/interrupts and stat: # cat /proc/interrupts CPU0 18: 6923OpenPIC Levelohci_hcd:usb2, ohci_hcd:usb3, drvbbc_ldc 19: 11OpenPIC Levelehci_hcd:usb1, drvbbc 20: 0OpenPIC Edge internal_error 21: 0OpenPIC LevelNMI 22: 0OpenPIC LevelSP_I2C_Handler 42: 1148OpenPIC Levelserial 43:174OpenPIC Leveli2c-mpc, i2c-mpc 59: 0OpenPIC Levelfsl_espi 72: 51888OpenPIC Levelmmc0 LOC: 143330 Local timer interrupts SPU: 0 Spurious interrupts CNT: 0 Performance monitoring interrupts MCE: 0 Machine check exceptions #cat /proc/stat cpu 4929 0 5784 353692 736 0 460 0 0 0 cpu0 4929 0 5784 353692 736 0 460 0 0 0 intr 394342 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6929 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1574 174 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 53202 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ctxt 357415 btime 1378108122 processes 1648 procs_running 1 procs_blocked 0 softirq 329977 0 154040 0 0 1145 0 17497 0 337 156958 I`ve no idea what the problem is and how to fix it. Do you have experience to this? Thanks in advance for any advice. Simen