On 03/05/2016 01:44 AM, Ian Jackson wrote: > Changlong Xie writes ("[PATCH v11 20/27] Support colo mode for qemu disk"): >> From: Wen Congyang <we...@cn.fujitsu.com> >> >> Usage: disk = >> ['...,colo,colo-host=xxx,colo-port=xxx,colo-export=xxx,active-disk=xxx,hidden-disk=xxx...'] >> For QEMU block replication details: >> http://wiki.qemu.org/Features/BlockReplication > > So now I am slightly confused by the design, I think. > > When you replicate a VM with COLO using xl, its memory state is > transferred over ssh. But its disk replication is done unencrypted > and unauthenticated ?
Yes, it is a problem. I will think how to improve it. > > And the disk replication is, out of band, and needs to be configured > separately ? This is rather awkward, although maybe not a > showstopper. (Maybe we can have a plan to fix it in the future...) colo-host,colo-port should be the global configuration. And colo-export, active-disk,hidden-disk must be configured separately, because each disk should have a different configuration. > > And, how does the disk replication, which doesn't depend on there > being xl running, relate to the vm state replication, which does ? I > think at the very least I'd like to see some information about the > principles of operation - either explained, or referred to, in the > user manual. OK. The disk replication doesn't depend on xl. We only can operate it via qemu monitor command: 1. stop the vm 2. do the checkpoint 3. start the vm 1/3 is suspend/resume the guest. We only need to do 2 when both vm are in the consistent state. > > Is it possible to use COLO with an existing full-service disk > replication service such as DRBD ? DRBD doesn's support the case like COLO. Because both primary guest and secondary guest need to write to the disk. > >> +(a) An example for COLO replication's configuration: disk >> =['...,colo,colo-host >> +=xxx,colo-port=xxx,colo-export=xxx,active-disk=xxx,hidden-disk=xxx...'] >> + >> +=item B<colo-host> :Secondary host's ip address. >> + >> +=item B<colo-port> :Secondary host's port, we will run a nbd server on >> +secondary host, and the nbd server will listen this port. >> + >> +=item B<colo-export> :Nbd server's disk export name of secondary host. >> + >> +=item B<active-disk> :Secondary's guest write will be buffered in this >> disk, >> +and it's used by secondary. >> + >> +=item B<hidden-disk> :Primary's modified contents will be buffered in >> this >> +disk, and it's used by secondary. > > What would a typical configuration look like ? I don't understand the > relationship between active-disk and hidden-disk, etc. QEMU has a feature: backing file For example: A's backing file is B 1. If we read from A, but the sector is not allocated in A. We wil return a zero sector to the guest. If A has a backing file, we will read the sector from B instead of returning a zero sector. 2. The backing file doesn't affect the write operation. QEMU has another feature: backup block job Backup job has two file: one is source and another is the target. It has some running mode. For block replication, we use the mode "sync=none". In this mode, we will read the data from the source disk before we modify it, and write it to the target disk. We keep a bitmap to remeber which sector is backuped from the source disk to the target disk. If the target disk is an empty disk, and empty disk's backing file is the source disk, we can read from the target disk to get the source disk's originnal data. How does block replication work: A. primary qemu: 1. use the block driver quorum: it will read from all children and write to all children. child 0: real disk child 1: nbd client reading from child 1 will fail, but we use the fifo mode. In this mode, we read from child 0 will success and we don't read from child 0 write to child 1: because child 1 is nbd client, it will forward the write request to nbd server B. secondary qemu: We have 3 disks: active disk(called it A), hidden disk(called it H), and secondary disk (real disk, called it S). A's backing file is H, and H's backing file is S. We also start a backup job: the source disk is S, and the target disk is H. we run nbd server in secondary qemu. And the nbd server will write to S. Before resuming both primary vm and secondary vm: the state is: 1. primary disk and secondary disk are in the consistent state(contain the same data) 2. active disk and hidden disk are the empty disk When the guest is running: 1. NBD server receives the primary write operation and writes the data to S 2. Before we write data to S, the backup job will read the original data and backup it to H 3. The secondary vm will write data to A. 4. If secondary vm will read data from A: I. If the sector is allocated in A, read it from A. II. Otherwise, the secondary vm doesn't modify this sector after the latest is resumed. III. In this case, we read it from H. We can read S's original data from H(See the explanation In backup job). If we have more than 1 real disk, we can use exportname to tag each disk. Each pair of primary disk and secondary disk should have the same export name. > >> +colo-host >> +--------- >> + >> +Description: Secondary host's address >> +Mandatory: Yes when COLO enabled > > Is it permitted to specify a host DNS name ? IIRC, I think it is OK. > >> + if (libxl_defbool_val(disk->colo_enable)) { >> + tmp = xs_read(ctx->xsh, XBT_NULL, >> + GCSPRINTF("%s/colo-port", be_path), &len); >> + if (!tmp) { >> + LOG(ERROR, "Missing xenstore node %s/colo-port", be_path); >> + goto cleanup; >> + } >> + disk->colo_port = tmp; > > This is quite repetitive code. Yes. Will introduce a new function to avoid it in the next version. > > >> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c >> index 3610a39..bff08b0 100644 >> --- a/tools/libxl/libxl_create.c >> +++ b/tools/libxl/libxl_create.c >> @@ -1800,12 +1800,29 @@ static void domain_create_cb(libxl__egc *egc, > ... >> @@ -256,6 +280,36 @@ static int disk_try_backend(disk_try_backend_args *a, >> LOG(DEBUG, "Disk vdev=%s, backend %s not compatible with script=...", >> a->disk->vdev, libxl_disk_backend_to_string(backend)); >> return 0; >> + >> + bad_colo: >> + LOG(DEBUG, "Disk vdev=%s, backend %s not compatible with colo", >> + a->disk->vdev, libxl_disk_backend_to_string(backend)); >> + return 0; > > This is correct here, I think. > >> + bad_colo_host: >> + LOG(DEBUG, "Disk vdev=%s, backend %s needs colo-host=... for colo", >> + a->disk->vdev, libxl_disk_backend_to_string(backend)); >> + return 0; > > I think these options should be checked later. disk_try_backend isn't > the place for general parameter checking; it is searching for which > backend to try. Hmm, do you mean we check it when we need to use COLO? > >> int libxl__device_disk_set_backend(libxl__gc *gc, libxl_device_disk *disk) { >> diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c >> index 4aca38e..ba17251 100644 >> --- a/tools/libxl/libxl_dm.c >> +++ b/tools/libxl/libxl_dm.c >> @@ -751,6 +751,139 @@ static int libxl__dm_runas_helper(libxl__gc *gc, const >> char *username) > ... >> +static char *qemu_disk_scsi_drive_string(libxl__gc *gc, const char >> *pdev_path, >> + int unit, const char *format, >> + const libxl_device_disk *disk, >> + int colo_mode) >> +{ >> + char *drive = NULL; >> + const char *exportname = disk->colo_export; >> + const char *active_disk = disk->active_disk; >> + const char *hidden_disk = disk->hidden_disk; >> + >> + switch (colo_mode) { >> + case LIBXL__COLO_NONE: >> + drive = libxl__sprintf >> + (gc, "file=%s,if=scsi,bus=0,unit=%d,format=%s,cache=writeback", >> + pdev_path, unit, format); > > I think this would be a lot clearer if the refactoring was split into > a seperate patch. OK. > >> if (strncmp(disks[i].vdev, "sd", 2) == 0) { >> - drive = libxl__sprintf >> - (gc, >> "file=%s,if=scsi,bus=0,unit=%d,format=%s,readonly=%s,cache=writeback", >> - pdev_path, disk, format, disks[i].readwrite ? >> "off" : "on"); >> + if (colo_mode == LIBXL__COLO_SECONDARY) { >> + /* >> + * -drive if=none,driver=format,file=pdev_path,\ >> + * id=exportname >> + */ > > I think this comment adds nothing to the code and could be profitably > omitted. OK. > >> + drive = libxl__sprintf >> + (gc, "if=none,driver=%s,file=%s,id=%s", >> + format, pdev_path, disks[i].colo_export); > > I don't understand how this works here. COLO_SECONDARY seems to > suppress the rest of the disk specification. COLO_SECONDAY will use two disks: one is for S, and one is for A. H is sepecified in A. This line is for S, and the codes for A is in the function qemu_disk_scsi_drive_string(). > > Also, this same logic seems to appear many times. Maybe it could be > centralised. Perhaps I would be able to advise more clearly if I > understood how this was put together. > >> diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl >> index 9b0a537..a2078d1 100644 >> --- a/tools/libxl/libxl_types.idl >> +++ b/tools/libxl/libxl_types.idl >> @@ -575,6 +575,13 @@ libxl_device_disk = Struct("device_disk", [ >> ("is_cdrom", integer), >> ("direct_io_safe", bool), >> ("discard_enable", libxl_defbool), >> + ("colo_enable", libxl_defbool), >> + ("colo_restore_enable", libxl_defbool), >> + ("colo_host", string), >> + ("colo_port", string), >> + ("colo_export", string), >> + ("active_disk", string), >> + ("hidden_disk", string) > > In general, many these should probably not be strings. Certainly the > port should be an integer. I don't quite understand the semantics of > the others. Yes, the port should be an integer. Will fix it in the next version. Thanks Wen Congyang > > Ian. > > > . > _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel