Existing information is preserved in a restructured way and usefully supplemented.
* restructure and revise the introduction * add subchapter "Considerations" * remove the subchapter "Schedule Format" with its one line of content and link where appropriate directly to the copy under "25. Appendix D: Calendar Events". The help button at adding/editing a job links now to the subchapter "Managing Jobs". * provide details on job removal and how to enforce it if necessary * add more helpful CLI examples and improve/fix existing ones * restructure and revise the subchapter "Error Handling" Signed-off-by: Alexander Zeidler <a.zeid...@proxmox.com> --- v3: * adapt the introduction and section "Risk of Data Loss" to provide information about using a shared storage together with storage replication * update the CLI example `pvesr update` (*:00 replaces incorrect */00) * implement most suggestions from Daniel Kral ** update commit message ** reword first paragraph of introduction ** rename subchapter "Recommendations" to "Considerations" ** write "every 15 minutes" to be consistent with UI (and additionally mention that examples from the drop-down list can be modified) ** reword the description of the bandwidth limit v2: https://lore.proxmox.com/pve-devel/20241218161948.3-1-a.zeid...@proxmox.com/ * no changes, only add missing pve-manager patch pvecm.adoc | 2 + pvesr.adoc | 411 +++++++++++++++++++++++++++++++++++++---------------- 2 files changed, 287 insertions(+), 126 deletions(-) diff --git a/pvecm.adoc b/pvecm.adoc index 15dda4e..4028e92 100644 --- a/pvecm.adoc +++ b/pvecm.adoc @@ -486,6 +486,7 @@ authentication. You should fix this by removing the respective keys from the '/etc/pve/priv/authorized_keys' file. +[[pvecm_quorum]] Quorum ------ @@ -963,6 +964,7 @@ case $- in esac ---- +[[pvecm_external_vote]] Corosync External Vote Support ------------------------------ diff --git a/pvesr.adoc b/pvesr.adoc index 9ad02f5..034e4c2 100644 --- a/pvesr.adoc +++ b/pvesr.adoc @@ -24,48 +24,68 @@ Storage Replication :pve-toplevel: endif::manvolnum[] -The `pvesr` command-line tool manages the {PVE} storage replication -framework. Storage replication brings redundancy for guests using -local storage and reduces migration time. - -It replicates guest volumes to another node so that all data is available -without using shared storage. Replication uses snapshots to minimize traffic -sent over the network. Therefore, new data is sent only incrementally after -the initial full sync. In the case of a node failure, your guest data is -still available on the replicated node. - -The replication is done automatically in configurable intervals. -The minimum replication interval is one minute, and the maximal interval -once a week. The format used to specify those intervals is a subset of -`systemd` calendar events, see -xref:pvesr_schedule_time_format[Schedule Format] section: - -It is possible to replicate a guest to multiple target nodes, -but not twice to the same target node. - -Each replications bandwidth can be limited, to avoid overloading a storage -or server. - -Only changes since the last replication (so-called `deltas`) need to be -transferred if the guest is migrated to a node to which it already is -replicated. This reduces the time needed significantly. The replication -direction automatically switches if you migrate a guest to the replication -target node. - -For example: VM100 is currently on `nodeA` and gets replicated to `nodeB`. -You migrate it to `nodeB`, so now it gets automatically replicated back from -`nodeB` to `nodeA`. - -If you migrate to a node where the guest is not replicated, the whole disk -data must send over. After the migration, the replication job continues to -replicate this guest to the configured nodes. +Replication can be configured for a guest which has volumes placed on +a local storage. Those volumes are then replicated to other cluster +nodes to enable a significantly faster guest migration to them. +Possible additional volumes on a shared storage are not being +replicated, since it is expected that the shared storage is also +available at the migration target node. Replication is particularly +interesting for small clusters if no shared storage is available. + +In the event of a node or local storage failure, the volume data as of +the latest completed replication runs are still available on the +replication target nodes. [IMPORTANT] ==== -High-Availability is allowed in combination with storage replication, but there -may be some data loss between the last synced time and the time a node failed. +While a replication-enabled guest can be configured for +xref:chapter_ha_manager[high availability], or +xref:pvesr_node_failed[manually moved] while its origin node is not +available, read about the involved +xref:pvesr_risk_of_data_loss[risk of data loss] and how to avoid it. ==== +.Replication requires … + +* at least one other cluster node as a replication target +* one common local storage entry in the datacenter, being functional +on both nodes +* that the local storage type is +xref:pvesr_supported_storage[supported by replication] +* that the guest has volumes stored on that local storage + +.Replication … + +* allows a fast migration to nodes where the guest is being replicated +* provides guest volume redundancy in a cluster where using a shared +storage type is not an option +* is configured as a job for a guest, with multiple jobs enabling +multiple replication targets +* jobs run one after the other at their configured interval (shortest +is every minute) +* uses snapshots to regularly transmit only changed volume data +(so-called deltas) +* network bandwidth can be limited per job, smoothing the storage and +network utilization +* targets stay basically the same when migrating the guest to another +node +* direction of a job reverses when moving the guest to its configured +replication target + +.Example: + +A guest runs on node `A` and has replication jobs to node `B` and `C`, +both with a set interval of every five minutes (`*/5`). Now we migrate +the guest from `A` to `B`, which also automatically updates the +replication targets for this guest to be `A` and `C`. Migration was +completed fast, as only the changed volume data since the last +replication run has been transmitted. + +In the event that node `B` or its local storage fails, the guest can +be restarted on `A` or `C`, with the risk of some data loss as +described in this chapter. + +[[pvesr_supported_storage]] Supported Storage Types ----------------------- @@ -76,147 +96,286 @@ Supported Storage Types |ZFS (local) |zfspool |yes |yes |============================================= -[[pvesr_schedule_time_format]] -Schedule Format ---------------- -Replication uses xref:chapter_calendar_events[calendar events] for -configuring the schedule. - -Error Handling +[[pvesr_considerations]] +Considerations -------------- -If a replication job encounters problems, it is placed in an error state. -In this state, the configured replication intervals get suspended -temporarily. The failed replication is repeatedly tried again in a -30 minute interval. -Once this succeeds, the original schedule gets activated again. +[[pvesr_risk_of_data_loss]] +Risk of Data Loss +~~~~~~~~~~~~~~~~~ + +If a node should suddenly become unavailable for a longer period of +time, it may become neccessary to run a guest on a replication target +node instead. Thereby the guest will use the latest replicated volume +data available on the chosen target node. That volume state will then +also be replicated to other nodes with the next replication runs, +since the replication directions are automatically updated for related +jobs. This also means, that the once newer volume state on the failed +node will be removed after it becomes available again. Possible +volumes on a shared storage are not affected by that, since they are +not being replicated. + +A more resilient solution may be to use a shared +xref:chapter_storage[storage type] instead. If that is not an option, +consider setting the replication job intervals short enough and avoid +moving replication-configured guests while their origin node is not +available. Instead of configuring those guests for high availability, +xref:qm_startup_and_shutdown[start at boot] could be a sufficient +alternative. + +[[pvesr_replication_network]] +Network for Replication Traffic +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Replication traffic is routed via the +xref:pvecm_migration_network[migration network]. If it is not set, the +management network is used by default, which can have a negative +impact on corosync and therefore on cluster availability. To specify +the migration network, navigate to +__Datacenter -> Options -> Migration Settings__, or set it via CLI in +the xref:datacenter_configuration_file[`datacenter.cfg`]. + +[[pvesr_cluster_size]] +Cluster Size +~~~~~~~~~~~~ + +With a 2-node cluster in particular, the failure of one node can leave +the other node without a xref:pvecm_quorum[quorum]. In order to keep +the cluster functional at all times, it is therefore crucial to +xref:pvecm_join_node_to_cluster[expand] to a 3-node cluster in advance +or to configure a xref:pvecm_external_vote[QDevice] for the third +vote. + +[[pvesr_managing_jobs]] +Managing Jobs +------------- -Possible issues -~~~~~~~~~~~~~~~ +[thumbnail="screenshot/gui-qemu-add-replication-job.png"] -Some of the most common issues are in the following list. Depending on your -setup there may be another cause. +Replication jobs can easily be created, modified and removed via web +interface, or by using the CLI tool `pvesr`. -* Network is not working. +To manage all replication jobs in one place, go to +__Datacenter -> Replication__. Additional functionalities are +available under __Node -> Replication__ and __Guest -> Replication__. +Go there to view logs, schedule a job once for now, or benefit from +preset fields when configuring a job. -* No free space left on the replication target storage. +Enabled replication jobs will automatically run at their set interval, +one after the other. You can change the default interval of every 15 +minutes (`*/15`) by selecting or adapting an example from the +drop-down list. The shortest interval is every minute (`*/1`). See +also xref:chapter_calendar_events[schedule format]. -* Storage with the same storage ID is not available on the target node. +If replication jobs result in significant I/O load on the target node, +the network bandwidth of individual jobs can be limited to keep the +load at an acceptable level. -NOTE: You can always use the replication log to find out what is causing the problem. +Shortly after job creation, a first snapshot is taken and sent to the +target node. Subsequent snapshots are taken according to the schedule +and only contain modified volume data, allowing a significantly +shorter transfer time. -Migrating a guest in case of Error -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -// FIXME: move this to better fitting chapter (sysadmin ?) and only link to -// it here +If you remove a replication job, the snapshots on the target node are +also getting deleted again by default. The removal takes place at the +next possible point in time and requires the job to be enabled. If the +target node is permanently unreachable, the cleanup can be skipped by +forcing a job deletion via CLI. -In the case of a grave error, a virtual guest may get stuck on a failed -node. You then need to move it manually to a working node again. +When not using the web interface, the cluster-wide unique replication +job ID has to be specified. For example, `100-0`, which is composed of +the guest ID, a hyphen and an arbitrary job number. -Example -~~~~~~~ +[[pvesr_cli_examples]] +CLI Examples +------------ -Let's assume that you have two guests (VM 100 and CT 200) running on node A -and replicate to node B. -Node A failed and can not get back online. Now you have to migrate the guest -to Node B manually. +Create a replication job for guest `100` and give it the job number +`0`. Replicate to node `pve2` every five minutes (`*/5`), at a maximum +network bandwitdh of `10` MBps (megabytes per second). -- connect to node B over ssh or open its shell via the web UI +---- +# pvesr create-local-job 100-0 pve2 --schedule "*/5" --rate 10 +---- + +List replication jobs from all nodes. -- check if that the cluster is quorate -+ ---- -# pvecm status +# pvesr list ---- -- If you have no quorum, we strongly advise to fix this first and make the - node operable again. Only if this is not possible at the moment, you may - use the following command to enforce quorum on the current node: -+ +List the job statuses from all local guests, or only from a specific +local guest. + ---- -# pvecm expected 1 +# pvesr status [--guest 100] ---- -WARNING: Avoid changes which affect the cluster if `expected votes` are set -(for example adding/removing nodes, storages, virtual guests) at all costs. -Only use it to get vital guests up and running again or to resolve the quorum -issue itself. +Read the configuration of job `100-0`. -- move both guest configuration files form the origin node A to node B: -+ ---- -# mv /etc/pve/nodes/A/qemu-server/100.conf /etc/pve/nodes/B/qemu-server/100.conf -# mv /etc/pve/nodes/A/lxc/200.conf /etc/pve/nodes/B/lxc/200.conf +# pvesr read 100-0 ---- -- Now you can start the guests again: -+ +Update the configuration of job `100-0`, for example, to change the +schedule interval to every full hour (`hourly`). + ---- -# qm start 100 -# pct start 200 +# pvesr update 100-0 --schedule "*:00" ---- -Remember to replace the VMIDs and node names with your respective values. +To run the job `100-0` once soon, schedule it regardless of the +configured interval. -Managing Jobs -------------- +---- +# pvesr schedule-now 100-0 +---- -[thumbnail="screenshot/gui-qemu-add-replication-job.png"] +Disable (or `enable`) the job `100-0`. + +---- +# pvesr disable 100-0 +---- + +Delete the job `100-0`. If the target node is permanently unreachable, +`--force` can be used to skip the failing cleanup. -You can use the web GUI to create, modify, and remove replication jobs -easily. Additionally, the command-line interface (CLI) tool `pvesr` can be -used to do this. +---- +# pvesr delete 100-0 [--force] +---- -You can find the replication panel on all levels (datacenter, node, virtual -guest) in the web GUI. They differ in which jobs get shown: -all, node- or guest-specific jobs. +[[pvesr_error_handling]] +Error Handling +-------------- -When adding a new job, you need to specify the guest if not already selected -as well as the target node. The replication -xref:pvesr_schedule_time_format[schedule] can be set if the default of `all -15 minutes` is not desired. You may impose a rate-limit on a replication -job. The rate limit can help to keep the load on the storage acceptable. +[[pvesr_job_failed]] +Job Failed +~~~~~~~~~~ -A replication job is identified by a cluster-wide unique ID. This ID is -composed of the VMID in addition to a job number. -This ID must only be specified manually if the CLI tool is used. +In the event that a replication job fails, it is temporarily placed in +an error state and a notification is sent. A retry is scheduled for 5 +minutes later, followed by another 10, 15 and finally every 30 +minutes. As soon as the job has run successfully again, the error +state is left and the configured interval is resumed. -Network -------- +.Troubleshooting Job Failures -Replication traffic will use the same network as the live guest migration. By -default, this is the management network. To use a different network for the -migration, configure the `Migration Network` in the web interface under -`Datacenter -> Options -> Migration Settings` or in the `datacenter.cfg`. See -xref:pvecm_migration_network[Migration Network] for more details. +To find out why a job exactly failed, read the log available under +__Node -> Replication__. -Command-line Interface Examples -------------------------------- +Common causes are: -Create a replication job which runs every 5 minutes with a limited bandwidth -of 10 Mbps (megabytes per second) for the guest with ID 100. +* The network is not working properly. +* The storage (ID) in use has set an availability restriction, +excluding the target node. +* The storage is not set up correctly on the target node (e.g. +different pool name). +* The storage on the target node has no free space left. + +[[pvesr_node_failed]] +Origin Node Failed +~~~~~~~~~~~~~~~~~~ +// FIXME: move this to better fitting chapter (sysadmin ?) and only link to +// it here +In the event that a node running replicated guests fails suddenly and +for too long, it may become necessary to restart these guests on their +replicated nodes. If replicated guests are configured for high +availability (HA), beside its involved +xref:pvesr_risk_of_data_loss[risk of data loss], just wait until these +guests are recovered on other nodes. Replicated guests which are not +configured for HA can be moved manually as explained below, including +the same risk of data loss. + +[[pvesr_find_latest_replicas]] +.Step 1: Optionally Decide on a Specific Replication Target Node + +To minimize the data loss of an important guest, you can optionally +find the target node on which the most recent successful replication +took place. If the origin node is healthy enough to access its web +interface, go to __Node -> Replication__ and see the 'Last Sync' +column. Alternatively, you can carry out the following steps. + +. To list all target nodes of an important guest, exemplary with the +ID `1000`, go to the CLI of any node and run: ++ ---- -# pvesr create-local-job 100-0 pve1 --schedule "*/5" --rate 10 +# pvesr list | grep -e Job -e ^1000 ---- -Disable an active job with ID `100-0`. +. Open the CLI on all listed target nodes. +. Adapt the following command with your VMID to find the most recent +snapshots among your target nodes. If snapshots were taken in the same +minute, look for the highest number at the end of the name, which is +the Unix timestamp. ++ ---- -# pvesr disable 100-0 +# zfs list -t snapshot -o name,creation | grep -e -1000-disk ---- -Enable a deactivated job with ID `100-0`. +[[pvesr_verify_cluster_health]] +.Step 2: Verify Cluster Health + +Go to the CLI of any replication target node and run `pvecm status`. +If the output contains `Quorate: Yes`, then the cluster/corosync is +healthy enough and you can proceed with +xref:pvesr_move_a_guest[Step 3: Move a guest]. +WARNING: If the cluster is not quorate and consists of 3 or more +nodes/votes, we strongly recommend to solve the underlying problem +first so that at least the majority of nodes/votes are available +again. + +If the cluster is not quorate and consists of only 2 nodes without an +additional xref:pvecm_external_vote[QDevice], you may want to proceed +with the following steps to temporary make the cluster functional +again. + +. Check whether the expected votes are `2`. ++ ---- -# pvesr enable 100-0 +# pvecm status | grep votes ---- -Change the schedule interval of the job with ID `100-0` to once per hour. +. Now you can enforce quorum on the one remaining node by running: ++ +---- +# pvecm expected 1 +---- ++ +WARNING: Avoid making changes to the cluster in this state at all +costs, for example adding or removing nodes, storages or guests. Delay +it until the second node is available again and expected votes have +been automatically restored to `2`. + +[[pvesr_move_a_guest]] +.Step 3: Move a Guest +. Use SSH to connect to any node that is part of the cluster majority. +Alternatively, go to the web interface and open the shell of such node +in a separate window or browser tab. ++ +. The following example commands move a VMID `1000` and CTID `2000` +from the node named `pve-failed` to a still available replication +target node named `pve-replicated`. ++ +---- +# cd /etc/pve/nodes/ +# mv pve-failed/qemu-server/1000.conf pve-replicated/qemu-server/ +# mv pve-failed/lxc/2000.conf pve-replicated/lxc/ +---- ++ +. Now you can start those guests again: ++ ---- -# pvesr update 100-0 --schedule '*/00' +# qm start 1000 +# pct start 2000 ---- ++ +. If it was necessary to enforce the quorum, as described when +verifying the cluster health, do not forget the warning at the end +about avoiding changes to the cluster. ifdef::manvolnum[] include::pve-copyright.adoc[] -- 2.39.5 _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel