Aaron Lauterer <[email protected]> writes:

Some small points below:

> ceph networks (public, cluster) can be changed on the fly in a running
> cluster. But the procedure, especially for the ceph public network is
> a bit more involved. By documenting it, we will hopefully reduce the
> number of issues our users run into when they try to attempt a network
> change on their own.
>
> Signed-off-by: Aaron Lauterer <[email protected]>
> ---
> Before I apply this commit I would like to get at least one T-b where you 
> tested
> both scenarios to make sure the instructions are clear to follow and that I
> didn't miss anything.
>
>  pveceph.adoc | 186 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 186 insertions(+)
>
> diff --git a/pveceph.adoc b/pveceph.adoc
> index 63c5ca9..c4a4f91 100644
> --- a/pveceph.adoc
> +++ b/pveceph.adoc
> @@ -1192,6 +1192,192 @@ ceph osd unset noout
>  You can now start up the guests. Highly available guests will change their 
> state
>  to 'started' when they power on.
>  
> +
> +[[pveceph_network_change]]
> +Network Changes
> +~~~~~~~~~~~~~~~
> +
> +It is possible to change the networks used by Ceph in a HCI setup without any
> +downtime if *both the old and new networks can be configured at the same 
> time*.
> +
> +The procedure differs depending on which network you want to change.
> +
> +After the new network has been configured on all hosts, make sure you test it
> +before proceeding with the changes. One way is to ping all hosts on the new
> +network. If you use a large MTU, make sure to also test that it works. For
> +example by sending ping packets that will result in a final packet at the max
> +MTU size.
> +
> +To test an MTU of 9000, you will need the following packet sizes:
> +
> +[horizontal]
> +IPv4:: The overhead of IP and ICMP is '28' bytes; the resulting packet size 
> for
> +the ping then is '8972' bytes.

I would personally mention that this is "generally" the case, as one
could be dealing with bigger headers, e.g. when q-in-q is used.

> +IPv6:: The overhead is '48' bytes and the resulting packet size is
> +'8952' bytes.
> +
> +The resulting ping command will look like this for an IPv4:
> +[source,bash]
> +----
> +ping -M do -s 8972 {target IP}
> +----
> +
> +When you are switching between IPv4 and IPv6 networks, you need to make sure
> +that the following options in the `ceph.conf` file are correctly set to 
> `true`
> +or `false`. These config options configure if Ceph services should bind to 
> IPv4
> +or IPv6 addresses.
> +----
> +ms_bind_ipv4 = true
> +ms_bind_ipv6 = false
> +----
> +
> +[[pveceph_network_change_public]]
> +Change the Ceph Public Network
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +The Ceph Public network is the main communication channel in a Ceph cluster
> +between the different services and clients (for example, a VM). Changing it 
> to
> +a different network is not as simple as changing the Ceph Cluster network. 
> The
> +main reason is that besides the configuration in the `ceph.conf` file, the 
> Ceph
> +MONs (monitors) have an internal configuration where they keep track of all 
> the
> +other MONs that are part of the cluster, the 'monmap'.
> +
> +Therefore, the procedure to change the Ceph Public network is a bit more
> +involved:
> +
> +1. Change `public_network` in the `ceph.conf` file

This is mentioned in the warning below, but maybe more emphasis could be
made here to only touch this one value.

Additionally, please use the full path here. There are versions at
/etc/pve and /etc/ceph and this is the first time in this new section
where one needs to modify one (even if it is mentioned below in the
expanded version).

> +2. Restart non MON services: OSDs, MGRs and MDS on one host
> +3. Wait until Ceph is back to 'Health_OK'

Should be HEALTH_OK instead.

> +4. Verify services are using the new network
> +5. Continue restarting services on the next host
> +6. Destroy one MON
> +7. Recreate MON
> +8. Wait until Ceph is back to 'Health_OK'

Should be HEALTH_OK instead.

> +9. Continue destroying and recreating MONs
> +
> +You first need to edit the `/etc/pve/ceph.conf` file. Change the
> +`public_network` line to match the new subnet.
> +
> +----
> +cluster_network = 10.9.9.30/24
> +----
> +
> +WARNING: Do not change the `mon_host` line or any `[mon.HOSTNAME]` sections.
> +These will be updated automatically when the MONs are destroyed and 
> recreated.
> +
> +NOTE: Don't worry if the host bits (for example, the last octet) are set by
> +default, the netmask in CIDR notation defines the network part.
> +
> +After you have changed the network, you need to restart the non MON services 
> in
> +the cluster for the changes to take effect. Do so one node at a time! To 
> restart all
> +non MON services on one node, you can use the following commands on that 
> node.
> +Ceph has `systemd` targets for each type of service.
> +
> +[source,bash]
> +----
> +systemctl restart ceph-osd.target
> +systemctl restart ceph-mgr.target
> +systemctl restart ceph-mds.target
> +----
> +NOTE: You will only have MDS' (Metadata Server) if you use CephFS.
> +
> +NOTE: After the first OSD service got restarted, the GUI will complain that
> +the OSD is not reachable anymore. This is not an issue,; VMs can still reach

Is the double punctuation here intentional?

> +them. The reason for the message is that the MGR service cannot reach the OSD
> +anymore. The error will vanish after the MGR services get restarted.
> +
> +WARNING: Do not restart OSDs on multiple hosts at the same time. Chances are
> +that for some PGs (placement groups), 2 out of the (default) 3 replicas will
> +be down. This will result in I/O being halted until the minimum required 
> number
> +(`min_size`) of replicas is available again.
> +
> +To verify that the services are listening on the new network, you can run the
> +following command on each node:
> +
> +[source,bash]
> +----
> +ss -tulpn | grep ceph
> +----
> +
> +NOTE: Since OSDs will also listen on the Ceph Cluster network, expect to see 
> that
> +network too in the output of `ss -tulpn`.
> +
> +Once the Ceph cluster is back in a fully healthy state ('Health_OK'), and the

Same here, HEALTH_OK.

> +services are listening on the new network, continue to restart the services 
> on
> +the host.
> +
> +The last services that need to be moved to the new network are the Ceph MONs
> +themselves. The easiest way is to destroy and recreate each monitor one by
> +one. This way, any mention of it in the `ceph.conf` and the monitor internal
> +`monmap` is handled automatically.
> +
> +Destroy the first MON and create it again. Wait a few moments before you
> +continue on to the next MON in the cluster, and make sure the cluster reports
> +'Health_OK' before proceeding.
> +
> +Once all MONs are recreated, you can verify that any mention of MONs in the
> +`ceph.conf` file references the new network. That means mainly the `mon_host`
> +line and the `[mon.HOSTNAME]` sections.
> +
> +One final `ss -tulpn | grep ceph` should show that the old network is not 
> used
> +by any Ceph service anymore.
> +
> +[[pveceph_network_change_cluster]]
> +Change the Ceph Cluster Network
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +The Ceph Cluster network is used for the replication traffic between the 
> OSDs.
> +Therefore, it can be beneficial to place it on its own fast physical network.
> +
> +The overall procedure is:
> +
> +1. Change `cluster_network` in the `ceph.conf` file
> +2. Restart OSDs on one host
> +3. Wait until Ceph is back to 'Health_OK'
> +4. Verify OSDs are using the new network
> +5. Continue restarting OSDs on the next host
> +
> +You first need to edit the `/etc/pve/ceph.conf` file. Change the
> +`cluster_network` line to match the new subnet.
> +
> +----
> +cluster_network = 10.9.9.30/24
> +----
> +
> +NOTE: Don't worry if the host bits (for example, the last octet) are set by
> +default; the netmask in CIDR notation defines the network part.
> +
> +After you have changed the network, you need to restart the OSDs in the 
> cluster
> +for the changes to take effect. Do so one node at a time!
> +To restart all OSDs on one node, you can use the following command on the 
> CLI on
> +that node:
> +
> +[source,bash]
> +----
> +systemctl restart ceph-osd.target
> +----
> +
> +WARNING: Do not restart OSDs on multiple hosts at the same time. Chances are
> +that for some PGs (placement groups), 2 out of the (default) 3 replicas will
> +be down. This will result in I/O being halted until the minimum required 
> number
> +(`min_size`) of replicas is available again.
> +
> +To verify that the OSD services are listening on the new network, you can 
> either
> +check the *OSD Details -> Network* tab in the *Ceph -> OSD* panel or by 
> running
> +the following command on the host:
> +[source,bash]
> +----
> +ss -tulpn | grep ceph-osd
> +----
> +
> +NOTE: Since OSDs will also listen on the Ceph Public network, expect to see 
> that
> +network too in the output of `ss -tulpn`.
> +
> +Once the Ceph cluster is back in a fully healthy state ('Health_OK'), and the

Same, should be HEALTH_OK.

> +OSDs are listening on the new network, continue to restart the OSDs on the 
> next
> +host.
> +
> +
>  [[pve_ceph_mon_and_ts]]
>  Ceph Monitoring and Troubleshooting
>  -----------------------------------

-- 
Maximiliano


_______________________________________________
pve-devel mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Reply via email to