On Mon, Aug 12, 2024 at 11:28 AM Oliver Freyermuth <freyerm...@physik.uni-bonn.de> wrote: > > Am 12.08.24 um 11:09 schrieb Ilya Dryomov: > > On Mon, Aug 12, 2024 at 10:20 AM Oliver Freyermuth > > <freyerm...@physik.uni-bonn.de> wrote: > >> > >> Dear Cephalopodians, > >> > >> we've successfully operated a "good old" Mimic cluster with primary RBD > >> images, replicated via journaling to a "backup cluster" with Octopus, for > >> the past years (i.e. one-way replication). > >> We've now finally gotten around upgrading the cluster with the primary > >> images to Octopus (and plan to upgrade further in the near future). > >> > >> After the upgrade, all MON+MGR-OSD+rbd_mirror daemons are running 15.2.17. > >> > >> We run three rbd-mirror daemons which all share the following client with > >> auth in the "backup" cluster, to which they write: > >> > >> client.rbd_mirror_backup > >> caps: [mon] profile rbd-mirror > >> caps: [osd] profile rbd > >> > >> and the following shared client with auth in the "primary cluster" from > >> which they are reading: > >> > >> client.rbd_mirror > >> caps: [mon] profile rbd > >> caps: [osd] profile rbd > >> > >> i.e. the same auth as described in the docs[0]. > >> > >> Checking on the primary cluster, we get: > >> > >> # rbd mirror pool status > >> health: UNKNOWN > >> daemon health: UNKNOWN > >> image health: OK > >> images: 288 total > >> 288 replaying > >> > >> For some reason, some values are "unknown" here. But mirroring seems to > >> work, as checking on the backup cluster reveals, see for example: > >> > >> # rbd mirror image status zabbix-test.example.com-disk2 > >> zabbix-test.example.com-disk2: > >> global_id: 1bdcb981-c1c5-4172-9583-be6a6cd996ec > >> state: up+replaying > >> description: replaying, > >> {"bytes_per_second":8540.27,"entries_behind_primary":0,"entries_per_second":1.8,"non_primary_position":{"entry_tid":869176,"object_number":504,"tag_tid":1},"primary_position":{"entry_tid":11143,"object_number":7,"tag_tid":1}} > >> service: rbd_mirror_backup on rbd-mirror002.example.com > >> last_update: 2024-08-12 09:53:17 > >> > >> However, we do in some seemingly random cases see that journals are never > >> advanced on the primary cluster — staying with the example above, on the > >> primary cluster I find the following: > >> > >> # rbd journal status --image zabbix-test.physik.uni-bonn.de-disk2 > >> minimum_set: 1 > >> active_set: 126 > >> registered clients: > >> [id=, commit_position=[positions=[[object_number=7, tag_tid=1, > >> entry_tid=11143], [object_number=6, tag_tid=1, entry_tid=11142], > >> [object_number=5, tag_tid=1, entry_tid=11141], [object_number=4, > >> tag_tid=1, entry_tid=11140]]], state=connected] > >> [id=52b80bb0-a090-4f7d-9950-c8691ed8fee9, > >> commit_position=[positions=[[object_number=505, tag_tid=1, > >> entry_tid=869181], [object_number=504, tag_tid=1, entry_tid=869180], > >> [object_number=507, tag_tid=1, entry_tid=869179], [object_number=506, > >> tag_tid=1, entry_tid=869178]]], state=connected] > >> > >> As you can see, the minimum_set was not advanced. As can be seen in > >> "mirror image status", it shows the strange output that > >> non_primary_position seems much more advanced than primary_position. This > >> seems to happen "at random" for only a few volumes... > >> There are no other active clients apart from the actual VM (libvirt+qemu). > > > > Hi Oliver, > > > > Were the VM clients (i.e. librbd on the hypervisor nodes) upgraded as well? > > Hi Ilya, > > "some of them" — as a matter of fact, we wanted to stress-test VM restarting > and live migration first, and in some cases saw VMs stuck for a long time, > which is now understandable... > > >> > >> As a quick fix, to purge journals piling up over and over, we've only > >> found the "solution" to temporarily disable and then re-enable journaling > >> for affected VM disks, which can be identified by: > >> for A in $(rbd ls); do echo -n "$A: "; rbd --format=json journal status > >> --image $A | jq '.active_set - .minimum_set'; done > >> > >> > >> Any idea what is going wrong here? > >> This did not happen with the primary cluster running Mimic and the backup > >> cluster running Octopus before, and also did not happen when both were > >> running Mimic. > > > > You might be hitting https://tracker.ceph.com/issues/57396. > > Indeed, it looks exactly like that, as we do fsfreeze+fstrim every night > (before snapshotting) inside all VMs (via qemu-guest-agent). > Correlating affected VMs with upgraded hypervisors reveals that only those > VMs running on hypervisors with Octopus clients seem affected, > and the issue easily explains why we saw problems with VM shutdown / restart > or live migration (extremely slowness / VMs almost getting stuck). I can also > confirm these problems seem to vanish when disabling journaling. > > So many thanks, this does indeed explain a lot :-). It also means the bug is > still present in Octopus, but fixed in Pacific and later. > > We'll likely switch to snapshot-based mirroring in the next weeks (now that > we know that this will avoid the problem), then finish the upgrade of all > hypervisors to Octopus, and only then attack Pacific and later.
Are any of your VM images clones (in "rbd snap create" + "rbd clone" sense)? If so, I'd advise against switching to snapshot-based based mirroring as there are known issues with sync/replication correctness there. Thanks, Ilya _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io