On Mon, Aug 12, 2024 at 10:20 AM Oliver Freyermuth <freyerm...@physik.uni-bonn.de> wrote: > > Dear Cephalopodians, > > we've successfully operated a "good old" Mimic cluster with primary RBD > images, replicated via journaling to a "backup cluster" with Octopus, for the > past years (i.e. one-way replication). > We've now finally gotten around upgrading the cluster with the primary images > to Octopus (and plan to upgrade further in the near future). > > After the upgrade, all MON+MGR-OSD+rbd_mirror daemons are running 15.2.17. > > We run three rbd-mirror daemons which all share the following client with > auth in the "backup" cluster, to which they write: > > client.rbd_mirror_backup > caps: [mon] profile rbd-mirror > caps: [osd] profile rbd > > and the following shared client with auth in the "primary cluster" from which > they are reading: > > client.rbd_mirror > caps: [mon] profile rbd > caps: [osd] profile rbd > > i.e. the same auth as described in the docs[0]. > > Checking on the primary cluster, we get: > > # rbd mirror pool status > health: UNKNOWN > daemon health: UNKNOWN > image health: OK > images: 288 total > 288 replaying > > For some reason, some values are "unknown" here. But mirroring seems to work, > as checking on the backup cluster reveals, see for example: > > # rbd mirror image status zabbix-test.example.com-disk2 > zabbix-test.example.com-disk2: > global_id: 1bdcb981-c1c5-4172-9583-be6a6cd996ec > state: up+replaying > description: replaying, > {"bytes_per_second":8540.27,"entries_behind_primary":0,"entries_per_second":1.8,"non_primary_position":{"entry_tid":869176,"object_number":504,"tag_tid":1},"primary_position":{"entry_tid":11143,"object_number":7,"tag_tid":1}} > service: rbd_mirror_backup on rbd-mirror002.example.com > last_update: 2024-08-12 09:53:17 > > However, we do in some seemingly random cases see that journals are never > advanced on the primary cluster — staying with the example above, on the > primary cluster I find the following: > > # rbd journal status --image zabbix-test.physik.uni-bonn.de-disk2 > minimum_set: 1 > active_set: 126 > registered clients: > [id=, commit_position=[positions=[[object_number=7, tag_tid=1, > entry_tid=11143], [object_number=6, tag_tid=1, entry_tid=11142], > [object_number=5, tag_tid=1, entry_tid=11141], [object_number=4, tag_tid=1, > entry_tid=11140]]], state=connected] > [id=52b80bb0-a090-4f7d-9950-c8691ed8fee9, > commit_position=[positions=[[object_number=505, tag_tid=1, entry_tid=869181], > [object_number=504, tag_tid=1, entry_tid=869180], [object_number=507, > tag_tid=1, entry_tid=869179], [object_number=506, tag_tid=1, > entry_tid=869178]]], state=connected] > > As you can see, the minimum_set was not advanced. As can be seen in "mirror > image status", it shows the strange output that non_primary_position seems > much more advanced than primary_position. This seems to happen "at random" > for only a few volumes... > There are no other active clients apart from the actual VM (libvirt+qemu).
Hi Oliver, Were the VM clients (i.e. librbd on the hypervisor nodes) upgraded as well? > > As a quick fix, to purge journals piling up over and over, we've only found > the "solution" to temporarily disable and then re-enable journaling for > affected VM disks, which can be identified by: > for A in $(rbd ls); do echo -n "$A: "; rbd --format=json journal status > --image $A | jq '.active_set - .minimum_set'; done > > > Any idea what is going wrong here? > This did not happen with the primary cluster running Mimic and the backup > cluster running Octopus before, and also did not happen when both were > running Mimic. You might be hitting https://tracker.ceph.com/issues/57396. Thanks, Ilya _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io