[ceph-users] Re: RBD Journaling seemingly getting stuck for some VMs after upgrade to Octopus

Ilya Dryomov Mon, 12 Aug 2024 03:17:44 -0700

On Mon, Aug 12, 2024 at 11:28 AM Oliver Freyermuth
<freyerm...@physik.uni-bonn.de> wrote:
>
> Am 12.08.24 um 11:09 schrieb Ilya Dryomov:
> > On Mon, Aug 12, 2024 at 10:20 AM Oliver Freyermuth
> > <freyerm...@physik.uni-bonn.de> wrote:
> >>
> >> Dear Cephalopodians,
> >>
> >> we've successfully operated a "good old" Mimic cluster with primary RBD 
> >> images, replicated via journaling to a "backup cluster" with Octopus, for 
> >> the past years (i.e. one-way replication).
> >> We've now finally gotten around upgrading the cluster with the primary 
> >> images to Octopus (and plan to upgrade further in the near future).
> >>
> >> After the upgrade, all MON+MGR-OSD+rbd_mirror daemons are running 15.2.17.
> >>
> >> We run three rbd-mirror daemons which all share the following client with 
> >> auth in the "backup" cluster, to which they write:
> >>
> >>     client.rbd_mirror_backup
> >>           caps: [mon] profile rbd-mirror
> >>           caps: [osd] profile rbd
> >>
> >> and the following shared client with auth in the "primary cluster" from 
> >> which they are reading:
> >>
> >>     client.rbd_mirror
> >>           caps: [mon] profile rbd
> >>           caps: [osd] profile rbd
> >>
> >> i.e. the same auth as described in the docs[0].
> >>
> >> Checking on the primary cluster, we get:
> >>
> >> # rbd mirror pool status
> >>     health: UNKNOWN
> >>     daemon health: UNKNOWN
> >>     image health: OK
> >>     images: 288 total
> >>         288 replaying
> >>
> >> For some reason, some values are "unknown" here. But mirroring seems to 
> >> work, as checking on the backup cluster reveals, see for example:
> >>
> >>     # rbd mirror image status zabbix-test.example.com-disk2
> >>       zabbix-test.example.com-disk2:
> >>       global_id:   1bdcb981-c1c5-4172-9583-be6a6cd996ec
> >>       state:       up+replaying
> >>       description: replaying, 
> >> {"bytes_per_second":8540.27,"entries_behind_primary":0,"entries_per_second":1.8,"non_primary_position":{"entry_tid":869176,"object_number":504,"tag_tid":1},"primary_position":{"entry_tid":11143,"object_number":7,"tag_tid":1}}
> >>       service:     rbd_mirror_backup on rbd-mirror002.example.com
> >>       last_update: 2024-08-12 09:53:17
> >>
> >> However, we do in some seemingly random cases see that journals are never 
> >> advanced on the primary cluster — staying with the example above, on the 
> >> primary cluster I find the following:
> >>
> >>     # rbd journal status --image zabbix-test.physik.uni-bonn.de-disk2
> >>     minimum_set: 1
> >>     active_set: 126
> >>       registered clients:
> >>             [id=, commit_position=[positions=[[object_number=7, tag_tid=1, 
> >> entry_tid=11143], [object_number=6, tag_tid=1, entry_tid=11142], 
> >> [object_number=5, tag_tid=1, entry_tid=11141], [object_number=4, 
> >> tag_tid=1, entry_tid=11140]]], state=connected]
> >>             [id=52b80bb0-a090-4f7d-9950-c8691ed8fee9, 
> >> commit_position=[positions=[[object_number=505, tag_tid=1, 
> >> entry_tid=869181], [object_number=504, tag_tid=1, entry_tid=869180], 
> >> [object_number=507, tag_tid=1, entry_tid=869179], [object_number=506, 
> >> tag_tid=1, entry_tid=869178]]], state=connected]
> >>
> >> As you can see, the minimum_set was not advanced. As can be seen in 
> >> "mirror image status", it shows the strange output that 
> >> non_primary_position seems much more advanced than primary_position. This 
> >> seems to happen "at random" for only a few volumes...
> >> There are no other active clients apart from the actual VM (libvirt+qemu).
> >
> > Hi Oliver,
> >
> > Were the VM clients (i.e. librbd on the hypervisor nodes) upgraded as well?
>
> Hi Ilya,
>
> "some of them" — as a matter of fact, we wanted to stress-test VM restarting 
> and live migration first, and in some cases saw VMs stuck for a long time, 
> which is now understandable...
>
> >>
> >> As a quick fix, to purge journals piling up over and over, we've only 
> >> found the "solution" to temporarily disable and then re-enable journaling 
> >> for affected VM disks, which can be identified by:
> >>    for A in $(rbd ls); do echo -n "$A: "; rbd --format=json journal status 
> >> --image $A | jq '.active_set - .minimum_set'; done
> >>
> >>
> >> Any idea what is going wrong here?
> >> This did not happen with the primary cluster running Mimic and the backup 
> >> cluster running Octopus before, and also did not happen when both were 
> >> running Mimic.
> >
> > You might be hitting https://tracker.ceph.com/issues/57396.
>
> Indeed, it looks exactly like that, as we do fsfreeze+fstrim every night 
> (before snapshotting) inside all VMs (via qemu-guest-agent).
> Correlating affected VMs with upgraded hypervisors reveals that only those 
> VMs running on hypervisors with Octopus clients seem affected,
> and the issue easily explains why we saw problems with VM shutdown / restart 
> or live migration (extremely slowness / VMs almost getting stuck). I can also 
> confirm these problems seem to vanish when disabling journaling.
>
> So many thanks, this does indeed explain a lot :-). It also means the bug is 
> still present in Octopus, but fixed in Pacific and later.
>
> We'll likely switch to snapshot-based mirroring in the next weeks (now that 
> we know that this will avoid the problem), then finish the upgrade of all 
> hypervisors to Octopus, and only then attack Pacific and later.


Are any of your VM images clones (in "rbd snap create" + "rbd clone"
sense)?  If so, I'd advise against switching to snapshot-based based
mirroring as there are known issues with sync/replication correctness
there.

Thanks,

                Ilya
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: RBD Journaling seemingly getting stuck for some VMs after upgrade to Octopus

Reply via email to