[ceph-users] Invalid RBD object maps of snapshots on Mimic

Oliver Freyermuth Wed, 09 Jan 2019 16:18:38 -0800

Dear Cephalopodians,

inspired by 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-January/032092.html I 
did a check of the object-maps of our RBD volumes
and snapshots. We are running 13.2.1 on the cluster I am talking about, all 
hosts (OSDs, MONs, RBD client nodes) still on CentOS 7.5.

Sadly, I found that for at least 50 % of the snapshots (only the snapshots, not
the volumes themselves), I got something like:
--------------------------------------------------------------------------------------------------
2019-01-09 23:00:06.481 7f89aeffd700 -1 librbd::ObjectMapIterateRequest: object
map error: object rbd_data.519c46b8b4567.0000000000000260 marked as 1, but
should be 3
2019-01-09 23:00:06.563 7f89aeffd700 -1 librbd::ObjectMapIterateRequest: object
map error: object rbd_data.519c46b8b4567.0000000000000840 marked as 1, but
should be 3
--------------------------------------------------------------------------------------------------
2019-01-09 23:00:09.166 7fbcff7fe700 -1 librbd::ObjectMapIterateRequest: object
map error: object rbd_data.519c46b8b4567.0000000000000480 marked as 1, but
should be 3
2019-01-09 23:00:09.228 7fbcff7fe700 -1 librbd::ObjectMapIterateRequest: object
map error: object rbd_data.519c46b8b4567.0000000000000840 marked as 1, but
should be 3
--------------------------------------------------------------------------------------------------
It often appears to affect 1-3 entries in the map of a snapshot. The Object Map
was *not* marked invalid before I ran the check.
After rebuilding it, the check is fine again.

The cluster has not yet seen any Ceph update (it was installed as 13.2.1, we
plan to upgrade to 13.2.4 soonish).
There have been no major causes of worries so far. We purged a single OSD disk,
balanced PGs with upmap, modified the CRUSH topology slightly etc.
The cluster never was in a prolonged unhealthy period nor did we have to repair
any PG.

Is this a known error?
Is it harmful, or is this just something like reference counting being off, and
objects being in the map which did not really change in the snapshot?

Our usecase, in case that helps to understand or reproduce:
- RBDs are used as disks for qemu/kvm virtual machines.
- Every night:
- We run an fstrim in the VM (which propagates to RBD and purges empty
blocks), fsfreeze it, take a snapshot, thaw it again.
- After that, we run two backups with Benji backup ( https://benji-backup.me/
) and Backy2 backup ( http://backy2.com/docs/ )
which seems to work rather well so far.
- We purge some old snapshots.

We use the following RBD feature flags:
layering, exclusive-lock, object-map, fast-diff, deep-flatten

Since Benji and Backy2 are optimized for differential RBD backups to
deduplicated storage, they leverage "rbd diff" (and hence make use of
fast-diff, I would think).
If rbd diff produces wrong output due to this issue, it would affect our
backups (but it would also affect classic backups of snapshots via "rbd
export"...).
In case the issue is known or understood, can somebody extrapolate whether this
means "rbd diff" contains too many blocks or actually misses changed blocks?

We are from now on running daily, full object-map checks on all volumes and
backups, and automatically rebuild any object-map which was found invalid after
the check.
Hopefully, this will allow to correlate the appearance of these issues with
"something" happening on the cluster.
I did not detect a clean pattern in the affected snapshots, though, it seemed
rather random...

Maybe it would also help to understand this issue if somebody else using RBD in
a similar manner on Mimic could also check the object-maps.
Since this issue does not show up until a check is performed, this was below
our radar for many months now...

Cheers,
Oliver

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Invalid RBD object maps of snapshots on Mimic

Reply via email to