On Wed, Jul 2, 2025 at 1:36 PM Gary Molenkamp <molen...@uwo.ca> wrote:
>
> I confirmed and can consistently replicate the failure event that forces
> the object-map rebuild.
>
> If the VM is terminated cleanly, such as a hypervisor reboot, then the
> VMs and their rbd volumes are all well.
> If the hypervisor goes down hard, such as a hard power cycle, then any
> VMs on the hypervisor will have
> sufficient I/O errors to prevent either a boot or the root volume from
> mounting.   This usually manifests as
>    "I/O error, dev sda, sector ......"
>
> The object-map check when this happens appears clean:
> [root@eda84984a767 /]# rbd object-map check proxmox/vm-179-disk-0
> Object Map Check: 100% complete...done.

Hi Gary,

Sorry, I forgot to specify "--debug-rbd 1" option for "rbd object-map
check" command in the previous email.

Is proxmox/vm-179-disk-0 a cloned image?

>
> And to confirm, rebuilding the above object-map, then allows the VM to
> function correctly with no apparent
> I/O error reports from the OS' kernel.

Since this appears to be easily reproducible, can you grab "rbd info"
output, the object listing, run "rbd object-map check --debug-rbd 1"
and also extract the object map object for one of the images:

a) after powering on the hypervisor but before starting the VM
b) after the VM is started and I/O errors are observed
c) after running "rbd object-map rebuild"

The object listing can be obtained with:

$ IMAGE_ID=$(rbd info proxmox/vm-179-disk-0 --format json | jq -r '.id')
$ rados -p proxmox ls | grep $IMAGE_ID >objs-a.txt

The object map can be extracted with:

$ rados -p proxmox get rbd_object_map.$IMAGE_ID objmap-a.bin

Attach the resulting outputs and files for a, b and c here or file
a tracker ticket, whichever you prefer.

>
> Is there something else I should be checking?   Could it be related to
> the rbd_invalidate_object_map_on_timeout setting on the pool?

Is rados_osd_op_timeout option set?  If so, what is the value?

To get the most out of the reproducing attempt, it would be great to
enable verbose logging before step b) and disable it immediately after.
I'm not sure how it's set up in Proxmox, but you would need to do the
equivalent of adding

debug ms = 1
debug rbd = 20
log file = <some path>
log to file = true

to ceph.conf file that is picked up by the QEMU process.

Thanks,

                Ilya
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to