Hi Chris and Wissem,

finally found the time: https://tracker.ceph.com/issues/50638

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Chris Dunlop <ch...@onthe.net.au>
Sent: 16 March 2021 03:56:50
To: Frank Schilder
Cc: ceph-users@ceph.io; Wissem MIMOUNA
Subject: Re: [ceph-users] OSD id 241 != my id 248: conversion from "ceph-disk" 
to "ceph-volume simple" destroys OSDs

Hi Frank,

I suggest you should file the ticket as you have the full story and the
use case to go with it.

I'm just an interested bystander, I just happened to know a little about
this area because of a filestore to bluestore migration I'd done recently.

Cheers,

Chris

On Fri, Mar 12, 2021 at 12:48:56PM +0000, Frank Schilder wrote:
> Hi Chris,
>
> thanks for looking at this issue in more detail.
>
> I have two communications on this issue and I'm afraid you didn't get all 
> information. There seem to be at least 2 occurrences of the same bug. Yes, 
> I'm pretty sure data.path should also be a stable device path instead of 
> /dev/sdq1. But this is the second occurrence of this bug, the other one is 
> for block.path, which is not visible in the communication I sent to you but 
> has more dramatic consequences.
>
> Please find below the full story. Unless you can do it, I will file a ticket. 
> To me this looks like a general occurrence of using unstable device paths by 
> accident that should be tracked down everywhere. If you can fix the code, you 
> might want to add a comment to it to make sure the same mistake is not 
> repeated.
>
> Problems:
>
> - ceph-volume simple scan|activate use unstable device paths like "/dev/sd??" 
> instead of stable device paths like "/dev/disk/by-partuuid/UUID", which leads 
> to OSD boot fails when devices are renamed at reboot by the kernel
>
> - ceph-volume simple activate modifies (!!!) OSD meta data from a stable 
> device path to an unstable device path, which does not only lead to boot 
> fails but also makes it impossible to move an OSD to a different host, 
> because ceph-volume simple scan will now produce a corrupted json config file
>
> Setup and observation:
>
> I observed this in the situation where after a reboot all disks were 
> re-named. I have a work-flow that deploys containers per physical disk slot 
> and performs a full OSD discovery at every container start to accommodate 
> exchanging OSDs. The basic sequence executed every time is:
>
> ceph-volume simple scan
> ceph volume simple activate
>
> Unfortunately, this sequence is not idempotent, because ceph volume simple 
> activate modifies (!!!) the symbolic link "block" on the OSD data partition 
> to point to an unstable device path, for example (note the first occurrence 
> of the unstable device path /dev/sdq1 in data.path):
>
> # mount /dev/sdq1 mnt
> # ls -l mnt
> [...]
> lrwxrwxrwx. 1 root root  58 Mar 11 16:17 block -> 
> /dev/disk/by-partuuid/a1e5ef7d-9bab-4911-abe5-9075b91d88a4
> [...]
> # umount mnt
> # ceph-volume simple scan --stdout /dev/sdq1
> Running command: /usr/sbin/cryptsetup status /dev/sdq1
> Running command: /usr/bin/mount -v /dev/sdq1 /tmp/tmpmfitNx
> stdout: mount: /dev/sdq1 mounted on /tmp/tmpmfitNx.
> Running command: /usr/bin/umount -v /tmp/tmpmfitNx
> stderr: umount: /tmp/tmpmfitNx (/dev/sdq1) unmounted
> {
>    "active": "ok",
>    "block": {
>        "path": "/dev/disk/by-partuuid/a1e5ef7d-9bab-4911-abe5-9075b91d88a4",
>        "uuid": "a1e5ef7d-9bab-4911-abe5-9075b91d88a4"
>    },
>    "block_uuid": "a1e5ef7d-9bab-4911-abe5-9075b91d88a4",
>    "bluefs": 1,
>    "ceph_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9",
>    "cluster_name": "ceph",
>    "data": {
>        "path": "/dev/sdq1",
>        "uuid": "9b88d6ec-87a4-4640-b80e-81d3d56fac15"
>    },
>    "fsid": "9b88d6ec-87a4-4640-b80e-81d3d56fac15",
>    "keyring": "AQBP4opcBeCYOxAA4sOpTthNE6T28WUf4Bgm3w==",
>    "kv_backend": "rocksdb",
>    "magic": "ceph osd volume v026",
>    "mkfs_done": "yes",
>    "none": "",
>    "ready": "ready",
>    "require_osd_release": "",
>    "type": "bluestore",
>    "whoami": 59
> }
> # ceph-volume simple activate --file 
> "/etc/ceph/osd/59-9b88d6ec-87a4-4640-b80e-81d3d56fac15.json" --no-systemd
> Running command: /usr/bin/mount -v /dev/sdq1 /var/lib/ceph/osd/ceph-59
> stdout: mount: /dev/sdq1 mounted on /var/lib/ceph/osd/ceph-59.
> Running command: /usr/bin/ln -snf /dev/sdq2 /var/lib/ceph/osd/ceph-59/block   
>     <<<--- Oh no !!!
> Running command: /usr/bin/chown -R ceph:ceph /dev/sdq2
> --> Skipping enabling of `simple` systemd unit
> --> Skipping masking of ceph-disk systemd units
> --> Skipping enabling and starting OSD simple systemd unit because 
> --no-systemd was used
> --> Successfully activated OSD 59 with FSID 
> 9b88d6ec-87a4-4640-b80e-81d3d56fac15
>
> # !!! Note the command "/usr/bin/ln -snf /dev/sdq2 
> /var/lib/ceph/osd/ceph-59/block" in the output,
> # which is corrupting the OSDs meta-data!
>
> # ls -l /var/lib/ceph/osd/ceph-59
> [...]
> lrwxrwxrwx. 1 root root   9 Mar 12 13:06 block -> /dev/sdq2
> [...]
>
> # This OSD now holds corrupted meta-data in form of a symbolic link with an 
> unstable device path
> # as its link target. Subsequent discoveries now produce corrupt .json config 
> files and moving this disk
> # to another host has turned into a real pain:
>
> # umount /var/lib/ceph/osd/ceph-59
> # ceph-volume simple scan --stdout /dev/sdq1
> Running command: /usr/sbin/cryptsetup status /dev/sdq1
> Running command: /usr/bin/mount -v /dev/sdq1 /tmp/tmpABkQsj
> stdout: mount: /dev/sdq1 mounted on /tmp/tmpABkQsj.
> Running command: /usr/bin/umount -v /tmp/tmpABkQsj
> stderr: umount: /tmp/tmpABkQsj (/dev/sdq1) unmounted
> {
>    "active": "ok",
>    "block": {
>        "path": "/dev/sdq2",
>        "uuid": "a1e5ef7d-9bab-4911-abe5-9075b91d88a4"
>    },
>    "block_uuid": "a1e5ef7d-9bab-4911-abe5-9075b91d88a4",
>    "bluefs": 1,
>    "ceph_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9",
>    "cluster_name": "ceph",
>    "data": {
>        "path": "/dev/sdq1",
>        "uuid": "9b88d6ec-87a4-4640-b80e-81d3d56fac15"
>    },
>    "fsid": "9b88d6ec-87a4-4640-b80e-81d3d56fac15",
>    "keyring": "AQBP4opcBeCYOxAA4sOpTthNE6T28WUf4Bgm3w==",
>    "kv_backend": "rocksdb",
>    "magic": "ceph osd volume v026",
>    "mkfs_done": "yes",
>    "none": "",
>    "ready": "ready",
>    "require_osd_release": "",
>    "type": "bluestore",
>    "whoami": 59
> }
>
> Here in this example, the disk names didn't change, which implies that this 
> OSD will still start as long as the disk is named /dev/sdq. However, if the 
> disk names change, ceph-volume simple scan unfortunately follows the broken 
> symlink link instead of using block_uuid for discovery, which leads to a 
> completely corrupted .json file similar to this one:
>
> # ceph-volume simple scan --stdout /dev/sdb1
> Running command: /usr/sbin/cryptsetup status /dev/sdb1
> {
>    "active": "ok",
>    "block": {
>        "path": "/dev/sda2",
>        "uuid": "b5ac1462-510a-4483-8f42-604e6adc5c9d"
>    },
>    "block_uuid": "1d9d89a2-18c7-4610-9dcd-167d44ce1879",
>    "bluefs": 1,
>    "ceph_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9",
>    "cluster_name": "ceph",
>    "data": {
>        "path": "/dev/sdb1",
>        "uuid": "c35a7efb-8c1c-42a1-8027-cf422d7e7ecb"
>    },
>    "fsid": "c35a7efb-8c1c-42a1-8027-cf422d7e7ecb",
>    "keyring": "AQAZJ6ddedALDxAAJI7NLJ2CRFoQWK5STRpHuw==",
>    "kv_backend": "rocksdb",
>    "magic": "ceph osd volume v026",
>    "mkfs_done": "yes",
>    "none": "",
>    "ready": "ready",
>    "require_osd_release": "",
>    "type": "bluestore",
>    "whoami": 241
> }
>
> Notice that now block_uuid and block.uuid do not match any more. This 
> corruption requires manual repair and I had to do this for an entire cluster.
>
> Resolution:
>
> I ended up with all OSDs I converted from "ceph-disk" to "ceph-volume simple" 
> failing to boot after a server reboot that shifted the device names and all 
> symbolic links to the block device were invalidated. Fortunately, the OSDs 
> recognised that the block device partition was for another OSD ID and exited 
> with an error, otherwise I would probably have lost data. To fix this, I 
> needed to write a script that resets the link target of the symlink "block" 
> to the correct part_uuip path.
>
> Using unstable device paths is one thing that can happen by accident. 
> However, what I really do not understand is, why "ceph-volume simple 
> activate" *modifies* meta-data that should be considered read-only. I found 
> this here in the code 
> src/ceph-volume/ceph_volume/devices/simple/activate.py:200-203:
>
>            # always re-do the symlink regardless if it exists, so that the 
> journal
>            # device path that may have changed can be mapped correctly every 
> time
>            destination = os.path.join(osd_dir, name)
>            process.run(['ln', '-snf', device, destination])
>
> Maybe the intention is correct, I don't know. However, the execution is not. 
> At this point, a dictionary of UUIDs should be used with explicit link 
> targets as in "/dev/disk/by-partuuid/"+uuid instead of "device" to make 
> absolutely sure nothing gets rigged here. I think a correct version of the 
> code in src/ceph-volume/ceph_volume/devices/simple/activate.py:190-206 would 
> look something like this
>
>        uuid_map = {
>            'journal': osd_metadata.get('journal', {}).get('uuid'),
>            'block': osd_metadata.get('block', {}).get('uuid'),
>            'block.db': osd_metadata.get('block.db', {}).get('uuid'),
>            'block.wal': osd_metadata.get('block.wal', {}).get('uuid')
>        }
>
>        for name, uuid in uuid_map.items():
>            if not uuid:
>                continue
>            # always re-do the symlink regardless if it exists, so that the 
> journal
>            # device path that may have changed can be mapped correctly every 
> time
>            destination = os.path.join(osd_dir, name)
>            process.run(['ln', '-snf', '/dev/disk/by-partuuid/'+uuid, 
> destination])
>
>            # make sure that the journal has proper permissions
>            system.chown(self.get_device(uuid))
>
> This will be very explicit about using stable device paths. Needless to say 
> that other occurrences as in 
> src/ceph-volume/ceph_volume/devices/simple/scan.py:89-90 should be addressed 
> as well, for example:
>
>        device_metadata['uuid'] = device_uuid
>        device_metadata['path'] = device
>
> could be corrected in a similar way:
>
>        device_metadata['uuid'] = device_uuid
>        device_metadata['path'] = '/dev/disk/by-partuuid/'+device_uuid
>
> There are probably more locations that deserve a good looking at.
>
> Hope that explains the calamities I found myself in.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to