Hi Chris and Wissem, finally found the time: https://tracker.ceph.com/issues/50638
Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Chris Dunlop <ch...@onthe.net.au> Sent: 16 March 2021 03:56:50 To: Frank Schilder Cc: ceph-users@ceph.io; Wissem MIMOUNA Subject: Re: [ceph-users] OSD id 241 != my id 248: conversion from "ceph-disk" to "ceph-volume simple" destroys OSDs Hi Frank, I suggest you should file the ticket as you have the full story and the use case to go with it. I'm just an interested bystander, I just happened to know a little about this area because of a filestore to bluestore migration I'd done recently. Cheers, Chris On Fri, Mar 12, 2021 at 12:48:56PM +0000, Frank Schilder wrote: > Hi Chris, > > thanks for looking at this issue in more detail. > > I have two communications on this issue and I'm afraid you didn't get all > information. There seem to be at least 2 occurrences of the same bug. Yes, > I'm pretty sure data.path should also be a stable device path instead of > /dev/sdq1. But this is the second occurrence of this bug, the other one is > for block.path, which is not visible in the communication I sent to you but > has more dramatic consequences. > > Please find below the full story. Unless you can do it, I will file a ticket. > To me this looks like a general occurrence of using unstable device paths by > accident that should be tracked down everywhere. If you can fix the code, you > might want to add a comment to it to make sure the same mistake is not > repeated. > > Problems: > > - ceph-volume simple scan|activate use unstable device paths like "/dev/sd??" > instead of stable device paths like "/dev/disk/by-partuuid/UUID", which leads > to OSD boot fails when devices are renamed at reboot by the kernel > > - ceph-volume simple activate modifies (!!!) OSD meta data from a stable > device path to an unstable device path, which does not only lead to boot > fails but also makes it impossible to move an OSD to a different host, > because ceph-volume simple scan will now produce a corrupted json config file > > Setup and observation: > > I observed this in the situation where after a reboot all disks were > re-named. I have a work-flow that deploys containers per physical disk slot > and performs a full OSD discovery at every container start to accommodate > exchanging OSDs. The basic sequence executed every time is: > > ceph-volume simple scan > ceph volume simple activate > > Unfortunately, this sequence is not idempotent, because ceph volume simple > activate modifies (!!!) the symbolic link "block" on the OSD data partition > to point to an unstable device path, for example (note the first occurrence > of the unstable device path /dev/sdq1 in data.path): > > # mount /dev/sdq1 mnt > # ls -l mnt > [...] > lrwxrwxrwx. 1 root root 58 Mar 11 16:17 block -> > /dev/disk/by-partuuid/a1e5ef7d-9bab-4911-abe5-9075b91d88a4 > [...] > # umount mnt > # ceph-volume simple scan --stdout /dev/sdq1 > Running command: /usr/sbin/cryptsetup status /dev/sdq1 > Running command: /usr/bin/mount -v /dev/sdq1 /tmp/tmpmfitNx > stdout: mount: /dev/sdq1 mounted on /tmp/tmpmfitNx. > Running command: /usr/bin/umount -v /tmp/tmpmfitNx > stderr: umount: /tmp/tmpmfitNx (/dev/sdq1) unmounted > { > "active": "ok", > "block": { > "path": "/dev/disk/by-partuuid/a1e5ef7d-9bab-4911-abe5-9075b91d88a4", > "uuid": "a1e5ef7d-9bab-4911-abe5-9075b91d88a4" > }, > "block_uuid": "a1e5ef7d-9bab-4911-abe5-9075b91d88a4", > "bluefs": 1, > "ceph_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9", > "cluster_name": "ceph", > "data": { > "path": "/dev/sdq1", > "uuid": "9b88d6ec-87a4-4640-b80e-81d3d56fac15" > }, > "fsid": "9b88d6ec-87a4-4640-b80e-81d3d56fac15", > "keyring": "AQBP4opcBeCYOxAA4sOpTthNE6T28WUf4Bgm3w==", > "kv_backend": "rocksdb", > "magic": "ceph osd volume v026", > "mkfs_done": "yes", > "none": "", > "ready": "ready", > "require_osd_release": "", > "type": "bluestore", > "whoami": 59 > } > # ceph-volume simple activate --file > "/etc/ceph/osd/59-9b88d6ec-87a4-4640-b80e-81d3d56fac15.json" --no-systemd > Running command: /usr/bin/mount -v /dev/sdq1 /var/lib/ceph/osd/ceph-59 > stdout: mount: /dev/sdq1 mounted on /var/lib/ceph/osd/ceph-59. > Running command: /usr/bin/ln -snf /dev/sdq2 /var/lib/ceph/osd/ceph-59/block > <<<--- Oh no !!! > Running command: /usr/bin/chown -R ceph:ceph /dev/sdq2 > --> Skipping enabling of `simple` systemd unit > --> Skipping masking of ceph-disk systemd units > --> Skipping enabling and starting OSD simple systemd unit because > --no-systemd was used > --> Successfully activated OSD 59 with FSID > 9b88d6ec-87a4-4640-b80e-81d3d56fac15 > > # !!! Note the command "/usr/bin/ln -snf /dev/sdq2 > /var/lib/ceph/osd/ceph-59/block" in the output, > # which is corrupting the OSDs meta-data! > > # ls -l /var/lib/ceph/osd/ceph-59 > [...] > lrwxrwxrwx. 1 root root 9 Mar 12 13:06 block -> /dev/sdq2 > [...] > > # This OSD now holds corrupted meta-data in form of a symbolic link with an > unstable device path > # as its link target. Subsequent discoveries now produce corrupt .json config > files and moving this disk > # to another host has turned into a real pain: > > # umount /var/lib/ceph/osd/ceph-59 > # ceph-volume simple scan --stdout /dev/sdq1 > Running command: /usr/sbin/cryptsetup status /dev/sdq1 > Running command: /usr/bin/mount -v /dev/sdq1 /tmp/tmpABkQsj > stdout: mount: /dev/sdq1 mounted on /tmp/tmpABkQsj. > Running command: /usr/bin/umount -v /tmp/tmpABkQsj > stderr: umount: /tmp/tmpABkQsj (/dev/sdq1) unmounted > { > "active": "ok", > "block": { > "path": "/dev/sdq2", > "uuid": "a1e5ef7d-9bab-4911-abe5-9075b91d88a4" > }, > "block_uuid": "a1e5ef7d-9bab-4911-abe5-9075b91d88a4", > "bluefs": 1, > "ceph_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9", > "cluster_name": "ceph", > "data": { > "path": "/dev/sdq1", > "uuid": "9b88d6ec-87a4-4640-b80e-81d3d56fac15" > }, > "fsid": "9b88d6ec-87a4-4640-b80e-81d3d56fac15", > "keyring": "AQBP4opcBeCYOxAA4sOpTthNE6T28WUf4Bgm3w==", > "kv_backend": "rocksdb", > "magic": "ceph osd volume v026", > "mkfs_done": "yes", > "none": "", > "ready": "ready", > "require_osd_release": "", > "type": "bluestore", > "whoami": 59 > } > > Here in this example, the disk names didn't change, which implies that this > OSD will still start as long as the disk is named /dev/sdq. However, if the > disk names change, ceph-volume simple scan unfortunately follows the broken > symlink link instead of using block_uuid for discovery, which leads to a > completely corrupted .json file similar to this one: > > # ceph-volume simple scan --stdout /dev/sdb1 > Running command: /usr/sbin/cryptsetup status /dev/sdb1 > { > "active": "ok", > "block": { > "path": "/dev/sda2", > "uuid": "b5ac1462-510a-4483-8f42-604e6adc5c9d" > }, > "block_uuid": "1d9d89a2-18c7-4610-9dcd-167d44ce1879", > "bluefs": 1, > "ceph_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9", > "cluster_name": "ceph", > "data": { > "path": "/dev/sdb1", > "uuid": "c35a7efb-8c1c-42a1-8027-cf422d7e7ecb" > }, > "fsid": "c35a7efb-8c1c-42a1-8027-cf422d7e7ecb", > "keyring": "AQAZJ6ddedALDxAAJI7NLJ2CRFoQWK5STRpHuw==", > "kv_backend": "rocksdb", > "magic": "ceph osd volume v026", > "mkfs_done": "yes", > "none": "", > "ready": "ready", > "require_osd_release": "", > "type": "bluestore", > "whoami": 241 > } > > Notice that now block_uuid and block.uuid do not match any more. This > corruption requires manual repair and I had to do this for an entire cluster. > > Resolution: > > I ended up with all OSDs I converted from "ceph-disk" to "ceph-volume simple" > failing to boot after a server reboot that shifted the device names and all > symbolic links to the block device were invalidated. Fortunately, the OSDs > recognised that the block device partition was for another OSD ID and exited > with an error, otherwise I would probably have lost data. To fix this, I > needed to write a script that resets the link target of the symlink "block" > to the correct part_uuip path. > > Using unstable device paths is one thing that can happen by accident. > However, what I really do not understand is, why "ceph-volume simple > activate" *modifies* meta-data that should be considered read-only. I found > this here in the code > src/ceph-volume/ceph_volume/devices/simple/activate.py:200-203: > > # always re-do the symlink regardless if it exists, so that the > journal > # device path that may have changed can be mapped correctly every > time > destination = os.path.join(osd_dir, name) > process.run(['ln', '-snf', device, destination]) > > Maybe the intention is correct, I don't know. However, the execution is not. > At this point, a dictionary of UUIDs should be used with explicit link > targets as in "/dev/disk/by-partuuid/"+uuid instead of "device" to make > absolutely sure nothing gets rigged here. I think a correct version of the > code in src/ceph-volume/ceph_volume/devices/simple/activate.py:190-206 would > look something like this > > uuid_map = { > 'journal': osd_metadata.get('journal', {}).get('uuid'), > 'block': osd_metadata.get('block', {}).get('uuid'), > 'block.db': osd_metadata.get('block.db', {}).get('uuid'), > 'block.wal': osd_metadata.get('block.wal', {}).get('uuid') > } > > for name, uuid in uuid_map.items(): > if not uuid: > continue > # always re-do the symlink regardless if it exists, so that the > journal > # device path that may have changed can be mapped correctly every > time > destination = os.path.join(osd_dir, name) > process.run(['ln', '-snf', '/dev/disk/by-partuuid/'+uuid, > destination]) > > # make sure that the journal has proper permissions > system.chown(self.get_device(uuid)) > > This will be very explicit about using stable device paths. Needless to say > that other occurrences as in > src/ceph-volume/ceph_volume/devices/simple/scan.py:89-90 should be addressed > as well, for example: > > device_metadata['uuid'] = device_uuid > device_metadata['path'] = device > > could be corrected in a similar way: > > device_metadata['uuid'] = device_uuid > device_metadata['path'] = '/dev/disk/by-partuuid/'+device_uuid > > There are probably more locations that deserve a good looking at. > > Hope that explains the calamities I found myself in. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io