[ceph-users] Re: ceph-volume claiming wrong device
As I said, I would recommend to really wipe the OSDs clean (ceph-volume lvm zap --destroy /dev/sdX), maybe reboot (on VMs it was sometimes necessary during my tests if I had too many failed attempts). And then also make sure you don't have any leftovers in the filesystem (under /var/lib/ceph) just to make sure you have a clean start. Zitat von Oleksiy Stashok : Hey Eugen, valid points, I first tried to provision OSDs via ceph-ansible (later excluded), which does run the batch command with all 4 disk devices, but it often failed with the same issue I mentioned earlier, something like: ``` bluefs _replay 0x0: stop: uuid e2f72ec9-2747-82d7-c7f8-41b7b6d41e1b != super.uuid 0110ddb3-d4bf-4c1e-be11-654598c71db0 ``` that's why I abandoned that idea and tried to provision OSDs manually one by one. As I mentioned I used ceph-ansible, not cephadm for legacy reasons, but I suspect the problem I'm seeing is related to ceph-volume, so I suspect cephadm won't change it. I did more investigation in 1-by-1 OSD creation flow and it seems like the fact that `ceph-volume lvm list` shows me 2 devices belonging to the same OSD can be explained by the following flow: 1. ceph-volume lvm create --bluestore --dmcrypt --data /dev/sdd 2. trying to create osd.2 3. fails with uuid != super.uuid issue 4. ceph-volume lvm list returns /dev/sdd belong to osd.2 (even though it failed) 5. ceph-volume lvm create --bluestore --dmcrypt --data /dev/sde 6. trying to create osd.2 (*again*) 7. succeeds 8. ceph-volume lvm list returns both /dev/sdd and /dev/sde belonging to osd.2 osd.2 is reported to be up and running. Any idea why this is happening? Thank you! Oleksiy On Thu, Oct 27, 2022 at 12:11 AM Eugen Block wrote: Hi, first of all, if you really need to issue ceph-volume manually, there's a batch command: cephadm ceph-volume lvm batch /dev/sdb /dev/sdc /dev/sdd /dev/sde Second, are you using cephadm? Maybe your manual intervention conflicts with the automatic osd setup (all available devices). You could look into /var/log/ceph/cephadm.log on each node and see if cephadm already tried to setup the OSDs for you. What does 'ceph orch ls' show? Did you end up having online OSDs or did it fail? In that case I would purge all OSDs from the crushmap, then wipe all devices (ceph-volume lvm zap --destroy /dev/sdX) and either let cephadm create the OSDs for you or you disable that (unmanaged=true) and run the manual steps again (although it's not really necessary). Regards, Eugen Zitat von Oleksiy Stashok : > Hey guys, > > I ran into a weird issue, hope you can explain what I'm observing. I'm > testing* Ceph 16.2.10* on *Ubuntu 20.04* in *Google Cloud VMs*, I created 3 > instances and attached 4 persistent SSD disks to each instance. I can see > these disks attached as `/dev/sdb, /dev/sdc, /dev/sdd, /dev/sde` devices. > > As a next step I used ceph-ansible to bootstrap the ceph cluster on 3 > instances, however I intentionally skipped OSD setup. So I ended up with a > Ceph cluster w/o any OSD. > > I ssh'ed into each VM and ran: > > ``` > sudo -s > for dev in sdb sdc sdd sde; do > /usr/sbin/ceph-volume --cluster ceph lvm create --bluestore > --dmcrypt --data "/dev/$dev" > done > ``` > > The operation above randomly fails on random instances/devices with > something like: > ``` > bluefs _replay 0x0: stop: uuid e2f72ec9-2747-82d7-c7f8-41b7b6d41e1b != > super.uuid 0110ddb3-d4bf-4c1e-be11-654598c71db0 > ``` > > The interesting this is that when I do > ``` > /usr/sbin/ceph-volume lvm ls > ``` > > I can see that the device for which OSD creation failed actually belongs to > a different OSD that was previously created for a different device. For > example the failure I mentioned above happened on the `/dev/sde` device, so > when I list lvms I see this: > ``` > == osd.2 === > > [block] > /dev/ceph-103a4373-dbe0-43d6-a9e0-34db4e1b257c/osd-block-9af542ba-fd65-4355-ad17-7293856acaeb > > block device > > /dev/ceph-103a4373-dbe0-43d6-a9e0-34db4e1b257c/osd-block-9af542ba-fd65-4355-ad17-7293856acaeb > block uuidFfFnLt-h33F-F73V-tY45-VuZM-scj7-C3dg1K > cephx lockbox secret AQAlelljqNPoMhAA59JwN3wGt0d6Si+nsnxsRQ== > cluster fsid 348fff8e-e850-4774-9694-05d5414b1c53 > cluster name ceph > crush device class > encrypted 1 > osd fsid 9af542ba-fd65-4355-ad17-7293856acaeb > osd id2 > osdspec affinity > type block > vdo 0 > devices /dev/sdd > > [block] > /dev/ceph-df14969f-2dfb-45f1-a579-a8e23ec12e33/osd-block-4686f6fc-8dc1-48fd-a2d9-70a281c8ee64 > > block device > > /dev/ceph-df14969f-2dfb-45f1-a579-a8e23ec12e33/osd-block-4686f6fc-8dc1-48fd-a2d9-70a281c8ee64 > block uuidGEajK3-Tsyf-XZS9-E5ik-M1BB-VIpb-q7D1ET > cephx lockbox secret AQAwell
[ceph-users] OSDs are not utilized evenly
Hi I observed on my Ceph cluster running latest Pacific that same size OSDs are utilized differently even if balancer is running and reports status as perfectly balanced. { "active": true, "last_optimize_duration": "0:00:00.622467", "last_optimize_started": "Tue Nov 1 12:49:36 2022", "mode": "upmap", "optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect", "plans": [] } balancer settings for upmap are: mgr advanced mgr/balancer/mode upmap mgr advanced mgr/balancer/upmap_max_deviation 1 mgr advanced mgr/balancer/upmap_max_optimizations 20 It's obvious that utilization is not same (difference is about 1TB) from command `ceph osd df`. Following is just a partial output: ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 0 hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 3.0 MiB 37 GiB 3.6 TiB 78.09 1.05 196 up 124 hdd 18.00020 1.0 16 TiB 12 TiB 12 TiB 0 B 32 GiB 4.7 TiB 71.20 0.96 195 up 157 hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 5.3 MiB 35 GiB 3.7 TiB 77.67 1.05 195 up 1 hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 2.0 MiB 35 GiB 3.7 TiB 77.69 1.05 195 up 243 hdd 18.00020 1.0 16 TiB 12 TiB 12 TiB 0 B 31 GiB 4.7 TiB 71.16 0.96 195 up 244 hdd 18.00020 1.0 16 TiB 12 TiB 12 TiB 0 B 31 GiB 4.7 TiB 71.19 0.96 195 up 245 hdd 18.00020 1.0 16 TiB 12 TiB 12 TiB 0 B 32 GiB 4.7 TiB 71.55 0.96 196 up 246 hdd 18.00020 1.0 16 TiB 12 TiB 12 TiB 0 B 31 GiB 4.7 TiB 71.17 0.96 195 up 249 hdd 18.00020 1.0 16 TiB 12 TiB 12 TiB 0 B 30 GiB 4.7 TiB 71.18 0.96 195 up 500 hdd 18.00020 1.0 16 TiB 12 TiB 12 TiB 0 B 30 GiB 4.7 TiB 71.19 0.96 195 up 501 hdd 18.00020 1.0 16 TiB 12 TiB 12 TiB 0 B 31 GiB 4.7 TiB 71.57 0.96 196 up 502 hdd 18.00020 1.0 16 TiB 12 TiB 12 TiB 0 B 31 GiB 4.7 TiB 71.18 0.96 195 up 532 hdd 18.00020 1.0 16 TiB 12 TiB 12 TiB 0 B 31 GiB 4.7 TiB 71.16 0.96 195 up 549 hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 576 KiB 36 GiB 3.7 TiB 77.70 1.05 195 up 550 hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 3.8 MiB 36 GiB 3.7 TiB 77.67 1.05 195 up 551 hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 2.4 MiB 35 GiB 3.7 TiB 77.68 1.05 195 up 552 hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 5.5 MiB 35 GiB 3.7 TiB 77.69 1.05 195 up 553 hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 5.1 MiB 37 GiB 3.6 TiB 77.71 1.05 195 up 554 hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 967 KiB 36 GiB 3.6 TiB 77.71 1.05 195 up 555 hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 1.3 MiB 36 GiB 3.6 TiB 78.08 1.05 196 up 556 hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 4.7 MiB 36 GiB 3.6 TiB 78.10 1.05 196 up 557 hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 2.4 MiB 36 GiB 3.7 TiB 77.69 1.05 195 up 558 hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 4.5 MiB 36 GiB 3.6 TiB 77.72 1.05 195 up 559 hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 1.5 MiB 35 GiB 3.6 TiB 78.09 1.05 196 up 560 hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 5.2 MiB 35 GiB 3.7 TiB 77.69 1.05 195 up 561 hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 2.8 MiB 35 GiB 3.7 TiB 77.69 1.05 195 up 562 hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 1.0 MiB 36 GiB 3.7 TiB 77.68 1.05 195 up 563 hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 2.6 MiB 36 GiB 3.7 TiB 77.68 1.05 195 up 564 hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 5.1 MiB 36 GiB 3.6 TiB 78.09 1.05 196 up 567 hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 4.8 MiB 36 GiB 3.6 TiB 78.11 1.05 196 up 568 hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 5.2 MiB 35 GiB 3.7 TiB 77.68 1.05 195 up All OSDs are used by the same pool (EC) I have the same issue on another Ceph cluster with the same setup where I was able to make OSDs utilization same by changing reweight from 1.0 to lower on OSDs with higher utilization and I got a lot of free space: before changing reweight: --- RAW STORAGE --- CLASS SIZEAVAIL USED RAW USED %RAW USED hdd3.1 PiB 510 TiB 2.6 PiB 2.6 PiB 83.77 ssd2.6 TiB 2.6 TiB 46 GiB46 GiB 1.70 TOTAL 3.1 PiB 513 TiB 2.6 Pi
[ceph-users] Re: OSDs are not utilized evenly
If the GB per pg is high, the balancer module won't be able to help. Your pg count per osd also looks low (30's), so increasing pgs per pool would help with both problems. You can use the pg calculator to determine which pools need what On Tue, Nov 1, 2022, 08:46 Denis Polom wrote: > Hi > > I observed on my Ceph cluster running latest Pacific that same size OSDs > are utilized differently even if balancer is running and reports status > as perfectly balanced. > > { > "active": true, > "last_optimize_duration": "0:00:00.622467", > "last_optimize_started": "Tue Nov 1 12:49:36 2022", > "mode": "upmap", > "optimize_result": "Unable to find further optimization, or pool(s) > pg_num is decreasing, or distribution is already perfect", > "plans": [] > } > > balancer settings for upmap are: > >mgr advanced > mgr/balancer/mode upmap >mgr advanced mgr/balancer/upmap_max_deviation1 >mgr advanced mgr/balancer/upmap_max_optimizations > 20 > > It's obvious that utilization is not same (difference is about 1TB) from > command `ceph osd df`. Following is just a partial output: > > ID CLASS WEIGHTREWEIGHT SIZE RAW USE DATA OMAP > META AVAIL%USE VAR PGS STATUS >0hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 3.0 MiB > 37 GiB 3.6 TiB 78.09 1.05 196 up > 124hdd 18.00020 1.0 16 TiB 12 TiB 12 TiB 0 B 32 > GiB 4.7 TiB 71.20 0.96 195 up > 157hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 5.3 MiB 35 > GiB 3.7 TiB 77.67 1.05 195 up >1hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 2.0 MiB > 35 GiB 3.7 TiB 77.69 1.05 195 up > 243hdd 18.00020 1.0 16 TiB 12 TiB 12 TiB 0 B 31 > GiB 4.7 TiB 71.16 0.96 195 up > 244hdd 18.00020 1.0 16 TiB 12 TiB 12 TiB 0 B 31 > GiB 4.7 TiB 71.19 0.96 195 up > 245hdd 18.00020 1.0 16 TiB 12 TiB 12 TiB 0 B 32 > GiB 4.7 TiB 71.55 0.96 196 up > 246hdd 18.00020 1.0 16 TiB 12 TiB 12 TiB 0 B 31 > GiB 4.7 TiB 71.17 0.96 195 up > 249hdd 18.00020 1.0 16 TiB 12 TiB 12 TiB 0 B 30 > GiB 4.7 TiB 71.18 0.96 195 up > 500hdd 18.00020 1.0 16 TiB 12 TiB 12 TiB 0 B 30 > GiB 4.7 TiB 71.19 0.96 195 up > 501hdd 18.00020 1.0 16 TiB 12 TiB 12 TiB 0 B 31 > GiB 4.7 TiB 71.57 0.96 196 up > 502hdd 18.00020 1.0 16 TiB 12 TiB 12 TiB 0 B 31 > GiB 4.7 TiB 71.18 0.96 195 up > 532hdd 18.00020 1.0 16 TiB 12 TiB 12 TiB 0 B 31 > GiB 4.7 TiB 71.16 0.96 195 up > 549hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 576 KiB 36 > GiB 3.7 TiB 77.70 1.05 195 up > 550hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 3.8 MiB 36 > GiB 3.7 TiB 77.67 1.05 195 up > 551hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 2.4 MiB 35 > GiB 3.7 TiB 77.68 1.05 195 up > 552hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 5.5 MiB 35 > GiB 3.7 TiB 77.69 1.05 195 up > 553hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 5.1 MiB 37 > GiB 3.6 TiB 77.71 1.05 195 up > 554hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 967 KiB 36 > GiB 3.6 TiB 77.71 1.05 195 up > 555hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 1.3 MiB 36 > GiB 3.6 TiB 78.08 1.05 196 up > 556hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 4.7 MiB 36 > GiB 3.6 TiB 78.10 1.05 196 up > 557hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 2.4 MiB 36 > GiB 3.7 TiB 77.69 1.05 195 up > 558hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 4.5 MiB 36 > GiB 3.6 TiB 77.72 1.05 195 up > 559hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 1.5 MiB 35 > GiB 3.6 TiB 78.09 1.05 196 up > 560hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 5.2 MiB 35 > GiB 3.7 TiB 77.69 1.05 195 up > 561hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 2.8 MiB 35 > GiB 3.7 TiB 77.69 1.05 195 up > 562hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 1.0 MiB 36 > GiB 3.7 TiB 77.68 1.05 195 up > 563hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 2.6 MiB 36 > GiB 3.7 TiB 77.68 1.05 195 up > 564hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 5.1 MiB 36 > GiB 3.6 TiB 78.09 1.05 196 up > 567hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 4.8 MiB 36 > GiB 3.6 TiB 78.11 1.05 196 up > 568hdd 18.00020 1.0 16 TiB 13 TiB 13 TiB 5.2 MiB 35 > GiB 3.7 TiB 77.68 1.05 195 up > > All OSDs are used by the same pool (EC) > > I have the same issue on another Ceph cluster w
[ceph-users] No active PG; No disk activity
Good morning everyone! Today there was an atypical situation in our Cluster where the three machines came to shut down. On powering up the cluster went up and formed quorum with no problems, but the PGs are all in Working, I don't see any disk activity on the machines. No PG is active. [ceph: root@dcs1 /]# ceph osd tree ID CLASS WEIGHTTYPE NAME STATUS REWEIGHT PRI-AFF -1 98.24359 root default -3 32.74786 host dcs1 0hdd 2.72899 osd.0 up 1.0 1.0 1hdd 2.72899 osd.1 up 1.0 1.0 2hdd 2.72899 osd.2 up 1.0 1.0 3hdd 2.72899 osd.3 up 1.0 1.0 4hdd 2.72899 osd.4 up 1.0 1.0 5hdd 2.72899 osd.5 up 1.0 1.0 6hdd 2.72899 osd.6 up 1.0 1.0 7hdd 2.72899 osd.7 up 1.0 1.0 8hdd 2.72899 osd.8 up 1.0 1.0 9hdd 2.72899 osd.9 up 1.0 1.0 10hdd 2.72899 osd.10 up 1.0 1.0 11hdd 2.72899 osd.11 up 1.0 1.0 -5 32.74786 host dcs2 12hdd 2.72899 osd.12 up 1.0 1.0 13hdd 2.72899 osd.13 up 1.0 1.0 14hdd 2.72899 osd.14 up 1.0 1.0 15hdd 2.72899 osd.15 up 1.0 1.0 16hdd 2.72899 osd.16 up 1.0 1.0 17hdd 2.72899 osd.17 up 1.0 1.0 18hdd 2.72899 osd.18 up 1.0 1.0 19hdd 2.72899 osd.19 up 1.0 1.0 20hdd 2.72899 osd.20 up 1.0 1.0 21hdd 2.72899 osd.21 up 1.0 1.0 22hdd 2.72899 osd.22 up 1.0 1.0 23hdd 2.72899 osd.23 up 1.0 1.0 -7 32.74786 host dcs3 24hdd 2.72899 osd.24 up 1.0 1.0 25hdd 2.72899 osd.25 up 1.0 1.0 26hdd 2.72899 osd.26 up 1.0 1.0 27hdd 2.72899 osd.27 up 1.0 1.0 28hdd 2.72899 osd.28 up 1.0 1.0 29hdd 2.72899 osd.29 up 1.0 1.0 30hdd 2.72899 osd.30 up 1.0 1.0 31hdd 2.72899 osd.31 up 1.0 1.0 32hdd 2.72899 osd.32 up 1.0 1.0 33hdd 2.72899 osd.33 up 1.0 1.0 34hdd 2.72899 osd.34 up 1.0 1.0 35hdd 2.72899 osd.35 up 1.0 1.0 [ceph: root@dcs1 /]# ceph -s cluster: id: 58bbb950-538b-11ed-b237-2c59e53b80cc health: HEALTH_WARN 4 filesystems are degraded 4 MDSs report slow metadata IOs Reduced data availability: 1153 pgs inactive, 1101 pgs peering 26 slow ops, oldest one blocked for 563 sec, daemons [osd.10,osd.13,osd.14,osd.15,osd.16,osd.18,osd.20,osd.21,osd.24,osd.25]... have slow ops. services: mon: 3 daemons, quorum dcs1.evocorp,dcs2,dcs3 (age 7m) mgr: dcs1.evocorp.kyqfcd(active, since 15m), standbys: dcs2.rirtyl mds: 4/4 daemons up, 4 standby osd: 36 osds: 36 up (since 6m), 36 in (since 47m); 65 remapped pgs data: volumes: 0/4 healthy, 4 recovering pools: 10 pools, 1153 pgs objects: 254.72k objects, 994 GiB usage: 2.8 TiB used, 95 TiB / 98 TiB avail pgs: 100.000% pgs not active 1036 peering 65 remapped+peering 52 activating [ceph: root@dcs1 /]# ceph health detail HEALTH_WARN 4 filesystems are degraded; 4 MDSs report slow metadata IOs; Reduced data availability: 1153 pgs inactive, 1101 pgs peering; 26 slow ops, oldest one blocked for 673 sec, daemons [osd.10,osd.13,osd.14,osd.15,osd.16,osd.18,osd.20,osd.21,osd.24,osd.25]... have slow ops. [WRN] FS_DEGRADED: 4 filesystems are degraded fs dc_ovirt is degraded fs dc_iso is degraded fs dc_sas is degraded fs pool_tester is degraded [WRN] MDS_SLOW_METADATA_IO: 4 MDSs report slow metadata IOs mds.dc_sas.dcs1.wbyuik(mds.0): 4 slow metadata IOs are blocked > 30 secs, oldest blocked for 1063 secs mds.dc_ovirt.dcs1.lpcazs(mds.0): 4 slow metadata IOs are blocked > 30 secs, oldest blocked for 1058 secs mds.pool_tester.dcs1.ixkkfs(mds.0): 4 slow metadata IOs are blocked > 30 secs, oldest blocked for 1058 secs mds.dc_iso.dcs1.jxqqjd(mds.0): 4 slow metadata IOs are blocked > 30 secs, oldest blocked for 1058 secs [WRN] PG_AVAILABILITY: Reduced data availability: 1153 pgs inactive, 1101 pgs peering pg 6.c3 is stuck inactive for 50m, current state peering, last acting [30,15,11] pg 6.c4 is stuck peering for 10h, current state peering, last acting [12,0,26
[ceph-users] cephadm trouble with OSD db- and wal-device placement (quincy)
Hej, we are using ceph version 17.2.0 on Ubuntu 22.04.1 LTS. We've got several servers with the same setup and are facing a problem with OSD deployment and db-/wal-device placement. Each server consists of ten rotational disks (10TB each) and two NVME devices (3TB each). We would like to deploy each rotational disk with a db- and wal-device. We want to place the db and wal devices of an osd together on the same NVME, to cut the failure of the OSDs in half if one NVME fails. We tried several osd service type specifications to achieve our deployment goal. Our best approach is: service_type: osd service_id: osd_spec_10x10tb-dsk_db_and_wal_on_2x3tb-nvme service_name: osd.osd_spec_10x10tb-dsk_db_and_wal_on_2x3tb-nvme placement: host_pattern: '*' unmanaged: true spec: data_devices: model: MG[redacted] rotational: 1 db_devices: limit: 1 model: MZ[redacted] rotational: 0 filter_logic: OR objectstore: bluestore wal_devices: limit: 1 model: MZ[redacted] rotational: 0 This service spec deploys ten OSDs with all db-devices on one NVME and all wal-devices on the second NVME. If we omit "limit: 1", cephadm deploys ten OSDs with db-devices equally distributed on both NVMEs and no wal-devices at all --- although half of the NVMEs capacity remains unused. What's the best way to do it. Does that even make sense? Thank you very much and with kind regards Uli ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephadm trouble with OSD db- and wal-device placement (quincy)
I haven't done it, but had to read through the documentation a couple months ago and what I gathered was: 1. if you have a db device specified but no wal device, it will put the wal on the same volume as the db. 2. the recommendation seems to be to not have a separate volume for db and wal if on the same physical device? So, that should allow you to have the failure mode you want I think? Can anyone else confirm this or knows that it is incorrect? Thanks, Kevin From: Ulrich Pralle Sent: Tuesday, November 1, 2022 7:25 AM To: ceph-users@ceph.io Subject: [ceph-users] cephadm trouble with OSD db- and wal-device placement (quincy) Check twice before you click! This email originated from outside PNNL. Hej, we are using ceph version 17.2.0 on Ubuntu 22.04.1 LTS. We've got several servers with the same setup and are facing a problem with OSD deployment and db-/wal-device placement. Each server consists of ten rotational disks (10TB each) and two NVME devices (3TB each). We would like to deploy each rotational disk with a db- and wal-device. We want to place the db and wal devices of an osd together on the same NVME, to cut the failure of the OSDs in half if one NVME fails. We tried several osd service type specifications to achieve our deployment goal. Our best approach is: service_type: osd service_id: osd_spec_10x10tb-dsk_db_and_wal_on_2x3tb-nvme service_name: osd.osd_spec_10x10tb-dsk_db_and_wal_on_2x3tb-nvme placement: host_pattern: '*' unmanaged: true spec: data_devices: model: MG[redacted] rotational: 1 db_devices: limit: 1 model: MZ[redacted] rotational: 0 filter_logic: OR objectstore: bluestore wal_devices: limit: 1 model: MZ[redacted] rotational: 0 This service spec deploys ten OSDs with all db-devices on one NVME and all wal-devices on the second NVME. If we omit "limit: 1", cephadm deploys ten OSDs with db-devices equally distributed on both NVMEs and no wal-devices at all --- although half of the NVMEs capacity remains unused. What's the best way to do it. Does that even make sense? Thank you very much and with kind regards Uli ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Is it a bug that OSD crashed when it's full?
The actual question is that, is crash expected when OSD is full? My focus is more on how to prevent this from happening. My expectation is that OSD rejects write request when it's full, but not crash. Otherwise, no point to have ratio threshold. Please let me know if this is the design or a bug. Thanks! Tony From: Tony Liu Sent: October 31, 2022 05:46 PM To: ceph-users@ceph.io; d...@ceph.io Subject: [ceph-users] Is it a bug that OSD crashed when it's full? Hi, Based on doc, Ceph prevents you from writing to a full OSD so that you don’t lose data. In my case, with v16.2.10, OSD crashed when it's full. Is this expected or some bug? I'd expect write failure instead of OSD crash. It keeps crashing when tried to bring it up. Is there any way to bring it back? -7> 2022-10-31T22:52:57.426+ 7fe37fd94200 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1667256777427646, "job": 1, "event": "recovery_started", "log_files": [23300]} -6> 2022-10-31T22:52:57.426+ 7fe37fd94200 4 rocksdb: [db_impl/db_impl_open.cc:760] Recovering log #23300 mode 2 -5> 2022-10-31T22:52:57.529+ 7fe37fd94200 3 rocksdb: [le/block_based/filter_policy.cc:584] Using legacy Bloom filter with high (20) bits/key. Dramatic filter space and/or accuracy improvement is available with format_version>=5. -4> 2022-10-31T22:52:57.592+ 7fe37fd94200 1 bluefs _allocate unable to allocate 0x9 on bdev 1, allocator name block, allocator type hybrid, capacity 0x6fc840, block size 0x1000, free 0x57acbc000, fragmentation 0.359784, allocated 0x0 -3> 2022-10-31T22:52:57.592+ 7fe37fd94200 -1 bluefs _allocate allocation failed, needed 0x8064a -2> 2022-10-31T22:52:57.592+ 7fe37fd94200 -1 bluefs _flush_range allocated: 0x0 offset: 0x0 length: 0x8064a -1> 2022-10-31T22:52:57.604+ 7fe37fd94200 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7fe37fd94200 time 2022-10-31T22:52:57.593873+ /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueFS.cc: 2768: ceph_abort_msg("bluefs enospc") ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable) 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string, std::allocator > const&)+0xe5) [0x55858d7e2e7c] 2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1131) [0x55858dee8cc1] 3: (BlueFS::_flush(BlueFS::FileWriter*, bool, bool*)+0x90) [0x55858dee8fa0] 4: (BlueFS::_flush(BlueFS::FileWriter*, bool, std::unique_lock&)+0x32) [0x55858defa0b2] 5: (BlueRocksWritableFile::Append(rocksdb::Slice const&)+0x11b) [0x55858df129eb] 6: (rocksdb::LegacyWritableFileWrapper::Append(rocksdb::Slice const&, rocksdb::IOOptions const&, rocksdb::IODebugContext*)+0x1f) [0x55858e3ae55f] 7: (rocksdb::WritableFileWriter::WriteBuffered(char const*, unsigned long)+0x58a) [0x55858e4c02aa] 8: (rocksdb::WritableFileWriter::Append(rocksdb::Slice const&)+0x2d0) [0x55858e4c1700] 9: (rocksdb::BlockBasedTableBuilder::WriteRawBlock(rocksdb::Slice const&, rocksdb::CompressionType, rocksdb::BlockHandle*, bool)+0xb6) [0x55858e5dce86] 10: (rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::Slice const&, rocksdb::BlockHandle*, bool)+0x26c) [0x55858e5dd7cc] 11: (rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::BlockBuilder*, rocksdb::BlockHandle*, bool)+0x3c) [0x55858e5ddecc] 12: (rocksdb::BlockBasedTableBuilder::Flush()+0x6d) [0x55858e5ddf5d] 13: (rocksdb::BlockBasedTableBuilder::Add(rocksdb::Slice const&, rocksdb::Slice const&)+0x2b8) [0x55858e5e13c8] 14: (rocksdb::BuildTable(std::__cxx11::basic_string, std::allocator > const&, rocksdb::Env*, rocksdb::FileSystem*, rocksdb::ImmutableCFOptions const&, rocksdb::MutableCFOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase*, std::vector >, std::allocator > > >, rocksdb::FileMetaData*, rocksdb::InternalKeyComparator const&, std::vector >, std::allocator > > > const*, unsigned int, std::__cxx11::basic_string, std::allocator > const&, std::vector >, unsigned long, rocksdb::SnapshotChecker*, rocksdb::CompressionType, unsigned long, rocksdb::CompressionOptions const&, bool, rocksdb::InternalStats*, rocksdb::TableFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, int, unsigned long, unsigned long, rocksdb::Env::WriteLifeTimeHint, unsigned long)+0xa45) [0x55858e58be45] 15: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, rocksdb::ColumnFamilyData*, rocksdb::MemTable*, rocksdb::VersionEdit*)+0xcf5) [0x5
[ceph-users] Re: ceph-volume claiming wrong device
It looks like I hit some flavour of https://tracker.ceph.com/issues/51034. Since when I set `bluefs_buffered_io=false` the issue (that I could reproduce pretty consistently) disappeared. Oleksiy On Tue, Nov 1, 2022 at 3:02 AM Eugen Block wrote: > As I said, I would recommend to really wipe the OSDs clean > (ceph-volume lvm zap --destroy /dev/sdX), maybe reboot (on VMs it was > sometimes necessary during my tests if I had too many failed > attempts). And then also make sure you don't have any leftovers in the > filesystem (under /var/lib/ceph) just to make sure you have a clean > start. > > Zitat von Oleksiy Stashok : > > > Hey Eugen, > > > > valid points, I first tried to provision OSDs via ceph-ansible (later > > excluded), which does run the batch command with all 4 disk devices, but > it > > often failed with the same issue I mentioned earlier, something like: > > ``` > > bluefs _replay 0x0: stop: uuid e2f72ec9-2747-82d7-c7f8-41b7b6d41e1b != > > super.uuid 0110ddb3-d4bf-4c1e-be11-654598c71db0 > > ``` > > that's why I abandoned that idea and tried to provision OSDs manually one > > by one. > > As I mentioned I used ceph-ansible, not cephadm for legacy reasons, but I > > suspect the problem I'm seeing is related to ceph-volume, so I suspect > > cephadm won't change it. > > > > I did more investigation in 1-by-1 OSD creation flow and it seems like > the > > fact that `ceph-volume lvm list` shows me 2 devices belonging to the same > > OSD can be explained by the following flow: > > > > 1. ceph-volume lvm create --bluestore --dmcrypt --data /dev/sdd > > 2. trying to create osd.2 > > 3. fails with uuid != super.uuid issue > > 4. ceph-volume lvm list returns /dev/sdd belong to osd.2 (even though it > > failed) > > 5. ceph-volume lvm create --bluestore --dmcrypt --data /dev/sde > > 6. trying to create osd.2 (*again*) > > 7. succeeds > > 8. ceph-volume lvm list returns both /dev/sdd and /dev/sde belonging to > > osd.2 > > > > osd.2 is reported to be up and running. > > > > Any idea why this is happening? > > Thank you! > > Oleksiy > > > > On Thu, Oct 27, 2022 at 12:11 AM Eugen Block wrote: > > > >> Hi, > >> > >> first of all, if you really need to issue ceph-volume manually, > >> there's a batch command: > >> > >> cephadm ceph-volume lvm batch /dev/sdb /dev/sdc /dev/sdd /dev/sde > >> > >> Second, are you using cephadm? Maybe your manual intervention > >> conflicts with the automatic osd setup (all available devices). You > >> could look into /var/log/ceph/cephadm.log on each node and see if > >> cephadm already tried to setup the OSDs for you. What does 'ceph orch > >> ls' show? > >> Did you end up having online OSDs or did it fail? In that case I would > >> purge all OSDs from the crushmap, then wipe all devices (ceph-volume > >> lvm zap --destroy /dev/sdX) and either let cephadm create the OSDs for > >> you or you disable that (unmanaged=true) and run the manual steps > >> again (although it's not really necessary). > >> > >> Regards, > >> Eugen > >> > >> Zitat von Oleksiy Stashok : > >> > >> > Hey guys, > >> > > >> > I ran into a weird issue, hope you can explain what I'm observing. I'm > >> > testing* Ceph 16.2.10* on *Ubuntu 20.04* in *Google Cloud VMs*, I > >> created 3 > >> > instances and attached 4 persistent SSD disks to each instance. I can > see > >> > these disks attached as `/dev/sdb, /dev/sdc, /dev/sdd, /dev/sde` > devices. > >> > > >> > As a next step I used ceph-ansible to bootstrap the ceph cluster on 3 > >> > instances, however I intentionally skipped OSD setup. So I ended up > with > >> a > >> > Ceph cluster w/o any OSD. > >> > > >> > I ssh'ed into each VM and ran: > >> > > >> > ``` > >> > sudo -s > >> > for dev in sdb sdc sdd sde; do > >> > /usr/sbin/ceph-volume --cluster ceph lvm create --bluestore > >> > --dmcrypt --data "/dev/$dev" > >> > done > >> > ``` > >> > > >> > The operation above randomly fails on random instances/devices with > >> > something like: > >> > ``` > >> > bluefs _replay 0x0: stop: uuid e2f72ec9-2747-82d7-c7f8-41b7b6d41e1b != > >> > super.uuid 0110ddb3-d4bf-4c1e-be11-654598c71db0 > >> > ``` > >> > > >> > The interesting this is that when I do > >> > ``` > >> > /usr/sbin/ceph-volume lvm ls > >> > ``` > >> > > >> > I can see that the device for which OSD creation failed actually > belongs > >> to > >> > a different OSD that was previously created for a different device. > For > >> > example the failure I mentioned above happened on the `/dev/sde` > device, > >> so > >> > when I list lvms I see this: > >> > ``` > >> > == osd.2 === > >> > > >> > [block] > >> > > >> > /dev/ceph-103a4373-dbe0-43d6-a9e0-34db4e1b257c/osd-block-9af542ba-fd65-4355-ad17-7293856acaeb > >> > > >> > block device > >> > > >> > > >> > /dev/ceph-103a4373-dbe0-43d6-a9e0-34db4e1b257c/osd-block-9af542ba-fd65-4355-ad17-7293856acaeb > >> > block uuidFfFnLt-h33F-F73V-tY45-VuZM-scj7-C3dg1K > >> > cephx lockbox secret > AQAlelljqNPoMhAA5
[ceph-users] Re: cephadm trouble with OSD db- and wal-device placement (quincy)
That is correct, just omit the wal_devices and they will be placed on the db_devices automatically. Zitat von "Fox, Kevin M" : I haven't done it, but had to read through the documentation a couple months ago and what I gathered was: 1. if you have a db device specified but no wal device, it will put the wal on the same volume as the db. 2. the recommendation seems to be to not have a separate volume for db and wal if on the same physical device? So, that should allow you to have the failure mode you want I think? Can anyone else confirm this or knows that it is incorrect? Thanks, Kevin From: Ulrich Pralle Sent: Tuesday, November 1, 2022 7:25 AM To: ceph-users@ceph.io Subject: [ceph-users] cephadm trouble with OSD db- and wal-device placement (quincy) Check twice before you click! This email originated from outside PNNL. Hej, we are using ceph version 17.2.0 on Ubuntu 22.04.1 LTS. We've got several servers with the same setup and are facing a problem with OSD deployment and db-/wal-device placement. Each server consists of ten rotational disks (10TB each) and two NVME devices (3TB each). We would like to deploy each rotational disk with a db- and wal-device. We want to place the db and wal devices of an osd together on the same NVME, to cut the failure of the OSDs in half if one NVME fails. We tried several osd service type specifications to achieve our deployment goal. Our best approach is: service_type: osd service_id: osd_spec_10x10tb-dsk_db_and_wal_on_2x3tb-nvme service_name: osd.osd_spec_10x10tb-dsk_db_and_wal_on_2x3tb-nvme placement: host_pattern: '*' unmanaged: true spec: data_devices: model: MG[redacted] rotational: 1 db_devices: limit: 1 model: MZ[redacted] rotational: 0 filter_logic: OR objectstore: bluestore wal_devices: limit: 1 model: MZ[redacted] rotational: 0 This service spec deploys ten OSDs with all db-devices on one NVME and all wal-devices on the second NVME. If we omit "limit: 1", cephadm deploys ten OSDs with db-devices equally distributed on both NVMEs and no wal-devices at all --- although half of the NVMEs capacity remains unused. What's the best way to do it. Does that even make sense? Thank you very much and with kind regards Uli ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] ceph status usage doesn't match bucket totals
I have a ceph cluster that shows a different space utilization in its status screen than in its bucket stats. When I copy the contents of this cluster to a different ceph cluster, the bucket stats totals are as expected and match the bucket stats totals. output of "ceph -s" on the first ceph cluster: cluster: id: 9de06f94-a841-4e17-ac7c-9a0d8d8791b8 health: HEALTH_OK services: mon: 3 daemons, quorum idb-ceph4,idb-ceph5,idb-ceph6 mgr: idb-ceph4(active), standbys: idb-ceph5, idb-ceph6 osd: 172 osds: 168 up, 168 in rgw: 5 daemons active data: pools: 16 pools, 9430 pgs objects: 272.4 M objects, 360 TiB usage: 1.1 PiB used, 333 TiB / 1.4 PiB avail pgs: 9410 active+clean 11 active+clean+scrubbing 9active+clean+scrubbing+deep io: client: 151 MiB/s rd, 838 op/s rd, 0 op/s wr taking the sum of all the sizes of the buckets gives: 526,773,819,875,727 bytes or 526.7TB. How can the status usage of objects be 333TiB and the bucket totals be 479TiB? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: No active PG; No disk activity
I managed to solve this problem. To document the resolution: The firewall was blocking communication. After disabling everything related to it and restarting the machine everything went back to normal. Em ter., 1 de nov. de 2022 às 10:46, Murilo Morais escreveu: > Good morning everyone! > > Today there was an atypical situation in our Cluster where the three > machines came to shut down. > > On powering up the cluster went up and formed quorum with no problems, but > the PGs are all in Working, I don't see any disk activity on the machines. > No PG is active. > > > > > [ceph: root@dcs1 /]# ceph osd tree > ID CLASS WEIGHTTYPE NAME STATUS REWEIGHT PRI-AFF > -1 98.24359 root default > -3 32.74786 host dcs1 > 0hdd 2.72899 osd.0 up 1.0 1.0 > 1hdd 2.72899 osd.1 up 1.0 1.0 > 2hdd 2.72899 osd.2 up 1.0 1.0 > 3hdd 2.72899 osd.3 up 1.0 1.0 > 4hdd 2.72899 osd.4 up 1.0 1.0 > 5hdd 2.72899 osd.5 up 1.0 1.0 > 6hdd 2.72899 osd.6 up 1.0 1.0 > 7hdd 2.72899 osd.7 up 1.0 1.0 > 8hdd 2.72899 osd.8 up 1.0 1.0 > 9hdd 2.72899 osd.9 up 1.0 1.0 > 10hdd 2.72899 osd.10 up 1.0 1.0 > 11hdd 2.72899 osd.11 up 1.0 1.0 > -5 32.74786 host dcs2 > 12hdd 2.72899 osd.12 up 1.0 1.0 > 13hdd 2.72899 osd.13 up 1.0 1.0 > 14hdd 2.72899 osd.14 up 1.0 1.0 > 15hdd 2.72899 osd.15 up 1.0 1.0 > 16hdd 2.72899 osd.16 up 1.0 1.0 > 17hdd 2.72899 osd.17 up 1.0 1.0 > 18hdd 2.72899 osd.18 up 1.0 1.0 > 19hdd 2.72899 osd.19 up 1.0 1.0 > 20hdd 2.72899 osd.20 up 1.0 1.0 > 21hdd 2.72899 osd.21 up 1.0 1.0 > 22hdd 2.72899 osd.22 up 1.0 1.0 > 23hdd 2.72899 osd.23 up 1.0 1.0 > -7 32.74786 host dcs3 > 24hdd 2.72899 osd.24 up 1.0 1.0 > 25hdd 2.72899 osd.25 up 1.0 1.0 > 26hdd 2.72899 osd.26 up 1.0 1.0 > 27hdd 2.72899 osd.27 up 1.0 1.0 > 28hdd 2.72899 osd.28 up 1.0 1.0 > 29hdd 2.72899 osd.29 up 1.0 1.0 > 30hdd 2.72899 osd.30 up 1.0 1.0 > 31hdd 2.72899 osd.31 up 1.0 1.0 > 32hdd 2.72899 osd.32 up 1.0 1.0 > 33hdd 2.72899 osd.33 up 1.0 1.0 > 34hdd 2.72899 osd.34 up 1.0 1.0 > 35hdd 2.72899 osd.35 up 1.0 1.0 > > > > > [ceph: root@dcs1 /]# ceph -s > cluster: > id: 58bbb950-538b-11ed-b237-2c59e53b80cc > health: HEALTH_WARN > 4 filesystems are degraded > 4 MDSs report slow metadata IOs > Reduced data availability: 1153 pgs inactive, 1101 pgs peering > 26 slow ops, oldest one blocked for 563 sec, daemons > [osd.10,osd.13,osd.14,osd.15,osd.16,osd.18,osd.20,osd.21,osd.24,osd.25]... > have slow ops. > > services: > mon: 3 daemons, quorum dcs1.evocorp,dcs2,dcs3 (age 7m) > mgr: dcs1.evocorp.kyqfcd(active, since 15m), standbys: dcs2.rirtyl > mds: 4/4 daemons up, 4 standby > osd: 36 osds: 36 up (since 6m), 36 in (since 47m); 65 remapped pgs > > data: > volumes: 0/4 healthy, 4 recovering > pools: 10 pools, 1153 pgs > objects: 254.72k objects, 994 GiB > usage: 2.8 TiB used, 95 TiB / 98 TiB avail > pgs: 100.000% pgs not active > 1036 peering > 65 remapped+peering > 52 activating > > > > > [ceph: root@dcs1 /]# ceph health detail > HEALTH_WARN 4 filesystems are degraded; 4 MDSs report slow metadata IOs; > Reduced data availability: 1153 pgs inactive, 1101 pgs peering; 26 slow > ops, oldest one blocked for 673 sec, daemons > [osd.10,osd.13,osd.14,osd.15,osd.16,osd.18,osd.20,osd.21,osd.24,osd.25]... > have slow ops. > [WRN] FS_DEGRADED: 4 filesystems are degraded > fs dc_ovirt is degraded > fs dc_iso is degraded > fs dc_sas is degraded > fs pool_tester is degraded > [WRN] MDS_SLOW_METADATA_IO: 4 MDSs report slow metadata IOs > mds.dc_sas.dcs1.wbyuik(mds.0): 4 slow metadata IOs are blocked > 30 > secs, oldest blocked for 1063 secs > mds.dc_ovirt.dcs1.lpcazs(mds.0): 4 slow metadata IOs are blocked > 30 > secs, oldest blocked for 1058 secs > mds.po
[ceph-users] Re: Is it a bug that OSD crashed when it's full?
Hi Tony, first of all let me share my understanding of the issue you're facing. This recalls me an upstream ticket and I presume my root cause analysis from there (https://tracker.ceph.com/issues/57672#note-9) is applicable in your case as well. So generally speaking your OSD isn't 100% full - from the log output one can see that 0x57acbc000 of 0x6fc840 bytes are free. But there are not enough contiguous 64K chunks for BlueFS to proceed operating.. As a result OSD managed to escape any *full* sentries and reached the state when it's crashed - these safety means just weren't designed to take that additional free space fragmentation factor into account... Similarly the lack of available 64K chunks prevents OSD from starting up - it needs to write out some more data to BlueFS during startup recovery. I'm currently working on enabling BlueFS functioning with default main device allocation unit (=4K) which will hopefully fix the above issue. Meanwhile you might want to workaround the current OSD's state by setting bluefs_shared_allocat_size to 32K - this might have some operational and performance effects but highly likely OSD should be able to startup afterwards. Please do not use 4K for now - it's known for causing more problems in some circumstances. And I'd highly recommend to redeploy the OSD ASAP as you drained all the data off it - I presume that's the reason why you want to bring it up instead of letting the cluster to recover using regular means applied on OSD loss. Alternative approach would be to add standalone DB volume and migrate BlueFS there - ceph-volume should be able to do that even in the current OSD state. Expanding main volume (if backed by LVM and extra spare space is available) is apparently a valid option too Thanks, Igor On 11/1/2022 8:09 PM, Tony Liu wrote: The actual question is that, is crash expected when OSD is full? My focus is more on how to prevent this from happening. My expectation is that OSD rejects write request when it's full, but not crash. Otherwise, no point to have ratio threshold. Please let me know if this is the design or a bug. Thanks! Tony From: Tony Liu Sent: October 31, 2022 05:46 PM To: ceph-users@ceph.io; d...@ceph.io Subject: [ceph-users] Is it a bug that OSD crashed when it's full? Hi, Based on doc, Ceph prevents you from writing to a full OSD so that you don’t lose data. In my case, with v16.2.10, OSD crashed when it's full. Is this expected or some bug? I'd expect write failure instead of OSD crash. It keeps crashing when tried to bring it up. Is there any way to bring it back? -7> 2022-10-31T22:52:57.426+ 7fe37fd94200 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1667256777427646, "job": 1, "event": "recovery_started", "log_files": [23300]} -6> 2022-10-31T22:52:57.426+ 7fe37fd94200 4 rocksdb: [db_impl/db_impl_open.cc:760] Recovering log #23300 mode 2 -5> 2022-10-31T22:52:57.529+ 7fe37fd94200 3 rocksdb: [le/block_based/filter_policy.cc:584] Using legacy Bloom filter with high (20) bits/key. Dramatic filter space and/or accuracy improvement is available with format_version>=5. -4> 2022-10-31T22:52:57.592+ 7fe37fd94200 1 bluefs _allocate unable to allocate 0x9 on bdev 1, allocator name block, allocator type hybrid, capacity , block size 0x1000, free 0x57acbc000, fragmentation 0.359784, allocated 0x0 -3> 2022-10-31T22:52:57.592+ 7fe37fd94200 -1 bluefs _allocate allocation failed, needed 0x8064a -2> 2022-10-31T22:52:57.592+ 7fe37fd94200 -1 bluefs _flush_range allocated: 0x0 offset: 0x0 length: 0x8064a -1> 2022-10-31T22:52:57.604+ 7fe37fd94200 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7fe37fd94200 time 2022-10-31T22:52:57.593873+ /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueFS.cc: 2768: ceph_abort_msg("bluefs enospc") ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable) 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string, std::allocator > const&)+0xe5) [0x55858d7e2e7c] 2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1131) [0x55858dee8cc1] 3: (BlueFS::_flush(BlueFS::FileWriter*, bool, bool*)+0x90) [0x55858dee8fa0] 4: (BlueFS::_flush(BlueFS::FileWriter*, bool, std::unique_lock&)+0x32) [0x55858defa0b2] 5: (BlueRocksWritableFile::Append(rocksdb::Slice const&)+0x11b) [0x55858df129eb] 6: (rocksdb::LegacyWritableFileWrapper::Append(rocksdb::Slice const&, rocksdb::IOOptions cons
[ceph-users] Re: Is it a bug that OSD crashed when it's full?
If its the same issue, I'd check the fragmentation score on the entire cluster asap. You may have other osds close to the limit and its harder to fix when all your osds cross the line at once. If you drain this one, it may push the other ones into the red zone if your too close, making the problem much worse. Our cluster has been stable after splitting all the db's to their own volumes. Really looking forward to the 4k fix. :) But the workaround seems solid. Thanks, Kevin From: Igor Fedotov Sent: Tuesday, November 1, 2022 4:34 PM To: Tony Liu; ceph-users@ceph.io; d...@ceph.io Subject: [ceph-users] Re: Is it a bug that OSD crashed when it's full? Check twice before you click! This email originated from outside PNNL. Hi Tony, first of all let me share my understanding of the issue you're facing. This recalls me an upstream ticket and I presume my root cause analysis from there (https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftracker.ceph.com%2Fissues%2F57672%23note-9&data=05%7C01%7Ckevin.fox%40pnnl.gov%7C2e3ff73019a7475ade5e08dabc627f6d%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C638029428500885214%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2F8iqR9bo0Yg4WZDA8TqI4d8HywpEoChEUCoiXLEE9TM%3D&reserved=0) is applicable in your case as well. So generally speaking your OSD isn't 100% full - from the log output one can see that 0x57acbc000 of 0x6fc840 bytes are free. But there are not enough contiguous 64K chunks for BlueFS to proceed operating.. As a result OSD managed to escape any *full* sentries and reached the state when it's crashed - these safety means just weren't designed to take that additional free space fragmentation factor into account... Similarly the lack of available 64K chunks prevents OSD from starting up - it needs to write out some more data to BlueFS during startup recovery. I'm currently working on enabling BlueFS functioning with default main device allocation unit (=4K) which will hopefully fix the above issue. Meanwhile you might want to workaround the current OSD's state by setting bluefs_shared_allocat_size to 32K - this might have some operational and performance effects but highly likely OSD should be able to startup afterwards. Please do not use 4K for now - it's known for causing more problems in some circumstances. And I'd highly recommend to redeploy the OSD ASAP as you drained all the data off it - I presume that's the reason why you want to bring it up instead of letting the cluster to recover using regular means applied on OSD loss. Alternative approach would be to add standalone DB volume and migrate BlueFS there - ceph-volume should be able to do that even in the current OSD state. Expanding main volume (if backed by LVM and extra spare space is available) is apparently a valid option too Thanks, Igor On 11/1/2022 8:09 PM, Tony Liu wrote: > The actual question is that, is crash expected when OSD is full? > My focus is more on how to prevent this from happening. > My expectation is that OSD rejects write request when it's full, but not > crash. > Otherwise, no point to have ratio threshold. > Please let me know if this is the design or a bug. > > Thanks! > Tony > > From: Tony Liu > Sent: October 31, 2022 05:46 PM > To: ceph-users@ceph.io; d...@ceph.io > Subject: [ceph-users] Is it a bug that OSD crashed when it's full? > > Hi, > > Based on doc, Ceph prevents you from writing to a full OSD so that you don’t > lose data. > In my case, with v16.2.10, OSD crashed when it's full. Is this expected or > some bug? > I'd expect write failure instead of OSD crash. It keeps crashing when tried > to bring it up. > Is there any way to bring it back? > > -7> 2022-10-31T22:52:57.426+ 7fe37fd94200 4 rocksdb: EVENT_LOG_v1 > {"time_micros": 1667256777427646, "job": 1, "event": "recovery_started", > "log_files": [23300]} > -6> 2022-10-31T22:52:57.426+ 7fe37fd94200 4 rocksdb: > [db_impl/db_impl_open.cc:760] Recovering log #23300 mode 2 > -5> 2022-10-31T22:52:57.529+ 7fe37fd94200 3 rocksdb: > [le/block_based/filter_policy.cc:584] Using legacy Bloom filter with high > (20) bits/key. Dramatic filter space and/or accuracy improvement is available > with format_version>=5. > -4> 2022-10-31T22:52:57.592+ 7fe37fd94200 1 bluefs _allocate unable > to allocate 0x9 on bdev 1, allocator name block, allocator type hybrid, > capacity, block size 0x1000, free 0x57acbc000, fragmentation 0.359784, > allocated 0x0 > -3> 2022-10-31T22:52:57.592+ 7fe37fd94200 -1 bluefs _allocate > allocation failed, needed 0x8064a > -2> 2022-10-31T22:52:57.592+ 7fe37fd94200 -1 bluefs _flush_range > allocated: 0x0 offset: 0x0 length: 0x8064a > -1> 2022-10-31T22:52:57.604+ 7fe37fd94200 -1 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_6
[ceph-users] Re: Is it a bug that OSD crashed when it's full?
Thank you Igor! Tony From: Igor Fedotov Sent: November 1, 2022 04:34 PM To: Tony Liu; ceph-users@ceph.io; d...@ceph.io Subject: Re: [ceph-users] Re: Is it a bug that OSD crashed when it's full? Hi Tony, first of all let me share my understanding of the issue you're facing. This recalls me an upstream ticket and I presume my root cause analysis from there (https://tracker.ceph.com/issues/57672#note-9) is applicable in your case as well. So generally speaking your OSD isn't 100% full - from the log output one can see that 0x57acbc000 of 0x6fc840 bytes are free. But there are not enough contiguous 64K chunks for BlueFS to proceed operating.. As a result OSD managed to escape any *full* sentries and reached the state when it's crashed - these safety means just weren't designed to take that additional free space fragmentation factor into account... Similarly the lack of available 64K chunks prevents OSD from starting up - it needs to write out some more data to BlueFS during startup recovery. I'm currently working on enabling BlueFS functioning with default main device allocation unit (=4K) which will hopefully fix the above issue. Meanwhile you might want to workaround the current OSD's state by setting bluefs_shared_allocat_size to 32K - this might have some operational and performance effects but highly likely OSD should be able to startup afterwards. Please do not use 4K for now - it's known for causing more problems in some circumstances. And I'd highly recommend to redeploy the OSD ASAP as you drained all the data off it - I presume that's the reason why you want to bring it up instead of letting the cluster to recover using regular means applied on OSD loss. Alternative approach would be to add standalone DB volume and migrate BlueFS there - ceph-volume should be able to do that even in the current OSD state. Expanding main volume (if backed by LVM and extra spare space is available) is apparently a valid option too Thanks, Igor On 11/1/2022 8:09 PM, Tony Liu wrote: > The actual question is that, is crash expected when OSD is full? > My focus is more on how to prevent this from happening. > My expectation is that OSD rejects write request when it's full, but not > crash. > Otherwise, no point to have ratio threshold. > Please let me know if this is the design or a bug. > > Thanks! > Tony > > From: Tony Liu > Sent: October 31, 2022 05:46 PM > To: ceph-users@ceph.io; d...@ceph.io > Subject: [ceph-users] Is it a bug that OSD crashed when it's full? > > Hi, > > Based on doc, Ceph prevents you from writing to a full OSD so that you don’t > lose data. > In my case, with v16.2.10, OSD crashed when it's full. Is this expected or > some bug? > I'd expect write failure instead of OSD crash. It keeps crashing when tried > to bring it up. > Is there any way to bring it back? > > -7> 2022-10-31T22:52:57.426+ 7fe37fd94200 4 rocksdb: EVENT_LOG_v1 > {"time_micros": 1667256777427646, "job": 1, "event": "recovery_started", > "log_files": [23300]} > -6> 2022-10-31T22:52:57.426+ 7fe37fd94200 4 rocksdb: > [db_impl/db_impl_open.cc:760] Recovering log #23300 mode 2 > -5> 2022-10-31T22:52:57.529+ 7fe37fd94200 3 rocksdb: > [le/block_based/filter_policy.cc:584] Using legacy Bloom filter with high > (20) bits/key. Dramatic filter space and/or accuracy improvement is available > with format_version>=5. > -4> 2022-10-31T22:52:57.592+ 7fe37fd94200 1 bluefs _allocate unable > to allocate 0x9 on bdev 1, allocator name block, allocator type hybrid, > capacity, block size 0x1000, free 0x57acbc000, fragmentation 0.359784, > allocated 0x0 > -3> 2022-10-31T22:52:57.592+ 7fe37fd94200 -1 bluefs _allocate > allocation failed, needed 0x8064a > -2> 2022-10-31T22:52:57.592+ 7fe37fd94200 -1 bluefs _flush_range > allocated: 0x0 offset: 0x0 length: 0x8064a > -1> 2022-10-31T22:52:57.604+ 7fe37fd94200 -1 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueFS.cc: > In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, > uint64_t)' thread 7fe37fd94200 time 2022-10-31T22:52:57.593873+ > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueFS.cc: > 2768: ceph_abort_msg("bluefs enospc") > > ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific > (stable) > 1: (ceph::__ceph_abort(char const*, int, char const*, > std::__cxx11::basic_string, std::allocator > > const&)+0xe5) [0x55858d7e2e7c] > 2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned > long)+0x1131) [0x55858dee8cc1] > 3: (BlueFS::_flush(BlueFS