[ceph-users] Re: alerts in dashboard
Hi Ben, It looks like you forgot to attach the screenshots. Regards, Nizam On Wed, Jun 21, 2023, 12:23 Ben wrote: > Hi, > > I got many critical alerts in ceph dashboard. Meanwhile the cluster shows > health ok status. > > See attached screenshot for detail. My questions are, are they real alerts? > How to get rid of them? > > Thanks > Ben > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSDs cannot join cluster anymore
Hi, can you share more details what exactly you did? How did you remove the nodes? Hopefully, you waited for the draining to finish? But if the remaining OSDs wait for removed OSDs it sounds like the draining was not finished. Zitat von Malte Stroem : Hello, we removed some nodes from our cluster. This worked without problems. Now, lots of OSDs do not want to join the cluster anymore if we reboot one of the still available nodes. It always runs into timeouts: --> ceph-volume lvm activate successful for osd ID: XX monclient(hunting): authenticate timed out after 300 MONs and MGRs are running fine. Network is working, netcat to the MONs' ports are open. Setting a higher debug level has no effect even if we add it to the ceph.conf file. The PGs are pretty unhappy, e. g.: 7.143 87771 0 0 00 3147449022350 0 10081 10081 down 2023-06-20T09:16:03.546158+961275'1395646 961300:9605547 [209,NONE,NONE] 209 [209,NONE,NONE] 209961231'1395512 2023-06-19T23:46:40.101791+961231'1395512 2023-06-19T23:46:40.101791+ PG query wants us to set an OSD lost however I do not want to do this. OSDs are blocked by OSDs from the removed nodes: ceph osd blocked-by osd num_blocked 152 38 244 41 144 54 ... We added the removed hosts again and tried to start the OSDs on this node and they also failed into the timeout mentioned above. This is a containerized cluster running version 16.2.10. Replication is 3, some pools use an erasure coded profile. Best regards, Malte ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSDs cannot join cluster anymore
Hello Eugen, thank you. Yesterday I thought: Well, Eugen can help! Yes, we drained the nodes. It needed two weeks to finish the process, and yes, I think this is the root cause. So we still have the nodes but when I try to restart one of those OSDs it still cannot join: Jun 21 09:46:03 ceph-node bash[2323668]: Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-66/block Jun 21 09:46:03 ceph-node bash[2323668]: Running command: /usr/bin/chown -R ceph:ceph /dev/dm-19 Jun 21 09:46:03 ceph-node bash[2323668]: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-66 Jun 21 09:46:03 ceph-node bash[2323668]: --> ceph-volume lvm activate successful for osd ID: 66 Jun 21 09:51:04 ceph-node bash[2323668]: debug 2023-06-21T07:51:04.176+ 7fabef5a1200 0 monclient(hunting): authenticate timed out after 300 Jun 21 09:56:04 ceph-node bash[2323668]: debug 2023-06-21T07:56:04.179+ 7fabef5a1200 0 monclient(hunting): authenticate timed out after 300 Jun 21 10:01:04 ceph-node bash[2323668]: debug 2023-06-21T08:01:04.177+ 7fabef5a1200 0 monclient(hunting): authenticate timed out after 300 Jun 21 10:06:04 ceph-node bash[2323668]: debug 2023-06-21T08:06:04.179+ 7fabef5a1200 0 monclient(hunting): authenticate timed out after 300 Jun 21 10:11:04 ceph-node bash[2323668]: debug 2023-06-21T08:11:04.174+ 7fabef5a1200 0 monclient(hunting): authenticate timed out after 300 Same messages on all OSDs. We still have some nodes running and did not restart those OSDs. Best, Malte Am 21.06.23 um 09:50 schrieb Eugen Block: Hi, can you share more details what exactly you did? How did you remove the nodes? Hopefully, you waited for the draining to finish? But if the remaining OSDs wait for removed OSDs it sounds like the draining was not finished. Zitat von Malte Stroem : Hello, we removed some nodes from our cluster. This worked without problems. Now, lots of OSDs do not want to join the cluster anymore if we reboot one of the still available nodes. It always runs into timeouts: --> ceph-volume lvm activate successful for osd ID: XX monclient(hunting): authenticate timed out after 300 MONs and MGRs are running fine. Network is working, netcat to the MONs' ports are open. Setting a higher debug level has no effect even if we add it to the ceph.conf file. The PGs are pretty unhappy, e. g.: 7.143 87771 0 0 0 0 314744902235 0 0 10081 10081 down 2023-06-20T09:16:03.546158+ 961275'1395646 961300:9605547 [209,NONE,NONE] 209 [209,NONE,NONE] 209 961231'1395512 2023-06-19T23:46:40.101791+ 961231'1395512 2023-06-19T23:46:40.101791+ PG query wants us to set an OSD lost however I do not want to do this. OSDs are blocked by OSDs from the removed nodes: ceph osd blocked-by osd num_blocked 152 38 244 41 144 54 ... We added the removed hosts again and tried to start the OSDs on this node and they also failed into the timeout mentioned above. This is a containerized cluster running version 16.2.10. Replication is 3, some pools use an erasure coded profile. Best regards, Malte ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSDs cannot join cluster anymore
Hi, Yes, we drained the nodes. It needed two weeks to finish the process, and yes, I think this is the root cause. So we still have the nodes but when I try to restart one of those OSDs it still cannot join: if the nodes were drained successfully (can you confirm that all PGs were active+clean after draining before you removed the nodes?) then the disks on the removed nodes wouldn't have any data to bring back. The question would be, why do the remaining OSDs still reference removed OSDs. Or am I misunderstanding something? I think it would help to know the whole story, can you provide more details? Also some more general cluster info would be helpful: $ ceph -s $ ceph osd tree $ ceph health detail Zitat von Malte Stroem : Hello Eugen, thank you. Yesterday I thought: Well, Eugen can help! Yes, we drained the nodes. It needed two weeks to finish the process, and yes, I think this is the root cause. So we still have the nodes but when I try to restart one of those OSDs it still cannot join: Jun 21 09:46:03 ceph-node bash[2323668]: Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-66/block Jun 21 09:46:03 ceph-node bash[2323668]: Running command: /usr/bin/chown -R ceph:ceph /dev/dm-19 Jun 21 09:46:03 ceph-node bash[2323668]: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-66 Jun 21 09:46:03 ceph-node bash[2323668]: --> ceph-volume lvm activate successful for osd ID: 66 Jun 21 09:51:04 ceph-node bash[2323668]: debug 2023-06-21T07:51:04.176+ 7fabef5a1200 0 monclient(hunting): authenticate timed out after 300 Jun 21 09:56:04 ceph-node bash[2323668]: debug 2023-06-21T07:56:04.179+ 7fabef5a1200 0 monclient(hunting): authenticate timed out after 300 Jun 21 10:01:04 ceph-node bash[2323668]: debug 2023-06-21T08:01:04.177+ 7fabef5a1200 0 monclient(hunting): authenticate timed out after 300 Jun 21 10:06:04 ceph-node bash[2323668]: debug 2023-06-21T08:06:04.179+ 7fabef5a1200 0 monclient(hunting): authenticate timed out after 300 Jun 21 10:11:04 ceph-node bash[2323668]: debug 2023-06-21T08:11:04.174+ 7fabef5a1200 0 monclient(hunting): authenticate timed out after 300 Same messages on all OSDs. We still have some nodes running and did not restart those OSDs. Best, Malte Am 21.06.23 um 09:50 schrieb Eugen Block: Hi, can you share more details what exactly you did? How did you remove the nodes? Hopefully, you waited for the draining to finish? But if the remaining OSDs wait for removed OSDs it sounds like the draining was not finished. Zitat von Malte Stroem : Hello, we removed some nodes from our cluster. This worked without problems. Now, lots of OSDs do not want to join the cluster anymore if we reboot one of the still available nodes. It always runs into timeouts: --> ceph-volume lvm activate successful for osd ID: XX monclient(hunting): authenticate timed out after 300 MONs and MGRs are running fine. Network is working, netcat to the MONs' ports are open. Setting a higher debug level has no effect even if we add it to the ceph.conf file. The PGs are pretty unhappy, e. g.: 7.143 87771 0 0 0 0 314744902235 0 0 10081 10081 down 2023-06-20T09:16:03.546158+ 961275'1395646 961300:9605547 [209,NONE,NONE] 209 [209,NONE,NONE] 209 961231'1395512 2023-06-19T23:46:40.101791+ 961231'1395512 2023-06-19T23:46:40.101791+ PG query wants us to set an OSD lost however I do not want to do this. OSDs are blocked by OSDs from the removed nodes: ceph osd blocked-by osd num_blocked 152 38 244 41 144 54 ... We added the removed hosts again and tried to start the OSDs on this node and they also failed into the timeout mentioned above. This is a containerized cluster running version 16.2.10. Replication is 3, some pools use an erasure coded profile. Best regards, Malte ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph Pacific bluefs enospc bug with newly created OSDs
Hi Igor, thank you for your ansere! >first of all Quincy does have a fix for the issue, see >https://tracker.ceph.com/issues/53466 (and its Quincy counterpart >https://tracker.ceph.com/issues/58588) Thank you I somehow missed that release, good to know! >SSD or HDD? Standalone or shared DB volume? I presume the latter... What >is disk size and current utilization? > >Please share ceph-bluestore-tool's bluefs-bdev-sizes command output if >possible We use 4 TB NVMe SSDs, shared db yes and mainly Micron with some Dell and Samsung in this cluster: Micron_7400_MTFDKCB3T8TDZ_214733D291B1 cloud5-1561:nvme5n1 osd.5 All Disks are at ~ 88% utilization. I noticed that around 92% our disks tend to run into this bug. Here are some bluefs-bdev-sizes from different OSDs on different hosts in this cluster: ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-36/ inferring bluefs devices from bluestore path 1 : device size 0x37e3ec0 : using 0x2e1b390(2.9 TiB) ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-24/ inferring bluefs devices from bluestore path 1 : device size 0x37e3ec0 : using 0x2d4e318d000(2.8 TiB) ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-5/ inferring bluefs devices from bluestore path 1 : device size 0x37e3ec0 : using 0x2f2da93d000(2.9 TiB) >Generally, given my assumption that DB volume is currently collocated >and you still want to stay on Pacific, you might want to consider >redeploying OSDs with a standalone DB volume setup. > >Just create large enough (2x of the current DB size seems to be pretty >conservative estimation for that volume's size) additional LV on top of >the same physical disk. And put DB there... > >Separating DB from main disk would result in much less fragmentation at >DB volume and hence work around the problem. The cost would be having >some extra spare space at DB volume unavailable for user data . I guess that makes, so the suggestion would be to deploy the osd and db on the same NVMe but with different logical volumes or updating to quincy. Thank you! Carsten Von: Igor Fedotov Datum: Dienstag, 20. Juni 2023 um 12:48 An: Carsten Grommel , ceph-users@ceph.io Betreff: Re: [ceph-users] Ceph Pacific bluefs enospc bug with newly created OSDs Hi Carsten, first of all Quincy does have a fix for the issue, see https://tracker.ceph.com/issues/53466 (and its Quincy counterpart https://tracker.ceph.com/issues/58588) Could you please share a bit more info on OSD disk layout? SSD or HDD? Standalone or shared DB volume? I presume the latter... What is disk size and current utilization? Please share ceph-bluestore-tool's bluefs-bdev-sizes command output if possible Generally, given my assumption that DB volume is currently collocated and you still want to stay on Pacific, you might want to consider redeploying OSDs with a standalone DB volume setup. Just create large enough (2x of the current DB size seems to be pretty conservative estimation for that volume's size) additional LV on top of the same physical disk. And put DB there... Separating DB from main disk would result in much less fragmentation at DB volume and hence work around the problem. The cost would be having some extra spare space at DB volume unavailable for user data . Hope this helps, Igor On 20/06/2023 10:29, Carsten Grommel wrote: > Hi all, > > we are experiencing the “bluefs enospc bug” again after redeploying all OSDs > of our Pacific Cluster. > I know that our cluster is a bit too utilized at the moment with 87.26 % raw > usage but still this should not happen afaik. > We never hat this problem with previous ceph versions and right now I am kind > of out of ideas at how to tackle these crashes. > Compacting the database did not help in the past either. > Redeploy seems to no help in the long run as well. For documentation I used > these commands to redeploy the osds: > > systemctl stop ceph-osd@${OSDNUM} > ceph osd destroy --yes-i-really-mean-it ${OSDNUM} > blkdiscard ${DEVICE} > sgdisk -Z ${DEVICE} > dmsetup remove ${DMDEVICE} > ceph-volume lvm create --osd-id ${OSDNUM} --data ${DEVICE} > > Any ideas or possible solutions on this? I am not yet ready to upgrade our > clusters to quincy, also I do presume that this bug is still present in > quincy as well? > > Follow our cluster information: > > Crash Info: > ceph crash info > 2023-06-19T21:23:51.285180Z_ac4105d7-cb09-45c8-a6e3-8a6bb6727b25 > { > "assert_condition": "abort", > "assert_file": "/build/ceph/src/os/bluestore/BlueFS.cc", > "assert_func": "int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, > uint64_t)", > "assert_line": 2810, > "assert_msg": "/build/ceph/src/os/bluestore/BlueFS.cc: In function 'int > BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread > 7fd561810100 time > 2023-06-19T23:23:51.261617+0200\n/build/ceph/src/os/bluestore/BlueFS.cc: > 2810: ceph_abort_msg(\"bluefs enospc\")\n", >
[ceph-users] Re: radosgw new zonegroup hammers master with metadata sync
I've update the dc3 site from octopus to pacific and the problem is still there. I find it very weird that in only happens from one single zonegroup to the master and not from the other two. Am Mi., 21. Juni 2023 um 01:59 Uhr schrieb Boris Behrens : > I recreated the site and the problem still persists. > > I've upped the logging and saw this for a lot of buckets (i've stopped the > debug log after some seconds). > 2023-06-20T23:32:29.365+ 7fcaab7fe700 20 get_system_obj_state: > rctx=0x7fcaab7f9320 obj=dc3.rgw.meta:root:s3bucket-fra2 > state=0x7fcba05ac0a0 s->prefetch_data=0 > 2023-06-20T23:32:29.365+ 7fcaab7fe700 10 cache get: > name=dc3.rgw.meta+root+s3bucket-fra2 : miss > 2023-06-20T23:32:29.365+ 7fcaab7fe700 10 cache put: > name=dc3.rgw.meta+root+s3bucket-fra2 info.flags=0x6 > 2023-06-20T23:32:29.365+ 7fcaab7fe700 10 adding > dc3.rgw.meta+root+s3bucket-fra2 to cache LRU end > 2023-06-20T23:32:29.365+ 7fcaab7fe700 10 cache get: > name=dc3.rgw.meta+root+s3bucket-fra2 : type miss (requested=0x1, cached=0x6) > 2023-06-20T23:32:29.365+ 7fcaab7fe700 10 cache put: > name=dc3.rgw.meta+root+s3bucket-fra2 info.flags=0x1 > 2023-06-20T23:32:29.365+ 7fcaab7fe700 10 moving > dc3.rgw.meta+root+s3bucket-fra2 to cache LRU end > 2023-06-20T23:32:29.365+ 7fcaab7fe700 20 get_system_obj_state: > rctx=0x7fcaab7f9320 > obj=dc3.rgw.meta:root:.bucket.meta.s3bucket-fra2:ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2297866866.29 > state=0x7fcba43ce0a0 s->prefetch_data=0 > 2023-06-20T23:32:29.365+ 7fcaab7fe700 10 cache get: > name=dc3.rgw.meta+root+.bucket.meta.s3bucket-fra2:ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2297866866.29 > : miss > 2023-06-20T23:32:29.365+ 7fcaab7fe700 10 cache put: > name=dc3.rgw.meta+root+.bucket.meta.s3bucket-fra2:ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2297866866.29 > info.flags=0x16 > 2023-06-20T23:32:29.365+ 7fcaab7fe700 10 adding > dc3.rgw.meta+root+.bucket.meta.s3bucket-fra2:ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2297866866.29 > to cache LRU end > 2023-06-20T23:32:29.365+ 7fcaab7fe700 10 cache get: > name=dc3.rgw.meta+root+.bucket.meta.s3bucket-fra2:ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2297866866.29 > : type miss (requested=0x13, cached=0x16) > 2023-06-20T23:32:29.365+ 7fcaab7fe700 10 cache put: > name=dc3.rgw.meta+root+.bucket.meta.s3bucket-fra2:ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2297866866.29 > info.flags=0x13 > 2023-06-20T23:32:29.365+ 7fcaab7fe700 10 moving > dc3.rgw.meta+root+.bucket.meta.s3bucket-fra2:ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2297866866.29 > to cache LRU end > 2023-06-20T23:32:29.365+ 7fcaab7fe700 10 chain_cache_entry: > cache_locator=dc3.rgw.meta+root+.bucket.meta.s3bucket-fra2:ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2297866866.29 > > Am Di., 20. Juni 2023 um 19:29 Uhr schrieb Boris : > >> Hi Casey, >> already did restart all RGW instances. Only helped for 2 minutes. We now >> stopped the new site. >> >> I will remove and recreate it later. >> As twi other sites don't have the problem I currently think I made a >> mistake in the process. >> >> Mit freundlichen Grüßen >> - Boris Behrens >> >> > Am 20.06.2023 um 18:30 schrieb Casey Bodley : >> > >> > hi Boris, >> > >> > we've been investigating reports of excessive polling from metadata >> > sync. i just opened https://tracker.ceph.com/issues/61743 to track >> > this. restarting the secondary zone radosgws should help as a >> > temporary workaround >> > >> >> On Tue, Jun 20, 2023 at 5:57 AM Boris Behrens wrote: >> >> >> >> Hi, >> >> yesterday I added a new zonegroup and it looks like it seems to cycle >> over >> >> the same requests over and over again. >> >> >> >> In the log of the main zone I see these requests: >> >> 2023-06-20T09:48:37.979+ 7f8941fb3700 1 beast: 0x7f8a602f3700: >> >> fd00:2380:0:24::136 - - [2023-06-20T09:48:37.979941+] "GET >> >> >> /admin/log?type=metadata&id=62&period=e8fc96f1-ae86-4dc1-b432-470b0772fded&max-entries=100&&rgwx-zonegroup=b39392eb-75f8-47f0-b4f3-7d3882930b26 >> >> HTTP/1.1" 200 44 - - - >> >> >> >> Only thing that changes is the &id. >> >> >> >> We have two other zonegroups that are configured identical (ceph.conf >> and >> >> period) and these don;t seem to spam the main rgw. >> >> >> >> root@host:~# radosgw-admin sync status >> >> realm 5d6f2ea4-b84a-459b-bce2-bccac338b3ef (main) >> >> zonegroup b39392eb-75f8-47f0-b4f3-7d3882930b26 (dc3) >> >> zone 96f5eca9-425b-4194-a152-86e310e91ddb (dc3) >> >> metadata sync syncing >> >>full sync: 0/64 shards >> >>incremental sync: 64/64 shards >> >>metadata is caught up with master >> >> >> >> root@host:~# radosgw-admin period get >> >> { >> >>"id": "e8fc96f1-ae86-4dc1-b432-470b0772fded", >> >>"epoch": 92, >> >>"predecessor_uuid": "5349ac85-3d6d-4088-993f-7a1d4be3835a", >> >>"sync_status": [ >> >>"", >> >> ... >> >>"" >> >>], >> >>"period_map": { >> >>"id": "e8fc96f1-ae86-4dc1-b4
[ceph-users] Re: Ceph iSCSI GW not working with VMware VMFS and Windows Clustered Storage Volumes (CSV)
On 20/06/2023 01:16, Work Ceph wrote: I see, thanks for the feedback guys! It is interesting that Ceph Manager does not allow us to export iSCSI blocks without selecting 2 or more iSCSI portals. Therefore, we will always use at least two, and as a consequence that feature is not going to be supported. Can I export an RBD image via iSCSI gateway using only one portal via GwCli? @Maged Mokhtar, I am not sure I follow. Do you guys have an iSCSI implementation that we can use to somehow replace the default iSCSI server in the default Ceph iSCSI Gateway? I didn't quite understand what the petasan project is, and if it is an OpenSource solution that we can somehow just pick/select/use one of its modules (e.g. just the iSCSI implementation) that you guys have. For sure PetaSAN is open source..you should see this from the home page :) we use Consul https://www.consul.io/use-cases/multi-platform-service-mesh to scale-out the service/protocol layers above Ceph in a scale-out active/active fashion. Most of our target use cases are non linux, such as VMWare and Windows, we provide easy to use deployment and management. For iSCSI, we use kernel/LIO rbd backstore originally developed by SUSE Enterprise storge. We have done some changes to send persistence reservations using the Ceph watch/notify, we also added changes to coordinate pre-snapshot quiescing/flushing across different gateways. We ported rbd backstore to 5.14 kernel. You should be able to use the iSCSI gateway by itself on existing non PetaSAN clusters but it is not a setup we support. You would use the LIO targercli to script the setup. There are some things to take care of such as setting the disk serial wwn to be the same across the different gateways serving the same image, setting up the multiple tpgs (target portal groups) for an image but only enabling the tpgs for local node. This setup will be using multi path MPIO to provide HA. Again it is not a setup we support, you could try it yourself in a test environment, you can also setup a test PetaSAN setup and examine the LIO configuration using targetcli. You can send me email if you need any clarifications. Cheers /Maged ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How does a "ceph orch restart SERVICE" affect availability?
Hi, Will that try to be smart and just restart a few at a time to keep things up and available. Or will it just trigger a restart everywhere simultaneously. basically, that's what happens for example during an upgrade if services are restarted. It's designed to be a rolling upgrade procedure so restarting all daemons of a specific service at the same time would cause an interruption. So the daemons are scheduled to restart and the mgr decides when it's safe to restart the next (this is a test cluster started on Nautilus, but it's on Quincy now): nautilus:~ # ceph orch restart osd.osd-hdd-ssd Scheduled to restart osd.5 on host 'nautilus' Scheduled to restart osd.0 on host 'nautilus' Scheduled to restart osd.2 on host 'nautilus' Scheduled to restart osd.1 on host 'nautilus2' Scheduled to restart osd.4 on host 'nautilus2' Scheduled to restart osd.7 on host 'nautilus2' Scheduled to restart osd.3 on host 'nautilus3' Scheduled to restart osd.8 on host 'nautilus3' Scheduled to restart osd.6 on host 'nautilus3' When it comes to OSDs it's possible (or even likely) that multiple OSDs are restarted at the same time, depending on the pools (and their replication size) they are part of. But ceph tries to avoid "inactive PGs" which is critical, of course. An edge case would be a pool with size 1 where restarting an OSD would cause an inactive PG until the OSD is up again. But since size 1 would be a bad idea anyway (except for testing purposes) you'd have to live with that. If you have the option I'd recommend to create a test cluster and play around with these things to get a better understanding, especially when it comes to upgrade tests etc. I guess in my current scenario, restarting one host at the time makes most sense, with a systemctl restart ceph-{fsid}.target and then checking that "ceph -s" says OK before proceeding to the next Yes, if your crush-failure-domain is host that should be safe, too. Regards, Eugen Zitat von Mikael Öhman : The documentation very briefly explains a few core commands for restarting things; https://docs.ceph.com/en/quincy/cephadm/operations/#starting-and-stopping-daemons but I feel I'm lacking quite some details of what is safe to do. I have a system in production, clusters connected via CephFS and some shared block devices. We would like to restart some things due to some new network configurations. Going daemon by daemon would take forever, so I'm curious as to what happens if one tries the command; ceph orch restart osd Will that try to be smart and just restart a few at a time to keep things up and available. Or will it just trigger a restart everywhere simultaneously. I guess in my current scenario, restarting one host at the time makes most sense, with a systemctl restart ceph-{fsid}.target and then checking that "ceph -s" says OK before proceeding to the next host, but I'm still curious as to what the "ceph orch restart xxx" command would do (but not enough to try it out in production) Best regards, Mikael Chalmers University of Technology ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSDs cannot join cluster anymore
Hello Eugen, recovery and rebalancing was finished however now all PGs show missing OSDs. Everything looks like the PGs are missing OSDs although it finished correctly. As if we shut down the servers immediately. But we removed the nodes the way it is described in the documentation. We just added new disks and they join the cluster immediately. So the old OSDs removed from the cluster are available, I restored OSD.40 but it does not want to join the cluster. Following are the outputs of the mentioned commands: ceph -s cluster: id: X health: HEALTH_WARN 1 failed cephadm daemon(s) 1 filesystem is degraded 1 MDSs report slow metadata IOs 19 osds down 4 hosts (50 osds) down Reduced data availability: 1220 pgs inactive Degraded data redundancy: 132 pgs undersized services: mon: 3 daemons, quorum cephx02,cephx04,cephx06 (age 4m) mgr: cephx02.xx(active, since 92s), standbys: cephx04.yy, cephx06.zz mds: 2/2 daemons up, 2 standby osd: 130 osds: 78 up (since 13m), 97 in (since 35m); 171 remapped pgs rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/2 healthy, 1 recovering pools: 12 pools, 1345 pgs objects: 11.02k objects, 1.9 GiB usage: 145 TiB used, 669 TiB / 814 TiB avail pgs: 86.617% pgs unknown 4.089% pgs not active 39053/33069 objects misplaced (118.095%) 1165 unknown 77 active+undersized+remapped 55 undersized+remapped+peered 38 active+clean+remapped 10 active+clean ceph osd tree ID CLASS WEIGHT TYPE NAMESTATUS REWEIGHT PRI-AFF -214.36646 root ssds -610.87329 host cephx01-ssd 186ssd 0.87329 osd.186down 1.0 1.0 -760.87329 host cephx02-ssd 263ssd 0.87329 osd.263 up 1.0 1.0 -850.87329 host cephx04-ssd 237ssd 0.87329 osd.237 up 1.0 1.0 -880.87329 host cephx06-ssd 236ssd 0.87329 osd.236 up 1.0 1.0 -940.87329 host cephx08-ssd 262ssd 0.87329 osd.262down 1.0 1.0 -1 1347.07397 root default -62 261.93823 host cephx01 139hdd10.91409 osd.139down 0 1.0 140hdd10.91409 osd.140down 0 1.0 142hdd10.91409 osd.142down 0 1.0 144hdd10.91409 osd.144down 0 1.0 146hdd10.91409 osd.146down 0 1.0 148hdd10.91409 osd.148down 0 1.0 150hdd10.91409 osd.150down 0 1.0 152hdd10.91409 osd.152down 0 1.0 154hdd10.91409 osd.154down 1.0 1.0 156hdd10.91409 osd.156down 1.0 1.0 158hdd10.91409 osd.158down 1.0 1.0 160hdd10.91409 osd.160down 1.0 1.0 162hdd10.91409 osd.162down 1.0 1.0 164hdd10.91409 osd.164down 1.0 1.0 166hdd10.91409 osd.166down 1.0 1.0 168hdd10.91409 osd.168down 1.0 1.0 170hdd10.91409 osd.170down 1.0 1.0 172hdd10.91409 osd.172down 1.0 1.0 174hdd10.91409 osd.174down 1.0 1.0 176hdd10.91409 osd.176down 1.0 1.0 178hdd10.91409 osd.178down 1.0 1.0 180hdd10.91409 osd.180down 1.0 1.0 182hdd10.91409 osd.182down 1.0 1.0 184hdd10.91409 osd.184down 1.0 1.0 -67 261.93823 host cephx02 138hdd10.91409 osd.138 up 1.0 1.0 141hdd10.91409 osd.141 up 1.0 1.0 143hdd10.91409 osd.143 up 1.0 1.0 145hdd10.91409 osd.145 up 1.0 1.0 147hdd10.91409 osd.147 up 1.0 1.0 149hdd10.91409 osd.149 up 1.0 1.0 151hdd10.91409 osd.151 up 1.0 1.0 153hdd10.91409 osd.153 up 1.0 1.0 155hdd10.91409 osd.155 up 1.0 1.0 157
[ceph-users] Re: alerts in dashboard
Hi Ben, also if some alerts are noisy, we have option in dashboard to silence those alerts. Also, can you provide the list of critical alerts that you see? On Wed, 21 Jun 2023 at 12:48, Nizamudeen A wrote: > Hi Ben, > > It looks like you forgot to attach the screenshots. > > Regards, > Nizam > > On Wed, Jun 21, 2023, 12:23 Ben wrote: > > > Hi, > > > > I got many critical alerts in ceph dashboard. Meanwhile the cluster shows > > health ok status. > > > > See attached screenshot for detail. My questions are, are they real > alerts? > > How to get rid of them? > > > > Thanks > > Ben > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Recover OSDs from folder /var/lib/ceph/uuid/removed
Yes, I am missing create: ceph osd create uuid id This works! Best, Malte Am 20.06.23 um 18:42 schrieb Malte Stroem: Well, things I would do: - add the keyring to ceph auth ceph auth add osd.XX osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/uuid(osd.XX/keyring - add OSD to crush ceph osd crush set osd.XX 1.0 root=default ... - create systemd service systemctl enable ceph-u...@osd.xx.service Is there something I am missing? Best, Malte Am 20.06.23 um 18:04 schrieb Malte Stroem: Hello, is it possible to recover an OSD if it was removed? The systemd service was removed but the block device is still listed under lsblk and the config files are still available under /var/lib/ceph/uuid/removed It is a containerized cluster. So I think we need to add the cephx entries, use ceph-volume, crush, and so on. Best regards, Malte ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph Pacific bluefs enospc bug with newly created OSDs
Hi Carsten, please also note a workaround to bring the osds back for e.g. data recovery - set bluefs_shared_alloc_size to 32768. This will hopefully allow OSD to startup and pull data out of it. But I wouldn't discourage you from using such OSDs long term as fragmentation might evolve and this workaround will become ineffective as well. Please do not apply this change to healthy OSDs as it's irreversible. BTW, having two namespace at NVMe drive is a good alternative to Logical Volumes if for some reasons one needs two "physical" disks for OSD setup... Thanks, Igor On 21/06/2023 11:41, Carsten Grommel wrote: Hi Igor, thank you for your ansere! >first of all Quincy does have a fix for the issue, see >https://tracker.ceph.com/issues/53466 (and its Quincy counterpart >https://tracker.ceph.com/issues/58588) Thank you I somehow missed that release, good to know! >SSD or HDD? Standalone or shared DB volume? I presume the latter... What >is disk size and current utilization? > >Please share ceph-bluestore-tool's bluefs-bdev-sizes command output if >possible We use 4 TB NVMe SSDs, shared db yes and mainly Micron with some Dell and Samsung in this cluster: Micron_7400_MTFDKCB3T8TDZ_214733D291B1 cloud5-1561:nvme5n1 osd.5 All Disks are at ~ 88% utilization. I noticed that around 92% our disks tend to run into this bug. Here are some bluefs-bdev-sizes from different OSDs on different hosts in this cluster: ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-36/ inferring bluefs devices from bluestore path 1 : device size 0x37e3ec0 : using 0x2e1b390(2.9 TiB) ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-24/ inferring bluefs devices from bluestore path 1 : device size 0x37e3ec0 : using 0x2d4e318d000(2.8 TiB) ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-5/ inferring bluefs devices from bluestore path 1 : device size 0x37e3ec0 : using 0x2f2da93d000(2.9 TiB) >Generally, given my assumption that DB volume is currently collocated >and you still want to stay on Pacific, you might want to consider >redeploying OSDs with a standalone DB volume setup. > >Just create large enough (2x of the current DB size seems to be pretty >conservative estimation for that volume's size) additional LV on top of >the same physical disk. And put DB there... > >Separating DB from main disk would result in much less fragmentation at >DB volume and hence work around the problem. The cost would be having >some extra spare space at DB volume unavailable for user data . I guess that makes, so the suggestion would be to deploy the osd and db on the same NVMe but with different logical volumes or updating to quincy. Thank you! Carsten *Von: *Igor Fedotov *Datum: *Dienstag, 20. Juni 2023 um 12:48 *An: *Carsten Grommel , ceph-users@ceph.io *Betreff: *Re: [ceph-users] Ceph Pacific bluefs enospc bug with newly created OSDs Hi Carsten, first of all Quincy does have a fix for the issue, see https://tracker.ceph.com/issues/53466 (and its Quincy counterpart https://tracker.ceph.com/issues/58588) Could you please share a bit more info on OSD disk layout? SSD or HDD? Standalone or shared DB volume? I presume the latter... What is disk size and current utilization? Please share ceph-bluestore-tool's bluefs-bdev-sizes command output if possible Generally, given my assumption that DB volume is currently collocated and you still want to stay on Pacific, you might want to consider redeploying OSDs with a standalone DB volume setup. Just create large enough (2x of the current DB size seems to be pretty conservative estimation for that volume's size) additional LV on top of the same physical disk. And put DB there... Separating DB from main disk would result in much less fragmentation at DB volume and hence work around the problem. The cost would be having some extra spare space at DB volume unavailable for user data . Hope this helps, Igor On 20/06/2023 10:29, Carsten Grommel wrote: > Hi all, > > we are experiencing the “bluefs enospc bug” again after redeploying all OSDs of our Pacific Cluster. > I know that our cluster is a bit too utilized at the moment with 87.26 % raw usage but still this should not happen afaik. > We never hat this problem with previous ceph versions and right now I am kind of out of ideas at how to tackle these crashes. > Compacting the database did not help in the past either. > Redeploy seems to no help in the long run as well. For documentation I used these commands to redeploy the osds: > > systemctl stop ceph-osd@${OSDNUM} > ceph osd destroy --yes-i-really-mean-it ${OSDNUM} > blkdiscard ${DEVICE} > sgdisk -Z ${DEVICE} > dmsetup remove ${DMDEVICE} > ceph-volume lvm create --osd-id ${OSDNUM} --data ${DEVICE} > > Any ideas or possible solutions on this? I am not yet ready to upgrade our clusters to quincy, also I do presume that this bug is still present in quincy as well? > > Follow our clus
[ceph-users] Re: RGW: Migrating a long-lived cluster to multi-site, fixing an EC pool mistake
Aaaand another dead end: there is too much meta-data involved (bucket and object ACLs, lifecycle, policy, …) that won’t be possible to perfectly migrate. Also, lifecycles _might_ be affected if mtimes change. So, I’m going to try and go back to a single-cluster multi-zone setup. For that I’m going to change all buckets with explicit placements to remove the explicit placement markers (those were created from old versions of Ceph and weren’t intentional by us, they perfectly reflect the default placement configuration). Here’s the patch I’m going to try on top of our Nautilus branch now: https://github.com/flyingcircusio/ceph/commit/b3a317987e50f089efc4e9694cf6e3d5d9c23bd5 All our buckets with explicit placements conform perfectly to the default placement, so this seems safe. Otherwise Zone migration was perfect until I noticed the objects with explicit placements in our staging and production clusters. (The dev cluster seems to have been purged intermediately, so this wasn’t noticed). I’m actually wondering whether explicit placements are really a sensible thing to have, even in multi-cluster multi-zone setups. AFAICT due to realms you might end up with different zonegroups referring to the same pools and this should only run through proper abstractions … o_O Cheers, Christian > On 14. Jun 2023, at 17:42, Christian Theune wrote: > > Hi, > > further note to self and for posterity … ;) > > This turned out to be a no-go as well, because you can’t silently switch the > pools to a different storage class: the objects will be found, but the index > still refers to the old storage class and lifecycle migrations won’t work. > > I’ve brainstormed for further options and it appears that the last resort is > to use placement targets and copy the buckets explicitly - twice, because on > Nautilus I don’t have renames available, yet. :( > > This will require temporary downtimes prohibiting users to access their > bucket. Fortunately we only have a few very large buckets (200T+) that will > take a while to copy. We can pre-sync them of course, so the downtime will > only be during the second copy. > > Christian > >> On 13. Jun 2023, at 14:52, Christian Theune wrote: >> >> Following up to myself and for posterity: >> >> I’m going to try to perform a switch here using (temporary) storage classes >> and renaming of the pools to ensure that I can quickly change the STANDARD >> class to a better EC pool and have new objects located there. After that >> we’ll add (temporary) lifecycle rules to all buckets to ensure their objects >> will be migrated to the STANDARD class. >> >> Once that is finished we should be able to delete the old pool and the >> temporary storage class. >> >> First tests appear successfull, but I’m a bit struggling to get the bucket >> rules working (apparently 0 days isn’t a real rule … and the debug interval >> setting causes high frequent LC runs but doesn’t seem move objects just yet. >> I’ll play around with that setting a bit more, though, I think I might have >> tripped something that only wants to process objects every so often and on >> an interval of 10 a day is still 2.4 hours … >> >> Cheers, >> Christian >> >>> On 9. Jun 2023, at 11:16, Christian Theune wrote: >>> >>> Hi, >>> >>> we are running a cluster that has been alive for a long time and we tread >>> carefully regarding updates. We are still a bit lagging and our cluster >>> (that started around Firefly) is currently at Nautilus. We’re updating and >>> we know we’re still behind, but we do keep running into challenges along >>> the way that typically are still unfixed on main and - as I started with - >>> have to tread carefully. >>> >>> Nevertheless, mistakes happen, and we found ourselves in this situation: we >>> converted our RGW data pool from replicated (n=3) to erasure coded (k=10, >>> m=3, with 17 hosts) but when doing the EC profile selection we missed that >>> our hosts are not evenly balanced (this is a growing cluster and some >>> machines have around 20TiB capacity for the RGW data pool, wheres newer >>> machines have around 160TiB and we rather should have gone with k=4, m=3. >>> In any case, having 13 chunks causes too many hosts to participate in each >>> object. Going for k+m=7 will allow distribution to be more effective as we >>> have 7 hosts that have the 160TiB sizing. >>> >>> Our original migration used the “cache tiering” approach, but that only >>> works once when moving from replicated to EC and can not be used for >>> further migrations. >>> >>> The amount of data is at 215TiB somewhat significant, so using an approach >>> that scales when copying data[1] to avoid ending up with months of >>> migration. >>> >>> I’ve run out of ideas doing this on a low-level (i.e. trying to fix it on a >>> rados/pool level) and I guess we can only fix this on an application level >>> using multi-zone replication. >>> >>> I have the setup nailed in gen
[ceph-users] Re: OSDs cannot join cluster anymore
I still can’t really grasp what might have happened here. But could you please clarify which of the down OSDs (or Hosts) are supposed to be down and which you’re trying to bring back online? Obviously osd.40 is one of your attempts. But what about the hosts cephx01 and cephx08? Are those the ones refusing to start their OSDs? And the remaining up OSDs you haven’t touched yet, correct? And regarding debug logs, you should set it with ceph config set because the local ceph.conf won’t have an effect. It could help to have the startup debug logs from one of the OSDs. Zitat von Malte Stroem : Hello Eugen, recovery and rebalancing was finished however now all PGs show missing OSDs. Everything looks like the PGs are missing OSDs although it finished correctly. As if we shut down the servers immediately. But we removed the nodes the way it is described in the documentation. We just added new disks and they join the cluster immediately. So the old OSDs removed from the cluster are available, I restored OSD.40 but it does not want to join the cluster. Following are the outputs of the mentioned commands: ceph -s cluster: id: X health: HEALTH_WARN 1 failed cephadm daemon(s) 1 filesystem is degraded 1 MDSs report slow metadata IOs 19 osds down 4 hosts (50 osds) down Reduced data availability: 1220 pgs inactive Degraded data redundancy: 132 pgs undersized services: mon: 3 daemons, quorum cephx02,cephx04,cephx06 (age 4m) mgr: cephx02.xx(active, since 92s), standbys: cephx04.yy, cephx06.zz mds: 2/2 daemons up, 2 standby osd: 130 osds: 78 up (since 13m), 97 in (since 35m); 171 remapped pgs rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/2 healthy, 1 recovering pools: 12 pools, 1345 pgs objects: 11.02k objects, 1.9 GiB usage: 145 TiB used, 669 TiB / 814 TiB avail pgs: 86.617% pgs unknown 4.089% pgs not active 39053/33069 objects misplaced (118.095%) 1165 unknown 77 active+undersized+remapped 55 undersized+remapped+peered 38 active+clean+remapped 10 active+clean ceph osd tree ID CLASS WEIGHT TYPE NAMESTATUS REWEIGHT PRI-AFF -214.36646 root ssds -610.87329 host cephx01-ssd 186ssd 0.87329 osd.186down 1.0 1.0 -760.87329 host cephx02-ssd 263ssd 0.87329 osd.263 up 1.0 1.0 -850.87329 host cephx04-ssd 237ssd 0.87329 osd.237 up 1.0 1.0 -880.87329 host cephx06-ssd 236ssd 0.87329 osd.236 up 1.0 1.0 -940.87329 host cephx08-ssd 262ssd 0.87329 osd.262down 1.0 1.0 -1 1347.07397 root default -62 261.93823 host cephx01 139hdd10.91409 osd.139down 0 1.0 140hdd10.91409 osd.140down 0 1.0 142hdd10.91409 osd.142down 0 1.0 144hdd10.91409 osd.144down 0 1.0 146hdd10.91409 osd.146down 0 1.0 148hdd10.91409 osd.148down 0 1.0 150hdd10.91409 osd.150down 0 1.0 152hdd10.91409 osd.152down 0 1.0 154hdd10.91409 osd.154down 1.0 1.0 156hdd10.91409 osd.156down 1.0 1.0 158hdd10.91409 osd.158down 1.0 1.0 160hdd10.91409 osd.160down 1.0 1.0 162hdd10.91409 osd.162down 1.0 1.0 164hdd10.91409 osd.164down 1.0 1.0 166hdd10.91409 osd.166down 1.0 1.0 168hdd10.91409 osd.168down 1.0 1.0 170hdd10.91409 osd.170down 1.0 1.0 172hdd10.91409 osd.172down 1.0 1.0 174hdd10.91409 osd.174down 1.0 1.0 176hdd10.91409 osd.176down 1.0 1.0 178hdd10.91409 osd.178down 1.0 1.0 180hdd10.91409 osd.180down 1.0 1.0 182hdd10.91409 osd.182down 1.0 1.0 184hdd10.91409 osd.184down 1.0 1.0 -67 261.93823 host cephx02 138hdd10.91409 osd.138 up 1.0 1.
[ceph-users] Re: Ceph Pacific bluefs enospc bug with newly created OSDs
Does quincy automatically switch existing things to 4k or do you need to do a new ost to get the 4k size? Thanks, Kevin From: Igor Fedotov Sent: Wednesday, June 21, 2023 5:56 AM To: Carsten Grommel; ceph-users@ceph.io Subject: [ceph-users] Re: Ceph Pacific bluefs enospc bug with newly created OSDs Check twice before you click! This email originated from outside PNNL. Hi Carsten, please also note a workaround to bring the osds back for e.g. data recovery - set bluefs_shared_alloc_size to 32768. This will hopefully allow OSD to startup and pull data out of it. But I wouldn't discourage you from using such OSDs long term as fragmentation might evolve and this workaround will become ineffective as well. Please do not apply this change to healthy OSDs as it's irreversible. BTW, having two namespace at NVMe drive is a good alternative to Logical Volumes if for some reasons one needs two "physical" disks for OSD setup... Thanks, Igor On 21/06/2023 11:41, Carsten Grommel wrote: > > Hi Igor, > > thank you for your ansere! > > >first of all Quincy does have a fix for the issue, see > >https://tracker.ceph.com/issues/53466 (and its Quincy counterpart > >https://tracker.ceph.com/issues/58588) > > Thank you I somehow missed that release, good to know! > > >SSD or HDD? Standalone or shared DB volume? I presume the latter... What > >is disk size and current utilization? > > > >Please share ceph-bluestore-tool's bluefs-bdev-sizes command output if > >possible > > We use 4 TB NVMe SSDs, shared db yes and mainly Micron with some Dell > and Samsung in this cluster: > > Micron_7400_MTFDKCB3T8TDZ_214733D291B1 cloud5-1561:nvme5n1 osd.5 > > All Disks are at ~ 88% utilization. I noticed that around 92% our > disks tend to run into this bug. > > Here are some bluefs-bdev-sizes from different OSDs on different hosts > in this cluster: > > ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-36/ > > inferring bluefs devices from bluestore path > > 1 : device size 0x37e3ec0 : using 0x2e1b390(2.9 TiB) > > ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-24/ > > inferring bluefs devices from bluestore path > > 1 : device size 0x37e3ec0 : using 0x2d4e318d000(2.8 TiB) > > ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-5/ > > inferring bluefs devices from bluestore path > > 1 : device size 0x37e3ec0 : using 0x2f2da93d000(2.9 TiB) > > >Generally, given my assumption that DB volume is currently collocated > >and you still want to stay on Pacific, you might want to consider > >redeploying OSDs with a standalone DB volume setup. > > > >Just create large enough (2x of the current DB size seems to be pretty > >conservative estimation for that volume's size) additional LV on top of > >the same physical disk. And put DB there... > > > >Separating DB from main disk would result in much less fragmentation at > >DB volume and hence work around the problem. The cost would be having > >some extra spare space at DB volume unavailable for user data . > > I guess that makes, so the suggestion would be to deploy the osd and > db on the same NVMe > > but with different logical volumes or updating to quincy. > > Thank you! > > Carsten > > *Von: *Igor Fedotov > *Datum: *Dienstag, 20. Juni 2023 um 12:48 > *An: *Carsten Grommel , ceph-users@ceph.io > > *Betreff: *Re: [ceph-users] Ceph Pacific bluefs enospc bug with newly > created OSDs > > Hi Carsten, > > first of all Quincy does have a fix for the issue, see > https://tracker.ceph.com/issues/53466 (and its Quincy counterpart > https://tracker.ceph.com/issues/58588) > > Could you please share a bit more info on OSD disk layout? > > SSD or HDD? Standalone or shared DB volume? I presume the latter... What > is disk size and current utilization? > > Please share ceph-bluestore-tool's bluefs-bdev-sizes command output if > possible > > > Generally, given my assumption that DB volume is currently collocated > and you still want to stay on Pacific, you might want to consider > redeploying OSDs with a standalone DB volume setup. > > Just create large enough (2x of the current DB size seems to be pretty > conservative estimation for that volume's size) additional LV on top of > the same physical disk. And put DB there... > > Separating DB from main disk would result in much less fragmentation at > DB volume and hence work around the problem. The cost would be having > some extra spare space at DB volume unavailable for user data . > > > Hope this helps, > > Igor > > > On 20/06/2023 10:29, Carsten Grommel wrote: > > Hi all, > > > > we are experiencing the “bluefs enospc bug” again after redeploying > all OSDs of our Pacific Cluster. > > I know that our cluster is a bit too utilized at the moment with > 87.26 % raw usage but still this should not happen afaik. > > We never hat this problem with previous ceph versions and right now > I am kind of out of ideas at how to tackle these crashes. > >
[ceph-users] How to repair pg in failed_repair state?
A lot of pg in inconsistent state occurred. Most of them were repaired with ceph pg repair all, but in the case of 3 pg as shown below, it does not proceed further with failed_repair status. [root@cephvm1 ~]# ceph health detail HEALTH_ERR 30 scrub errors; Too many repaired reads on 7 OSDs; Possible data damage: 3 pgs inconsistent OSD_SCRUB_ERRORS 30 scrub errors OSD_TOO_MANY_REPAIRS Too many repaired reads on 7 OSDs osd.29 had 315 reads repaired osd.23 had 530 reads repaired osd.18 had 69 reads repaired osd.2 had 267 reads repaired osd.0 had 179 reads repaired osd.12 had 513 reads repaired osd.13 had 404 reads repaired PG_DAMAGED Possible data damage: 3 pgs inconsistent pg 2.2f is active+clean+inconsistent+failed_repair, acting [29,13,18] pg 2.46 is active+clean+inconsistent+failed_repair, acting [12,0,29] pg 2.5c is active+clean+inconsistent+failed_repair, acting [12,23,0] The query result of pg 2.2f is as follows, and the problem seems to be that the three peer versions are different. [root@cephvm1 ~]# ceph pg 2.2f query { "state": "active+clean+inconsistent+failed_repair", "snap_trimq": "[]", "snap_trimq_len": 0, "epoch": 426, "up": [ 29, 13, 18 ], "acting": [ 29, 13, 18 ], "acting_recovery_backfill": [ "13", "18", "29" ], "info": { "pgid": "2.2f", "last_update": "426'128436680", "last_complete": "426'128436680", "log_tail": "390'128433627", "last_user_version": 128436529, "last_backfill": "MAX", "last_backfill_bitwise": 0, "purged_snaps": [], "history": { "epoch_created": 111, "epoch_pool_created": 67, "last_epoch_started": 426, "last_interval_started": 425, "last_epoch_clean": 426, "last_interval_clean": 425, "last_epoch_split": 111, "last_epoch_marked_full": 0, "same_up_since": 425, "same_interval_since": 425, "same_primary_since": 425, "last_scrub": "426'128436680", "last_scrub_stamp": "2023-06-21 15:57:53.645395", "last_deep_scrub": "426'128436680", "last_deep_scrub_stamp": "2023-06-21 15:57:53.645395", "last_clean_scrub_stamp": "2023-03-28 09:11:29.298557" }, "stats": { "version": "426'128436680", "reported_seq": "128628939", "reported_epoch": "426", "state": "active+clean+inconsistent+failed_repair", "last_fresh": "2023-06-21 15:57:53.645450", "last_change": "2023-06-21 15:57:53.645450", "last_active": "2023-06-21 15:57:53.645450", "last_peered": "2023-06-21 15:57:53.645450", "last_clean": "2023-06-21 15:57:53.645450", "last_became_active": "2023-06-21 14:03:02.233710", "last_became_peered": "2023-06-21 14:03:02.233710", "last_unstale": "2023-06-21 15:57:53.645450", "last_undegraded": "2023-06-21 15:57:53.645450", "last_fullsized": "2023-06-21 15:57:53.645450", "mapping_epoch": 425, "log_start": "390'128433627", "ondisk_log_start": "390'128433627", "created": 111, "last_epoch_clean": 426, "parent": "0.0", "parent_split_bits": 7, "last_scrub": "426'128436680", "last_scrub_stamp": "2023-06-21 15:57:53.645395", "last_deep_scrub": "426'128436680", "last_deep_scrub_stamp": "2023-06-21 15:57:53.645395", "last_clean_scrub_stamp": "2023-03-28 09:11:29.298557", "log_size": 3053, "ondisk_log_size": 3053, "stats_invalid": false, "dirty_stats_invalid": false, "omap_stats_invalid": false, "hitset_stats_invalid": false, "hitset_bytes_stats_invalid": false, "pin_stats_invalid": false, "manifest_stats_invalid": false, "snaptrimq_len": 0, "stat_sum": { "num_bytes": 10888387166, "num_objects": 2610, "num_object_clones": 0, "num_object_copies": 7830, "num_objects_missing_on_primary": 0, "num_objects_missing": 0, "num_objects_degraded": 0, "num_objects_misplaced": 0, "num_objects_unfound": 0, "num_objects_dirty": 2610, "num_whiteouts": 0, "num_read": 191976, "num_read_kb": 10314827, "num_write": 128429383, "num_write_kb": 741542291, "num_scrub_errors": 3, "num_shallow_scrub_errors": 0, "num_deep_scrub_errors": 3, "num_objects_recovered":
[ceph-users] Re: OSDs cannot join cluster anymore
On 6/21/23 11:20, Malte Stroem wrote: Hello Eugen, recovery and rebalancing was finished however now all PGs show missing OSDs. Everything looks like the PGs are missing OSDs although it finished correctly. As if we shut down the servers immediately. But we removed the nodes the way it is described in the documentation. We just added new disks and they join the cluster immediately. So the old OSDs removed from the cluster are available, I restored OSD.40 but it does not want to join the cluster. Are the osd.$id keys still there of the removed OSDs (check with ceph auth list)? Otherwise you might need to import the keyring into the cluster (/var/lib/ceph/osd/ceph-$id/keyring) and provide it proper CAPS. Gr. Stefan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io