[ceph-users] Re: CephFS warning: clients laggy due to laggy OSDs

2023-09-21 Thread Janek Bevendorff
Hi, I took a snapshot of MDS.0's logs. We have five active MDS in total, each one reporting laggy OSDs/clients, but I cannot find anything related to that in the log snippet. Anyhow, I uploaded the log for your reference with ceph-post-file ID 79b5138b-61d7-4ba7-b0a9-c6f02f47b881. This is wh

[ceph-users] Re: After upgrading from 17.2.6 to 18.2.0, OSDs are very frequently restarting due to livenessprobe failures

2023-09-21 Thread Igor Fedotov
Hi! Can you share OSD logs demostrating such a restart? Thanks, Igor On 20/09/2023 20:16, sbeng...@gmail.com wrote: Since upgrading to 18.2.0 , OSDs are very frequently restarting due to livenessprobe failures making the cluster unusable. Has anyone else seen this behavior? Upgrade path:

[ceph-users] After power outage, osd do not restart

2023-09-21 Thread Patrick Begou
Hi, After a power outage on my test ceph cluster, 2 osd fail to restart.  The log file show: 8e5f-00266cf8869c@osd.2.service: Failed with result 'timeout'. Sep 21 11:55:02 mostha1 systemd[1]: Failed to start Ceph osd.2 for 250f9864-0142-11ee-8e5f-00266cf8869c. Sep 21 11:55:12 mostha1 systemd[

[ceph-users] Re: After power outage, osd do not restart

2023-09-21 Thread Igor Fedotov
Hi Patrick, please share osd restart log to investigate that. Thanks, Igor On 21/09/2023 13:41, Patrick Begou wrote: Hi, After a power outage on my test ceph cluster, 2 osd fail to restart.  The log file show: 8e5f-00266cf8869c@osd.2.service: Failed with result 'timeout'. Sep 21 11:55:02

[ceph-users] OSD not starting after being mounted with ceph-objectstore-tool --op fuse

2023-09-21 Thread Budai Laszlo
Hello, I have a problem with an OSD not starting after being mounted offline using the ceph-objectstore-tool --op fuse command. The cephadm orch ps now shows me the osd in error state: osd.0   storage1   error 2m ago   5h    -    4096M  If I'm checkung

[ceph-users] backfill_wait preventing deep scrubs

2023-09-21 Thread Frank Schilder
Hi all, I replaced a disk in our octopus cluster and it is rebuilding. I noticed that since the replacement there is no scrubbing going on. Apparently, an OSD having a PG in backfill_wait state seems to block deep scrubbing all other PGs on that OSD as well - at least this is how it looks. Som

[ceph-users] Re: After power outage, osd do not restart

2023-09-21 Thread Patrick Begou
Hi Igor, the ceph-osd.2.log remains empty on the node where this osd is located. This is what I get when manualy restarting the osd. [root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# systemctl restart ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service Job for ceph-250f9864-0142-11ee-8

[ceph-users] Re: After power outage, osd do not restart

2023-09-21 Thread Igor Fedotov
May be execute systemctl reset-failed <...> or even restart the node? On 21/09/2023 14:26, Patrick Begou wrote: Hi Igor, the ceph-osd.2.log remains empty on the node where this osd is located. This is what I get when manualy restarting the osd. [root@mostha1 250f9864-0142-11ee-8e5f-00266cf

[ceph-users] ceph orch osd data_allocate_fraction does not work

2023-09-21 Thread Boris Behrens
I have a use case where I want to only use a small portion of the disk for the OSD and the documentation states that I can use data_allocation_fraction [1] But cephadm can not use this and throws this error: /usr/bin/podman: stderr ceph-volume lvm batch: error: unrecognized arguments: --data-alloc

[ceph-users] Re: ceph orch osd data_allocate_fraction does not work

2023-09-21 Thread Adam King
Looks like the orchestation side support for this got brought into pacific with the rest of the drive group stuff, but the actual underlying feature in ceph-volume (from https://github.com/ceph/ceph/pull/40659) never got a pacific backport. I've opened the backport now https://github.com/ceph/ceph/

[ceph-users] Re: After power outage, osd do not restart

2023-09-21 Thread Patrick Begou
Hi Igor, a "systemctl reset-failed" doesn't restart the osd. I reboot the node and now it show some error on the HDD: [  107.716769] ata3.00: exception Emask 0x0 SAct 0x80 SErr 0x0 action 0x0 [  107.716782] ata3.00: irq_stat 0x4008 [  107.716787] ata3.00: failed command: READ FPDMA QU

[ceph-users] Re: After power outage, osd do not restart

2023-09-21 Thread Eneko Lacunza
Hi Patrick, It seems your disk or controller are damaged. Are other disks connected to the same controller working ok? If so, I'd say disk is dead. Cheers El 21/9/23 a las 16:17, Patrick Begou escribió: Hi Igor, a "systemctl reset-failed" doesn't restart the osd. I reboot the node and now

[ceph-users] Recently started OSD crashes (or messages thereof)

2023-09-21 Thread Luke Hall
Hi, Since the recent update to 16.2.14-1~bpo11+1 on Debian Bullseye I've started seeing OSD crashes being registered almost daily across all six physical machines (6xOSD disks per machine). There's a --block-db for each osd on a LV from an NVMe. If anyone has any idea what might be causing t

[ceph-users] Re: backfill_wait preventing deep scrubs

2023-09-21 Thread Mykola Golub
On Thu, Sep 21, 2023 at 1:57 PM Frank Schilder wrote: > > Hi all, > > I replaced a disk in our octopus cluster and it is rebuilding. I noticed that > since the replacement there is no scrubbing going on. Apparently, an OSD > having a PG in backfill_wait state seems to block deep scrubbing all ot

[ceph-users] Re: After power outage, osd do not restart

2023-09-21 Thread Patrick Begou
Hi Eneko, I do not work on the ceph cluster since my last email (making some user support) and now the osd.2 is back in the cluster:  -7 0.68217  host mostha1   2    hdd  0.22739  osd.2   up   1.0  1.0   5    hdd  0.45479  osd.5   up   1.

[ceph-users] Re: backfill_wait preventing deep scrubs

2023-09-21 Thread Frank Schilder
Thanks! Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Mykola Golub Sent: Thursday, September 21, 2023 4:53 PM To: Frank Schilder Cc: ceph-users@ceph.io Subject: Re: [ceph-users] backfill_wait preventing deep scr

[ceph-users] Re: Recently started OSD crashes (or messages thereof)

2023-09-21 Thread Igor Fedotov
Hi Luke, highly likely this is caused by the issue covered https://tracker.ceph.com/issues/53906 Unfortunately it looks like we missed proper backport in Pacific. You can apparently work around the issue by setting 'bluestore_volume_selection_policy' config parameter to rocksdb_original. T

[ceph-users] Re: millions of hex 80 0_0000 omap keys in single index shard for single bucket

2023-09-21 Thread Christopher Durham
Hi Casey, This is indeed a multisite setup. The other side shows that for # radosgw-admin sync status the oldest incremental change not applied is about a minute old, and that is consistent over a number of minutes, always the oldest incremental change a minute or two old. However: # radosgw-

[ceph-users] Re: millions of hex 80 0_0000 omap keys in single index shard for single bucket

2023-09-21 Thread Casey Bodley
On Thu, Sep 21, 2023 at 12:21 PM Christopher Durham wrote: > > > Hi Casey, > > This is indeed a multisite setup. The other side shows that for > > # radosgw-admin sync status > > the oldest incremental change not applied is about a minute old, and that is > consistent over a number of minutes, al

[ceph-users] Re: millions of hex 80 0_0000 omap keys in single index shard for single bucket

2023-09-21 Thread Christopher Durham
Casey, What I will probably do is: 1. stop usage of that bucket2. wait a few minutes to allow anything to replicate, and verify object count, etc. 3. bilog trim After #3 I will see if any of the '/' objects still exist. Hopefully that will help. I now know what to look for to see if I can narro

[ceph-users] RGW External IAM Authorization

2023-09-21 Thread Seena Fallah
Hi Community, I recently proposed a new authorization mechanism for RGW that can let the RGW daemon ask an external service to authorize a request based on AWS S3 IAM tags (that means the external service would receive the same env as an IAM policy doc would have to evaluate the policy). You can f

[ceph-users] Re: After upgrading from 17.2.6 to 18.2.0, OSDs are very frequently restarting due to livenessprobe failures

2023-09-21 Thread Travis Nielsen
If there is nothing obvious in the OSD logs such as failing to start, and if the OSDs appear to be running until the liveness probe restarts them, you could disable or change the timeouts on the liveness probe. See https://rook.io/docs/rook/latest/CRDs/Cluster/ceph-cluster-crd/#health-settings . B

[ceph-users] Re: After upgrading from 17.2.6 to 18.2.0, OSDs are very frequently restarting due to livenessprobe failures

2023-09-21 Thread Sudhin Bengeri
Igor, Travis, Thanks for your attention to this issue. We extended the timeout for the liveness probe yesterday, and also extended the time after which a down OSD deployment is deleted by the operator. Once all the OSD deployments were recreated by the operator, we observed two OSD restarts - whi

[ceph-users] Re: Join us for the User + Dev Relaunch, happening this Thursday!

2023-09-21 Thread Laura Flores
Hi Ceph users and developers, Big thanks to Cory Snyder and Jonas Sterr for sharing your insights with an audience of 50+ users and developers! Cory shared some valuable troubleshooting tools and tricks that would be helpful for anyone interested in gathering good debugging info. See his presenta

[ceph-users] Querying the most recent snapshot

2023-09-21 Thread Dominique Ramaekers
Hi, A question to avoid using a to elaborate method in finding de most recent snapshot of a RBD-image. So, what would be the preferred way to find the latest snapshot of this image? root@hvs001:/# rbd snap ls libvirt-pool/CmsrvDOM2-MULTIMEDIA SNAPID NAMESIZE PROTECTED TIMESTAMP 223