Hi,
I took a snapshot of MDS.0's logs. We have five active MDS in total,
each one reporting laggy OSDs/clients, but I cannot find anything
related to that in the log snippet. Anyhow, I uploaded the log for your
reference with ceph-post-file ID 79b5138b-61d7-4ba7-b0a9-c6f02f47b881.
This is wh
Hi!
Can you share OSD logs demostrating such a restart?
Thanks,
Igor
On 20/09/2023 20:16, sbeng...@gmail.com wrote:
Since upgrading to 18.2.0 , OSDs are very frequently restarting due to
livenessprobe failures making the cluster unusable. Has anyone else seen this
behavior?
Upgrade path:
Hi,
After a power outage on my test ceph cluster, 2 osd fail to restart.
The log file show:
8e5f-00266cf8869c@osd.2.service: Failed with result 'timeout'.
Sep 21 11:55:02 mostha1 systemd[1]: Failed to start Ceph osd.2 for
250f9864-0142-11ee-8e5f-00266cf8869c.
Sep 21 11:55:12 mostha1 systemd[
Hi Patrick,
please share osd restart log to investigate that.
Thanks,
Igor
On 21/09/2023 13:41, Patrick Begou wrote:
Hi,
After a power outage on my test ceph cluster, 2 osd fail to restart.
The log file show:
8e5f-00266cf8869c@osd.2.service: Failed with result 'timeout'.
Sep 21 11:55:02
Hello,
I have a problem with an OSD not starting after being mounted offline using the
ceph-objectstore-tool --op fuse command.
The cephadm orch ps now shows me the osd in error state:
osd.0 storage1 error 2m ago 5h - 4096M
If I'm checkung
Hi all,
I replaced a disk in our octopus cluster and it is rebuilding. I noticed that
since the replacement there is no scrubbing going on. Apparently, an OSD having
a PG in backfill_wait state seems to block deep scrubbing all other PGs on that
OSD as well - at least this is how it looks.
Som
Hi Igor,
the ceph-osd.2.log remains empty on the node where this osd is located.
This is what I get when manualy restarting the osd.
[root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# systemctl restart
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service
Job for ceph-250f9864-0142-11ee-8
May be execute systemctl reset-failed <...> or even restart the node?
On 21/09/2023 14:26, Patrick Begou wrote:
Hi Igor,
the ceph-osd.2.log remains empty on the node where this osd is
located. This is what I get when manualy restarting the osd.
[root@mostha1 250f9864-0142-11ee-8e5f-00266cf
I have a use case where I want to only use a small portion of the disk for
the OSD and the documentation states that I can use
data_allocation_fraction [1]
But cephadm can not use this and throws this error:
/usr/bin/podman: stderr ceph-volume lvm batch: error: unrecognized
arguments: --data-alloc
Looks like the orchestation side support for this got brought into pacific
with the rest of the drive group stuff, but the actual underlying feature
in ceph-volume (from https://github.com/ceph/ceph/pull/40659) never got a
pacific backport. I've opened the backport now
https://github.com/ceph/ceph/
Hi Igor,
a "systemctl reset-failed" doesn't restart the osd.
I reboot the node and now it show some error on the HDD:
[ 107.716769] ata3.00: exception Emask 0x0 SAct 0x80 SErr 0x0
action 0x0
[ 107.716782] ata3.00: irq_stat 0x4008
[ 107.716787] ata3.00: failed command: READ FPDMA QU
Hi Patrick,
It seems your disk or controller are damaged. Are other disks connected
to the same controller working ok? If so, I'd say disk is dead.
Cheers
El 21/9/23 a las 16:17, Patrick Begou escribió:
Hi Igor,
a "systemctl reset-failed" doesn't restart the osd.
I reboot the node and now
Hi,
Since the recent update to 16.2.14-1~bpo11+1 on Debian Bullseye I've
started seeing OSD crashes being registered almost daily across all six
physical machines (6xOSD disks per machine). There's a --block-db for
each osd on a LV from an NVMe.
If anyone has any idea what might be causing t
On Thu, Sep 21, 2023 at 1:57 PM Frank Schilder wrote:
>
> Hi all,
>
> I replaced a disk in our octopus cluster and it is rebuilding. I noticed that
> since the replacement there is no scrubbing going on. Apparently, an OSD
> having a PG in backfill_wait state seems to block deep scrubbing all ot
Hi Eneko,
I do not work on the ceph cluster since my last email (making some user
support) and now the osd.2 is back in the cluster:
-7 0.68217 host mostha1
2 hdd 0.22739 osd.2 up 1.0 1.0
5 hdd 0.45479 osd.5 up 1.
Thanks!
Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
From: Mykola Golub
Sent: Thursday, September 21, 2023 4:53 PM
To: Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] backfill_wait preventing deep scr
Hi Luke,
highly likely this is caused by the issue covered
https://tracker.ceph.com/issues/53906
Unfortunately it looks like we missed proper backport in Pacific.
You can apparently work around the issue by setting
'bluestore_volume_selection_policy' config parameter to rocksdb_original.
T
Hi Casey,
This is indeed a multisite setup. The other side shows that for
# radosgw-admin sync status
the oldest incremental change not applied is about a minute old, and that is
consistent over a number of minutes, always the oldest incremental change a
minute or two old.
However:
# radosgw-
On Thu, Sep 21, 2023 at 12:21 PM Christopher Durham wrote:
>
>
> Hi Casey,
>
> This is indeed a multisite setup. The other side shows that for
>
> # radosgw-admin sync status
>
> the oldest incremental change not applied is about a minute old, and that is
> consistent over a number of minutes, al
Casey,
What I will probably do is:
1. stop usage of that bucket2. wait a few minutes to allow anything to
replicate, and verify object count, etc.
3. bilog trim
After #3 I will see if any of the '/' objects still exist.
Hopefully that will help. I now know what to look for to see if I can narro
Hi Community,
I recently proposed a new authorization mechanism for RGW that can let the
RGW daemon ask an external service to authorize a request based on AWS S3
IAM tags (that means the external service would receive the same env as an
IAM policy doc would have to evaluate the policy).
You can f
If there is nothing obvious in the OSD logs such as failing to start, and
if the OSDs appear to be running until the liveness probe restarts them,
you could disable or change the timeouts on the liveness probe. See
https://rook.io/docs/rook/latest/CRDs/Cluster/ceph-cluster-crd/#health-settings
.
B
Igor, Travis,
Thanks for your attention to this issue.
We extended the timeout for the liveness probe yesterday, and also extended
the time after which a down OSD deployment is deleted by the operator. Once
all the OSD deployments were recreated by the operator, we observed two OSD
restarts - whi
Hi Ceph users and developers,
Big thanks to Cory Snyder and Jonas Sterr for sharing your insights with an
audience of 50+ users and developers!
Cory shared some valuable troubleshooting tools and tricks that would be
helpful for anyone interested in gathering good debugging info.
See his presenta
Hi,
A question to avoid using a to elaborate method in finding de most recent
snapshot of a RBD-image.
So, what would be the preferred way to find the latest snapshot of this image?
root@hvs001:/# rbd snap ls libvirt-pool/CmsrvDOM2-MULTIMEDIA
SNAPID NAMESIZE PROTECTED TIMESTAMP
223
25 matches
Mail list logo