Here's the reason they exit:
7f1605dc9700 -1 osd.97 486896 _committed_osd_maps marked down 6 >
osd_max_markdown_count 5 in last 600.00 seconds, shutting down
If an osd flaps (marked down, then up) 6 times in 10 minutes, it
exits. (This is a safety measure).
It's normally caused by a network
Hi!
I run cephadm-based 16.2.x cluster in production. It's been mostly fine,
but not without quirks. Hope this helps.
/Z
On Tue, Mar 8, 2022 at 6:17 AM norman.kern wrote:
> Dear Ceph folks,
>
> Anyone is using cephadm in product(Version: Pacific)? I found several bugs
> on it and
> I really do
Yes, this is something we know and we disabled it, because we ran into the
problem that PGs went unavailable when two or more OSDs went offline.
I am searching for the reason WHY this happens.
Currently we have set the service file to restart=always and removed the
StartLimitBurst from the service
Hi,
We also had this kind of problems after upgrading to octopus. Maybe you
can play with the hearthbeat grace time (
https://docs.ceph.com/en/latest/rados/configuration/mon-osd-interaction/
) to tell osds to wait a little more before declaring another osd down !
We also try to fix the problem
Hi,
I just upgraded a small test cluster on Raspberries from pacific 16.2.6 to
16.2.7.
The upgrade went without major problems.
But now the Ceph Dashboard doesn't work anymore in Safari.
It complains about main..js "Line 3 invalid regular expression: invalid
group specifier name".
It works with
Proxmox = 6.4-8
CEPH = 15.2.15
Nodes = 3
Network = 2x100G / node
Disk = nvme Samsung PM-1733 MZWLJ3T8HBLS 4TB
nvme Samsung PM-1733 MZWLJ1T9HBJR 2TB
CPU = EPYC 7252
CEPH pools = 2 separate pools for each disk type and each disk spliced in 2
OSD's
Replica = 3
VM don't do many
Replying to my self :)
It seems to be this function:
replaceBraces(e) {
==> return e.replace(/(?<=\d)\s*-\s*(?=\d)/g, "..").
replace(/\(/g, "{").
replace(/\)/g, "}").
We have an old ceph cluster, which is running fine without any problems
with cephadm and pacific (16.2.7) on Ubuntu (which was deployed without
using cephadm).
Now, I am trying to setup one more cluster on CentOS Stream 8 with
cephadm, containers are killed or stopped for no reason.
On Tue, Mar 8
>
> VM don't do many writes and i migrated main testing VM's to 2TB pool which
> in turns fragments faster.
>
>
> Did a lot of tests and recreated pools and OSD's in many ways but in a
> matter of days every time each OSD's gets severely fragmented and loses up
> to 80% of write performance (tes
>
> We have an old ceph cluster, which is running fine without any problems
> with cephadm and pacific (16.2.7) on Ubuntu (which was deployed without
> using cephadm).
>
> Now, I am trying to setup one more cluster on CentOS Stream 8 with
> cephadm, containers are killed or stopped for no reaso
>
> Can't imagine there is no reason. Anyway I think there is a general
> misconception that using containers would make it easier for users.
>
> ceph = learn linux sysadmin + learn ceph
> cephadm = learn linux sysadmin + learn ceph + learn containers
>
Oh forgot ;)
croit ceph = learn not
Hi Francois,
thanks for the reminder. We offline compacted all of the OSDs when we
reinstalled the hosts with the new OS.
But actually reinstalling them was never on my list.
I could try that and in the same go I can remove all the cache SSDs (when
one SSD share the cache for 10 OSDs this is a ho
Hi,
The last 2 osd I recreated were on december 30 and february 8.
I totally agree that ssd cache are a terrible spof. I think that's an
option if you use 1 ssd/nvme for 1 or 2 osd, but the cost is then very
high. Using 1 ssd for 10 osd increase the risk for almost no gain
because the ssd is
> Where is the rados bench before and after your problem?
Rados bench before deleting OSD's and recreating them + syncing with
fragmentation 0.89
T1 - wr,4M T2 = ro,seq,4M T3 = ro,rand,4M
> Total time run 60.0405 Total time run 250.486 Total time run
> 600.463
> Total writes made
> Where is the rados bench before and after your problem?
Rados bench before deleting OSD's and recreating them + syncing with
fragmentation 0.89
T1 - wr,4M T2 = ro,seq,4M T3 = ro,rand,4M
> Total time run 60.0405 Total time run 250.486 Total time run
> 600.463
> Total writes made
> Where is the rados bench before and after your problem?
Rados bench before deleting OSD's and recreating them + syncing with
fragmentation 0.89
T1 - wr,4M T2 = ro,seq,4M T3 = ro,rand,4M
> Total time run 60.0405 Total time run 250.486 Total time run
> 600.463
> Total writes made
Rados bench before deleting OSD's and recreating them + syncing with
fragmentation 0.89
> T1 - wr,4M
> Total time run 60.0405
> Total writes made 9997
> Write size 4194304
> Object size4194304
> Bandwidth (MB/sec) 666,017
> Stddev Bandwidth 24.1108
> Max
We use it without major issues, at this point. There are still flaws, but
there are flaws in almost any deployment and management system, and this is
not unique to cephadm. I agree with the general sentiment that you need to
have some knowledge about containers, however. I don't think that's
necess
Hi,
This was already fixed in master/quincy, but the pacific backport was never
completed (https://github.com/ceph/ceph/pull/45301).
I just did that: https://github.com/ceph/ceph/pull/45301 (it should be
there for 16.2.8).
Kind Regards,
Ernesto
On Tue, Mar 8, 2022 at 3:55 PM Jozef Rebjak wrot
Thanks Eugen,
Yeah, unfortunately the OSDs have been replaced with new OSDs. Currently the
cluster is under rebalancing. I was thinking that I would try the
''osd_find_best_info_ignore_history_les' trick after the cluster has calmed
down and there is no extra traffic on the OSDs.
Thing is ..
Unexpectedly, everything disappeared and the cluster health went back to
its previous state!
I think I’ll never have a definitive answer ^^
I’ve been able to find out a really nice way to get the rbd stats/iotop on
our prometheus using the mgr plugin too and it’s awesome as we can now
better chase
It has taken me too long to reply to you. I just wanted to say thanks - this
was very helpful and answered my question. Thanks for taking the time to
provide this information.
--
Mark Selby
Sr Linux Administrator, The Voleon Group
mse...@voleon.com
This email is subject to important condi
I am not sure that what I would like to do is even possible. I was hoping there
is someone out there who could chime in on this.
We use Ceph RBD and Ceph FS somewhat extensively and are starting on our RGW
journey.
We have a couple of different groups that would like to be their own tenan
We are starting to test out Ceph RGW and have run into a small issue with the
aws-cli that amazon publishes. We have a set of developers who use the aws-cli
heavily and it seems that this tool does not work with Ceph RGW tenancy.
Given user = test01$test01 with bucket buck01
Given user = tes
Hi Mark,
On Wed, Mar 9, 2022 at 6:57 AM Mark Selby wrote:
> I am not sure that what I would like to do is even possible. I was hoping
> there is someone out there who could chime in on this.
>
>
>
> We use Ceph RBD and Ceph FS somewhat extensively and are starting on our
> RGW journey.
>
>
>
> W
Alternatively, if you want to restrict access to s3 resources for different
groups of users, then you can do so by creating a role in a tenant, and
then create s3 resources and attach tags to them and then use ABAC/ tags to
allow a user to access a particular resource (bucket/ object). Details can
Ok, some progress…
I’m describing what I did here, hopefully it will help someone that ended up in
the same predicament.
I used "ceph-objectstore-tool … —op mark-complete” to mark the incomplete pgs
as complete on the primary OSD, and then brought the OSD up. The incomplete pg
now has a state
Just to report back the root cause of the above mentioned failures in "
ceph-osd -i ${osd_id} --mkfs -k /var/lib/ceph/osd/ceph-${osd_id}/keyring"
It turns out the culprit was using Samsung SM883 SSD disks as DB/WAL
partitions. Replacing SM883 with Intel S4510/4520 SSDs solved the issues.
It loo
28 matches
Mail list logo