Hi
Looking at this error in v15.2.13:
"
[ERR] MGR_MODULE_ERROR: Module 'devicehealth' has failed:
Module 'devicehealth' has failed:
"
It used to work. Since the module is always on I can't seem to restart
it and I've found no clue as to why it failed. I've tried rebooting all
hosts to no
Den mån 14 juni 2021 kl 22:48 skrev Matt Larson :
>
> Looking at the documentation for (
> https://docs.ceph.com/en/latest/cephadm/upgrade/) - I have a question on
> whether you need to sequentially upgrade for each minor versions, 15.2.1 ->
> 15.2.3 -> ... -> 15.2.XX?
>
> Can you safely upgrade by
Hi Torkil,
you should see more information in the MGR log file.
Might be an idea to restart the MGR to get some recent logs.
Am 15.06.21 um 09:41 schrieb Torkil Svensgaard:
Hi
Looking at this error in v15.2.13:
"
[ERR] MGR_MODULE_ERROR: Module 'devicehealth' has failed:
Module 'devicehea
Hi
Thanks, I guess this might have something to do with it:
"
Jun 15 09:44:22 dcn-ceph-01 bash[3278]: debug
2021-06-15T09:44:22.507+ 7f704e4b3700 -1 mgr notify devicehealth.notify:
Jun 15 09:44:22 dcn-ceph-01 bash[3278]: debug
2021-06-15T09:44:22.507+ 7f704e4b3700 -1 mgr notify Traceba
Dears,
We have a ceph cluster with 4096 PGs out of with +100 PGs are not active+clean.
On top of the ceph cluster, we have a ceph FS, with 3 active MDS servers.
It seems that we can’t get all the files out of it because of the affected PGs.
The object store has more than 400 million objects.
W
Fired https://tracker.ceph.com/issues/51223
k
> On 9 Jun 2021, at 13:20, Igor Fedotov wrote:
>
> Should we fire another ticket for that?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
Hi everyone,
Here's today's schedule for Ceph Month:
9:00ET / 15:00 CEST Dashboard Update [Ernesto]
9:30 ET / 15:30 CEST [lightning] RBD latency with QD=1 bs=4k [Wido,
den Hollander]
9:40 ET / 15:40 CEST [lightning] From Open Source to Open Ended in
Ceph with Lua [Yuval Lifshitz]
Full schedule:
Hi,
I'm building a lab with virtual machines.
I build a set up with only 2 nodes, 2 osd per nodes and I have a host that
use mount.cephfs
Each 2 ceph nodes runs services mon + mgr + mds and has cephadm command.
If I stop a node, all commands hang.
Can't use dashboard, can't use ceph -s or any ce
On 15.06.21 15:16, nORKy wrote:
> Why is there no failover ??
Because only one MON out of two is not in the majority to build a quorum.
Regards
--
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin
https://www.heinlein-support.de
Tel: 030 / 405051-43
Fax: 030 / 405051-19
A
It's easy. The problem ise OSD's are still up because there is not enough
down mon_osd_min_down_reporters and due to this problem MDS is stucking.
The solution is "mon_osd_min_down_reporters = 1"
Due to "two node" cluster and "replicated 2" with "chooseleaf host"
the reporter count should be set t
Dear All,
I have deployed the latest CEPH Pacific release in my lab and started to check
out the new ?stable? NFS Ganesha features. First of all I'm a bit confused
which method to actually use to deploy the NFS cluster:
cephadm or ceph nfs cluster create?
I used "nfs cluster create" for
Hi,
That's right!
We're currently evaluating a similar setup with two identical HW nodes
(on two different sites), with OSD, MON and MDS each, and both nodes
have CephFS mounted.
The goal is to build a minimal self-contained shared filesystem that
remains online during planned updates and c
You have incomplete PGs, which means you have inactive data, because the data
isn't there.
This will typically only happen when you have multiple concurrent disk
failures, or something like that, so I think there is some missing info.
>1 osds exist in the crush map but not in the os
Hi,
yeah. I wasn't aware, that I have set "osd op complaint time" to 5
seconds. AFAIK the default is 32. So I get slow ops already after 5
instead of 32 seconds. Thats why I think no one before have noticed
this before.
My application which uses the cluster will throw timeouts after 6
seconds, tha
Hello,
I have a ceph cluster with 5 nodes (1 hdd each node). I want to add 5 more
drives (hdd) to expand my cluster. What is the best strategy for this?
I will add each drive in each node but is a good strategy add one drive and
wait to rebalance the data to new osd for add new osd? or maybe..
Hi,
Thank you guys. I deployed a third monitor and failover works. Thanks you
Le mar. 15 juin 2021 à 16:15, Christoph Brüning <
christoph.bruen...@uni-wuerzburg.de> a écrit :
> Hi,
>
> That's right!
>
> We're currently evaluating a similar setup with two identical HW nodes
> (on two different si
Hi,
as far as I understand it,
you get no real benefit with doing them one by one, as each osd add, can
cause a lot of data to be moved to a different osd, even tho you just
rebalanced it.
The algorithm determining the placement of pg's does not take the
current/historic placement into acco
Personally, when adding drives like this, I set noin (ceph osd set noin), and
norebalance (ceph osd set norebalance). Like your situation, we run smaller
clusters; our largest cluster only has 18 OSDs.
That keeps the cluster from starting data moves until all new drives are in
place. Don't fo
Hello all,
after upgrading Centos clients to version 8.4 CephFS ( Kernel
4.18.0-305.3.1.el8 ) mount did fail. Message: *mount error 110 =
Connection timed out*
..unfortunately the kernel log was flooded with zeros... :-(
The monitor connection seems to be ok, but libceph said:
kernel: libceph:
Looks like this: https://tracker.ceph.com/issues/51112
On Tue, Jun 15, 2021 at 5:48 PM Ackermann, Christoph
wrote:
>
> Hello all,
>
> after upgrading Centos clients to version 8.4 CephFS ( Kernel
> 4.18.0-305.3.1.el8 ) mount did fail. Message: *mount error 110 =
> Connection timed out*
> ..unf
Dear Cephers,
I encountered the following networking issue several times, and i wonder
whether there is a solution for networking HA solution.
We build ceph using L2 multi chassis link aggregation group (MC-LAG ) to
provide switch redundancy. On each host, we use 802.3ad, LACP
mode for NIC re
Note: I am not entirely sure here, and would love other input from the ML about
this, so take this with a grain of salt.
You don't show any unfound objects, which I think is excellent news as far as
data loss.
>>96 active+clean+scrubbing+deep+repair
The deep scrub + repair seems au
Hi,
On 15.06.21 16:15, Christoph Brüning wrote:
Hi,
That's right!
We're currently evaluating a similar setup with two identical HW nodes
(on two different sites), with OSD, MON and MDS each, and both nodes
have CephFS mounted.
The goal is to build a minimal self-contained shared filesystem
My big worry is about, when a single link under a bond breaks, it breaks hardly
such that the whole bond does not work.
How to make it "failover" in such cases?
best regards,
samuel
huxia...@horebdata.cn
From: Anthony D'Atri
Date: 2021-06-15 18:22
To: huxia...@horebdata.cn
Subject: Re: [c
Do you observe the same behaviour when you pull a cable?
Maybe a flapping port might cause this kind of behaviour, other than
that you should't see any network disconnects.
Are you sure about LACP configuration, what is the output of 'cat
/proc/net/bonding/bond0'
On Tue, Jun 15, 2021 at 7:19 PM hu
This also sounds like a possible GlusterFS use case.
Regards,
-Jamie
On Tue, Jun 15, 2021 at 12:30 PM Burkhard Linke <
burkhard.li...@computational.bio.uni-giessen.de> wrote:
> Hi,
>
> On 15.06.21 16:15, Christoph Brüning wrote:
> > Hi,
> >
> > That's right!
> >
> > We're currently evaluating a
When i pull out the cable, then the bond is working properly.
Does it mean that the port is somehow flapping? Ping can still work, but the
iperf test yields very low results.
huxia...@horebdata.cn
From: Serkan Çoban
Date: 2021-06-15 18:47
To: huxia...@horebdata.cn
CC: ceph-users
Subject: R
With an unstable link/port you could see the issues you describe. Ping doesn’t
have the packet rate for you to necessarily have a packet in transit at exactly
the same time as the port fails temporarily. Iperf on the other hand could
certainly show the issue, higher packet rate and more likely
> On Jun 15, 2021, at 10:26 AM, Andrew Walker-Brown
> wrote:
>
> With an unstable link/port you could see the issues you describe. Ping
> doesn’t have the packet rate for you to necessarily have a packet in transit
> at exactly the same time as the port fails temporarily. Iperf on the othe
Hi Ilya,
We're now hitting this on CentOS 8.4.
The "setmaxosd" workaround fixed access to one of our clusters, but
isn't working for another, where we have gaps in the osd ids, e.g.
# ceph osd getmaxosd
max_osd = 553 in epoch 691642
# ceph osd tree | sort -n -k1 | tail
541 ssd 0.87299
I'm trying to update a ceph octopus install, to add an iscsi gateway, using
ceph-ansible, and gwcli wont run for me.
The ansible run went well.. but when I try to actually use gwcli, I get
(blahblah)
ImportError: No module named rados
which isnt too surprising, since "python-rados" is not insta
Replying to own mail...
On Tue, Jun 15, 2021 at 7:54 PM Dan van der Ster wrote:
>
> Hi Ilya,
>
> We're now hitting this on CentOS 8.4.
>
> The "setmaxosd" workaround fixed access to one of our clusters, but
> isn't working for another, where we have gaps in the osd ids, e.g.
>
> # ceph osd getmax
Dan,
sorry, we have no gaps in osd numbering:
isceph@ceph-deploy:~$ sudo ceph osd ls |wc -l; sudo ceph osd tree | sort -n
-k1 |tail
76
[..]
73ssd0.28600 osd.73 up 1.0
1.0
74ssd0.27689 osd.74 up 1.0
1.0
Hi Christoph,
What about the max osd? If "ceph osd getmaxosd" is not 76 on this
cluster, then set it: `ceph osd setmaxosd 76`.
-- dan
On Tue, Jun 15, 2021 at 8:54 PM Ackermann, Christoph
wrote:
>
> Dan,
>
> sorry, we have no gaps in osd numbering:
> isceph@ceph-deploy:~$ sudo ceph osd ls |wc -l
Hi Dan,
Thanks for the hint, i'll try this tomorrow with a test bed first. This
evening I had to fix some Bareos client systems to get a quiet sleep. ;-)
Will give you feedback asap.
Best regards,
Christoph
Am Di., 15. Juni 2021 um 21:03 Uhr schrieb Dan van der Ster <
d...@vanderster.com>:
>
Hi Reed,
Thank you for getting back to us.
We had indeed several disk failures at the same time.
Regarding the OSD map, we have an OSD that failed and we needed to remove but
we didn't update the crushmap.
The question here, is it safe to update the OSD crushmap without affecting the
data ava
I run 2x 10G on my hosts, and i would tolerate one bond with one link down.
From what you suggest, i will check link monitoring, to make sure the failing
link will be removed automatically, without the requirement for manually
pulling out the cable.
thanks and best regards,
samuel
huxia.
We also run with Dell VLT switches (40 GB)
everything is active/active, so multiple paths as Andrew describes in
his config
Our config allows us:
bring down one of the switches for upgrades
bring down an iscsi gatway for patching
all the while at least one path is up and servicing
Thanks
Hello Ceph-users,
I've upgraded my Ubuntu server from 18.04.5 LTS to Ubuntu 20.04.2 LTS via
'do-release-upgrade',
during that process ceph packages were upgraded from Luminous to Octopus and
now ceph-mon daemon(I have only one) won't start, log error is:
"2021-06-15T20:23:41.843+ 7fbb55e9b54
Hello
How can I use ceph orch apply to deploy single site rgw
daemons with custom frontend configuration?
Basically, I have three servers in a DNS round-robin, each
running a 15.2.12 rgw daemon with this configuration:
rgw_frontends = civetweb num_threads=5000 port=443s
ssl_certificate=/etc/
Good day.
I'm writing some code for parsing output data for monitoring purposes.
The data is that of "ceph status -f json", "ceph df -f json", "ceph osd
perf -f json" and "ceph osd pool stats -f json".
I also need support for all major CEPH releases, starting with Jewel till
Pacific.
What I've st
Thanks for the replies folks.
This one was resolved, I wish I could tell you I know what I changed to fix
it, but there were several undocumented changes to the deployment script
I'm using whilst I was distracted by something else.. Tearing down and
redeploying today seems to not be suffering from
Hey folks,
I'm working through some basic ops drills, and noticed what I think is an
inconsistency in the Cephadm Docs. Some Googling appears to show this is a
known thing, but I didn't find a clear direction on cooking up a solution
yet.
On a cluster with 5 mons, 2 were abruptly removed when th
43 matches
Mail list logo