it a second time!
Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________
From: Marc Roos
Sent: 06 May 2020 19:19
To: ag; brad.swanson; dan; Frank Schilder
Cc: ceph-users
Subject: RE: [ceph-users] Re: Ceph meltdown, nee
is was a long one. Hope you made it all the way down.
Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
____________
From: Marc Roos
Sent: 05 May 2020 23:32
To: ag; brad.swanson; dan; Frank Schilder
Cc: ceph-users
Subject: RE: [ceph-user
Made it all the way down ;) Thank you very much for the detailed info.
-Original Message-
Cc: ceph-users
Subject: Re: [ceph-users] Re: Ceph meltdown, need help
Dear Marc,
This e-mail is two-part. First part is about the problem of a single
client being able to crash a ceph cluster
itself.
And I am a little worried when I read that a job, can bring down your
cluster. Is this possible with any cluster?
-Original Message-
Cc: ceph-users
Subject: [ceph-users] Re: Ceph meltdown, need help
Dear all,
the command
ceph config set mon.ceph-01 mon_osd_report_timeout 3600
sav
Dear all,
the command
ceph config set mon.ceph-01 mon_osd_report_timeout 3600
saved the day. Within a few seconds, the cluster became:
==
[root@gnosis ~]# ceph status
cluster:
id:
health: HEALTH_WARN
2 slow ops, oldest one blocked for 10884
Hi Dan,
looking at an older thread, I found that "OSDs do not send beacons if they are
not active". Is there any way to activate an OSD manually? Or check which ones
are inactive?
Also, I looked at this here:
[root@gnosis ~]# ceph mon feature ls
all features
supported: [kraken,luminous
.
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
From: Alex Gorbachev
Sent: 05 May 2020 17:31:17
To: Frank Schilder
Cc: Dan van der Ster; ceph-users
Subject: Re: [ceph-users] Re: Ceph meltdown, need help
On Tue, May 5, 2020 at 11:27 AM Frank Schilder
Thanks! Here it is:
[root@gnosis ~]# ceph osd dump | grep require
require_min_compat_client jewel
require_osd_release mimic
It looks like we had an extremely aggressive job running on our cluster,
completely flooding everything with small I/O. I think the cluster built up a
huge backlog and is/
hilder
AIT Risø Campus
Bygning 109, rum S14
From: Alex Gorbachev
Sent: 05 May 2020 17:19:26
To: Frank Schilder
Cc: Dan van der Ster; ceph-users
Subject: Re: [ceph-users] Re: Ceph meltdown, need help
Hi Frank,
On Tue, May 5, 2020 at 10:43 AM Frank Sc
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
From: Frank Schilder
Sent: 05 May 2020 16:41:59
To: Dan van der Ster
Cc: ceph-users
Subject: [ceph-users] Re: Ceph meltdown, need help
Dear Dan,
thank you for your fast response. Please find the
Dear Dan,
thank you for your fast response. Please find the log of the first OSD that
went down and the ceph.log with these links:
https://files.dtu.dk/u/tF1zv5zdc6mmXXO_/ceph.log?l
https://files.dtu.dk/u/hPb5qax2-b6W9vmp/ceph-osd.2.log?l
I can collect more osd logs if this helps.
Best regards
Ditto, I had a bad optic on 48x10 switch. The only way I detected it was my
prometheus tcp fail retrans count. Looking back over the previous 4 weeks, I
could seen it increment in small bursts, but Ceph was able to handle it and
then it went crazy and a bunch of OSD’s just dropped out.
__
ceph osd tree down # shows the down osds
ceph osd tree out # shows the out osds
there is no "active/inactive" state on an osd.
You can force an individual osd to do a soft restart with "ceph osd
down " -- this will cause it to restart and recontact mons and
osd peers. If that doesn't work, restar
OK those requires look correct.
While the pgs are inactive there will be no client IO, so there's
nothing to pause at this point. In general, I would evict those
misbehaving clients with ceph tell mds.* client evict id=
For now, keep nodown and noout, let all the PGs get active again. You
might n
ange.
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
>
> From: Alex Gorbachev
> Sent: 05 May 2020 17:19:26
> To: Frank Schilder
> Cc: Dan van der Ster; ceph-users
> Subject: Re: [ceph-users] Re: Cep
Hi,
The osds are getting marked down due to this:
2020-05-05 15:18:42.893964 mon.ceph-01 mon.0 192.168.32.65:6789/0
292689 : cluster [INF] osd.40 marked down after no beacon for
903.781033 seconds
2020-05-05 15:18:42.894009 mon.ceph-01 mon.0 192.168.32.65:6789/0
292690 : cluster [INF] osd.60 mark
Check network connectivity on all configured networks between alle hosts,
OSDs running but being marked as down is usually a network problem
Paul
--
Paul Emmerich
Looking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Te
Hi Frank,
On Tue, May 5, 2020 at 10:43 AM Frank Schilder wrote:
> Dear Dan,
>
> thank you for your fast response. Please find the log of the first OSD
> that went down and the ceph.log with these links:
>
> https://files.dtu.dk/u/tF1zv5zdc6mmXXO_/ceph.log?l
> https://files.dtu.dk/u/hPb5qax2-b6
Hi Frank,
Could you share any ceph-osd logs and also the ceph.log from a mon to
see why the cluster thinks all those osds are down?
Simply marking them up isn't going to help, I'm afraid.
Cheers, Dan
On Tue, May 5, 2020 at 4:12 PM Frank Schilder wrote:
>
> Hi all,
>
> a lot of OSDs crashed in
19 matches
Mail list logo