[ceph-users] Re: Ceph meltdown, need help

2020-05-14 Thread Frank Schilder
it a second time! Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________ From: Marc Roos Sent: 06 May 2020 19:19 To: ag; brad.swanson; dan; Frank Schilder Cc: ceph-users Subject: RE: [ceph-users] Re: Ceph meltdown, nee

[ceph-users] Re: Ceph meltdown, need help

2020-05-06 Thread Frank Schilder
is was a long one. Hope you made it all the way down. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ____________ From: Marc Roos Sent: 05 May 2020 23:32 To: ag; brad.swanson; dan; Frank Schilder Cc: ceph-users Subject: RE: [ceph-user

[ceph-users] Re: Ceph meltdown, need help

2020-05-06 Thread Marc Roos
Made it all the way down ;) Thank you very much for the detailed info. -Original Message- Cc: ceph-users Subject: Re: [ceph-users] Re: Ceph meltdown, need help Dear Marc, This e-mail is two-part. First part is about the problem of a single client being able to crash a ceph cluster

[ceph-users] Re: Ceph meltdown, need help

2020-05-05 Thread Marc Roos
itself. And I am a little worried when I read that a job, can bring down your cluster. Is this possible with any cluster? -Original Message- Cc: ceph-users Subject: [ceph-users] Re: Ceph meltdown, need help Dear all, the command ceph config set mon.ceph-01 mon_osd_report_timeout 3600 sav

[ceph-users] Re: Ceph meltdown, need help

2020-05-05 Thread Frank Schilder
Dear all, the command ceph config set mon.ceph-01 mon_osd_report_timeout 3600 saved the day. Within a few seconds, the cluster became: == [root@gnosis ~]# ceph status cluster: id: health: HEALTH_WARN 2 slow ops, oldest one blocked for 10884

[ceph-users] Re: Ceph meltdown, need help

2020-05-05 Thread Frank Schilder
Hi Dan, looking at an older thread, I found that "OSDs do not send beacons if they are not active". Is there any way to activate an OSD manually? Or check which ones are inactive? Also, I looked at this here: [root@gnosis ~]# ceph mon feature ls all features supported: [kraken,luminous

[ceph-users] Re: Ceph meltdown, need help

2020-05-05 Thread Frank Schilder
. = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Alex Gorbachev Sent: 05 May 2020 17:31:17 To: Frank Schilder Cc: Dan van der Ster; ceph-users Subject: Re: [ceph-users] Re: Ceph meltdown, need help On Tue, May 5, 2020 at 11:27 AM Frank Schilder

[ceph-users] Re: Ceph meltdown, need help

2020-05-05 Thread Frank Schilder
Thanks! Here it is: [root@gnosis ~]# ceph osd dump | grep require require_min_compat_client jewel require_osd_release mimic It looks like we had an extremely aggressive job running on our cluster, completely flooding everything with small I/O. I think the cluster built up a huge backlog and is/

[ceph-users] Re: Ceph meltdown, need help

2020-05-05 Thread Frank Schilder
hilder AIT Risø Campus Bygning 109, rum S14 From: Alex Gorbachev Sent: 05 May 2020 17:19:26 To: Frank Schilder Cc: Dan van der Ster; ceph-users Subject: Re: [ceph-users] Re: Ceph meltdown, need help Hi Frank, On Tue, May 5, 2020 at 10:43 AM Frank Sc

[ceph-users] Re: Ceph meltdown, need help

2020-05-05 Thread Frank Schilder
= Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 05 May 2020 16:41:59 To: Dan van der Ster Cc: ceph-users Subject: [ceph-users] Re: Ceph meltdown, need help Dear Dan, thank you for your fast response. Please find the

[ceph-users] Re: Ceph meltdown, need help

2020-05-05 Thread Frank Schilder
Dear Dan, thank you for your fast response. Please find the log of the first OSD that went down and the ceph.log with these links: https://files.dtu.dk/u/tF1zv5zdc6mmXXO_/ceph.log?l https://files.dtu.dk/u/hPb5qax2-b6W9vmp/ceph-osd.2.log?l I can collect more osd logs if this helps. Best regards

[ceph-users] Re: Ceph meltdown, need help

2020-05-05 Thread brad . swanson
Ditto, I had a bad optic on 48x10 switch. The only way I detected it was my prometheus tcp fail retrans count. Looking back over the previous 4 weeks, I could seen it increment in small bursts, but Ceph was able to handle it and then it went crazy and a bunch of OSD’s just dropped out. __

[ceph-users] Re: Ceph meltdown, need help

2020-05-05 Thread Dan van der Ster
ceph osd tree down # shows the down osds ceph osd tree out # shows the out osds there is no "active/inactive" state on an osd. You can force an individual osd to do a soft restart with "ceph osd down " -- this will cause it to restart and recontact mons and osd peers. If that doesn't work, restar

[ceph-users] Re: Ceph meltdown, need help

2020-05-05 Thread Dan van der Ster
OK those requires look correct. While the pgs are inactive there will be no client IO, so there's nothing to pause at this point. In general, I would evict those misbehaving clients with ceph tell mds.* client evict id= For now, keep nodown and noout, let all the PGs get active again. You might n

[ceph-users] Re: Ceph meltdown, need help

2020-05-05 Thread Alex Gorbachev
ange. > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Alex Gorbachev > Sent: 05 May 2020 17:19:26 > To: Frank Schilder > Cc: Dan van der Ster; ceph-users > Subject: Re: [ceph-users] Re: Cep

[ceph-users] Re: Ceph meltdown, need help

2020-05-05 Thread Dan van der Ster
Hi, The osds are getting marked down due to this: 2020-05-05 15:18:42.893964 mon.ceph-01 mon.0 192.168.32.65:6789/0 292689 : cluster [INF] osd.40 marked down after no beacon for 903.781033 seconds 2020-05-05 15:18:42.894009 mon.ceph-01 mon.0 192.168.32.65:6789/0 292690 : cluster [INF] osd.60 mark

[ceph-users] Re: Ceph meltdown, need help

2020-05-05 Thread Paul Emmerich
Check network connectivity on all configured networks between alle hosts, OSDs running but being marked as down is usually a network problem Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Te

[ceph-users] Re: Ceph meltdown, need help

2020-05-05 Thread Alex Gorbachev
Hi Frank, On Tue, May 5, 2020 at 10:43 AM Frank Schilder wrote: > Dear Dan, > > thank you for your fast response. Please find the log of the first OSD > that went down and the ceph.log with these links: > > https://files.dtu.dk/u/tF1zv5zdc6mmXXO_/ceph.log?l > https://files.dtu.dk/u/hPb5qax2-b6

[ceph-users] Re: Ceph meltdown, need help

2020-05-05 Thread Dan van der Ster
Hi Frank, Could you share any ceph-osd logs and also the ceph.log from a mon to see why the cluster thinks all those osds are down? Simply marking them up isn't going to help, I'm afraid. Cheers, Dan On Tue, May 5, 2020 at 4:12 PM Frank Schilder wrote: > > Hi all, > > a lot of OSDs crashed in