subject:"\[ceph\-users\] Re\: OSD reboot loop after running out of memory"

[ceph-users] Re: OSD reboot loop after running out of memory

2021-01-02 Thread Stefan Wild

Of course, I forgot to mention that. Thank you for bringing it up! We made sure balancer and PG autoscaler were turned off for the (only) pool that uses those PGs shortly after we noticed the cycle of remapping/backfilling: # ceph balancer status { "active": false, "last_optimize_duratio

[ceph-users] Re: OSD reboot loop after running out of memory

2021-01-01 Thread Stefan Wild

Our setup is not using SSDs as the Bluestore DB devices. We only have 2 SSDs vs 12 HDDs, which is normally fine for the low workload of the cluster. The SSDs are serving a pool that is just used by RGW for index and meta. Since the compaction two weeks ago the OSDs have all been stable. However,

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-16 Thread Frédéric Nass

Regarding RocksDB compaction, if you were in a situation were RocksDB had overspilled to HDDs (if your cluster is using an hybrid setup), the compaction should have move the bits back to fast devices. So it might have helped in this situation too. Regards, Frédéric. Le 16/12/2020 à 09:57, Fr

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-16 Thread Frédéric Nass

Hi Sefan, This has me thinking that the issue your cluster may be facing is probably with bluefs_buffered_io set to true, as this has been reported to induce excessive swap usage (and OSDs flapping or OOMing as consequences) in some versions starting from Nautilus I believe. Can you check th

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Frédéric Nass

ompaction. It would be amazing if that does the trick. Will keep you posted, here. Thanks, Stefan From: Igor Fedotov Sent: Monday, December 14, 2020 6:39:28 AM To: Stefan Wild ; ceph-users@ceph.io Subject: Re: [ceph-users] Re: OSD reboot loop after running out of

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Stefan Wild

Hi Frédéric, Thanks for the additional input. We are currently only running RGW on the cluster, so no snapshot removal, but there have been plenty of remappings with the OSDs failing (all of them at first during and after the OOM incident, then one-by-one). I haven't had a chance to look into o

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Frédéric Nass

and try the offline DB compaction. It would be amazing if that does the trick. Will keep you posted, here. Thanks, Stefan From: Igor Fedotov Sent: Monday, December 14, 2020 6:39:28 AM To: Stefan Wild ; ceph-users@ceph.io Subject: Re: [ceph-users] Re: OSD reboot

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Stefan Wild

Subject: Re: [ceph-users] Re: OSD reboot loop after running out of memory Hi Stefan, given the crash backtrace in your log I presume some data removal is in progress: Dec 12 21:58:38 ceph-tpa-server1 bash[784256]: 3: (KernelDevice::direct_read_unaligned(unsigned long, unsigned long, char*)+0xd8

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Igor Fedotov

Just a note - all the below is almost completely unrelated to high RAM usage. The latter is a different issue which presumably just triggered PG removal one... On 12/14/2020 2:39 PM, Igor Fedotov wrote: Hi Stefan, given the crash backtrace in your log I presume some data removal is in progr

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Stefan Wild

Could you please comment on, how to safely deal with these bugs or to avoid, > if > indeed they occur? > > thanks a lot, > > samuel > > > > huxia...@horebdata.cn > > From: Kalle Happonen > Date: 2020-12-14 08:28 > To: Stefan Wild > CC: ceph-users > Subjec

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Igor Fedotov

Hi Stefan, given the crash backtrace in your log I presume some data removal is in progress: Dec 12 21:58:38 ceph-tpa-server1 bash[784256]: 3: (KernelDevice::direct_read_unaligned(unsigned long, unsigned long, char*)+0xd8) [0x5587b9364a48] Dec 12 21:58:38 ceph-tpa-server1 bash[784256]: 4:

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Kalle Happonen

github.com/ceph/ceph/pull/35584 Cheers, Kalle - Original Message - > From: huxia...@horebdata.cn > To: "Kalle Happonen" , "Stefan Wild" > > Cc: "ceph-users" > Sent: Monday, 14 December, 2020 10:27:57 > Subject: Re: [ceph-users] Re: O

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread huxia...@horebdata.cn

lot, samuel huxia...@horebdata.cn From: Kalle Happonen Date: 2020-12-14 08:28 To: Stefan Wild CC: ceph-users Subject: [ceph-users] Re: OSD reboot loop after running out of memory Hi Stefan, we had been seeing OSDs OOMing on 14.2.13, but on a larger scale. In our case we hit a some bugs with

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-13 Thread Kalle Happonen

Kalle - Original Message - > From: "Stefan Wild" > To: "Igor Fedotov" , "ceph-users" > Sent: Sunday, 13 December, 2020 14:46:44 > Subject: [ceph-users] Re: OSD reboot loop after running out of memory > Hi Igor, > > Full osd logs from s

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-13 Thread Stefan Wild

Hi Igor, Full osd logs from startup to failed exit: https://tiltworks.com/osd.1.log In other news, can I expect osd.10 to go down next? Dec 13 07:40:14 ceph-tpa-server1 bash[1825010]: debug 2020-12-13T12:40:14.823+ 7ff37c2e1700 -1 osd.7 13375 heartbeat_check: no reply from 172.18.189.20:68

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-12 Thread Igor Fedotov

Hi Stefan, could you please share OSD startup log from /var/log/ceph? Thanks, Igor On 12/13/2020 5:44 AM, Stefan Wild wrote: Just had another look at the logs and this is what I did notice after the affected OSD starts up. Loads of entries of this sort: Dec 12 21:38:40 ceph-tpa-server1 ba

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-12 Thread Stefan Wild

Got a trace of the osd process, shortly after ceph status -w announced boot for the osd: strace: Process 784735 attached futex(0x5587c3e22fc8, FUTEX_WAIT_PRIVATE, 0, NULL) = ? +++ exited with 1 +++ It was stuck at that one call for several minutes before exiting. From: Stefan Wild Date: Satu

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-12 Thread Stefan Wild

Just had another look at the logs and this is what I did notice after the affected OSD starts up. Loads of entries of this sort: Dec 12 21:38:40 ceph-tpa-server1 bash[780507]: debug 2020-12-13T02:38:40.851+ 7fafd32c7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fafb721f700' had

[ceph-users] Re: OSD reboot loop after running out of memory

[ceph-users] Re: OSD reboot loop after running out of memory

[ceph-users] Re: OSD reboot loop after running out of memory

[ceph-users] Re: OSD reboot loop after running out of memory

[ceph-users] Re: OSD reboot loop after running out of memory

[ceph-users] Re: OSD reboot loop after running out of memory

[ceph-users] Re: OSD reboot loop after running out of memory

[ceph-users] Re: OSD reboot loop after running out of memory

[ceph-users] Re: OSD reboot loop after running out of memory

[ceph-users] Re: OSD reboot loop after running out of memory

[ceph-users] Re: OSD reboot loop after running out of memory

[ceph-users] Re: OSD reboot loop after running out of memory

[ceph-users] Re: OSD reboot loop after running out of memory

[ceph-users] Re: OSD reboot loop after running out of memory

[ceph-users] Re: OSD reboot loop after running out of memory

[ceph-users] Re: OSD reboot loop after running out of memory

[ceph-users] Re: OSD reboot loop after running out of memory

[ceph-users] Re: OSD reboot loop after running out of memory

18 matches

Site Navigation

Mail list logo

Footer information