Of course, I forgot to mention that. Thank you for bringing it up! We made sure
balancer and PG autoscaler were turned off for the (only) pool that uses those
PGs shortly after we noticed the cycle of remapping/backfilling:
# ceph balancer status
{
"active": false,
"last_optimize_duratio
Our setup is not using SSDs as the Bluestore DB devices. We only have 2 SSDs vs
12 HDDs, which is normally fine for the low workload of the cluster. The SSDs
are serving a pool that is just used by RGW for index and meta.
Since the compaction two weeks ago the OSDs have all been stable. However,
Regarding RocksDB compaction, if you were in a situation were RocksDB
had overspilled to HDDs (if your cluster is using an hybrid setup), the
compaction should have move the bits back to fast devices. So it might
have helped in this situation too.
Regards,
Frédéric.
Le 16/12/2020 à 09:57, Fr
Hi Sefan,
This has me thinking that the issue your cluster may be facing is
probably with bluefs_buffered_io set to true, as this has been reported
to induce excessive swap usage (and OSDs flapping or OOMing as
consequences) in some versions starting from Nautilus I believe.
Can you check th
ompaction. It would be amazing if that does the trick.
Will keep you posted, here.
Thanks,
Stefan
From: Igor Fedotov
Sent: Monday, December 14, 2020 6:39:28 AM
To: Stefan Wild ; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: OSD reboot loop after running out of
Hi Frédéric,
Thanks for the additional input. We are currently only running RGW on the
cluster, so no snapshot removal, but there have been plenty of remappings with
the OSDs failing (all of them at first during and after the OOM incident, then
one-by-one). I haven't had a chance to look into o
and try the offline DB compaction. It
would be amazing if that does the trick.
Will keep you posted, here.
Thanks,
Stefan
From: Igor Fedotov
Sent: Monday, December 14, 2020 6:39:28 AM
To: Stefan Wild ; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: OSD reboot
Subject: Re: [ceph-users] Re: OSD reboot loop after running out of memory
Hi Stefan,
given the crash backtrace in your log I presume some data removal is in
progress:
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]: 3:
(KernelDevice::direct_read_unaligned(unsigned long, unsigned long,
char*)+0xd8
Just a note - all the below is almost completely unrelated to high RAM
usage. The latter is a different issue which presumably just triggered
PG removal one...
On 12/14/2020 2:39 PM, Igor Fedotov wrote:
Hi Stefan,
given the crash backtrace in your log I presume some data removal is
in progr
Could you please comment on, how to safely deal with these bugs or to avoid,
> if
> indeed they occur?
>
> thanks a lot,
>
> samuel
>
>
>
> huxia...@horebdata.cn
>
> From: Kalle Happonen
> Date: 2020-12-14 08:28
> To: Stefan Wild
> CC: ceph-users
> Subjec
Hi Stefan,
given the crash backtrace in your log I presume some data removal is in
progress:
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]: 3:
(KernelDevice::direct_read_unaligned(unsigned long, unsigned long,
char*)+0xd8) [0x5587b9364a48]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]: 4:
github.com/ceph/ceph/pull/35584
Cheers,
Kalle
- Original Message -
> From: huxia...@horebdata.cn
> To: "Kalle Happonen" , "Stefan Wild"
>
> Cc: "ceph-users"
> Sent: Monday, 14 December, 2020 10:27:57
> Subject: Re: [ceph-users] Re: O
lot,
samuel
huxia...@horebdata.cn
From: Kalle Happonen
Date: 2020-12-14 08:28
To: Stefan Wild
CC: ceph-users
Subject: [ceph-users] Re: OSD reboot loop after running out of memory
Hi Stefan,
we had been seeing OSDs OOMing on 14.2.13, but on a larger scale. In our case
we hit a some bugs with
Kalle
- Original Message -
> From: "Stefan Wild"
> To: "Igor Fedotov" , "ceph-users"
> Sent: Sunday, 13 December, 2020 14:46:44
> Subject: [ceph-users] Re: OSD reboot loop after running out of memory
> Hi Igor,
>
> Full osd logs from s
Hi Igor,
Full osd logs from startup to failed exit:
https://tiltworks.com/osd.1.log
In other news, can I expect osd.10 to go down next?
Dec 13 07:40:14 ceph-tpa-server1 bash[1825010]: debug
2020-12-13T12:40:14.823+ 7ff37c2e1700 -1 osd.7 13375 heartbeat_check: no
reply from 172.18.189.20:68
Hi Stefan,
could you please share OSD startup log from /var/log/ceph?
Thanks,
Igor
On 12/13/2020 5:44 AM, Stefan Wild wrote:
Just had another look at the logs and this is what I did notice after the
affected OSD starts up.
Loads of entries of this sort:
Dec 12 21:38:40 ceph-tpa-server1 ba
Got a trace of the osd process, shortly after ceph status -w announced boot for
the osd:
strace: Process 784735 attached
futex(0x5587c3e22fc8, FUTEX_WAIT_PRIVATE, 0, NULL) = ?
+++ exited with 1 +++
It was stuck at that one call for several minutes before exiting.
From: Stefan Wild
Date: Satu
Just had another look at the logs and this is what I did notice after the
affected OSD starts up.
Loads of entries of this sort:
Dec 12 21:38:40 ceph-tpa-server1 bash[780507]: debug
2020-12-13T02:38:40.851+ 7fafd32c7700 1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7fafb721f700' had
18 matches
Mail list logo