[ceph-users] Re: OSD crash on Onode::put

Frank Schilder Wed, 11 Jan 2023 02:30:35 -0800

Hi Dongdong.

> is simple and can be applied cleanly.

I understand this statement from a developer's perspective. Now, try to explain 
to a user with a cephadm deployed containerized cluster how to build a 
container from source, point cephadm to use this container and what to do for 
the next upgrade. I think "simple" depends on context. Applying a patch to a 
production system is currently an expert operation, I'm afraid.

If you have instructions for building a ceph-container with the patch applied, 
I would be very interested. I was asking for a source container for exactly 
this reason. As far as I can tell from the conversation, this is quite a 
project in itself. The thread was "Re: Building ceph packages in containers? 
[was: Ceph debian/ubuntu packages build]", but I can't find it on the mailing 
list any more. There seems to be an archived version: 
https://www.spinics.net/lists/ceph-users/msg73231.html

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Dongdong Tao <[email protected]>
Sent: 11 January 2023 04:30:14
To: Frank Schilder
Cc: Igor Fedotov; [email protected]; [email protected]
Subject: Re: [ceph-users] Re: OSD crash on Onode::put

Hi Frank,

I don't have an operational workaround, the patch 
https://github.com/ceph/ceph/pull/46911/commits/f43f596aac97200a70db7a70a230eb9343018159
 is simple and can be applied cleanly.

Yes, restarting the OSD will clear pool entries, you can restart it when the 
bluestore_onode items are very low (e.g less than 10) if it really helps, but I 
think you'll need to tune and monitor the performance until you can get a 
number that is most suitable for your cluster.

But it can't help with the crash, since in general, the crash itself is 
basically a restart.

Regards,
Dongdong

On Tue, Jan 10, 2023 at 8:21 PM Serkan Çoban 
<[email protected]<mailto:[email protected]>> wrote:
Slot 19 is inside the chassis? Do you check chassis temperature? I
sometimes have more failure rate in chassis HDDs than in front of the
chassis. In our case it was related to the temperature difference.

On Tue, Jan 10, 2023 at 1:28 PM Frank Schilder 
<[email protected]<mailto:[email protected]>> wrote:
>
> Following up on my previous post, we have identical OSD hosts. The very 
> strange observation now is, that all outlier OSDs are in exactly the same 
> disk slot on these hosts. We have 5 problematic OSDs and they are all in slot 
> 19 on 5 different hosts. This is an extremely strange and unlikely 
> co-incidence.
>
> Are there any specific conditions for this problem to be present or amplified 
> that could have to do with hardware?
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> _______________________________________________
> ceph-users mailing list -- [email protected]<mailto:[email protected]>
> To unsubscribe send an email to 
> [email protected]<mailto:[email protected]>
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: OSD crash on Onode::put

Reply via email to