Re: [ceph-users] Regarding loss of heartbeats

Trygve Vea Tue, 29 Nov 2016 06:37:17 -0800

----- Den 29.nov.2016 15:20 skrev Nick Fisk n...@fisk.me.uk:
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
>> Trygve
>> Vea
>> Sent: 29 November 2016 14:07
>> To: ceph-users <ceph-us...@ceph.com>
>> Subject: [ceph-users] Regarding loss of heartbeats
>> 
>> Since Jewel, we've seen quite a bit of funky behaviour in Ceph.  I've written
>> about it a few times to the mailing list.
>> 
>> Higher CPU utilization after the upgrade / Loss of heartbeats.  We've looked 
>> at
>> our network setup, and we've optimized some
>> potential bottlenecks some places.
>> 
>> Interesting thing regarding loss of heartbeats.  We have observed OSDs 
>> running
>> on the same host losing heartbeats against
>> eachother.  I'm not sure why they are connected at all (we have had some
>> remapped/degraded placement groups over the weekend,
>> maybe that's why) - but I have a hard time pointing the finger at our network
>> when the heartbeat is lost between two osds on the
>> same server.
>> 
>> 
>> I've been staring myself blind at this problem for a while, and just now 
>> noticed
>> a pretty new bug report that I want to believe is
> related
>> to what I am experiencing: http://tracker.ceph.com/issues/18042
>> 
>> We had one OSD hit a suicide timeout value and kill itself off last night, 
>> and
>> one can see that several of these heartbeats are
> between
>> osds on the same node.  (zgrep '10.22.9.21.*10.22.9.21' ceph-osd.2.gz)
>> 
>> http://employee.tv.situla.bitbit.net/ceph-osd.2.gz
>> 
>> 
>> Does anyone have any thoughts about this?  Are we stumbling on a known, or
>> unknown bug in Ceph?
> 
> Hi Trygve,


Hi Nick!

> I was getting similar things to you after upgrading to 10.2.3, definitely 
> seeing
> problems where OSD's on the same nodes were marking
> each other out and the cluster was fairly idle. I found that it seemed to 
> being
> caused by Kernel 4.7, nodes in the same cluster that
> were on 4.4 were unaffected. After downgrading all nodes to 4.4, everything 
> has
> been really stable for me.

I'm not sure if this can apply to our setup.  Our upgrade to Jewel didn't 
include a kernel upgrade as far as I recall (and if it did, it was a minor 
release).

We're running 3.10.0-327.36.3.el7.x86_64, and follow latest stable kernel 
provided by CentOS7.  We've added the latest hpsa module as provided by HP to 
work around a known crash bug in that driver, but nothing special other than 
that.

The problems started as of Jewel.


-- 
Trygve
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Regarding loss of heartbeats

Reply via email to