[ceph-users] Re: Failing heartbeats when no backfill is running

Robert LeBlanc Mon, 19 Aug 2019 07:36:48 -0700

Only other thing I can think of is that a firewall is dropping idle
connections, although Ceph should be sending heartbeats more often then the
common 5 minutes for most firewalls. In the logs is it showing the monitor
marking the OSDs out or the OSD peers? That would give you an idea where to
look.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1



On Sat, Aug 17, 2019 at 10:26 AM Lorenz Kiefner <root+cephus...@deinadmin.de>
wrote:

> Hello again,
>
> all links are at least 10/50 mbit upstream/downstream, mostly 40/100 mbit,
> with some VMs at hosting companies running at 1/1 gbit. All my 39 OSDs on
> 17 hosts in 11 locations (5 of them are connected at the moment by consumer
> internet links) are nearly in a full mesh network consisting of wireguard
> VPN links, routed by bird with OSPF. Speed is not great as you can imagine
> but sufficient for me.
>
> Some hosts are x86, some are ARMv7 on ODROID HC-1 (SAMSUNG smartphone
> SoC). Could this mix of architectures be a problem?
>
> My goal is to provide a shared filesystem with my friends and to provide
> backup space on rbd images. This seams possible, but it is really annoying
> when OSDs are randomly marked down.
>
> If there were some network issues I would expect that all OSDs on the
> affected host would be marked down, but only one OSD on this host is marked
> down. If I log in on that host and restart the OSD the same OSD will
> probably be marked down again in some 10-30 minutes. And this only happens
> if there is *no* backfill or recovery running. I would expect that network
> issues and packet drops on a saturated line are more likely than on an
> idling line.
>
> Are there some (more) config keys for OSD ping timeouts in luminous? I
> would be very happy for some more ideas!
>
> Thank you all
>
> Lorenz
>
>
> Am 16.08.19 um 17:01 schrieb Robert LeBlanc:
>
> Personally I would not be trying to create a Ceph cluster across Consumer
> Internet links, usually their upload speed is so slow and Ceph is so chatty
> that it would make for a horrible experience. If you are looking for a
> backup solution, then I would look at some sort of n-way rsync solution, or
> btrfs/zfs volumes that send/receive each other. I really don't think Ceph
> is a good fit.
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Thu, Aug 15, 2019 at 12:37 AM Lorenz Kiefner <
> root+cephus...@deinadmin.de> wrote:
>
>> Oh no, it's not that bad. It's
>>
>> $ ping -s 65000 dest.inati.on
>>
>> on a VPN connection that has a MTU of 1300 via IPv6. So I suspect that I
>> only get an answer, when all 51 fragments get fully returned. It's clear
>> that big packets with lots of fragments are more affected by packet loss
>> than 64 byte pings.
>>
>> I just (at 9 o'clock in the morning) repeated this ping test and got
>> hardly any drops (less than 1%), even with the size of 64k. So it's really
>> dependent on the time of the day. Seems like some ISPs are dropping some
>> packets, especially in the evening...
>>
>> A few minutes ago I restarted all down-marked OSDs, but they are getting
>> marked down again... Seems like Ceph is tolerable against packet loss (it
>> surely affects performance, but this irrelevant for me).
>>
>>
>> Could erasure coded pools pose some problems?
>>
>>
>> Thank you all for every hint!
>>
>> Lorenz
>>
>>
>> Am 15.08.19 um 08:51 schrieb Janne Johansson:
>>
>> Den ons 14 aug. 2019 kl 17:46 skrev Lorenz Kiefner <
>> root+cephus...@deinadmin.de>:
>>
>>> Is ceph sensitive to packet loss? On some VPN links I have up to 20%
>>> packet loss on 64k packets but less than 3% on 5k packets in the
>>> evenings.
>>>
>>
>> 20% seems crazy high, there must be something really wrong there.
>>
>> At 20%, you would get tons of packet timeouts to wait for on all those
>> lost frames,
>> then resends of (at least!) those 20% extra, which in turn would lead to
>> 20% of those
>> resends getting lost, all while the main streams of data try to move
>> forward when some
>> older packet do get over. This is a really bad situation to design for,
>>
>> I think you should look for a link solution that doesn't drop that many
>> packets, instead of changing
>> the software you try to run over that link, all others will notice this
>> too and act badly in some way or other.
>>
>> Heck, 20% is like taking a math schoolbook and remove all instances of
>> "3" and "8" and see if kids can learn to count from it. 8-/
>>
>> --
>> May the most significant bit of your life be positive.
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Failing heartbeats when no backfill is running

Reply via email to