Only other thing I can think of is that a firewall is dropping idle connections, although Ceph should be sending heartbeats more often then the common 5 minutes for most firewalls. In the logs is it showing the monitor marking the OSDs out or the OSD peers? That would give you an idea where to look. ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
On Sat, Aug 17, 2019 at 10:26 AM Lorenz Kiefner <root+cephus...@deinadmin.de> wrote: > Hello again, > > all links are at least 10/50 mbit upstream/downstream, mostly 40/100 mbit, > with some VMs at hosting companies running at 1/1 gbit. All my 39 OSDs on > 17 hosts in 11 locations (5 of them are connected at the moment by consumer > internet links) are nearly in a full mesh network consisting of wireguard > VPN links, routed by bird with OSPF. Speed is not great as you can imagine > but sufficient for me. > > Some hosts are x86, some are ARMv7 on ODROID HC-1 (SAMSUNG smartphone > SoC). Could this mix of architectures be a problem? > > My goal is to provide a shared filesystem with my friends and to provide > backup space on rbd images. This seams possible, but it is really annoying > when OSDs are randomly marked down. > > If there were some network issues I would expect that all OSDs on the > affected host would be marked down, but only one OSD on this host is marked > down. If I log in on that host and restart the OSD the same OSD will > probably be marked down again in some 10-30 minutes. And this only happens > if there is *no* backfill or recovery running. I would expect that network > issues and packet drops on a saturated line are more likely than on an > idling line. > > Are there some (more) config keys for OSD ping timeouts in luminous? I > would be very happy for some more ideas! > > Thank you all > > Lorenz > > > Am 16.08.19 um 17:01 schrieb Robert LeBlanc: > > Personally I would not be trying to create a Ceph cluster across Consumer > Internet links, usually their upload speed is so slow and Ceph is so chatty > that it would make for a horrible experience. If you are looking for a > backup solution, then I would look at some sort of n-way rsync solution, or > btrfs/zfs volumes that send/receive each other. I really don't think Ceph > is a good fit. > ---------------- > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Thu, Aug 15, 2019 at 12:37 AM Lorenz Kiefner < > root+cephus...@deinadmin.de> wrote: > >> Oh no, it's not that bad. It's >> >> $ ping -s 65000 dest.inati.on >> >> on a VPN connection that has a MTU of 1300 via IPv6. So I suspect that I >> only get an answer, when all 51 fragments get fully returned. It's clear >> that big packets with lots of fragments are more affected by packet loss >> than 64 byte pings. >> >> I just (at 9 o'clock in the morning) repeated this ping test and got >> hardly any drops (less than 1%), even with the size of 64k. So it's really >> dependent on the time of the day. Seems like some ISPs are dropping some >> packets, especially in the evening... >> >> A few minutes ago I restarted all down-marked OSDs, but they are getting >> marked down again... Seems like Ceph is tolerable against packet loss (it >> surely affects performance, but this irrelevant for me). >> >> >> Could erasure coded pools pose some problems? >> >> >> Thank you all for every hint! >> >> Lorenz >> >> >> Am 15.08.19 um 08:51 schrieb Janne Johansson: >> >> Den ons 14 aug. 2019 kl 17:46 skrev Lorenz Kiefner < >> root+cephus...@deinadmin.de>: >> >>> Is ceph sensitive to packet loss? On some VPN links I have up to 20% >>> packet loss on 64k packets but less than 3% on 5k packets in the >>> evenings. >>> >> >> 20% seems crazy high, there must be something really wrong there. >> >> At 20%, you would get tons of packet timeouts to wait for on all those >> lost frames, >> then resends of (at least!) those 20% extra, which in turn would lead to >> 20% of those >> resends getting lost, all while the main streams of data try to move >> forward when some >> older packet do get over. This is a really bad situation to design for, >> >> I think you should look for a link solution that doesn't drop that many >> packets, instead of changing >> the software you try to run over that link, all others will notice this >> too and act badly in some way or other. >> >> Heck, 20% is like taking a math schoolbook and remove all instances of >> "3" and "8" and see if kids can learn to count from it. 8-/ >> >> -- >> May the most significant bit of your life be positive. >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >> > > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io >
_______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io