On Thu, Jan 17, 2019 at 4:42 AM Johan Thomsen <wr...@ownrisk.dk> wrote:
> Thanks you for responding! > > First thing: I disabled the firewall on all the nodes. > More specifically not firewalld, but the NixOS firewall, since I run NixOS. > I can netcat both udp and tcp traffic on all ports between all nodes > without problems. > > Next, I tried raising the mtu to 9000 on the nics where the cluster > network is connected - although I don't see why the mtu should affect > the heartbeat? > I have two bonded nics connected to the cluster network (mtu 9000) and > two separate bonded nics hooked on the public network (mtu 1500). > I've tested traffic and routing on both pairs of nics and traffic gets > through without issues, apparently. > Try 'osd hearbeat min size = 100' in ceph.conf on all osd nodes and restart, we have seen this in some network configurtion with mtu size mismatch between ports. > > > None of the above solved the problem :-( > > > Den tor. 17. jan. 2019 kl. 12.01 skrev Kevin Olbrich <k...@sv01.de>: > > > > Are you sure, no service like firewalld is running? > > Did you check that all machines have the same MTU and jumbo frames are > > enabled if needed? > > > > I had this problem when I first started with ceph and forgot to > > disable firewalld. > > Replication worked perfectly fine but the OSD was kicked out every few > seconds. > > > > Kevin > > > > Am Do., 17. Jan. 2019 um 11:57 Uhr schrieb Johan Thomsen < > wr...@ownrisk.dk>: > > > > > > Hi, > > > > > > I have a sad ceph cluster. > > > All my osds complain about failed reply on heartbeat, like so: > > > > > > osd.10 635 heartbeat_check: no reply from 192.168.160.237:6810 osd.42 > > > ever on either front or back, first ping sent 2019-01-16 > > > 22:26:07.724336 (cutoff 2019-01-16 22:26:08.225353) > > > > > > .. I've checked the network sanity all I can, and all ceph ports are > > > open between nodes both on the public network and the cluster network, > > > and I have no problems sending traffic back and forth between nodes. > > > I've tried tcpdump'ing and traffic is passing in both directions > > > between the nodes, but unfortunately I don't natively speak the ceph > > > protocol, so I can't figure out what's going wrong in the heartbeat > > > conversation. > > > > > > Still: > > > > > > # ceph health detail > > > > > > HEALTH_WARN nodown,noout flag(s) set; Reduced data availability: 1072 > > > pgs inactive, 1072 pgs peering > > > OSDMAP_FLAGS nodown,noout flag(s) set > > > PG_AVAILABILITY Reduced data availability: 1072 pgs inactive, 1072 pgs > peering > > > pg 7.3cd is stuck inactive for 245901.560813, current state > > > creating+peering, last acting [13,41,1] > > > pg 7.3ce is stuck peering for 245901.560813, current state > > > creating+peering, last acting [1,40,7] > > > pg 7.3cf is stuck peering for 245901.560813, current state > > > creating+peering, last acting [0,42,9] > > > pg 7.3d0 is stuck peering for 245901.560813, current state > > > creating+peering, last acting [20,8,38] > > > pg 7.3d1 is stuck peering for 245901.560813, current state > > > creating+peering, last acting [10,20,42] > > > (....) > > > > > > > > > I've set "noout" and "nodown" to prevent all osd's from being removed > > > from the cluster. They are all running and marked as "up". > > > > > > # ceph osd tree > > > > > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT > PRI-AFF > > > -1 249.73434 root default > > > -25 166.48956 datacenter m1 > > > -24 83.24478 pod kube1 > > > -35 41.62239 rack 10 > > > -34 41.62239 host ceph-sto-p102 > > > 40 hdd 7.27689 osd.40 up 1.00000 > 1.00000 > > > 41 hdd 7.27689 osd.41 up 1.00000 > 1.00000 > > > 42 hdd 7.27689 osd.42 up 1.00000 > 1.00000 > > > (....) > > > > > > I'm at a point where I don't know which options and what logs to check > anymore? > > > > > > Any debug hint would be very much appreciated. > > > > > > btw. I have no important data in the cluster (yet), so if the solution > > > is to drop all osd and recreate them, it's ok for now. But I'd really > > > like to know how the cluster ended in this state. > > > > > > /Johan > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@lists.ceph.com > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com