On Thu, Jan 17, 2019 at 4:42 AM Johan Thomsen <wr...@ownrisk.dk> wrote:

> Thanks you for responding!
>
> First thing: I disabled the firewall on all the nodes.
> More specifically not firewalld, but the NixOS firewall, since I run NixOS.
> I can netcat both udp and tcp traffic on all ports between all nodes
> without problems.
>
> Next, I tried raising the mtu to 9000 on the nics where the cluster
> network is connected - although I don't see why the mtu should affect
> the heartbeat?
> I have two bonded nics connected to the cluster network (mtu 9000) and
> two separate bonded nics hooked on the public network (mtu 1500).
> I've tested traffic and routing on both pairs of nics and traffic gets
> through without issues, apparently.
>
Try 'osd hearbeat min size = 100' in ceph.conf on all osd nodes and
restart, we have seen this in some network
configurtion with mtu size mismatch between ports.

>
>
> None of the above solved the problem :-(
>
>
> Den tor. 17. jan. 2019 kl. 12.01 skrev Kevin Olbrich <k...@sv01.de>:
> >
> > Are you sure, no service like firewalld is running?
> > Did you check that all machines have the same MTU and jumbo frames are
> > enabled if needed?
> >
> > I had this problem when I first started with ceph and forgot to
> > disable firewalld.
> > Replication worked perfectly fine but the OSD was kicked out every few
> seconds.
> >
> > Kevin
> >
> > Am Do., 17. Jan. 2019 um 11:57 Uhr schrieb Johan Thomsen <
> wr...@ownrisk.dk>:
> > >
> > > Hi,
> > >
> > > I have a sad ceph cluster.
> > > All my osds complain about failed reply on heartbeat, like so:
> > >
> > > osd.10 635 heartbeat_check: no reply from 192.168.160.237:6810 osd.42
> > > ever on either front or back, first ping sent 2019-01-16
> > > 22:26:07.724336 (cutoff 2019-01-16 22:26:08.225353)
> > >
> > > .. I've checked the network sanity all I can, and all ceph ports are
> > > open between nodes both on the public network and the cluster network,
> > > and I have no problems sending traffic back and forth between nodes.
> > > I've tried tcpdump'ing and traffic is passing in both directions
> > > between the nodes, but unfortunately I don't natively speak the ceph
> > > protocol, so I can't figure out what's going wrong in the heartbeat
> > > conversation.
> > >
> > > Still:
> > >
> > > # ceph health detail
> > >
> > > HEALTH_WARN nodown,noout flag(s) set; Reduced data availability: 1072
> > > pgs inactive, 1072 pgs peering
> > > OSDMAP_FLAGS nodown,noout flag(s) set
> > > PG_AVAILABILITY Reduced data availability: 1072 pgs inactive, 1072 pgs
> peering
> > >     pg 7.3cd is stuck inactive for 245901.560813, current state
> > > creating+peering, last acting [13,41,1]
> > >     pg 7.3ce is stuck peering for 245901.560813, current state
> > > creating+peering, last acting [1,40,7]
> > >     pg 7.3cf is stuck peering for 245901.560813, current state
> > > creating+peering, last acting [0,42,9]
> > >     pg 7.3d0 is stuck peering for 245901.560813, current state
> > > creating+peering, last acting [20,8,38]
> > >     pg 7.3d1 is stuck peering for 245901.560813, current state
> > > creating+peering, last acting [10,20,42]
> > >    (....)
> > >
> > >
> > > I've set "noout" and "nodown" to prevent all osd's from being removed
> > > from the cluster. They are all running and marked as "up".
> > >
> > > # ceph osd tree
> > >
> > > ID  CLASS WEIGHT    TYPE NAME                          STATUS REWEIGHT
> PRI-AFF
> > >  -1       249.73434 root default
> > > -25       166.48956     datacenter m1
> > > -24        83.24478         pod kube1
> > > -35        41.62239             rack 10
> > > -34        41.62239                 host ceph-sto-p102
> > >  40   hdd   7.27689                     osd.40             up  1.00000
> 1.00000
> > >  41   hdd   7.27689                     osd.41             up  1.00000
> 1.00000
> > >  42   hdd   7.27689                     osd.42             up  1.00000
> 1.00000
> > >    (....)
> > >
> > > I'm at a point where I don't know which options and what logs to check
> anymore?
> > >
> > > Any debug hint would be very much appreciated.
> > >
> > > btw. I have no important data in the cluster (yet), so if the solution
> > > is to drop all osd and recreate them, it's ok for now. But I'd really
> > > like to know how the cluster ended in this state.
> > >
> > > /Johan
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to