Thanks for the feedback -- I'll post back with more detailed logs if anything looks fishy!
On Tue, Mar 25, 2014 at 1:10 PM, Gregory Farnum <g...@inktank.com> wrote: > Well, you could try running with messenger debugging cranked all the > way up and see if there's something odd happening there (eg, not > handling incoming messages), but based on not having any other reports > of this, I think your networking stack is unhappy in some way. *shrug* > (Higher log levels showing what the individual pipes are doing will > narrow it down on the Ceph side.) > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > > > On Tue, Mar 25, 2014 at 10:05 AM, Travis Rhoden <trho...@gmail.com> wrote: > > > > > > > > On Tue, Mar 25, 2014 at 12:53 PM, Gregory Farnum <g...@inktank.com> > wrote: > >> > >> On Tue, Mar 25, 2014 at 9:24 AM, Travis Rhoden <trho...@gmail.com> > wrote: > >> > Okay, last one until I get some guidance. Sorry for the spam, but > >> > wanted to > >> > paint a full picture. Here are debug logs from all three mons, > >> > capturing > >> > what looks like an election sequence to me: > >> > > >> > ceph0: > >> > 2014-03-25 16:17:24.324846 7fa5c53fc700 5 > >> > mon.ceph0@0(electing).elector(35) > >> > start -- can i be leader? > >> > 2014-03-25 16:17:24.324900 7fa5c53fc700 1 > >> > mon.ceph0@0(electing).elector(35) > >> > init, last seen epoch 35 > >> > 2014-03-25 16:17:24.324913 7fa5c53fc700 1 -- 10.10.30.0:6789/0 --> > >> > mon.1 > >> > 10.10.30.1:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951 > >> > propose > >> > 35) v4 -- ?+0 0x263d480 > >> > 2014-03-25 16:17:24.324948 7fa5c53fc700 1 -- 10.10.30.0:6789/0 --> > >> > mon.2 > >> > 10.10.30.2:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951 > >> > propose > >> > 35) v4 -- ?+0 0x263d6c0 > >> > 2014-03-25 16:17:25.353975 7fa5c4bfb700 1 -- 10.10.30.0:6789/0 <== > >> > mon.2 > >> > 10.10.30.2:6789/0 493 ==== > election(b3f38955-4321-4850-9ddb-3b09940dc951 > >> > propose 35) v4 ==== 537+0+0 (4036841703 0 0) 0x265fd80 con 0x1df0c60 > >> > 2014-03-25 16:17:25.354042 7fa5c4bfb700 5 > >> > mon.ceph0@0(electing).elector(35) > >> > handle_propose from mon.2 > >> > 2014-03-25 16:17:29.325107 7fa5c53fc700 5 > >> > mon.ceph0@0(electing).elector(35) > >> > election timer expired > >> > > >> > ceph1: > >> > 2014-03-25 16:17:24.325529 7ffe48cc1700 5 > >> > mon.ceph1@1(electing).elector(35) > >> > handle_propose from mon.0 > >> > 2014-03-25 16:17:24.325535 7ffe48cc1700 5 > >> > mon.ceph1@1(electing).elector(35) > >> > defer to 0 > >> > 2014-03-25 16:17:24.325546 7ffe48cc1700 1 -- 10.10.30.1:6789/0 --> > >> > mon.0 > >> > 10.10.30.0:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951 > ack > >> > 35) > >> > v4 -- ?+0 0x1bbfb40 > >> > 2014-03-25 16:17:25.354038 7ffe48cc1700 1 -- 10.10.30.1:6789/0 <== > >> > mon.2 > >> > 10.10.30.2:6789/0 489 ==== > election(b3f38955-4321-4850-9ddb-3b09940dc951 > >> > propose 35) v4 ==== 537+0+0 (4036841703 0 0) 0x1bbf6c0 con 0x14d9b00 > >> > 2014-03-25 16:17:25.354102 7ffe48cc1700 5 > >> > mon.ceph1@1(electing).elector(35) > >> > handle_propose from mon.2 > >> > 2014-03-25 16:17:25.354113 7ffe48cc1700 5 > >> > mon.ceph1@1(electing).elector(35) > >> > no, we already acked 0 > >> > > >> > ceph2: > >> > 2014-03-25 16:17:20.353135 7f80d0013700 5 > >> > mon.ceph2@2(electing).elector(35) > >> > election timer expired > >> > 2014-03-25 16:17:20.353154 7f80d0013700 5 > >> > mon.ceph2@2(electing).elector(35) > >> > start -- can i be leader? > >> > 2014-03-25 16:17:20.353225 7f80d0013700 1 > >> > mon.ceph2@2(electing).elector(35) > >> > init, last seen epoch 35 > >> > 2014-03-25 16:17:20.353238 7f80d0013700 1 -- 10.10.30.2:6789/0 --> > >> > mon.0 > >> > 10.10.30.0:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951 > >> > propose > >> > 35) v4 -- ?+0 0x18e7900 > >> > 2014-03-25 16:17:20.353272 7f80d0013700 1 -- 10.10.30.2:6789/0 --> > >> > mon.1 > >> > 10.10.30.1:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951 > >> > propose > >> > 35) v4 -- ?+0 0x18e7d80 > >> > 2014-03-25 16:17:25.353559 7f80d0013700 5 > >> > mon.ceph2@2(electing).elector(35) > >> > election timer expired > >> > 2014-03-25 16:17:25.353578 7f80d0013700 5 > >> > mon.ceph2@2(electing).elector(35) > >> > start -- can i be leader? > >> > 2014-03-25 16:17:25.353647 7f80d0013700 1 > >> > mon.ceph2@2(electing).elector(35) > >> > init, last seen epoch 35 > >> > 2014-03-25 16:17:25.353660 7f80d0013700 1 -- 10.10.30.2:6789/0 --> > >> > mon.0 > >> > 10.10.30.0:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951 > >> > propose > >> > 35) v4 -- ?+0 0x19b7240 > >> > 2014-03-25 16:17:25.353695 7f80d0013700 1 -- 10.10.30.2:6789/0 --> > >> > mon.1 > >> > 10.10.30.1:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951 > >> > propose > >> > 35) v4 -- ?+0 0x19b76c0 > >> > 2014-03-25 16:17:30.354040 7f80d0013700 5 > >> > mon.ceph2@2(electing).elector(35) > >> > election timer expired > >> > > >> > Oddly, it looks to me like mon.2 (ceph2) never handles/receives the > >> > proposal > >> > from mon.0 (ceph0). But I admit I have no clue how monitor election > >> > works. > >> > >> Likewise, mon.1 sends an ack to mon.0 that is never received. I think > >> you've got a busted firewall that's allowing some one-way > >> communication, but not two-way. > > > > > > Hmm, I see that is missing as well. There is no hardware firewall in > play, > > and I've check iptables on all nodes -- no firewall rules are in place at > > this time. Just default ACCEPT rules everywhere. > > > > However, I do see some weirdness here: > > > > ceph0: > > # netstat -atln > > Active Internet connections (servers and established) > > Proto Recv-Q Send-Q Local Address Foreign Address State > > tcp 0 0 10.10.30.0:6789 0.0.0.0:* > LISTEN > > tcp 0 0 0.0.0.0:22 0.0.0.0:* > LISTEN > > tcp 0 0 10.1.30.0:22 10.1.0.1:33129 > > ESTABLISHED > > tcp 0 85128 10.10.30.0:40760 10.10.30.2:6789 > > ESTABLISHED > > tcp 0 0 10.10.30.0:43920 10.10.30.1:6789 > > ESTABLISHED > > tcp6 0 0 :::22 :::* > LISTEN > > > > ceph1: > > # netstat -atln > > Active Internet connections (servers and established) > > Proto Recv-Q Send-Q Local Address Foreign Address State > > tcp 0 0 10.10.30.1:6789 0.0.0.0:* > LISTEN > > tcp 0 0 0.0.0.0:22 0.0.0.0:* > LISTEN > > tcp 0 79643 10.10.30.1:6789 10.10.30.0:43920 > > ESTABLISHED > > tcp 0 0 10.10.30.1:41876 10.10.30.2:6789 > > ESTABLISHED > > tcp 0 0 10.1.30.1:22 10.1.0.1:38382 > > ESTABLISHED > > tcp6 0 0 :::22 :::* > LISTEN > > > > ceph2: > > # netstat -atln > > Active Internet connections (servers and established) > > Proto Recv-Q Send-Q Local Address Foreign Address State > > tcp 0 0 10.10.30.2:6789 0.0.0.0:* > LISTEN > > tcp 0 0 0.0.0.0:22 0.0.0.0:* > LISTEN > > tcp 0 0 10.10.30.2:6789 10.10.30.1:41876 > > ESTABLISHED > > tcp 0 0 10.1.30.2:22 10.1.0.1:58195 > > ESTABLISHED > > tcp 0 0 10.10.30.2:6789 10.10.30.0:40760 > > ESTABLISHED > > tcp6 0 0 :::22 :::* > LISTEN > > > > Running those netstat command multiple times shows the Send-Q backed up > on > > both ceph0 and ceph1. The number stays the same. > > > > So I see that mon.0 is backed up sending to mon.2, and that mon.1 is > backed > > up sending to mon.0. And mon.2 has no backup. > > > > Still baffled. > > > >> -Greg > >> > >> Software Engineer #42 @ http://inktank.com | http://ceph.com > > > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com