Thanks for the feedback -- I'll post back with more detailed logs if
anything looks fishy!


On Tue, Mar 25, 2014 at 1:10 PM, Gregory Farnum <g...@inktank.com> wrote:

> Well, you could try running with messenger debugging cranked all the
> way up and see if there's something odd happening there (eg, not
> handling incoming messages), but based on not having any other reports
> of this, I think your networking stack is unhappy in some way. *shrug*
> (Higher log levels showing what the individual pipes are doing will
> narrow it down on the Ceph side.)
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Tue, Mar 25, 2014 at 10:05 AM, Travis Rhoden <trho...@gmail.com> wrote:
> >
> >
> >
> > On Tue, Mar 25, 2014 at 12:53 PM, Gregory Farnum <g...@inktank.com>
> wrote:
> >>
> >> On Tue, Mar 25, 2014 at 9:24 AM, Travis Rhoden <trho...@gmail.com>
> wrote:
> >> > Okay, last one until I get some guidance.  Sorry for the spam, but
> >> > wanted to
> >> > paint a full picture.  Here are debug logs from all three mons,
> >> > capturing
> >> > what looks like an election sequence to me:
> >> >
> >> > ceph0:
> >> > 2014-03-25 16:17:24.324846 7fa5c53fc700  5
> >> > mon.ceph0@0(electing).elector(35)
> >> > start -- can i be leader?
> >> > 2014-03-25 16:17:24.324900 7fa5c53fc700  1
> >> > mon.ceph0@0(electing).elector(35)
> >> > init, last seen epoch 35
> >> > 2014-03-25 16:17:24.324913 7fa5c53fc700  1 -- 10.10.30.0:6789/0 -->
> >> > mon.1
> >> > 10.10.30.1:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951
> >> > propose
> >> > 35) v4 -- ?+0 0x263d480
> >> > 2014-03-25 16:17:24.324948 7fa5c53fc700  1 -- 10.10.30.0:6789/0 -->
> >> > mon.2
> >> > 10.10.30.2:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951
> >> > propose
> >> > 35) v4 -- ?+0 0x263d6c0
> >> > 2014-03-25 16:17:25.353975 7fa5c4bfb700  1 -- 10.10.30.0:6789/0 <==
> >> > mon.2
> >> > 10.10.30.2:6789/0 493 ====
> election(b3f38955-4321-4850-9ddb-3b09940dc951
> >> > propose 35) v4 ==== 537+0+0 (4036841703 0 0) 0x265fd80 con 0x1df0c60
> >> > 2014-03-25 16:17:25.354042 7fa5c4bfb700  5
> >> > mon.ceph0@0(electing).elector(35)
> >> > handle_propose from mon.2
> >> > 2014-03-25 16:17:29.325107 7fa5c53fc700  5
> >> > mon.ceph0@0(electing).elector(35)
> >> > election timer expired
> >> >
> >> > ceph1:
> >> > 2014-03-25 16:17:24.325529 7ffe48cc1700  5
> >> > mon.ceph1@1(electing).elector(35)
> >> > handle_propose from mon.0
> >> > 2014-03-25 16:17:24.325535 7ffe48cc1700  5
> >> > mon.ceph1@1(electing).elector(35)
> >> > defer to 0
> >> > 2014-03-25 16:17:24.325546 7ffe48cc1700  1 -- 10.10.30.1:6789/0 -->
> >> > mon.0
> >> > 10.10.30.0:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951
> ack
> >> > 35)
> >> > v4 -- ?+0 0x1bbfb40
> >> > 2014-03-25 16:17:25.354038 7ffe48cc1700  1 -- 10.10.30.1:6789/0 <==
> >> > mon.2
> >> > 10.10.30.2:6789/0 489 ====
> election(b3f38955-4321-4850-9ddb-3b09940dc951
> >> > propose 35) v4 ==== 537+0+0 (4036841703 0 0) 0x1bbf6c0 con 0x14d9b00
> >> > 2014-03-25 16:17:25.354102 7ffe48cc1700  5
> >> > mon.ceph1@1(electing).elector(35)
> >> > handle_propose from mon.2
> >> > 2014-03-25 16:17:25.354113 7ffe48cc1700  5
> >> > mon.ceph1@1(electing).elector(35)
> >> > no, we already acked 0
> >> >
> >> > ceph2:
> >> > 2014-03-25 16:17:20.353135 7f80d0013700  5
> >> > mon.ceph2@2(electing).elector(35)
> >> > election timer expired
> >> > 2014-03-25 16:17:20.353154 7f80d0013700  5
> >> > mon.ceph2@2(electing).elector(35)
> >> > start -- can i be leader?
> >> > 2014-03-25 16:17:20.353225 7f80d0013700  1
> >> > mon.ceph2@2(electing).elector(35)
> >> > init, last seen epoch 35
> >> > 2014-03-25 16:17:20.353238 7f80d0013700  1 -- 10.10.30.2:6789/0 -->
> >> > mon.0
> >> > 10.10.30.0:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951
> >> > propose
> >> > 35) v4 -- ?+0 0x18e7900
> >> > 2014-03-25 16:17:20.353272 7f80d0013700  1 -- 10.10.30.2:6789/0 -->
> >> > mon.1
> >> > 10.10.30.1:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951
> >> > propose
> >> > 35) v4 -- ?+0 0x18e7d80
> >> > 2014-03-25 16:17:25.353559 7f80d0013700  5
> >> > mon.ceph2@2(electing).elector(35)
> >> > election timer expired
> >> > 2014-03-25 16:17:25.353578 7f80d0013700  5
> >> > mon.ceph2@2(electing).elector(35)
> >> > start -- can i be leader?
> >> > 2014-03-25 16:17:25.353647 7f80d0013700  1
> >> > mon.ceph2@2(electing).elector(35)
> >> > init, last seen epoch 35
> >> > 2014-03-25 16:17:25.353660 7f80d0013700  1 -- 10.10.30.2:6789/0 -->
> >> > mon.0
> >> > 10.10.30.0:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951
> >> > propose
> >> > 35) v4 -- ?+0 0x19b7240
> >> > 2014-03-25 16:17:25.353695 7f80d0013700  1 -- 10.10.30.2:6789/0 -->
> >> > mon.1
> >> > 10.10.30.1:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951
> >> > propose
> >> > 35) v4 -- ?+0 0x19b76c0
> >> > 2014-03-25 16:17:30.354040 7f80d0013700  5
> >> > mon.ceph2@2(electing).elector(35)
> >> > election timer expired
> >> >
> >> > Oddly, it looks to me like mon.2 (ceph2) never handles/receives the
> >> > proposal
> >> > from mon.0 (ceph0).  But I admit I have no clue how monitor election
> >> > works.
> >>
> >> Likewise, mon.1 sends an ack to mon.0 that is never received. I think
> >> you've got a busted firewall that's allowing some one-way
> >> communication, but not two-way.
> >
> >
> > Hmm, I see that is missing as well.  There is no hardware firewall in
> play,
> > and I've check iptables on all nodes -- no firewall rules are in place at
> > this time.  Just default ACCEPT rules everywhere.
> >
> > However, I do see some weirdness here:
> >
> > ceph0:
> > # netstat -atln
> > Active Internet connections (servers and established)
> > Proto Recv-Q Send-Q Local Address           Foreign Address         State
> > tcp        0      0 10.10.30.0:6789         0.0.0.0:*
> LISTEN
> > tcp        0      0 0.0.0.0:22              0.0.0.0:*
> LISTEN
> > tcp        0      0 10.1.30.0:22            10.1.0.1:33129
> > ESTABLISHED
> > tcp        0  85128 10.10.30.0:40760        10.10.30.2:6789
> > ESTABLISHED
> > tcp        0      0 10.10.30.0:43920        10.10.30.1:6789
> > ESTABLISHED
> > tcp6       0      0 :::22                   :::*
>  LISTEN
> >
> > ceph1:
> > # netstat -atln
> > Active Internet connections (servers and established)
> > Proto Recv-Q Send-Q Local Address           Foreign Address         State
> > tcp        0      0 10.10.30.1:6789         0.0.0.0:*
> LISTEN
> > tcp        0      0 0.0.0.0:22              0.0.0.0:*
> LISTEN
> > tcp        0  79643 10.10.30.1:6789         10.10.30.0:43920
> > ESTABLISHED
> > tcp        0      0 10.10.30.1:41876        10.10.30.2:6789
> > ESTABLISHED
> > tcp        0      0 10.1.30.1:22            10.1.0.1:38382
> > ESTABLISHED
> > tcp6       0      0 :::22                   :::*
>  LISTEN
> >
> > ceph2:
> > # netstat -atln
> > Active Internet connections (servers and established)
> > Proto Recv-Q Send-Q Local Address           Foreign Address         State
> > tcp        0      0 10.10.30.2:6789         0.0.0.0:*
> LISTEN
> > tcp        0      0 0.0.0.0:22              0.0.0.0:*
> LISTEN
> > tcp        0      0 10.10.30.2:6789         10.10.30.1:41876
> > ESTABLISHED
> > tcp        0      0 10.1.30.2:22            10.1.0.1:58195
> > ESTABLISHED
> > tcp        0      0 10.10.30.2:6789         10.10.30.0:40760
> > ESTABLISHED
> > tcp6       0      0 :::22                   :::*
>  LISTEN
> >
> > Running those netstat command multiple times shows the Send-Q backed up
> on
> > both ceph0 and ceph1.  The number stays the same.
> >
> > So I see that mon.0 is backed up sending to mon.2, and that mon.1 is
> backed
> > up sending to mon.0.  And mon.2 has no backup.
> >
> > Still baffled.
> >
> >> -Greg
> >>
> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
> >
> >
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to