On Tue, Mar 25, 2014 at 12:28 PM, Alexandre DERUMIER <aderum...@odiso.com>wrote:

> Hi, can you ping between your hosts ?
>

I can.  And, from my first post, I can do things like "telnet 10.10.30.1
6789" from 10.10.30.0 to see that I can actually reach the monitor sockets.


> (just to be sure, what is your netmask ? (as I see 10.10.30.0 for mon1)
>

Everything is on 10.10.0.0/16

Did you really see something that said 10.10.30.0 for mon1?  Because it
should be .1, not .0...  .0 for mon0, .1 for mon1, etc.

Thanks for the response!


>
>
>
>
> ----- Mail original -----
>
> De: "Travis Rhoden" <trho...@gmail.com>
> À: "ceph-users" <ceph-us...@ceph.com>
> Envoyé: Mardi 25 Mars 2014 17:24:19
> Objet: Re: [ceph-users] Monitors stuck in "electing"
>
>
>
>
>
> Okay, last one until I get some guidance. Sorry for the spam, but wanted
> to paint a full picture. Here are debug logs from all three mons, capturing
> what looks like an election sequence to me:
>
> ceph0:
> 2014-03-25 16:17:24.324846 7fa5c53fc700 5 mon.ceph0@0(electing).elector(35)
> start -- can i be leader?
> 2014-03-25 16:17:24.324900 7fa5c53fc700 1 mon.ceph0@0(electing).elector(35)
> init, last seen epoch 35
> 2014-03-25 16:17:24.324913 7fa5c53fc700 1 -- 10.10.30.0:6789/0 --> mon.1
> 10.10.30.1:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951
> propose 35) v4 -- ?+0 0x263d480
> 2014-03-25 16:17:24.324948 7fa5c53fc700 1 -- 10.10.30.0:6789/0 --> mon.2
> 10.10.30.2:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951
> propose 35) v4 -- ?+0 0x263d6c0
> 2014-03-25 16:17:25.353975 7fa5c4bfb700 1 -- 10.10.30.0:6789/0 <== mon.2
> 10.10.30.2:6789/0 493 ==== election(b3f38955-4321-4850-9ddb-3b09940dc951
> propose 35) v4 ==== 537+0+0 (4036841703 0 0) 0x265fd80 con 0x1df0c60
> 2014-03-25 16:17:25.354042 7fa5c4bfb700 5 mon.ceph0@0(electing).elector(35)
> handle_propose from mon.2
> 2014-03-25 16:17:29.325107 7fa5c53fc700 5 mon.ceph0@0(electing).elector(35)
> election timer expired
>
> ceph1:
> 2014-03-25 16:17:24.325529 7ffe48cc1700 5 mon.ceph1@1(electing).elector(35)
> handle_propose from mon.0
> 2014-03-25 16:17:24.325535 7ffe48cc1700 5 mon.ceph1@1(electing).elector(35)
> defer to 0
> 2014-03-25 16:17:24.325546 7ffe48cc1700 1 -- 10.10.30.1:6789/0 --> mon.0
> 10.10.30.0:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951 ack
> 35) v4 -- ?+0 0x1bbfb40
> 2014-03-25 16:17:25.354038 7ffe48cc1700 1 -- 10.10.30.1:6789/0 <== mon.2
> 10.10.30.2:6789/0 489 ==== election(b3f38955-4321-4850-9ddb-3b09940dc951
> propose 35) v4 ==== 537+0+0 (4036841703 0 0) 0x1bbf6c0 con 0x14d9b00
> 2014-03-25 16:17:25.354102 7ffe48cc1700 5 mon.ceph1@1(electing).elector(35)
> handle_propose from mon.2
> 2014-03-25 16:17:25.354113 7ffe48cc1700 5 mon.ceph1@1(electing).elector(35)
> no, we already acked 0
>
> ceph2:
> 2014-03-25 16:17:20.353135 7f80d0013700 5 mon.ceph2@2(electing).elector(35)
> election timer expired
> 2014-03-25 16:17:20.353154 7f80d0013700 5 mon.ceph2@2(electing).elector(35)
> start -- can i be leader?
> 2014-03-25 16:17:20.353225 7f80d0013700 1 mon.ceph2@2(electing).elector(35)
> init, last seen epoch 35
> 2014-03-25 16:17:20.353238 7f80d0013700 1 -- 10.10.30.2:6789/0 --> mon.0
> 10.10.30.0:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951
> propose 35) v4 -- ?+0 0x18e7900
> 2014-03-25 16:17:20.353272 7f80d0013700 1 -- 10.10.30.2:6789/0 --> mon.1
> 10.10.30.1:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951
> propose 35) v4 -- ?+0 0x18e7d80
> 2014-03-25 16:17:25.353559 7f80d0013700 5 mon.ceph2@2(electing).elector(35)
> election timer expired
> 2014-03-25 16:17:25.353578 7f80d0013700 5 mon.ceph2@2(electing).elector(35)
> start -- can i be leader?
> 2014-03-25 16:17:25.353647 7f80d0013700 1 mon.ceph2@2(electing).elector(35)
> init, last seen epoch 35
> 2014-03-25 16:17:25.353660 7f80d0013700 1 -- 10.10.30.2:6789/0 --> mon.0
> 10.10.30.0:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951
> propose 35) v4 -- ?+0 0x19b7240
> 2014-03-25 16:17:25.353695 7f80d0013700 1 -- 10.10.30.2:6789/0 --> mon.1
> 10.10.30.1:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951
> propose 35) v4 -- ?+0 0x19b76c0
> 2014-03-25 16:17:30.354040 7f80d0013700 5 mon.ceph2@2(electing).elector(35)
> election timer expired
>
> Oddly, it looks to me like mon.2 (ceph2) never handles/receives the
> proposal from mon.0 (ceph0). But I admit I have no clue how monitor
> election works.
>
> - Travis
>
>
>
>
> On Tue, Mar 25, 2014 at 12:04 PM, Travis Rhoden < trho...@gmail.com >
> wrote:
>
>
>
>
> I bumped debug mon and debug ms up on one of the monitors (ceph0), and
> this is what I see:
>
> 2014-03-25 16:02:19.273406 7fa5c53fc700 5 mon.ceph0@0(electing).elector(35)
> election timer expired
> 2014-03-25 16:02:19.273447 7fa5c53fc700 5 mon.ceph0@0(electing).elector(35)
> start -- can i be leader?
> 2014-03-25 16:02:19.273528 7fa5c53fc700 1 mon.ceph0@0(electing).elector(35)
> init, last seen epoch 35
> 2014-03-25 16:02:19.273543 7fa5c53fc700 1 -- 10.10.30.0:6789/0 --> mon.1
> 10.10.30.1:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951
> propose 35) v4 -- ?+0 0x251bd80
> 2014-03-25 16:02:19.273569 7fa5c53fc700 1 -- 10.10.30.0:6789/0 --> mon.2
> 10.10.30.2:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951
> propose 35) v4 -- ?+0 0x251bb40
> 2014-03-25 16:02:20.284459 7fa5c4bfb700 1 -- 10.10.30.0:6789/0 <== mon.2
> 10.10.30.2:6789/0 312 ==== election(b3f38955-4321-4850-9ddb-3b09940dc951
> propose 35) v4 ==== 537+0+0 (4036841703 0 0) 0x2515480 con 0x1df0c60
> 2014-03-25 16:02:20.284524 7fa5c4bfb700 5 mon.ceph0@0(electing).elector(35)
> handle_propose from mon.2
> 2014-03-25 16:02:24.273726 7fa5c53fc700 5 mon.ceph0@0(electing).elector(35)
> election timer expired
>
> That just repeats...
>
>
>
>
>
>
> On Tue, Mar 25, 2014 at 11:48 AM, Travis Rhoden < trho...@gmail.com >
> wrote:
>
> <blockquote>
>
>
> Just to emphasize that I don't think it's clock skew, here is the NTP
> state of all three monitors:
>
> # ansible ceph_mons -m command -a "ntpq -p" -kK
> SSH password:
> sudo password [defaults to SSH password]:
> ceph0 | success | rc=0 >>
> remote refid st t when poll reach delay offset jitter
>
> ==============================================================================
> *controller-10g 198.60.73.8 2 u 43 64 377 0.236 0.057 0.097
>
> ceph1 | success | rc=0 >>
> remote refid st t when poll reach delay offset jitter
>
> ==============================================================================
> *controller-10g 198.60.73.8 2 u 39 64 377 0.273 0.035 0.064
>
> ceph2 | success | rc=0 >>
> remote refid st t when poll reach delay offset jitter
>
> ==============================================================================
> *controller-10g 198.60.73.8 2 u 30 64 377 0.201 -0.063 0.063
>
> I think they are pretty well in synch.
>
> - Travis
>
>
>
>
>
>
> On Tue, Mar 25, 2014 at 11:09 AM, Travis Rhoden < trho...@gmail.com >
> wrote:
>
> <blockquote>
>
>
>
>
>
> Hello,
>
> I just deployed a new Emperor cluster using ceph-deploy 1.4. All went very
> smooth, until I rebooted all the nodes. After reboot, the monitors no
> longer form a quorum.
>
> I followed the troubleshooting steps here:
> http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/
>
> Specifically, I"m in the stat described in this section:
> http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/#most-common-monitor-issues
>
> The state for all the monitors is "electing". The docs say this is most
> likely clock skew, but I do have all nodes synch'd with NTP. I've confirmed
> this multiple times. I've also confirmed the monitors can reach each other
> (by telneting to IP:PORT, and I can see established connections via
> netstat).
>
> I'm baffled.
>
> here is a sample mon_status output:
>
> root@ceph0:~# ceph daemon mon.ceph0 quorum_status
> { "election_epoch": 31,
> "quorum": [],
> "quorum_names": [],
> "quorum_leader_name": "",
> "monmap": { "epoch": 2,
> "fsid": "XXX", (redacted)
> "modified": "2014-03-24 14:35:22.332646",
> "created": "0.000000",
> "mons": [
> { "rank": 0,
> "name": "ceph0",
> "addr": " 10.10.30.0:6789 \/0"},
> { "rank": 1,
> "name": "ceph1",
> "addr": " 10.10.30.1:6789 \/0"},
> { "rank": 2,
> "name": "ceph2",
> "addr": " 10.10.30.2:6789 \/0"}]}}
>
> They all look identical to that.
>
> Any ideas what I can look at besides NTP? The docs really stress that it
> should be clock skew, so I'll keep looking at that...
>
> - Travis
>
>
>
>
> </blockquote>
>
>
> </blockquote>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to