Re: [ceph-users] 1 particular ceph-mon never jobs on 0.67.2

Travis Rhoden Mon, 26 Aug 2013 12:10:28 -0700

Cool.  So far I have tried:

start on (local-filesystems and net-device-up IFACE=eth0)
start on (local-filesystems and net-device-up IFACE=eth0 and net-device-up
IFACE=eth1)


About to try:
start on (local-filesystems and net-device-up IFACE=eth0 and net-device-up
IFACE=eth1 and started network-services)

The "local-filesystems" + network device is billed as an alternative to
runlevel if you need to to do something *after* networking...

No luck so far.  I'll keep trying things out.


On Mon, Aug 26, 2013 at 2:31 PM, Sage Weil <s...@inktank.com> wrote:

> On Mon, 26 Aug 2013, Travis Rhoden wrote:
> > Hi Sage,
> >
> > Thanks for the response.  I noticed that as well, and suspected
> > hostname/DHCP/DNS shenanigans.  What's weird is that all nodes are
> > identically configured.  I also have monitors running on n0 and n12, and
> > they come up fine, every time.
> >
> > Here's the mon_host line from ceph.conf:
> >
> > mon_initial_members = n0, n12, n24
> > mon_host = 10.0.1.0,10.0.1.12,10.0.1.24
> >
> > just to test /etc/hosts and name resolution...
> >
> > root@n24:~# getent hosts n24
> > 10.0.1.24       n24
> > root@n24:~# hostname -s
> > n24
> >
> > The only loopback device in /etc/hosts is "127.0.0.1       localhost", so
> > that should be fine.
> >
> > Upon rebooting this node, I've had the monitor come up okay once, maybe
> out
> > of 12 tries.  So it appears to be some kind of race...  No clue what is
> > going on.  If I stop and start the monitor (or restart), it doesn't
> appear
> > to change anything.
> >
> > However, on the topic of races, I having one other more pressing issue.
> > Each OSD host is having it's hostname assigned via DHCP.  Until that
> > assignment is made (during init), the hostname is "localhost", and then
> it
> > switches over to "n<x>", for some node number.  The issue I am seeing is
> > that there is a race between this hostname assignment and the Ceph
> Upstart
> > scripts, such that sometimes ceph-osd starts while the hostname is still
> > 'localhost'.  This then causes the osd location to change in the
> crushmap,
> > which is going to be a very bad thing.  =)  When rebooting all my nodes
> at
> > once (there are several dozen), about 50% move from being under n<x> to
> > localhost.  Restarting all the ceph-osd jobs moves them back (because the
> > hostname is defined).
> >
> > I'm wondering what kind of delay, or additional "start-on" logic I can
> add
> > to the upstart script to work around this.
>
> Hmm, this is beyond my upstart-fu, unfortunately.  This has come up
> before, actually.  Previously we would wait for any interface to come up
> and then start, but that broke with multi-nic machines, and I ended up
> just making things start in runlevel [2345].
>
> James, do you know what should be done to make the job wait for *all*
> network interfaces to be up?  Is that even the right solution here?
>
> sage
>
>
> >
> >
> > On Fri, Aug 23, 2013 at 4:47 PM, Sage Weil <s...@inktank.com> wrote:
> >       Hi Travis,
> >
> >       On Fri, 23 Aug 2013, Travis Rhoden wrote:
> >       > Hey folks,
> >       >
> >       > I've just done a brand new install of 0.67.2 on a cluster of
> >       Calxeda nodes.
> >       >
> >       > I have one particular monitor that number joins the quorum
> >       when I restart
> >       > the node.  Looks to  me like it has something to do with the
> >       "create-keys"
> >       > task, which never seems to finish:
> >       >
> >       > root      1240     1  4 13:03 ?        00:00:02
> >       /usr/bin/ceph-mon
> >       > --cluster=ceph -i n24 -f
> >       > root      1244     1  0 13:03 ?        00:00:00
> >       /usr/bin/python
> >       > /usr/sbin/ceph-create-keys --cluster=ceph -i n24
> >       >
> >       > I don't see that task on my other monitors.  Additionally,
> >       that task is
> >       > periodically query the monitor status:
> >       >
> >       > root      1240     1  2 13:03 ?        00:00:02
> >       /usr/bin/ceph-mon
> >       > --cluster=ceph -i n24 -f
> >       > root      1244     1  0 13:03 ?        00:00:00
> >       /usr/bin/python
> >       > /usr/sbin/ceph-create-keys --cluster=ceph -i n24
> >       > root      1982  1244 15 13:04 ?        00:00:00
> >       /usr/bin/python
> >       > /usr/bin/ceph --cluster=ceph
> >       --admin-daemon=/var/run/ceph/ceph-mon.n24.asok
> >       > mon_status
> >       >
> >       > Checking that status myself, I see:
> >       >
> >       > # ceph --cluster=ceph
> >       --admin-daemon=/var/run/ceph/ceph-mon.n24.asok
> >       > mon_status
> >       > { "name": "n24",
> >       >   "rank": 2,
> >       >   "state": "probing",
> >       >   "election_epoch": 0,
> >       >   "quorum": [],
> >       >   "outside_quorum": [
> >       >         "n24"],
> >       >   "extra_probe_peers": [],
> >       >   "sync_provider": [],
> >       >   "monmap": { "epoch": 2,
> >       >       "fsid": "f0b0d4ec-1ac3-4b24-9eab-c19760ce4682",
> >       >       "modified": "2013-08-23 12:55:34.374650",
> >       >       "created": "0.000000",
> >       >       "mons": [
> >       >             { "rank": 0,
> >       >               "name": "n0",
> >       >               "addr": "10.0.1.0:6789\/0"},
> >       >             { "rank": 1,
> >       >               "name": "n12",
> >       >               "addr": "10.0.1.12:6789\/0"},
> >       >             { "rank": 2,
> >       >               "name": "n24",
> >       >               "addr": "0.0.0.0:6810\/0"}]}}
> >                         ^^^^^^^^^^^^^^^^^^^^
> >
> > This is the problem.  I can't remember exactly what causes this,
> > though.
> > Can you verify the host in ceph.conf mon_host line matches the ip that
> > is
> > configured on th machine, and that the /etc/hsots on the machine
> > doesn't
> > have a loopback address on it.
> >
> > Thanks!
> > sage
> >
> >
> >
> >
> > >
> > > Any ideas what is going on here?  I don't see anything useful in
> > > /var/log/ceph/ceph-mon.n24.log
> > >
> > >  Thanks,
> > >
> > >  - Travis
> > >
> > >
> >
> >
> >
> >
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 1 particular ceph-mon never jobs on 0.67.2

Reply via email to