Cool. So far I have tried: start on (local-filesystems and net-device-up IFACE=eth0) start on (local-filesystems and net-device-up IFACE=eth0 and net-device-up IFACE=eth1)
About to try: start on (local-filesystems and net-device-up IFACE=eth0 and net-device-up IFACE=eth1 and started network-services) The "local-filesystems" + network device is billed as an alternative to runlevel if you need to to do something *after* networking... No luck so far. I'll keep trying things out. On Mon, Aug 26, 2013 at 2:31 PM, Sage Weil <s...@inktank.com> wrote: > On Mon, 26 Aug 2013, Travis Rhoden wrote: > > Hi Sage, > > > > Thanks for the response. I noticed that as well, and suspected > > hostname/DHCP/DNS shenanigans. What's weird is that all nodes are > > identically configured. I also have monitors running on n0 and n12, and > > they come up fine, every time. > > > > Here's the mon_host line from ceph.conf: > > > > mon_initial_members = n0, n12, n24 > > mon_host = 10.0.1.0,10.0.1.12,10.0.1.24 > > > > just to test /etc/hosts and name resolution... > > > > root@n24:~# getent hosts n24 > > 10.0.1.24 n24 > > root@n24:~# hostname -s > > n24 > > > > The only loopback device in /etc/hosts is "127.0.0.1 localhost", so > > that should be fine. > > > > Upon rebooting this node, I've had the monitor come up okay once, maybe > out > > of 12 tries. So it appears to be some kind of race... No clue what is > > going on. If I stop and start the monitor (or restart), it doesn't > appear > > to change anything. > > > > However, on the topic of races, I having one other more pressing issue. > > Each OSD host is having it's hostname assigned via DHCP. Until that > > assignment is made (during init), the hostname is "localhost", and then > it > > switches over to "n<x>", for some node number. The issue I am seeing is > > that there is a race between this hostname assignment and the Ceph > Upstart > > scripts, such that sometimes ceph-osd starts while the hostname is still > > 'localhost'. This then causes the osd location to change in the > crushmap, > > which is going to be a very bad thing. =) When rebooting all my nodes > at > > once (there are several dozen), about 50% move from being under n<x> to > > localhost. Restarting all the ceph-osd jobs moves them back (because the > > hostname is defined). > > > > I'm wondering what kind of delay, or additional "start-on" logic I can > add > > to the upstart script to work around this. > > Hmm, this is beyond my upstart-fu, unfortunately. This has come up > before, actually. Previously we would wait for any interface to come up > and then start, but that broke with multi-nic machines, and I ended up > just making things start in runlevel [2345]. > > James, do you know what should be done to make the job wait for *all* > network interfaces to be up? Is that even the right solution here? > > sage > > > > > > > > On Fri, Aug 23, 2013 at 4:47 PM, Sage Weil <s...@inktank.com> wrote: > > Hi Travis, > > > > On Fri, 23 Aug 2013, Travis Rhoden wrote: > > > Hey folks, > > > > > > I've just done a brand new install of 0.67.2 on a cluster of > > Calxeda nodes. > > > > > > I have one particular monitor that number joins the quorum > > when I restart > > > the node. Looks to me like it has something to do with the > > "create-keys" > > > task, which never seems to finish: > > > > > > root 1240 1 4 13:03 ? 00:00:02 > > /usr/bin/ceph-mon > > > --cluster=ceph -i n24 -f > > > root 1244 1 0 13:03 ? 00:00:00 > > /usr/bin/python > > > /usr/sbin/ceph-create-keys --cluster=ceph -i n24 > > > > > > I don't see that task on my other monitors. Additionally, > > that task is > > > periodically query the monitor status: > > > > > > root 1240 1 2 13:03 ? 00:00:02 > > /usr/bin/ceph-mon > > > --cluster=ceph -i n24 -f > > > root 1244 1 0 13:03 ? 00:00:00 > > /usr/bin/python > > > /usr/sbin/ceph-create-keys --cluster=ceph -i n24 > > > root 1982 1244 15 13:04 ? 00:00:00 > > /usr/bin/python > > > /usr/bin/ceph --cluster=ceph > > --admin-daemon=/var/run/ceph/ceph-mon.n24.asok > > > mon_status > > > > > > Checking that status myself, I see: > > > > > > # ceph --cluster=ceph > > --admin-daemon=/var/run/ceph/ceph-mon.n24.asok > > > mon_status > > > { "name": "n24", > > > "rank": 2, > > > "state": "probing", > > > "election_epoch": 0, > > > "quorum": [], > > > "outside_quorum": [ > > > "n24"], > > > "extra_probe_peers": [], > > > "sync_provider": [], > > > "monmap": { "epoch": 2, > > > "fsid": "f0b0d4ec-1ac3-4b24-9eab-c19760ce4682", > > > "modified": "2013-08-23 12:55:34.374650", > > > "created": "0.000000", > > > "mons": [ > > > { "rank": 0, > > > "name": "n0", > > > "addr": "10.0.1.0:6789\/0"}, > > > { "rank": 1, > > > "name": "n12", > > > "addr": "10.0.1.12:6789\/0"}, > > > { "rank": 2, > > > "name": "n24", > > > "addr": "0.0.0.0:6810\/0"}]}} > > ^^^^^^^^^^^^^^^^^^^^ > > > > This is the problem. I can't remember exactly what causes this, > > though. > > Can you verify the host in ceph.conf mon_host line matches the ip that > > is > > configured on th machine, and that the /etc/hsots on the machine > > doesn't > > have a loopback address on it. > > > > Thanks! > > sage > > > > > > > > > > > > > > Any ideas what is going on here? I don't see anything useful in > > > /var/log/ceph/ceph-mon.n24.log > > > > > > Thanks, > > > > > > - Travis > > > > > > > > > > > > > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com