FYI - i created AURORA-840 [1] to document common situations like these, hopefully to be completed in our documentation sprint tomorrow. Dobromir - since setting up a new cluster is fresh for you, you might have some insight we can add to this.
[1] https://issues.apache.org/jira/browse/AURORA-840 -=Bill On Wed, Oct 15, 2014 at 5:02 PM, Kevin Sweeney <kevi...@apache.org> wrote: > The scheduler uses that zookeeper path (or whatever url you pass to > -mesos_msater_address) to find the mesos master is should register with. > > Is there a mesos master running with a command-line that looks something > like > > exec /usr/local/sbin/mesos-master \ > --zk=zk://$ZK_HOST:2181/mesos/master \ > ...? > > On Wed, Oct 15, 2014 at 4:53 PM, Dobromir Montauk <dobro...@tellapart.com> > wrote: > > > Next problem: framework is failing to register. > > > > 2014-10-15 22:15:48,024:28619(0x7f852d259700):ZOO_INFO@check_events > @1750: > > session establishment complete on server [127.0.0.1:2181], > > sessionId=0x349110433970e3b, negotiated timeout=10000 > > I1015 22:15:48.024787 28763 group.cpp:313] Group process (group(3)@ > > 10.88.26.227:34348) connected to ZooKeeper > > I1015 22:15:48.024837 28763 group.cpp:787] Syncing group operations: > queue > > size (joins, cancels, datas) = (0, 0, 0) > > I1015 22:15:48.024868 28763 group.cpp:385] Trying to create path > > '/mesos/master' in ZooKeeper > > E1015 22:16:48.013 THREAD132 > > org.apache.aurora.scheduler.SchedulerLifecycle$6$3.run: Framework has not > > been registered within the tolerated delay. > > I1015 22:16:48.014 THREAD132 > > com.twitter.common.util.StateMachine$Builder$1.execute: > SchedulerLifecycle > > state machine transition LEADER_AWAITING_REGISTRATION -> DEAD > > > > My Zookeeper logs have this: > > > > 2014-10-15 22:15:47,831 - INFO [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@839] - Client attempting to > establish > > new session at /127.0.0.1:41116 > > 2014-10-15 22:15:47,836 - INFO [CommitProcessor:3:ZooKeeperServer@595] > - > > Established session 0x349110433970e3a with negotiated timeout 4000 for > > client /127.0.0.1:41116 > > 2014-10-15 22:15:48,017 - INFO [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket > > connection > > from /127.0.0.1:41117 > > 2014-10-15 22:15:48,017 - WARN [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@793] - Connection request from old > > client /127.0.0.1:41117; will be dropped if server is in r-o mode > > 2014-10-15 22:15:48,018 - INFO [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@839] - Client attempting to > establish > > new session at /127.0.0.1:41117 > > 2014-10-15 22:15:48,024 - INFO [CommitProcessor:3:ZooKeeperServer@595] > - > > Established session 0x349110433970e3b with negotiated timeout 10000 for > > client /127.0.0.1:41117 > > 2014-10-15 22:16:48,435 - WARN [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end of stream exception > > EndOfStreamException: Unable to read additional data from client > sessionid > > 0x349110433970e39, likely client has closed socket > > at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) > > at > > > > > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) > > at java.lang.Thread.run(Thread.java:745) > > 2014-10-15 22:16:48,436 - INFO [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for > > client /127.0.0.1:41114 which had sessionid 0x349110433970e39 > > > > I've removed /mesos/master and verified that it gets recreated: > > > > [zk: localhost:2181(CONNECTED) 4] rmr /mesos/master > > [zk: localhost:2181(CONNECTED) 5] ls /mesos/master > > Node does not exist: /mesos/master > > [zk: localhost:2181(CONNECTED) 6] get /mesos/master > > cZxid = 0x10000b717 > > ctime = Wed Oct 15 23:34:17 UTC 2014 > > mZxid = 0x10000b717 > > mtime = Wed Oct 15 23:34:17 UTC 2014 > > pZxid = 0x10000b717 > > cversion = 0 > > dataVersion = 0 > > aclVersion = 0 > > ephemeralOwner = 0x0 > > dataLength = 0 > > numChildren = 0 > > > > Anyone know what's going wrong? > > > > > > On Wed, Oct 15, 2014 at 2:47 PM, Dobromir Montauk < > dobro...@tellapart.com> > > wrote: > > > > > I found instructions that fixed the current issue here: > > > > > > http://wilderness.apache.org/channels/?f=aurora/2014-05-27 > > > > > > Would be nice to have a more friendly error message :) > > > > > > On Wed, Oct 15, 2014 at 2:22 PM, Dobromir Montauk < > > dobro...@tellapart.com> > > > wrote: > > > > > >> Hi, > > >> > > >> I've brought up Aurora on my Mesos master node with the following > > command: > > >> > > >> ubuntu@ec2-54-82-17-37:~/$ > > >> > > >> GLOG_v=2 > > >> LIBPROCESS_PORT=5050 > > >> LIBPROCESS_IP=127.0.0.1 > > >> AURORA_HOME=/usr/local/aurora-scheduler > > >> DIST_DIR=/home/ubuntu/aurora-scheduler/dist > > >> AURORA_HOME=/usr/local/aurora-scheduler > > >> > > >> sudo /usr/local/aurora-scheduler/bin/aurora-scheduler \ > > >> -cluster_name=tellapart \ > > >> -http_port=8081 \ > > >> -native_log_quorum_size=1 \ > > >> -zk_endpoints=localhost:2181 \ > > >> -mesos_master_address=54.166.50.69:5050,54.160.61.169:5050 > > ,localhost:5050 > > >> \ > > >> -serverset_path=/aurora/scheduler \ > > >> -native_log_zk_group_path=/aurora/replicated-log \ > > >> -native_log_file_path=$AURORA_HOME/scheduler/db \ > > >> -backup_dir=$AURORA_HOME/scheduler/backups \ > > >> -thermos_executor_path=/dev/null \ > > >> -gc_executor_path=$DIST_DIR/gc_executor.pex \ > > >> -enable_beta_updater=true \ > > >> -vlog=INFO \ > > >> -logtostderr > > >> > > >> Attached is the entire log, but basically I'm seeing this: > > >> > > >> I1015 21:18:05.315263 27634 group.cpp:313] Group process (group(1)@ > > >> 10.88.26.227:40393) connected to ZooKeeper > > >> I1015 21:18:05.315322 27634 group.cpp:787] Syncing group operations: > > >> queue size (joins, cancels, datas) = (0, 0, 0) > > >> I1015 21:18:05.315348 27634 group.cpp:385] Trying to create path > > >> '/aurora/replicated-log' in ZooKeeper > > >> I1015 21:18:05.316 THREAD1 > > >> com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange: Candidate > > >> /aurora/scheduler/singleton_candidate_0000000008 is now leader of > group: > > >> [singleton_candidate_0000000008] > > >> I1015 21:18:05.317 THREAD1 > > >> com.twitter.common.util.StateMachine$Builder$1.execute: > > SchedulerLifecycle > > >> state machine transition STORAGE_PREPARED -> > > LEADER_AWAITING_REGISTRATION > > >> I1015 21:18:05.317 THREAD1 > > >> org.apache.aurora.scheduler.SchedulerLifecycle$6.execute: Elected as > > >> leading scheduler! > > >> I1015 21:18:05.330394 27634 network.hpp:423] ZooKeeper group > memberships > > >> changed > > >> I1015 21:18:05.330660 27639 group.cpp:658] Trying to get > > >> '/aurora/replicated-log/0000000008' in ZooKeeper > > >> I1015 21:18:05.331550 27635 network.hpp:461] ZooKeeper group PIDs: { > > >> log-replica(1)@10.88.26.227:40393 } > > >> I1015 21:18:06.027016 27634 replica.cpp:638] Replica in EMPTY status > > >> received a broadcasted recover request > > >> I1015 21:18:06.027216 27634 recover.cpp:188] Received a recover > response > > >> from a replica in EMPTY status > > >> <repeat last 2 message ad nauseum> > > >> > > >> How can I debug what's going on? > > >> > > >> Thanks, > > >> Dobromir > > >> > > > > > > > > >