Re: Aurora stuck in LEADER_AWAITING_REGISTRATION

Bill Farner Wed, 15 Oct 2014 17:08:42 -0700

FYI - i created AURORA-840 [1] to document common situations like these,
hopefully to be completed in our documentation sprint tomorrow.  Dobromir -
since setting up a new cluster is fresh for you, you might have some
insight we can add to this.


[1] https://issues.apache.org/jira/browse/AURORA-840

-=Bill

On Wed, Oct 15, 2014 at 5:02 PM, Kevin Sweeney <kevi...@apache.org> wrote:

> The scheduler uses that zookeeper path (or whatever url you pass to
> -mesos_msater_address) to find the mesos master is should register with.
>
> Is there a mesos master running with a command-line that looks something
> like
>
> exec /usr/local/sbin/mesos-master \
>     --zk=zk://$ZK_HOST:2181/mesos/master \
> ...?
>
> On Wed, Oct 15, 2014 at 4:53 PM, Dobromir Montauk <dobro...@tellapart.com>
> wrote:
>
> > Next problem: framework is failing to register.
> >
> > 2014-10-15 22:15:48,024:28619(0x7f852d259700):ZOO_INFO@check_events
> @1750:
> > session establishment complete on server [127.0.0.1:2181],
> > sessionId=0x349110433970e3b, negotiated timeout=10000
> > I1015 22:15:48.024787 28763 group.cpp:313] Group process (group(3)@
> > 10.88.26.227:34348) connected to ZooKeeper
> > I1015 22:15:48.024837 28763 group.cpp:787] Syncing group operations:
> queue
> > size (joins, cancels, datas) = (0, 0, 0)
> > I1015 22:15:48.024868 28763 group.cpp:385] Trying to create path
> > '/mesos/master' in ZooKeeper
> > E1015 22:16:48.013 THREAD132
> > org.apache.aurora.scheduler.SchedulerLifecycle$6$3.run: Framework has not
> > been registered within the tolerated delay.
> > I1015 22:16:48.014 THREAD132
> > com.twitter.common.util.StateMachine$Builder$1.execute:
> SchedulerLifecycle
> > state machine transition LEADER_AWAITING_REGISTRATION -> DEAD
> >
> > My Zookeeper logs have this:
> >
> > 2014-10-15 22:15:47,831 - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@839] - Client attempting to
> establish
> > new session at /127.0.0.1:41116
> > 2014-10-15 22:15:47,836 - INFO  [CommitProcessor:3:ZooKeeperServer@595]
> -
> > Established session 0x349110433970e3a with negotiated timeout 4000 for
> > client /127.0.0.1:41116
> > 2014-10-15 22:15:48,017 - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
> > connection
> > from /127.0.0.1:41117
> > 2014-10-15 22:15:48,017 - WARN  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@793] - Connection request from old
> > client /127.0.0.1:41117; will be dropped if server is in r-o mode
> > 2014-10-15 22:15:48,018 - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@839] - Client attempting to
> establish
> > new session at /127.0.0.1:41117
> > 2014-10-15 22:15:48,024 - INFO  [CommitProcessor:3:ZooKeeperServer@595]
> -
> > Established session 0x349110433970e3b with negotiated timeout 10000 for
> > client /127.0.0.1:41117
> > 2014-10-15 22:16:48,435 - WARN  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end of stream exception
> > EndOfStreamException: Unable to read additional data from client
> sessionid
> > 0x349110433970e39, likely client has closed socket
> > at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
> > at
> >
> >
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
> > at java.lang.Thread.run(Thread.java:745)
> > 2014-10-15 22:16:48,436 - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
> > client /127.0.0.1:41114 which had sessionid 0x349110433970e39
> >
> > I've removed /mesos/master and verified that it gets recreated:
> >
> > [zk: localhost:2181(CONNECTED) 4] rmr /mesos/master
> > [zk: localhost:2181(CONNECTED) 5] ls /mesos/master
> > Node does not exist: /mesos/master
> > [zk: localhost:2181(CONNECTED) 6] get /mesos/master
> > cZxid = 0x10000b717
> > ctime = Wed Oct 15 23:34:17 UTC 2014
> > mZxid = 0x10000b717
> > mtime = Wed Oct 15 23:34:17 UTC 2014
> > pZxid = 0x10000b717
> > cversion = 0
> > dataVersion = 0
> > aclVersion = 0
> > ephemeralOwner = 0x0
> > dataLength = 0
> > numChildren = 0
> >
> > Anyone know what's going wrong?
> >
> >
> > On Wed, Oct 15, 2014 at 2:47 PM, Dobromir Montauk <
> dobro...@tellapart.com>
> > wrote:
> >
> > > I found instructions that fixed the current issue here:
> > >
> > > http://wilderness.apache.org/channels/?f=aurora/2014-05-27
> > >
> > > Would be nice to have a more friendly error message :)
> > >
> > > On Wed, Oct 15, 2014 at 2:22 PM, Dobromir Montauk <
> > dobro...@tellapart.com>
> > > wrote:
> > >
> > >> Hi,
> > >>
> > >> I've brought up Aurora on my Mesos master node with the following
> > command:
> > >>
> > >> ubuntu@ec2-54-82-17-37:~/$
> > >>
> > >> GLOG_v=2
> > >> LIBPROCESS_PORT=5050
> > >> LIBPROCESS_IP=127.0.0.1
> > >> AURORA_HOME=/usr/local/aurora-scheduler
> > >> DIST_DIR=/home/ubuntu/aurora-scheduler/dist
> > >> AURORA_HOME=/usr/local/aurora-scheduler
> > >>
> > >> sudo /usr/local/aurora-scheduler/bin/aurora-scheduler \
> > >>   -cluster_name=tellapart \
> > >>   -http_port=8081 \
> > >>   -native_log_quorum_size=1 \
> > >>   -zk_endpoints=localhost:2181 \
> > >>   -mesos_master_address=54.166.50.69:5050,54.160.61.169:5050
> > ,localhost:5050
> > >> \
> > >>   -serverset_path=/aurora/scheduler \
> > >>   -native_log_zk_group_path=/aurora/replicated-log \
> > >>   -native_log_file_path=$AURORA_HOME/scheduler/db \
> > >>   -backup_dir=$AURORA_HOME/scheduler/backups \
> > >>   -thermos_executor_path=/dev/null \
> > >>   -gc_executor_path=$DIST_DIR/gc_executor.pex \
> > >>   -enable_beta_updater=true \
> > >>   -vlog=INFO \
> > >>   -logtostderr
> > >>
> > >> Attached is the entire log, but basically I'm seeing this:
> > >>
> > >> I1015 21:18:05.315263 27634 group.cpp:313] Group process (group(1)@
> > >> 10.88.26.227:40393) connected to ZooKeeper
> > >> I1015 21:18:05.315322 27634 group.cpp:787] Syncing group operations:
> > >> queue size (joins, cancels, datas) = (0, 0, 0)
> > >> I1015 21:18:05.315348 27634 group.cpp:385] Trying to create path
> > >> '/aurora/replicated-log' in ZooKeeper
> > >> I1015 21:18:05.316 THREAD1
> > >> com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange: Candidate
> > >> /aurora/scheduler/singleton_candidate_0000000008 is now leader of
> group:
> > >> [singleton_candidate_0000000008]
> > >> I1015 21:18:05.317 THREAD1
> > >> com.twitter.common.util.StateMachine$Builder$1.execute:
> > SchedulerLifecycle
> > >> state machine transition STORAGE_PREPARED ->
> > LEADER_AWAITING_REGISTRATION
> > >> I1015 21:18:05.317 THREAD1
> > >> org.apache.aurora.scheduler.SchedulerLifecycle$6.execute: Elected as
> > >> leading scheduler!
> > >> I1015 21:18:05.330394 27634 network.hpp:423] ZooKeeper group
> memberships
> > >> changed
> > >> I1015 21:18:05.330660 27639 group.cpp:658] Trying to get
> > >> '/aurora/replicated-log/0000000008' in ZooKeeper
> > >> I1015 21:18:05.331550 27635 network.hpp:461] ZooKeeper group PIDs: {
> > >> log-replica(1)@10.88.26.227:40393 }
> > >> I1015 21:18:06.027016 27634 replica.cpp:638] Replica in EMPTY status
> > >> received a broadcasted recover request
> > >> I1015 21:18:06.027216 27634 recover.cpp:188] Received a recover
> response
> > >> from a replica in EMPTY status
> > >> <repeat last 2 message ad nauseum>
> > >>
> > >> How can I debug what's going on?
> > >>
> > >> Thanks,
> > >> Dobromir
> > >>
> > >
> > >
> >
>

Re: Aurora stuck in LEADER_AWAITING_REGISTRATION

Reply via email to