Hi,

I am using aurora for running 1300 instances of jenkins CIs in a mesos cluster. 
Aurora is running in a HA mode with a three node cluster. A few days ago aurora 
crashed and the end log can be seen below (Note that I have changed the actual 
zookeeper IP addresses with IP1 and IP2).

Appreciate any help with regards to avoiding this kind of a crash in future.

Aurora version --> 0.6.1-snapshot

Thanks.
Ash.

============entries from the aurora log file 
below================================================
I0305 21:23:25.546087  8811 replica.cpp:661] Replica learned APPEND action at 
position 93907
I0305 21:23:25.546380  8811 replica.cpp:508] Replica received write request for 
position 93908
I0305 21:23:25.651371  8811 leveldb.cpp:343] Persisting action (524328 bytes) 
to leveldb took 104.79926ms
I0305 21:23:25.651468  8811 replica.cpp:676] Persisted action at 93908
I0305 21:23:25.768111  8809 replica.cpp:655] Replica received learned notice 
for position 93908
I0305 21:23:25.809016  8809 leveldb.cpp:343] Persisting action (524330 bytes) 
to leveldb took 40.811957ms
I0305 21:23:25.809109  8809 replica.cpp:676] Persisted action at 93908
I0305 21:23:25.809135  8809 replica.cpp:661] Replica learned APPEND action at 
position 93908
I0305 21:23:25.831048  8812 replica.cpp:508] Replica received write request for 
position 93909
I0305 21:23:25.859412  8812 leveldb.cpp:343] Persisting action (524328 bytes) 
to leveldb took 28.015678ms
I0305 21:23:25.859561  8812 replica.cpp:676] Persisted action at 93909
I0305 21:23:26.522222  8814 replica.cpp:655] Replica received learned notice 
for position 93909
I0305 21:23:26.552645  8814 leveldb.cpp:343] Persisting action (524330 bytes) 
to leveldb took 30.324183ms
I0305 21:23:26.552736  8814 replica.cpp:676] Persisted action at 93909
I0305 21:23:26.552763  8814 replica.cpp:661] Replica learned APPEND action at 
position 93909
I0305 21:23:32.186401  8810 network.hpp:423] ZooKeeper group memberships changed
I0305 21:23:32.186600  8815 group.cpp:658] Trying to get 
'/aurora/replicated-log/0000000012' in ZooKeeper
I0305 21:23:32.197412  8815 group.cpp:658] Trying to get 
'/aurora/replicated-log/0000000016' in ZooKeeper
I0305 21:23:32.208843  8815 network.hpp:461] ZooKeeper group PIDs: { 
log-replica(1)@IP1:8083, log-replica(1)@IP2:8083 }
W0305 21:23:36.191 THREAD128 
com.twitter.common.zookeeper.ServerSetImpl$ServerSetWatcher.notifyServerSetChange:
 server set empty for path /aurora/scheduler
W0305 21:23:36.191 THREAD128 
org.apache.aurora.scheduler.http.LeaderRedirect$SchedulerMonitor.onChange: No 
schedulers in host set, will not redirect despite not being leader.
I0305 21:23:36.202 THREAD128 
com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange: Candidate 
/aurora/scheduler/singleton_candidate_0000000017 waiting for the next leader 
election, current voting: [singleton_candidate_0000000012, 
singleton_candidate_0000000017]
I0305 21:28:36.605 THREAD128 
com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange: Candidate 
/aurora/scheduler/singleton_candidate_0000000017 is now leader of group: 
[singleton_candidate_0000000017]
I0305 21:28:36.608 THREAD128 
com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle 
state machine transition STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION
I0305 21:28:36.608 THREAD128 
org.apache.aurora.scheduler.SchedulerLifecycle$6.execute: Elected as leading 
scheduler!
I0305 21:28:36.628625  8815 log.cpp:656] Attempting to start the writer
I0305 21:28:36.629897  8809 replica.cpp:474] Replica received implicit promise 
request with proposal 2
I0305 21:28:36.662684  8809 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 32.691373ms
I0305 21:28:36.662773  8809 replica.cpp:342] Persisted promised to 2
I0305 21:28:42.186013  8809 network.hpp:423] ZooKeeper group memberships changed
I0305 21:28:42.186200  8814 group.cpp:658] Trying to get 
'/aurora/replicated-log/0000000016' in ZooKeeper
I0305 21:28:42.197717  8813 network.hpp:461] ZooKeeper group PIDs: { 
log-replica(1)@IP1:8083 }
I0305 21:28:51.631983  8808 log.cpp:656] Attempting to start the writer
I0305 21:29:06.634974  8808 log.cpp:656] Attempting to start the writer
I0305 21:29:21.649663  8808 log.cpp:656] Attempting to start the writer
I0305 21:29:36.662432  8808 log.cpp:656] Attempting to start the writer
I0305 21:29:51.669358  8813 log.cpp:656] Attempting to start the writer
I0305 21:30:06.684087  8809 log.cpp:656] Attempting to start the writer
I0305 21:30:21.696292  8812 log.cpp:656] Attempting to start the writer
I0305 21:30:36.703007  8812 log.cpp:656] Attempting to start the writer
I0305 21:30:51.712620  8812 log.cpp:656] Attempting to start the writer
I0305 21:31:06.727669  8808 log.cpp:656] Attempting to start the writer
I0305 21:31:21.742475  8813 log.cpp:656] Attempting to start the writer
I0305 21:31:36.754021  8810 log.cpp:656] Attempting to start the writer
I0305 21:31:51.766616  8815 log.cpp:656] Attempting to start the writer
I0305 21:32:06.779209  8812 log.cpp:656] Attempting to start the writer
I0305 21:32:21.794214  8809 log.cpp:656] Attempting to start the writer
I0305 21:32:36.803851  8811 log.cpp:656] Attempting to start the writer
I0305 21:32:51.812130  8809 log.cpp:656] Attempting to start the writer
I0305 21:33:06.822957  8814 log.cpp:656] Attempting to start the writer
I0305 21:33:21.835047  8811 log.cpp:656] Attempting to start the writer
I0305 21:33:36.842078  8813 log.cpp:680] Attempting to append 5 bytes to the log
I0305 21:33:36.842 THREAD128 com.twitter.common.application.Lifecycle.shutdown: 
Shutting down application
I0305 21:33:36.843 THREAD128 
com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute: 
Executing 7 shutdown commands.
I0305 21:33:36.854 THREAD128 
org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute:
 Shutdown initiated by: Thread: RedirectMonitor STARTING-EventThread (id 128)
java.lang.Thread.getStackTrace(Thread.java:1567)
  
org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute(AppModule.java:159)
  
com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute(ShutdownRegistry.java:88)
  com.twitter.common.application.Lifecycle.shutdown(Lifecycle.java:92)
  
org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.disableLog(MesosLog.java:350)
  
org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.mutate(MesosLog.java:369)
  
org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.append(MesosLog.java:315)
  
org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.readAll(MesosLog.java:238)
  
org.apache.aurora.scheduler.storage.log.StreamManagerImpl.readFromBeginning(StreamManagerImpl.java:109)
  
org.apache.aurora.scheduler.storage.log.LogStorage.recover(LogStorage.java:505)
  
org.apache.aurora.scheduler.storage.log.LogStorage$$EnhancerByGuice$$598802ca.CGLIB$recover$6(<generated>:-1)
  
org.apache.aurora.scheduler.storage.log.LogStorage$$EnhancerByGuice$$598802ca$$FastClassByGuice$$ea0e64a7.invoke(<generated>:-1)
  
com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228)
  
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
  com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87)
  
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
  
com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52)
  
org.apache.aurora.scheduler.storage.log.LogStorage$$EnhancerByGuice$$598802ca.recover(<generated>:-1)
  
org.apache.aurora.scheduler.storage.log.LogStorage$20.execute(LogStorage.java:485)
  
org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:132)
  
org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult$Quiet.apply(Storage.java:149)
  
org.apache.aurora.scheduler.storage.mem.MemStorage$3.apply(MemStorage.java:147)
  
org.apache.aurora.scheduler.storage.mem.MemStorage$3.apply(MemStorage.java:144)
  org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:137)
  
org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$71c38b73.CGLIB$write$0(<generated>:-1)
  
org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$71c38b73$$FastClassByGuice$$66aea947.invoke(<generated>:-1)
  
com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228)
  
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
  
org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101)
  
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
  
com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52)
  
org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$71c38b73.write(<generated>:-1)
  org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:144)
  
org.apache.aurora.scheduler.storage.mem.MemStorage$$EnhancerByGuice$$cfe951ca.CGLIB$write$0(<generated>:-1)
  
org.apache.aurora.scheduler.storage.mem.MemStorage$$EnhancerByGuice$$cfe951ca$$FastClassByGuice$$fb95f8d6.invoke(<generated>:-1)
  
com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228)
  
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
  com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87)
  
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
  
com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52)
  
org.apache.aurora.scheduler.storage.mem.MemStorage$$EnhancerByGuice$$cfe951ca.write(<generated>:-1)
  org.apache.aurora.scheduler.storage.log.LogStorage.write(LogStorage.java:635)
  org.apache.aurora.scheduler.storage.log.LogStorage.start(LogStorage.java:479)
  
org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.start(CallOrderEnforcingStorage.java:92)
  
org.apache.aurora.scheduler.SchedulerLifecycle$6.execute(SchedulerLifecycle.java:257)
  
org.apache.aurora.scheduler.SchedulerLifecycle$6.execute(SchedulerLifecycle.java:253)
  com.twitter.common.base.Closures$4.execute(Closures.java:120)
  
org.apache.aurora.scheduler.SchedulerLifecycle$9.execute(SchedulerLifecycle.java:384)
  
org.apache.aurora.scheduler.SchedulerLifecycle$9.execute(SchedulerLifecycle.java:380)
  com.twitter.common.base.Closures$4.execute(Closures.java:120)
  com.twitter.common.base.Closures$3.execute(Closures.java:98)
  com.twitter.common.util.StateMachine.transition(StateMachine.java:191)
  
org.apache.aurora.scheduler.SchedulerLifecycle$SchedulerCandidateImpl.onLeading(SchedulerLifecycle.java:426)
  
com.twitter.common.zookeeper.SingletonService$1.onElected(SingletonService.java:168)
  
com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange(CandidateImpl.java:155)
  com.twitter.common.zookeeper.Group$GroupMonitor.setMembers(Group.java:665)
  com.twitter.common.zookeeper.Group$GroupMonitor.watchGroup(Group.java:638)
  com.twitter.common.zookeeper.Group$GroupMonitor.access$900(Group.java:579)
  com.twitter.common.zookeeper.Group$GroupMonitor$2.get(Group.java:600)
  com.twitter.common.zookeeper.Group$GroupMonitor$2.get(Group.java:597)
  com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:109)
  com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:107)
  com.twitter.common.util.BackoffHelper.doUntilResult(BackoffHelper.java:127)
  com.twitter.common.util.BackoffHelper.doUntilSuccess(BackoffHelper.java:107)
  com.twitter.common.zookeeper.Group$GroupMonitor.tryWatchGroup(Group.java:622)
  com.twitter.common.zookeeper.Group$GroupMonitor.access$1100(Group.java:579)
  com.twitter.common.zookeeper.Group$GroupMonitor$1.process(Group.java:591)
  org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
  org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
I0305 21:33:36.854 THREAD128 
com.twitter.common.stats.TimeSeriesRepositoryImpl$3.execute: Variable sampler 
shut down
I0305 21:33:36.855 THREAD128 
org.apache.aurora.scheduler.http.JettyServerModule$HttpServerLauncher$2.execute:
 Shutting down embedded http server
I0305 21:33:36.880 THREAD128 
org.eclipse.jetty.server.handler.ContextHandler.doStop: stopped 
o.e.j.s.ServletContextHandler{/,null}
I0305 21:33:36.938 THREAD128 
com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle 
state machine transition LEADER_AWAITING_REGISTRATION -> DEAD
W0305 21:33:36.956 THREAD128 
com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute: 
Shutdown action failed.
com.twitter.common.util.StateMachine$IllegalStateTransitionException: State 
transition from PREPARED to STOPPED is not allowed.
                at 
com.twitter.common.util.StateMachine.transition(StateMachine.java:183)
                at 
org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.stop(CallOrderEnforcingStorage.java:109)
                at 
org.apache.aurora.scheduler.SchedulerLifecycle$8.execute(SchedulerLifecycle.java:339)
                at 
org.apache.aurora.scheduler.SchedulerLifecycle$8.execute(SchedulerLifecycle.java:313)
                at com.twitter.common.base.Closures$4.execute(Closures.java:120)
                at com.twitter.common.base.Closures$3.execute(Closures.java:98)
                at 
com.twitter.common.util.StateMachine.transition(StateMachine.java:191)
                at 
org.apache.aurora.scheduler.SchedulerLifecycle$4.execute(SchedulerLifecycle.java:239)
                at 
com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute(ShutdownRegistry.java:88)
                at 
com.twitter.common.application.Lifecycle.shutdown(Lifecycle.java:92)
                at 
org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.disableLog(MesosLog.java:350)
                at 
org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.mutate(MesosLog.java:369)
                at 
org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.append(MesosLog.java:315)
                at 
org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.readAll(MesosLog.java:238)
                at 
org.apache.aurora.scheduler.storage.log.StreamManagerImpl.readFromBeginning(StreamManagerImpl.java:109)
                at 
org.apache.aurora.scheduler.storage.log.LogStorage.recover(LogStorage.java:505)
                at 
org.apache.aurora.scheduler.storage.log.LogStorage$$EnhancerByGuice$$598802ca.CGLIB$recover$6(<generated>)
                at 
org.apache.aurora.scheduler.storage.log.LogStorage$$EnhancerByGuice$$598802ca$$FastClassByGuice$$ea0e64a7.invoke(<generated>)
                at 
com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228)
                at 
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
                at 
com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87)
                at 
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
                at 
com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52)
                at 
org.apache.aurora.scheduler.storage.log.LogStorage$$EnhancerByGuice$$598802ca.recover(<generated>)
                at 
org.apache.aurora.scheduler.storage.log.LogStorage$20.execute(LogStorage.java:485)
                at 
org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:132)
                at 
org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult$Quiet.apply(Storage.java:149)
                at 
org.apache.aurora.scheduler.storage.mem.MemStorage$3.apply(MemStorage.java:147)
                at 
org.apache.aurora.scheduler.storage.mem.MemStorage$3.apply(MemStorage.java:144)
                at 
org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:137)
                at 
org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$71c38b73.CGLIB$write$0(<generated>)
                at 
org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$71c38b73$$FastClassByGuice$$66aea947.invoke(<generated>)
                at 
com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228)
                at 
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
                at 
org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101)
                at 
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
                at 
com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52)
                at 
org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$71c38b73.write(<generated>)
                at 
org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:144)
                at 
org.apache.aurora.scheduler.storage.mem.MemStorage$$EnhancerByGuice$$cfe951ca.CGLIB$write$0(<generated>)
                at 
org.apache.aurora.scheduler.storage.mem.MemStorage$$EnhancerByGuice$$cfe951ca$$FastClassByGuice$$fb95f8d6.invoke(<generated>)
                at 
com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228)
                at 
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
                at 
com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87)
                at 
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
                at 
com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52)
                at 
org.apache.aurora.scheduler.storage.mem.MemStorage$$EnhancerByGuice$$cfe951ca.write(<generated>)
                at 
org.apache.aurora.scheduler.storage.log.LogStorage.write(LogStorage.java:635)
                at 
org.apache.aurora.scheduler.storage.log.LogStorage.start(LogStorage.java:479)
                at 
org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.start(CallOrderEnforcingStorage.java:92)
                at 
org.apache.aurora.scheduler.SchedulerLifecycle$6.execute(SchedulerLifecycle.java:257)
                at 
org.apache.aurora.scheduler.SchedulerLifecycle$6.execute(SchedulerLifecycle.java:253)
                at com.twitter.common.base.Closures$4.execute(Closures.java:120)
                at 
org.apache.aurora.scheduler.SchedulerLifecycle$9.execute(SchedulerLifecycle.java:384)
                at 
org.apache.aurora.scheduler.SchedulerLifecycle$9.execute(SchedulerLifecycle.java:380)
                at com.twitter.common.base.Closures$4.execute(Closures.java:120)
                at com.twitter.common.base.Closures$3.execute(Closures.java:98)
                at 
com.twitter.common.util.StateMachine.transition(StateMachine.java:191)
                at 
org.apache.aurora.scheduler.SchedulerLifecycle$SchedulerCandidateImpl.onLeading(SchedulerLifecycle.java:426)
                at 
com.twitter.common.zookeeper.SingletonService$1.onElected(SingletonService.java:168)
                at 
com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange(CandidateImpl.java:155)
                at 
com.twitter.common.zookeeper.Group$GroupMonitor.setMembers(Group.java:665)
                at 
com.twitter.common.zookeeper.Group$GroupMonitor.watchGroup(Group.java:638)
                at 
com.twitter.common.zookeeper.Group$GroupMonitor.access$900(Group.java:579)
                at 
com.twitter.common.zookeeper.Group$GroupMonitor$2.get(Group.java:600)
                at 
com.twitter.common.zookeeper.Group$GroupMonitor$2.get(Group.java:597)
                at 
com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:109)
                at 
com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:107)
                at 
com.twitter.common.util.BackoffHelper.doUntilResult(BackoffHelper.java:127)
                at 
com.twitter.common.util.BackoffHelper.doUntilSuccess(BackoffHelper.java:107)
                at 
com.twitter.common.zookeeper.Group$GroupMonitor.tryWatchGroup(Group.java:622)
                at 
com.twitter.common.zookeeper.Group$GroupMonitor.access$1100(Group.java:579)
                at 
com.twitter.common.zookeeper.Group$GroupMonitor$1.process(Group.java:591)
                at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
                at 
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)

I0305 21:33:36.960 THREAD1 com.twitter.common.application.AppLauncher.run: 
Application run() exited.

Reply via email to