Hi, I am using aurora for running 1300 instances of jenkins CIs in a mesos cluster. Aurora is running in a HA mode with a three node cluster. A few days ago aurora crashed and the end log can be seen below (Note that I have changed the actual zookeeper IP addresses with IP1 and IP2).
Appreciate any help with regards to avoiding this kind of a crash in future. Aurora version --> 0.6.1-snapshot Thanks. Ash. ============entries from the aurora log file below================================================ I0305 21:23:25.546087 8811 replica.cpp:661] Replica learned APPEND action at position 93907 I0305 21:23:25.546380 8811 replica.cpp:508] Replica received write request for position 93908 I0305 21:23:25.651371 8811 leveldb.cpp:343] Persisting action (524328 bytes) to leveldb took 104.79926ms I0305 21:23:25.651468 8811 replica.cpp:676] Persisted action at 93908 I0305 21:23:25.768111 8809 replica.cpp:655] Replica received learned notice for position 93908 I0305 21:23:25.809016 8809 leveldb.cpp:343] Persisting action (524330 bytes) to leveldb took 40.811957ms I0305 21:23:25.809109 8809 replica.cpp:676] Persisted action at 93908 I0305 21:23:25.809135 8809 replica.cpp:661] Replica learned APPEND action at position 93908 I0305 21:23:25.831048 8812 replica.cpp:508] Replica received write request for position 93909 I0305 21:23:25.859412 8812 leveldb.cpp:343] Persisting action (524328 bytes) to leveldb took 28.015678ms I0305 21:23:25.859561 8812 replica.cpp:676] Persisted action at 93909 I0305 21:23:26.522222 8814 replica.cpp:655] Replica received learned notice for position 93909 I0305 21:23:26.552645 8814 leveldb.cpp:343] Persisting action (524330 bytes) to leveldb took 30.324183ms I0305 21:23:26.552736 8814 replica.cpp:676] Persisted action at 93909 I0305 21:23:26.552763 8814 replica.cpp:661] Replica learned APPEND action at position 93909 I0305 21:23:32.186401 8810 network.hpp:423] ZooKeeper group memberships changed I0305 21:23:32.186600 8815 group.cpp:658] Trying to get '/aurora/replicated-log/0000000012' in ZooKeeper I0305 21:23:32.197412 8815 group.cpp:658] Trying to get '/aurora/replicated-log/0000000016' in ZooKeeper I0305 21:23:32.208843 8815 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@IP1:8083, log-replica(1)@IP2:8083 } W0305 21:23:36.191 THREAD128 com.twitter.common.zookeeper.ServerSetImpl$ServerSetWatcher.notifyServerSetChange: server set empty for path /aurora/scheduler W0305 21:23:36.191 THREAD128 org.apache.aurora.scheduler.http.LeaderRedirect$SchedulerMonitor.onChange: No schedulers in host set, will not redirect despite not being leader. I0305 21:23:36.202 THREAD128 com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange: Candidate /aurora/scheduler/singleton_candidate_0000000017 waiting for the next leader election, current voting: [singleton_candidate_0000000012, singleton_candidate_0000000017] I0305 21:28:36.605 THREAD128 com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange: Candidate /aurora/scheduler/singleton_candidate_0000000017 is now leader of group: [singleton_candidate_0000000017] I0305 21:28:36.608 THREAD128 com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle state machine transition STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION I0305 21:28:36.608 THREAD128 org.apache.aurora.scheduler.SchedulerLifecycle$6.execute: Elected as leading scheduler! I0305 21:28:36.628625 8815 log.cpp:656] Attempting to start the writer I0305 21:28:36.629897 8809 replica.cpp:474] Replica received implicit promise request with proposal 2 I0305 21:28:36.662684 8809 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 32.691373ms I0305 21:28:36.662773 8809 replica.cpp:342] Persisted promised to 2 I0305 21:28:42.186013 8809 network.hpp:423] ZooKeeper group memberships changed I0305 21:28:42.186200 8814 group.cpp:658] Trying to get '/aurora/replicated-log/0000000016' in ZooKeeper I0305 21:28:42.197717 8813 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@IP1:8083 } I0305 21:28:51.631983 8808 log.cpp:656] Attempting to start the writer I0305 21:29:06.634974 8808 log.cpp:656] Attempting to start the writer I0305 21:29:21.649663 8808 log.cpp:656] Attempting to start the writer I0305 21:29:36.662432 8808 log.cpp:656] Attempting to start the writer I0305 21:29:51.669358 8813 log.cpp:656] Attempting to start the writer I0305 21:30:06.684087 8809 log.cpp:656] Attempting to start the writer I0305 21:30:21.696292 8812 log.cpp:656] Attempting to start the writer I0305 21:30:36.703007 8812 log.cpp:656] Attempting to start the writer I0305 21:30:51.712620 8812 log.cpp:656] Attempting to start the writer I0305 21:31:06.727669 8808 log.cpp:656] Attempting to start the writer I0305 21:31:21.742475 8813 log.cpp:656] Attempting to start the writer I0305 21:31:36.754021 8810 log.cpp:656] Attempting to start the writer I0305 21:31:51.766616 8815 log.cpp:656] Attempting to start the writer I0305 21:32:06.779209 8812 log.cpp:656] Attempting to start the writer I0305 21:32:21.794214 8809 log.cpp:656] Attempting to start the writer I0305 21:32:36.803851 8811 log.cpp:656] Attempting to start the writer I0305 21:32:51.812130 8809 log.cpp:656] Attempting to start the writer I0305 21:33:06.822957 8814 log.cpp:656] Attempting to start the writer I0305 21:33:21.835047 8811 log.cpp:656] Attempting to start the writer I0305 21:33:36.842078 8813 log.cpp:680] Attempting to append 5 bytes to the log I0305 21:33:36.842 THREAD128 com.twitter.common.application.Lifecycle.shutdown: Shutting down application I0305 21:33:36.843 THREAD128 com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute: Executing 7 shutdown commands. I0305 21:33:36.854 THREAD128 org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute: Shutdown initiated by: Thread: RedirectMonitor STARTING-EventThread (id 128) java.lang.Thread.getStackTrace(Thread.java:1567) org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute(AppModule.java:159) com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute(ShutdownRegistry.java:88) com.twitter.common.application.Lifecycle.shutdown(Lifecycle.java:92) org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.disableLog(MesosLog.java:350) org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.mutate(MesosLog.java:369) org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.append(MesosLog.java:315) org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.readAll(MesosLog.java:238) org.apache.aurora.scheduler.storage.log.StreamManagerImpl.readFromBeginning(StreamManagerImpl.java:109) org.apache.aurora.scheduler.storage.log.LogStorage.recover(LogStorage.java:505) org.apache.aurora.scheduler.storage.log.LogStorage$$EnhancerByGuice$$598802ca.CGLIB$recover$6(<generated>:-1) org.apache.aurora.scheduler.storage.log.LogStorage$$EnhancerByGuice$$598802ca$$FastClassByGuice$$ea0e64a7.invoke(<generated>:-1) com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228) com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72) com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87) com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72) com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52) org.apache.aurora.scheduler.storage.log.LogStorage$$EnhancerByGuice$$598802ca.recover(<generated>:-1) org.apache.aurora.scheduler.storage.log.LogStorage$20.execute(LogStorage.java:485) org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:132) org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult$Quiet.apply(Storage.java:149) org.apache.aurora.scheduler.storage.mem.MemStorage$3.apply(MemStorage.java:147) org.apache.aurora.scheduler.storage.mem.MemStorage$3.apply(MemStorage.java:144) org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:137) org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$71c38b73.CGLIB$write$0(<generated>:-1) org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$71c38b73$$FastClassByGuice$$66aea947.invoke(<generated>:-1) com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228) com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72) org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101) com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72) com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52) org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$71c38b73.write(<generated>:-1) org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:144) org.apache.aurora.scheduler.storage.mem.MemStorage$$EnhancerByGuice$$cfe951ca.CGLIB$write$0(<generated>:-1) org.apache.aurora.scheduler.storage.mem.MemStorage$$EnhancerByGuice$$cfe951ca$$FastClassByGuice$$fb95f8d6.invoke(<generated>:-1) com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228) com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72) com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87) com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72) com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52) org.apache.aurora.scheduler.storage.mem.MemStorage$$EnhancerByGuice$$cfe951ca.write(<generated>:-1) org.apache.aurora.scheduler.storage.log.LogStorage.write(LogStorage.java:635) org.apache.aurora.scheduler.storage.log.LogStorage.start(LogStorage.java:479) org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.start(CallOrderEnforcingStorage.java:92) org.apache.aurora.scheduler.SchedulerLifecycle$6.execute(SchedulerLifecycle.java:257) org.apache.aurora.scheduler.SchedulerLifecycle$6.execute(SchedulerLifecycle.java:253) com.twitter.common.base.Closures$4.execute(Closures.java:120) org.apache.aurora.scheduler.SchedulerLifecycle$9.execute(SchedulerLifecycle.java:384) org.apache.aurora.scheduler.SchedulerLifecycle$9.execute(SchedulerLifecycle.java:380) com.twitter.common.base.Closures$4.execute(Closures.java:120) com.twitter.common.base.Closures$3.execute(Closures.java:98) com.twitter.common.util.StateMachine.transition(StateMachine.java:191) org.apache.aurora.scheduler.SchedulerLifecycle$SchedulerCandidateImpl.onLeading(SchedulerLifecycle.java:426) com.twitter.common.zookeeper.SingletonService$1.onElected(SingletonService.java:168) com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange(CandidateImpl.java:155) com.twitter.common.zookeeper.Group$GroupMonitor.setMembers(Group.java:665) com.twitter.common.zookeeper.Group$GroupMonitor.watchGroup(Group.java:638) com.twitter.common.zookeeper.Group$GroupMonitor.access$900(Group.java:579) com.twitter.common.zookeeper.Group$GroupMonitor$2.get(Group.java:600) com.twitter.common.zookeeper.Group$GroupMonitor$2.get(Group.java:597) com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:109) com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:107) com.twitter.common.util.BackoffHelper.doUntilResult(BackoffHelper.java:127) com.twitter.common.util.BackoffHelper.doUntilSuccess(BackoffHelper.java:107) com.twitter.common.zookeeper.Group$GroupMonitor.tryWatchGroup(Group.java:622) com.twitter.common.zookeeper.Group$GroupMonitor.access$1100(Group.java:579) com.twitter.common.zookeeper.Group$GroupMonitor$1.process(Group.java:591) org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531) org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507) I0305 21:33:36.854 THREAD128 com.twitter.common.stats.TimeSeriesRepositoryImpl$3.execute: Variable sampler shut down I0305 21:33:36.855 THREAD128 org.apache.aurora.scheduler.http.JettyServerModule$HttpServerLauncher$2.execute: Shutting down embedded http server I0305 21:33:36.880 THREAD128 org.eclipse.jetty.server.handler.ContextHandler.doStop: stopped o.e.j.s.ServletContextHandler{/,null} I0305 21:33:36.938 THREAD128 com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle state machine transition LEADER_AWAITING_REGISTRATION -> DEAD W0305 21:33:36.956 THREAD128 com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute: Shutdown action failed. com.twitter.common.util.StateMachine$IllegalStateTransitionException: State transition from PREPARED to STOPPED is not allowed. at com.twitter.common.util.StateMachine.transition(StateMachine.java:183) at org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.stop(CallOrderEnforcingStorage.java:109) at org.apache.aurora.scheduler.SchedulerLifecycle$8.execute(SchedulerLifecycle.java:339) at org.apache.aurora.scheduler.SchedulerLifecycle$8.execute(SchedulerLifecycle.java:313) at com.twitter.common.base.Closures$4.execute(Closures.java:120) at com.twitter.common.base.Closures$3.execute(Closures.java:98) at com.twitter.common.util.StateMachine.transition(StateMachine.java:191) at org.apache.aurora.scheduler.SchedulerLifecycle$4.execute(SchedulerLifecycle.java:239) at com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute(ShutdownRegistry.java:88) at com.twitter.common.application.Lifecycle.shutdown(Lifecycle.java:92) at org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.disableLog(MesosLog.java:350) at org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.mutate(MesosLog.java:369) at org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.append(MesosLog.java:315) at org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.readAll(MesosLog.java:238) at org.apache.aurora.scheduler.storage.log.StreamManagerImpl.readFromBeginning(StreamManagerImpl.java:109) at org.apache.aurora.scheduler.storage.log.LogStorage.recover(LogStorage.java:505) at org.apache.aurora.scheduler.storage.log.LogStorage$$EnhancerByGuice$$598802ca.CGLIB$recover$6(<generated>) at org.apache.aurora.scheduler.storage.log.LogStorage$$EnhancerByGuice$$598802ca$$FastClassByGuice$$ea0e64a7.invoke(<generated>) at com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228) at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72) at com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87) at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72) at com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52) at org.apache.aurora.scheduler.storage.log.LogStorage$$EnhancerByGuice$$598802ca.recover(<generated>) at org.apache.aurora.scheduler.storage.log.LogStorage$20.execute(LogStorage.java:485) at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:132) at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult$Quiet.apply(Storage.java:149) at org.apache.aurora.scheduler.storage.mem.MemStorage$3.apply(MemStorage.java:147) at org.apache.aurora.scheduler.storage.mem.MemStorage$3.apply(MemStorage.java:144) at org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:137) at org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$71c38b73.CGLIB$write$0(<generated>) at org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$71c38b73$$FastClassByGuice$$66aea947.invoke(<generated>) at com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228) at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72) at org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101) at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72) at com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52) at org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$71c38b73.write(<generated>) at org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:144) at org.apache.aurora.scheduler.storage.mem.MemStorage$$EnhancerByGuice$$cfe951ca.CGLIB$write$0(<generated>) at org.apache.aurora.scheduler.storage.mem.MemStorage$$EnhancerByGuice$$cfe951ca$$FastClassByGuice$$fb95f8d6.invoke(<generated>) at com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228) at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72) at com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87) at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72) at com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52) at org.apache.aurora.scheduler.storage.mem.MemStorage$$EnhancerByGuice$$cfe951ca.write(<generated>) at org.apache.aurora.scheduler.storage.log.LogStorage.write(LogStorage.java:635) at org.apache.aurora.scheduler.storage.log.LogStorage.start(LogStorage.java:479) at org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.start(CallOrderEnforcingStorage.java:92) at org.apache.aurora.scheduler.SchedulerLifecycle$6.execute(SchedulerLifecycle.java:257) at org.apache.aurora.scheduler.SchedulerLifecycle$6.execute(SchedulerLifecycle.java:253) at com.twitter.common.base.Closures$4.execute(Closures.java:120) at org.apache.aurora.scheduler.SchedulerLifecycle$9.execute(SchedulerLifecycle.java:384) at org.apache.aurora.scheduler.SchedulerLifecycle$9.execute(SchedulerLifecycle.java:380) at com.twitter.common.base.Closures$4.execute(Closures.java:120) at com.twitter.common.base.Closures$3.execute(Closures.java:98) at com.twitter.common.util.StateMachine.transition(StateMachine.java:191) at org.apache.aurora.scheduler.SchedulerLifecycle$SchedulerCandidateImpl.onLeading(SchedulerLifecycle.java:426) at com.twitter.common.zookeeper.SingletonService$1.onElected(SingletonService.java:168) at com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange(CandidateImpl.java:155) at com.twitter.common.zookeeper.Group$GroupMonitor.setMembers(Group.java:665) at com.twitter.common.zookeeper.Group$GroupMonitor.watchGroup(Group.java:638) at com.twitter.common.zookeeper.Group$GroupMonitor.access$900(Group.java:579) at com.twitter.common.zookeeper.Group$GroupMonitor$2.get(Group.java:600) at com.twitter.common.zookeeper.Group$GroupMonitor$2.get(Group.java:597) at com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:109) at com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:107) at com.twitter.common.util.BackoffHelper.doUntilResult(BackoffHelper.java:127) at com.twitter.common.util.BackoffHelper.doUntilSuccess(BackoffHelper.java:107) at com.twitter.common.zookeeper.Group$GroupMonitor.tryWatchGroup(Group.java:622) at com.twitter.common.zookeeper.Group$GroupMonitor.access$1100(Group.java:579) at com.twitter.common.zookeeper.Group$GroupMonitor$1.process(Group.java:591) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507) I0305 21:33:36.960 THREAD1 com.twitter.common.application.AppLauncher.run: Application run() exited.