We're attempting to get Aurora to handle 11 job requests a second. We realize we're going to be limited by the resource intensive nature of the CLI and will be giving Herc a shot here soon; however we seem to get the scheduler to crash in less than an hour while submitting about 2 jobs/second.
The job we're submitting simply echo's "Hello" ever minute for 5 minutes, then exits. Here's what the logs look like at the time of the crash: I0316 18:22:54.348743 20799 log.cpp:680] Attempting to append 1650 bytes to the log I0316 18:22:54.348808 20799 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 22668 I0316 18:22:54.348989 20799 replica.cpp:508] Replica received write request for position 22668 I0316 18:22:54.352383 20799 leveldb.cpp:343] Persisting action (1671 bytes) to leveldb took 3.370462ms I0316 18:22:54.352414 20799 replica.cpp:676] Persisted action at 22668 I0316 18:22:54.352565 20799 replica.cpp:655] Replica received learned notice for position 22668 I0316 18:22:54.356220 20799 leveldb.cpp:343] Persisting action (1673 bytes) to leveldb took 3.634224ms I0316 18:22:54.356253 20799 replica.cpp:676] Persisted action at 22668 I0316 18:22:54.356266 20799 replica.cpp:661] Replica learned APPEND action at position 22668 I0316 18:22:54.561 THREAD127 org.apache.zookeeper.ClientCnxn$SendThread.startConnect: Opening socket connection to server sandbox01/192.168.4.101:2181 I0316 18:22:54.563 THREAD127 org.apache.zookeeper.ClientCnxn$SendThread.primeConnection: Socket connection established to sandbox01/192.168.4.101:2181, initiating session I0316 18:22:54.564 THREAD127 org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult: Session establishment complete on server sandbox01/192.168.4.101:2181, sessionid = 0x14c03fdca89bc0f, negotiated timeout = 4000 I0316 18:22:55.374 THREAD143 org.apache.aurora.scheduler.async.OfferManager$OfferManagerImpl.addOffer: Returning offers for 20150304-172015-1711581376-5050-15195-S5 for compaction I0316 18:22:55.374 THREAD143 org.apache.aurora.scheduler.async.OfferManager$OfferManagerImpl.addOffer: Returning offers for 20150304-172015-1711581376-5050-15195-S3 for compaction I0316 18:22:55.394 THREAD137 com.twitter.common.util.StateMachine$Builder$1.execute: 1426265234214-root-prod-gl6e9DL4vM3J0XTxM0k9GMT4QeXRTgYz-0-8e0df05c-b6c7-40bd-924f-b858f39da316 state machine transition PENDING -> ASSIGNED I0316 18:22:55.394 THREAD137 org.apache.aurora.scheduler.state.TaskStateMachine.addFollowup: Adding work command SAVE_STATE for 1426265234214-root-prod-gl6e9DL4vM3J0XTxM0k9GMT4QeXRTgYz-0-8e0df05c-b6c7-40bd-924f-b858f39da316 I0316 18:22:55.394 THREAD137 org.apache.aurora.scheduler.state.TaskAssigner$TaskAssignerImpl.assign: Offer on slave {URL} (id 20150304-172015-1711581376-5050-15195-S2) is being assigned task for 1426265234214-root-prod-gl6e9DL4vM3J0XTxM0k9GMT4QeXRTgYz-0-8e0df05c-b6c7-40bd-924f-b858f39da316. I0316 18:22:55.395851 20795 log.cpp:680] Attempting to append 1451 bytes to the log I0316 18:22:55.395921 20795 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 22669 I0316 18:22:55.396108 20795 replica.cpp:508] Replica received write request for position 22669 I0316 18:22:57.621 THREAD1905 org.apache.aurora.scheduler.mesos.MesosScheduler.Impl.logStatusUpdate: Received status update for task 1426265234214-root-prod-gl6e9DL4vM3J0XTxM0k9GMT4QeXRTgYz-0-8e0df05c-b6c7-40bd-924f-b858f39da316 in state TASK_STARTING: Initializing sandbox. I0316 18:22:58.396 THREAD137 com.twitter.common.application.Lifecycle.shutdown: Shutting down application I0316 18:22:58.396 THREAD137 com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute: Executing 7 shutdown commands. I0316 18:22:58.561 THREAD137 org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute: Shutdown initiated by: Thread: TaskScheduler-0 (id 137) java.lang.Thread.getStackTrace(Thread.java:1589) The stack trace doesn't show anything interesting, I can't figure out why it just randomly issues a shutdown. Let me know if there's anything else that would be helpful from the log, we're on an internal network so I can't just copy/paste the logs over.