Same here. A single fork ran for 4 hours and I stopped it after 20K
tasks. Running 30 forks now for about an hour or so, no failure so far
(though vagrant IO ops stalled to a crawl).

On Thu, Mar 19, 2015 at 3:25 PM, Bill Farner <wfar...@apache.org> wrote:
> I've been running 10 forks of your script against a scheduler in the
> vagrant environment for a little over an hour, and have not had luck at
> inducing a crash yet.  Are you able to repro there?
>
> Can you post the full scheduler log file from a repro of this?
>
> -=Bill
>
> On Thu, Mar 19, 2015 at 4:33 AM, Ryan Orr <ryanor...@gmail.com> wrote:
>
>> https://gist.github.com/ryanorr/b3736c03a9919f9b87dc
>>
>> Just a simple bash script. I run 10-15 of them in the background and it
>> crashes usually in less than 30 mins as described. Thanks for looking into
>> it.
>>
>> On Wed, Mar 18, 2015 at 7:19 PM, Bill Farner <wfar...@apache.org> wrote:
>>
>> > I'm slightly doubtful that GC is the issue (based on background from Ryan
>> > in IRC), but i could be wrong.  Trying out a sample script would help us
>> > confirm or deny, though.
>> >
>> > -=Bill
>> >
>> > On Wed, Mar 18, 2015 at 11:17 AM, Maxim Khutornenko <ma...@apache.org>
>> > wrote:
>> >
>> > > This time gap of over 2 seconds suggests there may have been an
>> > > intensive GC and/or IO operation that paused processing long enough
>> > > for the ZK session timeout to expire:
>> > >
>> > > I0316 18:22:55.396108 20795 replica.cpp:508] Replica received write
>> > request
>> > > for position 22669
>> > > I0316 18:22:57.621 THREAD1905
>> > > org.apache.aurora.scheduler.mesos.MesosScheduler.Impl.logStatusUpdate:
>> > > Received status update for task
>> > >
>> > > Do you see a "snapshot" or "backup" related log entires within 20
>> > > seconds before the shutdown? If so, those are the known hotspots for
>> > > generating a lot of GC and IO activities that may pause scheduler
>> > > processing for a few seconds, which is long enough for the scheduler
>> > > to lose leadership. The default ZK session timeout is 4 seconds, which
>> > > may be too low to avoid leader re-election due to expired session in
>> > > your case. You may want to validate this hypothesis by bumping up the
>> > > default ZK timeout via a startup flag: --zk_session_timeout.
>> > >
>> > > On Wed, Mar 18, 2015 at 8:00 AM, Ryan Orr <ryanor...@gmail.com> wrote:
>> > > > We're attempting to get Aurora to handle 11 job requests a second. We
>> > > > realize we're going to be limited by the resource intensive nature of
>> > the
>> > > > CLI and will be giving Herc a shot here soon; however we seem to get
>> > the
>> > > > scheduler to crash in less than an hour while submitting about 2
>> > > > jobs/second.
>> > > >
>> > > > The job we're submitting simply echo's "Hello" ever minute for 5
>> > minutes,
>> > > > then exits. Here's what the logs look like at the time of the crash:
>> > > >
>> > > > I0316 18:22:54.348743 20799 log.cpp:680] Attempting to append 1650
>> > bytes
>> > > to
>> > > > the log
>> > > > I0316 18:22:54.348808 20799 coordinator.cpp:340] Coordinator
>> attempting
>> > > to
>> > > > write APPEND action at position 22668
>> > > > I0316 18:22:54.348989 20799 replica.cpp:508] Replica received write
>> > > request
>> > > > for position 22668
>> > > > I0316 18:22:54.352383 20799 leveldb.cpp:343] Persisting action (1671
>> > > bytes)
>> > > > to leveldb took 3.370462ms
>> > > > I0316 18:22:54.352414 20799 replica.cpp:676] Persisted action at
>> 22668
>> > > > I0316 18:22:54.352565 20799 replica.cpp:655] Replica received learned
>> > > > notice for position 22668
>> > > > I0316 18:22:54.356220 20799 leveldb.cpp:343] Persisting action (1673
>> > > bytes)
>> > > > to leveldb took 3.634224ms
>> > > > I0316 18:22:54.356253 20799 replica.cpp:676] Persisted action at
>> 22668
>> > > > I0316 18:22:54.356266 20799 replica.cpp:661] Replica learned APPEND
>> > > action
>> > > > at position 22668
>> > > > I0316 18:22:54.561 THREAD127
>> > > > org.apache.zookeeper.ClientCnxn$SendThread.startConnect: Opening
>> socket
>> > > > connection to server sandbox01/192.168.4.101:2181
>> > > > I0316 18:22:54.563 THREAD127
>> > > > org.apache.zookeeper.ClientCnxn$SendThread.primeConnection: Socket
>> > > > connection established to sandbox01/192.168.4.101:2181, initiating
>> > > session
>> > > > I0316 18:22:54.564 THREAD127
>> > > > org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult: Session
>> > > > establishment complete on server sandbox01/192.168.4.101:2181,
>> > > sessionid =
>> > > > 0x14c03fdca89bc0f, negotiated timeout = 4000
>> > > > I0316 18:22:55.374 THREAD143
>> > > >
>> > org.apache.aurora.scheduler.async.OfferManager$OfferManagerImpl.addOffer:
>> > > > Returning offers for 20150304-172015-1711581376-5050-15195-S5 for
>> > > compaction
>> > > > I0316 18:22:55.374 THREAD143
>> > > >
>> > org.apache.aurora.scheduler.async.OfferManager$OfferManagerImpl.addOffer:
>> > > > Returning offers for 20150304-172015-1711581376-5050-15195-S3 for
>> > > compaction
>> > > > I0316 18:22:55.394 THREAD137
>> > > > com.twitter.common.util.StateMachine$Builder$1.execute:
>> > > >
>> > >
>> >
>> 1426265234214-root-prod-gl6e9DL4vM3J0XTxM0k9GMT4QeXRTgYz-0-8e0df05c-b6c7-40bd-924f-b858f39da316
>> > > > state machine transition PENDING -> ASSIGNED
>> > > > I0316 18:22:55.394 THREAD137
>> > > > org.apache.aurora.scheduler.state.TaskStateMachine.addFollowup:
>> Adding
>> > > work
>> > > > command SAVE_STATE for
>> > > >
>> > >
>> >
>> 1426265234214-root-prod-gl6e9DL4vM3J0XTxM0k9GMT4QeXRTgYz-0-8e0df05c-b6c7-40bd-924f-b858f39da316
>> > > > I0316 18:22:55.394 THREAD137
>> > > >
>> org.apache.aurora.scheduler.state.TaskAssigner$TaskAssignerImpl.assign:
>> > > > Offer on slave {URL} (id 20150304-172015-1711581376-5050-15195-S2) is
>> > > being
>> > > > assigned task for
>> > > >
>> > >
>> >
>> 1426265234214-root-prod-gl6e9DL4vM3J0XTxM0k9GMT4QeXRTgYz-0-8e0df05c-b6c7-40bd-924f-b858f39da316.
>> > > > I0316 18:22:55.395851 20795 log.cpp:680] Attempting to append 1451
>> > bytes
>> > > to
>> > > > the log
>> > > > I0316 18:22:55.395921 20795 coordinator.cpp:340] Coordinator
>> attempting
>> > > to
>> > > > write APPEND action at position 22669
>> > > > I0316 18:22:55.396108 20795 replica.cpp:508] Replica received write
>> > > request
>> > > > for position 22669
>> > > > I0316 18:22:57.621 THREAD1905
>> > > >
>> org.apache.aurora.scheduler.mesos.MesosScheduler.Impl.logStatusUpdate:
>> > > > Received status update for task
>> > > >
>> > >
>> >
>> 1426265234214-root-prod-gl6e9DL4vM3J0XTxM0k9GMT4QeXRTgYz-0-8e0df05c-b6c7-40bd-924f-b858f39da316
>> > > > in state TASK_STARTING: Initializing sandbox.
>> > > > I0316 18:22:58.396 THREAD137
>> > > > com.twitter.common.application.Lifecycle.shutdown: Shutting down
>> > > application
>> > > > I0316 18:22:58.396 THREAD137
>> > > >
>> > >
>> >
>> com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute:
>> > > > Executing 7 shutdown commands.
>> > > > I0316 18:22:58.561 THREAD137
>> > > >
>> > >
>> >
>> org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute:
>> > > > Shutdown initiated by: Thread: TaskScheduler-0 (id 137)
>> > > > java.lang.Thread.getStackTrace(Thread.java:1589)
>> > > >
>> > > > The stack trace doesn't show anything interesting, I can't figure out
>> > why
>> > > > it just randomly issues a shutdown. Let me know if there's anything
>> > else
>> > > > that would be helpful from the log, we're on an internal network so I
>> > > can't
>> > > > just copy/paste the logs over.
>> > >
>> >
>>

Reply via email to