Summary of IRC Meeting in #aurora at Mon Sep 22 18:02:41 2014: Attendees: davmclau, wickman, jfarrell, mchucarroll, wfarner, jcohen, Yasumoto, kts, jaybuff, mkhutornenko, zmanji, dlester
- Preface - scheduler performance issues - 0.6.0 release - Action: all committers to link blockers to release ticket AURORA-711 - job update orchestration in the scheduler IRC log follows: ## Preface ## [Mon Sep 22 18:02:58 2014] <kts>: let's get started with a quick roll call [Mon Sep 22 18:03:05 2014] <Yasumoto>: howdy howdy [Mon Sep 22 18:03:18 2014] <jfarrell>: here [Mon Sep 22 18:03:19 2014] <dlester>: present [Mon Sep 22 18:03:21 2014] <davmclau>: here [Mon Sep 22 18:03:24 2014] <mchucarroll>: here [Mon Sep 22 18:03:25 2014] <zmanji>: here [Mon Sep 22 18:03:27 2014] <jcohen>: here [Mon Sep 22 18:03:33 2014] <wfarner>: here [Mon Sep 22 18:03:36 2014] <wickman>: ahoy [Mon Sep 22 18:03:58 2014] <jaybuff>: howdy [Mon Sep 22 18:04:15 2014] <mkhutornenko>: morning [Mon Sep 22 18:04:25 2014] <kts>: morning all ## scheduler performance issues ## [Mon Sep 22 18:05:11 2014] <kts>: last week we started to see some performance issues around scheduler snapshots in one of our larger production clusters [Mon Sep 22 18:05:45 2014] <kts>: so you may have seen a higher number of performance-focused reviews going by recently [Mon Sep 22 18:06:58 2014] <wfarner>: i've started investigating this morning, there may actually be more going on than just snapshots [Mon Sep 22 18:07:22 2014] <wfarner>: the usual fallout we see is snapshot correlated with timed out tasks (ASSIGNED/KILLING -> LOST) [Mon Sep 22 18:07:53 2014] <wfarner>: looking into the timeline for one of these, though, there seems to be a stall _before_ the snapshot process begins [Mon Sep 22 18:08:24 2014] <wfarner>: hopefully more to come on this today [Mon Sep 22 18:08:47 2014] <wfarner>: just to set some expectations appropriately - this should not impact anything but very large, very heavily-used clusters [Mon Sep 22 18:09:15 2014] <wfarner>: <eom> [Mon Sep 22 18:10:15 2014] <kts>: thanks for the update wfarner ## 0.6.0 release ## [Mon Sep 22 18:11:30 2014] <wfarner>: is there a ticket to track the release yet? there are some feature tickets that i could add as blockers [Mon Sep 22 18:11:44 2014] <jfarrell>: yes, i created one last week [Mon Sep 22 18:11:46 2014] <dlester>: https://issues.apache.org/jira/browse/AURORA-711 [Mon Sep 22 18:11:48 2014] <kts>: looking at the action items from last week it looks like everything is pretty much in the same state [Mon Sep 22 18:11:50 2014] <kts>: http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201409.mbox/%3C20140915185248.8B7B9182C9%40urd.zones.apache.org%3E [Mon Sep 22 18:12:09 2014] <wfarner>: dlester: thanks [Mon Sep 22 18:12:20 2014] <wfarner>: kts: more or less, though there has been progress on feature work [Mon Sep 22 18:12:24 2014] <jfarrell>: https://issues.apache.org/jira/browse/AURORA-711 [Mon Sep 22 18:13:16 2014] <kts>: #action all committers to link blockers to release ticket AURORA-711 [Mon Sep 22 18:13:59 2014] <wfarner>: this is also a good time to get deprecation warnings in for things we would like to remove in 0.7.0 [Mon Sep 22 18:14:14 2014] <wfarner>: relevant ticket for that is https://issues.apache.org/jira/browse/AURORA-423 [Mon Sep 22 18:15:47 2014] <kts>: linked [Mon Sep 22 18:16:03 2014] <kts>: that's all I've got, any other topics? [Mon Sep 22 18:16:10 2014] <wfarner>: kts: that should not be linked against 0.6.0 [Mon Sep 22 18:16:43 2014] <wfarner>: AURORA-423 will be a blocker to 0.7.0 release [Mon Sep 22 18:16:50 2014] <kts>: wfarner: we need some way to represent that the list has been finalized though right? [Mon Sep 22 18:17:34 2014] <wfarner>: maybe 'related'? we definitely shouldn't resolve AURORA-423 for the 0.6.0 release [Mon Sep 22 18:18:02 2014] <kts>: works for me [Mon Sep 22 18:20:41 2014] <davmclau>: We got a real life end to end test running for the new scheduler updates. ## job update orchestration in the scheduler ## [Mon Sep 22 18:21:34 2014] <wfarner>: stage is yours, davmclau [Mon Sep 22 18:22:58 2014] <davmclau>: The status is that wfarner and mkhutornenko completed the server part with instance events at the end of last week. I updated the UI and we managed to run a complete end to end test by Friday. [Mon Sep 22 18:23:20 2014] <davmclau>: I think we still have one or two small issues to clean up, but that should be wrapped up this week. [Mon Sep 22 18:24:16 2014] <davmclau>: (eom) [Mon Sep 22 18:25:10 2014] <kts>: thanks davmclau [Mon Sep 22 18:25:27 2014] <kts>: any other topics? [Mon Sep 22 18:26:18 2014] <jaybuff>: sometime this week or next I am hoping to recruit some people to help write an "Aurora Operational Guide" doc [Mon Sep 22 18:26:38 2014] <dlester>: jaybuff: sounds great! [Mon Sep 22 18:26:43 2014] <jaybuff>: i want a mesos one as well [Mon Sep 22 18:27:07 2014] <wfarner>: jaybuff: count me in [Mon Sep 22 18:27:12 2014] <Yasumoto>: jaybuff: cool, I'd be stoked to help contribute to both [Mon Sep 22 18:27:21 2014] <jaybuff>: i will try to write an outline, then maybe we can block off an afternoon and brainstorm [Mon Sep 22 18:27:36 2014] <jfarrell>: jaybuff: can you start a thread on the dev@ list please, i'm sure a fair amount of people will want to help with that [Mon Sep 22 18:27:41 2014] <mchucarroll>: iâm also up for helping with that, or pretty much any other documentation. [Mon Sep 22 18:27:45 2014] <jaybuff>: sounds great [Mon Sep 22 18:28:10 2014] <Yasumoto>: ah, one last point [Mon Sep 22 18:28:11 2014] <jaybuff>: we had a pretty disasterous outage last week and it revealed some big holes [Mon Sep 22 18:28:27 2014] <wickman>: jaybuff: can you discuss in any detail? [Mon Sep 22 18:29:04 2014] <jaybuff>: sure, after meeting i can go into it. tl;dr there is a bug in the docker containerizer that causes things to explode when you have slaves with 300+ exited docker containers [Mon Sep 22 18:29:10 2014] <wickman>: ah [Mon Sep 22 18:29:34 2014] <Yasumoto>: There should be a new pants release today/tomorrow: https://github.com/pantsbuild/pants/issues/597, which will help us get https://issues.apache.org/jira/browse/AURORA-585 cleared up [Mon Sep 22 18:29:47 2014] <Yasumoto>: I'll send out an email to the dev@ list to make sure no one has concerns [Mon Sep 22 18:30:13 2014] <Yasumoto>: (it will enforce py27 for the repo, so that may not be 100% desired- tho there is a config option to change that) [Mon Sep 22 18:31:45 2014] <kts>: sound great [Mon Sep 22 18:31:52 2014] <kts>: *sounds [Mon Sep 22 18:32:40 2014] <kts>: anything else? [Mon Sep 22 18:33:02 2014] <wfarner>: not from me [Mon Sep 22 18:33:31 2014] <jfarrell>: think we covered the major items [Mon Sep 22 18:34:23 2014] <kts>: ASFBot: meeting stop Meeting ended at Mon Sep 22 18:34:23 2014