Summary of IRC Meeting in #aurora at Mon Oct 13 18:02:25 2014:

Attendees: wickman, jcohen, wfarner, Yasumoto, kts, mkhutornenko, davelester, 
zmanji

- Preface
- Aurora doc day
- 0.6.0 release
- Test coverage flakiness
- External update coordination
- Ticket resolution field
- Health check snooze
  - Action: wfarner to report back to email thread with discussion
- Retiring the GC executor
- Security


IRC log follows:

## Preface ##
[Mon Oct 13 18:02:54 2014] <wfarner>: welcome, folks.  let's kick off with a 
roll call
[Mon Oct 13 18:02:55 2014] <wfarner>: here
[Mon Oct 13 18:02:56 2014] <mkhutornenko>: here
[Mon Oct 13 18:03:07 2014] <kts>: here
[Mon Oct 13 18:03:08 2014] <jcohen>: here
[Mon Oct 13 18:03:11 2014] <zmanji>: here
[Mon Oct 13 18:03:35 2014] <davelester>: present
## Aurora doc day ##
[Mon Oct 13 18:05:36 2014] <wfarner>: kts, davelester the floor is yours
[Mon Oct 13 18:05:43 2014] <Yasumoto>: woot
[Mon Oct 13 18:05:43 2014] <kts>: thanks wfarner
[Mon Oct 13 18:06:02 2014] <kts>: we're organizing a day to focus on improving 
aurora's documentation
[Mon Oct 13 18:06:47 2014] <kts>: it's this Thursday, 16 Oct 2014 from 
1000-1700 PDT
[Mon Oct 13 18:07:18 2014] <wickman>: here (womp)
[Mon Oct 13 18:07:32 2014] <kts>: we'll be coordinating in this channel, but if 
you have anything you'd like to see documentation improved for please file a 
ticket now
[Mon Oct 13 18:07:52 2014] <kts>: some great examples of what those tickets 
look like:
[Mon Oct 13 18:07:54 2014] <kts>: AURORA-829
[Mon Oct 13 18:08:12 2014] <kts>: AURORA-828
[Mon Oct 13 18:08:27 2014] <kts>: make sure you mark your ticket with the JIRA 
componenent "Documentation"
[Mon Oct 13 18:08:35 2014] <kts>: hope to see everyone there
[Mon Oct 13 18:08:50 2014] <davelester>: We currently have 19 unresolved issues 
w/ the Documentation component 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20AURORA%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Documentation%20ORDER%20BY%20priority%20DESC
## 0.6.0 release ##
[Mon Oct 13 18:09:51 2014] <wfarner>: AURORA-711
[Mon Oct 13 18:10:28 2014] <wfarner>: In the course of adding the new client 
syntax and backend to the release goals, we've picked up a number of blocking 
tickets
[Mon Oct 13 18:11:03 2014] <wfarner>: I implore everyone with some spare cycles 
to pick up one of these tickets to help us cross the finish line.
[Mon Oct 13 18:11:49 2014] <wfarner>: Please explicitly take ownership by 
self-assigning what you think you can pick up.
## Test coverage flakiness ##
[Mon Oct 13 18:13:40 2014] <wfarner>: ~2 weeks ago, i added a check to the java 
build to fail the build on different types of missing test coverage
[Mon Oct 13 18:14:16 2014] <wfarner>: There has been some difficult-to-pinpoint 
flakiness with one check in particular - which makes sure that all classes have 
some test coverage
[Mon Oct 13 18:14:32 2014] <wfarner>: I believe this is now fixed, with 
AURORA-822
[Mon Oct 13 18:14:37 2014] <wfarner>: AURORA-822
[Mon Oct 13 18:14:58 2014] <wfarner>: If you see any more issues, please raise 
a ticket, as i know consider the bug squashed.
## External update coordination ##
[Mon Oct 13 18:15:59 2014] <wfarner>: mkhutornenko: anything to follow up from 
the email discussion on this topic?
[Mon Oct 13 18:16:37 2014] <mkhutornenko>: I would really like to hear more 
feedback on that
[Mon Oct 13 18:16:39 2014] <wfarner>: context: 
http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/%3CCAOTkfX7x2oipk4ZFysoS0uWZRizOnKJA3y15pvEW5K4YnUHw-A%40mail.gmail.com%3E
[Mon Oct 13 18:17:08 2014] <mkhutornenko>: is it going to add any value, any 
changes we should consider and etc.
[Mon Oct 13 18:17:23 2014] <wfarner>: ok - everyone please read that thread, 
speak now or forever hold your peace
## Ticket resolution field ##
[Mon Oct 13 18:18:51 2014] <wfarner>: There's a gotcha with closing tickets 
right now that i'm working to resolve.
[Mon Oct 13 18:19:04 2014] <wfarner>: This results in tickets being in 
status=Closed with resolution=None.
[Mon Oct 13 18:19:21 2014] <mkhutornenko>: +1 caught me many times before
[Mon Oct 13 18:19:57 2014] <wfarner>: I believe this is an issue with the JIRA 
project configuration.  For the time being, please be careful to avoid clicking 
buttons like 'Close'.
[Mon Oct 13 18:20:05 2014] <wfarner>: Instead prefer buttons that say 'Resolve'.
[Mon Oct 13 18:20:33 2014] <wfarner>: I hope to have this resolved this week, 
but as i do not have JIRA admin access, i cannot guarantee this.
## Health check snooze ##
[Mon Oct 13 18:21:52 2014] <wfarner>: We had a review for a new feature move to 
a dev list discussion last week.  Does anybody believe we did not achieve 
consensus on the approach?
[Mon Oct 13 18:22:00 2014] <wfarner>: https://reviews.apache.org/r/26383/
[Mon Oct 13 18:22:21 2014] <wfarner>: AURORA-795
[Mon Oct 13 18:22:27 2014] <zmanji>: There is a mailing list thread here: 
http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/%3CCACGrrVnLWDU=vevaft_qn0il5c8oq7pqae-3ge5nnh6vjg4...@mail.gmail.com%3E
[Mon Oct 13 18:22:57 2014] <zmanji>: I don’t think we have a consenus yet so 
please voice your opinion
[Mon Oct 13 18:23:00 2014] <wickman>: I think the consensus was "touch a snooze 
file, then unlink after mtime + CONSTANT_TIMEOUT"
[Mon Oct 13 18:23:07 2014] <wickman>: is that not correct?
[Mon Oct 13 18:23:14 2014] <wfarner>: wickman: that was my understanding as well
[Mon Oct 13 18:23:42 2014] <wickman>: the other option is "touch a file, and 
the health checker is disabled as long as that file is there."
[Mon Oct 13 18:23:43 2014] <kts>: I still feel that we should avoid being too 
clever in our implementation here
[Mon Oct 13 18:23:44 2014] <jcohen>: yeah, it sounded to me like that’s what 
we were coalescing on.
[Mon Oct 13 18:23:58 2014] <wickman>: the reason that I'm less in favor of that 
approach is that it's not really a snooze -- it's a sleep, and could be prone 
to somebody forgetting to turn it off
[Mon Oct 13 18:24:10 2014] <wickman>: which might be okay -- i think 99 times 
out of 100, people will be snoozing so they can get the state of a wedged task
[Mon Oct 13 18:24:13 2014] <wickman>: at which point they will kill when 
they're done
[Mon Oct 13 18:24:18 2014] <wickman>: so i think there's a reasonable argument 
either way
[Mon Oct 13 18:24:24 2014] <wfarner>: yes, i'm torn
[Mon Oct 13 18:24:36 2014] <kts>: but we don't really know how long a tool will 
take to get information about the wedged state
[Mon Oct 13 18:24:47 2014] <wickman>: kts: yeah, that's why #1 might be more 
appealing
[Mon Oct 13 18:24:58 2014] <jcohen>: kts: in that cause they could extend the 
snooze by using touch -m?
[Mon Oct 13 18:25:05 2014] <wickman>: though you could just do (while true; do 
touch .snooze; sleep 60; done;) &
[Mon Oct 13 18:25:46 2014] <jcohen>: I suppose it’s a question of what’s 
more likely (or more concerning): will someone forget to remove a snooze, or 
forget to extend it
[Mon Oct 13 18:26:06 2014] <mkhutornenko>: +1 for not deleting the file. 
Avoiding FS mutation == Less complexity == less things to go wrong
[Mon Oct 13 18:26:09 2014] <wickman>: i think it's important to look at why 
you'd want to snooze in the first place
[Mon Oct 13 18:26:15 2014] <kts>: forget to extend means diagnostic information 
is lost forever
[Mon Oct 13 18:26:18 2014] <wickman>: the only case i can think of is something 
in a super weird state
[Mon Oct 13 18:26:27 2014] <wickman>: and they're almost always going to kill 
those things in the weird state when they're done
[Mon Oct 13 18:26:32 2014] <wickman>: which would point to a permanent snooze
[Mon Oct 13 18:26:38 2014] <wfarner>: that was my feeling as well
[Mon Oct 13 18:28:17 2014] <wfarner>: should we reverse the position on this 
back to no time awareness at all?
[Mon Oct 13 18:28:23 2014] <kts>: +1
[Mon Oct 13 18:28:31 2014] <mkhutornenko>: +1
[Mon Oct 13 18:28:40 2014] <zmanji>: +1
[Mon Oct 13 18:28:46 2014] <wfarner>: +1
[Mon Oct 13 18:29:02 2014] <zmanji>: wfarner: can you update that thread and 
review with this information?
[Mon Oct 13 18:29:10 2014] <jcohen>: wickman is +1 on permanent snooze by proxy
[Mon Oct 13 18:29:11 2014] <wickman>: aye
[Mon Oct 13 18:29:22 2014] <wickman>: permanent snooze #freebandname
[Mon Oct 13 18:29:24 2014] <wfarner>: #action wfarner to report back to email 
thread with discussion
## Retiring the GC executor ##
[Mon Oct 13 18:29:43 2014] <wfarner>: AURORA-715
[Mon Oct 13 18:30:22 2014] <wfarner>: jcohen you are leading the charge here.  
i believe there may be more tickets to create under that epic
[Mon Oct 13 18:30:27 2014] <wfarner>: do you feel you have a grasp for what is 
involved?
[Mon Oct 13 18:30:39 2014] <jcohen>: wickman and I began discussing this a bit 
today. I still need to do a bit of research before I fully understand 
everything that needs to be done.
[Mon Oct 13 18:30:48 2014] <wfarner>: great
[Mon Oct 13 18:31:08 2014] <wfarner>: can you (very) briefly summarize the 
moving parts for those not in the know?
[Mon Oct 13 18:31:30 2014] <jcohen>: Not yet ;)
[Mon Oct 13 18:31:51 2014] <wfarner>: fair enough, please fill in the epic as 
you uncover more.  we can revisit next week
[Mon Oct 13 18:32:08 2014] <wickman>: the tl;dr here is that only 
thermos_observer and thermos will need a plugin to detect tasks via either the 
ExecutorDetector (path-based code) or new code that talks to the local slave
[Mon Oct 13 18:32:20 2014] <jcohen>: Feel free to correct my possibly naive 
understanding, but the GC executor is currently responsible for reconciling 
task state and cleaning up thermos checkpoints
[Mon Oct 13 18:32:36 2014] <jcohen>: the task state reconciliation will be 
handled by mesos
[Mon Oct 13 18:33:40 2014] <jcohen>: so we’ll need to fix things so 
checkpoints are properly cleaned up w/o the GC executor as well as work out a 
way for the scheduler UI to be properly notified
[Mon Oct 13 18:33:45 2014] <kts>: as will the cleaning of checkpoints as 
they'll be moved into the sandbox
[Mon Oct 13 18:33:45 2014] <jcohen>: (hand wave)
[Mon Oct 13 18:34:04 2014] <kts>: and therefore within the purview of the 
slave's disk gc
[Mon Oct 13 18:34:25 2014] <jcohen>: yes
[Mon Oct 13 18:34:33 2014] <wickman>: yeah, once the checkpoint root is set to 
be within the mesos sandbox, we no longer need to be concerned about clean up 
anymore... just discoverability via the thermos CLI and thermos observer
[Mon Oct 13 18:34:44 2014] <wickman>: the longer term plan for the thermos 
observer is to deprecate it -- so if that's accelerated, that issue is moot
[Mon Oct 13 18:35:22 2014] <wfarner>: thanks for the context
[Mon Oct 13 18:35:30 2014] <wfarner>: That exhausts my topics, any others?
## Security ##
[Mon Oct 13 18:36:30 2014] <kts>: AURORA-720
[Mon Oct 13 18:37:25 2014] <kts>: I've written up a rough outline of proposed 
steps to refactor the scheduler security code to use apache shiro
[Mon Oct 13 18:37:39 2014] <kts>: expressed as issues in that epic
[Mon Oct 13 18:38:21 2014] <kts>: tl;dr we would adopt Shiro and deprecate our 
custom solution
[Mon Oct 13 18:38:49 2014] <kts>: of which there are currently no public 
implementations that do anything
[Mon Oct 13 18:39:07 2014] <mkhutornenko>: kts: add that outline to AURORA-723?
[Mon Oct 13 18:39:14 2014] <mkhutornenko>: AURORA-723
[Mon Oct 13 18:39:22 2014] <wickman>: there are no public applications/products 
that use Shiro?
[Mon Oct 13 18:39:39 2014] <kts>: wickman: no, that use our custom security 
framework in aurora
[Mon Oct 13 18:39:39 2014] <wickman>: out of curiosioty, how old is Shiro?
[Mon Oct 13 18:39:43 2014] <wickman>: oh
[Mon Oct 13 18:39:43 2014] <wickman>: ok
[Mon Oct 13 18:40:15 2014] <wfarner>: http://en.wikipedia.org/wiki/Apache_Shiro
[Mon Oct 13 18:40:21 2014] <wfarner>: 1.0 since July 2010
[Mon Oct 13 18:40:29 2014] <wfarner>: latest release Feb 2014
[Mon Oct 13 18:40:42 2014] <kts>: and a fellow ASF project
[Mon Oct 13 18:41:00 2014] <kts>: anyway more details to come
[Mon Oct 13 18:41:13 2014] <wfarner>: thanks, kts
[Mon Oct 13 18:41:17 2014] <wfarner>: Last call for topics
[Mon Oct 13 18:41:52 2014] <kts>: AURORA-801
[Mon Oct 13 18:42:28 2014] <kts>: worth noting - if you've been running off 
master recently you'll want to pick up the patch that fixes that issue
[Mon Oct 13 18:43:18 2014] <kts>: that's all I've got
[Mon Oct 13 18:45:17 2014] <wfarner>: ASFBot702: meeting stop


Meeting ended at Mon Oct 13 18:45:17 2014

Reply via email to