+1, thanks for the braindump, Brian! This sounds great. -=Bill
On Sat, Jan 24, 2015 at 8:43 AM, Joe Smith <yasumo...@gmail.com> wrote: > Thanks for the write up! > > > On Jan 22, 2015, at 13:27, Brian Wickman <wick...@apache.org> wrote: > > > > Thermos is a standalone task execution system that is not coupled to > Aurora > > or Mesos. This is why by default, Thermos writes out of the sandbox > > (/var/run/thermos), has a separate observability system (Thermos > observer), > > and CLI (thermos.) > > > > Aurora built a Thermos executor as its default executor, but the > scheduler > > is not architecturally tied to Thermos (or vice versa.) In order to make > > things work smoothly with this decoupling, a Thermos-specific GC executor > > is also necessary to clean up the state leftover by the execution of > > Thermos tasks and reconcile potential conflicts between the state of the > > Mesos master and Aurora scheduler. > > > > Both the GC executor and Thermos observer violate some of the > philosophical > > axioms of Mesos (e.g. out-of-sandbox access.) They also significantly > > increase the complexity of building, deploying and maintaining Aurora. > I'm > > proposing removing both of them as required Aurora components. > > > > In order to do this and make Thermos/Aurora/Mesos to play together more > > nicely, several things are necessary. > > > > 1) Moving /var/run/thermos for each task into the Mesos sandbox > > > > Thermos is a state machine with all state transitions persisted to disk. > > Right now this goes to /var/run/thermos, but it should instead be > persisted > > some place relative to the Mesos sandbox so that the Mesos slave can > > garbage collect this state once a Thermos task has completed. > > > > This poses a task detection problem -- the Thermos CLI and Thermos > observer > > rely upon the existence of /var/run/thermos to know what tasks are > running, > > so we will need to develop a plugin to detect alternate task roots (see > > AURORA-1024 <https://issues.apache.org/jira/browse/AURORA-1024> > AURORA-1025 > > <https://issues.apache.org/jira/browse/AURORA-1025> AURORA-1026 > > <https://issues.apache.org/jira/browse/AURORA-1026> AURORA-1027 > > <https://issues.apache.org/jira/browse/AURORA-1025>). > > > > 2) Making the Thermos executor responsible for the Thermos UI > > > > In order to make the Thermos observer an optional component, the Thermos > > executor will need to assume Thermos observer responsibilities. Since > the > > Mesos slave already provides a webserver to serve executor sandboxes, I > am > > proposing that the Thermos executor generates static HTML content that > can > > be served by the Mesos slave as a UI. This means that the executor can > > remain lean (no embedded webserver.) See AURORA-725 > > <https://issues.apache.org/jira/browse/AURORA-725> AURORA-777 > > <https://issues.apache.org/jira/browse/AURORA-777> > > > > 3) Making the Aurora scheduler responsible for state reconciliation > > > > The last component that should be removed is the GC executor. The GC > > executor performs the important task of state reconciliation, but this is > > now supported directly by the Mesos master. See AURORA-715 > > <https://issues.apache.org/jira/browse/AURORA-715> and specifically > > AURORA-1047 <https://issues.apache.org/jira/browse/AURORA-1047>. > > Although the trusty gc_executor has been solid for a long time, removing > it would definitely simplify things, so +10. > > > > > > Lastly, this work should make it much easier to support alternate > executor > > implementations (including the Mesos default executor) from Aurora once a > > proper Aurora API (AURORA-987 > > <https://issues.apache.org/jira/browse/AURORA-987>) is available. > > > > ~brian >