I agree with everything here. A big pain point from the docker integration side was/is the observer, and rolling the observer functionality into the executor would simplify things greatly.
On Sat, Jan 24, 2015 at 12:29 PM, Bill Farner <wfar...@apache.org> wrote: > +1, thanks for the braindump, Brian! This sounds great. > > -=Bill > > On Sat, Jan 24, 2015 at 8:43 AM, Joe Smith <yasumo...@gmail.com> wrote: > > > Thanks for the write up! > > > > > On Jan 22, 2015, at 13:27, Brian Wickman <wick...@apache.org> wrote: > > > > > > Thermos is a standalone task execution system that is not coupled to > > Aurora > > > or Mesos. This is why by default, Thermos writes out of the sandbox > > > (/var/run/thermos), has a separate observability system (Thermos > > observer), > > > and CLI (thermos.) > > > > > > Aurora built a Thermos executor as its default executor, but the > > scheduler > > > is not architecturally tied to Thermos (or vice versa.) In order to > make > > > things work smoothly with this decoupling, a Thermos-specific GC > executor > > > is also necessary to clean up the state leftover by the execution of > > > Thermos tasks and reconcile potential conflicts between the state of > the > > > Mesos master and Aurora scheduler. > > > > > > Both the GC executor and Thermos observer violate some of the > > philosophical > > > axioms of Mesos (e.g. out-of-sandbox access.) They also significantly > > > increase the complexity of building, deploying and maintaining Aurora. > > I'm > > > proposing removing both of them as required Aurora components. > > > > > > In order to do this and make Thermos/Aurora/Mesos to play together more > > > nicely, several things are necessary. > > > > > > 1) Moving /var/run/thermos for each task into the Mesos sandbox > > > > > > Thermos is a state machine with all state transitions persisted to > disk. > > > Right now this goes to /var/run/thermos, but it should instead be > > persisted > > > some place relative to the Mesos sandbox so that the Mesos slave can > > > garbage collect this state once a Thermos task has completed. > > > > > > This poses a task detection problem -- the Thermos CLI and Thermos > > observer > > > rely upon the existence of /var/run/thermos to know what tasks are > > running, > > > so we will need to develop a plugin to detect alternate task roots (see > > > AURORA-1024 <https://issues.apache.org/jira/browse/AURORA-1024> > > AURORA-1025 > > > <https://issues.apache.org/jira/browse/AURORA-1025> AURORA-1026 > > > <https://issues.apache.org/jira/browse/AURORA-1026> AURORA-1027 > > > <https://issues.apache.org/jira/browse/AURORA-1025>). > > > > > > 2) Making the Thermos executor responsible for the Thermos UI > > > > > > In order to make the Thermos observer an optional component, the > Thermos > > > executor will need to assume Thermos observer responsibilities. Since > > the > > > Mesos slave already provides a webserver to serve executor sandboxes, I > > am > > > proposing that the Thermos executor generates static HTML content that > > can > > > be served by the Mesos slave as a UI. This means that the executor can > > > remain lean (no embedded webserver.) See AURORA-725 > > > <https://issues.apache.org/jira/browse/AURORA-725> AURORA-777 > > > <https://issues.apache.org/jira/browse/AURORA-777> > > > > > > 3) Making the Aurora scheduler responsible for state reconciliation > > > > > > The last component that should be removed is the GC executor. The GC > > > executor performs the important task of state reconciliation, but this > is > > > now supported directly by the Mesos master. See AURORA-715 > > > <https://issues.apache.org/jira/browse/AURORA-715> and specifically > > > AURORA-1047 <https://issues.apache.org/jira/browse/AURORA-1047>. > > > > Although the trusty gc_executor has been solid for a long time, removing > > it would definitely simplify things, so +10. > > > > > > > > > > Lastly, this work should make it much easier to support alternate > > executor > > > implementations (including the Mesos default executor) from Aurora > once a > > > proper Aurora API (AURORA-987 > > > <https://issues.apache.org/jira/browse/AURORA-987>) is available. > > > > > > ~brian > > >