+1, thanks for the braindump, Brian!  This sounds great.

-=Bill

On Sat, Jan 24, 2015 at 8:43 AM, Joe Smith <yasumo...@gmail.com> wrote:

> Thanks for the write up!
>
> > On Jan 22, 2015, at 13:27, Brian Wickman <wick...@apache.org> wrote:
> >
> > Thermos is a standalone task execution system that is not coupled to
> Aurora
> > or Mesos.  This is why by default, Thermos writes out of the sandbox
> > (/var/run/thermos), has a separate observability system (Thermos
> observer),
> > and CLI (thermos.)
> >
> > Aurora built a Thermos executor as its default executor, but the
> scheduler
> > is not architecturally tied to Thermos (or vice versa.)  In order to make
> > things work smoothly with this decoupling, a Thermos-specific GC executor
> > is also necessary to clean up the state leftover by the execution of
> > Thermos tasks and reconcile potential conflicts between the state of the
> > Mesos master and Aurora scheduler.
> >
> > Both the GC executor and Thermos observer violate some of the
> philosophical
> > axioms of Mesos (e.g. out-of-sandbox access.)  They also significantly
> > increase the complexity of building, deploying and maintaining Aurora.
> I'm
> > proposing removing both of them as required Aurora components.
> >
> > In order to do this and make Thermos/Aurora/Mesos to play together more
> > nicely, several things are necessary.
> >
> > 1) Moving /var/run/thermos for each task into the Mesos sandbox
> >
> > Thermos is a state machine with all state transitions persisted to disk.
> > Right now this goes to /var/run/thermos, but it should instead be
> persisted
> > some place relative to the Mesos sandbox so that the Mesos slave can
> > garbage collect this state once a Thermos task has completed.
> >
> > This poses a task detection problem -- the Thermos CLI and Thermos
> observer
> > rely upon the existence of /var/run/thermos to know what tasks are
> running,
> > so we will need to develop a plugin to detect alternate task roots (see
> > AURORA-1024 <https://issues.apache.org/jira/browse/AURORA-1024>
> AURORA-1025
> > <https://issues.apache.org/jira/browse/AURORA-1025> AURORA-1026
> > <https://issues.apache.org/jira/browse/AURORA-1026> AURORA-1027
> > <https://issues.apache.org/jira/browse/AURORA-1025>).
> >
> > 2) Making the Thermos executor responsible for the Thermos UI
> >
> > In order to make the Thermos observer an optional component, the Thermos
> > executor will need to assume Thermos observer responsibilities.  Since
> the
> > Mesos slave already provides a webserver to serve executor sandboxes, I
> am
> > proposing that the Thermos executor generates static HTML content that
> can
> > be served by the Mesos slave as a UI.  This means that the executor can
> > remain lean (no embedded webserver.)  See AURORA-725
> > <https://issues.apache.org/jira/browse/AURORA-725> AURORA-777
> > <https://issues.apache.org/jira/browse/AURORA-777>
> >
> > 3) Making the Aurora scheduler responsible for state reconciliation
> >
> > The last component that should be removed is the GC executor.  The GC
> > executor performs the important task of state reconciliation, but this is
> > now supported directly by the Mesos master.  See AURORA-715
> > <https://issues.apache.org/jira/browse/AURORA-715> and specifically
> > AURORA-1047 <https://issues.apache.org/jira/browse/AURORA-1047>.
>
> Although the trusty gc_executor has been solid for a long time, removing
> it would definitely simplify things, so +10.
>
>
> >
> > Lastly, this work should make it much easier to support alternate
> executor
> > implementations (including the Mesos default executor) from Aurora once a
> > proper Aurora API (AURORA-987
> > <https://issues.apache.org/jira/browse/AURORA-987>) is available.
> >
> > ~brian
>

Reply via email to