Thermos is a standalone task execution system that is not coupled to Aurora
or Mesos.  This is why by default, Thermos writes out of the sandbox
(/var/run/thermos), has a separate observability system (Thermos observer),
and CLI (thermos.)

Aurora built a Thermos executor as its default executor, but the scheduler
is not architecturally tied to Thermos (or vice versa.)  In order to make
things work smoothly with this decoupling, a Thermos-specific GC executor
is also necessary to clean up the state leftover by the execution of
Thermos tasks and reconcile potential conflicts between the state of the
Mesos master and Aurora scheduler.

Both the GC executor and Thermos observer violate some of the philosophical
axioms of Mesos (e.g. out-of-sandbox access.)  They also significantly
increase the complexity of building, deploying and maintaining Aurora.  I'm
proposing removing both of them as required Aurora components.

In order to do this and make Thermos/Aurora/Mesos to play together more
nicely, several things are necessary.

1) Moving /var/run/thermos for each task into the Mesos sandbox

Thermos is a state machine with all state transitions persisted to disk.
Right now this goes to /var/run/thermos, but it should instead be persisted
some place relative to the Mesos sandbox so that the Mesos slave can
garbage collect this state once a Thermos task has completed.

This poses a task detection problem -- the Thermos CLI and Thermos observer
rely upon the existence of /var/run/thermos to know what tasks are running,
so we will need to develop a plugin to detect alternate task roots (see
AURORA-1024 <https://issues.apache.org/jira/browse/AURORA-1024> AURORA-1025
<https://issues.apache.org/jira/browse/AURORA-1025> AURORA-1026
<https://issues.apache.org/jira/browse/AURORA-1026> AURORA-1027
<https://issues.apache.org/jira/browse/AURORA-1025>).

2) Making the Thermos executor responsible for the Thermos UI

In order to make the Thermos observer an optional component, the Thermos
executor will need to assume Thermos observer responsibilities.  Since the
Mesos slave already provides a webserver to serve executor sandboxes, I am
proposing that the Thermos executor generates static HTML content that can
be served by the Mesos slave as a UI.  This means that the executor can
remain lean (no embedded webserver.)  See AURORA-725
<https://issues.apache.org/jira/browse/AURORA-725> AURORA-777
<https://issues.apache.org/jira/browse/AURORA-777>

3) Making the Aurora scheduler responsible for state reconciliation

The last component that should be removed is the GC executor.  The GC
executor performs the important task of state reconciliation, but this is
now supported directly by the Mesos master.  See AURORA-715
<https://issues.apache.org/jira/browse/AURORA-715> and specifically
AURORA-1047 <https://issues.apache.org/jira/browse/AURORA-1047>.

Lastly, this work should make it much easier to support alternate executor
implementations (including the Mesos default executor) from Aurora once a
proper Aurora API (AURORA-987
<https://issues.apache.org/jira/browse/AURORA-987>) is available.

~brian

Reply via email to