Thermos is a standalone task execution system that is not coupled to Aurora or Mesos. This is why by default, Thermos writes out of the sandbox (/var/run/thermos), has a separate observability system (Thermos observer), and CLI (thermos.)
Aurora built a Thermos executor as its default executor, but the scheduler is not architecturally tied to Thermos (or vice versa.) In order to make things work smoothly with this decoupling, a Thermos-specific GC executor is also necessary to clean up the state leftover by the execution of Thermos tasks and reconcile potential conflicts between the state of the Mesos master and Aurora scheduler. Both the GC executor and Thermos observer violate some of the philosophical axioms of Mesos (e.g. out-of-sandbox access.) They also significantly increase the complexity of building, deploying and maintaining Aurora. I'm proposing removing both of them as required Aurora components. In order to do this and make Thermos/Aurora/Mesos to play together more nicely, several things are necessary. 1) Moving /var/run/thermos for each task into the Mesos sandbox Thermos is a state machine with all state transitions persisted to disk. Right now this goes to /var/run/thermos, but it should instead be persisted some place relative to the Mesos sandbox so that the Mesos slave can garbage collect this state once a Thermos task has completed. This poses a task detection problem -- the Thermos CLI and Thermos observer rely upon the existence of /var/run/thermos to know what tasks are running, so we will need to develop a plugin to detect alternate task roots (see AURORA-1024 <https://issues.apache.org/jira/browse/AURORA-1024> AURORA-1025 <https://issues.apache.org/jira/browse/AURORA-1025> AURORA-1026 <https://issues.apache.org/jira/browse/AURORA-1026> AURORA-1027 <https://issues.apache.org/jira/browse/AURORA-1025>). 2) Making the Thermos executor responsible for the Thermos UI In order to make the Thermos observer an optional component, the Thermos executor will need to assume Thermos observer responsibilities. Since the Mesos slave already provides a webserver to serve executor sandboxes, I am proposing that the Thermos executor generates static HTML content that can be served by the Mesos slave as a UI. This means that the executor can remain lean (no embedded webserver.) See AURORA-725 <https://issues.apache.org/jira/browse/AURORA-725> AURORA-777 <https://issues.apache.org/jira/browse/AURORA-777> 3) Making the Aurora scheduler responsible for state reconciliation The last component that should be removed is the GC executor. The GC executor performs the important task of state reconciliation, but this is now supported directly by the Mesos master. See AURORA-715 <https://issues.apache.org/jira/browse/AURORA-715> and specifically AURORA-1047 <https://issues.apache.org/jira/browse/AURORA-1047>. Lastly, this work should make it much easier to support alternate executor implementations (including the Mesos default executor) from Aurora once a proper Aurora API (AURORA-987 <https://issues.apache.org/jira/browse/AURORA-987>) is available. ~brian