Thanks for the write up! > On Jan 22, 2015, at 13:27, Brian Wickman <wick...@apache.org> wrote: > > Thermos is a standalone task execution system that is not coupled to Aurora > or Mesos. This is why by default, Thermos writes out of the sandbox > (/var/run/thermos), has a separate observability system (Thermos observer), > and CLI (thermos.) > > Aurora built a Thermos executor as its default executor, but the scheduler > is not architecturally tied to Thermos (or vice versa.) In order to make > things work smoothly with this decoupling, a Thermos-specific GC executor > is also necessary to clean up the state leftover by the execution of > Thermos tasks and reconcile potential conflicts between the state of the > Mesos master and Aurora scheduler. > > Both the GC executor and Thermos observer violate some of the philosophical > axioms of Mesos (e.g. out-of-sandbox access.) They also significantly > increase the complexity of building, deploying and maintaining Aurora. I'm > proposing removing both of them as required Aurora components. > > In order to do this and make Thermos/Aurora/Mesos to play together more > nicely, several things are necessary. > > 1) Moving /var/run/thermos for each task into the Mesos sandbox > > Thermos is a state machine with all state transitions persisted to disk. > Right now this goes to /var/run/thermos, but it should instead be persisted > some place relative to the Mesos sandbox so that the Mesos slave can > garbage collect this state once a Thermos task has completed. > > This poses a task detection problem -- the Thermos CLI and Thermos observer > rely upon the existence of /var/run/thermos to know what tasks are running, > so we will need to develop a plugin to detect alternate task roots (see > AURORA-1024 <https://issues.apache.org/jira/browse/AURORA-1024> AURORA-1025 > <https://issues.apache.org/jira/browse/AURORA-1025> AURORA-1026 > <https://issues.apache.org/jira/browse/AURORA-1026> AURORA-1027 > <https://issues.apache.org/jira/browse/AURORA-1025>). > > 2) Making the Thermos executor responsible for the Thermos UI > > In order to make the Thermos observer an optional component, the Thermos > executor will need to assume Thermos observer responsibilities. Since the > Mesos slave already provides a webserver to serve executor sandboxes, I am > proposing that the Thermos executor generates static HTML content that can > be served by the Mesos slave as a UI. This means that the executor can > remain lean (no embedded webserver.) See AURORA-725 > <https://issues.apache.org/jira/browse/AURORA-725> AURORA-777 > <https://issues.apache.org/jira/browse/AURORA-777> > > 3) Making the Aurora scheduler responsible for state reconciliation > > The last component that should be removed is the GC executor. The GC > executor performs the important task of state reconciliation, but this is > now supported directly by the Mesos master. See AURORA-715 > <https://issues.apache.org/jira/browse/AURORA-715> and specifically > AURORA-1047 <https://issues.apache.org/jira/browse/AURORA-1047>.
Although the trusty gc_executor has been solid for a long time, removing it would definitely simplify things, so +10. > > Lastly, this work should make it much easier to support alternate executor > implementations (including the Mesos default executor) from Aurora once a > proper Aurora API (AURORA-987 > <https://issues.apache.org/jira/browse/AURORA-987>) is available. > > ~brian