Over the next quarter, i'd like to embark on some work to improve the storage system in the scheduler. This work can probably be summarized as "stop building a database".
*Background* The current storage system uses a replicated write-ahead log [1] (provided by libmesos, some details in [2]), and a primary in-memory storage system [3]. Most of what i'll discuss relates to MemTaskStore [4], which is by far the largest (in terms of memory) and most complex. The current storage layout is non-relational. If callers want to deal with things in terms of jobs (collection of instances) and instances (currently active tasks in a job), they must do so with an appropriately-scoped task query, and group results to get the desired information. This is particularly problematic in places like the web interface, where we must do an O(n) walk of all tasks [5] to aggregate role and job statistics. Over time, we have implemented indices on the task store [6] to speed up some operations, but a better data layout would yield bigger gains. *Hierarchical storage* [7] It has become clear that we need to rework the storage implementation and APIs to directly support the concepts that we use in practice (Role, Environment, Job, Instance, Task). This would simplify both implementation and consumption of the storage APIs. It would also allow us to consume considerably less memory (by normalizing task configurations - the most memory-large objects in the system). This also leaves us with a natural place to associate auxiliary information with different hierarchy levels (e.g. Jobs). *Complexity* Implementing a storage system is easy to do poorly, difficult to do well. Right now, we're probably somewhere in between. We have bespoke implementations of complex things like transactions [8], read-write locking [9], and weakly-consistent reads [10]. I'd like to lean on other projects who have committed to solving these problems well rather than continue to roll our own. *SQL* If you've read this far, you're probably thinking "use a database, dummy!". Well, that's what i propose we do. This would allow us to offload much of the difficult code that is not our project's core competency. It will allow us to reorganize our storage layout in the future without writing complex [java] code, and avoid homemade implementation of things like data relationships and consistency. The first step i would like to take is replacing the in-memory storage with H2 [11]. I'm leaning towards H2 because we actually used it in aurora in the past with good results, and is in active development. *ORM* I'd like to leverage an ORM layer to handle interaction with the database. My default choice here would usually be hibernate, but AFAICT licensing prevents that. Further surveying has led me to MyBatis [12]. I also considered Apache Cayenne [13], but prefer MyBatis because it is actively releasing code (the latest stable release of Cayenne was 3 years ago) and it does not push us to use a code generator. The latter is a big deal, since it allows us to adapt the ORM layer to the existing objects. This will make it easier to minimize ripple of a new storage implementation to the rest of the code. *Future direction* The commentary above only describes the in-memory storage, but it could be extended to cover the replicated log as well. While there are nice features of the current arrangement (ease of installation, no SPOF), we do take on a significant amount of complexity by using the replicated log. For example, we have to implement snapshotting [14], log replay [15], backups [16], and backup recovery [17]. We currently lack good tooling for offline debugging of log data, or manipulating log contents. These are all things we would get for free with a database server. We also can formulate a much more obvious picture of how to do storage migrations across release versions that is resilient to code refactors. I'm interested in what folks think of the thought process here and the approach proposed. Coming back to the thesis -- i feel like a lot of our complexity is implementation of a database, and i would love to position ourselves to spend more of our effort focusing on building an awesome scheduler. -=Bill [1] https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java [2] https://mesos.apache.org/blog/mesos-0-17-0-released-featuring-autorecovery/ [3] https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/mem/MemStorage.java [4] https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/mem/MemTaskStore.java [5] https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterface.java#L471 [6] https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/mem/MemTaskStore.java#L100-108 [7] https://issues.apache.org/jira/browse/AURORA-106 [8] https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/LogManager.java#L374 [9] https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/ReadWriteLockManager.java [10] https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/mem/MemStorage.java#L170-174 [11] http://h2database.com/html/main.html [12] http://blog.mybatis.org/ [13] https://cayenne.apache.org/ [14] https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java [15] https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java#L309-323 [16] https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/backup/StorageBackup.java#L142-144 [17] https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/backup/Recovery.java#L108