Using an in-memory storage system like H2 makes a lot of sense to me. As for the replicated log, I'd love to talk about how we could write a concrete implementation of H2's transaction log using the Mesos replicated log. I think H2 was moving to a different transaction log implementation too (MVStore?) which was still log structured. We've been chatting about adding features into the replicated log like read-streaming which should make standby masters fast to startup on failover.
Ben. On Thu, Apr 17, 2014 at 1:16 PM, Bill Farner <wfar...@apache.org> wrote: > Over the next quarter, i'd like to embark on some work to improve the > storage system in the scheduler. This work can probably be summarized as > "stop building a database". > > *Background* > The current storage system uses a replicated write-ahead log [1] (provided > by libmesos, some details in [2]), and a primary in-memory storage system > [3]. Most of what i'll discuss relates to MemTaskStore [4], which is by > far the largest (in terms of memory) and most complex. > > The current storage layout is non-relational. If callers want to deal with > things in terms of jobs (collection of instances) and instances (currently > active tasks in a job), they must do so with an appropriately-scoped task > query, and group results to get the desired information. This is > particularly problematic in places like the web interface, where we must do > an O(n) walk of all tasks [5] to aggregate role and job statistics. Over > time, we have implemented indices on the task store [6] to speed up some > operations, but a better data layout would yield bigger gains. > > > *Hierarchical storage* [7] > It has become clear that we need to rework the storage implementation and > APIs to directly support the concepts that we use in practice (Role, > Environment, Job, Instance, Task). This would simplify both implementation > and consumption of the storage APIs. It would also allow us to consume > considerably less memory (by normalizing task configurations - the most > memory-large objects in the system). This also leaves us with a natural > place to associate auxiliary information with different hierarchy levels > (e.g. Jobs). > > > *Complexity* > Implementing a storage system is easy to do poorly, difficult to do well. > Right now, we're probably somewhere in between. We have bespoke > implementations of complex things like transactions [8], read-write locking > [9], and weakly-consistent reads [10]. I'd like to lean on other projects > who have committed to solving these problems well rather than continue to > roll our own. > > > *SQL* > If you've read this far, you're probably thinking "use a database, dummy!". > Well, that's what i propose we do. This would allow us to offload much of > the difficult code that is not our project's core competency. It will > allow us to reorganize our storage layout in the future without writing > complex [java] code, and avoid homemade implementation of things like data > relationships and consistency. > > The first step i would like to take is replacing the in-memory storage with > H2 [11]. I'm leaning towards H2 because we actually used it in aurora in > the past with good results, and is in active development. > > > *ORM* > I'd like to leverage an ORM layer to handle interaction with the database. > My default choice here would usually be hibernate, but AFAICT licensing > prevents that. Further surveying has led me to MyBatis [12]. I also > considered Apache Cayenne [13], but prefer MyBatis because it is actively > releasing code (the latest stable release of Cayenne was 3 years ago) and > it does not push us to use a code generator. The latter is a big deal, > since it allows us to adapt the ORM layer to the existing objects. This > will make it easier to minimize ripple of a new storage implementation to > the rest of the code. > > *Future direction* > The commentary above only describes the in-memory storage, but it could be > extended to cover the replicated log as well. While there are nice > features of the current arrangement (ease of installation, no SPOF), we do > take on a significant amount of complexity by using the replicated log. > For example, we have to implement snapshotting [14], log replay [15], > backups [16], and backup recovery [17]. We currently lack good tooling for > offline debugging of log data, or manipulating log contents. These are all > things we would get for free with a database server. We also can formulate > a much more obvious picture of how to do storage migrations across release > versions that is resilient to code refactors. > > > I'm interested in what folks think of the thought process here and the > approach proposed. Coming back to the thesis -- i feel like a lot of our > complexity is implementation of a database, and i would love to position > ourselves to spend more of our effort focusing on building an awesome > scheduler. > > > -=Bill > > [1] > > https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java > [2] > https://mesos.apache.org/blog/mesos-0-17-0-released-featuring-autorecovery/ > [3] > > https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/mem/MemStorage.java > [4] > > https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/mem/MemTaskStore.java > [5] > > https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterface.java#L471 > [6] > > https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/mem/MemTaskStore.java#L100-108 > [7] https://issues.apache.org/jira/browse/AURORA-106 > [8] > > https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/LogManager.java#L374 > [9] > > https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/ReadWriteLockManager.java > [10] > > https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/mem/MemStorage.java#L170-174 > [11] http://h2database.com/html/main.html > [12] http://blog.mybatis.org/ > [13] https://cayenne.apache.org/ > [14] > > https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java > [15] > > https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java#L309-323 > [16] > > https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/backup/StorageBackup.java#L142-144 > [17] > > https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/backup/Recovery.java#L108 >