As John pointed out, there's some more history worth divulging here. The first version of the in-memory storage was based on H2. We essentially used H2 as a key-value store for tasks, with the values being BLOBS containing binary thrift data. This was intended to simplify things in the face of rapidly-changing object models. As storage sizes grew in practice, we encountered performance issues due to the data layout mentioned above (non-relational schema, frequent high-recall queries). The outcome was high CPU consumption as we spent ~all time serializing and deserializing objects to/from the database.
The change from H2 -> maps gave us about 2 years of runway with few other changes. The biggest improvement was that we didn't need to serialize data going into the in-memory store. We never found anything that made H2 a fundamentally bad choice, but it was the way we used it that caused problems. -=Bill On Thu, Apr 17, 2014 at 12:57 PM, John Sirois <jsir...@twitter.com> wrote: > On Thu, Apr 17, 2014 at 12:16 PM, Bill Farner <wfar...@apache.org> wrote: > > > Over the next quarter, i'd like to embark on some work to improve the > > storage system in the scheduler. This work can probably be summarized as > > "stop building a database". > > > > *Background* > > The current storage system uses a replicated write-ahead log [1] > (provided > > by libmesos, some details in [2]), and a primary in-memory storage system > > [3]. Most of what i'll discuss relates to MemTaskStore [4], which is by > > far the largest (in terms of memory) and most complex. > > > > The current storage layout is non-relational. If callers want to deal > with > > things in terms of jobs (collection of instances) and instances > (currently > > active tasks in a job), they must do so with an appropriately-scoped task > > query, and group results to get the desired information. This is > > particularly problematic in places like the web interface, where we must > do > > an O(n) walk of all tasks [5] to aggregate role and job statistics. Over > > time, we have implemented indices on the task store [6] to speed up some > > operations, but a better data layout would yield bigger gains. > > > > > > *Hierarchical storage* [7] > > It has become clear that we need to rework the storage implementation and > > APIs to directly support the concepts that we use in practice (Role, > > Environment, Job, Instance, Task). This would simplify both > implementation > > and consumption of the storage APIs. It would also allow us to consume > > considerably less memory (by normalizing task configurations - the most > > memory-large objects in the system). This also leaves us with a natural > > place to associate auxiliary information with different hierarchy levels > > (e.g. Jobs). > > > > > > *Complexity* > > Implementing a storage system is easy to do poorly, difficult to do well. > > Right now, we're probably somewhere in between. We have bespoke > > implementations of complex things like transactions [8], read-write > locking > > [9], and weakly-consistent reads [10]. I'd like to lean on other > projects > > who have committed to solving these problems well rather than continue to > > roll our own. > > > > > > *SQL* > > If you've read this far, you're probably thinking "use a database, > dummy!". > > Well, that's what i propose we do. This would allow us to offload much > of > > the difficult code that is not our project's core competency. It will > > allow us to reorganize our storage layout in the future without writing > > complex [java] code, and avoid homemade implementation of things like > data > > relationships and consistency. > > > > The first step i would like to take is replacing the in-memory storage > with > > H2 [11]. I'm leaning towards H2 because we actually used it in aurora in > > the past with good results, and is in active development. > > > > Historically storage on the underlying replicated log has gone from H2 -> > mem > and now it'd go back to -> H2. Its seems like it would be good to detail > why > the thrash happened and why it won't again or what has changed to > invalidate the > H2 -> mem decision. > > > > > > *ORM* > > I'd like to leverage an ORM layer to handle interaction with the > database. > > My default choice here would usually be hibernate, but AFAICT licensing > > prevents that. Further surveying has led me to MyBatis [12]. I also > > considered Apache Cayenne [13], but prefer MyBatis because it is actively > > releasing code (the latest stable release of Cayenne was 3 years ago) and > > it does not push us to use a code generator. The latter is a big deal, > > since it allows us to adapt the ORM layer to the existing objects. This > > will make it easier to minimize ripple of a new storage implementation to > > the rest of the code. > > > > *Future direction* > > The commentary above only describes the in-memory storage, but it could > be > > extended to cover the replicated log as well. While there are nice > > features of the current arrangement (ease of installation, no SPOF), we > do > > take on a significant amount of complexity by using the replicated log. > > For example, we have to implement snapshotting [14], log replay [15], > > backups [16], and backup recovery [17]. We currently lack good tooling > for > > offline debugging of log data, or manipulating log contents. These are > all > > things we would get for free with a database server. We also can > formulate > > a much more obvious picture of how to do storage migrations across > release > > versions that is resilient to code refactors. > > > > > > I'm interested in what folks think of the thought process here and the > > approach proposed. Coming back to the thesis -- i feel like a lot of our > > complexity is implementation of a database, and i would love to position > > ourselves to spend more of our effort focusing on building an awesome > > scheduler. > > > > > > -=Bill > > > > [1] > > > > > https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java > > [2] > > > https://mesos.apache.org/blog/mesos-0-17-0-released-featuring-autorecovery/ > > [3] > > > > > https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/mem/MemStorage.java > > [4] > > > > > https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/mem/MemTaskStore.java > > [5] > > > > > https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterface.java#L471 > > [6] > > > > > https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/mem/MemTaskStore.java#L100-108 > > [7] https://issues.apache.org/jira/browse/AURORA-106 > > [8] > > > > > https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/LogManager.java#L374 > > [9] > > > > > https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/ReadWriteLockManager.java > > [10] > > > > > https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/mem/MemStorage.java#L170-174 > > [11] http://h2database.com/html/main.html > > [12] http://blog.mybatis.org/ > > [13] https://cayenne.apache.org/ > > [14] > > > > > https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java > > [15] > > > > > https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java#L309-323 > > [16] > > > > > https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/backup/StorageBackup.java#L142-144 > > [17] > > > > > https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/backup/Recovery.java#L108 > > > > > > -- > John Sirois > 303-512-3301 >