Over the next quarter, i'd like to embark on some work to improve the
storage system in the scheduler.  This work can probably be summarized as
"stop building a database".

*Background*
The current storage system uses a replicated write-ahead log [1] (provided
by libmesos, some details in [2]), and a primary in-memory storage system
[3].  Most of what i'll discuss relates to MemTaskStore [4], which is by
far the largest (in terms of memory) and most complex.

The current storage layout is non-relational.  If callers want to deal with
things in terms of jobs (collection of instances) and instances (currently
active tasks in a job), they must do so with an appropriately-scoped task
query, and group results to get the desired information.  This is
particularly problematic in places like the web interface, where we must do
an O(n) walk of all tasks [5] to aggregate role and job statistics.  Over
time, we have implemented indices on the task store [6] to speed up some
operations, but a better data layout would yield bigger gains.


*Hierarchical storage* [7]
It has become clear that we need to rework the storage implementation and
APIs to directly support the concepts that we use in practice (Role,
Environment, Job, Instance, Task).  This would simplify both implementation
and consumption of the storage APIs.  It would also allow us to consume
considerably less memory (by normalizing task configurations - the most
memory-large objects in the system).  This also leaves us with a natural
place to associate auxiliary information with different hierarchy levels
(e.g. Jobs).


*Complexity*
Implementing a storage system is easy to do poorly, difficult to do well.
 Right now, we're probably somewhere in between.  We have bespoke
implementations of complex things like transactions [8], read-write locking
[9], and weakly-consistent reads [10].  I'd like to lean on other projects
who have committed to solving these problems well rather than continue to
roll our own.


*SQL*
If you've read this far, you're probably thinking "use a database, dummy!".
 Well, that's what i propose we do.  This would allow us to offload much of
the difficult code that is not our project's core competency.  It will
allow us to reorganize our storage layout in the future without writing
complex [java] code, and avoid homemade implementation of things like data
relationships and consistency.

The first step i would like to take is replacing the in-memory storage with
H2 [11].  I'm leaning towards H2 because we actually used it in aurora in
the past with good results, and is in active development.


*ORM*
I'd like to leverage an ORM layer to handle interaction with the database.
 My default choice here would usually be hibernate, but AFAICT licensing
prevents that.  Further surveying has led me to MyBatis [12].  I also
considered Apache Cayenne [13], but prefer MyBatis because it is actively
releasing code (the latest stable release of Cayenne was 3 years ago) and
it does not push us to use a code generator.  The latter is a big deal,
since it allows us to adapt the ORM layer to the existing objects.  This
will make it easier to minimize ripple of a new storage implementation to
the rest of the code.

*Future direction*
The commentary above only describes the in-memory storage, but it could be
extended to cover the replicated log as well.  While there are nice
features of the current arrangement (ease of installation, no SPOF), we do
take on a significant amount of complexity by using the replicated log.
 For example, we have to implement snapshotting [14], log replay [15],
backups [16], and backup recovery [17].  We currently lack good tooling for
offline debugging of log data, or manipulating log contents.  These are all
things we would get for free with a database server.  We also can formulate
a much more obvious picture of how to do storage migrations across release
versions that is resilient to code refactors.


I'm interested in what folks think of the thought process here and the
approach proposed.  Coming back to the thesis -- i feel like a lot of our
complexity is implementation of a database, and i would love to position
ourselves to spend more of our effort focusing on building an awesome
scheduler.


-=Bill

[1]
https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java
[2]
https://mesos.apache.org/blog/mesos-0-17-0-released-featuring-autorecovery/
[3]
https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/mem/MemStorage.java
[4]
https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/mem/MemTaskStore.java
[5]
https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterface.java#L471
[6]
https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/mem/MemTaskStore.java#L100-108
[7] https://issues.apache.org/jira/browse/AURORA-106
[8]
https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/LogManager.java#L374
[9]
https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/ReadWriteLockManager.java
[10]
https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/mem/MemStorage.java#L170-174
[11] http://h2database.com/html/main.html
[12] http://blog.mybatis.org/
[13] https://cayenne.apache.org/
[14]
https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java
[15]
https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java#L309-323
[16]
https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/backup/StorageBackup.java#L142-144
[17]
https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/backup/Recovery.java#L108

Reply via email to