Re: Evolving the aurora scheduler storage

Benjamin Hindman Thu, 17 Apr 2014 13:02:21 -0700

Using an in-memory storage system like H2 makes a lot of sense to me.

As for the replicated log, I'd love to talk about how we could write a
concrete implementation of H2's transaction log using the Mesos replicated
log. I think H2 was moving to a different transaction log implementation
too (MVStore?) which was still log structured. We've been chatting about
adding features into the replicated log like read-streaming which should
make standby masters fast to startup on failover.


Ben.


On Thu, Apr 17, 2014 at 1:16 PM, Bill Farner <wfar...@apache.org> wrote:

> Over the next quarter, i'd like to embark on some work to improve the
> storage system in the scheduler.  This work can probably be summarized as
> "stop building a database".
>
> *Background*
> The current storage system uses a replicated write-ahead log [1] (provided
> by libmesos, some details in [2]), and a primary in-memory storage system
> [3].  Most of what i'll discuss relates to MemTaskStore [4], which is by
> far the largest (in terms of memory) and most complex.
>
> The current storage layout is non-relational.  If callers want to deal with
> things in terms of jobs (collection of instances) and instances (currently
> active tasks in a job), they must do so with an appropriately-scoped task
> query, and group results to get the desired information.  This is
> particularly problematic in places like the web interface, where we must do
> an O(n) walk of all tasks [5] to aggregate role and job statistics.  Over
> time, we have implemented indices on the task store [6] to speed up some
> operations, but a better data layout would yield bigger gains.
>
>
> *Hierarchical storage* [7]
> It has become clear that we need to rework the storage implementation and
> APIs to directly support the concepts that we use in practice (Role,
> Environment, Job, Instance, Task).  This would simplify both implementation
> and consumption of the storage APIs.  It would also allow us to consume
> considerably less memory (by normalizing task configurations - the most
> memory-large objects in the system).  This also leaves us with a natural
> place to associate auxiliary information with different hierarchy levels
> (e.g. Jobs).
>
>
> *Complexity*
> Implementing a storage system is easy to do poorly, difficult to do well.
>  Right now, we're probably somewhere in between.  We have bespoke
> implementations of complex things like transactions [8], read-write locking
> [9], and weakly-consistent reads [10].  I'd like to lean on other projects
> who have committed to solving these problems well rather than continue to
> roll our own.
>
>
> *SQL*
> If you've read this far, you're probably thinking "use a database, dummy!".
>  Well, that's what i propose we do.  This would allow us to offload much of
> the difficult code that is not our project's core competency.  It will
> allow us to reorganize our storage layout in the future without writing
> complex [java] code, and avoid homemade implementation of things like data
> relationships and consistency.
>
> The first step i would like to take is replacing the in-memory storage with
> H2 [11].  I'm leaning towards H2 because we actually used it in aurora in
> the past with good results, and is in active development.
>
>
> *ORM*
> I'd like to leverage an ORM layer to handle interaction with the database.
>  My default choice here would usually be hibernate, but AFAICT licensing
> prevents that.  Further surveying has led me to MyBatis [12].  I also
> considered Apache Cayenne [13], but prefer MyBatis because it is actively
> releasing code (the latest stable release of Cayenne was 3 years ago) and
> it does not push us to use a code generator.  The latter is a big deal,
> since it allows us to adapt the ORM layer to the existing objects.  This
> will make it easier to minimize ripple of a new storage implementation to
> the rest of the code.
>
> *Future direction*
> The commentary above only describes the in-memory storage, but it could be
> extended to cover the replicated log as well.  While there are nice
> features of the current arrangement (ease of installation, no SPOF), we do
> take on a significant amount of complexity by using the replicated log.
>  For example, we have to implement snapshotting [14], log replay [15],
> backups [16], and backup recovery [17].  We currently lack good tooling for
> offline debugging of log data, or manipulating log contents.  These are all
> things we would get for free with a database server.  We also can formulate
> a much more obvious picture of how to do storage migrations across release
> versions that is resilient to code refactors.
>
>
> I'm interested in what folks think of the thought process here and the
> approach proposed.  Coming back to the thesis -- i feel like a lot of our
> complexity is implementation of a database, and i would love to position
> ourselves to spend more of our effort focusing on building an awesome
> scheduler.
>
>
> -=Bill
>
> [1]
>
> https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java
> [2]
> https://mesos.apache.org/blog/mesos-0-17-0-released-featuring-autorecovery/
> [3]
>
> https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/mem/MemStorage.java
> [4]
>
> https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/mem/MemTaskStore.java
> [5]
>
> https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterface.java#L471
> [6]
>
> https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/mem/MemTaskStore.java#L100-108
> [7] https://issues.apache.org/jira/browse/AURORA-106
> [8]
>
> https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/LogManager.java#L374
> [9]
>
> https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/ReadWriteLockManager.java
> [10]
>
> https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/mem/MemStorage.java#L170-174
> [11] http://h2database.com/html/main.html
> [12] http://blog.mybatis.org/
> [13] https://cayenne.apache.org/
> [14]
>
> https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java
> [15]
>
> https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java#L309-323
> [16]
>
> https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/backup/StorageBackup.java#L142-144
> [17]
>
> https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/backup/Recovery.java#L108
>

Re: Evolving the aurora scheduler storage

Reply via email to