Re: Evolving the aurora scheduler storage

Bill Farner Thu, 17 Apr 2014 13:54:12 -0700

As John pointed out, there's some more history worth divulging here.

The first version of the in-memory storage was based on H2.  We essentially
used H2 as a key-value store for tasks, with the values being BLOBS
containing binary thrift data.  This was intended to simplify things in the
face of rapidly-changing object models.  As storage sizes grew in practice,
we encountered performance issues due to the data layout  mentioned above
(non-relational schema, frequent high-recall queries).  The outcome was
high CPU consumption as we spent ~all time serializing and deserializing
objects to/from the database.


The change from H2 -> maps gave us about 2 years of runway with few other
changes.  The biggest improvement was that we didn't need to serialize data
going into the in-memory store.  We never found anything that made H2 a
fundamentally bad choice, but it was the way we used it that caused
problems.


-=Bill


On Thu, Apr 17, 2014 at 12:57 PM, John Sirois <jsir...@twitter.com> wrote:

> On Thu, Apr 17, 2014 at 12:16 PM, Bill Farner <wfar...@apache.org> wrote:
>
> > Over the next quarter, i'd like to embark on some work to improve the
> > storage system in the scheduler.  This work can probably be summarized as
> > "stop building a database".
> >
> > *Background*
> > The current storage system uses a replicated write-ahead log [1]
> (provided
> > by libmesos, some details in [2]), and a primary in-memory storage system
> > [3].  Most of what i'll discuss relates to MemTaskStore [4], which is by
> > far the largest (in terms of memory) and most complex.
> >
> > The current storage layout is non-relational.  If callers want to deal
> with
> > things in terms of jobs (collection of instances) and instances
> (currently
> > active tasks in a job), they must do so with an appropriately-scoped task
> > query, and group results to get the desired information.  This is
> > particularly problematic in places like the web interface, where we must
> do
> > an O(n) walk of all tasks [5] to aggregate role and job statistics.  Over
> > time, we have implemented indices on the task store [6] to speed up some
> > operations, but a better data layout would yield bigger gains.
> >
> >
> > *Hierarchical storage* [7]
> > It has become clear that we need to rework the storage implementation and
> > APIs to directly support the concepts that we use in practice (Role,
> > Environment, Job, Instance, Task).  This would simplify both
> implementation
> > and consumption of the storage APIs.  It would also allow us to consume
> > considerably less memory (by normalizing task configurations - the most
> > memory-large objects in the system).  This also leaves us with a natural
> > place to associate auxiliary information with different hierarchy levels
> > (e.g. Jobs).
> >
> >
> > *Complexity*
> > Implementing a storage system is easy to do poorly, difficult to do well.
> >  Right now, we're probably somewhere in between.  We have bespoke
> > implementations of complex things like transactions [8], read-write
> locking
> > [9], and weakly-consistent reads [10].  I'd like to lean on other
> projects
> > who have committed to solving these problems well rather than continue to
> > roll our own.
> >
> >
> > *SQL*
> > If you've read this far, you're probably thinking "use a database,
> dummy!".
> >  Well, that's what i propose we do.  This would allow us to offload much
> of
> > the difficult code that is not our project's core competency.  It will
> > allow us to reorganize our storage layout in the future without writing
> > complex [java] code, and avoid homemade implementation of things like
> data
> > relationships and consistency.
> >
> > The first step i would like to take is replacing the in-memory storage
> with
> > H2 [11].  I'm leaning towards H2 because we actually used it in aurora in
> > the past with good results, and is in active development.
> >
>
> Historically storage on the underlying replicated log has gone from H2 ->
> mem
> and now it'd go back to -> H2.  Its seems like it would be good to detail
> why
> the thrash happened and why it won't again or what has changed to
> invalidate the
> H2 -> mem decision.
>
>
> >
> > *ORM*
> > I'd like to leverage an ORM layer to handle interaction with the
> database.
> >  My default choice here would usually be hibernate, but AFAICT licensing
> > prevents that.  Further surveying has led me to MyBatis [12].  I also
> > considered Apache Cayenne [13], but prefer MyBatis because it is actively
> > releasing code (the latest stable release of Cayenne was 3 years ago) and
> > it does not push us to use a code generator.  The latter is a big deal,
> > since it allows us to adapt the ORM layer to the existing objects.  This
> > will make it easier to minimize ripple of a new storage implementation to
> > the rest of the code.
> >
> > *Future direction*
> > The commentary above only describes the in-memory storage, but it could
> be
> > extended to cover the replicated log as well.  While there are nice
> > features of the current arrangement (ease of installation, no SPOF), we
> do
> > take on a significant amount of complexity by using the replicated log.
> >  For example, we have to implement snapshotting [14], log replay [15],
> > backups [16], and backup recovery [17].  We currently lack good tooling
> for
> > offline debugging of log data, or manipulating log contents.  These are
> all
> > things we would get for free with a database server.  We also can
> formulate
> > a much more obvious picture of how to do storage migrations across
> release
> > versions that is resilient to code refactors.
> >
> >
> > I'm interested in what folks think of the thought process here and the
> > approach proposed.  Coming back to the thesis -- i feel like a lot of our
> > complexity is implementation of a database, and i would love to position
> > ourselves to spend more of our effort focusing on building an awesome
> > scheduler.
> >
> >
> > -=Bill
> >
> > [1]
> >
> >
> https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java
> > [2]
> >
> https://mesos.apache.org/blog/mesos-0-17-0-released-featuring-autorecovery/
> > [3]
> >
> >
> https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/mem/MemStorage.java
> > [4]
> >
> >
> https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/mem/MemTaskStore.java
> > [5]
> >
> >
> https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterface.java#L471
> > [6]
> >
> >
> https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/mem/MemTaskStore.java#L100-108
> > [7] https://issues.apache.org/jira/browse/AURORA-106
> > [8]
> >
> >
> https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/LogManager.java#L374
> > [9]
> >
> >
> https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/ReadWriteLockManager.java
> > [10]
> >
> >
> https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/mem/MemStorage.java#L170-174
> > [11] http://h2database.com/html/main.html
> > [12] http://blog.mybatis.org/
> > [13] https://cayenne.apache.org/
> > [14]
> >
> >
> https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java
> > [15]
> >
> >
> https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java#L309-323
> > [16]
> >
> >
> https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/backup/StorageBackup.java#L142-144
> > [17]
> >
> >
> https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/backup/Recovery.java#L108
> >
>
>
>
> --
> John Sirois
> 303-512-3301
>

Re: Evolving the aurora scheduler storage

Reply via email to