In the spirit of the recent thread about discussing large changes on the
Dev ML, I'd like to talk about CASSANDRA-10993, the first step in the
"thread per core" work.

The goal of 10993 is to transform the read and write paths into an
event-driven model powered by event loops.  This means that each request
can be handled on a single thread (although typically broken up into
multiple steps, depending on I/O and locking) and the old mutation and read
thread pools can be removed.  So far, we've prototyped this with a couple
of approaches:

The first approach models each request as a state machine (or composition
of state machines).  For example, a single write request is encapsulated in
a WriteTask object which moves through a series of states as portions of
the write complete (allocating a commitlog segment, syncing the commitlog,
receiving responses from remote replicas).  These state transitions are
triggered by Events that are emitted by, e.g., the
CommitlogSegmentManager.  The event loop that manages tasks, events,
timeouts, and scheduling is custom and is (currently) closely tied to a
Netty event loop.  Here are a couple of example classes to take a look at:

WriteTask:
https://github.com/thobbs/cassandra/blob/CASSANDRA-10993-WIP/src/java/org/apache/cassandra/poc/WriteTask.java
EventLoop:
https://github.com/thobbs/cassandra/blob/CASSANDRA-10993-WIP/src/java/org/apache/cassandra/poc/EventLoop.java

The second approach utilizes RxJava and the Observable pattern.  Where we
would wait for emitted events in the state machine approach, we instead
depend on an Observable to "push" the data/result we're awaiting.
Scheduling is handled by an Rx scheduler (which is customizable).  The code
changes required for this are, overall, less intrusive.  Here's a quick
example of what this looks like for high-level operations:
https://github.com/thobbs/cassandra/blob/rxjava-rebase/src/java/org/apache/cassandra/service/StorageProxy.java#L1724-L1732
.

So far we've benchmarked both approaches on in-memory reads to get an idea
of the upper-bound performance of both approaches.  Throughput appears to
be very similar with both branches.

There are a few considerations up for debate as to which approach we should
go with that I would appreciate input on.

First, performance.  There are concerns that going with Rx (or something
similar) may limit the peak performance we can eventually attain in a
couple of ways.  First, we don't have as much control over the event loop,
scheduling, and chunking of tasks.  With the state machine approach, we're
writing all of this, so it's totally under our control.  With Rx, a lot of
things are customizable or already have decent tools, but this may come up
short in critical ways.  Second, the overhead of the Observable machinery
may become significant as other bottlenecks are removed.  Of course,
WriteTask et al have their own overhead, but once again, we have more
control there.

The second consideration is code style and ease of understanding.  I think
both of these approaches have downsides in different areas.  The state
machines are very explicit (an upside), but also very verbose and somewhat
disjointed.  Most of the complex operations in Cassandra can't cleanly be
represented as a single state machine, because they're logically multiple
state machines operating in parallel (e.g. the local write path and the
remote write path in WriteTask).  After working on the prototypes, I've
found the state machines to be harder to logically follow than I had
hoped.  Perhaps we could come up with better abstractions and patterns for
this, but that's the current state of things.  On the Rx side, the downside
is that the behavior is much less explicit.  Additionally, some find it
more difficult to mentally follow the flow of execution.  Based on my past
work with a large Twisted Python codebase, I'll agree that it's tough to
get used to, but not unmanageable with experience and good coding patterns.

A third consideration is code reuse.  A big advantage of Rx is that it
comes with many tools for transforming Observables, handling multiple
Observables, error handling, and tracing.  With the state machine approach,
we would need to write equivalents for these from scratch.  This is a
non-trivial amount of work that might make the project take significantly
longer to complete.  Combining this with fact that the Rx approach would be
less invasive, it seems like we would have an easier time introducing
incremental changes to the code base rather than having a big-bang commit.

If I can boil these concerns down to one tradeoff, it's this: do we want to
expend more effort and have more explicit code and complete control, or do
we want to piggyback on the Rx work, give up some control, and (hopefully)
get to the next, deeper optimizations sooner?

Thanks for any input on this topic.


-- 
Tyler Hobbs
DataStax <http://datastax.com/>

Reply via email to