In the spirit of the recent thread about discussing large changes on the Dev ML, I'd like to talk about CASSANDRA-10993, the first step in the "thread per core" work.
The goal of 10993 is to transform the read and write paths into an event-driven model powered by event loops. This means that each request can be handled on a single thread (although typically broken up into multiple steps, depending on I/O and locking) and the old mutation and read thread pools can be removed. So far, we've prototyped this with a couple of approaches: The first approach models each request as a state machine (or composition of state machines). For example, a single write request is encapsulated in a WriteTask object which moves through a series of states as portions of the write complete (allocating a commitlog segment, syncing the commitlog, receiving responses from remote replicas). These state transitions are triggered by Events that are emitted by, e.g., the CommitlogSegmentManager. The event loop that manages tasks, events, timeouts, and scheduling is custom and is (currently) closely tied to a Netty event loop. Here are a couple of example classes to take a look at: WriteTask: https://github.com/thobbs/cassandra/blob/CASSANDRA-10993-WIP/src/java/org/apache/cassandra/poc/WriteTask.java EventLoop: https://github.com/thobbs/cassandra/blob/CASSANDRA-10993-WIP/src/java/org/apache/cassandra/poc/EventLoop.java The second approach utilizes RxJava and the Observable pattern. Where we would wait for emitted events in the state machine approach, we instead depend on an Observable to "push" the data/result we're awaiting. Scheduling is handled by an Rx scheduler (which is customizable). The code changes required for this are, overall, less intrusive. Here's a quick example of what this looks like for high-level operations: https://github.com/thobbs/cassandra/blob/rxjava-rebase/src/java/org/apache/cassandra/service/StorageProxy.java#L1724-L1732 . So far we've benchmarked both approaches on in-memory reads to get an idea of the upper-bound performance of both approaches. Throughput appears to be very similar with both branches. There are a few considerations up for debate as to which approach we should go with that I would appreciate input on. First, performance. There are concerns that going with Rx (or something similar) may limit the peak performance we can eventually attain in a couple of ways. First, we don't have as much control over the event loop, scheduling, and chunking of tasks. With the state machine approach, we're writing all of this, so it's totally under our control. With Rx, a lot of things are customizable or already have decent tools, but this may come up short in critical ways. Second, the overhead of the Observable machinery may become significant as other bottlenecks are removed. Of course, WriteTask et al have their own overhead, but once again, we have more control there. The second consideration is code style and ease of understanding. I think both of these approaches have downsides in different areas. The state machines are very explicit (an upside), but also very verbose and somewhat disjointed. Most of the complex operations in Cassandra can't cleanly be represented as a single state machine, because they're logically multiple state machines operating in parallel (e.g. the local write path and the remote write path in WriteTask). After working on the prototypes, I've found the state machines to be harder to logically follow than I had hoped. Perhaps we could come up with better abstractions and patterns for this, but that's the current state of things. On the Rx side, the downside is that the behavior is much less explicit. Additionally, some find it more difficult to mentally follow the flow of execution. Based on my past work with a large Twisted Python codebase, I'll agree that it's tough to get used to, but not unmanageable with experience and good coding patterns. A third consideration is code reuse. A big advantage of Rx is that it comes with many tools for transforming Observables, handling multiple Observables, error handling, and tracing. With the state machine approach, we would need to write equivalents for these from scratch. This is a non-trivial amount of work that might make the project take significantly longer to complete. Combining this with fact that the Rx approach would be less invasive, it seems like we would have an easier time introducing incremental changes to the code base rather than having a big-bang commit. If I can boil these concerns down to one tradeoff, it's this: do we want to expend more effort and have more explicit code and complete control, or do we want to piggyback on the Rx work, give up some control, and (hopefully) get to the next, deeper optimizations sooner? Thanks for any input on this topic. -- Tyler Hobbs DataStax <http://datastax.com/>