On Mon, Nov 22, 2010 at 10:01 AM, David Jeske <dav...@gmail.com> wrote:
> I havn't used either Cassandra or hbase, so please don't take any part of > this message as me attempting to state facts about either system. However, > I'm very familiar with data-storage design details, and I've worked > extensively optimizing applications running on MySQL, Oracle, berkeledb > (including distributed txn berkeleydb), and Google Bigtable. > > The recent discussion triggered by Facebook messaging using HBase helped > surface many interesting design differences in the two systems. I'm writing > this message both to summarize what I've read in a few different places > about that topic, and to check my facts. > > As far as I can descern, this is a decent summary of the consistency and > performance differences between hbase and cassandra (N3/R2/W2 or N3/R1/W3) > for an hbase acceptable workload.. (Please correct the fact if they appear > wrong!) > > *1) Cassandra can't replicate the consistency situation of HBase.* Namely, > that when a write requiring a quorum fails it will never appear. Deriving > from this explanation: > > [In Cassandra]Provided at least one node receives the write, it will > eventually be written to all replicas. A failure to meet the requested > ConsistencyLevel is just that; not a failure to write the data itself. Once > the write is received by a node, it will eventually reach all replicas, > there is no roll back. - Nick Telford > [ref<http://www.mail-archive.com/user@cassandra.apache.org/msg07398.html> > ] > > [In Hbase] The DFSClient call returns when all datanodes in the pipeline > have flushed (to the OS buffer) and ack'ed. That code comes from HDFS-200 in > the 0.20-append branch and HDFS-265 for all branches after 0.20, meaning > that it's in 0.21.0 - Jean-Daniel Cryans > [ref<http://www.quora.com/How-does-HBase-write-performance-differ-from-write-performance-in-Cassandra-with-consistency-level-ALL?q=hbase+cassandra> > ] > > in HBase, if a write is accepted by only 1 of 3 HDFS replicas; and the > region master never receives a response from the other two replicas; and it > fails the client write, that write should never appear. Even if the region > master then fails, when a new region master is elected, and it restarts and > recovers, it should read HDFS blocks and accept the consensus 2/3 opinion > that the log does not contain the write -- dropping the write. The write > will never be seen. > Not quite. The replica synchronization code is pretty messy, but basically it will take the longest replica that may have been synced, not a quorum. i.e the guarantee is that "if you successfully sync() data, it will be present after replica synchronization". Unsynced data *may* be present after replica synchronization. But keep in mind that recovery is blocking in most cases - ie if the RS is writing to a pipeline and waiting on acks, and one of the nodes in the pipeline dies, then it will recover the pipeline (without the dead node) and continue syncing to the remaining two nodes. The client is still blocked at this point. If the RS itself dies, then it won't respond to the client at all, and it's anyone's guess whether the write was successful or not. The same is true if the network between client and RS dies. This is unavoidable in any system - a server can always fail *just before* sending the "success" message, and the write is left in "maybe written" state. What will *not* happen, though, is the following case: - Row contains value A - Client writes value B - RS fails - Client reads value A - Client reads again and sees value B Similarly, if client reads value B, it won't revert to value A in any circumstance. > > In Cassandra, if a write (requesting 2 or 3 copies) is accepted by only one > node, that write will fail to the client. Future reads R=1 will see that > write or not depending on whether they contact the one server that accepted > or not, until the data is propagated, at which time they will see the write. > Reads R=2 will not see the write until it is propagated until at least two > servers. There is no mechanism to assure that a write is either accepted by > the requested number of servers or aborted. > > *2) Cassandra has a less efficient memory footprint data pinned in memory > (or cached).* With 3 replicas on Cassandra, each element of data pinned > in-memory is kept in memory on 3 servers, wheras in hbase only region > masters keep the data in memory, so there is only one-copy of each data > element. > > *3) Cassandra (N3/W2/R2) has slower reads of cached or pinned-in-memory > data.* HBase can answer a read-only query that is in memory from the > single region-master, while Cassandra (N3/W2/R2) must read from multiple > servers. (note, N3/W2/R2 still doesn't produce the same consistency > situation as hbase, see #1) > > Yes, probably - except that it seems to me Cassandra should be able to offer lower latency in the face of java GC pauses. If an HBase RS is in a 200ms GC pause, latency for all rows hosted by that server will spike to 200ms. If one of three replicas is in a 200ms GC pause, the other two replicas will still respond quickly so latency should be less spiky in Cassandra. But it's at the cost of more RAM usage as you mentioned above. > *4) Cassandra (N3/W3/R1) takes longer to allow data to become writable > again in the face of a node-failure than HBase/HDFS.* Cassandra must > repair the keyrange to bring N from 2 to 3 to resume allowing writes with > W=3. HDFS can still acheive a 2 node quorum in the face of a node failure. > (note, using N3/W2 requires R2, see #3) (note, this still doesn't produce > the same consistency situation as hbase, see #1.) > > It takes HBase a while to detect the failure and recover the region to a new server - the recovery time depends on the amount of unflushed data in the memstores of the failed server. With default configs, it takes 1 minute for the ZK lease to be lost on a failed server, and then somewhere between 10 seconds and a few minutes to fully reassign the regions (depending on amount of data needing to be replayed from the HLog) > *5) HBase can't match the row-availability situation of Cassandra(N3/W2/R2 > or N3/W3/R1).* In the face of a single machine failure, if it is a region > master, those keys are offline in HBase until a new region master is elected > and brought online. In Cassandra, no single node failure causes the data to > become unavailable. > > True. We're working on the concept of "slave regions" which would handle some kinds of reads and blind puts during failures. > > Is that summary correct? Am I missing any points? Did I get any facts > wrong? > > Note, I'm NOT attempting to advocate the following changes, but simply > understand the design differences.... > > From my uninformed view, it seems that #1 causes the biggest cascade of > differences, affecting both #3 and #4. If Cassandra were allowed to do what > HBase/HDFS does, namely to specify a repair-consistency requirement, then > Cassandra (N3/W2/R2/Repair2) should be the same consistency guarantee as > HBase. Further, if Cassandra were allowed to elect one of the copies of data > as 'master'. then it could require the master participate in all quorum > writes, allowing reads to be consistent when conducted only through the > master. This could be a road to address 2/3/4. To my eyes, this would make > Cassandra capable of operating in a mode which is essentially equivilent to > HBase/HDFS. Do those facts seem correct? --- AGAIN, I'm not advocating > Cassandra make these changes, I'm simply trying to understand the > differences, by considering what changes it would take to make them the > same. > > On the surface it seems reasonable, but these things are always very tricky, and it's the subtleties that will kill you :) -Todd P.S. Very happy to see informed technical discussion of the differences in the two architectures!