On Mon, Nov 22, 2010 at 2:56 PM, Edward Capriolo <edlinuxg...@gmail.com> wrote: > On Mon, Nov 22, 2010 at 2:52 PM, Todd Lipcon <t...@lipcon.org> wrote: >> On Mon, Nov 22, 2010 at 10:01 AM, David Jeske <dav...@gmail.com> wrote: >>> >>> I havn't used either Cassandra or hbase, so please don't take any part of >>> this message as me attempting to state facts about either system. However, >>> I'm very familiar with data-storage design details, and I've worked >>> extensively optimizing applications running on MySQL, Oracle, berkeledb >>> (including distributed txn berkeleydb), and Google Bigtable. >>> The recent discussion triggered by Facebook messaging using HBase helped >>> surface many interesting design differences in the two systems. I'm writing >>> this message both to summarize what I've read in a few different places >>> about that topic, and to check my facts. >>> As far as I can descern, this is a decent summary of the consistency and >>> performance differences between hbase and cassandra (N3/R2/W2 or N3/R1/W3) >>> for an hbase acceptable workload.. (Please correct the fact if they appear >>> wrong!) >>> 1) Cassandra can't replicate the consistency situation of HBase. Namely, >>> that when a write requiring a quorum fails it will never appear. Deriving >>> from this explanation: >>> [In Cassandra]Provided at least one node receives the write, it will >>> eventually be written to all replicas. A failure to meet the requested >>> ConsistencyLevel is just that; not a failure to write the data itself. Once >>> the write is received by a node, it will eventually reach all replicas, >>> there is no roll back. - Nick Telford [ref] >>> >>> [In Hbase] The DFSClient call returns when all datanodes in the pipeline >>> have flushed (to the OS buffer) and ack'ed. That code comes from HDFS-200 in >>> the 0.20-append branch and HDFS-265 for all branches after 0.20, meaning >>> that it's in 0.21.0 - Jean-Daniel Cryans [ref] >>> in HBase, if a write is accepted by only 1 of 3 HDFS replicas; and the >>> region master never receives a response from the other two replicas; and it >>> fails the client write, that write should never appear. Even if the region >>> master then fails, when a new region master is elected, and it restarts and >>> recovers, it should read HDFS blocks and accept the consensus 2/3 opinion >>> that the log does not contain the write -- dropping the write. The write >>> will never be seen. >> >> Not quite. The replica synchronization code is pretty messy, but basically >> it will take the longest replica that may have been synced, not a quorum. >> i.e the guarantee is that "if you successfully sync() data, it will be >> present after replica synchronization". Unsynced data *may* be present after >> replica synchronization. >> But keep in mind that recovery is blocking in most cases - ie if the RS is >> writing to a pipeline and waiting on acks, and one of the nodes in the >> pipeline dies, then it will recover the pipeline (without the dead node) and >> continue syncing to the remaining two nodes. The client is still blocked at >> this point. >> If the RS itself dies, then it won't respond to the client at all, and it's >> anyone's guess whether the write was successful or not. The same is true if >> the network between client and RS dies. This is unavoidable in any system - >> a server can always fail *just before* sending the "success" message, and >> the write is left in "maybe written" state. >> What will *not* happen, though, is the following case: >> - Row contains value A >> - Client writes value B >> - RS fails >> - Client reads value A >> - Client reads again and sees value B >> Similarly, if client reads value B, it won't revert to value A in any >> circumstance. >> >>> >>> In Cassandra, if a write (requesting 2 or 3 copies) is accepted by only >>> one node, that write will fail to the client. Future reads R=1 will see that >>> write or not depending on whether they contact the one server that accepted >>> or not, until the data is propagated, at which time they will see the write. >>> Reads R=2 will not see the write until it is propagated until at least two >>> servers. There is no mechanism to assure that a write is either accepted by >>> the requested number of servers or aborted. >>> 2) Cassandra has a less efficient memory footprint data pinned in memory >>> (or cached). With 3 replicas on Cassandra, each element of data pinned >>> in-memory is kept in memory on 3 servers, wheras in hbase only region >>> masters keep the data in memory, so there is only one-copy of each data >>> element. >>> 3) Cassandra (N3/W2/R2) has slower reads of cached or pinned-in-memory >>> data. HBase can answer a read-only query that is in memory from the single >>> region-master, while Cassandra (N3/W2/R2) must read from multiple servers. >>> (note, N3/W2/R2 still doesn't produce the same consistency situation as >>> hbase, see #1) >> >> Yes, probably - except that it seems to me Cassandra should be able to offer >> lower latency in the face of java GC pauses. If an HBase RS is in a 200ms GC >> pause, latency for all rows hosted by that server will spike to 200ms. If >> one of three replicas is in a 200ms GC pause, the other two replicas will >> still respond quickly so latency should be less spiky in Cassandra. But it's >> at the cost of more RAM usage as you mentioned above. >> >>> >>> 4) Cassandra (N3/W3/R1) takes longer to allow data to become writable >>> again in the face of a node-failure than HBase/HDFS. Cassandra must repair >>> the keyrange to bring N from 2 to 3 to resume allowing writes with W=3. HDFS >>> can still acheive a 2 node quorum in the face of a node failure. (note, >>> using N3/W2 requires R2, see #3) (note, this still doesn't produce the same >>> consistency situation as hbase, see #1.) >> >> It takes HBase a while to detect the failure and recover the region to a new >> server - the recovery time depends on the amount of unflushed data in the >> memstores of the failed server. With default configs, it takes 1 minute for >> the ZK lease to be lost on a failed server, and then somewhere between 10 >> seconds and a few minutes to fully reassign the regions (depending on amount >> of data needing to be replayed from the HLog) >> >>> >>> 5) HBase can't match the row-availability situation of Cassandra(N3/W2/R2 >>> or N3/W3/R1). In the face of a single machine failure, if it is a region >>> master, those keys are offline in HBase until a new region master is elected >>> and brought online. In Cassandra, no single node failure causes the data to >>> become unavailable. >> >> True. We're working on the concept of "slave regions" which would handle >> some kinds of reads and blind puts during failures. >> >>> >>> Is that summary correct? Am I missing any points? Did I get any facts >>> wrong? >>> Note, I'm NOT attempting to advocate the following changes, but simply >>> understand the design differences.... >>> From my uninformed view, it seems that #1 causes the biggest cascade of >>> differences, affecting both #3 and #4. If Cassandra were allowed to do what >>> HBase/HDFS does, namely to specify a repair-consistency requirement, then >>> Cassandra (N3/W2/R2/Repair2) should be the same consistency guarantee as >>> HBase. Further, if Cassandra were allowed to elect one of the copies of data >>> as 'master'. then it could require the master participate in all quorum >>> writes, allowing reads to be consistent when conducted only through the >>> master. This could be a road to address 2/3/4. To my eyes, this would make >>> Cassandra capable of operating in a mode which is essentially equivilent to >>> HBase/HDFS. Do those facts seem correct? --- AGAIN, I'm not advocating >>> Cassandra make these changes, I'm simply trying to understand the >>> differences, by considering what changes it would take to make them the >>> same. >> >> On the surface it seems reasonable, but these things are always very tricky, >> and it's the subtleties that will kill you :) >> -Todd >> P.S. Very happy to see informed technical discussion of the differences in >> the two architectures! > > > > > On Mon, Nov 22, 2010 at 1:06 PM, David Jeske <dav...@gmail.com> wrote: >> >> I already noticed a mistake in my own facts... >> On Mon, Nov 22, 2010 at 10:01 AM, David Jeske <dav...@gmail.com> wrote: >>> >>> 4) Cassandra (N3/W3/R1) takes longer to allow data to become writable >>> again in the face of a node-failure than HBase/HDFS. Cassandra must repair >>> the keyrange to bring N from 2 to 3 to resume allowing writes with W=3. HDFS >>> can still acheive a 2 node quorum in the face of a node failure. (note, >>> using N3/W2 requires R2, see #3) (note, this still doesn't produce the same >>> consistency situation as hbase, see #1.) >> >> This should read: >> 4) Cassandra (N3/W3/R1) takes longer to allow data to become writable again >> in the face of a node-failure than HBase/HDFS in the face of an HDFS node >> failure. Cassandra must repair the keyrange to bring N from 2 to 3 to resume >> allowing writes with W=3. HDFS can still acheive a 2 node quorum in the face >> of a node failure. (note, using N3/W2 requires R2, see #3) (note, this still >> doesn't produce the same consistency situation as hbase, see #1.) > > 2) Cassandra has a less efficient memory footprint data pinned in > memory (or cached). With 3 replicas on Cassandra, each element of data > pinned in-memory is kept in memory on 3 servers, wheras in hbase only > region masters keep the data in memory, so there is only one-copy of > each data element. > > True in the general case, but if you want to optimize for READ.ONE. We > can make our caches work like hbase. > https://issues.apache.org/jira/browse/CASSANDRA-1314. > > #4 Not a fair comparison. If hbase is happy working with 2/3 replicas > it is not equal to Write.ALL Read.ONE. > > > > On Mon, Nov 22, 2010 at 1:06 PM, David Jeske <dav...@gmail.com> wrote: >> >> I already noticed a mistake in my own facts... >> On Mon, Nov 22, 2010 at 10:01 AM, David Jeske <dav...@gmail.com> wrote: >>> >>> 4) Cassandra (N3/W3/R1) takes longer to allow data to become writable >>> again in the face of a node-failure than HBase/HDFS. Cassandra must repair >>> the keyrange to bring N from 2 to 3 to resume allowing writes with W=3. HDFS >>> can still acheive a 2 node quorum in the face of a node failure. (note, >>> using N3/W2 requires R2, see #3) (note, this still doesn't produce the same >>> consistency situation as hbase, see #1.) >> >> This should read: >> 4) Cassandra (N3/W3/R1) takes longer to allow data to become writable again >> in the face of a node-failure than HBase/HDFS in the face of an HDFS node >> failure. Cassandra must repair the keyrange to bring N from 2 to 3 to resume >> allowing writes with W=3. HDFS can still acheive a 2 node quorum in the face >> of a node failure. (note, using N3/W2 requires R2, see #3) (note, this still >> doesn't produce the same consistency situation as hbase, see #1.) > > 2) Cassandra has a less efficient memory footprint data pinned in > memory (or cached). With 3 replicas on Cassandra, each element of data > pinned in-memory is kept in memory on 3 servers, wheras in hbase only > region masters keep the data in memory, so there is only one-copy of > each data element. > > True in the general case, but if you want to optimize for READ.ONE. We > can make our caches work like hbase. > https://issues.apache.org/jira/browse/CASSANDRA-1314. >
Also wanted to comment on your item #3 Cassandra has tow types of caches a key cache and a row cache. Users can chose either. Hbase only has a block cache. What of reads that are not in the cache? Cassandra can use memory mapped io for its data and index files. Hbase has a very expensive read path for things that are not in cache. HDFS random read performance is historically poor.