Re: cassandra vs hbase summary (was facebook messaging)

Edward Capriolo Mon, 22 Nov 2010 12:04:02 -0800

On Mon, Nov 22, 2010 at 2:56 PM, Edward Capriolo <edlinuxg...@gmail.com> wrote:
> On Mon, Nov 22, 2010 at 2:52 PM, Todd Lipcon <t...@lipcon.org> wrote:
>> On Mon, Nov 22, 2010 at 10:01 AM, David Jeske <dav...@gmail.com> wrote:
>>>
>>> I havn't used either Cassandra or hbase, so please don't take any part of
>>> this message as me attempting to state facts about either system. However,
>>> I'm very familiar with data-storage design details, and I've worked
>>> extensively optimizing applications running on MySQL, Oracle, berkeledb
>>> (including distributed txn berkeleydb), and Google Bigtable.
>>> The recent discussion triggered by Facebook messaging using HBase helped
>>> surface many interesting design differences in the two systems. I'm writing
>>> this message both to summarize what I've read in a few different places
>>> about that topic, and to check my facts.
>>> As far as I can descern, this is a decent summary of the consistency and
>>> performance differences between hbase and cassandra (N3/R2/W2 or N3/R1/W3)
>>> for an hbase acceptable workload.. (Please correct the fact if they appear
>>> wrong!)
>>> 1) Cassandra can't replicate the consistency situation of HBase. Namely,
>>> that when a write requiring a quorum fails it will never appear. Deriving
>>> from this explanation:
>>> [In Cassandra]Provided at least one node receives the write, it will
>>> eventually be written to all replicas. A failure to meet the requested
>>> ConsistencyLevel is just that; not a failure to write the data itself. Once
>>> the write is received by a node, it will eventually reach all replicas,
>>> there is no roll back. - Nick Telford [ref]
>>>
>>> [In Hbase] The DFSClient call returns when all datanodes in the pipeline
>>> have flushed (to the OS buffer) and ack'ed. That code comes from HDFS-200 in
>>> the 0.20-append branch and HDFS-265 for all branches after 0.20, meaning
>>> that it's in 0.21.0 - Jean-Daniel Cryans [ref]
>>> in HBase, if a write is accepted by only 1 of 3 HDFS replicas; and the
>>> region master never receives a response from the other two replicas; and it
>>> fails the client write, that write should never appear. Even if the region
>>> master then fails, when a new region master is elected, and it restarts and
>>> recovers, it should read HDFS blocks and accept the consensus 2/3 opinion
>>> that the log does not contain the write -- dropping the write. The write
>>> will never be seen.
>>
>> Not quite. The replica synchronization code is pretty messy, but basically
>> it will take the longest replica that may have been synced, not a quorum.
>> i.e the guarantee is that "if you successfully sync() data, it will be
>> present after replica synchronization". Unsynced data *may* be present after
>> replica synchronization.
>> But keep in mind that recovery is blocking in most cases - ie if the RS is
>> writing to a pipeline and waiting on acks, and one of the nodes in the
>> pipeline dies, then it will recover the pipeline (without the dead node) and
>> continue syncing to the remaining two nodes. The client is still blocked at
>> this point.
>> If the RS itself dies, then it won't respond to the client at all, and it's
>> anyone's guess whether the write was successful or not. The same is true if
>> the network between client and RS dies. This is unavoidable in any system -
>> a server can always fail *just before* sending the "success" message, and
>> the write is left in "maybe written" state.
>> What will *not* happen, though, is the following case:
>> - Row contains value A
>> - Client writes value B
>> - RS fails
>> - Client reads value A
>> - Client reads again and sees value B
>> Similarly, if client reads value B, it won't revert to value A in any
>> circumstance.
>>
>>>
>>> In Cassandra, if a write (requesting 2 or 3 copies) is accepted by only
>>> one node, that write will fail to the client. Future reads R=1 will see that
>>> write or not depending on whether they contact the one server that accepted
>>> or not, until the data is propagated, at which time they will see the write.
>>> Reads R=2 will not see the write until it is propagated until at least two
>>> servers. There is no mechanism to assure that a write is either accepted by
>>> the requested number of servers or aborted.
>>> 2) Cassandra has a less efficient memory footprint data pinned in memory
>>> (or cached). With 3 replicas on Cassandra, each element of data pinned
>>> in-memory is kept in memory on 3 servers, wheras in hbase only region
>>> masters keep the data in memory, so there is only one-copy of each data
>>> element.
>>> 3) Cassandra (N3/W2/R2) has slower reads of cached or pinned-in-memory
>>> data. HBase can answer a read-only query that is in memory from the single
>>> region-master, while Cassandra (N3/W2/R2) must read from multiple servers.
>>>  (note, N3/W2/R2 still doesn't produce the same consistency situation as
>>> hbase, see #1)
>>
>> Yes, probably - except that it seems to me Cassandra should be able to offer
>> lower latency in the face of java GC pauses. If an HBase RS is in a 200ms GC
>> pause, latency for all rows hosted by that server will spike to 200ms. If
>> one of three replicas is in a 200ms GC pause, the other two replicas will
>> still respond quickly so latency should be less spiky in Cassandra. But it's
>> at the cost of more RAM usage as you mentioned above.
>>
>>>
>>> 4) Cassandra (N3/W3/R1) takes longer to allow data to become writable
>>> again in the face of a node-failure than HBase/HDFS. Cassandra must repair
>>> the keyrange to bring N from 2 to 3 to resume allowing writes with W=3. HDFS
>>> can still acheive a 2 node quorum in the face of a node failure. (note,
>>> using N3/W2 requires R2, see #3) (note, this still doesn't produce the same
>>> consistency situation as hbase, see #1.)
>>
>> It takes HBase a while to detect the failure and recover the region to a new
>> server - the recovery time depends on the amount of unflushed data in the
>> memstores of the failed server. With default configs, it takes 1 minute for
>> the ZK lease to be lost on a failed server, and then somewhere between 10
>> seconds and a few minutes to fully reassign the regions (depending on amount
>> of data needing to be replayed from the HLog)
>>
>>>
>>> 5) HBase can't match the row-availability situation of Cassandra(N3/W2/R2
>>> or N3/W3/R1). In the face of a single machine failure, if it is a region
>>> master, those keys are offline in HBase until a new region master is elected
>>> and brought online. In Cassandra, no single node failure causes the data to
>>> become unavailable.
>>
>> True. We're working on the concept of "slave regions" which would handle
>> some kinds of reads and blind puts during failures.
>>
>>>
>>> Is that summary correct? Am I missing any points? Did I get any facts
>>> wrong?
>>> Note, I'm NOT attempting to advocate the following changes, but simply
>>> understand the design differences....
>>> From my uninformed view, it seems that #1 causes the biggest cascade of
>>> differences, affecting both #3 and #4. If Cassandra were allowed to do what
>>> HBase/HDFS does, namely to specify a repair-consistency requirement, then
>>> Cassandra (N3/W2/R2/Repair2) should be the same consistency guarantee as
>>> HBase. Further, if Cassandra were allowed to elect one of the copies of data
>>> as 'master'. then it could require the master participate in all quorum
>>> writes, allowing reads to be consistent when conducted only through the
>>> master. This could be a road to address 2/3/4. To my eyes, this would make
>>> Cassandra capable of operating in a mode which is essentially equivilent to
>>> HBase/HDFS. Do those facts seem correct?  --- AGAIN, I'm not advocating
>>> Cassandra make these changes, I'm simply trying to understand the
>>> differences, by considering what changes it would take to make them the
>>> same.
>>
>> On the surface it seems reasonable, but these things are always very tricky,
>> and it's the subtleties that will kill you :)
>> -Todd
>> P.S. Very happy to see informed technical discussion of the differences in
>> the two architectures!
>
>
>
>
> On Mon, Nov 22, 2010 at 1:06 PM, David Jeske <dav...@gmail.com> wrote:
>>
>> I already noticed a mistake in my own facts...
>> On Mon, Nov 22, 2010 at 10:01 AM, David Jeske <dav...@gmail.com> wrote:
>>>
>>> 4) Cassandra (N3/W3/R1) takes longer to allow data to become writable
>>> again in the face of a node-failure than HBase/HDFS. Cassandra must repair
>>> the keyrange to bring N from 2 to 3 to resume allowing writes with W=3. HDFS
>>> can still acheive a 2 node quorum in the face of a node failure. (note,
>>> using N3/W2 requires R2, see #3) (note, this still doesn't produce the same
>>> consistency situation as hbase, see #1.)
>>
>> This should read:
>> 4) Cassandra (N3/W3/R1) takes longer to allow data to become writable again
>> in the face of a node-failure than HBase/HDFS in the face of an HDFS node
>> failure. Cassandra must repair the keyrange to bring N from 2 to 3 to resume
>> allowing writes with W=3. HDFS can still acheive a 2 node quorum in the face
>> of a node failure. (note, using N3/W2 requires R2, see #3) (note, this still
>> doesn't produce the same consistency situation as hbase, see #1.)
>
> 2) Cassandra has a less efficient memory footprint data pinned in
> memory (or cached). With 3 replicas on Cassandra, each element of data
> pinned in-memory is kept in memory on 3 servers, wheras in hbase only
> region masters keep the data in memory, so there is only one-copy of
> each data element.
>
> True in the general case, but if you want to optimize for READ.ONE. We
> can make our caches work like hbase.
> https://issues.apache.org/jira/browse/CASSANDRA-1314.
>
> #4 Not a fair comparison. If hbase is happy working with 2/3 replicas
> it is not equal to Write.ALL Read.ONE.
>
>
>
> On Mon, Nov 22, 2010 at 1:06 PM, David Jeske <dav...@gmail.com> wrote:
>>
>> I already noticed a mistake in my own facts...
>> On Mon, Nov 22, 2010 at 10:01 AM, David Jeske <dav...@gmail.com> wrote:
>>>
>>> 4) Cassandra (N3/W3/R1) takes longer to allow data to become writable
>>> again in the face of a node-failure than HBase/HDFS. Cassandra must repair
>>> the keyrange to bring N from 2 to 3 to resume allowing writes with W=3. HDFS
>>> can still acheive a 2 node quorum in the face of a node failure. (note,
>>> using N3/W2 requires R2, see #3) (note, this still doesn't produce the same
>>> consistency situation as hbase, see #1.)
>>
>> This should read:
>> 4) Cassandra (N3/W3/R1) takes longer to allow data to become writable again
>> in the face of a node-failure than HBase/HDFS in the face of an HDFS node
>> failure. Cassandra must repair the keyrange to bring N from 2 to 3 to resume
>> allowing writes with W=3. HDFS can still acheive a 2 node quorum in the face
>> of a node failure. (note, using N3/W2 requires R2, see #3) (note, this still
>> doesn't produce the same consistency situation as hbase, see #1.)
>
> 2) Cassandra has a less efficient memory footprint data pinned in
> memory (or cached). With 3 replicas on Cassandra, each element of data
> pinned in-memory is kept in memory on 3 servers, wheras in hbase only
> region masters keep the data in memory, so there is only one-copy of
> each data element.
>
> True in the general case, but if you want to optimize for READ.ONE. We
> can make our caches work like hbase.
> https://issues.apache.org/jira/browse/CASSANDRA-1314.
>


Also wanted to comment on your item #3
Cassandra has tow types of caches a key cache and a row cache. Users
can chose either. Hbase only has a block cache.

What of reads that are not in the cache?
Cassandra can use memory mapped io for its data and index files. Hbase
has a very expensive read path for things that are not in cache. HDFS
random read performance is historically poor.

Re: cassandra vs hbase summary (was facebook messaging)

Reply via email to