Re: Hbase vs Cassandra

Ajay Mon, 08 Jun 2015 01:00:56 -0700

Hi All,

Thanks for all the input. I posted the same question in HBase forum and got
more response.

Posting the consolidated list here.

Our case is that a central team builds and maintain the platform (Cassandra
as a service). We have couple of usecases which fits Cassandra like
time-series data. But as a platform team, we need to know more features and
usecases which fits or best handled in Cassandra. Also to understand the
usecases where HBase performs better (we might need to have it as a service
too).

*Cassandra:*

1) From 2013 both can still be relevant:
http://www.pythian.com/blog/watch-hbase-vs-cassandra/

2) Here are some use cases from PlanetCassandra.org of companies who chose
Cassandra over HBase after evaluation, or migrated to Cassandra from HBase.
The eComNext interview cited on the page touches on time-series data;
http://planetcassandra.org/hbase-to-cassandra-migration/

3) From googling, the most popular advantages for Cassandra over HBase is
easy to deploy, maintain & monitor and no single point of failure.

4) From our six months research and POC experience in Cassandra, CQL is
pretty limited. Though CQL is targeted for Real time Read and Write, there
are cases where need to pull out data differently and we are OK with little
more latency. But Cassandra doesn't support that. We need MapReduce or
Spark for those. Then the debate starts why Cassandra and why not HBase if
we need Hadoop/Spark for MapReduce.

Expected a few more technical features/usecases that is best handled by
Cassandra (and how it works).

*HBase:*

1) As for the #4 you might be interested in reading
https://aphyr.com/posts/294-call-me-maybe-cassandra
Not sure if there is comparable article about HBase (anybody knows?) but it
can give you another perspective about what else to keep an eye on
regarding these systems.

2) See http://hbase.apache.org/book.html#perf.network.call_me_maybe

3) http://blog.parsely.com/post/1928/cass/
*Anyone have any comments on this?*

4) 1. No killer features comparing to hbase
2.terrible!!! Ambari/cloudera manager rulezzz. Netflix has its own tool for
Cassandra but it doesn't support vnodes.
3. Rumors say it fast when it works;) the reason- it can silently drop data
you try to write.
4. Timeseries is a nightmare. The easiest approach is just replicate data
to hdfs, partition it by hour/day and run spark/scalding/pig/hive/Impala

5)  Migrated from Cassandra to HBase.
Reasons:
Scan is fast with HBase. It fits better with time series data model. Please
look at opentsdb. Cassandra models it with large rows.
Server side filtering. You can use to filter some of your time series data
on the server side.
Hbase has a better integration with hadoop in general. We had to write our
own bulk loader using mapreduce for cassandra. hbase has already had a tool
for that. There is a nice integration with flume and kite.
High availability didnet matter for us. 10 secs down is fine for our use
cases.HBase started to support eventually consistent reads.

6) Coprocessor framework (custom code inside Region Server and
MasterServers), which Cassandra is missing, afaik.
   Coprocessors have been widely used by hBase users (Phoenix SQL, for
example) since inception (in 0.92).
* HBase security model is more mature and align well with Hadoop/HDFS
security. Cassandra provides just basic authentication/authorization/SSL
encryption, no Kerberos, no end-to-end data encryption,
no cell level security.

7) Another point to add is the new "HBase read high-availability using
timeline-consistent region replicas" feature from HBase 1.0 onward, which
brings HBase closer to Cassandra in term of Read Availability during
node failures.  You have a choice for Read Availability now.
https://issues.apache.org/jira/browse/HBASE-10070

8) Hbase can do range scans, and one can attack many problems with range
scans. Cassandra can't do range scans.

9) HBase is a distributed, consistent, sorted key value store. The "sorted"
bit allows for range scans in addition to the point gets that all K/V
stores support. Nothing more, nothing less.
It happens to store its data in HDFS by default, and we provide convenient
input and output formats for map reduce.

*Neutral:*
1)
http://khangaonkar.blogspot.com/2013/09/cassandra-vs-hbase-which-nosql-store-do.html

2) The fundamental differences that come to mind are:
* HBase is always consistent. Machine outages lead to inability to read or
write data on that machine. With Cassandra you can always write.

* Cassandra defaults to a random partitioner, so range scans are not
possible (by default)
* HBase has a range partitioner (if you don't want that the client has to
prefix the rowkey with a prefix of a hash of the rowkey). The main feature
that set HBase apart are range scans.

* HBase is much more tightly integrated with Hadoop/MapReduce/HDFS, etc.
You can map reduce directly into HFiles and map those into HBase instantly.

* Cassandra has a dedicated company supporting (and promoting) it.
* Getting started is easier with Cassandra. For HBase you need to run HDFS
and Zookeeper, etc.
* I've heard lots of anecdotes about Cassandra working nicely with small
cluster (< 50 nodes) and quick degenerating above that.
* HBase does not have a query language (but you can use Phoenix for full
SQL support)
* HBase does not have secondary indexes (having an eventually consistent
index, similar to what Cassandra has, is easy in HBase, but making it as
consistent as the rest of HBase is hard)

Thanks
Ajay

>
> On May 29, 2015, at 12:09 PM, Ajay <ajay.ga...@gmail.com> wrote:
>
> Hi,
>
> I need some info on Hbase vs Cassandra as a data store (in general plus
> specific to time series data).
>
> The comparison in the following helps:
> 1: features
> 2: deployment and monitoring
> 3: performance
> 4: anything else
>
> Thanks
> Ajay
>
>

Re: Hbase vs Cassandra

Reply via email to