Hi All, Thanks for all the input. I posted the same question in HBase forum and got more response.
Posting the consolidated list here. Our case is that a central team builds and maintain the platform (Cassandra as a service). We have couple of usecases which fits Cassandra like time-series data. But as a platform team, we need to know more features and usecases which fits or best handled in Cassandra. Also to understand the usecases where HBase performs better (we might need to have it as a service too). *Cassandra:* 1) From 2013 both can still be relevant: http://www.pythian.com/blog/watch-hbase-vs-cassandra/ 2) Here are some use cases from PlanetCassandra.org of companies who chose Cassandra over HBase after evaluation, or migrated to Cassandra from HBase. The eComNext interview cited on the page touches on time-series data; http://planetcassandra.org/hbase-to-cassandra-migration/ 3) From googling, the most popular advantages for Cassandra over HBase is easy to deploy, maintain & monitor and no single point of failure. 4) From our six months research and POC experience in Cassandra, CQL is pretty limited. Though CQL is targeted for Real time Read and Write, there are cases where need to pull out data differently and we are OK with little more latency. But Cassandra doesn't support that. We need MapReduce or Spark for those. Then the debate starts why Cassandra and why not HBase if we need Hadoop/Spark for MapReduce. Expected a few more technical features/usecases that is best handled by Cassandra (and how it works). *HBase:* 1) As for the #4 you might be interested in reading https://aphyr.com/posts/294-call-me-maybe-cassandra Not sure if there is comparable article about HBase (anybody knows?) but it can give you another perspective about what else to keep an eye on regarding these systems. 2) See http://hbase.apache.org/book.html#perf.network.call_me_maybe 3) http://blog.parsely.com/post/1928/cass/ *Anyone have any comments on this?* 4) 1. No killer features comparing to hbase 2.terrible!!! Ambari/cloudera manager rulezzz. Netflix has its own tool for Cassandra but it doesn't support vnodes. 3. Rumors say it fast when it works;) the reason- it can silently drop data you try to write. 4. Timeseries is a nightmare. The easiest approach is just replicate data to hdfs, partition it by hour/day and run spark/scalding/pig/hive/Impala 5) Migrated from Cassandra to HBase. Reasons: Scan is fast with HBase. It fits better with time series data model. Please look at opentsdb. Cassandra models it with large rows. Server side filtering. You can use to filter some of your time series data on the server side. Hbase has a better integration with hadoop in general. We had to write our own bulk loader using mapreduce for cassandra. hbase has already had a tool for that. There is a nice integration with flume and kite. High availability didnet matter for us. 10 secs down is fine for our use cases.HBase started to support eventually consistent reads. 6) Coprocessor framework (custom code inside Region Server and MasterServers), which Cassandra is missing, afaik. Coprocessors have been widely used by hBase users (Phoenix SQL, for example) since inception (in 0.92). * HBase security model is more mature and align well with Hadoop/HDFS security. Cassandra provides just basic authentication/authorization/SSL encryption, no Kerberos, no end-to-end data encryption, no cell level security. 7) Another point to add is the new "HBase read high-availability using timeline-consistent region replicas" feature from HBase 1.0 onward, which brings HBase closer to Cassandra in term of Read Availability during node failures. You have a choice for Read Availability now. https://issues.apache.org/jira/browse/HBASE-10070 8) Hbase can do range scans, and one can attack many problems with range scans. Cassandra can't do range scans. 9) HBase is a distributed, consistent, sorted key value store. The "sorted" bit allows for range scans in addition to the point gets that all K/V stores support. Nothing more, nothing less. It happens to store its data in HDFS by default, and we provide convenient input and output formats for map reduce. *Neutral:* 1) http://khangaonkar.blogspot.com/2013/09/cassandra-vs-hbase-which-nosql-store-do.html 2) The fundamental differences that come to mind are: * HBase is always consistent. Machine outages lead to inability to read or write data on that machine. With Cassandra you can always write. * Cassandra defaults to a random partitioner, so range scans are not possible (by default) * HBase has a range partitioner (if you don't want that the client has to prefix the rowkey with a prefix of a hash of the rowkey). The main feature that set HBase apart are range scans. * HBase is much more tightly integrated with Hadoop/MapReduce/HDFS, etc. You can map reduce directly into HFiles and map those into HBase instantly. * Cassandra has a dedicated company supporting (and promoting) it. * Getting started is easier with Cassandra. For HBase you need to run HDFS and Zookeeper, etc. * I've heard lots of anecdotes about Cassandra working nicely with small cluster (< 50 nodes) and quick degenerating above that. * HBase does not have a query language (but you can use Phoenix for full SQL support) * HBase does not have secondary indexes (having an eventually consistent index, similar to what Cassandra has, is easy in HBase, but making it as consistent as the rest of HBase is hard) Thanks Ajay > > On May 29, 2015, at 12:09 PM, Ajay <ajay.ga...@gmail.com> wrote: > > Hi, > > I need some info on Hbase vs Cassandra as a data store (in general plus > specific to time series data). > > The comparison in the following helps: > 1: features > 2: deployment and monitoring > 3: performance > 4: anything else > > Thanks > Ajay > >