Hello, Jonathan,

Thank you. I understood the situation.

If you have a strong requirement of not being able to have data unavailable for more than one second, I think Cassandra would be a clear winner here. Is this a requirement just for reads, for writes, or both?

Perhaps just for reads, but I'm not sure yet. Front caching may help, however, additional caching requires more money and makes the system more complex.

The flipside to this is that Cassandra carries the same in-memory data on every replica (because data can be read/written from multiple nodes, it must live on all these nodes), whereas HBase only carries it once on one server. The replication in HBase is at the DFS level not the DB level. So across a cluster, you can effectively only have 1/3 of the total memory available for unique data with Cassandra, if that makes sense.

I didn't notice this point. It may be an appealing point that HBase could cache more data effectively.

Quite honestly the requirement of not being to have data unavailability for more than 1 second likely takes HBase out of the running because under hard RegionServer failure, you will almost certainly have regions offline for longer than that. We'll continue improving here, and if you are not including the time for fault detection, it is feasible that we could get down into the realm of 1 second, though in this case you'd likely have a period of "eventual consistency" in which you would be able to access a region while the log replay was going on.

Accessing data during log replay sounds interesting as an option if not using transactional region servers.

Regards,
Maumau

Reply via email to