pre-spliting or not, that's the question

Marcelo Valle (BLOOMBERG/ LONDON) Tue, 07 Apr 2015 04:00:13 -0700

Hello, 

I am still in my first steps with HBase, I was used to use Cassandra a while 
ago.


For several years, I was used to think trying to store data in Cassandra 
ordered among nodes was something evil, as it's OrderedPartitioner is something 
not supported and not recommended in production. 

In HBase/Hadoop would, this is the default though. When trying to optimize for 
writes, I was told people use to use pre-spiting in HBase, some times using 
salting keys. This seems to make HBase behave as Cassandra random partitioner, 
loosing data order across nodes (because of the salting) but having a better 
write throughput.  

Because of these differences, I started to question what's the real advantage 
of having ordered data across nodes. For most applications, wouldn't 
pre-splitting be better? For a large number of applications, designing data 
without relying on order across nodes seems better, as 1 - it might be possible 
and 2 - when it's not possible you can whether use another table as index or 
index data to Solr/ES/Lucene and read from there in more complex scenarios. 
Maybe in some specific cases where you want little latency from the time you 
write data to time you read data, but reading much more than you write it could 
have some advantage, maybe...

As acting as a sorted map was a concept design decision of HBase, I think there 
must be reasons behind this decision and it seems I am not being able to figure 
these... Could you please point them out? 

I am asking this to improve my architectural understanding of HBase, as 
sometimes I might be getting the wrong impression there is no advantage in 
using post-splitting solution, when maybe it's just lack of knowledge I have on 
the technology.

Best regards,
Marcelo.

pre-spliting or not, that's the question

Reply via email to