Here's my take on this matter:

In the current situation, there isn't any good solution to the data warehousing 
solution you want in large scale. Impala and Drill are both projects that heads 
in this direction, but they still have a way to go and are not production ready 
yet. If you can stay at MySQL for moment, than stay there, or go for Hive but 
prepare a very large cluster of computers to handle the load.

A normal data warehouse as you describe is composed of DIMS (dimensions) and 
FACT tables. Representing this as is in HBase is a mess, since this will 
require you to do joins across the clusters - i.e. RPC calls and lots of them 
between Region Servers - which will slow down your queries to a halt (unless 
you want your user to wait 10-15 minutes).

The more sane approach then is do normalize the data - i.e. have a table 
containing the attributes of all dimensions in the FACT table, as one big fat 
FACT table - and save it to HDFS or HBase. Both have a partition key - your 
primary key to query upon (e.g. timestamp-customerId, timestamp-deviceId). You 
can query the data, after you filter it by the partition key, thus scanning 
only a portion of it, and then on each datanode/RS, filtering by the dimensions 
attributes as required by your query. If your data is distributed evenly across 
your cluster, running this query on multiple nodes at the same time can 
overcome the downside of fully reading the files/rows belonging to the 
partition key. You can add the statistical functions you require, such as 
sum,count, and send the rolled up results thus saving bandwidth.

The problem in current software stacks is that there's none that actually does 
what is stated above. Impala is in the right direction, but its yet to be in 
production state, from what I've read. Drill is just starting. Thus you end 
having to write map reduce jobs, which does the described above solution by 
either employing HIVE to get the HDFS files stored by partition key and 
translating you query into MR job, or using other open source solutions such as 
Cascading to ease the burden of writing your own MR Job code.

So in summary, I would stay at Oracle/MySQL until a descent open source 
answering your need will arrive - which I guess will happen during 2013/2014. 
If you can't - you will be forced to write your own custom solution, tailored 
to your queries, based on MR job. You can take a look at Trecul 
(https://github.com/akamai-tech/trecul) to boost processing speed of your Map 
Reduce job.

Asaf


On 13 ?0?3?0?5?0?4?0?6 2012, at 07:57, bigdata <[email protected]> wrote:

> Dear all,
> We have a traditional star-model data warehouse in RDBMS, now we want to 
> transfer it to HBase. After study HBase, I learn that HBase is normally can 
> be query by rowkey.
> 1.full rowkey (fastest)2.rowkey filter (fast)3.column family/qualifier filter 
> (slow)
> How can I design the HBase tables to implement the warehouse functions, 
> like:1.Query by DimensionA2.Query by DimensionA and DimensionB3.Sum, count, 
> distinct ...
> From my opinion, I should create several HBase tables with all combinations 
> of different dimensions as the rowkey. This solution will lead to huge data 
> duplication. Is there any good suggestions to solve it?
> Thanks a lot!
> 
>                                         

Reply via email to