On Thu, May 10, 2012 at 10:16 AM, Kuldeep Chitrakar <kuldeep.chitra...@synechron.com> wrote: > Does that mean all data in one BigTable in de-normalized form? Then whats the > main benefit of using Hive against Hbase as Hbase also recommends Highly de > normalized BigTable. > > > Thanks, > Kuldeep > -----Original Message----- > From: Edward Capriolo [mailto:edlinuxg...@gmail.com] > Sent: 10 May 2012 19:24 > To: user@hive.apache.org > Subject: Re: Dimensional Data Model on Hive > > On Thu, May 10, 2012 at 9:26 AM, Kuldeep Chitrakar > <kuldeep.chitra...@synechron.com> wrote: >> Hi >> >> >> >> I have data warehouse implementation for Click Stream data analysis on >> RDBMS. Its a start schema (Dimensions and Facts). >> >> >> >> Now if i want to move to Hive, Do i need to create same data model as >> Dimensions and facts and join them. >> >> >> >> I should create a big de-normalized table which contains all textual >> attributes from all dimensions. If so how do we handle SCD 2 type dimensions >> in Hive. >> >> >> >> Its very basic question but I am just confused on this. >> >> >> >> >> >> Thanks, >> >> Kuldeep > > While hive is sometimes referred to as a data warehouse you usually > want to avoid data warehouse concepts like stat-schema. There are a > number of reasons for this: > 1) No unique constraints > 2) limited index capabilities > 3) Map side joins are optimal when a single table is small > 4) Most join types while generalize into map reduce are much different > then a join in single node databases > > I'm most situations I advice going the "nosql route" and de-normalize > almost everything. Optimize for scanning.
Q: Does that mean all data in one BigTable in de-normalized form? A: No. I qualified this by saying "most". I am not advocating one large table, every situation is different. But generally star schema is going to be very difficult to implement and have less benefits then it would in most RDBMS systems. Q: What is the main benefit of using hive against hbase? A: I am not sure what you mean by "against". If you mean why would i chose one and not the other, hbase is designed for low latency < 20 ms put, get and scan operations. Hive is a declarative SQL like language that "queries" multi GB or TB sized files in hadoop. There is a storage handler implementation that allows you to query hbase data from hive as well if that is what you mean by against.