To Mohammad's point. You can use HBase for quick scans of the data. Hive for your longer running jobs. Impala over the two for quick adhoc searches.
On Thu, Dec 13, 2012 at 9:44 AM, Mohammad Tariq <[email protected]> wrote: > I am not saying Hbase is not good. My point was to consider Hive as well. > Think about the approach keeping both the tools in mind and decide. I just > provided an option keeping in mind the available built-in Hive features. I > would like to add one more point here, you can map your Hbase tables to > Hive. > > Regards, > Mohammad Tariq > > > > On Thu, Dec 13, 2012 at 7:58 PM, bigdata <[email protected]> wrote: > > > Hi, Tariq > > Thanks for your feedback. Actually, now we have two ways to reach the > > target, by Hive and by HBase.Could you tell me why HBase is not good for > > my requirements?Or what's the problem in my solution? > > Thanks. > > > > > From: [email protected] > > > Date: Thu, 13 Dec 2012 15:43:25 +0530 > > > Subject: Re: How to design a data warehouse in HBase? > > > To: [email protected] > > > > > > Both have got different purposes. Normally people say that Hive is > slow, > > > that's just because it uses MapReduce under the hood. And i'm sure that > > if > > > the data stored in HBase is very huge, nobody would write sequential > > > programs for Get or Scan. Instead they will write MP jobs or do > something > > > similar. > > > > > > My point is that nothing can be 100% real time. Is that what you > want?If > > > that is the case I would never suggest Hadoop on the first place as > it's > > a > > > batch processing system and cannot be used like an OLTP system, unless > > you > > > have thought of some additional stuff. Since you are talking about > > > warehouse, I am assuming you are going to store and process gigantic > > > amounts of data. That's the only reason I had suggested Hive. > > > > > > The whole point is that everything is not a solution for everything. > One > > > size doesn't fit all. First, we need to analyze our particular use > case. > > > The person, who says Hive is slow, might be correct. But only for his > > > scenario. > > > > > > HTH > > > > > > Regards, > > > Mohammad Tariq > > > > > > > > > > > > On Thu, Dec 13, 2012 at 3:17 PM, bigdata <[email protected]> > > wrote: > > > > > > > Hi, > > > > I've got the information that HIVE 's performance is too low. It > access > > > > HDFS files and scan all data to search one record. IS it TRUE? And > > HBase is > > > > much faster than it. > > > > > > > > > > > > > From: [email protected] > > > > > Date: Thu, 13 Dec 2012 15:12:25 +0530 > > > > > Subject: Re: How to design a data warehouse in HBase? > > > > > To: [email protected] > > > > > > > > > > Hi there, > > > > > > > > > > If you are really planning for a warehousing solution then I > would > > > > > suggest you to have a look over Apache Hive. It provides you > > warehousing > > > > > capabilities on top of a Hadoop cluster. Along with that it also > > provides > > > > > an SQLish interface to the data stored in your warehouse, which > > would be > > > > > very helpful to you, in case you are coming from an SQL background. > > > > > > > > > > HTH > > > > > > > > > > > > > > > > > > > > Regards, > > > > > Mohammad Tariq > > > > > > > > > > > > > > > > > > > > On Thu, Dec 13, 2012 at 2:43 PM, bigdata <[email protected]> > > > > wrote: > > > > > > > > > > > Thanks. I think a real example is better for me to understand > your > > > > > > suggestions. > > > > > > Now I have a relational table:ID LoginTime > > > > DeviceID1 > > > > > > 2012-12-12 12:12:12 abcdef2 2012-12-12 19:12:12 > > abcdef3 > > > > > > 2012-12-13 10:10:10 defdaf > > > > > > There are several requirements about this table:1. How many > device > > > > login > > > > > > in each day?1. For one day, how many new device login? (never > login > > > > > > before)1. For one day, how many accumulated device login? > > > > > > How can I design HBase tables to calculate these data?Now my > > solution > > > > > > is:table A: > > > > > > rowkey: date-deviceidcolumn family: logincolumn qualifier: > > 2012-12-12 > > > > > > 12:12:12/2012-12-12 19:12:12.... > > > > > > table B:rowkey: deviceidcolumn family:null or anyvalue > > > > > > > > > > > > For req#1, I can scan table A and use prefixfilter(rowkey) to > > check one > > > > > > special date, and get records countFor req#2, I get table b with > > each > > > > > > deviceid, and count result > > > > > > For req#3, count table A with prefixfilter like 1. > > > > > > Does it OK? Or other better solutions? > > > > > > Thanks!! > > > > > > > > > > > > > CC: [email protected] > > > > > > > From: [email protected] > > > > > > > Subject: Re: How to design a data warehouse in HBase? > > > > > > > Date: Thu, 13 Dec 2012 08:43:31 +0000 > > > > > > > To: [email protected] > > > > > > > > > > > > > > You need to spend a bit of time on Schema design. > > > > > > > You need to flatten your Schema... > > > > > > > Implement some secondary indexing to improve join > performance... > > > > > > > > > > > > > > Depends on what you want to do... There are other options > too... > > > > > > > > > > > > > > Sent from a remote device. Please excuse any typos... > > > > > > > > > > > > > > Mike Segel > > > > > > > > > > > > > > On Dec 13, 2012, at 7:09 AM, lars hofhansl < > [email protected]> > > > > wrote: > > > > > > > > > > > > > > > For OLAP type queries you will generally be better off with a > > truly > > > > > > column oriented database. > > > > > > > > You can probably shoehorn HBase into this, but it wasn't > really > > > > > > designed with raw scan performance along single columns in mind. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ________________________________ > > > > > > > > From: bigdata <[email protected]> > > > > > > > > To: "[email protected]" <[email protected]> > > > > > > > > Sent: Wednesday, December 12, 2012 9:57 PM > > > > > > > > Subject: How to design a data warehouse in HBase? > > > > > > > > > > > > > > > > Dear all, > > > > > > > > We have a traditional star-model data warehouse in RDBMS, now > > we > > > > want > > > > > > to transfer it to HBase. After study HBase, I learn that HBase is > > > > normally > > > > > > can be query by rowkey. > > > > > > > > 1.full rowkey (fastest)2.rowkey filter (fast)3.column > > > > family/qualifier > > > > > > filter (slow) > > > > > > > > How can I design the HBase tables to implement the warehouse > > > > > > functions, like:1.Query by DimensionA2.Query by DimensionA and > > > > > > DimensionB3.Sum, count, distinct ... > > > > > > > > From my opinion, I should create several HBase tables with > all > > > > > > combinations of different dimensions as the rowkey. This solution > > will > > > > lead > > > > > > to huge data duplication. Is there any good suggestions to solve > > it? > > > > > > > > Thanks a lot! > > > > > > > > > > > > > > > > > > > > > > > > > -- Kevin O'Dell Customer Operations Engineer, Cloudera
