Re: How to design a data warehouse in HBase?

Kevin O'dell Thu, 13 Dec 2012 07:30:47 -0800

Mohammad,

  I am not sure you are thinking about Impala correctly.  It still uses
HDFS so your data increasing over time is fine.  You are not going to need
to tune for special CPU, Storage, or Network.  Typically with Impala you
are going to be bound at the disks as it functions off of data locality.
 You can also use compression of Snappy, GZip, and BZip to help with the
amount of data you are storing.  You will not need to frequently update
your hardware.


On Thu, Dec 13, 2012 at 10:06 AM, Mohammad Tariq <[email protected]> wrote:

> Oh yes..Impala..good point by Kevin.
>
> Kevin : Would it be appropriate if I say that I should go for Impala if my
> data is not going to increase dramatically over time or if I have to work
> on only a subset of my BigData?Since Impala uses MPP, it may
> require specialized hardware tuned for CPU, storage and network performance
> for better results, which could become a problem if have to upgrade the
> hardware frequently because of the growing data.
>
> Regards,
>     Mohammad Tariq
>
>
>
> On Thu, Dec 13, 2012 at 8:17 PM, Kevin O'dell <[email protected]
> >wrote:
>
> > To Mohammad's point.  You can use HBase for quick scans of the data.
>  Hive
> > for your longer running jobs.  Impala over the two for quick adhoc
> > searches.
> >
> > On Thu, Dec 13, 2012 at 9:44 AM, Mohammad Tariq <[email protected]>
> > wrote:
> >
> > > I am not saying Hbase is not good. My point was to consider Hive as
> well.
> > > Think about the approach keeping both the tools in mind and decide. I
> > just
> > > provided an option keeping in mind the available built-in Hive
> features.
> > I
> > > would like to add one more point here, you can map your Hbase tables to
> > > Hive.
> > >
> > > Regards,
> > >     Mohammad Tariq
> > >
> > >
> > >
> > > On Thu, Dec 13, 2012 at 7:58 PM, bigdata <[email protected]>
> > wrote:
> > >
> > > > Hi, Tariq
> > > > Thanks for your feedback. Actually, now we have two ways to reach the
> > > > target, by Hive and  by HBase.Could you tell me why HBase is not good
> > for
> > > > my requirements?Or what's the problem in my solution?
> > > > Thanks.
> > > >
> > > > > From: [email protected]
> > > > > Date: Thu, 13 Dec 2012 15:43:25 +0530
> > > > > Subject: Re: How to design a data warehouse in HBase?
> > > > > To: [email protected]
> > > > >
> > > > > Both have got different purposes. Normally people say that Hive is
> > > slow,
> > > > > that's just because it uses MapReduce under the hood. And i'm sure
> > that
> > > > if
> > > > > the data stored in HBase is very huge, nobody would write
> sequential
> > > > > programs for Get or Scan. Instead they will write MP jobs or do
> > > something
> > > > > similar.
> > > > >
> > > > > My point is that nothing can be 100% real time. Is that what you
> > > want?If
> > > > > that is the case I would never suggest Hadoop on the first place as
> > > it's
> > > > a
> > > > > batch processing system and cannot be used like an OLTP system,
> > unless
> > > > you
> > > > > have thought of some additional stuff. Since you are talking about
> > > > > warehouse, I am assuming you are going to store and process
> gigantic
> > > > > amounts of data. That's the only reason I had suggested Hive.
> > > > >
> > > > > The whole point is that everything is not a solution for
> everything.
> > > One
> > > > > size doesn't fit all. First, we need to analyze our particular use
> > > case.
> > > > > The person, who says Hive is slow, might be correct. But only for
> his
> > > > > scenario.
> > > > >
> > > > > HTH
> > > > >
> > > > > Regards,
> > > > >     Mohammad Tariq
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Dec 13, 2012 at 3:17 PM, bigdata <[email protected]>
> > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > > I've got the information that HIVE 's performance is too low. It
> > > access
> > > > > > HDFS files and scan all data to search one record. IS it TRUE?
> And
> > > > HBase is
> > > > > > much faster than it.
> > > > > >
> > > > > >
> > > > > > > From: [email protected]
> > > > > > > Date: Thu, 13 Dec 2012 15:12:25 +0530
> > > > > > > Subject: Re: How to design a data warehouse in HBase?
> > > > > > > To: [email protected]
> > > > > > >
> > > > > > > Hi there,
> > > > > > >
> > > > > > >    If you are really planning for a warehousing solution then I
> > > would
> > > > > > > suggest you to have a look over Apache Hive. It provides you
> > > > warehousing
> > > > > > > capabilities on top of a Hadoop cluster. Along with that it
> also
> > > > provides
> > > > > > > an SQLish interface to the data stored in your warehouse, which
> > > > would be
> > > > > > > very helpful to you, in case you are coming from an SQL
> > background.
> > > > > > >
> > > > > > > HTH
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Regards,
> > > > > > >     Mohammad Tariq
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Dec 13, 2012 at 2:43 PM, bigdata <
> > [email protected]>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks. I think a real example is better for me to understand
> > > your
> > > > > > > > suggestions.
> > > > > > > > Now I have a relational table:ID   LoginTime
> > > > > >  DeviceID1
> > > > > > > >     2012-12-12 12:12:12   abcdef2     2012-12-12 19:12:12
> > > > abcdef3
> > > > > > > >  2012-12-13 10:10:10  defdaf
> > > > > > > > There are several requirements about this table:1. How many
> > > device
> > > > > > login
> > > > > > > > in each day?1. For one day, how many new device login? (never
> > > login
> > > > > > > > before)1. For one day, how many accumulated device login?
> > > > > > > > How can I design HBase tables to calculate these data?Now my
> > > > solution
> > > > > > > > is:table A:
> > > > > > > > rowkey:  date-deviceidcolumn family: logincolumn qualifier:
> > > >  2012-12-12
> > > > > > > > 12:12:12/2012-12-12 19:12:12....
> > > > > > > > table B:rowkey: deviceidcolumn family:null or anyvalue
> > > > > > > >
> > > > > > > > For req#1, I can scan table A and use prefixfilter(rowkey) to
> > > > check one
> > > > > > > > special date, and get records countFor req#2, I get table b
> > with
> > > > each
> > > > > > > > deviceid, and count result
> > > > > > > > For req#3, count table A with prefixfilter like 1.
> > > > > > > > Does it OK?  Or other better solutions?
> > > > > > > > Thanks!!
> > > > > > > >
> > > > > > > > > CC: [email protected]
> > > > > > > > > From: [email protected]
> > > > > > > > > Subject: Re: How to design a data warehouse in HBase?
> > > > > > > > > Date: Thu, 13 Dec 2012 08:43:31 +0000
> > > > > > > > > To: [email protected]
> > > > > > > > >
> > > > > > > > > You need to spend a bit of time on Schema design.
> > > > > > > > > You need to flatten your Schema...
> > > > > > > > > Implement some secondary indexing to improve join
> > > performance...
> > > > > > > > >
> > > > > > > > > Depends on what you want to do... There are other options
> > > too...
> > > > > > > > >
> > > > > > > > > Sent from a remote device. Please excuse any typos...
> > > > > > > > >
> > > > > > > > > Mike Segel
> > > > > > > > >
> > > > > > > > > On Dec 13, 2012, at 7:09 AM, lars hofhansl <
> > > [email protected]>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > For OLAP type queries you will generally be better off
> > with a
> > > > truly
> > > > > > > > column oriented database.
> > > > > > > > > > You can probably shoehorn HBase into this, but it wasn't
> > > really
> > > > > > > > designed with raw scan performance along single columns in
> > mind.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > ________________________________
> > > > > > > > > > From: bigdata <[email protected]>
> > > > > > > > > > To: "[email protected]" <[email protected]>
> > > > > > > > > > Sent: Wednesday, December 12, 2012 9:57 PM
> > > > > > > > > > Subject: How to design a data warehouse in HBase?
> > > > > > > > > >
> > > > > > > > > > Dear all,
> > > > > > > > > > We have a traditional star-model data warehouse in RDBMS,
> > now
> > > > we
> > > > > > want
> > > > > > > > to transfer it to HBase. After study HBase, I learn that
> HBase
> > is
> > > > > > normally
> > > > > > > > can be query by rowkey.
> > > > > > > > > > 1.full rowkey (fastest)2.rowkey filter (fast)3.column
> > > > > > family/qualifier
> > > > > > > > filter (slow)
> > > > > > > > > > How can I design the HBase tables to implement the
> > warehouse
> > > > > > > > functions, like:1.Query by DimensionA2.Query by DimensionA
> and
> > > > > > > > DimensionB3.Sum, count, distinct ...
> > > > > > > > > > From my opinion, I should create several HBase tables
> with
> > > all
> > > > > > > > combinations of different dimensions as the rowkey. This
> > solution
> > > > will
> > > > > > lead
> > > > > > > > to huge data duplication. Is there any good suggestions to
> > solve
> > > > it?
> > > > > > > > > > Thanks a lot!
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Kevin O'Dell
> > Customer Operations Engineer, Cloudera
> >
>



-- 
Kevin O'Dell
Customer Operations Engineer, Cloudera

Re: How to design a data warehouse in HBase?

Reply via email to