Re: Hive for large statistics tables?

Benjamin Fonze Tue, 27 Sep 2011 10:13:29 -0700

Hi guys,

Thank you for your time and your feedback.


That's not the answer I wanted to see but it does look like Hadoop/Hive
might not be the best choice for the specific application I'm trying to
develop.

I'll keep investigating in any case and will probably run a test with Hive
just to see how it performs.

Thanks again,
Benja.


On Tue, Sep 27, 2011 at 4:49 PM, Mark Grover <mgro...@oanda.com> wrote:

> Forgot to add the link to the video:
> http://vimeo.com/8689411
>
> Hi Benjamin,
> Wojciech raised some good points but I believe that Hive/Hadoop can still
> be useful in your case.
>
> MySQL solution that you presently have is not scalable. Hive is not a
> substitution for MySQL, it runs on Hadoop which is a distributed batch
> processing system. It will allow you to crunch *a lot* of data, amounts
> copious enough that stand-alone MySQL server wouldn't be able to deal with.
>
> Many people (including myself) use Hive/hadoop in conjunction with a
> relational DB. They do much of the number crunching via Hive/Hadoop and then
> write the aggregates on a (fast-access) relational DB to provide quick
> access to those results. However, as Wojciech pointed out, ad-hoc queries on
> Hive would, in general, take longer than similar queries in MySQL. It was
> designed to deal with large amounts of data, so that's just an overhead we
> have to live with.
>
> I'd suggest doing some background research on how much data you have and if
> Hive/hadoop really make sense. Here is a good video from Alex Loddengaard to
> get you started. A good slide (at 15:00) does a comparison of Hadoop with
> RDBMS. Later on (at 37:30), in the same video there is an example of typical
> workflow with Hive and Relational DB.
>
> Check it out and good luck!
>
> Mark
>
> ----- Original Message -----
> From: "Wojciech Langiewicz" <wlangiew...@gmail.com>
> To: user@hive.apache.org
> Sent: Tuesday, September 27, 2011 9:33:53 AM
> Subject: Re: Hive for large statistics tables?
>
> Hello,
> I'm using Hive to query data like yours. In my case I have about 300 -
> 500GB data per day, so it is much larger. We use Flume to load data into
> Hive - data is rolled every day (this can be changed).
>
> Hive queries - ad-hoc or scheduled usually take at least 10-20s or more
> (possibly hours) - it won't speed up your processing. Hive shows it
> power when you reach more data than serveral GB per month.
>
> I think, that in your case Hive is not a good solution, you'll be better
> off using more powerful MySQL servers.
>
> On 27.09.2011 11:14, Benjamin Fonze wrote:
> > Dear All,
> >
> > I'm new to this list, and I hope I'm sending this to the right place.
> >
> > I'm currently using MySQL to store a large amount of visitor statistics.
> > (Visits, clicks, etc....)
> >
> > Basically, each visit is logged in a text file, and every 15 minutes, a
> job
> > consolidate it into MySQL, into tables that looks like this :
> >
> > COUNTRY | DATE | USER_AGENT | REFERRER | SEARCH | ... | NUM_HITS
> >
> > This generates million of rows a month, and several GB of data. Then,
> when
> > querying these tables, it would typically take a few seconds. (Yes, there
> > are indexes, etc...)
> >
> > I was thinking to move all that data to a noSQL DB like Hive, but I want
> to
> > make sure it is adapted to my purpose. Can you confirm that Hive is a
> good
> > fit for such statistical data. More importantly, can you confirm that
> ad-hoc
> > queries on that data will be much faster that MySQL?
> >
> > Thanks in advance!
> >
> > Benjamin.
> >
>
>

Re: Hive for large statistics tables?

Reply via email to