Hi guys, Thank you for your time and your feedback.
That's not the answer I wanted to see but it does look like Hadoop/Hive might not be the best choice for the specific application I'm trying to develop. I'll keep investigating in any case and will probably run a test with Hive just to see how it performs. Thanks again, Benja. On Tue, Sep 27, 2011 at 4:49 PM, Mark Grover <mgro...@oanda.com> wrote: > Forgot to add the link to the video: > http://vimeo.com/8689411 > > Hi Benjamin, > Wojciech raised some good points but I believe that Hive/Hadoop can still > be useful in your case. > > MySQL solution that you presently have is not scalable. Hive is not a > substitution for MySQL, it runs on Hadoop which is a distributed batch > processing system. It will allow you to crunch *a lot* of data, amounts > copious enough that stand-alone MySQL server wouldn't be able to deal with. > > Many people (including myself) use Hive/hadoop in conjunction with a > relational DB. They do much of the number crunching via Hive/Hadoop and then > write the aggregates on a (fast-access) relational DB to provide quick > access to those results. However, as Wojciech pointed out, ad-hoc queries on > Hive would, in general, take longer than similar queries in MySQL. It was > designed to deal with large amounts of data, so that's just an overhead we > have to live with. > > I'd suggest doing some background research on how much data you have and if > Hive/hadoop really make sense. Here is a good video from Alex Loddengaard to > get you started. A good slide (at 15:00) does a comparison of Hadoop with > RDBMS. Later on (at 37:30), in the same video there is an example of typical > workflow with Hive and Relational DB. > > Check it out and good luck! > > Mark > > ----- Original Message ----- > From: "Wojciech Langiewicz" <wlangiew...@gmail.com> > To: user@hive.apache.org > Sent: Tuesday, September 27, 2011 9:33:53 AM > Subject: Re: Hive for large statistics tables? > > Hello, > I'm using Hive to query data like yours. In my case I have about 300 - > 500GB data per day, so it is much larger. We use Flume to load data into > Hive - data is rolled every day (this can be changed). > > Hive queries - ad-hoc or scheduled usually take at least 10-20s or more > (possibly hours) - it won't speed up your processing. Hive shows it > power when you reach more data than serveral GB per month. > > I think, that in your case Hive is not a good solution, you'll be better > off using more powerful MySQL servers. > > On 27.09.2011 11:14, Benjamin Fonze wrote: > > Dear All, > > > > I'm new to this list, and I hope I'm sending this to the right place. > > > > I'm currently using MySQL to store a large amount of visitor statistics. > > (Visits, clicks, etc....) > > > > Basically, each visit is logged in a text file, and every 15 minutes, a > job > > consolidate it into MySQL, into tables that looks like this : > > > > COUNTRY | DATE | USER_AGENT | REFERRER | SEARCH | ... | NUM_HITS > > > > This generates million of rows a month, and several GB of data. Then, > when > > querying these tables, it would typically take a few seconds. (Yes, there > > are indexes, etc...) > > > > I was thinking to move all that data to a noSQL DB like Hive, but I want > to > > make sure it is adapted to my purpose. Can you confirm that Hive is a > good > > fit for such statistical data. More importantly, can you confirm that > ad-hoc > > queries on that data will be much faster that MySQL? > > > > Thanks in advance! > > > > Benjamin. > > > >