Hello,
I'm using Hive to query data like yours. In my case I have about 300 - 500GB data per day, so it is much larger. We use Flume to load data into Hive - data is rolled every day (this can be changed).

Hive queries - ad-hoc or scheduled usually take at least 10-20s or more (possibly hours) - it won't speed up your processing. Hive shows it power when you reach more data than serveral GB per month.

I think, that in your case Hive is not a good solution, you'll be better off using more powerful MySQL servers.

On 27.09.2011 11:14, Benjamin Fonze wrote:
Dear All,

I'm new to this list, and I hope I'm sending this to the right place.

I'm currently using MySQL to store a large amount of visitor statistics.
(Visits, clicks, etc....)

Basically, each visit is logged in a text file, and every 15 minutes, a job
consolidate it into MySQL, into tables that looks like this :

COUNTRY | DATE | USER_AGENT | REFERRER | SEARCH | ... | NUM_HITS

This generates million of rows a month, and several GB of data. Then, when
querying these tables, it would typically take a few seconds. (Yes, there
are indexes, etc...)

I was thinking to move all that data to a noSQL DB like Hive, but I want to
make sure it is adapted to my purpose. Can you confirm that Hive is a good
fit for such statistical data. More importantly, can you confirm that ad-hoc
queries on that data will be much faster that MySQL?

Thanks in advance!

Benjamin.


Reply via email to