Hello,
I'm using Hive to query data like yours. In my case I have about 300 -
500GB data per day, so it is much larger. We use Flume to load data into
Hive - data is rolled every day (this can be changed).
Hive queries - ad-hoc or scheduled usually take at least 10-20s or more
(possibly hours) - it won't speed up your processing. Hive shows it
power when you reach more data than serveral GB per month.
I think, that in your case Hive is not a good solution, you'll be better
off using more powerful MySQL servers.
On 27.09.2011 11:14, Benjamin Fonze wrote:
Dear All,
I'm new to this list, and I hope I'm sending this to the right place.
I'm currently using MySQL to store a large amount of visitor statistics.
(Visits, clicks, etc....)
Basically, each visit is logged in a text file, and every 15 minutes, a job
consolidate it into MySQL, into tables that looks like this :
COUNTRY | DATE | USER_AGENT | REFERRER | SEARCH | ... | NUM_HITS
This generates million of rows a month, and several GB of data. Then, when
querying these tables, it would typically take a few seconds. (Yes, there
are indexes, etc...)
I was thinking to move all that data to a noSQL DB like Hive, but I want to
make sure it is adapted to my purpose. Can you confirm that Hive is a good
fit for such statistical data. More importantly, can you confirm that ad-hoc
queries on that data will be much faster that MySQL?
Thanks in advance!
Benjamin.