RE: Best practices for storing data on Hive

Aggarwal, Vaibhav Tue, 06 Sep 2011 13:09:47 -0700

Hi

You could choose to have the second table (for user ids) partitioned by date 
also.


table_root/userid=ab/date=2010-12-31/

That way you can split your data set by both a userid and a date.
 
You can use dynamic partitions to transform existing date partitioned table 
into userid/date partitioned table.

This table could help  you execute fast queries across both a particular user 
and a particular date (or date range).

Hive does partition pruning based on where clause.

Thanks
Vaibhav

-----Original Message-----
From: Mark Grover [mailto:mgro...@oanda.com] 
Sent: Friday, September 02, 2011 10:19 AM
To: user@hive.apache.org
Cc: Baiju Devani; Bob Tiernay
Subject: Best practices for storing data on Hive

Hello folks,
I am fairly new to Hive and am wondering if you could share some of the best 
practices for storing/querying data with Hive.

Here is an example of the problem I am trying to solve.

The traffic to our website is logged in files that contain information about 
clicks from various users.
Simplified, the log file looks like:
t_1, ip_1, userid_1
t_2, ip_2, userid_2
t_3, ip_3, userid_3
...

where t_i represents time of the click, ip_i represents ip address where the 
click originated from, and userid_i represents the user ID of the user.

Since the clicks are logged on an ongoing basis, partitioning our Hive table by 
day seemed like the obvious choice. Every night we upload the data from the 
previous day into a new partition.

However, we would also want the capability to find all log lines corresponding 
to a particular user. With our present partitioning scheme, all day partitions 
are searched for that user ID but this takes a long time. I am looking for 
ideas/suggestions/thoughts/comments on how to reduce this time.

As a solution, I am thinking that perhaps we could have 2 independent tables, 
one which stores data partitioned by day and the other partitioned by userId. 
With the second table partitioned by userId, I will have to find some way of 
maintaining the partitions since Hive doesn't support appending of files. Also, 
this seems suboptimal, since we are doubling that the amount of data that we 
store. What do you folks think of this idea? 

Do you have any other suggestions on how we can approach this problem?

What have other people in similar situations done? Please share.

Thank you in advance!
Mark

RE: Best practices for storing data on Hive

Reply via email to