Re: Best practices for storing data on Hive

Mark Grover Tue, 06 Sep 2011 12:39:47 -0700

Thanks for the response, wd.

I would REALLY APPRECIATE if other people can share their views as well.

Here are the possible solutions that I have thought about to the problem(see original email for description of problem):

1) Multiple partitions: We would partition the table by day and userId.However, given the amount of users that visit our website (hundreds ofthousands of unique users every day), this would lead to a large numberof partitions (and rather small file sizes, ranging from a couple ofbytes to a couple of KB). From the documentation I've read online, itseems that Hive/Hadoop weren't designed to deal with such small filesizes and such a situation should be avoided if possible.We had a scenario previously where we were partitioning by day and hourand because of the sheer number of partitions queries like "select *from <table> LIMIT 1;" were taking very long and even failed because of"Java out of Heap space" errors. My guess is that the master node wasmunching through all these partitions and couldn't deal with the largenumber of partitions.

2) Use of data locality: We could keep the data partitioned by day andbucketed by userId. Within each bucket sort the data by the (userId,time). This way we could keep the data related to each userId togetherwithin a daily partition and if Hive could be made aware of this sortingorder and could make use of this order to improve search/query times,that would alleviate the problem quite a bit. The big question here is:Does Hive leverage sorting order of data within a partition bucket whenrunning (most/all?) queries, where possible?

3) Using an index: As wd mentioned, Hive 0.7 introduces the notion on anindex. If I do index on userId, given that we can hundreds of thousandsof unique users per day, would indexing prove to be a good move? Arethere people who are using it for similar purposes or on a similar scale?

4) Using 2 "orthogonal tables": As mentioned in my original email (seebelow), we could have 2 independent tables, one which stores datapartitioned by day and other partitioned by userId. For maintainingpartitions in userId partitioned table, I am planning to do the following:In the nightly job, if userId=X visited the website previous day, wecreate a partition for userId=X if it doesn't already exist. Once thepartition is created, all clicks for that user Id on the day for inquestion are put in a single file and dropped in the userId=X folder onHDFS. This method could be used to simulate an "append" to the Hivetable. The file would only be a few bytes to a few KB and the format ofthe table would be sequence file.

What are your thoughts about the above 4 methods? Any particular likesor dislikes? Any comments, suggestions would be helpful.


Thank you again in advance!

Mark

On 11-09-04 04:01 AM, wd wrote:

Hive support more than one partitions, have your tried? Maybe you can
create to partitions named as date and user.

Hive 0.7 also support index, maybe you can have a try.

On Sat, Sep 3, 2011 at 1:18 AM, Mark Grover<[email protected]>  wrote:

Hello folks,
I am fairly new to Hive and am wondering if you could share some of the best 
practices for storing/querying data with Hive.

Here is an example of the problem I am trying to solve.

The traffic to our website is logged in files that contain information about 
clicks from various users.
Simplified, the log file looks like:
t_1, ip_1, userid_1
t_2, ip_2, userid_2
t_3, ip_3, userid_3
...

where t_i represents time of the click, ip_i represents ip address where the 
click originated from, and userid_i represents the user ID of the user.

Since the clicks are logged on an ongoing basis, partitioning our Hive table by 
day seemed like the obvious choice. Every night we upload the data from the 
previous day into a new partition.

However, we would also want the capability to find all log lines corresponding 
to a particular user. With our present partitioning scheme, all day partitions 
are searched for that user ID but this takes a long time. I am looking for 
ideas/suggestions/thoughts/comments on how to reduce this time.

As a solution, I am thinking that perhaps we could have 2 independent tables, 
one which stores data partitioned by day and the other partitioned by userId. 
With the second table partitioned by userId, I will have to find some way of 
maintaining the partitions since Hive doesn't support appending of files. Also, 
this seems suboptimal, since we are doubling that the amount of data that we 
store. What do you folks think of this idea?

Do you have any other suggestions on how we can approach this problem?

What have other people in similar situations done? Please share.

Thank you in advance!
Mark


--
Mark Grover, Business Intelligence Analyst
OANDA Corporation

www: oanda.com www: fxtrade.com
e: [email protected]

"Best Trading Platform" - World Finance's Forex Awards 2009.
"The One to Watch" - Treasury Today's Adam Smith Awards 2009.

Re: Best practices for storing data on Hive

Reply via email to