Synching HDFS directories with partitions on the Hive.

Guy Doulberg Wed, 06 Apr 2011 23:46:50 -0700

Hey folks,

I wanted to consult with on something that has been bothering me for a while...



I have declared external tables, these table are partitioned by dates_hour. I 
have a batch hadoop process that updates the files under the partitions. I want 
the data to be accesed via Hive, as soon as it is updated.

I came up with 3 solutions each has its own problem
1. Creating all partitions a month in  advance,  It creates empty directories 
on the HDFS with the future partitions. As a result of that using ">" might 
fail the job, since it loads empty file input.
2. When the batch finishes its work it updates the hive a new partitions has 
been added - the batch need to "know" hive in order to update it, I want the 
batch to be agnostic towards the Hive.
3. Having in the crontab a process that reads the HDFS and find all the 
partitions available on the HDFS, and then lists all the declared partitions, 
find the delta, and add the partitions in the delta.


Do you have other solutions?
Or improvements?


Thanks.
Guy

Synching HDFS directories with partitions on the Hive.

Reply via email to