Hello!

Our hive table import process uses a dynamic partition insert into a temporary 
table, then the resulting sequence files are loaded into the master table using 
LOAD DATA INPATH because we want the data online immediately for querying.  The 
data that is loaded does not overwrite files already existing in the partitions 
so we are essentially doing an "append" to the partitions.  Our question is, is 
this a bad practice, and how does this affect table sampling?  It seems that 
the table sample mechanism expects as many files in the partition folder as are 
partition buckets.  Doing a "compaction" of the table using INSERT OVERWRITE to 
re-write the partitions fixes the table sampling problem, but we would like to 
avoid the expensive write.  Are there better ways to accomplish our goal of 
putting data online quickly, and preserve the ability to table sample?

Thanks,
Luke Forehand
http://www.networkedinsights.com

Reply via email to