On a daily basis, we move large amounts of data from hive to hbase, via phoenix.

In order to do this, we create an external hive table with the data we need to 
move (all a subset of 1 compressed ORC table), and then use the Phoenix 
CsvBulkUpload utility. From everything I've read, this is the best approach.

My question is: how can I optimize my external table to make the bulk upload as 
efficient as possible?

For example, today, my external table is backed by 6,020 files in HDFS, each 
about 300-400mb.

This results in a mapreduce operation with 12,209 mappers that takes about 3 
hours (we don't have a huge cluster - 13 data nodes currently).

Would it be better to have more, smaller files? Fewer, larger files?

Reply via email to