On a daily basis, we move large amounts of data from hive to hbase, via phoenix.
In order to do this, we create an external hive table with the data we need to move (all a subset of 1 compressed ORC table), and then use the Phoenix CsvBulkUpload utility. From everything I've read, this is the best approach. My question is: how can I optimize my external table to make the bulk upload as efficient as possible? For example, today, my external table is backed by 6,020 files in HDFS, each about 300-400mb. This results in a mapreduce operation with 12,209 mappers that takes about 3 hours (we don't have a huge cluster - 13 data nodes currently). Would it be better to have more, smaller files? Fewer, larger files?