Hello everyone,
I am sure many of you might have faced similar issue. We do "insert into 'target_table' select a,b,c from x where .." kind of queries for a nightly load. This insert goes in a new partition of the target_table. Now the concern is : *this inserts load hardly any data* ( I would say less than 128 MB per day) *but data is fregmented into1200 files*. Each file in a few KiloBytes. This is slowing down the performance. How can we make sure, this load does not generate lot of small files? I have already set : *hive.merge.mapfiles and **hive.merge.mapredfiles *to true in custom/advanced hive-site.xml. But still the load job loads data with 1200 small files. I know why 1200 is, this is the value of maximum number of reducers/containers available in one of the hive-sites. (I do not think its a good idea to do cluster wide setting to change this number, as this can affect other jobs which can use cluster when it has free containers) *What could be other way/settings, so that the hive insert do not take 1200 slots and generate lots of small files?* I also have another question which is partly contrary to above : (This is relatively less important) When I reload this table by creating a new table by doing select on target table, the newly created table does not contain too many small files. This newly created table's number of files drops down from 1200 to ±50. What could be the reason? PS: I did go through http://www.openkb.info/2014/12/how-to-control-file-numbers-of-hive.html Regards, Arpan -- The contents of this e-mail are confidential and for the exclusive use of the intended recipient. If you receive this e-mail in error please delete it from your system immediately and notify us either by e-mail or telephone. You should not copy, forward or otherwise disclose the content of the e-mail. The views expressed in this communication may not necessarily be the view held by WHISHWORKS.