Hello everyone,

I am sure many of you might have faced similar issue.

We do "insert into 'target_table' select a,b,c from x where .." kind of
queries for a nightly load. This insert goes in a new partition of the
target_table.

Now the concern is : *this inserts load hardly any data* ( I would say less
than 128 MB per day) *but data is fregmented into1200 files*. Each file in
a few KiloBytes. This is slowing down the performance. How can we make
sure, this load does not generate lot of small files?

I have already set : *hive.merge.mapfiles and **hive.merge.mapredfiles *to
true in custom/advanced hive-site.xml. But still the load job loads data
with 1200 small files.

I know why 1200 is, this is the value of maximum number of
reducers/containers available in one of the hive-sites. (I do not think its
a good idea to do cluster wide setting to change this number, as this can
affect other jobs which can use cluster when it has free containers)

*What could be other way/settings, so that the hive insert do not take 1200
slots and generate lots of small files?*

I also have another question which is partly contrary to above : (This is
relatively less important)

When I reload this table by creating a new table by doing select on target
table, the newly created table does not contain too many small files. This
newly created table's number of files drops down from 1200 to ±50. What
could be the reason?

PS: I did go through
http://www.openkb.info/2014/12/how-to-control-file-numbers-of-hive.html


Regards,
Arpan

-- 
The contents of this e-mail are confidential and for the exclusive use of 
the intended recipient. If you receive this e-mail in error please delete 
it from your system immediately and notify us either by e-mail or 
telephone. You should not copy, forward or otherwise disclose the content 
of the e-mail. The views expressed in this communication may not 
necessarily be the view held by WHISHWORKS.

Reply via email to