Re: Controlling Number of small files while inserting into Hive table

Lefty Leverenz Mon, 26 Jun 2017 00:05:30 -0700

Saquib Khan, to unsubscribe you need to send a message to
user-unsubscr...@hive.apache.org as described here:  Mailing Lists
<http://hive.apache.org/mailing_lists.html>.



Thanks.

-- Lefty

On Sun, Jun 25, 2017 at 7:14 PM, saquib khan <skhan...@gmail.com> wrote:

> Please remove me from the user list.
>
> On Sun, Jun 25, 2017 at 5:10 PM Db-Blog <mpp.databa...@gmail.com> wrote:
>
>> Hi Arpan,
>> Include the partition column in the distribute by clause of DML, it will
>> generate only one file per day. Hope this will resolve the issue.
>>
>> "insert into 'target_table' select a,b,c from x where ... distribute by
>> (date)"
>>
>> PS: Backdated processing will generate additional file(s). One file per
>> load.
>>
>> Thanks,
>> Saurabh
>>
>> Sent from my iPhone, please avoid typos.
>>
>> On 22-Jun-2017, at 11:30 AM, Arpan Rajani <arpan.raj...@whishworks.com>
>> wrote:
>>
>> Hello everyone,
>>
>>
>> I am sure many of you might have faced similar issue.
>>
>> We do "insert into 'target_table' select a,b,c from x where .." kind of
>> queries for a nightly load. This insert goes in a new partition of the
>> target_table.
>>
>> Now the concern is : *this inserts load hardly any data* ( I would say
>> less than 128 MB per day) *but data is fregmented into1200 files*. Each
>> file in a few KiloBytes. This is slowing down the performance. How can we
>> make sure, this load does not generate lot of small files?
>>
>> I have already set : *hive.merge.mapfiles and **hive.merge.mapredfiles *to
>> true in custom/advanced hive-site.xml. But still the load job loads data
>> with 1200 small files.
>>
>> I know why 1200 is, this is the value of maximum number of
>> reducers/containers available in one of the hive-sites. (I do not think its
>> a good idea to do cluster wide setting to change this number, as this can
>> affect other jobs which can use cluster when it has free containers)
>>
>> *What could be other way/settings, so that the hive insert do not take
>> 1200 slots and generate lots of small files?*
>>
>> I also have another question which is partly contrary to above : (This is
>> relatively less important)
>>
>> When I reload this table by creating a new table by doing select on
>> target table, the newly created table does not contain too many small
>> files. This newly created table's number of files drops down from 1200 to
>> ±50. What could be the reason?
>>
>> PS: I did go through http://www.openkb.info/2014/12/how-to-control-
>> file-numbers-of-hive.html
>>
>>
>> Regards,
>> Arpan
>>
>> The contents of this e-mail are confidential and for the exclusive use of
>> the intended recipient. If you receive this e-mail in error please delete
>> it from your system immediately and notify us either by e-mail or
>> telephone. You should not copy, forward or otherwise disclose the content
>> of the e-mail. The views expressed in this communication may not
>> necessarily be the view held by WHISHWORKS.
>>
>>

Re: Controlling Number of small files while inserting into Hive table

Reply via email to