Re: Hive dynamic partitions generate multiple files

Cosmin Cătălin Sanda Tue, 28 Jan 2014 16:34:39 -0800

Hi Andre,

So the thing is like this: the first time the query runs, it generates one
file per dynamic partition, The next time the query runs and it needs to
write to the same partition, it will generate another file instead of
merging with the existing one.


Eg:
1.The partitioned S3 path looks like this s3://bucket/export/2014/01/23
2. I run the query on some data and I ultimately end up having a file in
the above mentioned partition.
3. I run the same query on some other data which ends up writing to the
same partition as above, only it doesn't take the existing file from there
and merges with it, it will generate a second file in the same partition.


------------------------------------
*Cosmin Catalin SANDA*
Software Systems Engineer
Phone: +45.27.30.60.35



On Wed, Jan 29, 2014 at 1:16 AM, Andre Araujo <ara...@pythian.com> wrote:

> Hi, Cosmin,
>
> Have you tried using DISTRIBUTE BY to distribute the query's data by the
> partitioning columns?
> That way all the data for each partition should be sent to the same
> reducer and should be written to a single file in each partition, I think.
>
> If your data is being distributed by a different criteria, you will
> potentially have multiple reducers writing to the same partitions.
>
> Andre
>
>
>
> On 29 January 2014 10:51, Cosmin Cătălin Sanda <cosmincata...@gmail.com>wrote:
>
>> Hi,
>>
>>  I have a number of Hive jobs that run during a day. Each individual job
>> is outputting data to Amazon S3. The Hive jobs use dynamic partitioning.
>>
>> The problem is that when different jobs need to write to the same dynamic
>> partition, they will each generate one file.
>>
>> What I would like is for the subsequent jobs to load the existing data
>> and merge it with the new data. Can this be achieved somehow? Is there an
>> option that needs to be enabled? I already set:
>>
>> SET hive.merge.mapredfiles = true;
>> SET hive.exec.dynamic.partition = true;
>> SET hive.exec.dynamic.partition.mode = nonstrict;
>>
>> I should mention that the query that actually outputs to S3 is an INSERT
>> INTO TABLE query. The Hive version is 0.8.1
>>
>>
>> Thank you,
>> Cosmin
>>
>
>
>
> --
> André Araújo
> Big Data Consultant/Solutions Architect
> The Pythian Group - Australia - www.pythian.com
>
> Office (calls from within Australia): 1300 366 021 x1270
> Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696   x1270
> Mobile: +61 410 323 559
> Fax: +61 2 9805 0544
> IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk
>
> “Success is not about standing at the top, it's the steps you leave behind.”
> — Iker Pou (rock climber)
>
> --
>
>
>
>

Re: Hive dynamic partitions generate multiple files

Reply via email to