Re: Hive dynamic partitions generate multiple files

Cosmin Cătălin Sanda Tue, 28 Jan 2014 23:07:24 -0800

Hi Andre,

The reason is that I want those partitions to go into other queries. If the
individual files are only a few MB than the performance will be
sub-optimal. As far as I understood, the individual files need to be at
least around 140MB for the Maps to work properly.


------------------------------------
*Cosmin Catalin SANDA*
Software Systems Engineer
Phone: +45.27.30.60.35



On Wed, Jan 29, 2014 at 2:53 AM, Andre Araujo <ara...@pythian.com> wrote:

> Why do you need exactly one file? This is transparent to Hive and it
> should treat it seamlessly. Unless you have external requirements (reading
> files from somewhere else) it shouldn't matter.
>
> HDFS support to file append is not a solid standard afaik, and will depend
> on the distribution and version you're using. In some versions file append
> is not available an the only way to add data to an existing Hive table is
> to create an additional file under the table's directory in HDFS. I haven't
> looked at the code but it may be that Hive developers chose this to be the
> default way for appending data so it works with all HDFS distributions and
> versions.
>
> If you need to merge multiple files under the same partition you can
> select everything from that partition an INSERT OVERWRITE the data again.
>
> But again, unless you have requirements external to Hive, you shouldn't be
> concerned about that.
>
>
> On 29 January 2014 11:32, Cosmin Cătălin Sanda <cosmincata...@gmail.com>wrote:
>
>> Hi Andre,
>>
>> So the thing is like this: the first time the query runs, it generates
>> one file per dynamic partition, The next time the query runs and it needs
>> to write to the same partition, it will generate another file instead of
>> merging with the existing one.
>>
>> Eg:
>> 1.The partitioned S3 path looks like this s3://bucket/export/2014/01/23
>> 2. I run the query on some data and I ultimately end up having a file in
>> the above mentioned partition.
>> 3. I run the same query on some other data which ends up writing to the
>> same partition as above, only it doesn't take the existing file from there
>> and merges with it, it will generate a second file in the same partition.
>>
>>
>> ------------------------------------
>> *Cosmin Catalin SANDA*
>> Software Systems Engineer
>> Phone: +45.27.30.60.35
>>
>>
>>
>> On Wed, Jan 29, 2014 at 1:16 AM, Andre Araujo <ara...@pythian.com> wrote:
>>
>>> Hi, Cosmin,
>>>
>>> Have you tried using DISTRIBUTE BY to distribute the query's data by the
>>> partitioning columns?
>>> That way all the data for each partition should be sent to the same
>>> reducer and should be written to a single file in each partition, I think.
>>>
>>> If your data is being distributed by a different criteria, you will
>>> potentially have multiple reducers writing to the same partitions.
>>>
>>> Andre
>>>
>>>
>>>
>>> On 29 January 2014 10:51, Cosmin Cătălin Sanda 
>>> <cosmincata...@gmail.com>wrote:
>>>
>>>> Hi,
>>>>
>>>>  I have a number of Hive jobs that run during a day. Each individual
>>>> job is outputting data to Amazon S3. The Hive jobs use dynamic 
>>>> partitioning.
>>>>
>>>> The problem is that when different jobs need to write to the same
>>>> dynamic partition, they will each generate one file.
>>>>
>>>> What I would like is for the subsequent jobs to load the existing data
>>>> and merge it with the new data. Can this be achieved somehow? Is there an
>>>> option that needs to be enabled? I already set:
>>>>
>>>> SET hive.merge.mapredfiles = true;
>>>> SET hive.exec.dynamic.partition = true;
>>>> SET hive.exec.dynamic.partition.mode = nonstrict;
>>>>
>>>> I should mention that the query that actually outputs to S3 is an INSERT
>>>> INTO TABLE query. The Hive version is 0.8.1
>>>>
>>>>
>>>> Thank you,
>>>> Cosmin
>>>>
>>>
>>>
>>>
>>> --
>>> André Araújo
>>> Big Data Consultant/Solutions Architect
>>> The Pythian Group - Australia - www.pythian.com
>>>
>>> Office (calls from within Australia): 1300 366 021 x1270
>>> Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696
>>> x1270
>>> Mobile: +61 410 323 559
>>> Fax: +61 2 9805 0544
>>> IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk
>>>
>>> “Success is not about standing at the top, it's the steps you leave
>>> behind.” — Iker Pou (rock climber)
>>>
>>> --
>>>
>>>
>>>
>>>
>>
>
>
> --
> André Araújo
> Big Data Consultant/Solutions Architect
> The Pythian Group - Australia - www.pythian.com
>
> Office (calls from within Australia): 1300 366 021 x1270
> Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696   x1270
> Mobile: +61 410 323 559
> Fax: +61 2 9805 0544
> IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk
>
> “Success is not about standing at the top, it's the steps you leave behind.”
> — Iker Pou (rock climber)
>
> --
>
>
>
>

Re: Hive dynamic partitions generate multiple files

Reply via email to