Re: Hive dynamic partitions generate multiple files

Cosmin Cătălin Sanda Wed, 29 Jan 2014 02:18:28 -0800

Hi Andre,

I think this is indeed the direction in which I am going to go, unless
anyone else has some other ideas :)


------------------------------------
*Cosmin Catalin SANDA*
Software Systems Engineer
Phone: +45.27.30.60.35



On Wed, Jan 29, 2014 at 10:43 AM, Andre Araujo <ara...@pythian.com> wrote:

> Hi, Cosmin,
>
> Functionally the the subsequent queries will work just fine (they will
> return the correct results). But you're correct in saying that it's not
> optimal.
> If the jobs always generate very small files you might end up with a huge
> number of small files, which will have a impact on the name nodes memory
> usage as well.
> In that case I think you could periodically "coalesce" the recent
> partitions. Once a week/month you can select from the more recent
> partitions and insert overwrite, which will convert all those small files
> in bigger ones.
>
> However, if the jobs are creating files that are already around the
> cluster block size, it should be fine to leave them as is.
>
> Maybe someone else has some other ideas...
>
>
> On 29 January 2014 18:05, Cosmin Cătălin Sanda <cosmincata...@gmail.com>wrote:
>
>> Hi Andre,
>>
>> The reason is that I want those partitions to go into other queries. If
>> the individual files are only a few MB than the performance will be
>> sub-optimal. As far as I understood, the individual files need to be at
>> least around 140MB for the Maps to work properly.
>>
>> ------------------------------------
>> *Cosmin Catalin SANDA*
>> Software Systems Engineer
>> Phone: +45.27.30.60.35
>>
>>
>>
>> On Wed, Jan 29, 2014 at 2:53 AM, Andre Araujo <ara...@pythian.com> wrote:
>>
>>> Why do you need exactly one file? This is transparent to Hive and it
>>> should treat it seamlessly. Unless you have external requirements (reading
>>> files from somewhere else) it shouldn't matter.
>>>
>>> HDFS support to file append is not a solid standard afaik, and will
>>> depend on the distribution and version you're using. In some versions file
>>> append is not available an the only way to add data to an existing Hive
>>> table is to create an additional file under the table's directory in HDFS.
>>> I haven't looked at the code but it may be that Hive developers chose this
>>> to be the default way for appending data so it works with all HDFS
>>> distributions and versions.
>>>
>>> If you need to merge multiple files under the same partition you can
>>> select everything from that partition an INSERT OVERWRITE the data again.
>>>
>>> But again, unless you have requirements external to Hive, you shouldn't
>>> be concerned about that.
>>>
>>>
>>> On 29 January 2014 11:32, Cosmin Cătălin Sanda 
>>> <cosmincata...@gmail.com>wrote:
>>>
>>>> Hi Andre,
>>>>
>>>> So the thing is like this: the first time the query runs, it generates
>>>> one file per dynamic partition, The next time the query runs and it needs
>>>> to write to the same partition, it will generate another file instead of
>>>> merging with the existing one.
>>>>
>>>> Eg:
>>>> 1.The partitioned S3 path looks like this s3://bucket/export/2014/01/23
>>>> 2. I run the query on some data and I ultimately end up having a file
>>>> in the above mentioned partition.
>>>> 3. I run the same query on some other data which ends up writing to the
>>>> same partition as above, only it doesn't take the existing file from there
>>>> and merges with it, it will generate a second file in the same partition.
>>>>
>>>>
>>>> ------------------------------------
>>>> *Cosmin Catalin SANDA*
>>>> Software Systems Engineer
>>>> Phone: +45.27.30.60.35
>>>>
>>>>
>>>>
>>>> On Wed, Jan 29, 2014 at 1:16 AM, Andre Araujo <ara...@pythian.com>wrote:
>>>>
>>>>> Hi, Cosmin,
>>>>>
>>>>> Have you tried using DISTRIBUTE BY to distribute the query's data by
>>>>> the partitioning columns?
>>>>> That way all the data for each partition should be sent to the same
>>>>> reducer and should be written to a single file in each partition, I think.
>>>>>
>>>>> If your data is being distributed by a different criteria, you will
>>>>> potentially have multiple reducers writing to the same partitions.
>>>>>
>>>>> Andre
>>>>>
>>>>>
>>>>>
>>>>> On 29 January 2014 10:51, Cosmin Cătălin Sanda <
>>>>> cosmincata...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>  I have a number of Hive jobs that run during a day. Each individual
>>>>>> job is outputting data to Amazon S3. The Hive jobs use dynamic 
>>>>>> partitioning.
>>>>>>
>>>>>> The problem is that when different jobs need to write to the same
>>>>>> dynamic partition, they will each generate one file.
>>>>>>
>>>>>> What I would like is for the subsequent jobs to load the existing
>>>>>> data and merge it with the new data. Can this be achieved somehow? Is 
>>>>>> there
>>>>>> an option that needs to be enabled? I already set:
>>>>>>
>>>>>> SET hive.merge.mapredfiles = true;
>>>>>> SET hive.exec.dynamic.partition = true;
>>>>>> SET hive.exec.dynamic.partition.mode = nonstrict;
>>>>>>
>>>>>> I should mention that the query that actually outputs to S3 is an INSERT
>>>>>> INTO TABLE query. The Hive version is 0.8.1
>>>>>>
>>>>>>
>>>>>> Thank you,
>>>>>> Cosmin
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> André Araújo
>>>>> Big Data Consultant/Solutions Architect
>>>>> The Pythian Group - Australia - www.pythian.com
>>>>>
>>>>> Office (calls from within Australia): 1300 366 021 x1270
>>>>> Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696
>>>>> x1270
>>>>> Mobile: +61 410 323 559
>>>>> Fax: +61 2 9805 0544
>>>>> IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk
>>>>>
>>>>> “Success is not about standing at the top, it's the steps you leave
>>>>> behind.” — Iker Pou (rock climber)
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> André Araújo
>>> Big Data Consultant/Solutions Architect
>>> The Pythian Group - Australia - www.pythian.com
>>>
>>> Office (calls from within Australia): 1300 366 021 x1270
>>> Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696
>>> x1270
>>> Mobile: +61 410 323 559
>>> Fax: +61 2 9805 0544
>>> IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk
>>>
>>> “Success is not about standing at the top, it's the steps you leave
>>> behind.” — Iker Pou (rock climber)
>>>
>>> --
>>>
>>>
>>>
>>>
>>
>
>
> --
> André Araújo
> Big Data Consultant/Solutions Architect
> The Pythian Group - Australia - www.pythian.com
>
> Office (calls from within Australia): 1300 366 021 x1270
> Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696   x1270
> Mobile: +61 410 323 559
> Fax: +61 2 9805 0544
> IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk
>
> “Success is not about standing at the top, it's the steps you leave behind.”
> — Iker Pou (rock climber)
>
> --
>
>
>
>

Re: Hive dynamic partitions generate multiple files

Reply via email to