Re: Dynamic partitioned parquet tables

Slava Markeyev Fri, 09 Oct 2015 17:57:45 -0700

When hive.optimize.sort.dynamic.partition is off hive opens a file writer
for each new partition key as it is encountered and writes records to those
appropriate files. Since the parquet writer buffers writes in memory before
flushing to disk it can lead to OOMs when you have lots of partitions/open
files. hive.optimize.sort.dynamic.partition sorts the records based on the
partition key before starting writing. This means that all records for a
partition are written in a contiguous chunk before opening a new file and
writing that partition.


The issue you're encountering is partition creation on the metastore is
slow. I think that's a fact that isn't avoidable at the moment. I provided
a patch (see HIVE-10385) but it's not for everyone. Since your size per
partition is so small I'd recommend not partitioning by day and simply
making it a column. For queries that span months or years you'll probably
spend more time on listing files and getting partitions during query
planning than actually scanning your data.

-Slava

On Fri, Oct 9, 2015 at 4:12 PM, Yogesh Keshetty <yogesh.keshe...@outlook.com
> wrote:

>
>  Any one tried this? Please help me if you have any knowledge on this kind
> of use case.
>
>
> ------------------------------
> From: yogesh.keshe...@outlook.com
> To: user@hive.apache.org
> Subject: Dynamic partitioned parquet tables
> Date: Fri, 9 Oct 2015 11:20:57 +0530
>
>
>  Hello,
>
> I have a question regarding parquet tables. We have POS data, we want to
> store the data on per day partition basis.  We sqoop the data into an
> external table which is in text file format and then try to insert into an
> external table which is partitioned by date and, due to some
> requirements, we wanted to keep these files as parquet files. The average
> file size per day is around 2 MB. I know that parquet is not meant to be
> for lot of small files. But, we wanted to keep it that way. The problem is
> during the initial historical data load we are trying to create dynamic
> partitions, however no matter how much memory I set the jobs keeps failing
> because of memory issues. But after some research I found out that
> turning ,"set hive.optimize.sort.dynamic.partition = true", this property
> on we could create dynamic partitioned tables. But this is taking longer
> time than what we expected, is there anyway that we can boost the
> performance? Also, in spite of turning the property on when we try to
> create dynamic partitions for multiple years data at a time we are again
> running into heap error. How can we handle this problem? Please help us.
>
> Thanks in advance!
>
> Thank you,
> Yogesh
>



-- 

Slava Markeyev | Engineering | Upsight
Find me on LinkedIn <http://www.linkedin.com/in/slavamarkeyev>
<http://www.linkedin.com/in/slavamarkeyev>

Re: Dynamic partitioned parquet tables

Reply via email to