[jira] [Updated] (HIVE-16969) Improvement performance of MapOperator for Parquet

Colin Ma (JIRA) Mon, 26 Jun 2017 20:11:27 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-16969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Colin Ma updated HIVE-16969:
----------------------------
    Attachment: HIVE-16969.001.patch

With the patch, I test the query13 of TPC-DS in my local cluster, The cluster 
includes 6 nodes, 128G memory/per node, CPU is Intel(R) Xeon(R) E5-2680, 1G 
network. With the 10G data scale and spark as executor engine. The table is 
stored as Parquet file, and the partition number of the largest table is 1825. 
The result shows the execution time from {color:red}85s{color} to 
{color:#14892c}71s{color}, and the initial time of MapOperator from 
{color:red}15s{color} to {color:#14892c}less than 1s{color}.

> Improvement performance of MapOperator for Parquet
> --------------------------------------------------
>
>                 Key: HIVE-16969
>                 URL: https://issues.apache.org/jira/browse/HIVE-16969
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 3.0.0
>            Reporter: Colin Ma
>            Assignee: Colin Ma
>             Fix For: 3.0.0
>
>         Attachments: HIVE-16969.001.patch
>
>
> For a table with many partition files, 
> MapOperator.cloneConfsForNestedColPruning() will update the 
> hive.io.file.readNestedColumn.paths many times. The larger value of 
> hive.io.file.readNestedColumn.paths will cause the poor performance for 
> ParquetHiveSerDe.processRawPrunedPaths(). 
> So, the unnecessary paths should be appended to 
> hive.io.file.readNestedColumn.paths.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HIVE-16969) Improvement performance of MapOperator for Parquet

Reply via email to