[jira] [Updated] (HIVE-22495) Parquet count(*) read in all data

Jason Xu (Jira) Thu, 14 Nov 2019 06:38:04 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-22495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jason Xu updated HIVE-22495:
----------------------------
    Description: 
Running a hive query on a Parquet table

select count ( * ) from t

The query read in all data (all columns) instead of just metadata.

For comparison, hive 0.13 and Spark read in much less data.

 
||engine||HDFS data read||
|Hive 2.3.4|          452.9 MB|
|Hive 0.13|            22.5 KB|
|Spark|            41.6 KB|

 

Seems cause is that Parquet read support fall back to file schema if 
indexColumnsWanted is empty, logic still exist in master branch.

Don't know why this empty list check was added, please suggest if there're any 
other impact.

 

 

 

  was:
Running a hive query on a Parquet table

select count ( * ) from t

The query read in all data (all columns) instead of just metadata.

For comparison, hive 0.13 and Spark read in much less data.

 
||engine||HDFS data read||
|Hive 2.3.4|          452.9 MB|
|Hive 0.13|            22.5 KB|
|Spark|            41.6 KB|

 

Seems cause is that Parquet read support fall back to file schema if 
indexColumnsWanted is empty, logic still exist in master branch.

 

 

 


> Parquet count(*) read in all data
> ---------------------------------
>
>                 Key: HIVE-22495
>                 URL: https://issues.apache.org/jira/browse/HIVE-22495
>             Project: Hive
>          Issue Type: Bug
>          Components: Reader
>            Reporter: Jason Xu
>            Assignee: Jason Xu
>            Priority: Major
>         Attachments: HIVE-22495.patch
>
>
> Running a hive query on a Parquet table
> select count ( * ) from t
> The query read in all data (all columns) instead of just metadata.
> For comparison, hive 0.13 and Spark read in much less data.
>  
> ||engine||HDFS data read||
> |Hive 2.3.4|          452.9 MB|
> |Hive 0.13|            22.5 KB|
> |Spark|            41.6 KB|
>  
> Seems cause is that Parquet read support fall back to file schema if 
> indexColumnsWanted is empty, logic still exist in master branch.
> Don't know why this empty list check was added, please suggest if there're 
> any other impact.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-22495) Parquet count(*) read in all data

Reply via email to