[ https://issues.apache.org/jira/browse/HIVE-22495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Xu updated HIVE-22495: ---------------------------- Attachment: HIVE-22495.patch Status: Patch Available (was: Open) Removing indexColumnsWanted empty list check would avoid read in all data with "select count(*)", please suggest if this would have other impact. Please help review, thanks! > Parquet count(*) read in all data > --------------------------------- > > Key: HIVE-22495 > URL: https://issues.apache.org/jira/browse/HIVE-22495 > Project: Hive > Issue Type: Bug > Components: Reader > Reporter: Jason Xu > Assignee: Jason Xu > Priority: Major > Attachments: HIVE-22495.patch, HIVE-22495.patch > > > Running a hive query on a Parquet table > select count ( * ) from test_table > The query read in all data (all columns) instead of just metadata. > For comparison, hive 0.13 and Spark read in much less data with my test table. > > ||engine||HDFS data read|| > |Hive 2.3.4| 452.9 MB| > |Hive 0.13| 22.5 KB| > |Spark| 41.6 KB| > > Seems cause is that Parquet read support fall back to file schema if > indexColumnsWanted is empty, logic still exist in master branch. > Don't know why this empty list check was added, please suggest if there're > any other impact. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)