[ https://issues.apache.org/jira/browse/HIVE-10278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jakub Havlík updated HIVE-10278: -------------------------------- Priority: Blocker (was: Major) > Hive does not use Parquet projection to access structures > --------------------------------------------------------- > > Key: HIVE-10278 > URL: https://issues.apache.org/jira/browse/HIVE-10278 > Project: Hive > Issue Type: Bug > Components: File Formats, Hive, Physical Optimizer, Query Planning, > Query Processor, Types > Affects Versions: 1.0.0 > Environment: CentOS 6.5, Cloudera 2.5.0-cdh5.3.0, 120 nodes in a > cluster. > Reporter: Jakub Havlík > Priority: Blocker > Labels: performance > > Selection from table stored in Parquet format with structures does not uses > projections as per Parquet specification. This means that reading just one > item from structure results in reading the whole structure. It was found by > following test: > Two tables (one flat one with structures) were created as follows: > drop table if exists test_flat; > create table test_flat > (urlurl string, > urlvalid boolean, > urlhost string, > urldomain string, > urlsubdomain string, > urlprotocol string, > urlsuffix string, > urlmiddomain string, > refererurl string, > referervalid boolean, > refererhost string, > refererdomain string, > referersubdomain string, > refererprotocol string, > referersuffix string, > referermiddomain string) > stored as parquet > ; > drop table if exists test_struct; > create table test_struct > (url struct<url:string, valid:boolean, host:string, domain:string, > subdomain:string, protocol:string, suffix:string, middomain:string>, > referer struct<url:string, valid:boolean, host:string, domain:string, > subdomain:string, protocol:string, suffix:string, middomain:string>) > stored as parquet; > Size of these tables is: > [havlik@ams07-015 ~]$ hdfs dfs -du -s -h > /results/havlik/new_calibration/test_flat/ > 820.4 G 1.6 T /results/havlik/new_calibration/test_flat > [havlik@ams07-015 ~]$ hdfs dfs -du -s -h > /results/havlik/new_calibration/test_struct/ > 822.6 G 1.6 T /results/havlik/new_calibration/test_struct > Flat SELECT: > select > count(*) > from > test_struct > where > url.valid = true > and referer.valid = true; > Struct SELECT: > select > count(*) > from > test_flat > where > urlvalid = true > and referervalid = true; > CPU time: > flat: 11785 seconds > struct: 38004 seconds > HDFS bytes read: > flat: 1 812 148 468 > struct: 883 774 856 844 (which is total size of the table) > Using own MapReduce it is possible to use projections into structures to get > results similar to flat table. It is clear that Hive needs to implement it as > it creates unnecessary disk reading and CPU time overhead and cripples > performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)