[ https://issues.apache.org/jira/browse/HIVE-21709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated HIVE-21709: ---------------------------------- Labels: pull-request-available (was: ) > Count with expression does not work in Parquet > ---------------------------------------------- > > Key: HIVE-21709 > URL: https://issues.apache.org/jira/browse/HIVE-21709 > Project: Hive > Issue Type: Bug > Affects Versions: 2.3.2 > Reporter: Mainak Ghosh > Priority: Major > Labels: pull-request-available > > For parquet file with nested schema, count with expression as column name > does not work when you are filtering on another column in the same struct. > Here are the steps to reproduce: > {code:java} > CREATE TABLE `test_table`( `rtb_win` struct<`impression_id`:string, > `pub_id`:string>) ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS > INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'; > INSERT INTO TABLE test_table SELECT named_struct('impression_id', 'cat', > 'pub_id', '2'); > select count(rtb_win.impression_id) from test_table where rtb_win.pub_id ='2'; > WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the > future versions. Consider using a different execution engine (i.e. spark, > tez) or using Hive 1.X releases. > +------+ > | _c0 | > +------+ > | 0 | > +------+ > select count(*) from test_parquet_count_mghosh where rtb_win.pub_id ='2'; > WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the > future versions. Consider using a different execution engine (i.e. spark, > tez) or using Hive 1.X releases. > +------+ > | _c0 | > +------+ > | 1 | > +------+{code} > As you can see the first query returns the wrong result while the second one > returns the correct result. > The issue is an column order mismatch between the actual parquet file > (impression_id first and pub_id second) and the Hive prunedCols datastructure > (reverse). As a result in the filter we compare with the wrong value and the > count returns 0. I have been able to identify the cause of this mismatch. > I would love to get the code reviewed and merged. Some of the code changes > are changes to commits from Ferdinand Xu and Chao Sun. -- This message was sent by Atlassian JIRA (v7.6.3#76005)