[ https://issues.apache.org/jira/browse/HIVE-21709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841549#comment-16841549 ]
Mainak Ghosh commented on HIVE-21709: ------------------------------------- Thanks David. I will add the unit test and the patch after following the documentation you shared. Does the code review depend on these steps? I am not sure whether the problem occurs in the current Hive version. I would assume it does as the original code has not changed in the current version either. Can you help me test it in the new version? > Count with expression does not work in Parquet > ---------------------------------------------- > > Key: HIVE-21709 > URL: https://issues.apache.org/jira/browse/HIVE-21709 > Project: Hive > Issue Type: Bug > Affects Versions: 2.3.2 > Reporter: Mainak Ghosh > Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > For parquet file with nested schema, count with expression as column name > does not work when you are filtering on another column in the same struct. > Here are the steps to reproduce: > {code:java} > CREATE TABLE `test_table`( `rtb_win` struct<`impression_id`:string, > `pub_id`:string>) ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS > INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'; > INSERT INTO TABLE test_table SELECT named_struct('impression_id', 'cat', > 'pub_id', '2'); > select count(rtb_win.impression_id) from test_table where rtb_win.pub_id ='2'; > WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the > future versions. Consider using a different execution engine (i.e. spark, > tez) or using Hive 1.X releases. > +------+ > | _c0 | > +------+ > | 0 | > +------+ > select count(*) from test_parquet_count_mghosh where rtb_win.pub_id ='2'; > WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the > future versions. Consider using a different execution engine (i.e. spark, > tez) or using Hive 1.X releases. > +------+ > | _c0 | > +------+ > | 1 | > +------+{code} > As you can see the first query returns the wrong result while the second one > returns the correct result. > The issue is an column order mismatch between the actual parquet file > (impression_id first and pub_id second) and the Hive prunedCols datastructure > (reverse). As a result in the filter we compare with the wrong value and the > count returns 0. I have been able to identify the cause of this mismatch. > I would love to get the code reviewed and merged. Some of the code changes > are changes to commits from Ferdinand Xu and Chao Sun. -- This message was sent by Atlassian JIRA (v7.6.3#76005)