[jira] [Commented] (HIVE-21709) Count with expression does not work in Parquet

Mainak Ghosh (JIRA) Thu, 16 May 2019 10:26:39 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-21709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841549#comment-16841549
 ]


Mainak Ghosh commented on HIVE-21709:
-------------------------------------

Thanks David. I will add the unit test and the patch after following the 
documentation you shared. Does the code review depend on these steps?

I am not sure whether the problem occurs in the current Hive version. I would 
assume it does as the original code has not changed in the current version 
either. Can you help me test it in the new version?

 

> Count with expression does not work in Parquet
> ----------------------------------------------
>
>                 Key: HIVE-21709
>                 URL: https://issues.apache.org/jira/browse/HIVE-21709
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 2.3.2
>            Reporter: Mainak Ghosh
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> For parquet file with nested schema, count with expression as column name 
> does not work when you are filtering on another column in the same struct. 
> Here are the steps to reproduce:
> {code:java}
> CREATE TABLE `test_table`( `rtb_win` struct<`impression_id`:string, 
> `pub_id`:string>) ROW FORMAT SERDE 
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS 
> INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
> INSERT INTO TABLE test_table SELECT named_struct('impression_id', 'cat', 
> 'pub_id', '2');
> select count(rtb_win.impression_id) from test_table where rtb_win.pub_id ='2';
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
> future versions. Consider using a different execution engine (i.e. spark, 
> tez) or using Hive 1.X releases.
> +------+ 
> | _c0  |
> +------+ 
> | 0    | 
> +------+
> select count(*) from test_parquet_count_mghosh where rtb_win.pub_id ='2';
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
> future versions. Consider using a different execution engine (i.e. spark, 
> tez) or using Hive 1.X releases. 
> +------+ 
> | _c0  | 
> +------+ 
> | 1    | 
> +------+{code}
> As you can see the first query returns the wrong result while the second one 
> returns the correct result.
> The issue is an column order mismatch between the actual parquet file 
> (impression_id first and pub_id second) and the Hive prunedCols datastructure 
> (reverse). As a result in the filter we compare with the wrong value and the 
> count returns 0. I have been able to identify the cause of this mismatch.
> I would love to get the code reviewed and merged. Some of the code changes 
> are changes to commits from Ferdinand Xu and Chao Sun.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-21709) Count with expression does not work in Parquet

Reply via email to