[ https://issues.apache.org/jira/browse/HIVE-21407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16945895#comment-16945895 ]
Marta Kuczora commented on HIVE-21407: -------------------------------------- Hi [~kgyrtkirk], sorry, I had some other things going on so I couldn't get back to this one earlier, but now I had some time. As we discussed back then, I checked [HIVE-21316|https://issues.apache.org/jira/browse/HIVE-21316] if it may fix this issue as well, but unfortunately I didn't. When the filter predicate is created the ConvertAstToSearchArg.getType method determines the type of the PredicateLeaf and it handles CHAR, VARCHAR and STRING as the same and returns STRING as type. Because of this, when the predicate is pushed to Parquet, we won't know if the STRING type was actually a CHAR, VARCHAR or STRING type. HIVE-21316 doesn't affected this logic. By the way, I figured that this approach doesn't solve all use cases, so I am trying a new approach now. I uploaded the patch to run the tests and if it doesn't break any tests, I will put it to review board. > Parquet predicate pushdown is not working correctly for char column types > ------------------------------------------------------------------------- > > Key: HIVE-21407 > URL: https://issues.apache.org/jira/browse/HIVE-21407 > Project: Hive > Issue Type: Bug > Affects Versions: 4.0.0 > Reporter: Marta Kuczora > Assignee: Marta Kuczora > Priority: Major > Attachments: HIVE-21407.2.patch, HIVE-21407.3.patch, > HIVE-21407.4.patch, HIVE-21407.patch > > > If the 'hive.optimize.index.filter' parameter is false, the filter predicate > is not pushed to parquet, so the filtering only happens within Hive. If the > parameter is true, the filter is pushed to parquet, but for a char type, the > value which is pushed to Parquet will be padded with spaces: > {noformat} > @Override > public void setValue(String val, int len) { > super.setValue(HiveBaseChar.getPaddedValue(val, len), -1); > } > {noformat} > So if we have a char(10) column which contains the value "apple" and the > where condition looks like 'where c='apple'', the value pushed to Paquet will > be 'apple' followed by 5 spaces. But the stored values are not padded, so no > rows will be returned from Parquet. > How to reproduce: > {noformat} > $ create table ppd (c char(10), v varchar(10), i int) stored as parquet; > $ insert into ppd values ('apple', 'bee', 1),('apple', 'tree', 2),('hello', > 'world', 1),('hello','vilag',3); > $ set hive.optimize.ppd.storage=true; > $ set hive.vectorized.execution.enabled=true; > $ set hive.vectorized.execution.enabled=false; > $ set hive.optimize.ppd=true; > $ set hive.optimize.index.filter=true; > $ set hive.parquet.timestamp.skip.conversion=false; > $ select * from ppd where c='apple'; > +--------+--------+--------+ > | ppd.c | ppd.v | ppd.i | > +--------+--------+--------+ > +--------+--------+--------+ > $ set hive.optimize.index.filter=false; or set > hive.optimize.ppd.storage=false; > $ select * from ppd where c='apple'; > +-------------+--------+--------+ > | ppd.c | ppd.v | ppd.i | > +-------------+--------+--------+ > | apple | bee | 1 | > | apple | tree | 2 | > +-------------+--------+--------+ > {noformat} > The issue surfaced after uploading the fix for > [HIVE-21327|https://issues.apache.org/jira/browse/HIVE-21327] was uploaded > upstream. Before the HIVE-21327 fix, setting the parameter > 'hive.parquet.timestamp.skip.conversion' to true in the parquet_ppd_char.q > test hid this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)