[ https://issues.apache.org/jira/browse/HIVE-26320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610984#comment-17610984 ]
Stamatis Zampetakis commented on HIVE-26320: -------------------------------------------- The Parquet format specification does not offer bounded character data types so max length cannot be enforced at storage level. The best that Hive can do (and what actually does) is pick the STRING [logical type|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#string] for the actual storage. When storing data to the Parquet file (with Hive or other tools writing directly into Parquet) we can easily end up with arbitrarily long strings. The Hive DDL command allows to declare character types with maximum length but for Parquet storage handler this seems to be more like a hint for the compiler rather than a constraint as it should be. When we write data exclusively with Hive we can probably enforce the max len constraint, and may we do in some places but we cannot control what other (external) writers can do (since we are talking about an external table). Anyways if the DDL says that we have a len constraint I guess nobody will blame us/Hive if we fail miserably while reading data that do not conform to the definition. Maybe it would be a good idea to advise people to use only the STRING datatype in the DDL for Parquet but since we claim to support char/varchar types (HIVE-7735) it is a bit too late to do it now. Parquet reader gives back a {{Text}} instead of {{HiveVarcharWritable}} probably due to the fact that there are no length limitations in the format itself. The code responsible for this is the [ETypeConverter|https://github.com/apache/hive/blob/c19d56ec7429bfcfad92b62ac335dbf8177dab24/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ETypeConverter.java#L488] and observe that at this point we don't care much about what the actual DDL type is (we have it in hiveTypeInfo) but we simply take the binary data from Parquet and wrap it in a {{Text}} object. I guess if we want to change the way that Parquet is deserialized we probably have to do it inside the {{ETypeConverter}}. There we have the actual DDL type and if we do it at this level we avoid the performance overhead of multiple conversions. Plus other type conversions to HiveXX objects also take place in this class so it seems like the right place. We can construct a {{HiveVarcharWritable}} at this level but as I said before if the data do not conform to the max len then most likely we are going truncate them silently. Now considering the other angle that John mentioned, it may also make sense fixing it at the {{GenericUDFIn}} level. The fact that the query below works fine, while the query in the summary does not, indicates that there is specific code handling structs vs simple columns in the function itself and somewhat it is problematic. I strongly suspect that the object that comes from Parquet is still a {{Text}} but somehow there is conversion code in the function that it handles it successfully when it is not in a struct. {code:sql} select case when (kob='BB' or kob='BC' ) then 1 else 0 end as logic_check from case_test_parquet;{code} > Incorrect case evaluation for Parquet based table > ------------------------------------------------- > > Key: HIVE-26320 > URL: https://issues.apache.org/jira/browse/HIVE-26320 > Project: Hive > Issue Type: Bug > Components: HiveServer2, Query Planning > Affects Versions: 4.0.0-alpha-1 > Reporter: Chiran Ravani > Assignee: John Sherman > Priority: Major > Labels: pull-request-available > Time Spent: 2h > Remaining Estimate: 0h > > Query involving case statement with two or more conditions leads to incorrect > result for tables with parquet format, The problem is not observed with ORC > or TextFile. > *Steps to reproduce*: > {code:java} > create external table case_test_parquet(kob varchar(2),enhanced_type_code > int) stored as parquet; > insert into case_test_parquet values('BB',18),('BC',18),('AB',18); > select case when ( > (kob='BB' and enhanced_type_code='18') > or (kob='BC' and enhanced_type_code='18') > ) > then 1 > else 0 > end as logic_check > from case_test_parquet; > {code} > Result: > {code} > 0 > 0 > 0 > {code} > Expected result: > {code} > 1 > 1 > 0 > {code} > The problem does not appear when setting hive.optimize.point.lookup=false. -- This message was sent by Atlassian Jira (v8.20.10#820010)