[jira] [Commented] (HIVE-26320) Incorrect case evaluation for Parquet based table

Stamatis Zampetakis (Jira) Thu, 29 Sep 2022 04:55:36 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-26320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610984#comment-17610984
 ]


Stamatis Zampetakis commented on HIVE-26320:
--------------------------------------------

The Parquet format specification does not offer bounded character data types so 
max length cannot be enforced at storage level. The best that Hive can do (and 
what actually does) is pick the STRING [logical 
type|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#string]
 for the actual storage. When storing data to the Parquet file (with Hive or 
other tools writing directly into Parquet) we can easily end up with 
arbitrarily long strings. 

The Hive DDL command allows to declare character types with maximum length but 
for Parquet storage handler this seems to be more like a hint for the compiler 
rather than a constraint as it should be. When we write data exclusively with 
Hive we can probably enforce the max len constraint, and may we do in some 
places but we cannot control what other (external) writers can do (since we are 
talking about an external table). Anyways if the DDL says that we have a len 
constraint I guess nobody will blame us/Hive if we fail miserably while reading 
data that do not conform to the definition.

Maybe it would be a good idea to advise people to use only the STRING datatype 
in the DDL for Parquet but since we claim to support char/varchar types 
(HIVE-7735) it is a bit too late to do it now.

Parquet reader gives back a {{Text}} instead of {{HiveVarcharWritable}} 
probably due to the fact that there are no length limitations in the format 
itself. The code responsible for this is the 
[ETypeConverter|https://github.com/apache/hive/blob/c19d56ec7429bfcfad92b62ac335dbf8177dab24/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ETypeConverter.java#L488]
 and observe that at this point we don't care much about what the actual DDL 
type is (we have it in hiveTypeInfo) but we simply take the binary data from 
Parquet and wrap it in a {{Text}} object. 

I guess if we want to change the way that Parquet is deserialized we probably 
have to do it inside the {{ETypeConverter}}. There we have the actual DDL type 
and if we do it at this level we avoid the performance overhead of multiple 
conversions. Plus other type conversions to HiveXX objects also take place in 
this class so it seems like the right place. We can construct a 
{{HiveVarcharWritable}} at this level but as I said before if the data do not 
conform to the max len then most likely we are going truncate them silently.

Now considering the other angle that John mentioned, it may also make sense 
fixing it at the {{GenericUDFIn}} level. The fact that the query below works 
fine, while the query in the summary does not, indicates that there is specific 
code handling structs vs simple columns in the function itself and somewhat it 
is problematic. I strongly suspect that the object that comes from Parquet is 
still a {{Text}} but somehow there is conversion code in the function that it 
handles it successfully when it is not in a struct. 

{code:sql}
select case when (kob='BB' or kob='BC' )
            then 1
            else 0
        end as logic_check
from case_test_parquet;{code}


> Incorrect case evaluation for Parquet based table
> -------------------------------------------------
>
>                 Key: HIVE-26320
>                 URL: https://issues.apache.org/jira/browse/HIVE-26320
>             Project: Hive
>          Issue Type: Bug
>          Components: HiveServer2, Query Planning
>    Affects Versions: 4.0.0-alpha-1
>            Reporter: Chiran Ravani
>            Assignee: John Sherman
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 2h
>  Remaining Estimate: 0h
>
> Query involving case statement with two or more conditions leads to incorrect 
> result for tables with parquet format, The problem is not observed with ORC 
> or TextFile.
> *Steps to reproduce*:
> {code:java}
> create external table case_test_parquet(kob varchar(2),enhanced_type_code 
> int) stored as parquet;
> insert into case_test_parquet values('BB',18),('BC',18),('AB',18);
> select case when (
>                    (kob='BB' and enhanced_type_code='18')
>                    or (kob='BC' and enhanced_type_code='18')
>                  )
>             then 1
>             else 0
>         end as logic_check
> from case_test_parquet;
> {code}
> Result:
> {code}
> 0
> 0
> 0
> {code}
> Expected result:
> {code}
> 1
> 1
> 0
> {code}
> The problem does not appear when setting hive.optimize.point.lookup=false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-26320) Incorrect case evaluation for Parquet based table

Reply via email to