Jainik Vora created HIVE-24066:
----------------------------------

             Summary: Hive query on parquet data should identify if column is 
not present in file schema and show NULL value instead of Exception
                 Key: HIVE-24066
                 URL: https://issues.apache.org/jira/browse/HIVE-24066
             Project: Hive
          Issue Type: Bug
          Components: Hive
    Affects Versions: 2.3.5
            Reporter: Jainik Vora


I created a hive table containing columns with struct data type 
 
{code:java}
CREATE EXTERNAL TABLE abc_dwh.table_on_parquet (
  `context` struct<`app`:struct<`build`:string, `name`:string, 
`namespace`:string, `version`:string>, `screen`:struct<`height`:bigint, 
`width`:bigint>, `timezone`:string>,
  `messageid` string,
  `timestamp` string,
  `userid` string)
PARTITIONED BY (year string, month string, day string, hour string)
STORED as PARQUET
LOCATION 's3://abc/xyz';
  {code}
 
All columns are nullable hence the parquet files read by the table don't always 
contain all columns. If any file in a partition doesn't have "context.app" 
struct and if "context.app.version" is queried, Hive throws an exception as 
below. Same for "context.screen" as well.
 
{code:java}
 Caused by: java.io.IOException: java.lang.RuntimeException: Primitive type 
appshould not doesn't match typeapp[version]
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:379)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203)
... 25 more
Caused by: java.lang.RuntimeException: Primitive type appshould not doesn't 
match typeapp[version]
at 
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.projectLeafTypes(DataWritableReadSupport.java:330)
at 
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.projectLeafTypes(DataWritableReadSupport.java:322)
at 
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getProjectedSchema(DataWritableReadSupport.java:249)
at 
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:379)
at 
org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:84)
at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:75)
at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:60)
at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:376)
... 26 more
 {code}
 
Querying context.app shows as null
{code:java}
hive> select context.app from abc_dwh.table_on_parquet where year=2020 and 
month='07' and day=26 and hour='03' limit 5;
OK
NULL
NULL
NULL
NULL
NULL
  {code}
 
As a workaround, I tried querying "context.app.version" only if "context.app" 
is not null but that also gave the same error.  *To verify the case statement 
for null check, I ran below query which should produce "0" in result for all 
columns produced "1".*  Distinct value of context.app for the partition is NULL 
so ruled out differences in select with limit. Running the same query in 
SparkSQL provides the correct result. 
{code:java}
hive> select case when context.app is null then 0 else 1 end status from 
abc_dwh.table_on_parquet where year=2020 and month='07' and day=26 and 
hour='03' limit 5;
OK
1
1
1
1
1 {code}
Hive Version used: 2.3.5-amzn-0 (on AWS EMR){color:#888888}
{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to