I created a hive table containing columns with struct data type CREATE EXTERNAL TABLE abc_dwh.table_on_parquet ( `context` struct<`app`:struct<`build`:string, `name`:string, `namespace`:string, `version`:string>, `screen`:struct<`height`:bigint, `width`:bigint>, `timezone`:string>, `messageid` string, `timestamp` string, `userid` string) PARTITIONED BY (year string, month string, day string, hour string) STORED as PARQUET LOCATION 's3://abc/xyz';
All columns are nullable hence the parquet files read by the table don't always contain all columns. If any file in a partition doesn't have "context.app" struct and if "context.app.version" is queried, I see an exception as below. Same for "context.screen" as well. Caused by: java.io.IOException: java.lang.RuntimeException: Primitive type appshould not doesn't match typeapp[version] at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:379) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203) ... 25 more Caused by: java.lang.RuntimeException: Primitive type appshould not doesn't match typeapp[version] at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.projectLeafTypes(DataWritableReadSupport.java:330) at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.projectLeafTypes(DataWritableReadSupport.java:322) at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getProjectedSchema(DataWritableReadSupport.java:249) at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:379) at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:84) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:75) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:60) at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:376) ... 26 more Querying context.app shows as null hive> select context.app from abc_dwh.table_on_parquet where year=2020 and month='07' and day=26 and hour='03' limit 5; OK NULL NULL NULL NULL NULL As a workaround, I tried querying "context.app.version" only if "context.app" is not null but that also gave the same error. *To verify the case statement for null check, I ran below query which should produce "0" in result for all columns produced "1".* Distinct value of context.app for the partition is NULL so ruled out differences in select with limit. Running the same query in SparkSQL provides the correct result. hive> select case when context.app is null then 0 else 1 end status from abc_dwh.table_on_parquet where year=2020 and month='07' and day=26 and hour='03' limit 5; OK 1 1 1 1 1 *Is there a solution to querying a nullable struct column reliably*? Appreciate any help on this. Hive Version used: 2.3.5-amzn-0 (on AWS EMR) -- Thanks & Regards, *Jainik Vora*