Hi, I'm new to using Parquet and ORC files and I'm hitting a problem with querying nested data. Can those files formats be used to query deeply nested data?
If yes, why I am getting an error with the SerDes for both of them? Here's the background: I'm starting from a JSON data file like this: { "ClientCode": "ABC", "JSONUpdateDtm": "200901011000", "Encounter": { "Number": "5555555-9999999", "Patient": { "PatientNumber": "987654321", "SSN": "123-45-6789" }, "Payers": [ { "SequenceNumber": "1", "Payer": "MC", "Description": "Medicaid" }, { "SequenceNumber": "2", "Payer": "XYZ" } ] } } and I've created a Hive table with this schema using a JSON SerDe: ClientCode STRING, Encounter STRUCT<Number:STRING, Patient:STRUCT<PatientNumber:STRING, SSN:STRING>, Payers:ARRAY<STRUCT<Description:STRING, Payer:STRING, SequenceNumber:STRING>>>, JSONUpdateDtm STRING) I can issue this query just fine: > hive select clientcode, encounter.Number, encounter.patient.ssn from json_tbl; OK DEF 4444-88888 444-45-4444 ABC 5555555-9999999 123-45-6789 I then created Parquet and ORCFile versions of this data set: CREATE TABLE parquet_tbl ( ClientCode STRING, Encounter STRUCT<Number:STRING, Patient:STRUCT<PatientNumber:STRING, SSN:STRING>, Payers:ARRAY<STRUCT<Description:STRING, Payer:STRING, SequenceNumber:STRING>>>, JSONUpdateDtm STRING) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'; INSERT OVERWRITE TABLE parquet_tbl SELECT * from json_tbl; CREATE TABLE orc_tbl ( ClientCode STRING, Encounter STRUCT<Number:STRING, Patient:STRUCT<PatientNumber:STRING, SSN:STRING>, Payers:ARRAY<STRUCT<Description:STRING, Payer:STRING, SequenceNumber:STRING>>>, JSONUpdateDtm STRING) STORED AS orc; INSERT OVERWRITE TABLE orc_tbl SELECT * from json_tbl; I can query these tables when I query two levels deep, but *not three. Am I doing something wrong? Or do these data formats not support deeply nested queries?* hive> select clientcode, encounter.Number from parquet_tbl; OK DEF 4444-88888 ABC 5555555-9999999 hive> select clientcode, encounter.Number from orc_tbl; OK DEF 4444-88888 ABC 5555555-9999999 hive> select clientcode, encounter.Number, encounter.patient.ssn from orc_tbl; Diagnostic Messages for this Task: Error: java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:425) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106) ... 9 more Caused by: java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38) ... 14 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106) ... 17 more Caused by: java.lang.RuntimeException: Map operator initialization failed at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:134) ... 22 more Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.ExprNodeFieldEvaluator.initialize(ExprNodeFieldEvaluator.java:61) at org.apache.hadoop.hive.ql.exec.ExprNodeFieldEvaluator.initialize(ExprNodeFieldEvaluator.java:53) at org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:992) at org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:1018) at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:64) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:377) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:453) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:409) at org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:188) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:377) at org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:425) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:377) at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:113) ... 22 more FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask Thank you, Michael