Another option is to implement a custom ParquetInputFormat extending the current Hive MR Parquet format and handle schema coersion at the input split/record reader level. This would be more involving but guarantee to work if you could add auxiliary jars to your Hive cluster.
On Wed, Aug 29, 2018 at 8:06 AM Anup Tiwari <anupsdtiw...@gmail.com> wrote: > Hi All, > > We have a use case where we have created a partition external table in > hive 2.3.3 which is pointing to a parquet location where we have date level > folder and on some days parquet was created by hive 2.1.1 and on some days > it was created by using glue. Now when we trying to read this data, we are > getting below error :- > > Vertex failed, vertexName=Map 1, vertexId=vertex_1535191533874_0135_2_00, > diagnostics=[Task failed, taskId=task_1535191533874_0135_2_00_000000, > diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( > failure ) : > attempt_1535191533874_0135_2_00_000000_0:java.lang.RuntimeException: > java.lang.RuntimeException: > org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while > processing row [Error getting row data with exception > java.lang.UnsupportedOperationException: Cannot inspect > org.apache.hadoop.hive.serde2.io.DateWritable > at > org.apache.hadoop.hive.ql.io.parquet.serde.primitive.ParquetStringInspector.getPrimitiveJavaObject(ParquetStringInspector.java:77) > at > org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:247) > at > org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:366) > at > org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:202) > at > org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:188) > at > org.apache.hadoop.hive.ql.exec.MapOperator.toErrorMessage(MapOperator.java:588) > at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:554) > at > org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:86) > at > org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:70) > at > org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > ] > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.RuntimeException: > org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while > processing row [Error getting row data with exception > java.lang.UnsupportedOperationException: Cannot inspect > org.apache.hadoop.hive.serde2.io.DateWritable > at > org.apache.hadoop.hive.ql.io.parquet.serde.primitive.ParquetStringInspector.getPrimitiveJavaObject(ParquetStringInspector.java:77) > > > After some drill down i saw schema of columns inside both type of parquet > file using parquet tool and found different data types for some column and > it seems, for those columns where data type is same in both files, we are > able to query those successfully using above external table but in case of > a difference we are getting error so just wanted your suggestion how to > come out of this? > > Processing all data with same engine(hive/glue) will be very costly for us > as we have data for last 2-3 years. Please find below table which says data > type of both file for some column along with data type in external table > for same column and "test result" specifies whether i was able to read it > or not. > > *Parquet made with Hive 2.1.1* *Parquet made with AWS Glue* *Final Hive > Table for reading both of these* *Test Result* > optional int32 action_date (DATE); optional binary action_date (UTF8); > action_date > : string Fail > optional int64 user_id; optional int64 user_id; user_id : bigint Pass > optional binary tracking_type (UTF8); required binary tracking_type > (UTF8); tracking_type : string Fail > optional int32 game_number; optional int64 game_number; game_number : > bigint Pass > > Regards, > Anup Tiwari > -- Thai