Re: Problem in reading parquet data from 2 different sources(Hive + Glue) using hive tables

Thai Bui Thu, 30 Aug 2018 07:39:01 -0700

Another option is to implement a custom ParquetInputFormat extending the
current Hive  MR Parquet format and handle schema coersion at the input
split/record reader level. This would be more involving but guarantee to
work if you could add auxiliary jars to your Hive cluster.


On Wed, Aug 29, 2018 at 8:06 AM Anup Tiwari <anupsdtiw...@gmail.com> wrote:

> Hi All,
>
> We have a use case where we have created a partition external table in
> hive 2.3.3 which is pointing to a parquet location where we have date level
> folder and on some days parquet was created by hive 2.1.1 and on some days
> it was created by using glue. Now when we trying to read this data, we are
> getting below error :-
>
> Vertex failed, vertexName=Map 1, vertexId=vertex_1535191533874_0135_2_00,
> diagnostics=[Task failed, taskId=task_1535191533874_0135_2_00_000000,
> diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task (
> failure ) :
> attempt_1535191533874_0135_2_00_000000_0:java.lang.RuntimeException:
> java.lang.RuntimeException:
> org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while
> processing row [Error getting row data with exception
> java.lang.UnsupportedOperationException: Cannot inspect
> org.apache.hadoop.hive.serde2.io.DateWritable
> at
> org.apache.hadoop.hive.ql.io.parquet.serde.primitive.ParquetStringInspector.getPrimitiveJavaObject(ParquetStringInspector.java:77)
> at
> org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:247)
> at
> org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:366)
> at
> org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:202)
> at
> org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:188)
> at
> org.apache.hadoop.hive.ql.exec.MapOperator.toErrorMessage(MapOperator.java:588)
> at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:554)
> at
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:86)
> at
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:70)
> at
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419)
> at
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
> at
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
> at
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
> at
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
> at
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
> at
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
> at
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>  ]
> at
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211)
> at
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
> at
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
> at
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
> at
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
> at
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
> at
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.RuntimeException:
> org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while
> processing row [Error getting row data with exception
> java.lang.UnsupportedOperationException: Cannot inspect
> org.apache.hadoop.hive.serde2.io.DateWritable
> at
> org.apache.hadoop.hive.ql.io.parquet.serde.primitive.ParquetStringInspector.getPrimitiveJavaObject(ParquetStringInspector.java:77)
>
>
> After some drill down i saw schema of columns inside both type of parquet
> file using parquet tool and found different data types for some column and
> it seems, for those columns where data type is same in both files, we are
> able to query those successfully using above external table but in case of
> a difference we are getting error so just wanted your suggestion how to
> come out of this?
>
> Processing all data with same engine(hive/glue) will be very costly for us
> as we have data for last 2-3 years. Please find below table which says data
> type of both file for some column along with data type in external table
> for same column and "test result" specifies whether i was able to read it
> or not.
>
> *Parquet made with Hive 2.1.1* *Parquet made with AWS Glue* *Final Hive
> Table for reading both of these* *Test Result*
> optional int32 action_date (DATE); optional binary action_date (UTF8); 
> action_date
> : string Fail
> optional int64 user_id; optional int64 user_id; user_id : bigint Pass
> optional binary tracking_type (UTF8); required binary tracking_type
> (UTF8); tracking_type : string Fail
> optional int32 game_number; optional int64 game_number; game_number :
> bigint Pass
>
> Regards,
> Anup Tiwari
>
-- 
Thai

Re: Problem in reading parquet data from 2 different sources(Hive + Glue) using hive tables

Reply via email to