Re: Problem in reading parquet data from 2 different sources(Hive + Glue) using hive tables

Anup Tiwari Thu, 30 Aug 2018 07:46:12 -0700

Hi Thai,

Any links or examples for achieving this?  Since I do not have much idea of
this.


On Thu, 30 Aug 2018 20:08 Thai Bui, <blquyt...@gmail.com> wrote:

> Another option is to implement a custom ParquetInputFormat extending the
> current Hive  MR Parquet format and handle schema coersion at the input
> split/record reader level. This would be more involving but guarantee to
> work if you could add auxiliary jars to your Hive cluster.
>
> On Wed, Aug 29, 2018 at 8:06 AM Anup Tiwari <anupsdtiw...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> We have a use case where we have created a partition external table in
>> hive 2.3.3 which is pointing to a parquet location where we have date level
>> folder and on some days parquet was created by hive 2.1.1 and on some days
>> it was created by using glue. Now when we trying to read this data, we are
>> getting below error :-
>>
>> Vertex failed, vertexName=Map 1, vertexId=vertex_1535191533874_0135_2_00,
>> diagnostics=[Task failed, taskId=task_1535191533874_0135_2_00_000000,
>> diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task (
>> failure ) :
>> attempt_1535191533874_0135_2_00_000000_0:java.lang.RuntimeException:
>> java.lang.RuntimeException:
>> org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while
>> processing row [Error getting row data with exception
>> java.lang.UnsupportedOperationException: Cannot inspect
>> org.apache.hadoop.hive.serde2.io.DateWritable
>> at
>> org.apache.hadoop.hive.ql.io.parquet.serde.primitive.ParquetStringInspector.getPrimitiveJavaObject(ParquetStringInspector.java:77)
>> at
>> org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:247)
>> at
>> org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:366)
>> at
>> org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:202)
>> at
>> org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:188)
>> at
>> org.apache.hadoop.hive.ql.exec.MapOperator.toErrorMessage(MapOperator.java:588)
>> at
>> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:554)
>> at
>> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:86)
>> at
>> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:70)
>> at
>> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419)
>> at
>> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
>> at
>> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
>> at
>> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
>> at
>> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
>> at
>> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:422)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
>> at
>> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
>> at
>> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
>> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> at java.lang.Thread.run(Thread.java:748)
>>  ]
>> at
>> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211)
>> at
>> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
>> at
>> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
>> at
>> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
>> at
>> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:422)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
>> at
>> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
>> at
>> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
>> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> at java.lang.Thread.run(Thread.java:748)
>> Caused by: java.lang.RuntimeException:
>> org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while
>> processing row [Error getting row data with exception
>> java.lang.UnsupportedOperationException: Cannot inspect
>> org.apache.hadoop.hive.serde2.io.DateWritable
>> at
>> org.apache.hadoop.hive.ql.io.parquet.serde.primitive.ParquetStringInspector.getPrimitiveJavaObject(ParquetStringInspector.java:77)
>>
>>
>> After some drill down i saw schema of columns inside both type of parquet
>> file using parquet tool and found different data types for some column and
>> it seems, for those columns where data type is same in both files, we are
>> able to query those successfully using above external table but in case of
>> a difference we are getting error so just wanted your suggestion how to
>> come out of this?
>>
>> Processing all data with same engine(hive/glue) will be very costly for
>> us as we have data for last 2-3 years. Please find below table which says
>> data type of both file for some column along with data type in external
>> table for same column and "test result" specifies whether i was able to
>> read it or not.
>>
>> *Parquet made with Hive 2.1.1* *Parquet made with AWS Glue* *Final Hive
>> Table for reading both of these* *Test Result*
>> optional int32 action_date (DATE); optional binary action_date (UTF8); 
>> action_date
>> : string Fail
>> optional int64 user_id; optional int64 user_id; user_id : bigint Pass
>> optional binary tracking_type (UTF8); required binary tracking_type
>> (UTF8); tracking_type : string Fail
>> optional int32 game_number; optional int64 game_number; game_number :
>> bigint Pass
>>
>> Regards,
>> Anup Tiwari
>>
> --
> Thai
>

Re: Problem in reading parquet data from 2 different sources(Hive + Glue) using hive tables

Reply via email to