Re: Problem in reading parquet data from 2 different sources(Hive + Glue) using hive tables

Thai Bui Sun, 02 Sep 2018 10:31:04 -0700

Here’s all I can find related to this idea.

ParquetHiveSerde is where the raw parquet data is unpacked into readable
POJO. Everything started with this root array ObjectInspector
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java#L130.
You can see that the column info and types are available in this class as
well in the initialize(..) method. This will provide the column types that
Hive thinks the Parquet table should look like.


Now, when one of the Hive tasks is unpacking the Parquet raw binary types
into a POJO, the actually primitive ObjectInpsect will be used. For
example, in your stack trace, with column type String, this line is
involved
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ArrayWritableObjectInspector.java#L107.
What you need to do here is to create a wrapper type for
StringObjectInpsector that deal with UnsupportedOperationException when
trying to unpack DateWritable. You should be able to do a simple if..else..
to use the appropriate ObjectInspector that matches the actual type of your
files, then convert the values to a String.

After figuring out all of that logics, you'll have to create a new
HiveSerde that wraps ParquetHiveSerDe with the new logics and use that in
your external table. For instance,

CREATE EXTERNAL TABLE your_table
> ...
> ROW FORMAT SERDE
> 'your.custom.WrappedParquetHiveSerDe'
> STORED AS INPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION '...'










On Thu, Aug 30, 2018 at 9:45 AM Anup Tiwari <anupsdtiw...@gmail.com> wrote:

> Hi Thai,
>
> Any links or examples for achieving this?  Since I do not have much idea
> of this.
>
> On Thu, 30 Aug 2018 20:08 Thai Bui, <blquyt...@gmail.com> wrote:
>
>> Another option is to implement a custom ParquetInputFormat extending the
>> current Hive  MR Parquet format and handle schema coersion at the input
>> split/record reader level. This would be more involving but guarantee to
>> work if you could add auxiliary jars to your Hive cluster.
>>
>> On Wed, Aug 29, 2018 at 8:06 AM Anup Tiwari <anupsdtiw...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> We have a use case where we have created a partition external table in
>>> hive 2.3.3 which is pointing to a parquet location where we have date level
>>> folder and on some days parquet was created by hive 2.1.1 and on some days
>>> it was created by using glue. Now when we trying to read this data, we are
>>> getting below error :-
>>>
>>> Vertex failed, vertexName=Map 1,
>>> vertexId=vertex_1535191533874_0135_2_00, diagnostics=[Task failed,
>>> taskId=task_1535191533874_0135_2_00_000000, diagnostics=[TaskAttempt 0
>>> failed, info=[Error: Error while running task ( failure ) :
>>> attempt_1535191533874_0135_2_00_000000_0:java.lang.RuntimeException:
>>> java.lang.RuntimeException:
>>> org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while
>>> processing row [Error getting row data with exception
>>> java.lang.UnsupportedOperationException: Cannot inspect
>>> org.apache.hadoop.hive.serde2.io.DateWritable
>>> at
>>> org.apache.hadoop.hive.ql.io.parquet.serde.primitive.ParquetStringInspector.getPrimitiveJavaObject(ParquetStringInspector.java:77)
>>> at
>>> org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:247)
>>> at
>>> org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:366)
>>> at
>>> org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:202)
>>> at
>>> org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:188)
>>> at
>>> org.apache.hadoop.hive.ql.exec.MapOperator.toErrorMessage(MapOperator.java:588)
>>> at
>>> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:554)
>>> at
>>> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:86)
>>> at
>>> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:70)
>>> at
>>> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419)
>>> at
>>> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
>>> at
>>> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
>>> at
>>> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
>>> at
>>> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
>>> at
>>> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>> at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
>>> at
>>> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
>>> at
>>> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
>>> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>> at java.lang.Thread.run(Thread.java:748)
>>>  ]
>>> at
>>> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211)
>>> at
>>> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
>>> at
>>> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
>>> at
>>> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
>>> at
>>> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>> at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
>>> at
>>> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
>>> at
>>> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
>>> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>> at java.lang.Thread.run(Thread.java:748)
>>> Caused by: java.lang.RuntimeException:
>>> org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while
>>> processing row [Error getting row data with exception
>>> java.lang.UnsupportedOperationException: Cannot inspect
>>> org.apache.hadoop.hive.serde2.io.DateWritable
>>> at
>>> org.apache.hadoop.hive.ql.io.parquet.serde.primitive.ParquetStringInspector.getPrimitiveJavaObject(ParquetStringInspector.java:77)
>>>
>>>
>>> After some drill down i saw schema of columns inside both type of
>>> parquet file using parquet tool and found different data types for some
>>> column and it seems, for those columns where data type is same in both
>>> files, we are able to query those successfully using above external table
>>> but in case of a difference we are getting error so just wanted your
>>> suggestion how to come out of this?
>>>
>>> Processing all data with same engine(hive/glue) will be very costly for
>>> us as we have data for last 2-3 years. Please find below table which says
>>> data type of both file for some column along with data type in external
>>> table for same column and "test result" specifies whether i was able to
>>> read it or not.
>>>
>>> *Parquet made with Hive 2.1.1* *Parquet made with AWS Glue* *Final Hive
>>> Table for reading both of these* *Test Result*
>>> optional int32 action_date (DATE); optional binary action_date (UTF8); 
>>> action_date
>>> : string Fail
>>> optional int64 user_id; optional int64 user_id; user_id : bigint Pass
>>> optional binary tracking_type (UTF8); required binary tracking_type
>>> (UTF8); tracking_type : string Fail
>>> optional int32 game_number; optional int64 game_number; game_number :
>>> bigint Pass
>>>
>>> Regards,
>>> Anup Tiwari
>>>
>> --
>> Thai
>>
>

Re: Problem in reading parquet data from 2 different sources(Hive + Glue) using hive tables

Reply via email to