Peter Vary created HIVE-24504:
---------------------------------
Summary: VectorFileSinkArrowOperator does not serialize complex
types correctly
Key: HIVE-24504
URL: https://issues.apache.org/jira/browse/HIVE-24504
Project: Hive
Issue Type: Bug
Components: llap
Reporter: Peter Vary
Assignee: Peter Vary
When the table has complex types and the result has 0 records theĀ
VectorFileSinkArrowOperator only serializes the primitive types correctly. For
complex types only the main type is set which causes issues for clients trying
to read data.
Got the following HWC exception:
{code:java}
Previous exception in task: Unsupported data type: Null
org.apache.spark.sql.execution.arrow.ArrowUtils$.fromArrowType(ArrowUtils.scala:71)
org.apache.spark.sql.execution.arrow.ArrowUtils$.fromArrowField(ArrowUtils.scala:106)
org.apache.spark.sql.execution.arrow.ArrowUtils$.fromArrowField(ArrowUtils.scala:98)
org.apache.spark.sql.execution.arrow.ArrowUtils.fromArrowField(ArrowUtils.scala)
org.apache.spark.sql.vectorized.ArrowColumnVector.<init>(ArrowColumnVector.java:135)
com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataReader.get(HiveWarehouseDataReader.java:105)
com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataReader.get(HiveWarehouseDataReader.java:29)
org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.next(DataSourceRDD.scala:59)
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.datasourcev2scan_nextBatch_0$(Unknown
Source)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source)
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
org.apache.spark.scheduler.Task.run(Task.scala:109)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
at
org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:139)
at
org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117)
at org.apache.spark.scheduler.Task.run(Task.scala:119)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745) {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)