Hi, I have one question related to Spark-Avro, not sure if here is the best
place to ask.
I have the following Scala Case class, populated with the data in the Spark
application, and I tried to save it as AVRO format in the HDFS
case class Claim ( ......)
case class Coupon ( account_id: Long ........ claims: List[Claim])
As the above example, the Coupon case class contains List of Claim class.
In the RDD, it holds an Iterator of Coupon data, and I will try to save it into
the HDFS. I am using Spark 1.3.1, with Spark-Avro 1.0.0 (which matches with
Spark 1.3.x)
rdd.toDF.save("hdfs_location", "com.databricks.spark.avro")
I have no problem to save the data this way, but the problem is that I cannot
use the avro data in Hive.
Here is the schema example generated by Spark AVRO for the above data:
{
"type":"record",
"name":"topLevelRecord",
"fields":[{
"name":"account_id",
"type":"long"
},........{"name":"claims",
"type":[
{
"type":"array",
"items":[
{
"type":"record",
"name":"claims",
"fields":[
......
The claims field is generated as an union contains array, instead of array of
structure directly.Or for more clearly, here is the schema in the hive when
pointing to the data generated by Spark-Avro:desc tableOK
col_name data_type comment
account_id bigint from deserializer.......claims
uniontype<array<uniontype<struct<account_id:bigint, .......>>>
from deserializerObviously, this causes trouble for Hive to query this data (at
least in the Hive 0.12, which we are currently use), so end user cannot query
it in the hive like "select claims[0].account_id from table".
I wonder why Spark-Avro has to wrapping a union structure in this case, instead
of just building "array<struct>"?Or better, is there a way I can control the
AVRO generated in this case in Spark-AVOR?ThanksYong