Hi, We have a PR to support fixed length byte array in parquet file.
https://github.com/apache/spark/pull/1737 Can someone help verifying it? Thanks. 2014-07-15 19:23 GMT+08:00 Pei-Lun Lee <pl...@appier.com>: > Sorry, should be SPARK-2489 > > > 2014-07-15 19:22 GMT+08:00 Pei-Lun Lee <pl...@appier.com>: > > Filed SPARK-2446 >> >> >> >> 2014-07-15 16:17 GMT+08:00 Michael Armbrust <mich...@databricks.com>: >> >> Oh, maybe not. Please file another JIRA. >>> >>> >>> On Tue, Jul 15, 2014 at 12:34 AM, Pei-Lun Lee <pl...@appier.com> wrote: >>> >>>> Hi Michael, >>>> >>>> Good to know it is being handled. I tried master branch (9fe693b5) and >>>> got another error: >>>> >>>> scala> sqlContext.parquetFile("/tmp/foo") >>>> java.lang.RuntimeException: Unsupported parquet datatype optional >>>> fixed_len_byte_array(4) b >>>> at scala.sys.package$.error(package.scala:27) >>>> at >>>> org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:58) >>>> at >>>> org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:109) >>>> at >>>> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:282) >>>> at >>>> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:279) >>>> ...... >>>> >>>> The avro schema I used is something like: >>>> >>>> protocol Test { >>>> fixed Bytes4(4); >>>> >>>> record User { >>>> string name; >>>> int age; >>>> union {null, int} i; >>>> union {null, int} j; >>>> union {null, Bytes4} b; >>>> union {null, bytes} c; >>>> union {null, int} d; >>>> } >>>> } >>>> >>>> Is this case included in SPARK-2446 >>>> <https://issues.apache.org/jira/browse/SPARK-2446>? >>>> >>>> >>>> 2014-07-15 3:54 GMT+08:00 Michael Armbrust <mich...@databricks.com>: >>>> >>>> This is not supported yet, but there is a PR open to fix it: >>>>> https://issues.apache.org/jira/browse/SPARK-2446 >>>>> >>>>> >>>>> On Mon, Jul 14, 2014 at 4:17 AM, Pei-Lun Lee <pl...@appier.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I am using spark-sql 1.0.1 to load parquet files generated from >>>>>> method described in: >>>>>> >>>>>> https://gist.github.com/massie/7224868 >>>>>> >>>>>> >>>>>> When I try to submit a select query with columns of type fixed length >>>>>> byte array, the following error pops up: >>>>>> >>>>>> >>>>>> 14/07/14 11:09:14 INFO scheduler.DAGScheduler: Failed to run take at >>>>>> basicOperators.scala:100 >>>>>> org.apache.spark.SparkDriverExecutionException: Execution error >>>>>> at >>>>>> org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:581) >>>>>> at >>>>>> org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:559) >>>>>> Caused by: parquet.io.ParquetDecodingException: Can not read value at >>>>>> 0 in block -1 in file s3n://foo/bar/part-r-00000.snappy.parquet >>>>>> at >>>>>> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:177) >>>>>> at >>>>>> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130) >>>>>> at >>>>>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:122) >>>>>> at >>>>>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) >>>>>> at >>>>>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) >>>>>> at >>>>>> scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) >>>>>> at >>>>>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) >>>>>> at >>>>>> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) >>>>>> at scala.collection.Iterator$class.foreach(Iterator.scala:727) >>>>>> at >>>>>> scala.collection.AbstractIterator.foreach(Iterator.scala:1157) >>>>>> at >>>>>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) >>>>>> at >>>>>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) >>>>>> at >>>>>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) >>>>>> at scala.collection.TraversableOnce$class.to >>>>>> (TraversableOnce.scala:273) >>>>>> at scala.collection.AbstractIterator.to(Iterator.scala:1157) >>>>>> at >>>>>> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) >>>>>> at >>>>>> scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) >>>>>> at >>>>>> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) >>>>>> at >>>>>> scala.collection.AbstractIterator.toArray(Iterator.scala:1157) >>>>>> at org.apache.spark.rdd.RDD$$anonfun$27.apply(RDD.scala:989) >>>>>> at org.apache.spark.rdd.RDD$$anonfun$27.apply(RDD.scala:989) >>>>>> at >>>>>> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083) >>>>>> at >>>>>> org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:574) >>>>>> ... 1 more >>>>>> Caused by: java.lang.ClassCastException: Expected instance of >>>>>> primitive converter but got >>>>>> "org.apache.spark.sql.parquet.CatalystNativeArrayConverter" >>>>>> at >>>>>> parquet.io.api.Converter.asPrimitiveConverter(Converter.java:30) >>>>>> at >>>>>> parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:264) >>>>>> at >>>>>> parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:60) >>>>>> at >>>>>> parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74) >>>>>> at >>>>>> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:110) >>>>>> at >>>>>> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172) >>>>>> ... 24 more >>>>>> >>>>>> >>>>>> Is fixed length byte array supposed to work in this version? I >>>>>> noticed that other array types like int or string already work. >>>>>> >>>>>> Thanks, >>>>>> -- >>>>>> Pei-Lun >>>>>> >>>>>> >>>>> >>>> >>> >> >