Question / issue while creating a parquet file using a text file with spark 2.0...

Muthu Jayakumar Thu, 28 Jul 2016 21:15:08 -0700

Hello there,

I am using Spark 2.0.0 to create a parquet file using a text file with
Scala. I am trying to read a text file with bunch of values of type string
and long (mostly). And all the occurrences can be null. In order to support
nulls, all the values are boxed with Option (ex:- Option[String],
Option[Long]).
The schema for the parquet file is based on some external metadata file, so
I use 'StructField' to create a schema programmatically and perform some
code snippet like below...


sc.textFile(flatFile).map(_.split(fileSeperator)).map { line =>
  convertToRawColumns(line, schemaSeq)
}

...

val df = sqlContext.createDataFrame(rawFileRDD, schema) //Failed line.

On a side note, the same code used to work fine with Spark 1.6.2.

Here is the error from Spark 2.0.0.

Jul 28, 2016 8:27:10 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
Compression: SNAPPY
Jul 28, 2016 8:27:10 PM INFO:
org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size Jul-28
20:27:10 [Executor task launch worker-483] ERROR   Utils Aborting task
java.lang.RuntimeException: Error while encoding:
java.lang.RuntimeException: scala.Some is not a valid external type for
schema of string
if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row
object).isNullAt) null else staticinvoke(class
org.apache.spark.unsafe.types.UTF8String, StringType, fromString,
validateexternaltype(getexternalrowfield(assertnotnull(input[0,
org.apache.spark.sql.Row, true], top level row object), 0, host),
StringType), true) AS host#37315
+- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level
row object).isNullAt) null else staticinvoke(class
org.apache.spark.unsafe.types.UTF8String, StringType, fromString,
validateexternaltype(getexternalrowfield(assertnotnull(input[0,
org.apache.spark.sql.Row, true], top level row object), 0, host),
StringType), true)
   :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row
object).isNullAt
   :  :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level
row object)
   :  :  +- input[0, org.apache.spark.sql.Row, true]
   :  +- 0
   :- null
   +- staticinvoke(class org.apache.spark.unsafe.types.UTF8String,
StringType, fromString,
validateexternaltype(getexternalrowfield(assertnotnull(input[0,
org.apache.spark.sql.Row, true], top level row object), 0, host),
StringType), true)
      +- validateexternaltype(getexternalrowfield(assertnotnull(input[0,
org.apache.spark.sql.Row, true], top level row object), 0, host),
StringType)
         +- getexternalrowfield(assertnotnull(input[0,
org.apache.spark.sql.Row, true], top level row object), 0, host)
            +- assertnotnull(input[0, org.apache.spark.sql.Row, true], top
level row object)
               +- input[0, org.apache.spark.sql.Row, true]


Let me know if you would like me try to create a more simplified reproducer
to this problem. Perhaps I should not be using Option[T] for nullable
schema values?

Please advice,
Muthu

Question / issue while creating a parquet file using a text file with spark 2.0...

Reply via email to