Hello Dong Meng, Thanks for the tip. But, I do have code in place that looks like this...
StructField(columnName, getSparkDataType(dataType), nullable = true) May be I am missing something else. The same code works fine with Spark 1.6.2 though. On a side note, I could be using SparkSession, but i don't know how to split and map the row elegantly. Hence using it as RDD. Thanks, Muthu On Thu, Jul 28, 2016 at 10:47 PM, Dong Meng <mengdong0...@gmail.com> wrote: > you can specify nullable in StructField > > On Thu, Jul 28, 2016 at 9:14 PM, Muthu Jayakumar <bablo...@gmail.com> > wrote: > >> Hello there, >> >> I am using Spark 2.0.0 to create a parquet file using a text file with >> Scala. I am trying to read a text file with bunch of values of type string >> and long (mostly). And all the occurrences can be null. In order to support >> nulls, all the values are boxed with Option (ex:- Option[String], >> Option[Long]). >> The schema for the parquet file is based on some external metadata file, >> so I use 'StructField' to create a schema programmatically and perform some >> code snippet like below... >> >> sc.textFile(flatFile).map(_.split(fileSeperator)).map { line => >> convertToRawColumns(line, schemaSeq) >> } >> >> ... >> >> val df = sqlContext.createDataFrame(rawFileRDD, schema) //Failed line. >> >> On a side note, the same code used to work fine with Spark 1.6.2. >> >> Here is the error from Spark 2.0.0. >> >> Jul 28, 2016 8:27:10 PM INFO: >> org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY >> Jul 28, 2016 8:27:10 PM INFO: >> org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size Jul-28 >> 20:27:10 [Executor task launch worker-483] ERROR Utils Aborting task >> java.lang.RuntimeException: Error while encoding: >> java.lang.RuntimeException: scala.Some is not a valid external type for >> schema of string >> if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row >> object).isNullAt) null else staticinvoke(class >> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, >> validateexternaltype(getexternalrowfield(assertnotnull(input[0, >> org.apache.spark.sql.Row, true], top level row object), 0, host), >> StringType), true) AS host#37315 >> +- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level >> row object).isNullAt) null else staticinvoke(class >> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, >> validateexternaltype(getexternalrowfield(assertnotnull(input[0, >> org.apache.spark.sql.Row, true], top level row object), 0, host), >> StringType), true) >> :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level >> row object).isNullAt >> : :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top >> level row object) >> : : +- input[0, org.apache.spark.sql.Row, true] >> : +- 0 >> :- null >> +- staticinvoke(class org.apache.spark.unsafe.types.UTF8String, >> StringType, fromString, >> validateexternaltype(getexternalrowfield(assertnotnull(input[0, >> org.apache.spark.sql.Row, true], top level row object), 0, host), >> StringType), true) >> +- validateexternaltype(getexternalrowfield(assertnotnull(input[0, >> org.apache.spark.sql.Row, true], top level row object), 0, host), >> StringType) >> +- getexternalrowfield(assertnotnull(input[0, >> org.apache.spark.sql.Row, true], top level row object), 0, host) >> +- assertnotnull(input[0, org.apache.spark.sql.Row, true], >> top level row object) >> +- input[0, org.apache.spark.sql.Row, true] >> >> >> Let me know if you would like me try to create a more simplified >> reproducer to this problem. Perhaps I should not be using Option[T] for >> nullable schema values? >> >> Please advice, >> Muthu >> > >