Re: Question / issue while creating a parquet file using a text file with spark 2.0...

Muthu Jayakumar Thu, 28 Jul 2016 22:54:02 -0700

Hello Dong Meng,

Thanks for the tip. But, I do have code in place that looks like this...


StructField(columnName, getSparkDataType(dataType), nullable = true)

May be I am missing something else. The same code works fine with Spark
1.6.2 though. On a side note, I could be using SparkSession, but i don't
know how to split and map the row elegantly. Hence using it as RDD.

Thanks,
Muthu


On Thu, Jul 28, 2016 at 10:47 PM, Dong Meng <mengdong0...@gmail.com> wrote:

> you can specify nullable in StructField
>
> On Thu, Jul 28, 2016 at 9:14 PM, Muthu Jayakumar <bablo...@gmail.com>
> wrote:
>
>> Hello there,
>>
>> I am using Spark 2.0.0 to create a parquet file using a text file with
>> Scala. I am trying to read a text file with bunch of values of type string
>> and long (mostly). And all the occurrences can be null. In order to support
>> nulls, all the values are boxed with Option (ex:- Option[String],
>> Option[Long]).
>> The schema for the parquet file is based on some external metadata file,
>> so I use 'StructField' to create a schema programmatically and perform some
>> code snippet like below...
>>
>> sc.textFile(flatFile).map(_.split(fileSeperator)).map { line =>
>>   convertToRawColumns(line, schemaSeq)
>> }
>>
>> ...
>>
>> val df = sqlContext.createDataFrame(rawFileRDD, schema) //Failed line.
>>
>> On a side note, the same code used to work fine with Spark 1.6.2.
>>
>> Here is the error from Spark 2.0.0.
>>
>> Jul 28, 2016 8:27:10 PM INFO:
>> org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
>> Jul 28, 2016 8:27:10 PM INFO:
>> org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size Jul-28
>> 20:27:10 [Executor task launch worker-483] ERROR   Utils Aborting task
>> java.lang.RuntimeException: Error while encoding:
>> java.lang.RuntimeException: scala.Some is not a valid external type for
>> schema of string
>> if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row
>> object).isNullAt) null else staticinvoke(class
>> org.apache.spark.unsafe.types.UTF8String, StringType, fromString,
>> validateexternaltype(getexternalrowfield(assertnotnull(input[0,
>> org.apache.spark.sql.Row, true], top level row object), 0, host),
>> StringType), true) AS host#37315
>> +- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level
>> row object).isNullAt) null else staticinvoke(class
>> org.apache.spark.unsafe.types.UTF8String, StringType, fromString,
>> validateexternaltype(getexternalrowfield(assertnotnull(input[0,
>> org.apache.spark.sql.Row, true], top level row object), 0, host),
>> StringType), true)
>>    :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level
>> row object).isNullAt
>>    :  :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top
>> level row object)
>>    :  :  +- input[0, org.apache.spark.sql.Row, true]
>>    :  +- 0
>>    :- null
>>    +- staticinvoke(class org.apache.spark.unsafe.types.UTF8String,
>> StringType, fromString,
>> validateexternaltype(getexternalrowfield(assertnotnull(input[0,
>> org.apache.spark.sql.Row, true], top level row object), 0, host),
>> StringType), true)
>>       +- validateexternaltype(getexternalrowfield(assertnotnull(input[0,
>> org.apache.spark.sql.Row, true], top level row object), 0, host),
>> StringType)
>>          +- getexternalrowfield(assertnotnull(input[0,
>> org.apache.spark.sql.Row, true], top level row object), 0, host)
>>             +- assertnotnull(input[0, org.apache.spark.sql.Row, true],
>> top level row object)
>>                +- input[0, org.apache.spark.sql.Row, true]
>>
>>
>> Let me know if you would like me try to create a more simplified
>> reproducer to this problem. Perhaps I should not be using Option[T] for
>> nullable schema values?
>>
>> Please advice,
>> Muthu
>>
>
>

Re: Question / issue while creating a parquet file using a text file with spark 2.0...

Reply via email to