[ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15673708#comment-15673708
 ] 

Wenchen Fan commented on SPARK-18251:
-------------------------------------

This is a tricky problem.

First, using `Option[T]` as the type of Dataset is not well supported, you can 
try `Seq(Some(1 -> "a"), None).toDS.collect`, it will return `Array(Some(1 -> 
a), Some(0 -> null))` instead of `Array(Some((1,a)), None)`. The reason is, we 
explicitly forbid users to use null as top-level non-flat objects, but we 
forget to handle top level None. We should create a ticket for it.

In Spark 2.0, typed filter is not optimized well, so the example code will 
first deserialize row to object, then apply the map function, then serialize 
the mapped object to row, then the filter operator will deserialize the row to 
object, and apply the filter function. In Spark 2.1, typed filter will be 
pushed down through `SerializeFromObject`, which means the process get 
optimized to: deserialize row to object, then apply the map function, then 
apply the filter function, then serialize object to row. So this bug is kind of 
hidden in Spark 2.1(the null check exception is thrown during serialization)

We have 2 choices here:
1. forbid top-level None, as well as null.
2. treat `Option` like a normal product, which is a single field struct.

Any ideas? cc [~yhuai] [~lian cheng] [~marmbrus]

> DataSet API | RuntimeException: Null value appeared in non-nullable field 
> when holding Option Case Class
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-18251
>                 URL: https://issues.apache.org/jira/browse/SPARK-18251
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.0.1
>         Environment: OS X
>            Reporter: Aniket Bhatnagar
>
> I am running into a runtime exception when a DataSet is holding an Empty 
> object instance for an Option type that is holding non-nullable field. For 
> instance, if we have the following case class:
> case class DataRow(id: Int, value: String)
> Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
> hold Empty. If it does so, the following exception is thrown:
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: 
> Null value appeared in non-nullable field:
> - field (class: "scala.Int", name: "id")
> - option value class: "DataSetOptBug.DataRow"
> - root class: "scala.Option"
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>       at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>       at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>       at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>       at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>       at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>       at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>       at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>       at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>       at org.apache.spark.scheduler.Task.run(Task.scala:86)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The bug can be reproduce by using the program: 
> https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to