Sorry for the late response. Here is what I am seeing...
Schema from parquet file. d1.printSchema() root |-- task_id: string (nullable = true) |-- task_name: string (nullable = true) |-- some_histogram: struct (nullable = true) | |-- values: array (nullable = true) | | |-- element: double (containsNull = true) | |-- freq: array (nullable = true) | | |-- element: long (containsNull = true) d2.printSchema() //Data created using dataframe and/or processed before writing to parquet file. root |-- task_id: string (nullable = true) |-- task_name: string (nullable = true) |-- some_histogram: struct (nullable = true) | |-- values: array (nullable = true) | | |-- element: double (containsNull = false) | |-- freq: array (nullable = true) | | |-- element: long (containsNull = false) d1.union(d2).printSchema() Exception in thread "main" org.apache.spark.sql.AnalysisException: unresolved operator 'Union; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:361) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:161) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167) at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59) at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594) at org.apache.spark.sql.Dataset.union(Dataset.scala:1459) Please advice, Muthu On Thu, Oct 20, 2016 at 1:46 AM, Michael Armbrust <mich...@databricks.com> wrote: > What is the issue you see when unioning? > > On Wed, Oct 19, 2016 at 6:39 PM, Muthu Jayakumar <bablo...@gmail.com> > wrote: > >> Hello Michael, >> >> Thank you for looking into this query. In my case there seem to be an >> issue when I union a parquet file read from disk versus another dataframe >> that I construct in-memory. The only difference I see is the containsNull = >> true. In fact, I do not see any errors with union on the simple schema of >> "col1 thru col4" above. But the problem seem to exist only on that >> "some_histogram" column which contains the mixed containsNull = true/false. >> Let me know if this helps. >> >> Thanks, >> Muthu >> >> >> >> On Wed, Oct 19, 2016 at 6:21 PM, Michael Armbrust <mich...@databricks.com >> > wrote: >> >>> Nullable is just a hint to the optimizer that its impossible for there >>> to be a null value in this column, so that it can avoid generating code for >>> null-checks. When in doubt, we set nullable=true since it is always safer >>> to check. >>> >>> Why in particular are you trying to change the nullability of the column? >>> >>> On Wed, Oct 19, 2016 at 6:07 PM, Muthu Jayakumar <bablo...@gmail.com> >>> wrote: >>> >>>> Hello there, >>>> >>>> I am trying to understand how and when does DataFrame (or Dataset) sets >>>> nullable = true vs false on a schema. >>>> >>>> Here is my observation from a sample code I tried... >>>> >>>> >>>> scala> spark.createDataset(Seq((1, "a", 2.0d), (2, "b", 2.0d), (3, >>>> "c", 2.0d))).toDF("col1", "col2", "col3").withColumn("col4", >>>> lit("bla")).printSchema() >>>> root >>>> |-- col1: integer (nullable = false) >>>> |-- col2: string (nullable = true) >>>> |-- col3: double (nullable = false) >>>> |-- col4: string (nullable = false) >>>> >>>> >>>> scala> spark.createDataset(Seq((1, "a", 2.0d), (2, "b", 2.0d), (3, >>>> "c", 2.0d))).toDF("col1", "col2", "col3").withColumn("col4", >>>> lit("bla")).write.parquet("/tmp/sample.parquet") >>>> >>>> scala> spark.read.parquet("/tmp/sample.parquet").printSchema() >>>> root >>>> |-- col1: integer (nullable = true) >>>> |-- col2: string (nullable = true) >>>> |-- col3: double (nullable = true) >>>> |-- col4: string (nullable = true) >>>> >>>> >>>> The place where this seem to get me into trouble is when I try to union >>>> one data-structure from in-memory (notice that in the below schema the >>>> highlighted element is represented as 'false' for in-memory created schema) >>>> and one from file that starts out with a schema like below... >>>> >>>> |-- some_histogram: struct (nullable = true) >>>> | |-- values: array (nullable = true) >>>> | | |-- element: double (containsNull = true) >>>> | |-- freq: array (nullable = true) >>>> | | |-- element: long (containsNull = true) >>>> >>>> Is there a way to convert this attribute from true to false without >>>> running any mapping / udf on that column? >>>> >>>> Please advice, >>>> Muthu >>>> >>> >>> >> >