Sorry for the late response. Here is what I am seeing...
Schema from parquet file.
d1.printSchema()
root
|-- task_id: string (nullable = true)
|-- task_name: string (nullable = true)
|-- some_histogram: struct (nullable = true)
| |-- values: array (nullable = true)
| | |-- element: double (containsNull = true)
| |-- freq: array (nullable = true)
| | |-- element: long (containsNull = true)
d2.printSchema() //Data created using dataframe and/or processed before writing to
parquet file.
root
|-- task_id: string (nullable = true)
|-- task_name: string (nullable = true)
|-- some_histogram: struct (nullable = true)
| |-- values: array (nullable = true)
| | |-- element: double (containsNull = false)
| |-- freq: array (nullable = true)
| | |-- element: long (containsNull = false)
d1.union(d2).printSchema()
Exception in thread "main" org.apache.spark.sql.AnalysisException:
unresolved operator 'Union;
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
at
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:361)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
at
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
at
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
at
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:161)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
at org.apache.spark.sql.Dataset.union(Dataset.scala:1459)
Please advice,
Muthu
On Thu, Oct 20, 2016 at 1:46 AM, Michael Armbrust
<mich...@databricks.com <mailto:mich...@databricks.com>> wrote:
What is the issue you see when unioning?
On Wed, Oct 19, 2016 at 6:39 PM, Muthu Jayakumar
<bablo...@gmail.com <mailto:bablo...@gmail.com>> wrote:
Hello Michael,
Thank you for looking into this query. In my case there seem
to be an issue when I union a parquet file read from disk
versus another dataframe that I construct in-memory. The only
difference I see is the containsNull = true. In fact, I do
not see any errors with union on the simple schema of "col1
thru col4" above. But the problem seem to exist only on that
"some_histogram" column which contains the mixed containsNull
= true/false.
Let me know if this helps.
Thanks,
Muthu
On Wed, Oct 19, 2016 at 6:21 PM, Michael Armbrust
<mich...@databricks.com <mailto:mich...@databricks.com>> wrote:
Nullable is just a hint to the optimizer that its
impossible for there to be a null value in this column,
so that it can avoid generating code for null-checks.
When in doubt, we set nullable=true since it is always
safer to check.
Why in particular are you trying to change the
nullability of the column?
On Wed, Oct 19, 2016 at 6:07 PM, Muthu Jayakumar
<bablo...@gmail.com <mailto:bablo...@gmail.com>> wrote:
Hello there,
I am trying to understand how and when does DataFrame
(or Dataset) sets nullable = true vs false on a schema.
Here is my observation from a sample code I tried...
scala> spark.createDataset(Seq((1, "a", 2.0d), (2,
"b", 2.0d), (3, "c", 2.0d))).toDF("col1", "col2",
"col3").withColumn("col4", lit("bla")).printSchema()
root
|-- col1: integer (nullable = false)
|-- col2: string (nullable = true)
|-- col3: double (nullable = false)
|-- col4: string (nullable = false)
scala> spark.createDataset(Seq((1, "a", 2.0d), (2,
"b", 2.0d), (3, "c", 2.0d))).toDF("col1", "col2",
"col3").withColumn("col4",
lit("bla")).write.parquet("/tmp/sample.parquet")
scala>
spark.read.parquet("/tmp/sample.parquet").printSchema()
root
|-- col1: integer (nullable = true)
|-- col2: string (nullable = true)
|-- col3: double (nullable = true)
|-- col4: string (nullable = true)
The place where this seem to get me into trouble is
when I try to union one data-structure from in-memory
(notice that in the below schema the highlighted
element is represented as 'false' for in-memory
created schema) and one from file that starts out
with a schema like below...
|-- some_histogram: struct (nullable = true)
| |-- values: array (nullable = true)
| | |-- element: double (containsNull = true)
| |-- freq: array (nullable = true)
| | |-- element: long (containsNull = true)
Is there a way to convert this attribute from true to
false without running any mapping / udf on that column?
Please advice,
Muthu