Thanks for the response. What do you mean by "semantically" the same?
They're both Datasets of the same type, which is a case class, so I would
expect compile-time integrity of the data. Is there a situation where this
wouldn't be the case?

Interestingly enough, if I instead create an empty rdd with
sparkContext.emptyRDD of the same case class type, it works!

So something like:
var data = spark.sparkContext.emptyRDD[SomeData]

// loop
  data = data.union(someCode.thatReturnsADataset().rdd)
// end loop

data.toDS //so I can union it to the actual Dataset I have elsewhere

On Thu, Oct 20, 2016 at 8:34 PM Agraj Mangal <agraj....@gmail.com> wrote:

I believe this normally comes when Spark is unable to perform union due to
"difference" in schema of the operands. Can you check if the schema of both
the datasets are semantically same ?

On Tue, Oct 18, 2016 at 9:06 AM, Efe Selcuk <efema...@gmail.com> wrote:

Bump!

On Thu, Oct 13, 2016 at 8:25 PM Efe Selcuk <efema...@gmail.com> wrote:

I have a use case where I want to build a dataset based off of
conditionally available data. I thought I'd do something like this:

case class SomeData( ... ) // parameters are basic encodable types like
strings and BigDecimals

var data = spark.emptyDataset[SomeData]

// loop, determining what data to ingest and process into datasets
  data = data.union(someCode.thatReturnsADataset)
// end loop

However I get a runtime exception:

Exception in thread "main" org.apache.spark.sql.AnalysisException:
unresolved operator 'Union;
        at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
        at
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
        at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:361)
        at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
        at
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
        at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
        at
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
        at
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
        at org.apache.spark.sql.Dataset.<init>(Dataset.scala:161)
        at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
        at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
        at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
        at org.apache.spark.sql.Dataset.union(Dataset.scala:1459)

Granted, I'm new at Spark so this might be an anti-pattern, so I'm open to
suggestions. However it doesn't seem like I'm doing anything incorrect
here, the types are correct. Searching for this error online returns
results seemingly about working in dataframes and having mismatching
schemas or a different order of fields, and it seems like bugfixes have
gone into place for those cases.

Thanks in advance.
Efe




-- 
Thanks & Regards,
Agraj Mangal

Reply via email to