I'm pretty sure this is a Dataset bug

Tim Gautier Fri, 27 May 2016 08:25:43 -0700

Unfortunately I can't show exactly the data I'm using, but this is what I'm
seeing:


I have a case class 'Product' that represents a table in our database. I
load that data via sqlContext.read.format("jdbc").options(...).load.as[Product]
and register it in a temp table 'product'.

For testing, I created a new Dataset that has only 3 records in it:

val ts = sqlContext.sql("select * from product where product_catalog_id in
(1, 2, 3)").as[Product]

I also created another one using the same case class and data, but from a
sequence instead.

val ds: Dataset[Product] = Seq(
      Product(Some(1), ...),
      Product(Some(2), ...),
      Product(Some(3), ...)
    ).toDS

The spark shell tells me these are exactly the same type at this point, but
they don't behave the same.

ts.as("ts1").joinWith(ts.as("ts2"), $"ts1.product_catalog_id" ===
$"ts2.product_catalog_id")
ds.as("ds1").joinWith(ds.as("ds2"), $"ds1.product_catalog_id" ===
$"ds2.product_catalog_id")

Again, spark tells me these self joins return exactly the same type, but
when I do a .show on them, only the one created from a Seq works. The one
created by reading from the database throws this error:

org.apache.spark.sql.AnalysisException: cannot resolve
'ts1.product_catalog_id' given input columns: [..., product_catalog_id,
...];

Is this a bug? Is there anyway to make the Dataset loaded from a table
behave like the one created from a sequence?

Thanks,
Tim

I'm pretty sure this is a Dataset bug

Reply via email to