Ok, great, Well I havn't provided a good example of what I'm doing. Let's assume that my case class is case class A(tons of fields, with sub classes)
val df = sqlContext.sql("select * from a").as[A] val df2 = spark.emptyDataset[A] df.union(df2) This code will throw the exception. Is this expected? I assume that when I do as[A] it will convert the schema to the case class schema, and it shouldn't throw the exception, or this will be done lazy when the union is been processed? 2017-05-08 17:50 GMT-03:00 Burak Yavuz <brk...@gmail.com>: > Yes, unfortunately. This should actually be fixed, and the union's schema > should have the less restrictive of the DataFrames. > > On Mon, May 8, 2017 at 12:46 PM, Dirceu Semighini Filho < > dirceu.semigh...@gmail.com> wrote: > >> HI Burak, >> By nullability you mean that if I have the exactly the same schema, but >> one side support null and the other doesn't, this exception (in union >> dataset) will be thrown? >> >> >> >> 2017-05-08 16:41 GMT-03:00 Burak Yavuz <brk...@gmail.com>: >> >>> I also want to add that generally these may be caused by the >>> `nullability` field in the schema. >>> >>> On Mon, May 8, 2017 at 12:25 PM, Shixiong(Ryan) Zhu < >>> shixi...@databricks.com> wrote: >>> >>>> This is because RDD.union doesn't check the schema, so you won't see >>>> the problem unless you run RDD and hit the incompatible column problem. For >>>> RDD, You may not see any error if you don't use the incompatible column. >>>> >>>> Dataset.union requires compatible schema. You can print ds.schema and >>>> ds1.schema and check if they are same. >>>> >>>> On Mon, May 8, 2017 at 11:07 AM, Dirceu Semighini Filho < >>>> dirceu.semigh...@gmail.com> wrote: >>>> >>>>> Hello, >>>>> I've a very complex case class structure, with a lot of fields. >>>>> When I try to union two datasets of this class, it doesn't work with >>>>> the following error : >>>>> ds.union(ds1) >>>>> Exception in thread "main" org.apache.spark.sql.AnalysisException: >>>>> Union can only be performed on tables with the compatible column types >>>>> >>>>> But when use it's rdd, the union goes right: >>>>> ds.rdd.union(ds1.rdd) >>>>> res8: org.apache.spark.rdd.RDD[ >>>>> >>>>> Is there any reason for this to happen (besides a bug ;) ) >>>>> >>>>> >>>>> >>>> >>> >> >