Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

Dirceu Semighini Filho Mon, 08 May 2017 13:57:52 -0700

Ok, great,
Well I havn't provided a good example of what I'm doing. Let's assume that
my case  class is
case class A(tons of fields, with sub classes)


val df = sqlContext.sql("select * from a").as[A]

val df2 = spark.emptyDataset[A]

df.union(df2)

This code will throw the exception.
Is this expected? I assume that when I do as[A] it will convert the schema
to the case class schema, and it shouldn't throw the exception, or this
will be done lazy when the union is been processed?



2017-05-08 17:50 GMT-03:00 Burak Yavuz <brk...@gmail.com>:

> Yes, unfortunately. This should actually be fixed, and the union's schema
> should have the less restrictive of the DataFrames.
>
> On Mon, May 8, 2017 at 12:46 PM, Dirceu Semighini Filho <
> dirceu.semigh...@gmail.com> wrote:
>
>> HI Burak,
>> By nullability you mean that if I have the exactly the same schema, but
>> one side support null and the other doesn't, this exception (in union
>> dataset) will be thrown?
>>
>>
>>
>> 2017-05-08 16:41 GMT-03:00 Burak Yavuz <brk...@gmail.com>:
>>
>>> I also want to add that generally these may be caused by the
>>> `nullability` field in the schema.
>>>
>>> On Mon, May 8, 2017 at 12:25 PM, Shixiong(Ryan) Zhu <
>>> shixi...@databricks.com> wrote:
>>>
>>>> This is because RDD.union doesn't check the schema, so you won't see
>>>> the problem unless you run RDD and hit the incompatible column problem. For
>>>> RDD, You may not see any error if you don't use the incompatible column.
>>>>
>>>> Dataset.union requires compatible schema. You can print ds.schema and
>>>> ds1.schema and check if they are same.
>>>>
>>>> On Mon, May 8, 2017 at 11:07 AM, Dirceu Semighini Filho <
>>>> dirceu.semigh...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>> I've a very complex case class structure, with a lot of fields.
>>>>> When I try to union two datasets of this class, it doesn't work with
>>>>> the following error :
>>>>> ds.union(ds1)
>>>>> Exception in thread "main" org.apache.spark.sql.AnalysisException:
>>>>> Union can only be performed on tables with the compatible column types
>>>>>
>>>>> But when use it's rdd, the union goes right:
>>>>> ds.rdd.union(ds1.rdd)
>>>>> res8: org.apache.spark.rdd.RDD[
>>>>>
>>>>> Is there any reason for this to happen (besides a bug ;) )
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

Reply via email to