[Spark Core] [Feature] unionByName parameters

Daniel Davies Sat, 05 Feb 2022 09:47:58 -0800

Hello dev@,

I had a quick question about the unionByName function. This function
currently seems to accept a parameter- "allowMissingColumns"- that allows
some tolerance to merging datasets with different schemas [e.g. here
<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2170>];
but the implementation is currently a bit restrictive, i.e., with the
second parameter being a boolean, it is only possible to make unionByName
add all columns from both dataframes at the moment. We have other use cases
that would be useful to have natively in spark- for example, to take only
column names that are in both dataframes (and I'm assuming that other users
will have different merge strategies in mind also). Are there any plans for
making a second parameter to unionByName a string (denoting a column
merging 'mode' parameter)? If not, would the community be in favour of a PR
that implements something like this?


Kind Regards,

Daniel

[Spark Core] [Feature] unionByName parameters

Reply via email to