Hi Maciej, Sorry for the late reply. I believe you are right. Merging nested StructType s can be tricky. As a matter of fact, it will require a complex logic and most likely some conventions to include all the edge cases.
What about just exposing the existing merge <https://github.com/apache/spark/blob/36dd531a93af55ce5c2bfd8d275814ccb2846962/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala#L496> (currently private) through a public *merge *method? Could that add some extra flexibility to the current API? Best, Alex On Sun, Aug 14, 2022 at 2:10 PM Maciej <mszymkiew...@gmail.com> wrote: > I have mixed feelings about this proposal. Merging or diffing schemas is > a common operation, but specific requirements differ from case to case, > especially when complex nested data is used. > > Even if we put ordering of the fields aside, data types equality > semantics (StructField in particular) is likely to result in > implementation which is either confusing or has limited applicability. > > Additionally, Scala StructType is already a Seq[StructField] and as such > provides set-like operations (contains, diff, intersect, union) as well > as implementations of ++ / :+ / +: so we cannot do much here, without > breaking the existing API. > > On 8/14/22 11:30, Alexandros Biratsis wrote: > > Hello Rui and Tim, > > > > Indeed this sound a good idea and quite useful. To make it more formal > > the list of a StructType could be treated as a Scala/Python set by > > providing(inheriting?) the common sets' functionality i.e add, remove, > > concat, intersect, diff etc. The set like functionality could be part of > > StructType class for both languages. > > > > The Scala set collection > > > https://www.scala-lang.org/api/2.13.x/scala/collection/immutable/Set.html > <https://www.scala-lang.org/api/2.13.x/scala/collection/immutable/Set.html > > > > > > Best, > > Alex > > > > On Wed, Aug 10, 2022, 08:14 Rui Wang <amaliu...@apache.org > > <mailto:amaliu...@apache.org>> wrote: > > > > Thanks for the idea! > > > > I am thinking that the usage of "combined = StructType( a.fields + > > b.fields)" is still good because > > 1) it is not horrible to merge a and b in this way. > > 2) itself clarifies the intention which is merge two struct's fields > > to construct a new struct > > 3) you also have room to apply more complicated operations on fields > > merging. For example remove duplicate files with the same name or > > use a.fields but remove some fields if they are in b. > > > > overloading "+" could be > > 1. it's ambiguous on what this plus is doing. > > 2. If you define + is a concatenation on the fields, then it's > > limited to only do the concatenation. How about other operations > > like extract fields from a based on b? Maybe overloading "-"? In > > this case the item list will grow. > > > > -Rui > > > > On Tue, Aug 9, 2022 at 1:10 PM Tim <bosse...@posteo.de > > <mailto:bosse...@posteo.de>> wrote: > > > > Hi all, > > > > this is my first message to the Spark mailing list, so please > > bear with > > me if I don't fully meet your communication standards. > > I just wanted to discuss one aspect that I've stumbled across > > several > > times over the past few weeks. > > When working with Spark, I often run into the problem of having > > to merge > > two (or more) existing StructTypes into a new one to define a > > schema. > > Usually this looks similar (in Python) to the following > simplified > > example: > > > > a = StructType([StuctField("field_a", StringType())]) > > b = StructType([StructField("field_b", IntegerType())]) > > > > combined = StructType( a.fields + b.fields) > > > > My idea, which I would like to discuss, is to shorten the above > > example > > in Python as follows by supporting Python's add operator for > > StructTypes: > > > > combined = a + b > > > > > > What do you think of this idea? Are there any reasons why this > > is not > > yet part of StructType's functionality? > > If you support this idea, I could create a first PR for further > and > > deeper discussion. > > > > Best > > Tim > > > > > --------------------------------------------------------------------- > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > <mailto:dev-unsubscr...@spark.apache.org> > > > > -- > Best regards, > Maciej Szymkiewicz > > Web: https://zero323.net > PGP: A30CEF0C31A501EC > >