Re: [DISCUSS] [Spark SQL, PySpark] Combining StructTypes into a new StructType

Alexandros Biratsis Tue, 23 Aug 2022 22:07:52 -0700

Hi Maciej,

Sorry for the late reply. I believe you are right. Merging nested
StructType s can be tricky. As a matter of fact, it will require a complex
logic and most likely some conventions to include all the edge cases.


What about just exposing the existing merge
<https://github.com/apache/spark/blob/36dd531a93af55ce5c2bfd8d275814ccb2846962/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala#L496>
(currently
private) through a public *merge *method? Could that add some extra
flexibility to the current API?

Best,
Alex

On Sun, Aug 14, 2022 at 2:10 PM Maciej <mszymkiew...@gmail.com> wrote:

> I have mixed feelings about this proposal. Merging or diffing schemas is
> a common operation, but specific requirements differ from case to case,
> especially when complex nested data is used.
>
> Even if we put ordering of the fields aside, data types equality
> semantics (StructField in particular) is likely to result in
> implementation which is either confusing or has limited applicability.
>
> Additionally, Scala StructType is already a Seq[StructField] and as such
> provides set-like operations (contains, diff, intersect, union) as well
> as implementations of ++ / :+ / +: so we cannot do much here, without
> breaking the existing API.
>
> On 8/14/22 11:30, Alexandros Biratsis wrote:
> > Hello Rui and Tim,
> >
> > Indeed this sound a good idea and quite useful. To make it more formal
> > the list of a StructType could be treated as a Scala/Python set by
> > providing(inheriting?) the common sets' functionality i.e add, remove,
> > concat, intersect, diff etc. The set like functionality could be part of
> > StructType class for both languages.
> >
> > The Scala set collection
> >
> https://www.scala-lang.org/api/2.13.x/scala/collection/immutable/Set.html
> <https://www.scala-lang.org/api/2.13.x/scala/collection/immutable/Set.html
> >
> >
> > Best,
> > Alex
> >
> > On Wed, Aug 10, 2022, 08:14 Rui Wang <amaliu...@apache.org
> > <mailto:amaliu...@apache.org>> wrote:
> >
> >     Thanks for the idea!
> >
> >     I am thinking that the usage of "combined = StructType( a.fields +
> >     b.fields)" is still good because
> >     1) it is not horrible to merge a and b in this way.
> >     2) itself clarifies the intention which is merge two struct's fields
> >     to construct a new struct
> >     3) you also have room to apply more complicated operations on fields
> >     merging. For example remove duplicate files with the same name or
> >     use a.fields but remove some fields if they are in b.
> >
> >     overloading "+" could be
> >     1. it's ambiguous on what this plus is doing.
> >     2. If you define + is a concatenation on the fields, then it's
> >     limited to only do the concatenation. How about other operations
> >     like extract fields from a based on b? Maybe overloading "-"? In
> >     this case the item list will grow.
> >
> >     -Rui
> >
> >     On Tue, Aug 9, 2022 at 1:10 PM Tim <bosse...@posteo.de
> >     <mailto:bosse...@posteo.de>> wrote:
> >
> >         Hi all,
> >
> >         this is my first message to the Spark mailing list, so please
> >         bear with
> >         me if I don't fully meet your communication standards.
> >         I just wanted to discuss one aspect that I've stumbled across
> >         several
> >         times over the past few weeks.
> >         When working with Spark, I often run into the problem of having
> >         to merge
> >         two (or more) existing StructTypes into a new one to define a
> >         schema.
> >         Usually this looks similar (in Python) to the following
> simplified
> >         example:
> >
> >                   a = StructType([StuctField("field_a", StringType())])
> >                   b = StructType([StructField("field_b", IntegerType())])
> >
> >                   combined = StructType( a.fields + b.fields)
> >
> >         My idea, which I would like to discuss, is to shorten the above
> >         example
> >         in Python as follows by supporting Python's add operator for
> >         StructTypes:
> >
> >                   combined = a + b
> >
> >
> >         What do you think of this idea? Are there any reasons why this
> >         is not
> >         yet part of StructType's functionality?
> >         If you support this idea, I could create a first PR for further
> and
> >         deeper discussion.
> >
> >         Best
> >         Tim
> >
> >
>  ---------------------------------------------------------------------
> >         To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >         <mailto:dev-unsubscr...@spark.apache.org>
> >
>
> --
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> PGP: A30CEF0C31A501EC
>
>

Re: [DISCUSS] [Spark SQL, PySpark] Combining StructTypes into a new StructType

Reply via email to