Re: [DISCUSS] Variant Spec Location

Gang Wu Wed, 21 Aug 2024 23:34:29 -0700

It seems that we have reached a consensus to some extent that there
should be a new home for the variant spec. The pending question
is whether Parquet or Arrow is a better choice. As a committer from Arrow,
Parquet and ORC communities, I am neutral to choose any and happy to
help with the movement once a decision has been made.


Should we start a vote to move forward?

Best,
Gang

On Sat, Aug 17, 2024 at 8:34 AM Micah Kornfield <[email protected]>
wrote:

> >
> > That being said, I think the most important consideration for now is
> where
> > are the current maintainers / contributors to the variant type. If most
> of
> > them are already PMC members / committers on a project, it becomes a bit
> > easier. Otherwise if there isn't much overlap with a project's existing
> > governance, I worry there could be a bit of friction. How many active
> > contributors are there from Iceberg? And how about from Arrow?
>
>
> I think this is the key question. What are the requirements around
> governance?  I've seen some tangential messaging here but I'm not clear on
> what everyone expects.
>
> I think for a lot of the other concerns my view is that the exact project
> does not really matter (and choosing a project with mature cross language
> testing infrastructure or committing to building it is critical). IIUC we
> are talking about following artifacts:
>
> 1.  A stand alone specification document (this can be hosted anyplace)
> 2.  A set of language bindings with minimal dependencies can be consumed
> downstream (again, as long as dependencies are managed carefully any
> project can host these)
> 3.  Potential integration where appropriate into file format libraries to
> support shredding (but as of now this is being bypassed by using
> conventions anyways).  My impression is that at least for Parquet there has
> been a proliferation of vectorized readers across different projects, so
> I'm not clear how much standardization in parquet-java could help here.
>
> To respond to some other questions:
>
> Arrow is not used as Spark's in-memory model, nor Trino and others so those
> > existing relationships aren't there. I also worry that differences in
> > approaches would make it difficult later on.
>
>
> While Arrow is not in the core memory model, for Spark I believe it is
> still used for IPC for things like Java<->Python. Trino also consumes Arrow
> libraries today to support things like Snowflake/Bigquery federation. But I
> think this is minor because as mentioned above I think the functional
> libraries would be relatively stand-alone.
>
> Do we think it could be introduced as a canonical extension arrow type?
>
>
>  I believe it can be, I think there are probably different layouts that can
> be supported:
>
> 1.  A struct with two variable width bytes columns (metadata and value data
> are stored separately and each entry has a 1:1 relationship).
> 2.  Shredded (shredded according to the same convention as parquet), I
> would need to double check but I don't think Arrow would have problems here
> but REE would likely be required to make this efficient (i.e. sparse value
> support is important).
>
> In both cases the main complexity is providing the necessary functions for
> manipulation.
>
> Thanks,
> Micah
>
>
>
>
>
>
>
> On Fri, Aug 16, 2024 at 3:58 PM Will Jones <[email protected]>
> wrote:
>
> > In being more engine and format agnostic, I agree the Arrow project might
> > be a good host for such a specification. It seems like we want to move
> away
> > from hosting in Spark to make it engine agnostic. But moving into Iceberg
> > might make it less format agnostic, as I understand multiple formats
> might
> > want to implement this. I'm not intimately familiar with the state of
> this,
> > but I believe Delta Lake would like to be aligned with the same format as
> > Iceberg. In addition, the Lance format (which I work on), will eventually
> > be interesting as well. It seems equally bad to me to attach this
> > specification to a particular table format as it does a particular query
> > engine.
> >
> > That being said, I think the most important consideration for now is
> where
> > are the current maintainers / contributors to the variant type. If most
> of
> > them are already PMC members / committers on a project, it becomes a bit
> > easier. Otherwise if there isn't much overlap with a project's existing
> > governance, I worry there could be a bit of friction. How many active
> > contributors are there from Iceberg? And how about from Arrow?
> >
> > BTW, I'd add I'm interested in helping develop an Arrow extension type
> for
> > the binary variant type. I've been experimenting with a DataFusion
> > extension that operates on this [1], and already have some ideas on how
> > such an extension type might be defined. I'm not yet caught up on the
> > shredded specification, but I think having just the binary format would
> be
> > beneficial for in-memory analytics, which are most relevant to Arrow.
> I'll
> > be creating a seperate thread on the Arrow ML about this soon.
> >
> > Best,
> >
> > Will Jones
> >
> > [1]
> >
> https://github.com/datafusion-contrib/datafusion-functions-variant/issues
> >
> >
> > On Thu, Aug 15, 2024 at 7:39 PM Gang Wu <[email protected]> wrote:
> >
> > > + dev@arrow
> > >
> > > Thanks for all the valuable suggestions! I am inclined to Micah's idea
> > that
> > > Arrow might be a better host compared to Parquet.
> > >
> > > To give more context, I am taking the initiative to add the geometry
> type
> > > to both Parquet and ORC. I'd like to do the same thing for variant type
> > in
> > > that variant type is engine and file format agnostic. This does mean
> that
> > > Parquet might not be the neutral place to hold the variant spec.
> > >
> > > Best,
> > > Gang
> > >
> > > On Fri, Aug 16, 2024 at 10:00 AM Jingsong Li <[email protected]>
> > > wrote:
> > >
> > > > Thanks all for your discussion.
> > > >
> > > > The Apache Paimon community is also considering support for this
> > > > Variant type, without a doubt, we hope to maintain consistency with
> > > > Iceberg.
> > > >
> > > > Not only the Paimon community, but also various computing engines
> need
> > > > to adapt to this type, such as Flink and StarRocks. We also hope to
> > > > promote them to adapt to this type.
> > > >
> > > > It is worth noting that we also need to standardize many functions
> > > > related to it.
> > > >
> > > > A neutral place to maintain it is a great choice.
> > > >
> > > > - As Gang Wu said, a standalone project is good, just like
> > RoaringBitmap
> > > > [1].
> > > > - As Ryan said, Parquet community is a neutral option too.
> > > > - As Micah said, Arrow is also an option too.
> > > >
> > > > [1] https://github.com/RoaringBitmap
> > > >
> > > > Best,
> > > > Jingsong
> > > >
> > > > On Fri, Aug 16, 2024 at 7:18 AM Micah Kornfield <
> [email protected]
> > >
> > > > wrote:
> > > > >>
> > > > >> Thats fair @Micah, so far all the discussions have been direct and
> > off
> > > > the dev list. Would you like to make the request on the public Spark
> > Dev
> > > > list? I would be glad to co-sign, I can also draft up a quick email
> if
> > > you
> > > > don't have time.
> > > > >
> > > > >
> > > > > I think once we come to consensus, if you have bandwidth, I think
> the
> > > > message might be better coming from you, as you have more context on
> > some
> > > > of the non-public conversations, the requirements from an Iceberg
> > > > perspective on governance and the blockers that were encountered.  If
> > > > details on the conversations can't be shared, (i.e. we are starting
> > from
> > > > scratch) it seems like suggesting a new project via SPIP might be the
> > way
> > > > forward.  I'm happy to help with that if it is useful but I would
> guess
> > > > Aihua or Tyler might be in a better place to start as it seems they
> > have
> > > > done more serious thinking here.
> > > > >
> > > > > If we decide to try to standardize on Parquet or Arrow I'm happy to
> > > help
> > > > support the effort in those communities.
> > > > >
> > > > > Thanks,
> > > > > Micah
> > > > >
> > > > > On Thu, Aug 15, 2024 at 8:09 AM Russell Spitzer <
> > > > [email protected]> wrote:
> > > > >>
> > > > >> Thats fair @Micah, so far all the discussions have been direct and
> > off
> > > > the dev list. Would you like to make the request on the public Spark
> > Dev
> > > > list? I would be glad to co-sign, I can also draft up a quick email
> if
> > > you
> > > > don't have time.
> > > > >>
> > > > >> On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield <
> > > [email protected]>
> > > > wrote:
> > > > >>>>
> > > > >>>> I agree that it would be beneficial to make a sub-project, the
> > main
> > > > problem is political and not logistic. I've been asking for movement
> > from
> > > > other relative projects for a month and we simply haven't gotten
> > > anywhere.
> > > > >>>
> > > > >>>
> > > > >>> I just wanted to double check that these issues were brought
> > directly
> > > > to the spark community (i.e. a discussion thread on the Spark
> developer
> > > > mailing list) and not via backchannels.
> > > > >>>
> > > > >>> I'm not sure the outcome would be different and I don't think
> this
> > > > should block forking the spec, but we should make sure that the
> > decision
> > > is
> > > > publicly documented within both communities.
> > > > >>>
> > > > >>> Thanks,
> > > > >>> Micah
> > > > >>>
> > > > >>> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer <
> > > > [email protected]> wrote:
> > > > >>>>
> > > > >>>> @Gang Wu
> > > > >>>>
> > > > >>>> I agree that it would be beneficial to make a sub-project, the
> > main
> > > > problem is political and not logistic. I've been asking for movement
> > from
> > > > other relative projects for a month and we simply haven't gotten
> > > anywhere.
> > > > I don't think there is anything that would stop us from moving to a
> > joint
> > > > project in the future and if you know of some way of encouraging that
> > > > movement from other relevant parties I would be glad to collaborate
> in
> > > > doing that. One thing that I don't want to do is have the Iceberg
> > project
> > > > stay in a holding pattern without any clear roadmap as to how to
> > proceed.
> > > > >>>>
> > > > >>>> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu <[email protected]
> >
> > > > wrote:
> > > > >>>>>
> > > > >>>>> I’m on board with copying the spec into our repository.
> However,
> > as
> > > > we’ve talked about, it’s not just a straightforward copy—there are
> > > already
> > > > some divergences. Some of them are under discussion. Iceberg is
> > > definitely
> > > > the best place for these specs. Engines like Trino and Flink can then
> > > rely
> > > > on the Iceberg specs as a solid foundation.
> > > > >>>>>
> > > > >>>>> Yufei
> > > > >>>>>
> > > > >>>>> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <[email protected]>
> > wrote:
> > > > >>>>>>
> > > > >>>>>> Sorry for chiming in late.
> > > > >>>>>>
> > > > >>>>>> From the discussion in
> > > > https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I
> > > don't
> > > > quite understand why it is logistically complicated to create a
> > > sub-project
> > > > to hold the variant spec and impl.
> > > > >>>>>>
> > > > >>>>>> IMHO, coping the variant type spec into Apache Iceberg has
> some
> > > > deficiencies:
> > > > >>>>>> - It is a burden to update two repos if there is a variant
> type
> > > > spec change and will likely result in deviation if some changes do
> not
> > > > reach agreement from both parties.
> > > > >>>>>> - Implementers are required to keep an eye on both specs
> > > > (considering proprietary engines where both Iceberg and Delta are
> > > > supported).
> > > > >>>>>> - Putting the spec and impl of variant type in Iceberg repo
> does
> > > > lose the opportunity for better native support from file formats like
> > > > Parquet and ORC.
> > > > >>>>>>
> > > > >>>>>> I'm not sure if it is possible to create a separate project
> > (e.g.
> > > > apache/variant-type) to make it a single point of truth. We can learn
> > > from
> > > > the experience of Apache Arrow. In this fashion, different engines,
> > table
> > > > formats and file formats can follow the same spec and are free to
> > depend
> > > on
> > > > the reference implementations from apache/variant-type or implement
> > their
> > > > own.
> > > > >>>>>>
> > > > >>>>>> Best,
> > > > >>>>>> Gang
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye <[email protected]
> >
> > > > wrote:
> > > > >>>>>>>
> > > > >>>>>>> +1 for copying the spec into our repository, I think we need
> to
> > > > own it fully as a part of the table spec, and we can build
> > compatibility
> > > > through tests.
> > > > >>>>>>>
> > > > >>>>>>> -Jack
> > > > >>>>>>>
> > > > >>>>>>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer <
> > > > [email protected]> wrote:
> > > > >>>>>>>>
> > > > >>>>>>>> I'm not really in favor of linking and annotating as that
> just
> > > > makes things more complicated and still is essentially forking just
> > with
> > > > more steps. If we just track our annotations / modifications  to a
> > single
> > > > commit/version then we have the same issue again but now you have to
> go
> > > to
> > > > multiple sources to get the actual Spec. In addition, our very copy
> of
> > > the
> > > > Spec is going to require new types which don't exist in the Spark
> Spec
> > > > which necessarily means diverging. We will need to take up new
> > primitive
> > > > id's (as noted in my first email)
> > > > >>>>>>>>
> > > > >>>>>>>> The other issue I have is I don't think the Spark Spec is
> > really
> > > > going through a thorough review process from all members of the Spark
> > > > community, I believe it probably should have gone through the SPIP
> but
> > > > instead seems to have been merged without broad community
> involvement.
> > > > >>>>>>>>
> > > > >>>>>>>> The only way to truly avoid diverging is to only have a
> single
> > > > copy of the spec, in our previous discussions the vast majority of
> > Apache
> > > > Iceberg community want it to exist here.
> > > > >>>>>>>>
> > > > >>>>>>>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks <
> > [email protected]
> > > >
> > > > wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>> I'm really excited about the introduction of variant type
> to
> > > > Iceberg, but I want to raise concerns about forking the spec.
> > > > >>>>>>>>>
> > > > >>>>>>>>> I feel like preemptively forking would create the situation
> > > > where we end up diverging because there's little reason to work with
> > both
> > > > communities to evolve in a way that benefits everyone.
> > > > >>>>>>>>>
> > > > >>>>>>>>> I would much rather point to a specific version of the spec
> > and
> > > > annotate any variance in Iceberg's handling.  This would allow us to
> > > > continue without dividing the communities.
> > > > >>>>>>>>>
> > > > >>>>>>>>> If at any point there are irreconcilable differences, I
> would
> > > > support forking, but I don't feel like that should be the initial
> step.
> > > > >>>>>>>>>
> > > > >>>>>>>>> No one is excited about the possibility that the physical
> > > > representations end up diverging, but it feels like we're setting
> > > ourselves
> > > > up for that exact scenario.
> > > > >>>>>>>>>
> > > > >>>>>>>>> -Dan
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong <
> > > > [email protected]> wrote:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> +1 to what's already being said here. It is good to copy
> the
> > > > spec to Iceberg and add context that's specific to Iceberg, but at
> the
> > > same
> > > > time, we should maintain compatibility.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Kind regards,
> > > > >>>>>>>>>> Fokko
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang <
> > > > [email protected]>:
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> +1 to copy the spec into our repository. I think the best
> > way
> > > > to keep compatibility is building integration tests.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>> Manu
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry <
> > > > [email protected]> wrote:
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Thanks Russell and Aihua for pushing Variant support!
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Given the differences between the supported types and
> the
> > > > lack of interest from the other project, I think it is reasonable to
> > > > duplicate the specification to our repository.
> > > > >>>>>>>>>>>> I would give very strong emphasis on sticking to the
> Spark
> > > > spec as much as possible, to keep compatibility as much as possible.
> > > Maybe
> > > > even revert to a shared specification if the situation changes.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>>> Peter
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Aihua Xu <[email protected]> ezt írta (időpont: 2024.
> > aug.
> > > > 13., K, 19:52):
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Thanks Russell for bringing this up.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> This is the main blocker to move forward with the
> Variant
> > > > support in Iceberg and hopefully we can have a consensus. To me, I
> also
> > > > feel it makes more sense to move the spec into Iceberg rather than
> > Spark
> > > > engine owns it and we try to keep it compatible with Spark spec.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>>>> Aihua
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer <
> > > > [email protected]> wrote:
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Hi Y’all,
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> We’ve hit a bit of a roadblock with the Variant
> > Proposal,
> > > > while we were hoping to move the Variant and Shredding specifications
> > > from
> > > > Spark into Iceberg there doesn’t seem to be a lot of interest in
> that.
> > > > Unfortunately, I think we have a number of issues with just linking
> to
> > > the
> > > > Spark project directly from within Iceberg and I believe we need to
> > copy
> > > > the specifications into our repository.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> There are a few reasons why i think this is necessary
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> First, we have a divergence of types already. The
> Spark
> > > > Specification already includes types which Iceberg has no definition
> > for
> > > > (19, 20 - Interval Types) and Iceberg already has a type which is not
> > > > included within the Spark Specification (Time) and will soon have
> more
> > > with
> > > > TimestampNS, and Geo.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Second, We would like to make sure that Spark is not a
> > > hard
> > > > dependency for other engines. We are working with several
> implementers
> > of
> > > > the Iceberg spec and it has previously been agreed that it would be
> > best
> > > if
> > > > the source of truth for Variant existed in an engine and file format
> > > > neutral location. The Iceberg project has a good open model of
> > governance
> > > > and, as we have seen so far discussing Variant, open and active
> > > > collaboration. This would also help as we can strictly version our
> > > changes
> > > > in-line with the rest of the Iceberg spec.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Third, The Shredding spec is not quite finished and
> > > > requires some group analysis and discussion before we commit it. I
> > think
> > > > again the Iceberg community is probably the right place for this to
> > > happen
> > > > as we have already started discussions here on these topics.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> For these reasons I think we should go with a direct
> > copy
> > > > of the existing specification from the Spark Project and move ahead
> > with
> > > > our discussions and modifications within Iceberg. That said, I do not
> > > want
> > > > to diverge if possible from the Spark proposal. For example, although
> > we
> > > do
> > > > not use the Interval types above, I think we should not reuse those
> > type
> > > > ids within our spec. Iceberg's Variant Spec types 19 and 20 would
> > remain
> > > > unused along with any other types we think are not applicable. We
> > should
> > > > strive whenever possible to allow for compatibility.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> In the interest of moving forward with this proposal I
> > am
> > > > hoping to see if anyone in the community objects to this plan going
> > > forward
> > > > or has a better alternative.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> As always I am thankful for your time and am eager to
> > hear
> > > > back from everyone,
> > > > >>>>>>>>>>>>>> Russ
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > >
> > >
> >
>

Re: [DISCUSS] Variant Spec Location

Reply via email to