Re: [DISCUSS] Variant Spec Location

Curt Hagenlocher Thu, 22 Aug 2024 08:08:26 -0700

This seems to straddle that line, in that you can also view this as a way
to represent semi-structured data in a manner that allows for more
efficient querying and computation by breaking out some of its components
into a more structured form.


(I also happen to want a canonical Arrow representation for variant data,
as this type occurs in many databases but doesn't have a great
representation today in ADBC results. That's why I filed [Format] Consider
adding an official variant type to Arrow · Issue #42069 · apache/arrow
(github.com) <https://github.com/apache/arrow/issues/42069>. Of course,
there's no specific reason why a canonical Arrow representation for
variants must align with Spark and/or Iceberg.)

-Curt

On Thu, Aug 22, 2024 at 2:01 AM Antoine Pitrou <[email protected]> wrote:

>
> Ah, thanks. I've tried to find a rationale and ended up on
> https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 . Is it
> a good description of what you're after?
>
> If so, then I don't think Arrow is a good match. This seems mostly to be
> a marshalling format for semi-structured data (like Avro?). Arrow data
> types are meant to be in a representation ideal for querying and
> computation, rather than transport and storage.
>
> This could be developed separately and then be represented in Arrow
> using an extension type (perhaps a canonical one as in
> https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html).
>
> What do other Arrow developers think?
>
> Regards
>
> Antoine.
>
>
> Le 22/08/2024 à 10:45, Gang Wu a écrit :
> > Sorry for the inconvenience.
> >
> > This is the permalink for the discussion:
> > https://lists.apache.org/thread/hopkr2f0ftoywwt9zo3jxb7n0ob5s5bw
> >
> > On Thu, Aug 22, 2024 at 3:51 PM Antoine Pitrou <[email protected]>
> wrote:
> >
> >>
> >> Hi Gang,
> >>
> >> Sorry, but can you give a pointer to the start of this discussion thread
> >> in a readable format (for example a mailing-list archive)? It appears
> >> that dev@arrow wasn't cc'ed from the start and that can make it
> >> difficult to understand what this is about.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> Le 22/08/2024 à 08:32, Gang Wu a écrit :
> >>> It seems that we have reached a consensus to some extent that there
> >>> should be a new home for the variant spec. The pending question
> >>> is whether Parquet or Arrow is a better choice. As a committer from
> >> Arrow,
> >>> Parquet and ORC communities, I am neutral to choose any and happy to
> >>> help with the movement once a decision has been made.
> >>>
> >>> Should we start a vote to move forward?
> >>>
> >>> Best,
> >>> Gang
> >>>
> >>> On Sat, Aug 17, 2024 at 8:34 AM Micah Kornfield <[email protected]
> >
> >>> wrote:
> >>>
> >>>>>
> >>>>> That being said, I think the most important consideration for now is
> >>>> where
> >>>>> are the current maintainers / contributors to the variant type. If
> most
> >>>> of
> >>>>> them are already PMC members / committers on a project, it becomes a
> >> bit
> >>>>> easier. Otherwise if there isn't much overlap with a project's
> existing
> >>>>> governance, I worry there could be a bit of friction. How many active
> >>>>> contributors are there from Iceberg? And how about from Arrow?
> >>>>
> >>>>
> >>>> I think this is the key question. What are the requirements around
> >>>> governance?  I've seen some tangential messaging here but I'm not
> clear
> >> on
> >>>> what everyone expects.
> >>>>
> >>>> I think for a lot of the other concerns my view is that the exact
> >> project
> >>>> does not really matter (and choosing a project with mature cross
> >> language
> >>>> testing infrastructure or committing to building it is critical). IIUC
> >> we
> >>>> are talking about following artifacts:
> >>>>
> >>>> 1.  A stand alone specification document (this can be hosted anyplace)
> >>>> 2.  A set of language bindings with minimal dependencies can be
> consumed
> >>>> downstream (again, as long as dependencies are managed carefully any
> >>>> project can host these)
> >>>> 3.  Potential integration where appropriate into file format libraries
> >> to
> >>>> support shredding (but as of now this is being bypassed by using
> >>>> conventions anyways).  My impression is that at least for Parquet
> there
> >> has
> >>>> been a proliferation of vectorized readers across different projects,
> so
> >>>> I'm not clear how much standardization in parquet-java could help
> here.
> >>>>
> >>>> To respond to some other questions:
> >>>>
> >>>> Arrow is not used as Spark's in-memory model, nor Trino and others so
> >> those
> >>>>> existing relationships aren't there. I also worry that differences in
> >>>>> approaches would make it difficult later on.
> >>>>
> >>>>
> >>>> While Arrow is not in the core memory model, for Spark I believe it is
> >>>> still used for IPC for things like Java<->Python. Trino also consumes
> >> Arrow
> >>>> libraries today to support things like Snowflake/Bigquery federation.
> >> But I
> >>>> think this is minor because as mentioned above I think the functional
> >>>> libraries would be relatively stand-alone.
> >>>>
> >>>> Do we think it could be introduced as a canonical extension arrow
> type?
> >>>>
> >>>>
> >>>>    I believe it can be, I think there are probably different layouts
> >> that can
> >>>> be supported:
> >>>>
> >>>> 1.  A struct with two variable width bytes columns (metadata and value
> >> data
> >>>> are stored separately and each entry has a 1:1 relationship).
> >>>> 2.  Shredded (shredded according to the same convention as parquet), I
> >>>> would need to double check but I don't think Arrow would have problems
> >> here
> >>>> but REE would likely be required to make this efficient (i.e. sparse
> >> value
> >>>> support is important).
> >>>>
> >>>> In both cases the main complexity is providing the necessary functions
> >> for
> >>>> manipulation.
> >>>>
> >>>> Thanks,
> >>>> Micah
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Fri, Aug 16, 2024 at 3:58 PM Will Jones <[email protected]>
> >>>> wrote:
> >>>>
> >>>>> In being more engine and format agnostic, I agree the Arrow project
> >> might
> >>>>> be a good host for such a specification. It seems like we want to
> move
> >>>> away
> >>>>> from hosting in Spark to make it engine agnostic. But moving into
> >> Iceberg
> >>>>> might make it less format agnostic, as I understand multiple formats
> >>>> might
> >>>>> want to implement this. I'm not intimately familiar with the state of
> >>>> this,
> >>>>> but I believe Delta Lake would like to be aligned with the same
> format
> >> as
> >>>>> Iceberg. In addition, the Lance format (which I work on), will
> >> eventually
> >>>>> be interesting as well. It seems equally bad to me to attach this
> >>>>> specification to a particular table format as it does a particular
> >> query
> >>>>> engine.
> >>>>>
> >>>>> That being said, I think the most important consideration for now is
> >>>> where
> >>>>> are the current maintainers / contributors to the variant type. If
> most
> >>>> of
> >>>>> them are already PMC members / committers on a project, it becomes a
> >> bit
> >>>>> easier. Otherwise if there isn't much overlap with a project's
> existing
> >>>>> governance, I worry there could be a bit of friction. How many active
> >>>>> contributors are there from Iceberg? And how about from Arrow?
> >>>>>
> >>>>> BTW, I'd add I'm interested in helping develop an Arrow extension
> type
> >>>> for
> >>>>> the binary variant type. I've been experimenting with a DataFusion
> >>>>> extension that operates on this [1], and already have some ideas on
> how
> >>>>> such an extension type might be defined. I'm not yet caught up on the
> >>>>> shredded specification, but I think having just the binary format
> would
> >>>> be
> >>>>> beneficial for in-memory analytics, which are most relevant to Arrow.
> >>>> I'll
> >>>>> be creating a seperate thread on the Arrow ML about this soon.
> >>>>>
> >>>>> Best,
> >>>>>
> >>>>> Will Jones
> >>>>>
> >>>>> [1]
> >>>>>
> >>>>
> >>
> https://github.com/datafusion-contrib/datafusion-functions-variant/issues
> >>>>>
> >>>>>
> >>>>> On Thu, Aug 15, 2024 at 7:39 PM Gang Wu <[email protected]> wrote:
> >>>>>
> >>>>>> + dev@arrow
> >>>>>>
> >>>>>> Thanks for all the valuable suggestions! I am inclined to Micah's
> idea
> >>>>> that
> >>>>>> Arrow might be a better host compared to Parquet.
> >>>>>>
> >>>>>> To give more context, I am taking the initiative to add the geometry
> >>>> type
> >>>>>> to both Parquet and ORC. I'd like to do the same thing for variant
> >> type
> >>>>> in
> >>>>>> that variant type is engine and file format agnostic. This does mean
> >>>> that
> >>>>>> Parquet might not be the neutral place to hold the variant spec.
> >>>>>>
> >>>>>> Best,
> >>>>>> Gang
> >>>>>>
> >>>>>> On Fri, Aug 16, 2024 at 10:00 AM Jingsong Li <
> [email protected]>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Thanks all for your discussion.
> >>>>>>>
> >>>>>>> The Apache Paimon community is also considering support for this
> >>>>>>> Variant type, without a doubt, we hope to maintain consistency with
> >>>>>>> Iceberg.
> >>>>>>>
> >>>>>>> Not only the Paimon community, but also various computing engines
> >>>> need
> >>>>>>> to adapt to this type, such as Flink and StarRocks. We also hope to
> >>>>>>> promote them to adapt to this type.
> >>>>>>>
> >>>>>>> It is worth noting that we also need to standardize many functions
> >>>>>>> related to it.
> >>>>>>>
> >>>>>>> A neutral place to maintain it is a great choice.
> >>>>>>>
> >>>>>>> - As Gang Wu said, a standalone project is good, just like
> >>>>> RoaringBitmap
> >>>>>>> [1].
> >>>>>>> - As Ryan said, Parquet community is a neutral option too.
> >>>>>>> - As Micah said, Arrow is also an option too.
> >>>>>>>
> >>>>>>> [1] https://github.com/RoaringBitmap
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Jingsong
> >>>>>>>
> >>>>>>> On Fri, Aug 16, 2024 at 7:18 AM Micah Kornfield <
> >>>> [email protected]
> >>>>>>
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Thats fair @Micah, so far all the discussions have been direct
> and
> >>>>> off
> >>>>>>> the dev list. Would you like to make the request on the public
> Spark
> >>>>> Dev
> >>>>>>> list? I would be glad to co-sign, I can also draft up a quick email
> >>>> if
> >>>>>> you
> >>>>>>> don't have time.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> I think once we come to consensus, if you have bandwidth, I think
> >>>> the
> >>>>>>> message might be better coming from you, as you have more context
> on
> >>>>> some
> >>>>>>> of the non-public conversations, the requirements from an Iceberg
> >>>>>>> perspective on governance and the blockers that were encountered.
> If
> >>>>>>> details on the conversations can't be shared, (i.e. we are starting
> >>>>> from
> >>>>>>> scratch) it seems like suggesting a new project via SPIP might be
> the
> >>>>> way
> >>>>>>> forward.  I'm happy to help with that if it is useful but I would
> >>>> guess
> >>>>>>> Aihua or Tyler might be in a better place to start as it seems they
> >>>>> have
> >>>>>>> done more serious thinking here.
> >>>>>>>>
> >>>>>>>> If we decide to try to standardize on Parquet or Arrow I'm happy
> to
> >>>>>> help
> >>>>>>> support the effort in those communities.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Micah
> >>>>>>>>
> >>>>>>>> On Thu, Aug 15, 2024 at 8:09 AM Russell Spitzer <
> >>>>>>> [email protected]> wrote:
> >>>>>>>>>
> >>>>>>>>> Thats fair @Micah, so far all the discussions have been direct
> and
> >>>>> off
> >>>>>>> the dev list. Would you like to make the request on the public
> Spark
> >>>>> Dev
> >>>>>>> list? I would be glad to co-sign, I can also draft up a quick email
> >>>> if
> >>>>>> you
> >>>>>>> don't have time.
> >>>>>>>>>
> >>>>>>>>> On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield <
> >>>>>> [email protected]>
> >>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> I agree that it would be beneficial to make a sub-project, the
> >>>>> main
> >>>>>>> problem is political and not logistic. I've been asking for
> movement
> >>>>> from
> >>>>>>> other relative projects for a month and we simply haven't gotten
> >>>>>> anywhere.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I just wanted to double check that these issues were brought
> >>>>> directly
> >>>>>>> to the spark community (i.e. a discussion thread on the Spark
> >>>> developer
> >>>>>>> mailing list) and not via backchannels.
> >>>>>>>>>>
> >>>>>>>>>> I'm not sure the outcome would be different and I don't think
> >>>> this
> >>>>>>> should block forking the spec, but we should make sure that the
> >>>>> decision
> >>>>>> is
> >>>>>>> publicly documented within both communities.
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Micah
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer <
> >>>>>>> [email protected]> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> @Gang Wu
> >>>>>>>>>>>
> >>>>>>>>>>> I agree that it would be beneficial to make a sub-project, the
> >>>>> main
> >>>>>>> problem is political and not logistic. I've been asking for
> movement
> >>>>> from
> >>>>>>> other relative projects for a month and we simply haven't gotten
> >>>>>> anywhere.
> >>>>>>> I don't think there is anything that would stop us from moving to a
> >>>>> joint
> >>>>>>> project in the future and if you know of some way of encouraging
> that
> >>>>>>> movement from other relevant parties I would be glad to collaborate
> >>>> in
> >>>>>>> doing that. One thing that I don't want to do is have the Iceberg
> >>>>> project
> >>>>>>> stay in a holding pattern without any clear roadmap as to how to
> >>>>> proceed.
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu <
> [email protected]
> >>>>>
> >>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> I’m on board with copying the spec into our repository.
> >>>> However,
> >>>>> as
> >>>>>>> we’ve talked about, it’s not just a straightforward copy—there are
> >>>>>> already
> >>>>>>> some divergences. Some of them are under discussion. Iceberg is
> >>>>>> definitely
> >>>>>>> the best place for these specs. Engines like Trino and Flink can
> then
> >>>>>> rely
> >>>>>>> on the Iceberg specs as a solid foundation.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Yufei
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <[email protected]>
> >>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Sorry for chiming in late.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>   From the discussion in
> >>>>>>> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq,
> I
> >>>>>> don't
> >>>>>>> quite understand why it is logistically complicated to create a
> >>>>>> sub-project
> >>>>>>> to hold the variant spec and impl.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> IMHO, coping the variant type spec into Apache Iceberg has
> >>>> some
> >>>>>>> deficiencies:
> >>>>>>>>>>>>> - It is a burden to update two repos if there is a variant
> >>>> type
> >>>>>>> spec change and will likely result in deviation if some changes do
> >>>> not
> >>>>>>> reach agreement from both parties.
> >>>>>>>>>>>>> - Implementers are required to keep an eye on both specs
> >>>>>>> (considering proprietary engines where both Iceberg and Delta are
> >>>>>>> supported).
> >>>>>>>>>>>>> - Putting the spec and impl of variant type in Iceberg repo
> >>>> does
> >>>>>>> lose the opportunity for better native support from file formats
> like
> >>>>>>> Parquet and ORC.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'm not sure if it is possible to create a separate project
> >>>>> (e.g.
> >>>>>>> apache/variant-type) to make it a single point of truth. We can
> learn
> >>>>>> from
> >>>>>>> the experience of Apache Arrow. In this fashion, different engines,
> >>>>> table
> >>>>>>> formats and file formats can follow the same spec and are free to
> >>>>> depend
> >>>>>> on
> >>>>>>> the reference implementations from apache/variant-type or implement
> >>>>> their
> >>>>>>> own.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Best,
> >>>>>>>>>>>>> Gang
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye <
> [email protected]
> >>>>>
> >>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> +1 for copying the spec into our repository, I think we need
> >>>> to
> >>>>>>> own it fully as a part of the table spec, and we can build
> >>>>> compatibility
> >>>>>>> through tests.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -Jack
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer <
> >>>>>>> [email protected]> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I'm not really in favor of linking and annotating as that
> >>>> just
> >>>>>>> makes things more complicated and still is essentially forking just
> >>>>> with
> >>>>>>> more steps. If we just track our annotations / modifications  to a
> >>>>> single
> >>>>>>> commit/version then we have the same issue again but now you have
> to
> >>>> go
> >>>>>> to
> >>>>>>> multiple sources to get the actual Spec. In addition, our very copy
> >>>> of
> >>>>>> the
> >>>>>>> Spec is going to require new types which don't exist in the Spark
> >>>> Spec
> >>>>>>> which necessarily means diverging. We will need to take up new
> >>>>> primitive
> >>>>>>> id's (as noted in my first email)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The other issue I have is I don't think the Spark Spec is
> >>>>> really
> >>>>>>> going through a thorough review process from all members of the
> Spark
> >>>>>>> community, I believe it probably should have gone through the SPIP
> >>>> but
> >>>>>>> instead seems to have been merged without broad community
> >>>> involvement.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The only way to truly avoid diverging is to only have a
> >>>> single
> >>>>>>> copy of the spec, in our previous discussions the vast majority of
> >>>>> Apache
> >>>>>>> Iceberg community want it to exist here.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks <
> >>>>> [email protected]
> >>>>>>>
> >>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I'm really excited about the introduction of variant type
> >>>> to
> >>>>>>> Iceberg, but I want to raise concerns about forking the spec.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I feel like preemptively forking would create the
> situation
> >>>>>>> where we end up diverging because there's little reason to work
> with
> >>>>> both
> >>>>>>> communities to evolve in a way that benefits everyone.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I would much rather point to a specific version of the
> spec
> >>>>> and
> >>>>>>> annotate any variance in Iceberg's handling.  This would allow us
> to
> >>>>>>> continue without dividing the communities.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> If at any point there are irreconcilable differences, I
> >>>> would
> >>>>>>> support forking, but I don't feel like that should be the initial
> >>>> step.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> No one is excited about the possibility that the physical
> >>>>>>> representations end up diverging, but it feels like we're setting
> >>>>>> ourselves
> >>>>>>> up for that exact scenario.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> -Dan
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong <
> >>>>>>> [email protected]> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> +1 to what's already being said here. It is good to copy
> >>>> the
> >>>>>>> spec to Iceberg and add context that's specific to Iceberg, but at
> >>>> the
> >>>>>> same
> >>>>>>> time, we should maintain compatibility.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Kind regards,
> >>>>>>>>>>>>>>>>> Fokko
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang <
> >>>>>>> [email protected]>:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> +1 to copy the spec into our repository. I think the
> best
> >>>>> way
> >>>>>>> to keep compatibility is building integration tests.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>> Manu
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry <
> >>>>>>> [email protected]> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks Russell and Aihua for pushing Variant support!
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Given the differences between the supported types and
> >>>> the
> >>>>>>> lack of interest from the other project, I think it is reasonable
> to
> >>>>>>> duplicate the specification to our repository.
> >>>>>>>>>>>>>>>>>>> I would give very strong emphasis on sticking to the
> >>>> Spark
> >>>>>>> spec as much as possible, to keep compatibility as much as
> possible.
> >>>>>> Maybe
> >>>>>>> even revert to a shared specification if the situation changes.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>> Peter
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Aihua Xu <[email protected]> ezt írta (időpont: 2024.
> >>>>> aug.
> >>>>>>> 13., K, 19:52):
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks Russell for bringing this up.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> This is the main blocker to move forward with the
> >>>> Variant
> >>>>>>> support in Iceberg and hopefully we can have a consensus. To me, I
> >>>> also
> >>>>>>> feel it makes more sense to move the spec into Iceberg rather than
> >>>>> Spark
> >>>>>>> engine owns it and we try to keep it compatible with Spark spec.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>> Aihua
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer <
> >>>>>>> [email protected]> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Hi Y’all,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> We’ve hit a bit of a roadblock with the Variant
> >>>>> Proposal,
> >>>>>>> while we were hoping to move the Variant and Shredding
> specifications
> >>>>>> from
> >>>>>>> Spark into Iceberg there doesn’t seem to be a lot of interest in
> >>>> that.
> >>>>>>> Unfortunately, I think we have a number of issues with just linking
> >>>> to
> >>>>>> the
> >>>>>>> Spark project directly from within Iceberg and I believe we need to
> >>>>> copy
> >>>>>>> the specifications into our repository.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> There are a few reasons why i think this is necessary
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> First, we have a divergence of types already. The
> >>>> Spark
> >>>>>>> Specification already includes types which Iceberg has no
> definition
> >>>>> for
> >>>>>>> (19, 20 - Interval Types) and Iceberg already has a type which is
> not
> >>>>>>> included within the Spark Specification (Time) and will soon have
> >>>> more
> >>>>>> with
> >>>>>>> TimestampNS, and Geo.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Second, We would like to make sure that Spark is not
> a
> >>>>>> hard
> >>>>>>> dependency for other engines. We are working with several
> >>>> implementers
> >>>>> of
> >>>>>>> the Iceberg spec and it has previously been agreed that it would be
> >>>>> best
> >>>>>> if
> >>>>>>> the source of truth for Variant existed in an engine and file
> format
> >>>>>>> neutral location. The Iceberg project has a good open model of
> >>>>> governance
> >>>>>>> and, as we have seen so far discussing Variant, open and active
> >>>>>>> collaboration. This would also help as we can strictly version our
> >>>>>> changes
> >>>>>>> in-line with the rest of the Iceberg spec.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Third, The Shredding spec is not quite finished and
> >>>>>>> requires some group analysis and discussion before we commit it. I
> >>>>> think
> >>>>>>> again the Iceberg community is probably the right place for this to
> >>>>>> happen
> >>>>>>> as we have already started discussions here on these topics.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> For these reasons I think we should go with a direct
> >>>>> copy
> >>>>>>> of the existing specification from the Spark Project and move ahead
> >>>>> with
> >>>>>>> our discussions and modifications within Iceberg. That said, I do
> not
> >>>>>> want
> >>>>>>> to diverge if possible from the Spark proposal. For example,
> although
> >>>>> we
> >>>>>> do
> >>>>>>> not use the Interval types above, I think we should not reuse those
> >>>>> type
> >>>>>>> ids within our spec. Iceberg's Variant Spec types 19 and 20 would
> >>>>> remain
> >>>>>>> unused along with any other types we think are not applicable. We
> >>>>> should
> >>>>>>> strive whenever possible to allow for compatibility.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> In the interest of moving forward with this proposal
> I
> >>>>> am
> >>>>>>> hoping to see if anyone in the community objects to this plan going
> >>>>>> forward
> >>>>>>> or has a better alternative.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> As always I am thankful for your time and am eager to
> >>>>> hear
> >>>>>>> back from everyone,
> >>>>>>>>>>>>>>>>>>>>> Russ
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>

Re: [DISCUSS] Variant Spec Location

Reply via email to