Re: [DISCUSS] Variant Spec Location

Aihua Xu Thu, 22 Aug 2024 21:54:05 -0700

>From this thread
https://lists.apache.org/thread/0k5oj3mn0049fcxoxm3gx3d7r28gw4rj,  seems
Spark community is leaning toward moving to Parquet.


Gang, can you help start a discussion in the parquet community on adopting
and maintaining such Variant spec?

On Thu, Aug 22, 2024 at 8:08 AM Curt Hagenlocher <[email protected]>
wrote:

> This seems to straddle that line, in that you can also view this as a way
> to represent semi-structured data in a manner that allows for more
> efficient querying and computation by breaking out some of its components
> into a more structured form.
>
> (I also happen to want a canonical Arrow representation for variant data,
> as this type occurs in many databases but doesn't have a great
> representation today in ADBC results. That's why I filed [Format]
> Consider adding an official variant type to Arrow · Issue #42069 ·
> apache/arrow (github.com) <https://github.com/apache/arrow/issues/42069>.
> Of course, there's no specific reason why a canonical Arrow
> representation for variants must align with Spark and/or Iceberg.)
>
> -Curt
>
> On Thu, Aug 22, 2024 at 2:01 AM Antoine Pitrou <[email protected]> wrote:
>
>>
>> Ah, thanks. I've tried to find a rationale and ended up on
>> https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 . Is it
>> a good description of what you're after?
>>
>> If so, then I don't think Arrow is a good match. This seems mostly to be
>> a marshalling format for semi-structured data (like Avro?). Arrow data
>> types are meant to be in a representation ideal for querying and
>> computation, rather than transport and storage.
>>
>> This could be developed separately and then be represented in Arrow
>> using an extension type (perhaps a canonical one as in
>> https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html).
>>
>> What do other Arrow developers think?
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 22/08/2024 à 10:45, Gang Wu a écrit :
>> > Sorry for the inconvenience.
>> >
>> > This is the permalink for the discussion:
>> > https://lists.apache.org/thread/hopkr2f0ftoywwt9zo3jxb7n0ob5s5bw
>> >
>> > On Thu, Aug 22, 2024 at 3:51 PM Antoine Pitrou <[email protected]>
>> wrote:
>> >
>> >>
>> >> Hi Gang,
>> >>
>> >> Sorry, but can you give a pointer to the start of this discussion
>> thread
>> >> in a readable format (for example a mailing-list archive)? It appears
>> >> that dev@arrow wasn't cc'ed from the start and that can make it
>> >> difficult to understand what this is about.
>> >>
>> >> Regards
>> >>
>> >> Antoine.
>> >>
>> >>
>> >> Le 22/08/2024 à 08:32, Gang Wu a écrit :
>> >>> It seems that we have reached a consensus to some extent that there
>> >>> should be a new home for the variant spec. The pending question
>> >>> is whether Parquet or Arrow is a better choice. As a committer from
>> >> Arrow,
>> >>> Parquet and ORC communities, I am neutral to choose any and happy to
>> >>> help with the movement once a decision has been made.
>> >>>
>> >>> Should we start a vote to move forward?
>> >>>
>> >>> Best,
>> >>> Gang
>> >>>
>> >>> On Sat, Aug 17, 2024 at 8:34 AM Micah Kornfield <
>> [email protected]>
>> >>> wrote:
>> >>>
>> >>>>>
>> >>>>> That being said, I think the most important consideration for now is
>> >>>> where
>> >>>>> are the current maintainers / contributors to the variant type. If
>> most
>> >>>> of
>> >>>>> them are already PMC members / committers on a project, it becomes a
>> >> bit
>> >>>>> easier. Otherwise if there isn't much overlap with a project's
>> existing
>> >>>>> governance, I worry there could be a bit of friction. How many
>> active
>> >>>>> contributors are there from Iceberg? And how about from Arrow?
>> >>>>
>> >>>>
>> >>>> I think this is the key question. What are the requirements around
>> >>>> governance?  I've seen some tangential messaging here but I'm not
>> clear
>> >> on
>> >>>> what everyone expects.
>> >>>>
>> >>>> I think for a lot of the other concerns my view is that the exact
>> >> project
>> >>>> does not really matter (and choosing a project with mature cross
>> >> language
>> >>>> testing infrastructure or committing to building it is critical).
>> IIUC
>> >> we
>> >>>> are talking about following artifacts:
>> >>>>
>> >>>> 1.  A stand alone specification document (this can be hosted
>> anyplace)
>> >>>> 2.  A set of language bindings with minimal dependencies can be
>> consumed
>> >>>> downstream (again, as long as dependencies are managed carefully any
>> >>>> project can host these)
>> >>>> 3.  Potential integration where appropriate into file format
>> libraries
>> >> to
>> >>>> support shredding (but as of now this is being bypassed by using
>> >>>> conventions anyways).  My impression is that at least for Parquet
>> there
>> >> has
>> >>>> been a proliferation of vectorized readers across different
>> projects, so
>> >>>> I'm not clear how much standardization in parquet-java could help
>> here.
>> >>>>
>> >>>> To respond to some other questions:
>> >>>>
>> >>>> Arrow is not used as Spark's in-memory model, nor Trino and others so
>> >> those
>> >>>>> existing relationships aren't there. I also worry that differences
>> in
>> >>>>> approaches would make it difficult later on.
>> >>>>
>> >>>>
>> >>>> While Arrow is not in the core memory model, for Spark I believe it
>> is
>> >>>> still used for IPC for things like Java<->Python. Trino also consumes
>> >> Arrow
>> >>>> libraries today to support things like Snowflake/Bigquery federation.
>> >> But I
>> >>>> think this is minor because as mentioned above I think the functional
>> >>>> libraries would be relatively stand-alone.
>> >>>>
>> >>>> Do we think it could be introduced as a canonical extension arrow
>> type?
>> >>>>
>> >>>>
>> >>>>    I believe it can be, I think there are probably different layouts
>> >> that can
>> >>>> be supported:
>> >>>>
>> >>>> 1.  A struct with two variable width bytes columns (metadata and
>> value
>> >> data
>> >>>> are stored separately and each entry has a 1:1 relationship).
>> >>>> 2.  Shredded (shredded according to the same convention as parquet),
>> I
>> >>>> would need to double check but I don't think Arrow would have
>> problems
>> >> here
>> >>>> but REE would likely be required to make this efficient (i.e. sparse
>> >> value
>> >>>> support is important).
>> >>>>
>> >>>> In both cases the main complexity is providing the necessary
>> functions
>> >> for
>> >>>> manipulation.
>> >>>>
>> >>>> Thanks,
>> >>>> Micah
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Fri, Aug 16, 2024 at 3:58 PM Will Jones <[email protected]>
>> >>>> wrote:
>> >>>>
>> >>>>> In being more engine and format agnostic, I agree the Arrow project
>> >> might
>> >>>>> be a good host for such a specification. It seems like we want to
>> move
>> >>>> away
>> >>>>> from hosting in Spark to make it engine agnostic. But moving into
>> >> Iceberg
>> >>>>> might make it less format agnostic, as I understand multiple formats
>> >>>> might
>> >>>>> want to implement this. I'm not intimately familiar with the state
>> of
>> >>>> this,
>> >>>>> but I believe Delta Lake would like to be aligned with the same
>> format
>> >> as
>> >>>>> Iceberg. In addition, the Lance format (which I work on), will
>> >> eventually
>> >>>>> be interesting as well. It seems equally bad to me to attach this
>> >>>>> specification to a particular table format as it does a particular
>> >> query
>> >>>>> engine.
>> >>>>>
>> >>>>> That being said, I think the most important consideration for now is
>> >>>> where
>> >>>>> are the current maintainers / contributors to the variant type. If
>> most
>> >>>> of
>> >>>>> them are already PMC members / committers on a project, it becomes a
>> >> bit
>> >>>>> easier. Otherwise if there isn't much overlap with a project's
>> existing
>> >>>>> governance, I worry there could be a bit of friction. How many
>> active
>> >>>>> contributors are there from Iceberg? And how about from Arrow?
>> >>>>>
>> >>>>> BTW, I'd add I'm interested in helping develop an Arrow extension
>> type
>> >>>> for
>> >>>>> the binary variant type. I've been experimenting with a DataFusion
>> >>>>> extension that operates on this [1], and already have some ideas on
>> how
>> >>>>> such an extension type might be defined. I'm not yet caught up on
>> the
>> >>>>> shredded specification, but I think having just the binary format
>> would
>> >>>> be
>> >>>>> beneficial for in-memory analytics, which are most relevant to
>> Arrow.
>> >>>> I'll
>> >>>>> be creating a seperate thread on the Arrow ML about this soon.
>> >>>>>
>> >>>>> Best,
>> >>>>>
>> >>>>> Will Jones
>> >>>>>
>> >>>>> [1]
>> >>>>>
>> >>>>
>> >>
>> https://github.com/datafusion-contrib/datafusion-functions-variant/issues
>> >>>>>
>> >>>>>
>> >>>>> On Thu, Aug 15, 2024 at 7:39 PM Gang Wu <[email protected]> wrote:
>> >>>>>
>> >>>>>> + dev@arrow
>> >>>>>>
>> >>>>>> Thanks for all the valuable suggestions! I am inclined to Micah's
>> idea
>> >>>>> that
>> >>>>>> Arrow might be a better host compared to Parquet.
>> >>>>>>
>> >>>>>> To give more context, I am taking the initiative to add the
>> geometry
>> >>>> type
>> >>>>>> to both Parquet and ORC. I'd like to do the same thing for variant
>> >> type
>> >>>>> in
>> >>>>>> that variant type is engine and file format agnostic. This does
>> mean
>> >>>> that
>> >>>>>> Parquet might not be the neutral place to hold the variant spec.
>> >>>>>>
>> >>>>>> Best,
>> >>>>>> Gang
>> >>>>>>
>> >>>>>> On Fri, Aug 16, 2024 at 10:00 AM Jingsong Li <
>> [email protected]>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>> Thanks all for your discussion.
>> >>>>>>>
>> >>>>>>> The Apache Paimon community is also considering support for this
>> >>>>>>> Variant type, without a doubt, we hope to maintain consistency
>> with
>> >>>>>>> Iceberg.
>> >>>>>>>
>> >>>>>>> Not only the Paimon community, but also various computing engines
>> >>>> need
>> >>>>>>> to adapt to this type, such as Flink and StarRocks. We also hope
>> to
>> >>>>>>> promote them to adapt to this type.
>> >>>>>>>
>> >>>>>>> It is worth noting that we also need to standardize many functions
>> >>>>>>> related to it.
>> >>>>>>>
>> >>>>>>> A neutral place to maintain it is a great choice.
>> >>>>>>>
>> >>>>>>> - As Gang Wu said, a standalone project is good, just like
>> >>>>> RoaringBitmap
>> >>>>>>> [1].
>> >>>>>>> - As Ryan said, Parquet community is a neutral option too.
>> >>>>>>> - As Micah said, Arrow is also an option too.
>> >>>>>>>
>> >>>>>>> [1] https://github.com/RoaringBitmap
>> >>>>>>>
>> >>>>>>> Best,
>> >>>>>>> Jingsong
>> >>>>>>>
>> >>>>>>> On Fri, Aug 16, 2024 at 7:18 AM Micah Kornfield <
>> >>>> [email protected]
>> >>>>>>
>> >>>>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>> Thats fair @Micah, so far all the discussions have been direct
>> and
>> >>>>> off
>> >>>>>>> the dev list. Would you like to make the request on the public
>> Spark
>> >>>>> Dev
>> >>>>>>> list? I would be glad to co-sign, I can also draft up a quick
>> email
>> >>>> if
>> >>>>>> you
>> >>>>>>> don't have time.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> I think once we come to consensus, if you have bandwidth, I think
>> >>>> the
>> >>>>>>> message might be better coming from you, as you have more context
>> on
>> >>>>> some
>> >>>>>>> of the non-public conversations, the requirements from an Iceberg
>> >>>>>>> perspective on governance and the blockers that were
>> encountered.  If
>> >>>>>>> details on the conversations can't be shared, (i.e. we are
>> starting
>> >>>>> from
>> >>>>>>> scratch) it seems like suggesting a new project via SPIP might be
>> the
>> >>>>> way
>> >>>>>>> forward.  I'm happy to help with that if it is useful but I would
>> >>>> guess
>> >>>>>>> Aihua or Tyler might be in a better place to start as it seems
>> they
>> >>>>> have
>> >>>>>>> done more serious thinking here.
>> >>>>>>>>
>> >>>>>>>> If we decide to try to standardize on Parquet or Arrow I'm happy
>> to
>> >>>>>> help
>> >>>>>>> support the effort in those communities.
>> >>>>>>>>
>> >>>>>>>> Thanks,
>> >>>>>>>> Micah
>> >>>>>>>>
>> >>>>>>>> On Thu, Aug 15, 2024 at 8:09 AM Russell Spitzer <
>> >>>>>>> [email protected]> wrote:
>> >>>>>>>>>
>> >>>>>>>>> Thats fair @Micah, so far all the discussions have been direct
>> and
>> >>>>> off
>> >>>>>>> the dev list. Would you like to make the request on the public
>> Spark
>> >>>>> Dev
>> >>>>>>> list? I would be glad to co-sign, I can also draft up a quick
>> email
>> >>>> if
>> >>>>>> you
>> >>>>>>> don't have time.
>> >>>>>>>>>
>> >>>>>>>>> On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield <
>> >>>>>> [email protected]>
>> >>>>>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>> I agree that it would be beneficial to make a sub-project, the
>> >>>>> main
>> >>>>>>> problem is political and not logistic. I've been asking for
>> movement
>> >>>>> from
>> >>>>>>> other relative projects for a month and we simply haven't gotten
>> >>>>>> anywhere.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> I just wanted to double check that these issues were brought
>> >>>>> directly
>> >>>>>>> to the spark community (i.e. a discussion thread on the Spark
>> >>>> developer
>> >>>>>>> mailing list) and not via backchannels.
>> >>>>>>>>>>
>> >>>>>>>>>> I'm not sure the outcome would be different and I don't think
>> >>>> this
>> >>>>>>> should block forking the spec, but we should make sure that the
>> >>>>> decision
>> >>>>>> is
>> >>>>>>> publicly documented within both communities.
>> >>>>>>>>>>
>> >>>>>>>>>> Thanks,
>> >>>>>>>>>> Micah
>> >>>>>>>>>>
>> >>>>>>>>>> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer <
>> >>>>>>> [email protected]> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>> @Gang Wu
>> >>>>>>>>>>>
>> >>>>>>>>>>> I agree that it would be beneficial to make a sub-project, the
>> >>>>> main
>> >>>>>>> problem is political and not logistic. I've been asking for
>> movement
>> >>>>> from
>> >>>>>>> other relative projects for a month and we simply haven't gotten
>> >>>>>> anywhere.
>> >>>>>>> I don't think there is anything that would stop us from moving to
>> a
>> >>>>> joint
>> >>>>>>> project in the future and if you know of some way of encouraging
>> that
>> >>>>>>> movement from other relevant parties I would be glad to
>> collaborate
>> >>>> in
>> >>>>>>> doing that. One thing that I don't want to do is have the Iceberg
>> >>>>> project
>> >>>>>>> stay in a holding pattern without any clear roadmap as to how to
>> >>>>> proceed.
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu <
>> [email protected]
>> >>>>>
>> >>>>>>> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> I’m on board with copying the spec into our repository.
>> >>>> However,
>> >>>>> as
>> >>>>>>> we’ve talked about, it’s not just a straightforward copy—there are
>> >>>>>> already
>> >>>>>>> some divergences. Some of them are under discussion. Iceberg is
>> >>>>>> definitely
>> >>>>>>> the best place for these specs. Engines like Trino and Flink can
>> then
>> >>>>>> rely
>> >>>>>>> on the Iceberg specs as a solid foundation.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Yufei
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <[email protected]>
>> >>>>> wrote:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Sorry for chiming in late.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>   From the discussion in
>> >>>>>>> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq,
>> I
>> >>>>>> don't
>> >>>>>>> quite understand why it is logistically complicated to create a
>> >>>>>> sub-project
>> >>>>>>> to hold the variant spec and impl.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> IMHO, coping the variant type spec into Apache Iceberg has
>> >>>> some
>> >>>>>>> deficiencies:
>> >>>>>>>>>>>>> - It is a burden to update two repos if there is a variant
>> >>>> type
>> >>>>>>> spec change and will likely result in deviation if some changes do
>> >>>> not
>> >>>>>>> reach agreement from both parties.
>> >>>>>>>>>>>>> - Implementers are required to keep an eye on both specs
>> >>>>>>> (considering proprietary engines where both Iceberg and Delta are
>> >>>>>>> supported).
>> >>>>>>>>>>>>> - Putting the spec and impl of variant type in Iceberg repo
>> >>>> does
>> >>>>>>> lose the opportunity for better native support from file formats
>> like
>> >>>>>>> Parquet and ORC.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> I'm not sure if it is possible to create a separate project
>> >>>>> (e.g.
>> >>>>>>> apache/variant-type) to make it a single point of truth. We can
>> learn
>> >>>>>> from
>> >>>>>>> the experience of Apache Arrow. In this fashion, different
>> engines,
>> >>>>> table
>> >>>>>>> formats and file formats can follow the same spec and are free to
>> >>>>> depend
>> >>>>>> on
>> >>>>>>> the reference implementations from apache/variant-type or
>> implement
>> >>>>> their
>> >>>>>>> own.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>> Gang
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye <
>> [email protected]
>> >>>>>
>> >>>>>>> wrote:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> +1 for copying the spec into our repository, I think we
>> need
>> >>>> to
>> >>>>>>> own it fully as a part of the table spec, and we can build
>> >>>>> compatibility
>> >>>>>>> through tests.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> -Jack
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer <
>> >>>>>>> [email protected]> wrote:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> I'm not really in favor of linking and annotating as that
>> >>>> just
>> >>>>>>> makes things more complicated and still is essentially forking
>> just
>> >>>>> with
>> >>>>>>> more steps. If we just track our annotations / modifications  to a
>> >>>>> single
>> >>>>>>> commit/version then we have the same issue again but now you have
>> to
>> >>>> go
>> >>>>>> to
>> >>>>>>> multiple sources to get the actual Spec. In addition, our very
>> copy
>> >>>> of
>> >>>>>> the
>> >>>>>>> Spec is going to require new types which don't exist in the Spark
>> >>>> Spec
>> >>>>>>> which necessarily means diverging. We will need to take up new
>> >>>>> primitive
>> >>>>>>> id's (as noted in my first email)
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> The other issue I have is I don't think the Spark Spec is
>> >>>>> really
>> >>>>>>> going through a thorough review process from all members of the
>> Spark
>> >>>>>>> community, I believe it probably should have gone through the SPIP
>> >>>> but
>> >>>>>>> instead seems to have been merged without broad community
>> >>>> involvement.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> The only way to truly avoid diverging is to only have a
>> >>>> single
>> >>>>>>> copy of the spec, in our previous discussions the vast majority of
>> >>>>> Apache
>> >>>>>>> Iceberg community want it to exist here.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks <
>> >>>>> [email protected]
>> >>>>>>>
>> >>>>>>> wrote:
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> I'm really excited about the introduction of variant type
>> >>>> to
>> >>>>>>> Iceberg, but I want to raise concerns about forking the spec.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> I feel like preemptively forking would create the
>> situation
>> >>>>>>> where we end up diverging because there's little reason to work
>> with
>> >>>>> both
>> >>>>>>> communities to evolve in a way that benefits everyone.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> I would much rather point to a specific version of the
>> spec
>> >>>>> and
>> >>>>>>> annotate any variance in Iceberg's handling.  This would allow us
>> to
>> >>>>>>> continue without dividing the communities.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> If at any point there are irreconcilable differences, I
>> >>>> would
>> >>>>>>> support forking, but I don't feel like that should be the initial
>> >>>> step.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> No one is excited about the possibility that the physical
>> >>>>>>> representations end up diverging, but it feels like we're setting
>> >>>>>> ourselves
>> >>>>>>> up for that exact scenario.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> -Dan
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong <
>> >>>>>>> [email protected]> wrote:
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> +1 to what's already being said here. It is good to copy
>> >>>> the
>> >>>>>>> spec to Iceberg and add context that's specific to Iceberg, but at
>> >>>> the
>> >>>>>> same
>> >>>>>>> time, we should maintain compatibility.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Kind regards,
>> >>>>>>>>>>>>>>>>> Fokko
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang <
>> >>>>>>> [email protected]>:
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> +1 to copy the spec into our repository. I think the
>> best
>> >>>>> way
>> >>>>>>> to keep compatibility is building integration tests.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>>>>>> Manu
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry <
>> >>>>>>> [email protected]> wrote:
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> Thanks Russell and Aihua for pushing Variant support!
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> Given the differences between the supported types and
>> >>>> the
>> >>>>>>> lack of interest from the other project, I think it is reasonable
>> to
>> >>>>>>> duplicate the specification to our repository.
>> >>>>>>>>>>>>>>>>>>> I would give very strong emphasis on sticking to the
>> >>>> Spark
>> >>>>>>> spec as much as possible, to keep compatibility as much as
>> possible.
>> >>>>>> Maybe
>> >>>>>>> even revert to a shared specification if the situation changes.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>>>>>>> Peter
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> Aihua Xu <[email protected]> ezt írta (időpont: 2024.
>> >>>>> aug.
>> >>>>>>> 13., K, 19:52):
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> Thanks Russell for bringing this up.
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> This is the main blocker to move forward with the
>> >>>> Variant
>> >>>>>>> support in Iceberg and hopefully we can have a consensus. To me, I
>> >>>> also
>> >>>>>>> feel it makes more sense to move the spec into Iceberg rather than
>> >>>>> Spark
>> >>>>>>> engine owns it and we try to keep it compatible with Spark spec.
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>>>>>>>> Aihua
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer <
>> >>>>>>> [email protected]> wrote:
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Hi Y’all,
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> We’ve hit a bit of a roadblock with the Variant
>> >>>>> Proposal,
>> >>>>>>> while we were hoping to move the Variant and Shredding
>> specifications
>> >>>>>> from
>> >>>>>>> Spark into Iceberg there doesn’t seem to be a lot of interest in
>> >>>> that.
>> >>>>>>> Unfortunately, I think we have a number of issues with just
>> linking
>> >>>> to
>> >>>>>> the
>> >>>>>>> Spark project directly from within Iceberg and I believe we need
>> to
>> >>>>> copy
>> >>>>>>> the specifications into our repository.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> There are a few reasons why i think this is
>> necessary
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> First, we have a divergence of types already. The
>> >>>> Spark
>> >>>>>>> Specification already includes types which Iceberg has no
>> definition
>> >>>>> for
>> >>>>>>> (19, 20 - Interval Types) and Iceberg already has a type which is
>> not
>> >>>>>>> included within the Spark Specification (Time) and will soon have
>> >>>> more
>> >>>>>> with
>> >>>>>>> TimestampNS, and Geo.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Second, We would like to make sure that Spark is
>> not a
>> >>>>>> hard
>> >>>>>>> dependency for other engines. We are working with several
>> >>>> implementers
>> >>>>> of
>> >>>>>>> the Iceberg spec and it has previously been agreed that it would
>> be
>> >>>>> best
>> >>>>>> if
>> >>>>>>> the source of truth for Variant existed in an engine and file
>> format
>> >>>>>>> neutral location. The Iceberg project has a good open model of
>> >>>>> governance
>> >>>>>>> and, as we have seen so far discussing Variant, open and active
>> >>>>>>> collaboration. This would also help as we can strictly version our
>> >>>>>> changes
>> >>>>>>> in-line with the rest of the Iceberg spec.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Third, The Shredding spec is not quite finished and
>> >>>>>>> requires some group analysis and discussion before we commit it. I
>> >>>>> think
>> >>>>>>> again the Iceberg community is probably the right place for this
>> to
>> >>>>>> happen
>> >>>>>>> as we have already started discussions here on these topics.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> For these reasons I think we should go with a direct
>> >>>>> copy
>> >>>>>>> of the existing specification from the Spark Project and move
>> ahead
>> >>>>> with
>> >>>>>>> our discussions and modifications within Iceberg. That said, I do
>> not
>> >>>>>> want
>> >>>>>>> to diverge if possible from the Spark proposal. For example,
>> although
>> >>>>> we
>> >>>>>> do
>> >>>>>>> not use the Interval types above, I think we should not reuse
>> those
>> >>>>> type
>> >>>>>>> ids within our spec. Iceberg's Variant Spec types 19 and 20 would
>> >>>>> remain
>> >>>>>>> unused along with any other types we think are not applicable. We
>> >>>>> should
>> >>>>>>> strive whenever possible to allow for compatibility.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> In the interest of moving forward with this
>> proposal I
>> >>>>> am
>> >>>>>>> hoping to see if anyone in the community objects to this plan
>> going
>> >>>>>> forward
>> >>>>>>> or has a better alternative.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> As always I am thankful for your time and am eager
>> to
>> >>>>> hear
>> >>>>>>> back from everyone,
>> >>>>>>>>>>>>>>>>>>>>> Russ
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >
>>
>

Re: [DISCUSS] Variant Spec Location

Reply via email to