Re: [DISCUSS] Variant Spec Location

Aihua Xu Fri, 23 Aug 2024 08:46:09 -0700

Thanks Gang for initiating the discussion.

On Fri, Aug 23, 2024 at 2:22 AM Gang Wu <ust...@gmail.com> wrote:


> Thanks Aihua!
>
> I've started the discussion in dev@parquet:
> https://lists.apache.org/thread/6h58hj39lhqtcyd2hlsyvqm4lzdh4b9z
>
> Best,
> Gang
>
> On Fri, Aug 23, 2024 at 12:53 PM Aihua Xu <aihua...@snowflake.com> wrote:
>
>> From this thread
>> https://lists.apache.org/thread/0k5oj3mn0049fcxoxm3gx3d7r28gw4rj,  seems
>> Spark community is leaning toward moving to Parquet.
>>
>> Gang, can you help start a discussion in the parquet community on
>> adopting and maintaining such Variant spec?
>>
>> On Thu, Aug 22, 2024 at 8:08 AM Curt Hagenlocher <c...@hagenlocher.org>
>> wrote:
>>
>>> This seems to straddle that line, in that you can also view this as a
>>> way to represent semi-structured data in a manner that allows for more
>>> efficient querying and computation by breaking out some of its components
>>> into a more structured form.
>>>
>>> (I also happen to want a canonical Arrow representation for variant
>>> data, as this type occurs in many databases but doesn't have a great
>>> representation today in ADBC results. That's why I filed [Format]
>>> Consider adding an official variant type to Arrow · Issue #42069 ·
>>> apache/arrow (github.com) <https://github.com/apache/arrow/issues/42069>.
>>> Of course, there's no specific reason why a canonical Arrow
>>> representation for variants must align with Spark and/or Iceberg.)
>>>
>>> -Curt
>>>
>>> On Thu, Aug 22, 2024 at 2:01 AM Antoine Pitrou <anto...@python.org>
>>> wrote:
>>>
>>>>
>>>> Ah, thanks. I've tried to find a rationale and ended up on
>>>> https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 . Is
>>>> it
>>>> a good description of what you're after?
>>>>
>>>> If so, then I don't think Arrow is a good match. This seems mostly to
>>>> be
>>>> a marshalling format for semi-structured data (like Avro?). Arrow data
>>>> types are meant to be in a representation ideal for querying and
>>>> computation, rather than transport and storage.
>>>>
>>>> This could be developed separately and then be represented in Arrow
>>>> using an extension type (perhaps a canonical one as in
>>>> https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html).
>>>>
>>>> What do other Arrow developers think?
>>>>
>>>> Regards
>>>>
>>>> Antoine.
>>>>
>>>>
>>>> Le 22/08/2024 à 10:45, Gang Wu a écrit :
>>>> > Sorry for the inconvenience.
>>>> >
>>>> > This is the permalink for the discussion:
>>>> > https://lists.apache.org/thread/hopkr2f0ftoywwt9zo3jxb7n0ob5s5bw
>>>> >
>>>> > On Thu, Aug 22, 2024 at 3:51 PM Antoine Pitrou <anto...@python.org>
>>>> wrote:
>>>> >
>>>> >>
>>>> >> Hi Gang,
>>>> >>
>>>> >> Sorry, but can you give a pointer to the start of this discussion
>>>> thread
>>>> >> in a readable format (for example a mailing-list archive)? It appears
>>>> >> that dev@arrow wasn't cc'ed from the start and that can make it
>>>> >> difficult to understand what this is about.
>>>> >>
>>>> >> Regards
>>>> >>
>>>> >> Antoine.
>>>> >>
>>>> >>
>>>> >> Le 22/08/2024 à 08:32, Gang Wu a écrit :
>>>> >>> It seems that we have reached a consensus to some extent that there
>>>> >>> should be a new home for the variant spec. The pending question
>>>> >>> is whether Parquet or Arrow is a better choice. As a committer from
>>>> >> Arrow,
>>>> >>> Parquet and ORC communities, I am neutral to choose any and happy to
>>>> >>> help with the movement once a decision has been made.
>>>> >>>
>>>> >>> Should we start a vote to move forward?
>>>> >>>
>>>> >>> Best,
>>>> >>> Gang
>>>> >>>
>>>> >>> On Sat, Aug 17, 2024 at 8:34 AM Micah Kornfield <
>>>> emkornfi...@gmail.com>
>>>> >>> wrote:
>>>> >>>
>>>> >>>>>
>>>> >>>>> That being said, I think the most important consideration for now
>>>> is
>>>> >>>> where
>>>> >>>>> are the current maintainers / contributors to the variant type.
>>>> If most
>>>> >>>> of
>>>> >>>>> them are already PMC members / committers on a project, it
>>>> becomes a
>>>> >> bit
>>>> >>>>> easier. Otherwise if there isn't much overlap with a project's
>>>> existing
>>>> >>>>> governance, I worry there could be a bit of friction. How many
>>>> active
>>>> >>>>> contributors are there from Iceberg? And how about from Arrow?
>>>> >>>>
>>>> >>>>
>>>> >>>> I think this is the key question. What are the requirements around
>>>> >>>> governance?  I've seen some tangential messaging here but I'm not
>>>> clear
>>>> >> on
>>>> >>>> what everyone expects.
>>>> >>>>
>>>> >>>> I think for a lot of the other concerns my view is that the exact
>>>> >> project
>>>> >>>> does not really matter (and choosing a project with mature cross
>>>> >> language
>>>> >>>> testing infrastructure or committing to building it is critical).
>>>> IIUC
>>>> >> we
>>>> >>>> are talking about following artifacts:
>>>> >>>>
>>>> >>>> 1.  A stand alone specification document (this can be hosted
>>>> anyplace)
>>>> >>>> 2.  A set of language bindings with minimal dependencies can be
>>>> consumed
>>>> >>>> downstream (again, as long as dependencies are managed carefully
>>>> any
>>>> >>>> project can host these)
>>>> >>>> 3.  Potential integration where appropriate into file format
>>>> libraries
>>>> >> to
>>>> >>>> support shredding (but as of now this is being bypassed by using
>>>> >>>> conventions anyways).  My impression is that at least for Parquet
>>>> there
>>>> >> has
>>>> >>>> been a proliferation of vectorized readers across different
>>>> projects, so
>>>> >>>> I'm not clear how much standardization in parquet-java could help
>>>> here.
>>>> >>>>
>>>> >>>> To respond to some other questions:
>>>> >>>>
>>>> >>>> Arrow is not used as Spark's in-memory model, nor Trino and others
>>>> so
>>>> >> those
>>>> >>>>> existing relationships aren't there. I also worry that
>>>> differences in
>>>> >>>>> approaches would make it difficult later on.
>>>> >>>>
>>>> >>>>
>>>> >>>> While Arrow is not in the core memory model, for Spark I believe
>>>> it is
>>>> >>>> still used for IPC for things like Java<->Python. Trino also
>>>> consumes
>>>> >> Arrow
>>>> >>>> libraries today to support things like Snowflake/Bigquery
>>>> federation.
>>>> >> But I
>>>> >>>> think this is minor because as mentioned above I think the
>>>> functional
>>>> >>>> libraries would be relatively stand-alone.
>>>> >>>>
>>>> >>>> Do we think it could be introduced as a canonical extension arrow
>>>> type?
>>>> >>>>
>>>> >>>>
>>>> >>>>    I believe it can be, I think there are probably different
>>>> layouts
>>>> >> that can
>>>> >>>> be supported:
>>>> >>>>
>>>> >>>> 1.  A struct with two variable width bytes columns (metadata and
>>>> value
>>>> >> data
>>>> >>>> are stored separately and each entry has a 1:1 relationship).
>>>> >>>> 2.  Shredded (shredded according to the same convention as
>>>> parquet), I
>>>> >>>> would need to double check but I don't think Arrow would have
>>>> problems
>>>> >> here
>>>> >>>> but REE would likely be required to make this efficient (i.e.
>>>> sparse
>>>> >> value
>>>> >>>> support is important).
>>>> >>>>
>>>> >>>> In both cases the main complexity is providing the necessary
>>>> functions
>>>> >> for
>>>> >>>> manipulation.
>>>> >>>>
>>>> >>>> Thanks,
>>>> >>>> Micah
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> On Fri, Aug 16, 2024 at 3:58 PM Will Jones <
>>>> will.jones...@gmail.com>
>>>> >>>> wrote:
>>>> >>>>
>>>> >>>>> In being more engine and format agnostic, I agree the Arrow
>>>> project
>>>> >> might
>>>> >>>>> be a good host for such a specification. It seems like we want to
>>>> move
>>>> >>>> away
>>>> >>>>> from hosting in Spark to make it engine agnostic. But moving into
>>>> >> Iceberg
>>>> >>>>> might make it less format agnostic, as I understand multiple
>>>> formats
>>>> >>>> might
>>>> >>>>> want to implement this. I'm not intimately familiar with the
>>>> state of
>>>> >>>> this,
>>>> >>>>> but I believe Delta Lake would like to be aligned with the same
>>>> format
>>>> >> as
>>>> >>>>> Iceberg. In addition, the Lance format (which I work on), will
>>>> >> eventually
>>>> >>>>> be interesting as well. It seems equally bad to me to attach this
>>>> >>>>> specification to a particular table format as it does a particular
>>>> >> query
>>>> >>>>> engine.
>>>> >>>>>
>>>> >>>>> That being said, I think the most important consideration for now
>>>> is
>>>> >>>> where
>>>> >>>>> are the current maintainers / contributors to the variant type.
>>>> If most
>>>> >>>> of
>>>> >>>>> them are already PMC members / committers on a project, it
>>>> becomes a
>>>> >> bit
>>>> >>>>> easier. Otherwise if there isn't much overlap with a project's
>>>> existing
>>>> >>>>> governance, I worry there could be a bit of friction. How many
>>>> active
>>>> >>>>> contributors are there from Iceberg? And how about from Arrow?
>>>> >>>>>
>>>> >>>>> BTW, I'd add I'm interested in helping develop an Arrow extension
>>>> type
>>>> >>>> for
>>>> >>>>> the binary variant type. I've been experimenting with a DataFusion
>>>> >>>>> extension that operates on this [1], and already have some ideas
>>>> on how
>>>> >>>>> such an extension type might be defined. I'm not yet caught up on
>>>> the
>>>> >>>>> shredded specification, but I think having just the binary format
>>>> would
>>>> >>>> be
>>>> >>>>> beneficial for in-memory analytics, which are most relevant to
>>>> Arrow.
>>>> >>>> I'll
>>>> >>>>> be creating a seperate thread on the Arrow ML about this soon.
>>>> >>>>>
>>>> >>>>> Best,
>>>> >>>>>
>>>> >>>>> Will Jones
>>>> >>>>>
>>>> >>>>> [1]
>>>> >>>>>
>>>> >>>>
>>>> >>
>>>> https://github.com/datafusion-contrib/datafusion-functions-variant/issues
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> On Thu, Aug 15, 2024 at 7:39 PM Gang Wu <ust...@gmail.com> wrote:
>>>> >>>>>
>>>> >>>>>> + dev@arrow
>>>> >>>>>>
>>>> >>>>>> Thanks for all the valuable suggestions! I am inclined to
>>>> Micah's idea
>>>> >>>>> that
>>>> >>>>>> Arrow might be a better host compared to Parquet.
>>>> >>>>>>
>>>> >>>>>> To give more context, I am taking the initiative to add the
>>>> geometry
>>>> >>>> type
>>>> >>>>>> to both Parquet and ORC. I'd like to do the same thing for
>>>> variant
>>>> >> type
>>>> >>>>> in
>>>> >>>>>> that variant type is engine and file format agnostic. This does
>>>> mean
>>>> >>>> that
>>>> >>>>>> Parquet might not be the neutral place to hold the variant spec.
>>>> >>>>>>
>>>> >>>>>> Best,
>>>> >>>>>> Gang
>>>> >>>>>>
>>>> >>>>>> On Fri, Aug 16, 2024 at 10:00 AM Jingsong Li <
>>>> jingsongl...@gmail.com>
>>>> >>>>>> wrote:
>>>> >>>>>>
>>>> >>>>>>> Thanks all for your discussion.
>>>> >>>>>>>
>>>> >>>>>>> The Apache Paimon community is also considering support for this
>>>> >>>>>>> Variant type, without a doubt, we hope to maintain consistency
>>>> with
>>>> >>>>>>> Iceberg.
>>>> >>>>>>>
>>>> >>>>>>> Not only the Paimon community, but also various computing
>>>> engines
>>>> >>>> need
>>>> >>>>>>> to adapt to this type, such as Flink and StarRocks. We also
>>>> hope to
>>>> >>>>>>> promote them to adapt to this type.
>>>> >>>>>>>
>>>> >>>>>>> It is worth noting that we also need to standardize many
>>>> functions
>>>> >>>>>>> related to it.
>>>> >>>>>>>
>>>> >>>>>>> A neutral place to maintain it is a great choice.
>>>> >>>>>>>
>>>> >>>>>>> - As Gang Wu said, a standalone project is good, just like
>>>> >>>>> RoaringBitmap
>>>> >>>>>>> [1].
>>>> >>>>>>> - As Ryan said, Parquet community is a neutral option too.
>>>> >>>>>>> - As Micah said, Arrow is also an option too.
>>>> >>>>>>>
>>>> >>>>>>> [1] https://github.com/RoaringBitmap
>>>> >>>>>>>
>>>> >>>>>>> Best,
>>>> >>>>>>> Jingsong
>>>> >>>>>>>
>>>> >>>>>>> On Fri, Aug 16, 2024 at 7:18 AM Micah Kornfield <
>>>> >>>> emkornfi...@gmail.com
>>>> >>>>>>
>>>> >>>>>>> wrote:
>>>> >>>>>>>>>
>>>> >>>>>>>>> Thats fair @Micah, so far all the discussions have been
>>>> direct and
>>>> >>>>> off
>>>> >>>>>>> the dev list. Would you like to make the request on the public
>>>> Spark
>>>> >>>>> Dev
>>>> >>>>>>> list? I would be glad to co-sign, I can also draft up a quick
>>>> email
>>>> >>>> if
>>>> >>>>>> you
>>>> >>>>>>> don't have time.
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>> I think once we come to consensus, if you have bandwidth, I
>>>> think
>>>> >>>> the
>>>> >>>>>>> message might be better coming from you, as you have more
>>>> context on
>>>> >>>>> some
>>>> >>>>>>> of the non-public conversations, the requirements from an
>>>> Iceberg
>>>> >>>>>>> perspective on governance and the blockers that were
>>>> encountered.  If
>>>> >>>>>>> details on the conversations can't be shared, (i.e. we are
>>>> starting
>>>> >>>>> from
>>>> >>>>>>> scratch) it seems like suggesting a new project via SPIP might
>>>> be the
>>>> >>>>> way
>>>> >>>>>>> forward.  I'm happy to help with that if it is useful but I
>>>> would
>>>> >>>> guess
>>>> >>>>>>> Aihua or Tyler might be in a better place to start as it seems
>>>> they
>>>> >>>>> have
>>>> >>>>>>> done more serious thinking here.
>>>> >>>>>>>>
>>>> >>>>>>>> If we decide to try to standardize on Parquet or Arrow I'm
>>>> happy to
>>>> >>>>>> help
>>>> >>>>>>> support the effort in those communities.
>>>> >>>>>>>>
>>>> >>>>>>>> Thanks,
>>>> >>>>>>>> Micah
>>>> >>>>>>>>
>>>> >>>>>>>> On Thu, Aug 15, 2024 at 8:09 AM Russell Spitzer <
>>>> >>>>>>> russell.spit...@gmail.com> wrote:
>>>> >>>>>>>>>
>>>> >>>>>>>>> Thats fair @Micah, so far all the discussions have been
>>>> direct and
>>>> >>>>> off
>>>> >>>>>>> the dev list. Would you like to make the request on the public
>>>> Spark
>>>> >>>>> Dev
>>>> >>>>>>> list? I would be glad to co-sign, I can also draft up a quick
>>>> email
>>>> >>>> if
>>>> >>>>>> you
>>>> >>>>>>> don't have time.
>>>> >>>>>>>>>
>>>> >>>>>>>>> On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield <
>>>> >>>>>> emkornfi...@gmail.com>
>>>> >>>>>>> wrote:
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> I agree that it would be beneficial to make a sub-project,
>>>> the
>>>> >>>>> main
>>>> >>>>>>> problem is political and not logistic. I've been asking for
>>>> movement
>>>> >>>>> from
>>>> >>>>>>> other relative projects for a month and we simply haven't gotten
>>>> >>>>>> anywhere.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> I just wanted to double check that these issues were brought
>>>> >>>>> directly
>>>> >>>>>>> to the spark community (i.e. a discussion thread on the Spark
>>>> >>>> developer
>>>> >>>>>>> mailing list) and not via backchannels.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> I'm not sure the outcome would be different and I don't think
>>>> >>>> this
>>>> >>>>>>> should block forking the spec, but we should make sure that the
>>>> >>>>> decision
>>>> >>>>>> is
>>>> >>>>>>> publicly documented within both communities.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Thanks,
>>>> >>>>>>>>>> Micah
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer <
>>>> >>>>>>> russell.spit...@gmail.com> wrote:
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> @Gang Wu
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> I agree that it would be beneficial to make a sub-project,
>>>> the
>>>> >>>>> main
>>>> >>>>>>> problem is political and not logistic. I've been asking for
>>>> movement
>>>> >>>>> from
>>>> >>>>>>> other relative projects for a month and we simply haven't gotten
>>>> >>>>>> anywhere.
>>>> >>>>>>> I don't think there is anything that would stop us from moving
>>>> to a
>>>> >>>>> joint
>>>> >>>>>>> project in the future and if you know of some way of
>>>> encouraging that
>>>> >>>>>>> movement from other relevant parties I would be glad to
>>>> collaborate
>>>> >>>> in
>>>> >>>>>>> doing that. One thing that I don't want to do is have the
>>>> Iceberg
>>>> >>>>> project
>>>> >>>>>>> stay in a holding pattern without any clear roadmap as to how to
>>>> >>>>> proceed.
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu <
>>>> flyrain...@gmail.com
>>>> >>>>>
>>>> >>>>>>> wrote:
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> I’m on board with copying the spec into our repository.
>>>> >>>> However,
>>>> >>>>> as
>>>> >>>>>>> we’ve talked about, it’s not just a straightforward copy—there
>>>> are
>>>> >>>>>> already
>>>> >>>>>>> some divergences. Some of them are under discussion. Iceberg is
>>>> >>>>>> definitely
>>>> >>>>>>> the best place for these specs. Engines like Trino and Flink
>>>> can then
>>>> >>>>>> rely
>>>> >>>>>>> on the Iceberg specs as a solid foundation.
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> Yufei
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <ust...@gmail.com>
>>>> >>>>> wrote:
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>> Sorry for chiming in late.
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>   From the discussion in
>>>> >>>>>>>
>>>> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I
>>>> >>>>>> don't
>>>> >>>>>>> quite understand why it is logistically complicated to create a
>>>> >>>>>> sub-project
>>>> >>>>>>> to hold the variant spec and impl.
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>> IMHO, coping the variant type spec into Apache Iceberg has
>>>> >>>> some
>>>> >>>>>>> deficiencies:
>>>> >>>>>>>>>>>>> - It is a burden to update two repos if there is a variant
>>>> >>>> type
>>>> >>>>>>> spec change and will likely result in deviation if some changes
>>>> do
>>>> >>>> not
>>>> >>>>>>> reach agreement from both parties.
>>>> >>>>>>>>>>>>> - Implementers are required to keep an eye on both specs
>>>> >>>>>>> (considering proprietary engines where both Iceberg and Delta
>>>> are
>>>> >>>>>>> supported).
>>>> >>>>>>>>>>>>> - Putting the spec and impl of variant type in Iceberg
>>>> repo
>>>> >>>> does
>>>> >>>>>>> lose the opportunity for better native support from file
>>>> formats like
>>>> >>>>>>> Parquet and ORC.
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>> I'm not sure if it is possible to create a separate
>>>> project
>>>> >>>>> (e.g.
>>>> >>>>>>> apache/variant-type) to make it a single point of truth. We can
>>>> learn
>>>> >>>>>> from
>>>> >>>>>>> the experience of Apache Arrow. In this fashion, different
>>>> engines,
>>>> >>>>> table
>>>> >>>>>>> formats and file formats can follow the same spec and are free
>>>> to
>>>> >>>>> depend
>>>> >>>>>> on
>>>> >>>>>>> the reference implementations from apache/variant-type or
>>>> implement
>>>> >>>>> their
>>>> >>>>>>> own.
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>> Best,
>>>> >>>>>>>>>>>>> Gang
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye <
>>>> yezhao...@gmail.com
>>>> >>>>>
>>>> >>>>>>> wrote:
>>>> >>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>> +1 for copying the spec into our repository, I think we
>>>> need
>>>> >>>> to
>>>> >>>>>>> own it fully as a part of the table spec, and we can build
>>>> >>>>> compatibility
>>>> >>>>>>> through tests.
>>>> >>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>> -Jack
>>>> >>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer <
>>>> >>>>>>> russell.spit...@gmail.com> wrote:
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>> I'm not really in favor of linking and annotating as
>>>> that
>>>> >>>> just
>>>> >>>>>>> makes things more complicated and still is essentially forking
>>>> just
>>>> >>>>> with
>>>> >>>>>>> more steps. If we just track our annotations / modifications
>>>> to a
>>>> >>>>> single
>>>> >>>>>>> commit/version then we have the same issue again but now you
>>>> have to
>>>> >>>> go
>>>> >>>>>> to
>>>> >>>>>>> multiple sources to get the actual Spec. In addition, our very
>>>> copy
>>>> >>>> of
>>>> >>>>>> the
>>>> >>>>>>> Spec is going to require new types which don't exist in the
>>>> Spark
>>>> >>>> Spec
>>>> >>>>>>> which necessarily means diverging. We will need to take up new
>>>> >>>>> primitive
>>>> >>>>>>> id's (as noted in my first email)
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>> The other issue I have is I don't think the Spark Spec
>>>> is
>>>> >>>>> really
>>>> >>>>>>> going through a thorough review process from all members of the
>>>> Spark
>>>> >>>>>>> community, I believe it probably should have gone through the
>>>> SPIP
>>>> >>>> but
>>>> >>>>>>> instead seems to have been merged without broad community
>>>> >>>> involvement.
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>> The only way to truly avoid diverging is to only have a
>>>> >>>> single
>>>> >>>>>>> copy of the spec, in our previous discussions the vast majority
>>>> of
>>>> >>>>> Apache
>>>> >>>>>>> Iceberg community want it to exist here.
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks <
>>>> >>>>> dwe...@apache.org
>>>> >>>>>>>
>>>> >>>>>>> wrote:
>>>> >>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>> I'm really excited about the introduction of variant
>>>> type
>>>> >>>> to
>>>> >>>>>>> Iceberg, but I want to raise concerns about forking the spec.
>>>> >>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>> I feel like preemptively forking would create the
>>>> situation
>>>> >>>>>>> where we end up diverging because there's little reason to work
>>>> with
>>>> >>>>> both
>>>> >>>>>>> communities to evolve in a way that benefits everyone.
>>>> >>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>> I would much rather point to a specific version of the
>>>> spec
>>>> >>>>> and
>>>> >>>>>>> annotate any variance in Iceberg's handling.  This would allow
>>>> us to
>>>> >>>>>>> continue without dividing the communities.
>>>> >>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>> If at any point there are irreconcilable differences, I
>>>> >>>> would
>>>> >>>>>>> support forking, but I don't feel like that should be the
>>>> initial
>>>> >>>> step.
>>>> >>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>> No one is excited about the possibility that the
>>>> physical
>>>> >>>>>>> representations end up diverging, but it feels like we're
>>>> setting
>>>> >>>>>> ourselves
>>>> >>>>>>> up for that exact scenario.
>>>> >>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>> -Dan
>>>> >>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong <
>>>> >>>>>>> fo...@apache.org> wrote:
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>> +1 to what's already being said here. It is good to
>>>> copy
>>>> >>>> the
>>>> >>>>>>> spec to Iceberg and add context that's specific to Iceberg, but
>>>> at
>>>> >>>> the
>>>> >>>>>> same
>>>> >>>>>>> time, we should maintain compatibility.
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>> Kind regards,
>>>> >>>>>>>>>>>>>>>>> Fokko
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang <
>>>> >>>>>>> owenzhang1...@gmail.com>:
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>> +1 to copy the spec into our repository. I think the
>>>> best
>>>> >>>>> way
>>>> >>>>>>> to keep compatibility is building integration tests.
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>> Thanks,
>>>> >>>>>>>>>>>>>>>>>> Manu
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry <
>>>> >>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>> >>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>> Thanks Russell and Aihua for pushing Variant
>>>> support!
>>>> >>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>> Given the differences between the supported types
>>>> and
>>>> >>>> the
>>>> >>>>>>> lack of interest from the other project, I think it is
>>>> reasonable to
>>>> >>>>>>> duplicate the specification to our repository.
>>>> >>>>>>>>>>>>>>>>>>> I would give very strong emphasis on sticking to the
>>>> >>>> Spark
>>>> >>>>>>> spec as much as possible, to keep compatibility as much as
>>>> possible.
>>>> >>>>>> Maybe
>>>> >>>>>>> even revert to a shared specification if the situation changes.
>>>> >>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>> Thanks,
>>>> >>>>>>>>>>>>>>>>>>> Peter
>>>> >>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>> Aihua Xu <aihu...@gmail.com> ezt írta (időpont:
>>>> 2024.
>>>> >>>>> aug.
>>>> >>>>>>> 13., K, 19:52):
>>>> >>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>> Thanks Russell for bringing this up.
>>>> >>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>> This is the main blocker to move forward with the
>>>> >>>> Variant
>>>> >>>>>>> support in Iceberg and hopefully we can have a consensus. To
>>>> me, I
>>>> >>>> also
>>>> >>>>>>> feel it makes more sense to move the spec into Iceberg rather
>>>> than
>>>> >>>>> Spark
>>>> >>>>>>> engine owns it and we try to keep it compatible with Spark spec.
>>>> >>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>> Thanks,
>>>> >>>>>>>>>>>>>>>>>>>> Aihua
>>>> >>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer <
>>>> >>>>>>> russell.spit...@gmail.com> wrote:
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> Hi Y’all,
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> We’ve hit a bit of a roadblock with the Variant
>>>> >>>>> Proposal,
>>>> >>>>>>> while we were hoping to move the Variant and Shredding
>>>> specifications
>>>> >>>>>> from
>>>> >>>>>>> Spark into Iceberg there doesn’t seem to be a lot of interest in
>>>> >>>> that.
>>>> >>>>>>> Unfortunately, I think we have a number of issues with just
>>>> linking
>>>> >>>> to
>>>> >>>>>> the
>>>> >>>>>>> Spark project directly from within Iceberg and I believe we
>>>> need to
>>>> >>>>> copy
>>>> >>>>>>> the specifications into our repository.
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> There are a few reasons why i think this is
>>>> necessary
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> First, we have a divergence of types already. The
>>>> >>>> Spark
>>>> >>>>>>> Specification already includes types which Iceberg has no
>>>> definition
>>>> >>>>> for
>>>> >>>>>>> (19, 20 - Interval Types) and Iceberg already has a type which
>>>> is not
>>>> >>>>>>> included within the Spark Specification (Time) and will soon
>>>> have
>>>> >>>> more
>>>> >>>>>> with
>>>> >>>>>>> TimestampNS, and Geo.
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> Second, We would like to make sure that Spark is
>>>> not a
>>>> >>>>>> hard
>>>> >>>>>>> dependency for other engines. We are working with several
>>>> >>>> implementers
>>>> >>>>> of
>>>> >>>>>>> the Iceberg spec and it has previously been agreed that it
>>>> would be
>>>> >>>>> best
>>>> >>>>>> if
>>>> >>>>>>> the source of truth for Variant existed in an engine and file
>>>> format
>>>> >>>>>>> neutral location. The Iceberg project has a good open model of
>>>> >>>>> governance
>>>> >>>>>>> and, as we have seen so far discussing Variant, open and active
>>>> >>>>>>> collaboration. This would also help as we can strictly version
>>>> our
>>>> >>>>>> changes
>>>> >>>>>>> in-line with the rest of the Iceberg spec.
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> Third, The Shredding spec is not quite finished
>>>> and
>>>> >>>>>>> requires some group analysis and discussion before we commit
>>>> it. I
>>>> >>>>> think
>>>> >>>>>>> again the Iceberg community is probably the right place for
>>>> this to
>>>> >>>>>> happen
>>>> >>>>>>> as we have already started discussions here on these topics.
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> For these reasons I think we should go with a
>>>> direct
>>>> >>>>> copy
>>>> >>>>>>> of the existing specification from the Spark Project and move
>>>> ahead
>>>> >>>>> with
>>>> >>>>>>> our discussions and modifications within Iceberg. That said, I
>>>> do not
>>>> >>>>>> want
>>>> >>>>>>> to diverge if possible from the Spark proposal. For example,
>>>> although
>>>> >>>>> we
>>>> >>>>>> do
>>>> >>>>>>> not use the Interval types above, I think we should not reuse
>>>> those
>>>> >>>>> type
>>>> >>>>>>> ids within our spec. Iceberg's Variant Spec types 19 and 20
>>>> would
>>>> >>>>> remain
>>>> >>>>>>> unused along with any other types we think are not applicable.
>>>> We
>>>> >>>>> should
>>>> >>>>>>> strive whenever possible to allow for compatibility.
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> In the interest of moving forward with this
>>>> proposal I
>>>> >>>>> am
>>>> >>>>>>> hoping to see if anyone in the community objects to this plan
>>>> going
>>>> >>>>>> forward
>>>> >>>>>>> or has a better alternative.
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> As always I am thankful for your time and am
>>>> eager to
>>>> >>>>> hear
>>>> >>>>>>> back from everyone,
>>>> >>>>>>>>>>>>>>>>>>>>> Russ
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>>
>>>> >>>>>
>>>> >>>>
>>>> >>>
>>>> >>
>>>> >
>>>>
>>>

Re: [DISCUSS] Variant Spec Location

Reply via email to