Re: [DISCUSS] Variant Spec Location

Julien Le Dem Fri, 23 Aug 2024 17:50:58 -0700

Thank you Gang, that's sounds like a good idea to me as well

On Fri, Aug 23, 2024 at 8:47 AM Aihua Xu <[email protected]>
wrote:


> Thanks Gang for initiating the discussion.
>
> On Fri, Aug 23, 2024 at 2:22 AM Gang Wu <[email protected]> wrote:
>
>> Thanks Aihua!
>>
>> I've started the discussion in dev@parquet:
>> https://lists.apache.org/thread/6h58hj39lhqtcyd2hlsyvqm4lzdh4b9z
>>
>> Best,
>> Gang
>>
>> On Fri, Aug 23, 2024 at 12:53 PM Aihua Xu <[email protected]> wrote:
>>
>>> From this thread
>>> https://lists.apache.org/thread/0k5oj3mn0049fcxoxm3gx3d7r28gw4rj,
>>> seems Spark community is leaning toward moving to Parquet.
>>>
>>> Gang, can you help start a discussion in the parquet community on
>>> adopting and maintaining such Variant spec?
>>>
>>> On Thu, Aug 22, 2024 at 8:08 AM Curt Hagenlocher <[email protected]>
>>> wrote:
>>>
>>>> This seems to straddle that line, in that you can also view this as a
>>>> way to represent semi-structured data in a manner that allows for more
>>>> efficient querying and computation by breaking out some of its components
>>>> into a more structured form.
>>>>
>>>> (I also happen to want a canonical Arrow representation for variant
>>>> data, as this type occurs in many databases but doesn't have a great
>>>> representation today in ADBC results. That's why I filed [Format]
>>>> Consider adding an official variant type to Arrow · Issue #42069 ·
>>>> apache/arrow (github.com)
>>>> <https://github.com/apache/arrow/issues/42069>. Of course, there's no
>>>> specific reason why a canonical Arrow representation for variants must
>>>> align with Spark and/or Iceberg.)
>>>>
>>>> -Curt
>>>>
>>>> On Thu, Aug 22, 2024 at 2:01 AM Antoine Pitrou <[email protected]>
>>>> wrote:
>>>>
>>>>>
>>>>> Ah, thanks. I've tried to find a rationale and ended up on
>>>>> https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 . Is
>>>>> it
>>>>> a good description of what you're after?
>>>>>
>>>>> If so, then I don't think Arrow is a good match. This seems mostly to
>>>>> be
>>>>> a marshalling format for semi-structured data (like Avro?). Arrow data
>>>>> types are meant to be in a representation ideal for querying and
>>>>> computation, rather than transport and storage.
>>>>>
>>>>> This could be developed separately and then be represented in Arrow
>>>>> using an extension type (perhaps a canonical one as in
>>>>> https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html).
>>>>>
>>>>> What do other Arrow developers think?
>>>>>
>>>>> Regards
>>>>>
>>>>> Antoine.
>>>>>
>>>>>
>>>>> Le 22/08/2024 à 10:45, Gang Wu a écrit :
>>>>> > Sorry for the inconvenience.
>>>>> >
>>>>> > This is the permalink for the discussion:
>>>>> > https://lists.apache.org/thread/hopkr2f0ftoywwt9zo3jxb7n0ob5s5bw
>>>>> >
>>>>> > On Thu, Aug 22, 2024 at 3:51 PM Antoine Pitrou <[email protected]>
>>>>> wrote:
>>>>> >
>>>>> >>
>>>>> >> Hi Gang,
>>>>> >>
>>>>> >> Sorry, but can you give a pointer to the start of this discussion
>>>>> thread
>>>>> >> in a readable format (for example a mailing-list archive)? It
>>>>> appears
>>>>> >> that dev@arrow wasn't cc'ed from the start and that can make it
>>>>> >> difficult to understand what this is about.
>>>>> >>
>>>>> >> Regards
>>>>> >>
>>>>> >> Antoine.
>>>>> >>
>>>>> >>
>>>>> >> Le 22/08/2024 à 08:32, Gang Wu a écrit :
>>>>> >>> It seems that we have reached a consensus to some extent that there
>>>>> >>> should be a new home for the variant spec. The pending question
>>>>> >>> is whether Parquet or Arrow is a better choice. As a committer from
>>>>> >> Arrow,
>>>>> >>> Parquet and ORC communities, I am neutral to choose any and happy
>>>>> to
>>>>> >>> help with the movement once a decision has been made.
>>>>> >>>
>>>>> >>> Should we start a vote to move forward?
>>>>> >>>
>>>>> >>> Best,
>>>>> >>> Gang
>>>>> >>>
>>>>> >>> On Sat, Aug 17, 2024 at 8:34 AM Micah Kornfield <
>>>>> [email protected]>
>>>>> >>> wrote:
>>>>> >>>
>>>>> >>>>>
>>>>> >>>>> That being said, I think the most important consideration for
>>>>> now is
>>>>> >>>> where
>>>>> >>>>> are the current maintainers / contributors to the variant type.
>>>>> If most
>>>>> >>>> of
>>>>> >>>>> them are already PMC members / committers on a project, it
>>>>> becomes a
>>>>> >> bit
>>>>> >>>>> easier. Otherwise if there isn't much overlap with a project's
>>>>> existing
>>>>> >>>>> governance, I worry there could be a bit of friction. How many
>>>>> active
>>>>> >>>>> contributors are there from Iceberg? And how about from Arrow?
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> I think this is the key question. What are the requirements around
>>>>> >>>> governance?  I've seen some tangential messaging here but I'm not
>>>>> clear
>>>>> >> on
>>>>> >>>> what everyone expects.
>>>>> >>>>
>>>>> >>>> I think for a lot of the other concerns my view is that the exact
>>>>> >> project
>>>>> >>>> does not really matter (and choosing a project with mature cross
>>>>> >> language
>>>>> >>>> testing infrastructure or committing to building it is critical).
>>>>> IIUC
>>>>> >> we
>>>>> >>>> are talking about following artifacts:
>>>>> >>>>
>>>>> >>>> 1.  A stand alone specification document (this can be hosted
>>>>> anyplace)
>>>>> >>>> 2.  A set of language bindings with minimal dependencies can be
>>>>> consumed
>>>>> >>>> downstream (again, as long as dependencies are managed carefully
>>>>> any
>>>>> >>>> project can host these)
>>>>> >>>> 3.  Potential integration where appropriate into file format
>>>>> libraries
>>>>> >> to
>>>>> >>>> support shredding (but as of now this is being bypassed by using
>>>>> >>>> conventions anyways).  My impression is that at least for Parquet
>>>>> there
>>>>> >> has
>>>>> >>>> been a proliferation of vectorized readers across different
>>>>> projects, so
>>>>> >>>> I'm not clear how much standardization in parquet-java could help
>>>>> here.
>>>>> >>>>
>>>>> >>>> To respond to some other questions:
>>>>> >>>>
>>>>> >>>> Arrow is not used as Spark's in-memory model, nor Trino and
>>>>> others so
>>>>> >> those
>>>>> >>>>> existing relationships aren't there. I also worry that
>>>>> differences in
>>>>> >>>>> approaches would make it difficult later on.
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> While Arrow is not in the core memory model, for Spark I believe
>>>>> it is
>>>>> >>>> still used for IPC for things like Java<->Python. Trino also
>>>>> consumes
>>>>> >> Arrow
>>>>> >>>> libraries today to support things like Snowflake/Bigquery
>>>>> federation.
>>>>> >> But I
>>>>> >>>> think this is minor because as mentioned above I think the
>>>>> functional
>>>>> >>>> libraries would be relatively stand-alone.
>>>>> >>>>
>>>>> >>>> Do we think it could be introduced as a canonical extension arrow
>>>>> type?
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>    I believe it can be, I think there are probably different
>>>>> layouts
>>>>> >> that can
>>>>> >>>> be supported:
>>>>> >>>>
>>>>> >>>> 1.  A struct with two variable width bytes columns (metadata and
>>>>> value
>>>>> >> data
>>>>> >>>> are stored separately and each entry has a 1:1 relationship).
>>>>> >>>> 2.  Shredded (shredded according to the same convention as
>>>>> parquet), I
>>>>> >>>> would need to double check but I don't think Arrow would have
>>>>> problems
>>>>> >> here
>>>>> >>>> but REE would likely be required to make this efficient (i.e.
>>>>> sparse
>>>>> >> value
>>>>> >>>> support is important).
>>>>> >>>>
>>>>> >>>> In both cases the main complexity is providing the necessary
>>>>> functions
>>>>> >> for
>>>>> >>>> manipulation.
>>>>> >>>>
>>>>> >>>> Thanks,
>>>>> >>>> Micah
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> On Fri, Aug 16, 2024 at 3:58 PM Will Jones <
>>>>> [email protected]>
>>>>> >>>> wrote:
>>>>> >>>>
>>>>> >>>>> In being more engine and format agnostic, I agree the Arrow
>>>>> project
>>>>> >> might
>>>>> >>>>> be a good host for such a specification. It seems like we want
>>>>> to move
>>>>> >>>> away
>>>>> >>>>> from hosting in Spark to make it engine agnostic. But moving into
>>>>> >> Iceberg
>>>>> >>>>> might make it less format agnostic, as I understand multiple
>>>>> formats
>>>>> >>>> might
>>>>> >>>>> want to implement this. I'm not intimately familiar with the
>>>>> state of
>>>>> >>>> this,
>>>>> >>>>> but I believe Delta Lake would like to be aligned with the same
>>>>> format
>>>>> >> as
>>>>> >>>>> Iceberg. In addition, the Lance format (which I work on), will
>>>>> >> eventually
>>>>> >>>>> be interesting as well. It seems equally bad to me to attach this
>>>>> >>>>> specification to a particular table format as it does a
>>>>> particular
>>>>> >> query
>>>>> >>>>> engine.
>>>>> >>>>>
>>>>> >>>>> That being said, I think the most important consideration for
>>>>> now is
>>>>> >>>> where
>>>>> >>>>> are the current maintainers / contributors to the variant type.
>>>>> If most
>>>>> >>>> of
>>>>> >>>>> them are already PMC members / committers on a project, it
>>>>> becomes a
>>>>> >> bit
>>>>> >>>>> easier. Otherwise if there isn't much overlap with a project's
>>>>> existing
>>>>> >>>>> governance, I worry there could be a bit of friction. How many
>>>>> active
>>>>> >>>>> contributors are there from Iceberg? And how about from Arrow?
>>>>> >>>>>
>>>>> >>>>> BTW, I'd add I'm interested in helping develop an Arrow
>>>>> extension type
>>>>> >>>> for
>>>>> >>>>> the binary variant type. I've been experimenting with a
>>>>> DataFusion
>>>>> >>>>> extension that operates on this [1], and already have some ideas
>>>>> on how
>>>>> >>>>> such an extension type might be defined. I'm not yet caught up
>>>>> on the
>>>>> >>>>> shredded specification, but I think having just the binary
>>>>> format would
>>>>> >>>> be
>>>>> >>>>> beneficial for in-memory analytics, which are most relevant to
>>>>> Arrow.
>>>>> >>>> I'll
>>>>> >>>>> be creating a seperate thread on the Arrow ML about this soon.
>>>>> >>>>>
>>>>> >>>>> Best,
>>>>> >>>>>
>>>>> >>>>> Will Jones
>>>>> >>>>>
>>>>> >>>>> [1]
>>>>> >>>>>
>>>>> >>>>
>>>>> >>
>>>>> https://github.com/datafusion-contrib/datafusion-functions-variant/issues
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> On Thu, Aug 15, 2024 at 7:39 PM Gang Wu <[email protected]>
>>>>> wrote:
>>>>> >>>>>
>>>>> >>>>>> + dev@arrow
>>>>> >>>>>>
>>>>> >>>>>> Thanks for all the valuable suggestions! I am inclined to
>>>>> Micah's idea
>>>>> >>>>> that
>>>>> >>>>>> Arrow might be a better host compared to Parquet.
>>>>> >>>>>>
>>>>> >>>>>> To give more context, I am taking the initiative to add the
>>>>> geometry
>>>>> >>>> type
>>>>> >>>>>> to both Parquet and ORC. I'd like to do the same thing for
>>>>> variant
>>>>> >> type
>>>>> >>>>> in
>>>>> >>>>>> that variant type is engine and file format agnostic. This does
>>>>> mean
>>>>> >>>> that
>>>>> >>>>>> Parquet might not be the neutral place to hold the variant spec.
>>>>> >>>>>>
>>>>> >>>>>> Best,
>>>>> >>>>>> Gang
>>>>> >>>>>>
>>>>> >>>>>> On Fri, Aug 16, 2024 at 10:00 AM Jingsong Li <
>>>>> [email protected]>
>>>>> >>>>>> wrote:
>>>>> >>>>>>
>>>>> >>>>>>> Thanks all for your discussion.
>>>>> >>>>>>>
>>>>> >>>>>>> The Apache Paimon community is also considering support for
>>>>> this
>>>>> >>>>>>> Variant type, without a doubt, we hope to maintain consistency
>>>>> with
>>>>> >>>>>>> Iceberg.
>>>>> >>>>>>>
>>>>> >>>>>>> Not only the Paimon community, but also various computing
>>>>> engines
>>>>> >>>> need
>>>>> >>>>>>> to adapt to this type, such as Flink and StarRocks. We also
>>>>> hope to
>>>>> >>>>>>> promote them to adapt to this type.
>>>>> >>>>>>>
>>>>> >>>>>>> It is worth noting that we also need to standardize many
>>>>> functions
>>>>> >>>>>>> related to it.
>>>>> >>>>>>>
>>>>> >>>>>>> A neutral place to maintain it is a great choice.
>>>>> >>>>>>>
>>>>> >>>>>>> - As Gang Wu said, a standalone project is good, just like
>>>>> >>>>> RoaringBitmap
>>>>> >>>>>>> [1].
>>>>> >>>>>>> - As Ryan said, Parquet community is a neutral option too.
>>>>> >>>>>>> - As Micah said, Arrow is also an option too.
>>>>> >>>>>>>
>>>>> >>>>>>> [1] https://github.com/RoaringBitmap
>>>>> >>>>>>>
>>>>> >>>>>>> Best,
>>>>> >>>>>>> Jingsong
>>>>> >>>>>>>
>>>>> >>>>>>> On Fri, Aug 16, 2024 at 7:18 AM Micah Kornfield <
>>>>> >>>> [email protected]
>>>>> >>>>>>
>>>>> >>>>>>> wrote:
>>>>> >>>>>>>>>
>>>>> >>>>>>>>> Thats fair @Micah, so far all the discussions have been
>>>>> direct and
>>>>> >>>>> off
>>>>> >>>>>>> the dev list. Would you like to make the request on the public
>>>>> Spark
>>>>> >>>>> Dev
>>>>> >>>>>>> list? I would be glad to co-sign, I can also draft up a quick
>>>>> email
>>>>> >>>> if
>>>>> >>>>>> you
>>>>> >>>>>>> don't have time.
>>>>> >>>>>>>>
>>>>> >>>>>>>>
>>>>> >>>>>>>> I think once we come to consensus, if you have bandwidth, I
>>>>> think
>>>>> >>>> the
>>>>> >>>>>>> message might be better coming from you, as you have more
>>>>> context on
>>>>> >>>>> some
>>>>> >>>>>>> of the non-public conversations, the requirements from an
>>>>> Iceberg
>>>>> >>>>>>> perspective on governance and the blockers that were
>>>>> encountered.  If
>>>>> >>>>>>> details on the conversations can't be shared, (i.e. we are
>>>>> starting
>>>>> >>>>> from
>>>>> >>>>>>> scratch) it seems like suggesting a new project via SPIP might
>>>>> be the
>>>>> >>>>> way
>>>>> >>>>>>> forward.  I'm happy to help with that if it is useful but I
>>>>> would
>>>>> >>>> guess
>>>>> >>>>>>> Aihua or Tyler might be in a better place to start as it seems
>>>>> they
>>>>> >>>>> have
>>>>> >>>>>>> done more serious thinking here.
>>>>> >>>>>>>>
>>>>> >>>>>>>> If we decide to try to standardize on Parquet or Arrow I'm
>>>>> happy to
>>>>> >>>>>> help
>>>>> >>>>>>> support the effort in those communities.
>>>>> >>>>>>>>
>>>>> >>>>>>>> Thanks,
>>>>> >>>>>>>> Micah
>>>>> >>>>>>>>
>>>>> >>>>>>>> On Thu, Aug 15, 2024 at 8:09 AM Russell Spitzer <
>>>>> >>>>>>> [email protected]> wrote:
>>>>> >>>>>>>>>
>>>>> >>>>>>>>> Thats fair @Micah, so far all the discussions have been
>>>>> direct and
>>>>> >>>>> off
>>>>> >>>>>>> the dev list. Would you like to make the request on the public
>>>>> Spark
>>>>> >>>>> Dev
>>>>> >>>>>>> list? I would be glad to co-sign, I can also draft up a quick
>>>>> email
>>>>> >>>> if
>>>>> >>>>>> you
>>>>> >>>>>>> don't have time.
>>>>> >>>>>>>>>
>>>>> >>>>>>>>> On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield <
>>>>> >>>>>> [email protected]>
>>>>> >>>>>>> wrote:
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> I agree that it would be beneficial to make a sub-project,
>>>>> the
>>>>> >>>>> main
>>>>> >>>>>>> problem is political and not logistic. I've been asking for
>>>>> movement
>>>>> >>>>> from
>>>>> >>>>>>> other relative projects for a month and we simply haven't
>>>>> gotten
>>>>> >>>>>> anywhere.
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>> I just wanted to double check that these issues were brought
>>>>> >>>>> directly
>>>>> >>>>>>> to the spark community (i.e. a discussion thread on the Spark
>>>>> >>>> developer
>>>>> >>>>>>> mailing list) and not via backchannels.
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>> I'm not sure the outcome would be different and I don't
>>>>> think
>>>>> >>>> this
>>>>> >>>>>>> should block forking the spec, but we should make sure that the
>>>>> >>>>> decision
>>>>> >>>>>> is
>>>>> >>>>>>> publicly documented within both communities.
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>> Thanks,
>>>>> >>>>>>>>>> Micah
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer <
>>>>> >>>>>>> [email protected]> wrote:
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> @Gang Wu
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> I agree that it would be beneficial to make a sub-project,
>>>>> the
>>>>> >>>>> main
>>>>> >>>>>>> problem is political and not logistic. I've been asking for
>>>>> movement
>>>>> >>>>> from
>>>>> >>>>>>> other relative projects for a month and we simply haven't
>>>>> gotten
>>>>> >>>>>> anywhere.
>>>>> >>>>>>> I don't think there is anything that would stop us from moving
>>>>> to a
>>>>> >>>>> joint
>>>>> >>>>>>> project in the future and if you know of some way of
>>>>> encouraging that
>>>>> >>>>>>> movement from other relevant parties I would be glad to
>>>>> collaborate
>>>>> >>>> in
>>>>> >>>>>>> doing that. One thing that I don't want to do is have the
>>>>> Iceberg
>>>>> >>>>> project
>>>>> >>>>>>> stay in a holding pattern without any clear roadmap as to how
>>>>> to
>>>>> >>>>> proceed.
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu <
>>>>> [email protected]
>>>>> >>>>>
>>>>> >>>>>>> wrote:
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>> I’m on board with copying the spec into our repository.
>>>>> >>>> However,
>>>>> >>>>> as
>>>>> >>>>>>> we’ve talked about, it’s not just a straightforward copy—there
>>>>> are
>>>>> >>>>>> already
>>>>> >>>>>>> some divergences. Some of them are under discussion. Iceberg is
>>>>> >>>>>> definitely
>>>>> >>>>>>> the best place for these specs. Engines like Trino and Flink
>>>>> can then
>>>>> >>>>>> rely
>>>>> >>>>>>> on the Iceberg specs as a solid foundation.
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>> Yufei
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <[email protected]
>>>>> >
>>>>> >>>>> wrote:
>>>>> >>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>> Sorry for chiming in late.
>>>>> >>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>   From the discussion in
>>>>> >>>>>>>
>>>>> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I
>>>>> >>>>>> don't
>>>>> >>>>>>> quite understand why it is logistically complicated to create a
>>>>> >>>>>> sub-project
>>>>> >>>>>>> to hold the variant spec and impl.
>>>>> >>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>> IMHO, coping the variant type spec into Apache Iceberg
>>>>> has
>>>>> >>>> some
>>>>> >>>>>>> deficiencies:
>>>>> >>>>>>>>>>>>> - It is a burden to update two repos if there is a
>>>>> variant
>>>>> >>>> type
>>>>> >>>>>>> spec change and will likely result in deviation if some
>>>>> changes do
>>>>> >>>> not
>>>>> >>>>>>> reach agreement from both parties.
>>>>> >>>>>>>>>>>>> - Implementers are required to keep an eye on both specs
>>>>> >>>>>>> (considering proprietary engines where both Iceberg and Delta
>>>>> are
>>>>> >>>>>>> supported).
>>>>> >>>>>>>>>>>>> - Putting the spec and impl of variant type in Iceberg
>>>>> repo
>>>>> >>>> does
>>>>> >>>>>>> lose the opportunity for better native support from file
>>>>> formats like
>>>>> >>>>>>> Parquet and ORC.
>>>>> >>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>> I'm not sure if it is possible to create a separate
>>>>> project
>>>>> >>>>> (e.g.
>>>>> >>>>>>> apache/variant-type) to make it a single point of truth. We
>>>>> can learn
>>>>> >>>>>> from
>>>>> >>>>>>> the experience of Apache Arrow. In this fashion, different
>>>>> engines,
>>>>> >>>>> table
>>>>> >>>>>>> formats and file formats can follow the same spec and are free
>>>>> to
>>>>> >>>>> depend
>>>>> >>>>>> on
>>>>> >>>>>>> the reference implementations from apache/variant-type or
>>>>> implement
>>>>> >>>>> their
>>>>> >>>>>>> own.
>>>>> >>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>> Best,
>>>>> >>>>>>>>>>>>> Gang
>>>>> >>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye <
>>>>> [email protected]
>>>>> >>>>>
>>>>> >>>>>>> wrote:
>>>>> >>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>> +1 for copying the spec into our repository, I think we
>>>>> need
>>>>> >>>> to
>>>>> >>>>>>> own it fully as a part of the table spec, and we can build
>>>>> >>>>> compatibility
>>>>> >>>>>>> through tests.
>>>>> >>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>> -Jack
>>>>> >>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer <
>>>>> >>>>>>> [email protected]> wrote:
>>>>> >>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>> I'm not really in favor of linking and annotating as
>>>>> that
>>>>> >>>> just
>>>>> >>>>>>> makes things more complicated and still is essentially forking
>>>>> just
>>>>> >>>>> with
>>>>> >>>>>>> more steps. If we just track our annotations / modifications
>>>>> to a
>>>>> >>>>> single
>>>>> >>>>>>> commit/version then we have the same issue again but now you
>>>>> have to
>>>>> >>>> go
>>>>> >>>>>> to
>>>>> >>>>>>> multiple sources to get the actual Spec. In addition, our very
>>>>> copy
>>>>> >>>> of
>>>>> >>>>>> the
>>>>> >>>>>>> Spec is going to require new types which don't exist in the
>>>>> Spark
>>>>> >>>> Spec
>>>>> >>>>>>> which necessarily means diverging. We will need to take up new
>>>>> >>>>> primitive
>>>>> >>>>>>> id's (as noted in my first email)
>>>>> >>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>> The other issue I have is I don't think the Spark Spec
>>>>> is
>>>>> >>>>> really
>>>>> >>>>>>> going through a thorough review process from all members of
>>>>> the Spark
>>>>> >>>>>>> community, I believe it probably should have gone through the
>>>>> SPIP
>>>>> >>>> but
>>>>> >>>>>>> instead seems to have been merged without broad community
>>>>> >>>> involvement.
>>>>> >>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>> The only way to truly avoid diverging is to only have a
>>>>> >>>> single
>>>>> >>>>>>> copy of the spec, in our previous discussions the vast
>>>>> majority of
>>>>> >>>>> Apache
>>>>> >>>>>>> Iceberg community want it to exist here.
>>>>> >>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks <
>>>>> >>>>> [email protected]
>>>>> >>>>>>>
>>>>> >>>>>>> wrote:
>>>>> >>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>> I'm really excited about the introduction of variant
>>>>> type
>>>>> >>>> to
>>>>> >>>>>>> Iceberg, but I want to raise concerns about forking the spec.
>>>>> >>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>> I feel like preemptively forking would create the
>>>>> situation
>>>>> >>>>>>> where we end up diverging because there's little reason to
>>>>> work with
>>>>> >>>>> both
>>>>> >>>>>>> communities to evolve in a way that benefits everyone.
>>>>> >>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>> I would much rather point to a specific version of
>>>>> the spec
>>>>> >>>>> and
>>>>> >>>>>>> annotate any variance in Iceberg's handling.  This would allow
>>>>> us to
>>>>> >>>>>>> continue without dividing the communities.
>>>>> >>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>> If at any point there are irreconcilable differences,
>>>>> I
>>>>> >>>> would
>>>>> >>>>>>> support forking, but I don't feel like that should be the
>>>>> initial
>>>>> >>>> step.
>>>>> >>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>> No one is excited about the possibility that the
>>>>> physical
>>>>> >>>>>>> representations end up diverging, but it feels like we're
>>>>> setting
>>>>> >>>>>> ourselves
>>>>> >>>>>>> up for that exact scenario.
>>>>> >>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>> -Dan
>>>>> >>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong <
>>>>> >>>>>>> [email protected]> wrote:
>>>>> >>>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>>> +1 to what's already being said here. It is good to
>>>>> copy
>>>>> >>>> the
>>>>> >>>>>>> spec to Iceberg and add context that's specific to Iceberg,
>>>>> but at
>>>>> >>>> the
>>>>> >>>>>> same
>>>>> >>>>>>> time, we should maintain compatibility.
>>>>> >>>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>>> Kind regards,
>>>>> >>>>>>>>>>>>>>>>> Fokko
>>>>> >>>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang <
>>>>> >>>>>>> [email protected]>:
>>>>> >>>>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>>>> +1 to copy the spec into our repository. I think
>>>>> the best
>>>>> >>>>> way
>>>>> >>>>>>> to keep compatibility is building integration tests.
>>>>> >>>>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>>>> Thanks,
>>>>> >>>>>>>>>>>>>>>>>> Manu
>>>>> >>>>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry <
>>>>> >>>>>>> [email protected]> wrote:
>>>>> >>>>>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>>>>> Thanks Russell and Aihua for pushing Variant
>>>>> support!
>>>>> >>>>>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>>>>> Given the differences between the supported types
>>>>> and
>>>>> >>>> the
>>>>> >>>>>>> lack of interest from the other project, I think it is
>>>>> reasonable to
>>>>> >>>>>>> duplicate the specification to our repository.
>>>>> >>>>>>>>>>>>>>>>>>> I would give very strong emphasis on sticking to
>>>>> the
>>>>> >>>> Spark
>>>>> >>>>>>> spec as much as possible, to keep compatibility as much as
>>>>> possible.
>>>>> >>>>>> Maybe
>>>>> >>>>>>> even revert to a shared specification if the situation changes.
>>>>> >>>>>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>>>>> Thanks,
>>>>> >>>>>>>>>>>>>>>>>>> Peter
>>>>> >>>>>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>>>>> Aihua Xu <[email protected]> ezt írta (időpont:
>>>>> 2024.
>>>>> >>>>> aug.
>>>>> >>>>>>> 13., K, 19:52):
>>>>> >>>>>>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>>>>>> Thanks Russell for bringing this up.
>>>>> >>>>>>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>>>>>> This is the main blocker to move forward with the
>>>>> >>>> Variant
>>>>> >>>>>>> support in Iceberg and hopefully we can have a consensus. To
>>>>> me, I
>>>>> >>>> also
>>>>> >>>>>>> feel it makes more sense to move the spec into Iceberg rather
>>>>> than
>>>>> >>>>> Spark
>>>>> >>>>>>> engine owns it and we try to keep it compatible with Spark
>>>>> spec.
>>>>> >>>>>>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>>>>>> Thanks,
>>>>> >>>>>>>>>>>>>>>>>>>> Aihua
>>>>> >>>>>>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>>>>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer <
>>>>> >>>>>>> [email protected]> wrote:
>>>>> >>>>>>>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>>>>>>> Hi Y’all,
>>>>> >>>>>>>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>>>>>>> We’ve hit a bit of a roadblock with the Variant
>>>>> >>>>> Proposal,
>>>>> >>>>>>> while we were hoping to move the Variant and Shredding
>>>>> specifications
>>>>> >>>>>> from
>>>>> >>>>>>> Spark into Iceberg there doesn’t seem to be a lot of interest
>>>>> in
>>>>> >>>> that.
>>>>> >>>>>>> Unfortunately, I think we have a number of issues with just
>>>>> linking
>>>>> >>>> to
>>>>> >>>>>> the
>>>>> >>>>>>> Spark project directly from within Iceberg and I believe we
>>>>> need to
>>>>> >>>>> copy
>>>>> >>>>>>> the specifications into our repository.
>>>>> >>>>>>>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>>>>>>> There are a few reasons why i think this is
>>>>> necessary
>>>>> >>>>>>>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>>>>>>> First, we have a divergence of types already. The
>>>>> >>>> Spark
>>>>> >>>>>>> Specification already includes types which Iceberg has no
>>>>> definition
>>>>> >>>>> for
>>>>> >>>>>>> (19, 20 - Interval Types) and Iceberg already has a type which
>>>>> is not
>>>>> >>>>>>> included within the Spark Specification (Time) and will soon
>>>>> have
>>>>> >>>> more
>>>>> >>>>>> with
>>>>> >>>>>>> TimestampNS, and Geo.
>>>>> >>>>>>>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>>>>>>> Second, We would like to make sure that Spark is
>>>>> not a
>>>>> >>>>>> hard
>>>>> >>>>>>> dependency for other engines. We are working with several
>>>>> >>>> implementers
>>>>> >>>>> of
>>>>> >>>>>>> the Iceberg spec and it has previously been agreed that it
>>>>> would be
>>>>> >>>>> best
>>>>> >>>>>> if
>>>>> >>>>>>> the source of truth for Variant existed in an engine and file
>>>>> format
>>>>> >>>>>>> neutral location. The Iceberg project has a good open model of
>>>>> >>>>> governance
>>>>> >>>>>>> and, as we have seen so far discussing Variant, open and active
>>>>> >>>>>>> collaboration. This would also help as we can strictly version
>>>>> our
>>>>> >>>>>> changes
>>>>> >>>>>>> in-line with the rest of the Iceberg spec.
>>>>> >>>>>>>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>>>>>>> Third, The Shredding spec is not quite finished
>>>>> and
>>>>> >>>>>>> requires some group analysis and discussion before we commit
>>>>> it. I
>>>>> >>>>> think
>>>>> >>>>>>> again the Iceberg community is probably the right place for
>>>>> this to
>>>>> >>>>>> happen
>>>>> >>>>>>> as we have already started discussions here on these topics.
>>>>> >>>>>>>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>>>>>>> For these reasons I think we should go with a
>>>>> direct
>>>>> >>>>> copy
>>>>> >>>>>>> of the existing specification from the Spark Project and move
>>>>> ahead
>>>>> >>>>> with
>>>>> >>>>>>> our discussions and modifications within Iceberg. That said, I
>>>>> do not
>>>>> >>>>>> want
>>>>> >>>>>>> to diverge if possible from the Spark proposal. For example,
>>>>> although
>>>>> >>>>> we
>>>>> >>>>>> do
>>>>> >>>>>>> not use the Interval types above, I think we should not reuse
>>>>> those
>>>>> >>>>> type
>>>>> >>>>>>> ids within our spec. Iceberg's Variant Spec types 19 and 20
>>>>> would
>>>>> >>>>> remain
>>>>> >>>>>>> unused along with any other types we think are not applicable.
>>>>> We
>>>>> >>>>> should
>>>>> >>>>>>> strive whenever possible to allow for compatibility.
>>>>> >>>>>>>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>>>>>>> In the interest of moving forward with this
>>>>> proposal I
>>>>> >>>>> am
>>>>> >>>>>>> hoping to see if anyone in the community objects to this plan
>>>>> going
>>>>> >>>>>> forward
>>>>> >>>>>>> or has a better alternative.
>>>>> >>>>>>>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>>>>>>> As always I am thankful for your time and am
>>>>> eager to
>>>>> >>>>> hear
>>>>> >>>>>>> back from everyone,
>>>>> >>>>>>>>>>>>>>>>>>>>> Russ
>>>>> >>>>>>>>>>>>>>>>>>>>>
>>>>> >>>>>>>>>>>>>>>>>>>>>
>>>>> >>>>>>>
>>>>> >>>>>>
>>>>> >>>>>
>>>>> >>>>
>>>>> >>>
>>>>> >>
>>>>> >
>>>>>
>>>>

Re: [DISCUSS] Variant Spec Location

Reply via email to