I also support integrating iceberg-rust with pyiceberg rather than building something new on OpenDAL.
OpenDAL backed FileIO will be usable in Python once opendalfs[1], the native fsspec support for OpenDAL, is ready. Users can use opendalfs as a FileIO class directly in pure python. It's not an action item for our community to take. The consensus we've reached is that iceberg-rust will be the core of PyIceberg. The main question now is "How?" How can we implement it without disrupting our valued users? This is my top priority. *Naming is so hard! Let's refer to the new iceberg-rust based pyiceberg core as `*pyiceberg-core*` until we decide on a project name.* First, we need to establish a workflow that allows us to gradually integrate new features into pyiceberg-core. Additionally, pyiceberg should be able to import and optionally use classes from pyiceberg-core in an additive manner. While developing this workflow, our community will learn how to collaborate, manage releases, and more. We will then incorporate additional Rust-backed features into pyiceberg-core. Eventually, we may make pyiceberg-core our default implementation. My current plan is to implement this pyiceberg-core under iceberg-rust repo under `bindings/python`. - Iceberg-rust is currently under active development. I plan to release pyiceberg-core independently of iceberg-rust's release, as they feature distinct public APIs (and languages!). - Most of the work involves maintaining a few Python stubs and classes, with the majority related to Rust. - The python integration is just a start: we can expect `bindings/nodejs` to happen here too. The setup work has already been started. I will update my PR here once it's ready to review. [1]: https://github.com/fsspec/opendalfs On Sat, Aug 3, 2024, at 09:57, Renjie Liu wrote: > Hi: > > I lean towards implementing pyiceberg's FileIO backed by iceberg-rust's > FileIO, rather than directly using OpenDAL. The motivation is that we can use > this as a starting point of providing iceberg-rust backed components for > pyiceberg, and due to its simplicity, it's a good case. I believe there will > be more cases, like Sung mentioned transform in another thread, and table > scan mentioned by Fokko. > > If we want to use OpenDAL directly, we don't need iceberg-rust, since OpenDAL > already has python binding: > https://opendal.apache.org/docs/python/opendal.html > >> Do you have any experience with this? I see many projects having Rust and >> Python code in a single repository. There are some exceptions like Pydantic >> (pydantic <https://github.com/pydantic/pydantic>, pydantic-core >> <https://github.com/pydantic/pydantic-core>). > > Well, first I want to say providing a python binding for a library written in > rust is a quite common practice. Just to name a few: opendal > <https://github.com/apache/opendal>, polars > <https://github.com/pola-rs/polars>, datafusion > <https://github.com/apache/datafusion>, delta-rs > <https://github.com/delta-io/delta-rs>. As far as I know, most of them choose > to put python binding with rust in the same repo, only datafusion-python > <https://github.com/apache/datafusion-python> lives in another, I'm not sure > about the reason, maybe it's too large? > > I haven't tried to implement one before, but pyo3 <https://github.com/PyO3> > has great documentation, and there are many existing examples in open source > we can learn with. > > On Sat, Aug 3, 2024 at 2:23 AM Fokko Driesprong <fo...@apache.org> wrote: >> One more thing, >> >>> About this idea, would you have a more detailed design? For example, where >>> should the pyo3 codes live, in iceberg-rust or in pyiceberg? What kind of >>> interface should we provide to pyiceberg, FileIO or OpenDAL? >> >> Do you have any experience with this? I see many projects having Rust and >> Python code in a single repository. There are some exceptions like Pydantic >> (pydantic <https://github.com/pydantic/pydantic>, pydantic-core >> <https://github.com/pydantic/pydantic-core>). >> >> Kind regards, >> Fokko >> >> >> >> Op vr 2 aug 2024 om 20:11 schreef Fokko Driesprong <fo...@apache.org>: >>> Thanks for driving this Xuanwo, >>> >>> I already suggested this in my talk back at the Spark Summit to see if we >>> can spark some interest, and it is exciting to see this materialize. >>> >>> For the IO abstraction, I think the FileIO is the best option. We already >>> have the interface >>> <https://github.com/apache/iceberg-python/blob/6c0d307032608967ccd00cfe72d8815e6e7e01cc/pyiceberg/io/__init__.py#L239> >>> in PyIceberg, and also a PyArrowFileIO >>> <https://github.com/apache/iceberg-python/blob/6c0d307032608967ccd00cfe72d8815e6e7e01cc/pyiceberg/io/pyarrow.py#L327>. >>> I must admit that the abstraction is less clear in PyIceberg since we rely >>> so much on Arrow for reading/writing data that it is tightly coupled. I >>> would love to see if we can use OpenDAL for reading/writing data, and >>> Iceberg-rust for pushing down the low-level logic. A while ago I did some >>> profiling on the code, and one of the major issues is that Arrow doesn't >>> support proper field-ID projection. Therefore we have to the Parquet file, >>> and do the schema-evolution and type promotion afterwards in Python >>> <https://github.com/apache/iceberg-python/blob/6c0d307032608967ccd00cfe72d8815e6e7e01cc/pyiceberg/io/pyarrow.py#L1444-L1458>, >>> which causes a lot of congestion on the GIL. >>> >>> Kind regards, >>> Fokko >>> >>> Op vr 2 aug 2024 om 17:46 schreef Jack Ye <yezhao...@gmail.com>: >>>> +1 for an OpenDALFileIO >>>> >>>> -Jack >>>> >>>> On Fri, Aug 2, 2024 at 8:32 AM Xuanwo <xua...@apache.org> wrote: >>>>> __ >>>>> Hi, renjie >>>>> >>>>> Thank you for your support. I'll delve into the details and first build a >>>>> PoC PR to make it clear. >>>>> >>>>> On Fri, Aug 2, 2024, at 22:51, Renjie Liu wrote: >>>>>> Hi: >>>>>> >>>>>> Thanks Xuanwo for raising this. >>>>>> >>>>>> As mentioned in another thread, I think using iceberg-rust in pyiceberg >>>>>> is a good idea. >>>>>> >>>>>> About this idea, would you have a more detailed design? For example, >>>>>> where should the pyo3 codes live, in iceberg-rust or in pyiceberg? What >>>>>> kind of interface should we provide to pyiceberg, FileIO or OpenDAL? >>>>>> >>>>>> I think this is a good first step moving forward to make pyiceberg >>>>>> backed iceberg-rust. In the future we can replace components gradually. >>>>>> >>>>>> On Fri, Aug 2, 2024 at 5:58 PM Xuanwo <xua...@apache.org> wrote: >>>>>>> __ >>>>>>> > Xuanwo, would PyIceberg and iceberg-rust share the underlying OpenDAL >>>>>>> > implementations via pyo3 / fsspec bindings >>>>>>> > <https://github.com/apache/opendal/issues/4511>? >>>>>>> >>>>>>> Hi, Raschkowski, good question! >>>>>>> >>>>>>> It's possible. There is an ongoing project developing fsspec bindings >>>>>>> for opendal at https://github.com/fsspec/opendalfs. Once complete, we >>>>>>> can directly use opendal through fsspec. >>>>>>> >>>>>>> This work is unrelated to Pyicberg or Iceberg-rust. Ideally, users >>>>>>> should be able to use opendalfs as an alternative implementation of the >>>>>>> fsspec AbstractFileSystem class. >>>>>>> >>>>>>> On Fri, Aug 2, 2024, at 17:44, Will Raschkowski wrote: >>>>>>>> Xuanwo, would PyIceberg and iceberg-rust share the underlying OpenDAL >>>>>>>> implementations via pyo3 / fsspec bindings >>>>>>>> <https://github.com/apache/opendal/issues/4511>? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *From:* Joe Stein <crypt...@gmail.com> >>>>>>>> *Sent:* Thursday, August 1, 2024 3:37 AM >>>>>>>> *To:* dev@iceberg.apache.org <dev@iceberg.apache.org> >>>>>>>> *Subject:* Re: [DISCUSS] Use iceberg-rust as pyiceberg file io >>>>>>>> >>>>>>>> *CAUTION:* This email originates from an external party (outside of >>>>>>>> Palantir). If you believe this message is suspicious in nature, please >>>>>>>> use the "Report Message" button built into Outlook. >>>>>>>> >>>>>>>> Kafka did this with librdkafka and was wildly successful. The >>>>>>>> underlying bindings being in rust are great with a layer for access in >>>>>>>> Python +1 >>>>>>>> >>>>>>>> >>>>>>>> ~ Joe Stein >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jul 31, 2024 at 10:29 PM Xuanwo <xua...@apache.org> wrote: >>>>>>>>> Hello everyone >>>>>>>>> >>>>>>>>> I start this thread to discuss the idea about using iceberg-rust as >>>>>>>>> pyiceberg file io. >>>>>>>>> >>>>>>>>> The idea is living at >>>>>>>>> https://hackmd.io/@xuanwo/iceberg_rust_as_file_io [hackmd.io] >>>>>>>>> <https://urldefense.com/v3/__https://hackmd.io/@xuanwo/iceberg_rust_as_file_io__;!!NkS9JGVQ2sDq!7Js41FIzh2smsAOySXrKd527DXCmXdrwV8Uov8TIdQqLRcsCkfPnHzfbxbX_xctpoNpYw2XGfrduTPd6ppTI$> >>>>>>>>> >>>>>>>>> In summary, we can leverage the work from iceberg-rust to help >>>>>>>>> pyiceberg in developing a fast and compact file IO system that >>>>>>>>> benefits users with specific constraints. >>>>>>>>> >>>>>>>>> Welcome to join in the discussion. >>>>>>>>> >>>>>>>>> Xuanwo >>>>>>>>> >>>>>>>>> https://xuanwo.io/ [xuanwo.io] >>>>>>>>> <https://urldefense.com/v3/__https://xuanwo.io/__;!!NkS9JGVQ2sDq!7Js41FIzh2smsAOySXrKd527DXCmXdrwV8Uov8TIdQqLRcsCkfPnHzfbxbX_xctpoNpYw2XGfrduTNspr1jI$> >>>>>>> Xuanwo >>>>>>> >>>>>>> https://xuanwo.io/ >>>>>>> >>>>> Xuanwo >>>>> >>>>> https://xuanwo.io/ >>>>> Xuanwo https://xuanwo.io/