Hi: I lean towards implementing pyiceberg's FileIO backed by iceberg-rust's FileIO, rather than directly using OpenDAL. The motivation is that we can use this as a starting point of providing iceberg-rust backed components for pyiceberg, and due to its simplicity, it's a good case. I believe there will be more cases, like Sung mentioned transform in another thread, and table scan mentioned by Fokko.
If we want to use OpenDAL directly, we don't need iceberg-rust, since OpenDAL already has python binding: https://opendal.apache.org/docs/python/opendal.html Do you have any experience with this? I see many projects having Rust and > Python code in a single repository. There are some exceptions like > Pydantic (pydantic <https://github.com/pydantic/pydantic>, pydantic-core > <https://github.com/pydantic/pydantic-core>). Well, first I want to say providing a python binding for a library written in rust is a quite common practice. Just to name a few: opendal <https://github.com/apache/opendal>, polars <https://github.com/pola-rs/polars>, datafusion <https://github.com/apache/datafusion>, delta-rs <https://github.com/delta-io/delta-rs>. As far as I know, most of them choose to put python binding with rust in the same repo, only datafusion-python <https://github.com/apache/datafusion-python> lives in another, I'm not sure about the reason, maybe it's too large? I haven't tried to implement one before, but pyo3 <https://github.com/PyO3> has great documentation, and there are many existing examples in open source we can learn with. On Sat, Aug 3, 2024 at 2:23 AM Fokko Driesprong <fo...@apache.org> wrote: > One more thing, > > About this idea, would you have a more detailed design? For example, >> where should the pyo3 codes live, in iceberg-rust or in pyiceberg? What >> kind of interface should we provide to pyiceberg, FileIO or OpenDAL? > > > Do you have any experience with this? I see many projects having Rust and > Python code in a single repository. There are some exceptions like > Pydantic (pydantic <https://github.com/pydantic/pydantic>, pydantic-core > <https://github.com/pydantic/pydantic-core>). > > Kind regards, > Fokko > > > > Op vr 2 aug 2024 om 20:11 schreef Fokko Driesprong <fo...@apache.org>: > >> Thanks for driving this Xuanwo, >> >> I already suggested this in my talk back at the Spark Summit to see if we >> can spark some interest, and it is exciting to see this materialize. >> >> For the IO abstraction, I think the FileIO is the best option. We already >> have the interface >> <https://github.com/apache/iceberg-python/blob/6c0d307032608967ccd00cfe72d8815e6e7e01cc/pyiceberg/io/__init__.py#L239> >> in PyIceberg, and also a PyArrowFileIO >> <https://github.com/apache/iceberg-python/blob/6c0d307032608967ccd00cfe72d8815e6e7e01cc/pyiceberg/io/pyarrow.py#L327>. >> I must admit that the abstraction is less clear in PyIceberg since we rely >> so much on Arrow for reading/writing data that it is tightly coupled. I >> would love to see if we can use OpenDAL for reading/writing data, and >> Iceberg-rust for pushing down the low-level logic. A while ago I did some >> profiling on the code, and one of the major issues is that Arrow doesn't >> support proper field-ID projection. Therefore we have to the Parquet file, >> and do the schema-evolution and type promotion afterwards in Python >> <https://github.com/apache/iceberg-python/blob/6c0d307032608967ccd00cfe72d8815e6e7e01cc/pyiceberg/io/pyarrow.py#L1444-L1458>, >> which causes a lot of congestion on the GIL. >> >> Kind regards, >> Fokko >> >> Op vr 2 aug 2024 om 17:46 schreef Jack Ye <yezhao...@gmail.com>: >> >>> +1 for an OpenDALFileIO >>> >>> -Jack >>> >>> On Fri, Aug 2, 2024 at 8:32 AM Xuanwo <xua...@apache.org> wrote: >>> >>>> Hi, renjie >>>> >>>> Thank you for your support. I'll delve into the details and first build >>>> a PoC PR to make it clear. >>>> >>>> On Fri, Aug 2, 2024, at 22:51, Renjie Liu wrote: >>>> >>>> Hi: >>>> >>>> Thanks Xuanwo for raising this. >>>> >>>> As mentioned in another thread, I think using iceberg-rust in pyiceberg >>>> is a good idea. >>>> >>>> About this idea, would you have a more detailed design? For example, >>>> where should the pyo3 codes live, in iceberg-rust or in pyiceberg? What >>>> kind of interface should we provide to pyiceberg, FileIO or OpenDAL? >>>> >>>> I think this is a good first step moving forward to make pyiceberg >>>> backed iceberg-rust. In the future we can replace components gradually. >>>> >>>> On Fri, Aug 2, 2024 at 5:58 PM Xuanwo <xua...@apache.org> wrote: >>>> >>>> >>>> > Xuanwo, would PyIceberg and iceberg-rust share the underlying OpenDAL >>>> implementations via pyo3 / fsspec bindings >>>> <https://github.com/apache/opendal/issues/4511>? >>>> >>>> Hi, Raschkowski, good question! >>>> >>>> It's possible. There is an ongoing project developing fsspec bindings >>>> for opendal at https://github.com/fsspec/opendalfs. Once complete, we >>>> can directly use opendal through fsspec. >>>> >>>> This work is unrelated to Pyicberg or Iceberg-rust. Ideally, users >>>> should be able to use opendalfs as an alternative implementation of the >>>> fsspec AbstractFileSystem class. >>>> >>>> On Fri, Aug 2, 2024, at 17:44, Will Raschkowski wrote: >>>> >>>> Xuanwo, would PyIceberg and iceberg-rust share the underlying OpenDAL >>>> implementations via pyo3 / fsspec bindings >>>> <https://github.com/apache/opendal/issues/4511>? >>>> >>>> >>>> ------------------------------ >>>> >>>> *From:* Joe Stein <crypt...@gmail.com> >>>> *Sent:* Thursday, August 1, 2024 3:37 AM >>>> *To:* dev@iceberg.apache.org <dev@iceberg.apache.org> >>>> *Subject:* Re: [DISCUSS] Use iceberg-rust as pyiceberg file io >>>> >>>> *CAUTION:* This email originates from an external party (outside of >>>> Palantir). If you believe this message is suspicious in nature, please use >>>> the "Report Message" button built into Outlook. >>>> >>>> Kafka did this with librdkafka and was wildly successful. The >>>> underlying bindings being in rust are great with a layer for access in >>>> Python +1 >>>> >>>> >>>> ~ Joe Stein >>>> >>>> >>>> On Wed, Jul 31, 2024 at 10:29 PM Xuanwo <xua...@apache.org> wrote: >>>> >>>> Hello everyone >>>> >>>> I start this thread to discuss the idea about using iceberg-rust as >>>> pyiceberg file io. >>>> >>>> The idea is living at https://hackmd.io/@xuanwo/iceberg_rust_as_file_io >>>> [hackmd.io] >>>> <https://urldefense.com/v3/__https://hackmd.io/@xuanwo/iceberg_rust_as_file_io__;!!NkS9JGVQ2sDq!7Js41FIzh2smsAOySXrKd527DXCmXdrwV8Uov8TIdQqLRcsCkfPnHzfbxbX_xctpoNpYw2XGfrduTPd6ppTI$> >>>> >>>> In summary, we can leverage the work from iceberg-rust to help >>>> pyiceberg in developing a fast and compact file IO system that benefits >>>> users with specific constraints. >>>> >>>> Welcome to join in the discussion. >>>> >>>> Xuanwo >>>> >>>> https://xuanwo.io/ [xuanwo.io] >>>> <https://urldefense.com/v3/__https://xuanwo.io/__;!!NkS9JGVQ2sDq!7Js41FIzh2smsAOySXrKd527DXCmXdrwV8Uov8TIdQqLRcsCkfPnHzfbxbX_xctpoNpYw2XGfrduTNspr1jI$> >>>> >>>> Xuanwo >>>> >>>> https://xuanwo.io/ >>>> >>>> Xuanwo >>>> >>>> https://xuanwo.io/ >>>> >>>>