Thanks for driving this Xuanwo, I already suggested this in my talk back at the Spark Summit to see if we can spark some interest, and it is exciting to see this materialize.
For the IO abstraction, I think the FileIO is the best option. We already have the interface <https://github.com/apache/iceberg-python/blob/6c0d307032608967ccd00cfe72d8815e6e7e01cc/pyiceberg/io/__init__.py#L239> in PyIceberg, and also a PyArrowFileIO <https://github.com/apache/iceberg-python/blob/6c0d307032608967ccd00cfe72d8815e6e7e01cc/pyiceberg/io/pyarrow.py#L327>. I must admit that the abstraction is less clear in PyIceberg since we rely so much on Arrow for reading/writing data that it is tightly coupled. I would love to see if we can use OpenDAL for reading/writing data, and Iceberg-rust for pushing down the low-level logic. A while ago I did some profiling on the code, and one of the major issues is that Arrow doesn't support proper field-ID projection. Therefore we have to the Parquet file, and do the schema-evolution and type promotion afterwards in Python <https://github.com/apache/iceberg-python/blob/6c0d307032608967ccd00cfe72d8815e6e7e01cc/pyiceberg/io/pyarrow.py#L1444-L1458>, which causes a lot of congestion on the GIL. Kind regards, Fokko Op vr 2 aug 2024 om 17:46 schreef Jack Ye <yezhao...@gmail.com>: > +1 for an OpenDALFileIO > > -Jack > > On Fri, Aug 2, 2024 at 8:32 AM Xuanwo <xua...@apache.org> wrote: > >> Hi, renjie >> >> Thank you for your support. I'll delve into the details and first build a >> PoC PR to make it clear. >> >> On Fri, Aug 2, 2024, at 22:51, Renjie Liu wrote: >> >> Hi: >> >> Thanks Xuanwo for raising this. >> >> As mentioned in another thread, I think using iceberg-rust in pyiceberg >> is a good idea. >> >> About this idea, would you have a more detailed design? For example, >> where should the pyo3 codes live, in iceberg-rust or in pyiceberg? What >> kind of interface should we provide to pyiceberg, FileIO or OpenDAL? >> >> I think this is a good first step moving forward to make pyiceberg backed >> iceberg-rust. In the future we can replace components gradually. >> >> On Fri, Aug 2, 2024 at 5:58 PM Xuanwo <xua...@apache.org> wrote: >> >> >> > Xuanwo, would PyIceberg and iceberg-rust share the underlying OpenDAL >> implementations via pyo3 / fsspec bindings >> <https://github.com/apache/opendal/issues/4511>? >> >> Hi, Raschkowski, good question! >> >> It's possible. There is an ongoing project developing fsspec bindings for >> opendal at https://github.com/fsspec/opendalfs. Once complete, we can >> directly use opendal through fsspec. >> >> This work is unrelated to Pyicberg or Iceberg-rust. Ideally, users should >> be able to use opendalfs as an alternative implementation of the fsspec >> AbstractFileSystem class. >> >> On Fri, Aug 2, 2024, at 17:44, Will Raschkowski wrote: >> >> Xuanwo, would PyIceberg and iceberg-rust share the underlying OpenDAL >> implementations via pyo3 / fsspec bindings >> <https://github.com/apache/opendal/issues/4511>? >> >> >> ------------------------------ >> >> *From:* Joe Stein <crypt...@gmail.com> >> *Sent:* Thursday, August 1, 2024 3:37 AM >> *To:* dev@iceberg.apache.org <dev@iceberg.apache.org> >> *Subject:* Re: [DISCUSS] Use iceberg-rust as pyiceberg file io >> >> *CAUTION:* This email originates from an external party (outside of >> Palantir). If you believe this message is suspicious in nature, please use >> the "Report Message" button built into Outlook. >> >> Kafka did this with librdkafka and was wildly successful. The underlying >> bindings being in rust are great with a layer for access in Python +1 >> >> >> ~ Joe Stein >> >> >> On Wed, Jul 31, 2024 at 10:29 PM Xuanwo <xua...@apache.org> wrote: >> >> Hello everyone >> >> I start this thread to discuss the idea about using iceberg-rust as >> pyiceberg file io. >> >> The idea is living at https://hackmd.io/@xuanwo/iceberg_rust_as_file_io >> [hackmd.io] >> <https://urldefense.com/v3/__https://hackmd.io/@xuanwo/iceberg_rust_as_file_io__;!!NkS9JGVQ2sDq!7Js41FIzh2smsAOySXrKd527DXCmXdrwV8Uov8TIdQqLRcsCkfPnHzfbxbX_xctpoNpYw2XGfrduTPd6ppTI$> >> >> In summary, we can leverage the work from iceberg-rust to help pyiceberg >> in developing a fast and compact file IO system that benefits users with >> specific constraints. >> >> Welcome to join in the discussion. >> >> Xuanwo >> >> https://xuanwo.io/ [xuanwo.io] >> <https://urldefense.com/v3/__https://xuanwo.io/__;!!NkS9JGVQ2sDq!7Js41FIzh2smsAOySXrKd527DXCmXdrwV8Uov8TIdQqLRcsCkfPnHzfbxbX_xctpoNpYw2XGfrduTNspr1jI$> >> >> Xuanwo >> >> https://xuanwo.io/ >> >> Xuanwo >> >> https://xuanwo.io/ >> >>