Fantastic work! I think this is a great direction, and this provides a good base to start iterating.
It makes the most sense to me for the Python bindings (and others) to live in the same repo as iceberg-rust, especially at this early stage. - Tim O'Guin -------- Original Message -------- On 8/3/24 12:33 AM, Xuanwo wrote: > Let's rock! Welcome to take a review: > https://github.com/apache/iceberg-rust/pull/518 > > On Sat, Aug 3, 2024, at 12:13, Xuanwo wrote: > >> I also support integrating iceberg-rust with pyiceberg rather than building >> something new on OpenDAL. >> >> OpenDAL backed FileIO will be usable in Python once opendalfs[1], the native >> fsspec support for OpenDAL, is ready. Users can use opendalfs as a FileIO >> class directly in pure python. It's not an action item for our community to >> take. >> >> The consensus we've reached is that iceberg-rust will be the core of >> PyIceberg. The main question now is "How?" How can we implement it without >> disrupting our valued users? This is my top priority. >> >> Naming is so hard! Let's refer to the new iceberg-rust based pyiceberg core >> as `pyiceberg-core` until we decide on a project name. >> >> First, we need to establish a workflow that allows us to gradually integrate >> new features into pyiceberg-core. Additionally, pyiceberg should be able to >> import and optionally use classes from pyiceberg-core in an additive manner. >> While developing this workflow, our community will learn how to collaborate, >> manage releases, and more. >> >> We will then incorporate additional Rust-backed features into >> pyiceberg-core. Eventually, we may make pyiceberg-core our default >> implementation. >> >> My current plan is to implement this pyiceberg-core under iceberg-rust repo >> under `bindings/python`. >> >> - Iceberg-rust is currently under active development. I plan to release >> pyiceberg-core independently of iceberg-rust's release, as they feature >> distinct public APIs (and languages!). >> - Most of the work involves maintaining a few Python stubs and classes, with >> the majority related to Rust. >> - The python integration is just a start: we can expect `bindings/nodejs` to >> happen here too. >> >> The setup work has already been started. I will update my PR here once it's >> ready to review. >> >> [1]: https://github.com/fsspec/opendalfs >> >> On Sat, Aug 3, 2024, at 09:57, Renjie Liu wrote: >> >>> Hi: >>> >>> I lean towards implementing pyiceberg's FileIO backed by iceberg-rust's >>> FileIO, rather than directly using OpenDAL. The motivation is that we can >>> use this as a starting point of providing iceberg-rust backed components >>> for pyiceberg, and due to its simplicity, it's a good case. I believe there >>> will be more cases, like Sung mentioned transform in another thread, and >>> table scan mentioned by Fokko. >>> >>> If we want to use OpenDAL directly, we don't need iceberg-rust, since >>> OpenDAL already has python binding: >>> https://opendal.apache.org/docs/python/opendal.html >>> >>>> Do you have any experience with this? I see many projects having Rust and >>>> Python code in a single repository. There are some exceptions like >>>> Pydantic ([pydantic](https://github.com/pydantic/pydantic), >>>> [pydantic-core](https://github.com/pydantic/pydantic-core)). >>> >>> Well, first I want to say providing a python binding for a library written >>> in rust is a quite common practice. Just to name a few: >>> [opendal](https://github.com/apache/opendal), >>> [polars](https://github.com/pola-rs/polars), >>> [datafusion](https://github.com/apache/datafusion), >>> [delta-rs](https://github.com/delta-io/delta-rs). As far as I know, most of >>> them choose to put python binding with rust in the same repo, only >>> [datafusion-python](https://github.com/apache/datafusion-python) lives in >>> another, I'm not sure about the reason, maybe it's too large? >>> >>> I haven't tried to implement one before, but >>> [pyo3](https://github.com/PyO3) has great documentation, and there are many >>> existing examples in open source we can learn with. >>> >>> On Sat, Aug 3, 2024 at 2:23 AM Fokko Driesprong <fo...@apache.org> wrote: >>> >>>> One more thing, >>>> >>>>> About this idea, would you have a more detailed design? For example, >>>>> where should the pyo3 codes live, in iceberg-rust or in pyiceberg? What >>>>> kind of interface should we provide to pyiceberg, FileIO or OpenDAL? >>>> >>>> Do you have any experience with this? I see many projects having Rust and >>>> Python code in a single repository. There are some exceptions like >>>> Pydantic ([pydantic](https://github.com/pydantic/pydantic), >>>> [pydantic-core](https://github.com/pydantic/pydantic-core)). >>>> >>>> Kind regards, >>>> Fokko >>>> >>>> Op vr 2 aug 2024 om 20:11 schreef Fokko Driesprong <fo...@apache.org>: >>>> >>>>> Thanks for driving this Xuanwo, >>>>> >>>>> I already suggested this in my talk back at the Spark Summit to see if we >>>>> can spark some interest, and it is exciting to see this materialize. >>>>> >>>>> For the IO abstraction, I think the FileIO is the best option. We already >>>>> have the >>>>> [interface](https://github.com/apache/iceberg-python/blob/6c0d307032608967ccd00cfe72d8815e6e7e01cc/pyiceberg/io/__init__.py#L239) >>>>> in PyIceberg, and also a >>>>> [PyArrowFileIO](https://github.com/apache/iceberg-python/blob/6c0d307032608967ccd00cfe72d8815e6e7e01cc/pyiceberg/io/pyarrow.py#L327). >>>>> I must admit that the abstraction is less clear in PyIceberg since we >>>>> rely so much on Arrow for reading/writing data that it is tightly >>>>> coupled. I would love to see if we can use OpenDAL for reading/writing >>>>> data, and Iceberg-rust for pushing down the low-level logic. A while ago >>>>> I did some profiling on the code, and one of the major issues is that >>>>> Arrow doesn't support proper field-ID projection. Therefore we have to >>>>> the Parquet file, and do the schema-evolution and type promotion >>>>> afterwards [in >>>>> Python](https://github.com/apache/iceberg-python/blob/6c0d307032608967ccd00cfe72d8815e6e7e01cc/pyiceberg/io/pyarrow.py#L1444-L1458), >>>>> which causes a lot of congestion on the GIL. >>>>> >>>>> Kind regards, >>>>> Fokko >>>>> >>>>> Op vr 2 aug 2024 om 17:46 schreef Jack Ye <yezhao...@gmail.com>: >>>>> >>>>>> +1 for an OpenDALFileIO >>>>>> >>>>>> -Jack >>>>>> >>>>>> On Fri, Aug 2, 2024 at 8:32 AM Xuanwo <xua...@apache.org> wrote: >>>>>> >>>>>>> Hi, renjie >>>>>>> >>>>>>> Thank you for your support. I'll delve into the details and first build >>>>>>> a PoC PR to make it clear. >>>>>>> >>>>>>> On Fri, Aug 2, 2024, at 22:51, Renjie Liu wrote: >>>>>>> >>>>>>>> Hi: >>>>>>>> >>>>>>>> Thanks Xuanwo for raising this. >>>>>>>> >>>>>>>> As mentioned in another thread, I think using iceberg-rust in >>>>>>>> pyiceberg is a good idea. >>>>>>>> >>>>>>>> About this idea, would you have a more detailed design? For example, >>>>>>>> where should the pyo3 codes live, in iceberg-rust or in pyiceberg? >>>>>>>> What kind of interface should we provide to pyiceberg, FileIO or >>>>>>>> OpenDAL? >>>>>>>> >>>>>>>> I think this is a good first step moving forward to make pyiceberg >>>>>>>> backed iceberg-rust. In the future we can replace components gradually. >>>>>>>> >>>>>>>> On Fri, Aug 2, 2024 at 5:58 PM Xuanwo <xua...@apache.org> wrote: >>>>>>>> >>>>>>>>>> Xuanwo, would PyIceberg and iceberg-rust share the underlying >>>>>>>>>> OpenDAL implementations via pyo3 / [fsspec >>>>>>>>>> bindings](https://github.com/apache/opendal/issues/4511)? >>>>>>>>> >>>>>>>>> Hi, Raschkowski, good question! >>>>>>>>> >>>>>>>>> It's possible. There is an ongoing project developing fsspec bindings >>>>>>>>> for opendal at https://github.com/fsspec/opendalfs. Once complete, we >>>>>>>>> can directly use opendal through fsspec. >>>>>>>>> >>>>>>>>> This work is unrelated to Pyicberg or Iceberg-rust. Ideally, users >>>>>>>>> should be able to use opendalfs as an alternative implementation of >>>>>>>>> the fsspec AbstractFileSystem class. >>>>>>>>> >>>>>>>>> On Fri, Aug 2, 2024, at 17:44, Will Raschkowski wrote: >>>>>>>>> >>>>>>>>>> Xuanwo, would PyIceberg and iceberg-rust share the underlying >>>>>>>>>> OpenDAL implementations via pyo3 / [fsspec >>>>>>>>>> bindings](https://github.com/apache/opendal/issues/4511)? >>>>>>>>>> >>>>>>>>>> --------------------------------------------------------------- >>>>>>>>>> >>>>>>>>>> From: Joe Stein <crypt...@gmail.com> >>>>>>>>>> Sent: Thursday, August 1, 2024 3:37 AM >>>>>>>>>> To: dev@iceberg.apache.org <dev@iceberg.apache.org> >>>>>>>>>> Subject: Re: [DISCUSS] Use iceberg-rust as pyiceberg file io >>>>>>>>>> >>>>>>>>>> CAUTION: This email originates from an external party (outside of >>>>>>>>>> Palantir). If you believe this message is suspicious in nature, >>>>>>>>>> please use the "Report Message" button built into Outlook. >>>>>>>>>> >>>>>>>>>> Kafka did this with librdkafka and was wildly successful. The >>>>>>>>>> underlying bindings being in rust are great with a layer for access >>>>>>>>>> in Python +1 >>>>>>>>>> >>>>>>>>>> ~ Joe Stein >>>>>>>>>> >>>>>>>>>> On Wed, Jul 31, 2024 at 10:29 PM Xuanwo <xua...@apache.org> wrote: >>>>>>>>>> >>>>>>>>>>> Hello everyone >>>>>>>>>>> >>>>>>>>>>> I start this thread to discuss the idea about using iceberg-rust as >>>>>>>>>>> pyiceberg file io. >>>>>>>>>>> >>>>>>>>>>> The idea is living at >>>>>>>>>>> [https://hackmd.io/@xuanwo/iceberg_rust_as_file_io >>>>>>>>>>> [hackmd.io]](https://urldefense.com/v3/__https://hackmd.io/@xuanwo/iceberg_rust_as_file_io__;!!NkS9JGVQ2sDq!7Js41FIzh2smsAOySXrKd527DXCmXdrwV8Uov8TIdQqLRcsCkfPnHzfbxbX_xctpoNpYw2XGfrduTPd6ppTI$) >>>>>>>>>>> >>>>>>>>>>> In summary, we can leverage the work from iceberg-rust to help >>>>>>>>>>> pyiceberg in developing a fast and compact file IO system that >>>>>>>>>>> benefits users with specific constraints. >>>>>>>>>>> >>>>>>>>>>> Welcome to join in the discussion. >>>>>>>>>>> >>>>>>>>>>> Xuanwo >>>>>>>>>>> >>>>>>>>>>> [https://xuanwo.io/ >>>>>>>>>>> [xuanwo.io]](https://urldefense.com/v3/__https://xuanwo.io/__;!!NkS9JGVQ2sDq!7Js41FIzh2smsAOySXrKd527DXCmXdrwV8Uov8TIdQqLRcsCkfPnHzfbxbX_xctpoNpYw2XGfrduTNspr1jI$) >>>>>>>>> >>>>>>>>> Xuanwo >>>>>>>>> >>>>>>>>> https://xuanwo.io/ >>>>>>> >>>>>>> Xuanwo >>>>>>> >>>>>>> https://xuanwo.io/ >> >> Xuanwo >> >> https://xuanwo.io/ > > Xuanwo > > https://xuanwo.io/