Hi:

I lean towards implementing pyiceberg's FileIO backed by iceberg-rust's
FileIO, rather than directly using OpenDAL. The motivation is that we can
use this as a starting point of providing iceberg-rust backed components
for pyiceberg, and due to its simplicity, it's a good case. I believe there
will be more cases, like Sung mentioned transform in another thread, and
table scan mentioned by Fokko.

If we want to use OpenDAL directly, we don't need iceberg-rust, since
OpenDAL already has python binding:
https://opendal.apache.org/docs/python/opendal.html

Do you have any experience with this? I see many projects having Rust and
> Python code in a single repository. There are some exceptions like
> Pydantic (pydantic <https://github.com/pydantic/pydantic>, pydantic-core
> <https://github.com/pydantic/pydantic-core>).


Well, first I want to say providing a python binding for a library
written in rust is a quite common practice. Just to name a few: opendal
<https://github.com/apache/opendal>,  polars
<https://github.com/pola-rs/polars>, datafusion
<https://github.com/apache/datafusion>, delta-rs
<https://github.com/delta-io/delta-rs>. As far as I know, most of them
choose to put python binding with rust in the same repo, only
datafusion-python <https://github.com/apache/datafusion-python> lives in
another, I'm not sure about the reason, maybe it's too large?

I haven't tried to implement one before, but pyo3 <https://github.com/PyO3> has
great documentation, and there are many existing examples in open source we
can learn with.

On Sat, Aug 3, 2024 at 2:23 AM Fokko Driesprong <fo...@apache.org> wrote:

> One more thing,
>
> About this idea, would you have a more detailed design? For example,
>>  where should the pyo3 codes live, in iceberg-rust or in pyiceberg? What
>> kind of interface should we provide to pyiceberg, FileIO or OpenDAL?
>
>
> Do you have any experience with this? I see many projects having Rust and
> Python code in a single repository. There are some exceptions like
> Pydantic (pydantic <https://github.com/pydantic/pydantic>, pydantic-core
> <https://github.com/pydantic/pydantic-core>).
>
> Kind regards,
> Fokko
>
>
>
> Op vr 2 aug 2024 om 20:11 schreef Fokko Driesprong <fo...@apache.org>:
>
>> Thanks for driving this Xuanwo,
>>
>> I already suggested this in my talk back at the Spark Summit to see if we
>> can spark some interest, and it is exciting to see this materialize.
>>
>> For the IO abstraction, I think the FileIO is the best option. We already
>> have the interface
>> <https://github.com/apache/iceberg-python/blob/6c0d307032608967ccd00cfe72d8815e6e7e01cc/pyiceberg/io/__init__.py#L239>
>> in PyIceberg, and also a PyArrowFileIO
>> <https://github.com/apache/iceberg-python/blob/6c0d307032608967ccd00cfe72d8815e6e7e01cc/pyiceberg/io/pyarrow.py#L327>.
>> I must admit that the abstraction is less clear in PyIceberg since we rely
>> so much on Arrow for reading/writing data that it is tightly coupled. I
>> would love to see if we can use OpenDAL for reading/writing data, and
>> Iceberg-rust for pushing down the low-level logic. A while ago I did some
>> profiling on the code, and one of the major issues is that Arrow doesn't
>> support proper field-ID projection. Therefore we have to the Parquet file,
>> and do the schema-evolution and type promotion afterwards in Python
>> <https://github.com/apache/iceberg-python/blob/6c0d307032608967ccd00cfe72d8815e6e7e01cc/pyiceberg/io/pyarrow.py#L1444-L1458>,
>> which causes a lot of congestion on the GIL.
>>
>> Kind regards,
>> Fokko
>>
>> Op vr 2 aug 2024 om 17:46 schreef Jack Ye <yezhao...@gmail.com>:
>>
>>> +1 for an OpenDALFileIO
>>>
>>> -Jack
>>>
>>> On Fri, Aug 2, 2024 at 8:32 AM Xuanwo <xua...@apache.org> wrote:
>>>
>>>> Hi, renjie
>>>>
>>>> Thank you for your support. I'll delve into the details and first build
>>>> a PoC PR to make it clear.
>>>>
>>>> On Fri, Aug 2, 2024, at 22:51, Renjie Liu wrote:
>>>>
>>>> Hi:
>>>>
>>>> Thanks Xuanwo for raising this.
>>>>
>>>> As mentioned in another thread, I think using iceberg-rust in pyiceberg
>>>> is a good idea.
>>>>
>>>> About this idea, would you have a more detailed design? For example,
>>>> where should the pyo3 codes live, in iceberg-rust or in pyiceberg? What
>>>> kind of interface should we provide to pyiceberg, FileIO or OpenDAL?
>>>>
>>>> I think this is a good first step moving forward to make pyiceberg
>>>> backed iceberg-rust. In the future we can replace components gradually.
>>>>
>>>> On Fri, Aug 2, 2024 at 5:58 PM Xuanwo <xua...@apache.org> wrote:
>>>>
>>>>
>>>> > Xuanwo, would PyIceberg and iceberg-rust share the underlying OpenDAL
>>>> implementations via pyo3 / fsspec bindings
>>>> <https://github.com/apache/opendal/issues/4511>?
>>>>
>>>> Hi, Raschkowski, good question!
>>>>
>>>> It's possible. There is an ongoing project developing fsspec bindings
>>>> for opendal at https://github.com/fsspec/opendalfs. Once complete, we
>>>> can directly use opendal through fsspec.
>>>>
>>>> This work is unrelated to Pyicberg or Iceberg-rust. Ideally, users
>>>> should be able to use opendalfs as an alternative implementation of the
>>>> fsspec AbstractFileSystem class.
>>>>
>>>> On Fri, Aug 2, 2024, at 17:44, Will Raschkowski wrote:
>>>>
>>>> Xuanwo, would PyIceberg and iceberg-rust share the underlying OpenDAL
>>>> implementations via pyo3 / fsspec bindings
>>>> <https://github.com/apache/opendal/issues/4511>?
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> *From:* Joe Stein <crypt...@gmail.com>
>>>> *Sent:* Thursday, August 1, 2024 3:37 AM
>>>> *To:* dev@iceberg.apache.org <dev@iceberg.apache.org>
>>>> *Subject:* Re: [DISCUSS] Use iceberg-rust as pyiceberg file io
>>>>
>>>> *CAUTION:* This email originates from an external party (outside of
>>>> Palantir). If you believe this message is suspicious in nature, please use
>>>> the "Report Message" button built into Outlook.
>>>>
>>>> Kafka did this with librdkafka and was wildly successful. The
>>>> underlying bindings being in rust are great with a layer for access in
>>>> Python +1
>>>>
>>>>
>>>> ~ Joe Stein
>>>>
>>>>
>>>> On Wed, Jul 31, 2024 at 10:29 PM Xuanwo <xua...@apache.org> wrote:
>>>>
>>>> Hello everyone
>>>>
>>>> I start this thread to discuss the idea about using iceberg-rust as
>>>> pyiceberg file io.
>>>>
>>>> The idea is living at https://hackmd.io/@xuanwo/iceberg_rust_as_file_io
>>>> [hackmd.io]
>>>> <https://urldefense.com/v3/__https://hackmd.io/@xuanwo/iceberg_rust_as_file_io__;!!NkS9JGVQ2sDq!7Js41FIzh2smsAOySXrKd527DXCmXdrwV8Uov8TIdQqLRcsCkfPnHzfbxbX_xctpoNpYw2XGfrduTPd6ppTI$>
>>>>
>>>> In summary, we can leverage the work from iceberg-rust to help
>>>> pyiceberg in developing a fast and compact file IO system that benefits
>>>> users with specific constraints.
>>>>
>>>> Welcome to join in the discussion.
>>>>
>>>> Xuanwo
>>>>
>>>> https://xuanwo.io/ [xuanwo.io]
>>>> <https://urldefense.com/v3/__https://xuanwo.io/__;!!NkS9JGVQ2sDq!7Js41FIzh2smsAOySXrKd527DXCmXdrwV8Uov8TIdQqLRcsCkfPnHzfbxbX_xctpoNpYw2XGfrduTNspr1jI$>
>>>>
>>>> Xuanwo
>>>>
>>>> https://xuanwo.io/
>>>>
>>>> Xuanwo
>>>>
>>>> https://xuanwo.io/
>>>>
>>>>

Reply via email to