Re: [DISCUSS] PyIceberg 0.6.0 release

Daniel Weeks Fri, 26 Jan 2024 13:56:03 -0800

I'm also strongly in favor of getting this release out even with the
limitations as it's still a huge step forward and we can build
incrementally on the write support.


Incredible work everyone, I'm really excited about the progress here.

-Dan

On Fri, Jan 26, 2024 at 11:16 AM Fokko Driesprong <fo...@apache.org> wrote:

> Thanks everyone for the responses and great to see everyone is as excited
> as I am :D
>
> I have some good news. The guys from Eventual have been working on
> integrating PyIceberg into their Daft dataframe
> <https://www.getdaft.io/projects/docs/en/latest/user_guide/integrations/data_catalogs.html#apache-iceberg>.
> They are integrating on the scan-tasks level where they leverage their own
> Parquet reader to read in a distributed fashion. Feel free to join the
> #daft channel on the Iceberg Slack
> <https://iceberg.apache.org/community/#slack> if you're interested in
> this. We're in the process of making sure that all the Iceberg features
> work well (schema and partition evolution, projection, etc). The query
> planning is done in PyIceberg in a single process (we do use
> multi-threading), we're doing some profiling on the PyIceberg code to
> identify bottlenecks to scale to at least 1M+ partitions.
>
> Similar to the read-path, for writing, we're designing the API in such a
> way that this also can be distributed.
>
> As I mentioned, I created issues
> <https://github.com/apache/iceberg-python/issues> around the gaps. There
> is a good discussion going on around the partitioned writes
> <https://github.com/apache/iceberg-python/issues/208>, and writing using
> a sort order <https://github.com/apache/iceberg-python/issues/271> is
> still up for grabs.
>
> Kind regards,
> Fokko
>
> Op vr 26 jan 2024 om 19:45 schreef Ryan Blue <b...@tabular.io>:
>
>> Like the Java implementation, we've been building toward a library that
>> can be used in distributed applications as well as directly on a single
>> node. For example, job planning can produce a set of file scan tasks or a
>> scan can be pushed to duckdb (to_duckdb) or pandas (to_pandas). The write
>> side is similar where we have methods that accept Arrow dataframes and
>> write files and an API for committing those files to a table. The write
>> side isn't as well developed yet (no support for partitions, for example),
>> but the basics are there and we would love to work with Ray and other
>> communities to add native Iceberg support!
>>
>> On Fri, Jan 26, 2024 at 10:40 AM Pucheng Yang <py...@pinterest.com.invalid>
>> wrote:
>>
>>> I have similar questions as Yufei's. My organization has interest in Ray
>>> Iceberg integration and during the conversation with the Ray team, we know
>>> they would also like the have Iceberg integration as well. I think this is
>>> a good opportunity for both projects to collaborate.
>>>
>>> On Fri, Jan 26, 2024 at 10:32 AM Sung Yun <sy...@cornell.edu> wrote:
>>>
>>>> It’s so exciting to see the project take another step forward, Fokko!
>>>>
>>>> Really great job to everyone involved.
>>>>
>>>> Best,
>>>> Sung
>>>>
>>>> On Jan 26, 2024, at 11:48 AM, Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>> 
>>>> It's great to see all the progress in PyIceberg. Thanks to everyone
>>>> that's been contributing!
>>>>
>>>> I'm all for getting a release out as soon as possible and following up
>>>> with more features in the write path in 0.7.0.
>>>>
>>>> On Fri, Jan 26, 2024 at 5:22 AM Fokko Driesprong <fo...@apache.org>
>>>> wrote:
>>>>
>>>>> Hey everyone,
>>>>>
>>>>> I want to discuss the 0.6.0 release that will bring a lot of
>>>>> functionality to the public:
>>>>>
>>>>>    - Write support for writing to unpartitioned tables
>>>>>       - Includes snapshot generation
>>>>>       - Constructing Avro writer trees
>>>>>    - Support writing metadata which allows to commit support for the
>>>>>    Hive, Sql, and Glue catalog.
>>>>>    - Support for name-mapping
>>>>>    - Easy evolution of schema using the union_by_name method
>>>>>    - And a lot of bug fixes and improvements
>>>>>
>>>>> The write support is still limited, for example, partitioned writes or
>>>>> tables with sort-orders are not supported. Also, as Ryan mentioned during
>>>>> the last community sync, we're doing fast appends by default, and we're
>>>>> unable to compact yet. I've created issues on Github
>>>>> <https://github.com/apache/iceberg-python/issues> to track all these
>>>>> limitations. However, I think it is good to get the current work out to 
>>>>> the
>>>>> public so they can try it and we can uncover any impediments as soon as
>>>>> possible. And we can follow up with 0.7.0.
>>>>>
>>>>> Kind regards,
>>>>> Fokko Driesprong
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Re: [DISCUSS] PyIceberg 0.6.0 release

Reply via email to