I'm also strongly in favor of getting this release out even with the limitations as it's still a huge step forward and we can build incrementally on the write support.
Incredible work everyone, I'm really excited about the progress here. -Dan On Fri, Jan 26, 2024 at 11:16 AM Fokko Driesprong <fo...@apache.org> wrote: > Thanks everyone for the responses and great to see everyone is as excited > as I am :D > > I have some good news. The guys from Eventual have been working on > integrating PyIceberg into their Daft dataframe > <https://www.getdaft.io/projects/docs/en/latest/user_guide/integrations/data_catalogs.html#apache-iceberg>. > They are integrating on the scan-tasks level where they leverage their own > Parquet reader to read in a distributed fashion. Feel free to join the > #daft channel on the Iceberg Slack > <https://iceberg.apache.org/community/#slack> if you're interested in > this. We're in the process of making sure that all the Iceberg features > work well (schema and partition evolution, projection, etc). The query > planning is done in PyIceberg in a single process (we do use > multi-threading), we're doing some profiling on the PyIceberg code to > identify bottlenecks to scale to at least 1M+ partitions. > > Similar to the read-path, for writing, we're designing the API in such a > way that this also can be distributed. > > As I mentioned, I created issues > <https://github.com/apache/iceberg-python/issues> around the gaps. There > is a good discussion going on around the partitioned writes > <https://github.com/apache/iceberg-python/issues/208>, and writing using > a sort order <https://github.com/apache/iceberg-python/issues/271> is > still up for grabs. > > Kind regards, > Fokko > > Op vr 26 jan 2024 om 19:45 schreef Ryan Blue <b...@tabular.io>: > >> Like the Java implementation, we've been building toward a library that >> can be used in distributed applications as well as directly on a single >> node. For example, job planning can produce a set of file scan tasks or a >> scan can be pushed to duckdb (to_duckdb) or pandas (to_pandas). The write >> side is similar where we have methods that accept Arrow dataframes and >> write files and an API for committing those files to a table. The write >> side isn't as well developed yet (no support for partitions, for example), >> but the basics are there and we would love to work with Ray and other >> communities to add native Iceberg support! >> >> On Fri, Jan 26, 2024 at 10:40 AM Pucheng Yang <py...@pinterest.com.invalid> >> wrote: >> >>> I have similar questions as Yufei's. My organization has interest in Ray >>> Iceberg integration and during the conversation with the Ray team, we know >>> they would also like the have Iceberg integration as well. I think this is >>> a good opportunity for both projects to collaborate. >>> >>> On Fri, Jan 26, 2024 at 10:32 AM Sung Yun <sy...@cornell.edu> wrote: >>> >>>> It’s so exciting to see the project take another step forward, Fokko! >>>> >>>> Really great job to everyone involved. >>>> >>>> Best, >>>> Sung >>>> >>>> On Jan 26, 2024, at 11:48 AM, Ryan Blue <b...@tabular.io> wrote: >>>> >>>> >>>> It's great to see all the progress in PyIceberg. Thanks to everyone >>>> that's been contributing! >>>> >>>> I'm all for getting a release out as soon as possible and following up >>>> with more features in the write path in 0.7.0. >>>> >>>> On Fri, Jan 26, 2024 at 5:22 AM Fokko Driesprong <fo...@apache.org> >>>> wrote: >>>> >>>>> Hey everyone, >>>>> >>>>> I want to discuss the 0.6.0 release that will bring a lot of >>>>> functionality to the public: >>>>> >>>>> - Write support for writing to unpartitioned tables >>>>> - Includes snapshot generation >>>>> - Constructing Avro writer trees >>>>> - Support writing metadata which allows to commit support for the >>>>> Hive, Sql, and Glue catalog. >>>>> - Support for name-mapping >>>>> - Easy evolution of schema using the union_by_name method >>>>> - And a lot of bug fixes and improvements >>>>> >>>>> The write support is still limited, for example, partitioned writes or >>>>> tables with sort-orders are not supported. Also, as Ryan mentioned during >>>>> the last community sync, we're doing fast appends by default, and we're >>>>> unable to compact yet. I've created issues on Github >>>>> <https://github.com/apache/iceberg-python/issues> to track all these >>>>> limitations. However, I think it is good to get the current work out to >>>>> the >>>>> public so they can try it and we can uncover any impediments as soon as >>>>> possible. And we can follow up with 0.7.0. >>>>> >>>>> Kind regards, >>>>> Fokko Driesprong >>>>> >>>> >>>> >>>> -- >>>> Ryan Blue >>>> Tabular >>>> >>>> >> >> -- >> Ryan Blue >> Tabular >> >