Re: [DISCUSS] Filesystem in PyIceberg

2024-08-13 Thread André Luis Anastácio
I believe that now I understand how to leverage the metadata tables to deal with removing orphan files. I didn't know that the DELETE_FILES metadata table existed, so I believe this is what Fokko meant. Fokko, was your idea to use the DELETE_FILES and ALL_FILES metadata tables? Do you know why

Re: [DISCUSS] Filesystem in PyIceberg

2024-08-13 Thread Steve Loughran
On Tue, 13 Aug 2024 at 03:50, Xuanwo wrote: > Hi, André > > Thanks a lot for starting this thread. > > List operations on storage services are expensive and slow. That's why > Iceberg is designed to store metadata in files and avoid using list > operations in FileIO. However, `orphan file removal

Re: [DISCUSS] Filesystem in PyIceberg

2024-08-12 Thread Xuanwo
Hi, André Thanks a lot for starting this thread. List operations on storage services are expensive and slow. That's why Iceberg is designed to store metadata in files and avoid using list operations in FileIO. However, `orphan file removal` or `garbage cleanup` are special tasks that do requir

Re: [DISCUSS] Filesystem in PyIceberg

2024-08-12 Thread André Luis Anastácio
Thank you Fokko about the context! This blog post helped me a lot! I understand that in the Iceberg Java implementation the maintenance procedures are just [interfaces](https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/actions/DeleteOrphanFiles.java#L34), and the

Re: [DISCUSS] Filesystem in PyIceberg

2024-08-12 Thread Fokko Driesprong
Hi André, First of all, thanks for raising this. Maintenance routines are a long-awaited functionality in PyIceberg. The FileIO concept is not limited to PyIceberg, but is also present in Java