Hi, André Thanks a lot for starting this thread.
List operations on storage services are expensive and slow. That's why Iceberg is designed to store metadata in files and avoid using list operations in FileIO. However, `orphan file removal` or `garbage cleanup` are special tasks that do require scanning the entire storage location and comparing it with our existing metadata files. I believe that if there is a way to ensure all engines use List operations correctly ( don't abuse list! ), it would be beneficial for us to introduce list files in FileIO. I prefer to have this in FileIO and eventually exposed in pyicberg/iceberg-rust's public API instead of letting users use opendal directly. The public API could be a metadata table or something similar; I haven't given it much thought yet. FileIO is now a widely shared design across different language implementations, and we have built a mature mechanism to allow users to implement and provide their own FileIO. By adding a new API in FileIO, we can ensure that we are not favoring any specific FileIO implementation. On Tue, Aug 13, 2024, at 07:01, André Luis Anastácio wrote: > Thank you Fokko about the context! This blog post helped me a lot! > I understand that in the Iceberg Java implementation the maintenance > procedures are just interfaces > <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/actions/DeleteOrphanFiles.java#L34>, > and the implementation is done on the engine side > <https://github.com/apache/iceberg/blob/ae08334cad1f1a9eebb9cdcf48ce5084da9bc44d/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java#L103>. > What do you think about this for PyIceberg? > >> I was hoping to leverage the metadata tables for that. >> > I’m not sure if I understand correctly. Do you mean that the idea would be to > access the metadata using the metadata tables through the table public API > instead of reading the metadata files directly? > > If I understood correctly, and following what was done in the Java > implementation, what are your thoughts on having the procedures module using > only the PyIceberg public API and OpenDAL to handle with filesystem? With > that, we would have something that is not coupled with the PyIceberg > internals. > > > André Anastácio > > > On Monday, August 12th, 2024 at 5:03 PM, Fokko Driesprong <fo...@apache.org> > wrote: >> Hi André, >> >> First of all, thanks for raising this. Maintenance routines are a >> long-awaited functionality in PyIceberg. >> >> The FileIO concept <https://iceberg.apache.org/fileio/> is not limited to >> PyIceberg, but is also present in Java >> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/io/FileIO.java> >> and Iceberg-Rust >> <https://github.com/apache/iceberg-rust/blob/bbbea9751439dea6afb85f5acf0f3689357cf3de/crates/iceberg/src/io/file_io.rs#L40>. >> The main focus of FileIO is to provide object-store native operations to >> the Iceberg client (an excellent blog can be found here >> <https://tabular.io/blog/iceberg-fileio-cloud-native-tables/>). Based on >> this, I don't think we want to create a first-class citizen for >> FileSystem-like operations, because Iceberg is designed to work with object >> stores native operations. >> >> That said, in PyIceberg the abstraction between the engine and the FileIO is >> not as clear as in other implementations. This is mostly because the >> ArrowFileIO >> <https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L328> >> returns Arrow buffers, and therefore we ended up with a more closely >> related implementation than desired. It would be good to see if we can >> untangle that, and I'm sure that once we get OpenDAL or Iceberg-Rust in >> there, there will be a strong need to do that. >> >> Orphan files is quite a resource-intensive operation since it requires >> listing all the files under the location, and comparing this with all the >> files in the metadata (I was hoping to leverage the metadata tables for >> that). >> >> Hope this helps! >> >> Kind regards, >> Fokko >> >> >> >> >> >> >> Op ma 12 aug 2024 om 14:38 schreef André Luis Anastácio >> <ndrl...@proton.me.invalid>: >>> >>> Hello everyone, >>> >>> I’ve been studying the Java implementation of orphan file removal to >>> replicate it in PyIceberg. During this process, I noticed a key difference: >>> in Java, we use the Hadoop Filesystem[1], while in PyIceberg, we use the >>> Filesystem provided by FileIO[2][3]. >>> >>> Currently, we support two FileIO implementations: Fsspec and PyArrow. >>> However, there is a hard requirement to use PyArrow for the reading >>> process, and when we instantiate the FileSystem, we wrap Fsspec with the >>> PyArrow interface[4][5]. >>> >>> Thus, we can say that the default filesystem interface is the PyArrow one. >>> >>> In the future, we aim to use the FileIO from rust-iceberg, which leverages >>> OpenDAL—a tool that doesn’t have wrappers for the Fsspec or Arrow >>> interfaces. >>> >>> For the FileIO context (write/read/delete operations), I believe we are in >>> good shape. The challenge arises when we need to access the Filesystem >>> object to handle tasks like listing files. >>> >>> With this in mind, I want to open a discussion about how we should >>> standardize an interface for file listing. >>> >>> What should be our default interface for listing files? >>> >>> - Create our own definition (e.g., extend FileIO or create a new Filesystem >>> interface) >>> - Use Fsspec >>> - Use Arrow >>> - Use OpenDAL >>> - Other? >>> >>> >>> Could we move the implementation for retrieving and wrapping the >>> Filesystem[4][5] to another location, so it can be reused elsewhere? >>> >>> Any other suggestions? >>> >>> [1] >>> https://github.com/apache/iceberg/blob/ae08334cad1f1a9eebb9cdcf48ce5084da9bc44d/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java#L356 >>> [2] >>> https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/fsspec.py#L350-L354 >>> [3]https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L346-L401 >>> [4] >>> https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L1335-L1349 >>> [5] >>> https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L1429-L1443 >>> >>> André Anastácio >>> Xuanwo https://xuanwo.io/