Hi Karuppayya Wanted to check, would a regex suffice for this use case (ie, match /data/*, /metadata/*) and to keep it more general ? The idea came from Dan in a one off chat.
Thanks Szehon On Wed, Feb 26, 2025 at 1:41 PM Pucheng Yang <py...@pinterest.com.invalid> wrote: > Yes, Iceberg spec does not define where the data and metadata should be > located. /data and /metadata are paths by default, but users can override > this behavior by having customized location provider or set > write.metadata.path explicitly. > > On Wed, Feb 26, 2025 at 1:24 PM karuppayya <karuppayya1...@gmail.com> > wrote: > >> Hello Team, >> >> I'm writing to propose a change to the orphan file removal logic in this >> PR <https://github.com/apache/iceberg/pull/12278>. >> >> Currently, the orphan file removal process lists files at the root of the >> table to figure out orphans files. >> This can lead to unintended consequences in scenarios where multiple >> tables share a common root directory. >> Example: >> *tbl1* -> */dir1/*tbl1 >> *tbl2* -> */dir1* >> Orphan removal of tbl2 can clean up the tbl1 directory since the listing >> happens at *dir1.* >> >> I propose modifying the orphan file removal logic to list specifically >> within the `data` and `metadata` directories of the target table. This >> would ensure that only files within those directories, and therefore >> directly associated with the table(in most cases), are considered for >> removal. >> >> Are there any potential drawbacks or edge cases that I haven't considered? >> >> *Note: * >> 1. This does not address scenarios where tables are nested within the >> `data` or `metadata` directories of another table. >> Example: >> *tbl1* -> dir/tbl1 >> *tbl2* -> dir/tbl1/data/tbl2 >> 2. When two tables have same location >> Some related discussions related to location ownership here >> <https://github.com/apache/iceberg/issues/4159> and here >> <https://github.com/apache/iceberg/issues/9133> >> >> Eager to hear your feedback here or on the PR. Thank you!. >> >> - Karuppayya >> >>