Hi Karuppayya

Wanted to check, would a regex suffice for this use case (ie, match
/data/*, /metadata/*) and to keep it more general ?  The idea came from Dan
in a one off chat.

Thanks
Szehon

On Wed, Feb 26, 2025 at 1:41 PM Pucheng Yang <py...@pinterest.com.invalid>
wrote:

> Yes, Iceberg spec does not define where the data and metadata should be
> located. /data and /metadata are paths by default, but users can override
> this behavior by having customized location provider or set
> write.metadata.path explicitly.
>
> On Wed, Feb 26, 2025 at 1:24 PM karuppayya <karuppayya1...@gmail.com>
> wrote:
>
>> Hello Team,
>>
>> I'm writing to propose a change to the orphan file removal logic in this
>> PR <https://github.com/apache/iceberg/pull/12278>.
>>
>> Currently, the orphan file removal process lists files at the root of the
>> table to figure out orphans files.
>> This can lead to unintended consequences in scenarios where multiple
>> tables share a common root directory.
>> Example:
>> *tbl1* -> */dir1/*tbl1
>> *tbl2* -> */dir1*
>> Orphan removal of tbl2 can clean up the tbl1 directory since the listing
>> happens at *dir1.*
>>
>> I propose modifying the orphan file removal logic to list specifically
>> within the `data` and `metadata` directories of the target table. This
>> would ensure that only files within those directories,  and therefore
>> directly associated with the table(in most cases), are considered for
>> removal.
>>
>> Are there any potential drawbacks or edge cases that I haven't considered?
>>
>> *Note: *
>> 1. This does not address scenarios where tables are nested within the
>> `data` or `metadata` directories of another table.
>> Example:
>> *tbl1* -> dir/tbl1
>> *tbl2* -> dir/tbl1/data/tbl2
>> 2. When two tables have same location
>> Some related discussions related to location ownership here
>> <https://github.com/apache/iceberg/issues/4159> and here
>> <https://github.com/apache/iceberg/issues/9133>
>>
>> Eager to hear your feedback here or on the PR. Thank you!.
>>
>> - Karuppayya
>>
>>

Reply via email to