Hi,

When you come from storage perspective, then the current design of 'not
owning' location makes sense.

However, if you come from SQL perspective, then all this is impractical
limitation. Analysts and other SQL users want to be able to delete their
data  and must have confidence that all the data is removed.
Failing to do so may expose them to GDPR-related liabilities.

Therefore we should work towards (2). For starters, we should be able to
assume that tables with implicit location, do own their location.
Then we should have an option to validate location ownership for tables
with explicit location.

I don't know yet how tables with multiple locations fit into this picture,
or tables with manifest in one place, or data files in some other places.
SQL users wouldn't create such tables though.

BR
PF





On Tue, Nov 23, 2021 at 4:32 AM Jack Ye <yezhao...@gmail.com> wrote:

> +1 for item 1, the fact that we do not remove all data referenced by all
> metadata files seems like a bug to me that should be fixed. The table's
> pointer is already removed in the catalog with no way to rollback, so there
> is no reason for keeping those files around. I don't know if there is any
> historical context for us to only remove data in the latest metadata, maybe
> someone with context could provide more details.
>
> For item 2, I think this aligns with the discussion conclusions in the
> linked issues. At least we can have some flag saying the table location,
> data location and metadata location do not have other table data (a.k.a.
> the table owns those 3 prefixes), and then we can safely do a recursive
> deletion. This also seems to align with the intention for having a LIST API
> in FileIO discussed in https://github.com/apache/iceberg/issues/3212 for
> remove_orphan_files.
>
> -Jack
>
>
> On Mon, Nov 22, 2021 at 6:42 PM Yan Yan <yyany...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> Does anyone know across catalog implementations, when we drop tables with
>> *purge=true*, why do we only drop last metadata and files referred by
>> it, but not any of the previous metadata? e.g.
>>
>> *create iceberg table1*; <--- metadata.json-1
>> *insert into table1* ...; <--- metadata.json-2
>>
>> when I do *drop table1* after these two commands, `metadata.json-1` will
>> not be deleted. This will also mean if we rollback/compact table and then
>> drop, data files referred by some of the previous metadata files will also
>> not be deleted.
>>
>> I know the community used to talk about table location ownership for file
>> cleanup after dropping table (e.g.
>> https://github.com/apache/iceberg/issues/1764
>> https://github.com/trinodb/trino/issues/5616 ) but I'm not sure if they
>> could completely solve the problem since we can customize metadata/data
>> location, and I think we should still delete the past metadata.json even if
>> the table doesn't own any location.
>>
>> I was thinking about the following items:
>> 1. to make a change to delete past metadata.json files as well when the
>> table is dropped with *purge=true* (small change, doesn't tackle
>> rollback/compaction data files)
>> 2. add configuration regarding table's location ownership, and delete
>> underlying files in drop table if table owns location (more complicated)
>>
>> I think 1 should be relatively safe to do despite that it's a behavior
>> change, but want to run it through the community first.
>>
>> Thanks!
>> Yan
>>
>

Reply via email to