+1 for item 1, the fact that we do not remove all data referenced by all metadata files seems like a bug to me that should be fixed. The table's pointer is already removed in the catalog with no way to rollback, so there is no reason for keeping those files around. I don't know if there is any historical context for us to only remove data in the latest metadata, maybe someone with context could provide more details.
For item 2, I think this aligns with the discussion conclusions in the linked issues. At least we can have some flag saying the table location, data location and metadata location do not have other table data (a.k.a. the table owns those 3 prefixes), and then we can safely do a recursive deletion. This also seems to align with the intention for having a LIST API in FileIO discussed in https://github.com/apache/iceberg/issues/3212 for remove_orphan_files. -Jack On Mon, Nov 22, 2021 at 6:42 PM Yan Yan <yyany...@gmail.com> wrote: > Hi everyone, > > Does anyone know across catalog implementations, when we drop tables with > *purge=true*, why do we only drop last metadata and files referred by it, > but not any of the previous metadata? e.g. > > *create iceberg table1*; <--- metadata.json-1 > *insert into table1* ...; <--- metadata.json-2 > > when I do *drop table1* after these two commands, `metadata.json-1` will > not be deleted. This will also mean if we rollback/compact table and then > drop, data files referred by some of the previous metadata files will also > not be deleted. > > I know the community used to talk about table location ownership for file > cleanup after dropping table (e.g. > https://github.com/apache/iceberg/issues/1764 > https://github.com/trinodb/trino/issues/5616 ) but I'm not sure if they > could completely solve the problem since we can customize metadata/data > location, and I think we should still delete the past metadata.json even if > the table doesn't own any location. > > I was thinking about the following items: > 1. to make a change to delete past metadata.json files as well when the > table is dropped with *purge=true* (small change, doesn't tackle > rollback/compaction data files) > 2. add configuration regarding table's location ownership, and delete > underlying files in drop table if table owns location (more complicated) > > I think 1 should be relatively safe to do despite that it's a behavior > change, but want to run it through the community first. > > Thanks! > Yan >