Hello Dev list, I want to update the community on the current thread for the proposal "Partially Loading Metadata - LoadTable V2" after hearing more perspectives from the community. In general, there are still some distance to go for a general consensus which I hope to foster more conversations and hear new inputs.
*Previous Discussions* ( https://docs.google.com/document/d/1Nv7_9XqS8EyR30_mrrqkwbZx9pw34i3HYIwuDDXnOY4/edit?tab=t.0 *)* *10/28/2024, quick google meet discussion* Thanks, Christian, Dmitri, Eric, JB, Szehon, Yufei for your time and voicing your opinion this morning. Here're a quick summary of what we discussed (detail meeting notes also included in the link above): Folks agreed that having a REST endpoint allowing clients to filter for what they need from LoadTableResult is a useful feature. The preliminary use cases that are brought up: 1. Load only current snapshot and current schema 2. Load only metadata file location 3. Load only credentials to access table 4. Query historical status of the table when time traveling Meanwhile, it is also important for this endpoint to be extensible enough so that it could take care of likewise use cases that only require a portion of LoadTableResult (metadata included) in the future. What the group has no strong preference or needs further inputs are: 1. Whether to modify the existing loadTable endpoint for partial loading or creating a new endpoint. The possible concern here is backward compatibility. 2. Whether to add bulk support to support cases like loading the current schema of all tables belonging to the same namespace. *10/23/2024, Iceberg community sync* Thanks, Ryan, Dan, Yufei, JB, Russel and Szehon for your inputs here. Folks are divided in two aspects: 1. Can we use table maintenance work to keep metadata size at check, thus preventing the necessity to slice metadata at all? 2. Is it the same use case to bulk load part of the information for many tables and to load part of the information for one table? *10/09/2024, Dev list* Thanks, Dan, Eduard for your inputs here. Folks are aligned here to extend the existing "refs" mode to other fields (i.e. metadata-log, snapshot-log, schemas), so that we can lazily load those fields if not needed. There are other parties from the community I had discussion on this topic with. I appreciate your input, and I failed to mention the discussion here because I forgot to keep a written record of the context for those discussions. In case you fall into this category, then I do apologize. *Summary of perspectives* The original proposal was aimed to tackle the growing metadata problem, and proposed a loadTable V2 endpoint. As the last thread mentioned, the conclusion at the time was that *extending the existing "refs" loading mode to more fields is preferable as it introduces less complexity and is more feasible to implement*. The later threads were where the community divided. On the one side, *there's a general scepticism on the concept of partial metadata* (i.e. union results from different requests has been a problem, even for "refs" lazy loading in the past); on the other side, *there's a push to generalize partial metadata concept to "LoadTableResult" as a whole* (e.g. to only return metadata file location, or only return table access creds based on client filter). Related is the concept of bulk API, where the community has raised this use case more than once, which are typically related to data warehouse management features, such as: 1) querying current schemas of all the tables belonging to a namespace; 2) querying certain table properties of many tables to see if any maintenance (downstream) jobs should be triggered; 3) querying ownership information of all tables to check security compliance of all the tables in data warehouse, etc. I want to lay everything down and foster more discussion for a good direction: 1. extend the current "refs" lazy loading mechanism to be a more generic solution 2. prevent partial metadata at all cost, and try to contain metadata size to always (or most of the time) load in full 3. generalize partial loading concept to the entire "LoadTableResult" (e.g. a generic loadTable V2 endpoint), so that users can use the same endpoint whether they want part of metadata, or other part of the "LoadTableResult" (e.g. metadata file location; table creds) 4. repurposing the last direction to make a bulk API for the REST spec, where loading pieces of information from many tables are permitted Or if there are other directions I failed to account for here. Looking forward to feedback/discussion from the community, thanks! Haizhou