Re: [DISCUSS] Partial Metadata Loading

Eric Maynard Thu, 31 Oct 2024 11:04:14 -0700

Thanks for this breakdown Dan.

I share your concerns about the complexity this might impose on the client.
On some of your other notes, I have some thoughts below:



Several Apache Polaris (Incubating) committers were in the recent sync on
this proposal, so I want to share one perspective related to the last point
re: *Partial metadata impedes adoption*.

Personally, I feel better about the prospect of Polaris supporting a
flexible loadTableV2-type API as opposed to having to keep adding more
endpoints to support new use cases that really just boil down to partial
metadata. Gabor gives the example of isLatest above, and a recent proposal
<https://docs.google.com/document/d/1acCkaPCO7WsLtvYugrayurbef4zCnD2rb3ZPBKeaYoo/edit?tab=t.0#heading=h.hs6r9d26w1y2>
described
an endpoint for credentials. I can't speak for every REST catalog
implementation, but I am worried that Polaris will have to keep adding more
APIs that really just expose various different slices of the loadTable
response.

I also like that loadTableV2 gives us the option to "partially implement"
the partial metadata response like you noted. Compared to something like a
credential endpoint that either works or doesn't work, the loadTableV2
endpoint can be trivially implemented to just return all metadata like
loadTable "V1" does. In my view, this makes the road to adoption easier.


With respect to your section titled *Partial metadata** doesn't align with
primary use cases*:

It's certainly true that many use cases do require a significant amount of
the metadata returned by loadTable today. However I would guess that very
few truly require 100% of the metadata. If we are evaluating endpoints
based on how consistently useful the response will be, I feel like this
argument turns into a stronger one against loadTableV1 than loadTableV2.

In other words -- if it's true that "partial metadata doesn't align with
primary use cases", it seems true that "full metadata doesn't align
with *almost
all* use cases".

Even if most use cases do need 90% of the metadata, it seems like a useful
optimization for the client to not have to request whatever it doesn't
need. This also gives us the flexibility to make table metadata richer in
the future without having to worry about the cost a heavier metadata
payload might incur for existing use cases.


Eric M.


On Thu, Oct 31, 2024 at 10:37 AM Daniel Weeks <[email protected]> wrote:

> I'd like to clarify my concerns here because I think there are more
> aspects to this than we've captured.
>
> *Partial metadata loads adds significant complexity to the protocol*
> Iceberg metadata is a complicated structure and finding a way to represent
> how and what we want to piece apart is non-trivial.  There are nested
> structures and references between different fields that would all need
> custom ways to return through a response.  This also makes it difficult for
> clients to process and services to implement.  Adding this (even with an
> option to return full metadata with requirements that reflect the table
> spec) necessitates a v2 endpoint.  If catalogs are required to support all
> partial load semantics, then the catalog becomes complicated.  If the
> catalog can opt to always return the full metadata, it makes the client
> more complicated since they may have to handle to very different looking
> response objects for any load request.
>
> *Partial metadata doesn't address the underlying issue, but pushes it
> somewhere else*
> From a client perspective, I can see that this feels like an optimization
> because I can just grab what I want from the metadata (e.g. schema, or
> properties).  However, all we've done is push that complexity to the server
> which either has to parse the metadata and return a subset of it, or needs
> to have a more complicated way of representing and storing independent
> pieces of metadata (all while still being required to produce new json
> metadata).  All we've done here is make the service more complicated, and
> the underlying issue of maintenance of the metadata still needs to be
> addressed.
>
> *Partial metadata** doesn't align with primary use cases*
> The vast majority of use cases require a significant amount of the
> metadata returned in the load table response.  While some pieces may be
> discarded, much of the information is necessary to read or update a table.
> The ref loading was an effort to limit the overall size of the response and
> include the vast majority of relevant information for read only uses cases,
> but even our most complete implementations still need the full metadata to
> properly construct a new commit and resolve conflicts.
>
> Even the example of Impala trying to load the location to determine if the
> table has changed is less than ideal because to accurately answer that
> question, you need to load the metadata.  For example, if there was a
> background compaction that resulted in a rewrite operation or a property
> change that doesn't affect the underlying data, it may not be necessary to
> invalidate the cache.  This approach is further exacerbated if the
> community decides to remove the location requirement because it would then
> not be available to signify the state of the table.
>
> *Partial metadata impedes adoption*
> My biggest concern is that the added complexity here impedes adoption of
> the REST specification.  There are a large number of engines and catalog
> implementations that are still in the early stages of the adoption curve.
> Partial metadata loads splits these groups into the catalogs willing to
> implement it and engines that start requiring it in order to function.
> While I think partial metadata loads is an interesting technical challenge,
> I don't believe that it's necessary and our effort should go into producing
> good solutions for metadata management and implementations of catalogs that
> can return the table metadata quickly to clients.
>
> I feel like focusing on table metadata maintenance addresses all of the
> issues except the most extreme edge cases and good catalog implementations
> can return a metadata payload faster the most object stores can even load
> the metadata json file (in practice single digit millisecond responses are
> achievable here), so performance is not the tradeoff.
>
> - Dan
>
>
> On Tue, Oct 29, 2024 at 1:31 AM Gabor Kaszab <[email protected]>
> wrote:
>
>> Hi Iceberg Community,
>>
>> I just wanted to mention that I was also going to start a discussion
>> about getting partial information from LoadTableResponse through the REST
>> API.
>> My motivation is a bit different here, though:
>> Impala currently has strong integration with HMS and in turn with the
>> HiveCatalog. Nowadays there are efforts put into the project to make it
>> work with REST catalog for Iceberg tables, and there is one piece that we
>> miss now with the REST API. Impala caches table metadata and we need a way
>> to decide whether we have to reload the metadata for a particular table or
>> not. Currently, with HMS we have a push-based solution where every change
>> of the table is pushed to Impala from HMS as notifications/events, and with
>> REST catalog we were thinking of a pull-based approach where Impala
>> occasionally asks the REST catalog whether a particular table is up-to-date
>> or not.
>>
>> *Use-case*: So in Impala's case what would be important is to have a
>> REST Catalog API to answer a question like:
>> "I cached this version of this particular table, is it up-to-date or do I
>> have to reload it?"
>>
>> *Possible solutions*:
>> 1) This could either be achieved by an API like this:
>>     boolean isLatest(TableIdentifier ident, String metadataLocation);
>> 2) Another approach could be to get the latest metadata location and let
>> the engine compare it to the one it holds:
>>     String metadataLocation(TableIdentifier ident);
>> 3) Similarly to 2) querying metadata location could also be achieved by
>> the current proposal of partial metadata like: (I just made up some types
>> here)
>>     Table loadTable(TableIdentifier ident,
>> SomeFilterClass.MetadataLocation);
>>
>> Either way is fine for Impala I think, I just wanted to share our
>> use-case that could also leverage getting partial metadata.
>> Now that I have written this mail it seems to hijack the original
>> conversation a bit. Let me know if I should raise this in a separate
>> [discuss] thread.
>>
>> Regards,
>> Gabor
>>
>> On Tue, Oct 29, 2024 at 2:16 AM Haizhou Zhao <[email protected]>
>> wrote:
>>
>>> Hello Dev list,
>>>
>>> I want to update the community on the current thread for the proposal
>>> "Partially Loading Metadata - LoadTable V2" after hearing more perspectives
>>> from the community. In general, there are still some distance to go for a
>>> general consensus which I hope to foster more conversations and hear new
>>> inputs.
>>>
>>> *Previous Discussions* (
>>> https://docs.google.com/document/d/1Nv7_9XqS8EyR30_mrrqkwbZx9pw34i3HYIwuDDXnOY4/edit?tab=t.0
>>> *)*
>>>
>>>
>>> *10/28/2024, quick google meet discussion*
>>>
>>> Thanks, Christian, Dmitri, Eric, JB, Szehon, Yufei for your time and
>>> voicing your opinion this morning. Here're a quick summary of what we
>>> discussed (detail meeting notes also included in the link above):
>>>
>>> Folks agreed that having a REST endpoint allowing clients to filter for
>>> what they need from LoadTableResult is a useful feature. The preliminary
>>> use cases that are brought up:
>>> 1. Load only current snapshot and current schema
>>> 2. Load only metadata file location
>>> 3. Load only credentials to access table
>>> 4. Query historical status of the table when time traveling
>>> Meanwhile, it is also important for this endpoint to be extensible
>>> enough so that it could take care of likewise use cases that only require a
>>> portion of LoadTableResult (metadata included) in the future.
>>>
>>> What the group has no strong preference or needs further inputs are:
>>> 1. Whether to modify the existing loadTable endpoint for partial loading
>>> or creating a new endpoint. The possible concern here is backward
>>> compatibility.
>>> 2. Whether to add bulk support to support cases like loading the current
>>> schema of all tables belonging to the same namespace.
>>>
>>>
>>> *10/23/2024, Iceberg community sync*
>>>
>>> Thanks, Ryan, Dan, Yufei, JB, Russel and Szehon for your inputs here.
>>>
>>> Folks are divided in two aspects:
>>> 1. Can we use table maintenance work to keep metadata size at check,
>>> thus preventing the necessity to slice metadata at all?
>>> 2. Is it the same use case to bulk load part of the information for many
>>> tables and to load part of the information for one table?
>>>
>>>
>>> *10/09/2024, Dev list*
>>>
>>> Thanks, Dan, Eduard for your inputs here.
>>>
>>> Folks are aligned here to extend the existing "refs" mode to other
>>> fields (i.e. metadata-log, snapshot-log, schemas), so that we can lazily
>>> load those fields if not needed.
>>>
>>>
>>> There are other parties from the community I had discussion on this
>>> topic with. I appreciate your input, and I failed to mention the discussion
>>> here because I forgot to keep a written record of the context for those
>>> discussions. In case you fall into this category, then I do apologize.
>>>
>>>
>>> *Summary of perspectives*
>>>
>>> The original proposal was aimed to tackle the growing metadata problem,
>>> and proposed a loadTable V2 endpoint. As the last thread mentioned, the
>>> conclusion at the time was that *extending the existing "refs" loading
>>> mode to more fields is preferable as it introduces less complexity and is
>>> more feasible to implement*.
>>>
>>> The later threads were where the community divided. On the one side, 
>>> *there's
>>> a general scepticism on the concept of partial metadata* (i.e. union
>>> results from different requests has been a problem, even for "refs" lazy
>>> loading in the past); on the other side, *there's a push to generalize
>>> partial metadata concept to "LoadTableResult" as a whole* (e.g. to only
>>> return metadata file location, or only return table access creds based on
>>> client filter).
>>>
>>> Related is the concept of bulk API, where the community has raised this
>>> use case more than once, which are typically related to data warehouse
>>> management features, such as: 1) querying current schemas of all the tables
>>> belonging to a namespace; 2) querying certain table properties of many
>>> tables to see if any maintenance (downstream) jobs should be triggered; 3)
>>> querying ownership information of all tables to check security compliance
>>> of all the tables in data warehouse, etc.
>>>
>>> I want to lay everything down and foster more discussion for a good
>>> direction:
>>> 1. extend the current "refs" lazy loading mechanism to be a more generic
>>> solution
>>> 2. prevent partial metadata at all cost, and try to contain metadata
>>> size to always (or most of the time) load in full
>>> 3. generalize partial loading concept to the entire "LoadTableResult"
>>> (e.g. a generic loadTable V2 endpoint), so that users can use the same
>>> endpoint whether they want part of metadata, or other part of the
>>> "LoadTableResult" (e.g. metadata file location; table creds)
>>> 4. repurposing the last direction to make a bulk API for the REST spec,
>>> where loading pieces of information from many tables are permitted
>>> Or if there are other directions I failed to account for here.
>>>
>>> Looking forward to feedback/discussion from the community, thanks!
>>> Haizhou
>>>
>>

Re: [DISCUSS] Partial Metadata Loading

Reply via email to