Re: [DISCUSS] Partial Metadata Loading

Szehon Ho Tue, 05 Nov 2024 18:23:49 -0800

There seems to be many opinions here, but one of the main objections seems
to be the complexity added to REST spec impeding newer catalogs.


Looking through the actual REST API change proposal
<https://docs.google.com/document/d/1eXnT0ZiFvdm_Zvk6fLGT_UxVWO-HsiqVywqu1Uk8s7E/edit?tab=t.0#heading=h.zfurj8lnnibk>,
some of these are indeed a bit advanced to implement, like metadata
property filtering, or time-range filtering, for potentially small gain, so
I can understand that argument.

There is definitely value in trimming TableMetadata wire traffic though,
and I would love to see this work proceed.  TableMetadata maintenance only
works to a point, if a user wants to keep data of many different schemas,
partition specs, etc , maintenance cannot fix the problem alone.  Going
back to the previous discussion thread, I think Eduard's proposal in
https://lists.apache.org/thread/r9fgq4yz1oy5bow09zhhmcm66t6kgbh7 in
extending refs to the other table-metadata array fields, beyond snapshots,
is a good compromise to at least get the ball rolling without too much
change to the API.

Thanks
Szehon

On Fri, Nov 1, 2024 at 9:04 AM Dmitri Bourlatchkov
<[email protected]> wrote:

> Hello All,
>
> This is an interesting discussion and I'd like to offer my perspective.
>
> When a REST Catalog is involved, the metadata is loaded and modified via
> the catalog API. So control over the metadata is delegated to the catalog.
>
> I'd argue that in this situation, catalogs should have the flexibility to
> optimize metadata operations internally. In other words, if a particular
> use case does not require access to some pieces of metadata, the catalog
> should have to provide them. For example, querying a particular snapshot
> does not require knowledge of other snapshots.
>
> I understand that the current metadata representation evolved to support
> certain use cases. Still, as far as API v2 is concerned, would it have to
> match what was happening in API v1? I think this is an opportunity to
> design API v2 in a more flexible and extensible manner.
>
> On the point of complexity (and I think adoption concerns are but a
> consequence of complexity). I believe if the API is modelled to supply
> information required for particular use cases as opposed to representing a
> particular state of the table as a whole, the complexity can be reduced.
>
> In other words, I propose to make API v2 such that it focuses on what
> clients (engines) require for operation as opposed to what the table
> metadata has in its totality at any moment in time. In a way, API v2
> outputs do not have to be exact chunks of metadata carved out of physical
> files, but may be defined differently, linking to server-side metadata only
> conceptually.
>
> More specifically, if the client queries a table, it declares this intent
> in API and receives the information required for the query. The client
> should be prepared to receive more information than it needs (in case the
> server does not support metadata slicing), but that should not add
> complexity as discarding unused data should not be hard if the data
> structures allow for slicing. In effect, actual runtime efficiencies will
> be defined by the combined efforts of the client (engine) and catalog. At
> the same time neither the client, nor the catalog is required to implement
> advanced use cases.
>
> Similarly, if the client is only interested to know whether a table
> changed since point X (time or snapshot), that is also expressed in the API
> request. It may be a separate endpoint, or it may be possible to implement
> it as, for example, returning the latest snapshot ID.
>
> I understand, there are use cases where engines want to operate directly
> on metadata files in storage. That is fine too, IMO, I am not proposing to
> change the Iceberg file format spec. At the same time catalogs do not have
> to be limited to fetching data for the REST API from those files. Catalogs
> may choose to have additional storage partitioned and indexed differently
> than plain files.
>
> This is all very high level, of course, and it requires a lot of
> additional thinking about how to design API v2,  but I believe we could
> achieve a more supportable and adoptable API v2 this way.
>
> Cheers,
> Dmitri.
>
> On Thu, Oct 31, 2024 at 2:41 PM Daniel Weeks <[email protected]> wrote:
>
>> Eric,
>>
>> With respect to the credential endpoint, I believe there is important
>> context missing that probably should have been captured in the doc.  The
>> credential endpoint is unlike other use cases because the fundamental issue
>> is that refresh is an operation that happens across distributed workers.
>> Workers in spark/flink/trino/etc. all need to refresh credentials for long
>> running operations and results in orders of magnitude higher request rates
>> than a table load.  We originally expected to use the table load even for
>> this, but the concern was it would effectively DDOS the catalog.
>>
>> If there are specific cases that have solid justification like the above,
>> I think we should add specific endpoints, but those should be used
>> sparingly.
>>
>> > In other words -- if it's true that "partial metadata doesn't align
>> with primary use cases", it seems true that "full metadata doesn't align
>> with *almost all* use cases".
>>
>> I don't find this argument compelling.  Are you saying that any case
>> where everything from a response isn't fully used, you should optimize that
>> request so that a client can only request the specific information it will
>> use?  Generally, we want a surface area that can address most use cases and
>> as a consequence, not every request is going to perfectly match the
>> specific needs of the client.
>>
>>  -Dan
>>
>>
>> On Thu, Oct 31, 2024 at 11:03 AM Eric Maynard <[email protected]>
>> wrote:
>>
>>> Thanks for this breakdown Dan.
>>>
>>> I share your concerns about the complexity this might impose on the
>>> client. On some of your other notes, I have some thoughts below:
>>>
>>>
>>> Several Apache Polaris (Incubating) committers were in the recent sync
>>> on this proposal, so I want to share one perspective related to the last
>>> point re: *Partial metadata impedes adoption*.
>>>
>>> Personally, I feel better about the prospect of Polaris supporting a
>>> flexible loadTableV2-type API as opposed to having to keep adding more
>>> endpoints to support new use cases that really just boil down to partial
>>> metadata. Gabor gives the example of isLatest above, and a recent
>>> proposal
>>> <https://docs.google.com/document/d/1acCkaPCO7WsLtvYugrayurbef4zCnD2rb3ZPBKeaYoo/edit?tab=t.0#heading=h.hs6r9d26w1y2>
>>>  described
>>> an endpoint for credentials. I can't speak for every REST catalog
>>> implementation, but I am worried that Polaris will have to keep adding more
>>> APIs that really just expose various different slices of the loadTable
>>> response.
>>>
>>> I also like that loadTableV2 gives us the option to "partially
>>> implement" the partial metadata response like you noted. Compared to
>>> something like a credential endpoint that either works or doesn't work, the
>>> loadTableV2 endpoint can be trivially implemented to just return all
>>> metadata like loadTable "V1" does. In my view, this makes the road to
>>> adoption easier.
>>>
>>>
>>> With respect to your section titled *Partial metadata** doesn't align
>>> with primary use cases*:
>>>
>>> It's certainly true that many use cases do require a significant amount
>>> of the metadata returned by loadTable today. However I would guess that
>>> very few truly require 100% of the metadata. If we are evaluating endpoints
>>> based on how consistently useful the response will be, I feel like this
>>> argument turns into a stronger one against loadTableV1 than loadTableV2.
>>>
>>> In other words -- if it's true that "partial metadata doesn't align with
>>> primary use cases", it seems true that "full metadata doesn't align with 
>>> *almost
>>> all* use cases".
>>>
>>> Even if most use cases do need 90% of the metadata, it seems like a
>>> useful optimization for the client to not have to request whatever it
>>> doesn't need. This also gives us the flexibility to make table metadata
>>> richer in the future without having to worry about the cost a heavier
>>> metadata payload might incur for existing use cases.
>>>
>>>
>>> Eric M.
>>>
>>>
>>> On Thu, Oct 31, 2024 at 10:37 AM Daniel Weeks <[email protected]> wrote:
>>>
>>>> I'd like to clarify my concerns here because I think there are more
>>>> aspects to this than we've captured.
>>>>
>>>> *Partial metadata loads adds significant complexity to the protocol*
>>>> Iceberg metadata is a complicated structure and finding a way to
>>>> represent how and what we want to piece apart is non-trivial.  There are
>>>> nested structures and references between different fields that would all
>>>> need custom ways to return through a response.  This also makes it
>>>> difficult for clients to process and services to implement.  Adding this
>>>> (even with an option to return full metadata with requirements that reflect
>>>> the table spec) necessitates a v2 endpoint.  If catalogs are required to
>>>> support all partial load semantics, then the catalog becomes complicated.
>>>> If the catalog can opt to always return the full metadata, it makes the
>>>> client more complicated since they may have to handle to very different
>>>> looking response objects for any load request.
>>>>
>>>> *Partial metadata doesn't address the underlying issue, but pushes it
>>>> somewhere else*
>>>> From a client perspective, I can see that this feels like an
>>>> optimization because I can just grab what I want from the metadata (e.g.
>>>> schema, or properties).  However, all we've done is push that complexity to
>>>> the server which either has to parse the metadata and return a subset of
>>>> it, or needs to have a more complicated way of representing and storing
>>>> independent pieces of metadata (all while still being required to produce
>>>> new json metadata).  All we've done here is make the service more
>>>> complicated, and the underlying issue of maintenance of the metadata still
>>>> needs to be addressed.
>>>>
>>>> *Partial metadata** doesn't align with primary use cases*
>>>> The vast majority of use cases require a significant amount of the
>>>> metadata returned in the load table response.  While some pieces may be
>>>> discarded, much of the information is necessary to read or update a table.
>>>> The ref loading was an effort to limit the overall size of the response and
>>>> include the vast majority of relevant information for read only uses cases,
>>>> but even our most complete implementations still need the full metadata to
>>>> properly construct a new commit and resolve conflicts.
>>>>
>>>> Even the example of Impala trying to load the location to determine if
>>>> the table has changed is less than ideal because to accurately answer that
>>>> question, you need to load the metadata.  For example, if there was a
>>>> background compaction that resulted in a rewrite operation or a property
>>>> change that doesn't affect the underlying data, it may not be necessary to
>>>> invalidate the cache.  This approach is further exacerbated if the
>>>> community decides to remove the location requirement because it would then
>>>> not be available to signify the state of the table.
>>>>
>>>> *Partial metadata impedes adoption*
>>>> My biggest concern is that the added complexity here impedes adoption
>>>> of the REST specification.  There are a large number of engines and catalog
>>>> implementations that are still in the early stages of the adoption curve.
>>>> Partial metadata loads splits these groups into the catalogs willing to
>>>> implement it and engines that start requiring it in order to function.
>>>> While I think partial metadata loads is an interesting technical challenge,
>>>> I don't believe that it's necessary and our effort should go into producing
>>>> good solutions for metadata management and implementations of catalogs that
>>>> can return the table metadata quickly to clients.
>>>>
>>>> I feel like focusing on table metadata maintenance addresses all of the
>>>> issues except the most extreme edge cases and good catalog implementations
>>>> can return a metadata payload faster the most object stores can even load
>>>> the metadata json file (in practice single digit millisecond responses are
>>>> achievable here), so performance is not the tradeoff.
>>>>
>>>> - Dan
>>>>
>>>>
>>>> On Tue, Oct 29, 2024 at 1:31 AM Gabor Kaszab <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Iceberg Community,
>>>>>
>>>>> I just wanted to mention that I was also going to start a discussion
>>>>> about getting partial information from LoadTableResponse through the REST
>>>>> API.
>>>>> My motivation is a bit different here, though:
>>>>> Impala currently has strong integration with HMS and in turn with the
>>>>> HiveCatalog. Nowadays there are efforts put into the project to make it
>>>>> work with REST catalog for Iceberg tables, and there is one piece that we
>>>>> miss now with the REST API. Impala caches table metadata and we need a way
>>>>> to decide whether we have to reload the metadata for a particular table or
>>>>> not. Currently, with HMS we have a push-based solution where every change
>>>>> of the table is pushed to Impala from HMS as notifications/events, and 
>>>>> with
>>>>> REST catalog we were thinking of a pull-based approach where Impala
>>>>> occasionally asks the REST catalog whether a particular table is 
>>>>> up-to-date
>>>>> or not.
>>>>>
>>>>> *Use-case*: So in Impala's case what would be important is to have a
>>>>> REST Catalog API to answer a question like:
>>>>> "I cached this version of this particular table, is it up-to-date or
>>>>> do I have to reload it?"
>>>>>
>>>>> *Possible solutions*:
>>>>> 1) This could either be achieved by an API like this:
>>>>>     boolean isLatest(TableIdentifier ident, String metadataLocation);
>>>>> 2) Another approach could be to get the latest metadata location and
>>>>> let the engine compare it to the one it holds:
>>>>>     String metadataLocation(TableIdentifier ident);
>>>>> 3) Similarly to 2) querying metadata location could also be achieved
>>>>> by the current proposal of partial metadata like: (I just made up some
>>>>> types here)
>>>>>     Table loadTable(TableIdentifier ident,
>>>>> SomeFilterClass.MetadataLocation);
>>>>>
>>>>> Either way is fine for Impala I think, I just wanted to share our
>>>>> use-case that could also leverage getting partial metadata.
>>>>> Now that I have written this mail it seems to hijack the original
>>>>> conversation a bit. Let me know if I should raise this in a separate
>>>>> [discuss] thread.
>>>>>
>>>>> Regards,
>>>>> Gabor
>>>>>
>>>>> On Tue, Oct 29, 2024 at 2:16 AM Haizhou Zhao <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hello Dev list,
>>>>>>
>>>>>> I want to update the community on the current thread for the proposal
>>>>>> "Partially Loading Metadata - LoadTable V2" after hearing more 
>>>>>> perspectives
>>>>>> from the community. In general, there are still some distance to go for a
>>>>>> general consensus which I hope to foster more conversations and hear new
>>>>>> inputs.
>>>>>>
>>>>>> *Previous Discussions* (
>>>>>> https://docs.google.com/document/d/1Nv7_9XqS8EyR30_mrrqkwbZx9pw34i3HYIwuDDXnOY4/edit?tab=t.0
>>>>>> *)*
>>>>>>
>>>>>>
>>>>>> *10/28/2024, quick google meet discussion*
>>>>>>
>>>>>> Thanks, Christian, Dmitri, Eric, JB, Szehon, Yufei for your time and
>>>>>> voicing your opinion this morning. Here're a quick summary of what we
>>>>>> discussed (detail meeting notes also included in the link above):
>>>>>>
>>>>>> Folks agreed that having a REST endpoint allowing clients to filter
>>>>>> for what they need from LoadTableResult is a useful feature. The
>>>>>> preliminary use cases that are brought up:
>>>>>> 1. Load only current snapshot and current schema
>>>>>> 2. Load only metadata file location
>>>>>> 3. Load only credentials to access table
>>>>>> 4. Query historical status of the table when time traveling
>>>>>> Meanwhile, it is also important for this endpoint to be extensible
>>>>>> enough so that it could take care of likewise use cases that only 
>>>>>> require a
>>>>>> portion of LoadTableResult (metadata included) in the future.
>>>>>>
>>>>>> What the group has no strong preference or needs further inputs are:
>>>>>> 1. Whether to modify the existing loadTable endpoint for partial
>>>>>> loading or creating a new endpoint. The possible concern here is backward
>>>>>> compatibility.
>>>>>> 2. Whether to add bulk support to support cases like loading the
>>>>>> current schema of all tables belonging to the same namespace.
>>>>>>
>>>>>>
>>>>>> *10/23/2024, Iceberg community sync*
>>>>>>
>>>>>> Thanks, Ryan, Dan, Yufei, JB, Russel and Szehon for your inputs here.
>>>>>>
>>>>>> Folks are divided in two aspects:
>>>>>> 1. Can we use table maintenance work to keep metadata size at check,
>>>>>> thus preventing the necessity to slice metadata at all?
>>>>>> 2. Is it the same use case to bulk load part of the information for
>>>>>> many tables and to load part of the information for one table?
>>>>>>
>>>>>>
>>>>>> *10/09/2024, Dev list*
>>>>>>
>>>>>> Thanks, Dan, Eduard for your inputs here.
>>>>>>
>>>>>> Folks are aligned here to extend the existing "refs" mode to other
>>>>>> fields (i.e. metadata-log, snapshot-log, schemas), so that we can lazily
>>>>>> load those fields if not needed.
>>>>>>
>>>>>>
>>>>>> There are other parties from the community I had discussion on this
>>>>>> topic with. I appreciate your input, and I failed to mention the 
>>>>>> discussion
>>>>>> here because I forgot to keep a written record of the context for those
>>>>>> discussions. In case you fall into this category, then I do apologize.
>>>>>>
>>>>>>
>>>>>> *Summary of perspectives*
>>>>>>
>>>>>> The original proposal was aimed to tackle the growing metadata
>>>>>> problem, and proposed a loadTable V2 endpoint. As the last thread
>>>>>> mentioned, the conclusion at the time was that *extending the
>>>>>> existing "refs" loading mode to more fields is preferable as it 
>>>>>> introduces
>>>>>> less complexity and is more feasible to implement*.
>>>>>>
>>>>>> The later threads were where the community divided. On the one side, 
>>>>>> *there's
>>>>>> a general scepticism on the concept of partial metadata* (i.e. union
>>>>>> results from different requests has been a problem, even for "refs" lazy
>>>>>> loading in the past); on the other side, *there's a push to
>>>>>> generalize partial metadata concept to "LoadTableResult" as a whole*
>>>>>> (e.g. to only return metadata file location, or only return table access
>>>>>> creds based on client filter).
>>>>>>
>>>>>> Related is the concept of bulk API, where the community has raised
>>>>>> this use case more than once, which are typically related to data 
>>>>>> warehouse
>>>>>> management features, such as: 1) querying current schemas of all the 
>>>>>> tables
>>>>>> belonging to a namespace; 2) querying certain table properties of many
>>>>>> tables to see if any maintenance (downstream) jobs should be triggered; 
>>>>>> 3)
>>>>>> querying ownership information of all tables to check security compliance
>>>>>> of all the tables in data warehouse, etc.
>>>>>>
>>>>>> I want to lay everything down and foster more discussion for a good
>>>>>> direction:
>>>>>> 1. extend the current "refs" lazy loading mechanism to be a more
>>>>>> generic solution
>>>>>> 2. prevent partial metadata at all cost, and try to contain metadata
>>>>>> size to always (or most of the time) load in full
>>>>>> 3. generalize partial loading concept to the entire "LoadTableResult"
>>>>>> (e.g. a generic loadTable V2 endpoint), so that users can use the same
>>>>>> endpoint whether they want part of metadata, or other part of the
>>>>>> "LoadTableResult" (e.g. metadata file location; table creds)
>>>>>> 4. repurposing the last direction to make a bulk API for the REST
>>>>>> spec, where loading pieces of information from many tables are permitted
>>>>>> Or if there are other directions I failed to account for here.
>>>>>>
>>>>>> Looking forward to feedback/discussion from the community, thanks!
>>>>>> Haizhou
>>>>>>
>>>>>

Re: [DISCUSS] Partial Metadata Loading

Reply via email to