Re: [DISCUSS] Partial Metadata Loading

Jean-Baptiste Onofré Tue, 05 Nov 2024 22:59:25 -0800

Hi Szehon

I agree with you there.


I think it's better to move forward step by step, so Eduard's proposal
is a good idea.

However, I think it's worth to keep the discussion going, at least to
shape a good proposal.

Regards
JB

On Wed, Nov 6, 2024 at 3:23 AM Szehon Ho <szehon.apa...@gmail.com> wrote:
>
> There seems to be many opinions here, but one of the main objections seems to 
> be the complexity added to REST spec impeding newer catalogs.
>
> Looking through the actual REST API change proposal, some of these are indeed 
> a bit advanced to implement, like metadata property filtering, or time-range 
> filtering, for potentially small gain, so I can understand that argument.
>
> There is definitely value in trimming TableMetadata wire traffic though, and 
> I would love to see this work proceed.  TableMetadata maintenance only works 
> to a point, if a user wants to keep data of many different schemas, partition 
> specs, etc , maintenance cannot fix the problem alone.  Going back to the 
> previous discussion thread, I think Eduard's proposal in 
> https://lists.apache.org/thread/r9fgq4yz1oy5bow09zhhmcm66t6kgbh7 in extending 
> refs to the other table-metadata array fields, beyond snapshots, is a good 
> compromise to at least get the ball rolling without too much change to the 
> API.
>
> Thanks
> Szehon
>
> On Fri, Nov 1, 2024 at 9:04 AM Dmitri Bourlatchkov 
> <dmitri.bourlatch...@dremio.com.invalid> wrote:
>>
>> Hello All,
>>
>> This is an interesting discussion and I'd like to offer my perspective.
>>
>> When a REST Catalog is involved, the metadata is loaded and modified via the 
>> catalog API. So control over the metadata is delegated to the catalog.
>>
>> I'd argue that in this situation, catalogs should have the flexibility to 
>> optimize metadata operations internally. In other words, if a particular use 
>> case does not require access to some pieces of metadata, the catalog should 
>> have to provide them. For example, querying a particular snapshot does not 
>> require knowledge of other snapshots.
>>
>> I understand that the current metadata representation evolved to support 
>> certain use cases. Still, as far as API v2 is concerned, would it have to 
>> match what was happening in API v1? I think this is an opportunity to design 
>> API v2 in a more flexible and extensible manner.
>>
>> On the point of complexity (and I think adoption concerns are but a 
>> consequence of complexity). I believe if the API is modelled to supply 
>> information required for particular use cases as opposed to representing a 
>> particular state of the table as a whole, the complexity can be reduced.
>>
>> In other words, I propose to make API v2 such that it focuses on what 
>> clients (engines) require for operation as opposed to what the table 
>> metadata has in its totality at any moment in time. In a way, API v2 outputs 
>> do not have to be exact chunks of metadata carved out of physical files, but 
>> may be defined differently, linking to server-side metadata only 
>> conceptually.
>>
>> More specifically, if the client queries a table, it declares this intent in 
>> API and receives the information required for the query. The client should 
>> be prepared to receive more information than it needs (in case the server 
>> does not support metadata slicing), but that should not add complexity as 
>> discarding unused data should not be hard if the data structures allow for 
>> slicing. In effect, actual runtime efficiencies will be defined by the 
>> combined efforts of the client (engine) and catalog. At the same time 
>> neither the client, nor the catalog is required to implement advanced use 
>> cases.
>>
>> Similarly, if the client is only interested to know whether a table changed 
>> since point X (time or snapshot), that is also expressed in the API request. 
>> It may be a separate endpoint, or it may be possible to implement it as, for 
>> example, returning the latest snapshot ID.
>>
>> I understand, there are use cases where engines want to operate directly on 
>> metadata files in storage. That is fine too, IMO, I am not proposing to 
>> change the Iceberg file format spec. At the same time catalogs do not have 
>> to be limited to fetching data for the REST API from those files. Catalogs 
>> may choose to have additional storage partitioned and indexed differently 
>> than plain files.
>>
>> This is all very high level, of course, and it requires a lot of additional 
>> thinking about how to design API v2,  but I believe we could achieve a more 
>> supportable and adoptable API v2 this way.
>>
>> Cheers,
>> Dmitri.
>>
>> On Thu, Oct 31, 2024 at 2:41 PM Daniel Weeks <dwe...@apache.org> wrote:
>>>
>>> Eric,
>>>
>>> With respect to the credential endpoint, I believe there is important 
>>> context missing that probably should have been captured in the doc.  The 
>>> credential endpoint is unlike other use cases because the fundamental issue 
>>> is that refresh is an operation that happens across distributed workers.  
>>> Workers in spark/flink/trino/etc. all need to refresh credentials for long 
>>> running operations and results in orders of magnitude higher request rates 
>>> than a table load.  We originally expected to use the table load even for 
>>> this, but the concern was it would effectively DDOS the catalog.
>>>
>>> If there are specific cases that have solid justification like the above, I 
>>> think we should add specific endpoints, but those should be used sparingly.
>>>
>>> > In other words -- if it's true that "partial metadata doesn't align with 
>>> > primary use cases", it seems true that "full metadata doesn't align with 
>>> > almost all use cases".
>>>
>>> I don't find this argument compelling.  Are you saying that any case where 
>>> everything from a response isn't fully used, you should optimize that 
>>> request so that a client can only request the specific information it will 
>>> use?  Generally, we want a surface area that can address most use cases and 
>>> as a consequence, not every request is going to perfectly match the 
>>> specific needs of the client.
>>>
>>>  -Dan
>>>
>>>
>>> On Thu, Oct 31, 2024 at 11:03 AM Eric Maynard <eric.w.mayn...@gmail.com> 
>>> wrote:
>>>>
>>>> Thanks for this breakdown Dan.
>>>>
>>>> I share your concerns about the complexity this might impose on the 
>>>> client. On some of your other notes, I have some thoughts below:
>>>>
>>>>
>>>> Several Apache Polaris (Incubating) committers were in the recent sync on 
>>>> this proposal, so I want to share one perspective related to the last 
>>>> point re: Partial metadata impedes adoption.
>>>>
>>>> Personally, I feel better about the prospect of Polaris supporting a 
>>>> flexible loadTableV2-type API as opposed to having to keep adding more 
>>>> endpoints to support new use cases that really just boil down to partial 
>>>> metadata. Gabor gives the example of isLatest above, and a recent proposal 
>>>> described an endpoint for credentials. I can't speak for every REST 
>>>> catalog implementation, but I am worried that Polaris will have to keep 
>>>> adding more APIs that really just expose various different slices of the 
>>>> loadTable response.
>>>>
>>>> I also like that loadTableV2 gives us the option to "partially implement" 
>>>> the partial metadata response like you noted. Compared to something like a 
>>>> credential endpoint that either works or doesn't work, the loadTableV2 
>>>> endpoint can be trivially implemented to just return all metadata like 
>>>> loadTable "V1" does. In my view, this makes the road to adoption easier.
>>>>
>>>>
>>>> With respect to your section titled Partial metadata doesn't align with 
>>>> primary use cases:
>>>>
>>>> It's certainly true that many use cases do require a significant amount of 
>>>> the metadata returned by loadTable today. However I would guess that very 
>>>> few truly require 100% of the metadata. If we are evaluating endpoints 
>>>> based on how consistently useful the response will be, I feel like this 
>>>> argument turns into a stronger one against loadTableV1 than loadTableV2.
>>>>
>>>> In other words -- if it's true that "partial metadata doesn't align with 
>>>> primary use cases", it seems true that "full metadata doesn't align with 
>>>> almost all use cases".
>>>>
>>>> Even if most use cases do need 90% of the metadata, it seems like a useful 
>>>> optimization for the client to not have to request whatever it doesn't 
>>>> need. This also gives us the flexibility to make table metadata richer in 
>>>> the future without having to worry about the cost a heavier metadata 
>>>> payload might incur for existing use cases.
>>>>
>>>>
>>>> Eric M.
>>>>
>>>>
>>>> On Thu, Oct 31, 2024 at 10:37 AM Daniel Weeks <dwe...@apache.org> wrote:
>>>>>
>>>>> I'd like to clarify my concerns here because I think there are more 
>>>>> aspects to this than we've captured.
>>>>>
>>>>> Partial metadata loads adds significant complexity to the protocol
>>>>> Iceberg metadata is a complicated structure and finding a way to 
>>>>> represent how and what we want to piece apart is non-trivial.  There are 
>>>>> nested structures and references between different fields that would all 
>>>>> need custom ways to return through a response.  This also makes it 
>>>>> difficult for clients to process and services to implement.  Adding this 
>>>>> (even with an option to return full metadata with requirements that 
>>>>> reflect the table spec) necessitates a v2 endpoint.  If catalogs are 
>>>>> required to support all partial load semantics, then the catalog becomes 
>>>>> complicated.  If the catalog can opt to always return the full metadata, 
>>>>> it makes the client more complicated since they may have to handle to 
>>>>> very different looking response objects for any load request.
>>>>>
>>>>> Partial metadata doesn't address the underlying issue, but pushes it 
>>>>> somewhere else
>>>>> From a client perspective, I can see that this feels like an optimization 
>>>>> because I can just grab what I want from the metadata (e.g. schema, or 
>>>>> properties).  However, all we've done is push that complexity to the 
>>>>> server which either has to parse the metadata and return a subset of it, 
>>>>> or needs to have a more complicated way of representing and storing 
>>>>> independent pieces of metadata (all while still being required to produce 
>>>>> new json metadata).  All we've done here is make the service more 
>>>>> complicated, and the underlying issue of maintenance of the metadata 
>>>>> still needs to be addressed.
>>>>>
>>>>> Partial metadata doesn't align with primary use cases
>>>>> The vast majority of use cases require a significant amount of the 
>>>>> metadata returned in the load table response.  While some pieces may be 
>>>>> discarded, much of the information is necessary to read or update a 
>>>>> table.  The ref loading was an effort to limit the overall size of the 
>>>>> response and include the vast majority of relevant information for read 
>>>>> only uses cases, but even our most complete implementations still need 
>>>>> the full metadata to properly construct a new commit and resolve 
>>>>> conflicts.
>>>>>
>>>>> Even the example of Impala trying to load the location to determine if 
>>>>> the table has changed is less than ideal because to accurately answer 
>>>>> that question, you need to load the metadata.  For example, if there was 
>>>>> a background compaction that resulted in a rewrite operation or a 
>>>>> property change that doesn't affect the underlying data, it may not be 
>>>>> necessary to invalidate the cache.  This approach is further exacerbated 
>>>>> if the community decides to remove the location requirement because it 
>>>>> would then not be available to signify the state of the table.
>>>>>
>>>>> Partial metadata impedes adoption
>>>>> My biggest concern is that the added complexity here impedes adoption of 
>>>>> the REST specification.  There are a large number of engines and catalog 
>>>>> implementations that are still in the early stages of the adoption curve. 
>>>>>  Partial metadata loads splits these groups into the catalogs willing to 
>>>>> implement it and engines that start requiring it in order to function. 
>>>>> While I think partial metadata loads is an interesting technical 
>>>>> challenge, I don't believe that it's necessary and our effort should go 
>>>>> into producing good solutions for metadata management and implementations 
>>>>> of catalogs that can return the table metadata quickly to clients.
>>>>>
>>>>> I feel like focusing on table metadata maintenance addresses all of the 
>>>>> issues except the most extreme edge cases and good catalog 
>>>>> implementations can return a metadata payload faster the most object 
>>>>> stores can even load the metadata json file (in practice single digit 
>>>>> millisecond responses are achievable here), so performance is not the 
>>>>> tradeoff.
>>>>>
>>>>> - Dan
>>>>>
>>>>>
>>>>> On Tue, Oct 29, 2024 at 1:31 AM Gabor Kaszab <gaborkas...@apache.org> 
>>>>> wrote:
>>>>>>
>>>>>> Hi Iceberg Community,
>>>>>>
>>>>>> I just wanted to mention that I was also going to start a discussion 
>>>>>> about getting partial information from LoadTableResponse through the 
>>>>>> REST API.
>>>>>> My motivation is a bit different here, though:
>>>>>> Impala currently has strong integration with HMS and in turn with the 
>>>>>> HiveCatalog. Nowadays there are efforts put into the project to make it 
>>>>>> work with REST catalog for Iceberg tables, and there is one piece that 
>>>>>> we miss now with the REST API. Impala caches table metadata and we need 
>>>>>> a way to decide whether we have to reload the metadata for a particular 
>>>>>> table or not. Currently, with HMS we have a push-based solution where 
>>>>>> every change of the table is pushed to Impala from HMS as 
>>>>>> notifications/events, and with REST catalog we were thinking of a 
>>>>>> pull-based approach where Impala occasionally asks the REST catalog 
>>>>>> whether a particular table is up-to-date or not.
>>>>>>
>>>>>> Use-case: So in Impala's case what would be important is to have a REST 
>>>>>> Catalog API to answer a question like:
>>>>>> "I cached this version of this particular table, is it up-to-date or do 
>>>>>> I have to reload it?"
>>>>>>
>>>>>> Possible solutions:
>>>>>> 1) This could either be achieved by an API like this:
>>>>>>     boolean isLatest(TableIdentifier ident, String metadataLocation);
>>>>>> 2) Another approach could be to get the latest metadata location and let 
>>>>>> the engine compare it to the one it holds:
>>>>>>     String metadataLocation(TableIdentifier ident);
>>>>>> 3) Similarly to 2) querying metadata location could also be achieved by 
>>>>>> the current proposal of partial metadata like: (I just made up some 
>>>>>> types here)
>>>>>>     Table loadTable(TableIdentifier ident, 
>>>>>> SomeFilterClass.MetadataLocation);
>>>>>>
>>>>>> Either way is fine for Impala I think, I just wanted to share our 
>>>>>> use-case that could also leverage getting partial metadata.
>>>>>> Now that I have written this mail it seems to hijack the original 
>>>>>> conversation a bit. Let me know if I should raise this in a separate 
>>>>>> [discuss] thread.
>>>>>>
>>>>>> Regards,
>>>>>> Gabor
>>>>>>
>>>>>> On Tue, Oct 29, 2024 at 2:16 AM Haizhou Zhao 
>>>>>> <zhaohaizhou940...@gmail.com> wrote:
>>>>>>>
>>>>>>> Hello Dev list,
>>>>>>>
>>>>>>> I want to update the community on the current thread for the proposal 
>>>>>>> "Partially Loading Metadata - LoadTable V2" after hearing more 
>>>>>>> perspectives from the community. In general, there are still some 
>>>>>>> distance to go for a general consensus which I hope to foster more 
>>>>>>> conversations and hear new inputs.
>>>>>>>
>>>>>>> Previous Discussions 
>>>>>>> (https://docs.google.com/document/d/1Nv7_9XqS8EyR30_mrrqkwbZx9pw34i3HYIwuDDXnOY4/edit?tab=t.0)
>>>>>>>
>>>>>>>
>>>>>>> 10/28/2024, quick google meet discussion
>>>>>>>
>>>>>>> Thanks, Christian, Dmitri, Eric, JB, Szehon, Yufei for your time and 
>>>>>>> voicing your opinion this morning. Here're a quick summary of what we 
>>>>>>> discussed (detail meeting notes also included in the link above):
>>>>>>>
>>>>>>> Folks agreed that having a REST endpoint allowing clients to filter for 
>>>>>>> what they need from LoadTableResult is a useful feature. The 
>>>>>>> preliminary use cases that are brought up:
>>>>>>> 1. Load only current snapshot and current schema
>>>>>>> 2. Load only metadata file location
>>>>>>> 3. Load only credentials to access table
>>>>>>> 4. Query historical status of the table when time traveling
>>>>>>> Meanwhile, it is also important for this endpoint to be extensible 
>>>>>>> enough so that it could take care of likewise use cases that only 
>>>>>>> require a portion of LoadTableResult (metadata included) in the future.
>>>>>>>
>>>>>>> What the group has no strong preference or needs further inputs are:
>>>>>>> 1. Whether to modify the existing loadTable endpoint for partial 
>>>>>>> loading or creating a new endpoint. The possible concern here is 
>>>>>>> backward compatibility.
>>>>>>> 2. Whether to add bulk support to support cases like loading the 
>>>>>>> current schema of all tables belonging to the same namespace.
>>>>>>>
>>>>>>>
>>>>>>> 10/23/2024, Iceberg community sync
>>>>>>>
>>>>>>> Thanks, Ryan, Dan, Yufei, JB, Russel and Szehon for your inputs here.
>>>>>>>
>>>>>>> Folks are divided in two aspects:
>>>>>>> 1. Can we use table maintenance work to keep metadata size at check, 
>>>>>>> thus preventing the necessity to slice metadata at all?
>>>>>>> 2. Is it the same use case to bulk load part of the information for 
>>>>>>> many tables and to load part of the information for one table?
>>>>>>>
>>>>>>>
>>>>>>> 10/09/2024, Dev list
>>>>>>>
>>>>>>> Thanks, Dan, Eduard for your inputs here.
>>>>>>>
>>>>>>> Folks are aligned here to extend the existing "refs" mode to other 
>>>>>>> fields (i.e. metadata-log, snapshot-log, schemas), so that we can 
>>>>>>> lazily load those fields if not needed.
>>>>>>>
>>>>>>>
>>>>>>> There are other parties from the community I had discussion on this 
>>>>>>> topic with. I appreciate your input, and I failed to mention the 
>>>>>>> discussion here because I forgot to keep a written record of the 
>>>>>>> context for those discussions. In case you fall into this category, 
>>>>>>> then I do apologize.
>>>>>>>
>>>>>>>
>>>>>>> Summary of perspectives
>>>>>>>
>>>>>>> The original proposal was aimed to tackle the growing metadata problem, 
>>>>>>> and proposed a loadTable V2 endpoint. As the last thread mentioned, the 
>>>>>>> conclusion at the time was that extending the existing "refs" loading 
>>>>>>> mode to more fields is preferable as it introduces less complexity and 
>>>>>>> is more feasible to implement.
>>>>>>>
>>>>>>> The later threads were where the community divided. On the one side, 
>>>>>>> there's a general scepticism on the concept of partial metadata (i.e. 
>>>>>>> union results from different requests has been a problem, even for 
>>>>>>> "refs" lazy loading in the past); on the other side, there's a push to 
>>>>>>> generalize partial metadata concept to "LoadTableResult" as a whole 
>>>>>>> (e.g. to only return metadata file location, or only return table 
>>>>>>> access creds based on client filter).
>>>>>>>
>>>>>>> Related is the concept of bulk API, where the community has raised this 
>>>>>>> use case more than once, which are typically related to data warehouse 
>>>>>>> management features, such as: 1) querying current schemas of all the 
>>>>>>> tables belonging to a namespace; 2) querying certain table properties 
>>>>>>> of many tables to see if any maintenance (downstream) jobs should be 
>>>>>>> triggered; 3) querying ownership information of all tables to check 
>>>>>>> security compliance of all the tables in data warehouse, etc.
>>>>>>>
>>>>>>> I want to lay everything down and foster more discussion for a good 
>>>>>>> direction:
>>>>>>> 1. extend the current "refs" lazy loading mechanism to be a more 
>>>>>>> generic solution
>>>>>>> 2. prevent partial metadata at all cost, and try to contain metadata 
>>>>>>> size to always (or most of the time) load in full
>>>>>>> 3. generalize partial loading concept to the entire "LoadTableResult" 
>>>>>>> (e.g. a generic loadTable V2 endpoint), so that users can use the same 
>>>>>>> endpoint whether they want part of metadata, or other part of the 
>>>>>>> "LoadTableResult" (e.g. metadata file location; table creds)
>>>>>>> 4. repurposing the last direction to make a bulk API for the REST spec, 
>>>>>>> where loading pieces of information from many tables are permitted
>>>>>>> Or if there are other directions I failed to account for here.
>>>>>>>
>>>>>>> Looking forward to feedback/discussion from the community, thanks!
>>>>>>> Haizhou

Re: [DISCUSS] Partial Metadata Loading

Reply via email to