Hi, Vladimir:

Thanks for raising this. I think your proposal is mixing two things up:
1. Add an endpoint for loading a catalog object by name without knowing its
type. This is reasonable to me.
2. Make the endpoint a bulk load operation. I'm hesitating with this option
since it makes error handling difficult. As you mentioned in doc, if we
have introduces 1, then the numbers of request will reduce from m * n to m,
where m is the number of object names, and n is the type of objects. For
the problem of requests burst and latency, will client cache + parallel
fetching solve your problem?

On Fri, Jan 3, 2025 at 7:33 PM Vladimir Ozerov <voze...@querifylabs.com>
wrote:

> A motivational example: Trino has to implement a parallel table metadata
> fetching recently (https://github.com/trinodb/trino/pull/23909) because
> otherwise metadata queries (e.g., INFORMATION_SCHEMA) was slow. Parallel
> metadata retrieval boosted metadata query performance significantly. But
> this solution is far from ideal:
>
>    1. Now catalogs will experience request bursts whenever a user or a
>    tool attempts to list Iceberg objects in Trino. This may potentially induce
>    unpredictable latency spikes, especially for large schemas
>    2. Each such request imposes a constant catalog overhead on
>    request dispatching, serde, security checks, etc. which could be easily
>    avoided with bulk metadata lookup
>    3. The aforementioned fix addresses only parallel table retrieval. But
>    then the engine will have to support the same thing for views and
>    materialized views, producing even more requests bursts, with considerable
>    number of requests returning error responses because we cannot get object
>    type and its metadata in one shot.
>
>
> On Tue, Dec 24, 2024 at 10:29 PM Vladimir Ozerov <voze...@querifylabs.com>
> wrote:
>
>> Hi,
>>
>> Following the discussion [1] I'd like to formally propose an extension to
>> REST catalog API that allows efficient lookup of multiple catalog objects
>> without knowing their types in advance.
>>
>> When a query is submitted, the engine needs to resolve referenced
>> objects. The current REST API requires multiple catalog calls per query,
>> because it (1) assumes the prior knowledge of the object type (not the case
>> for virtually all query engines), and (2) lacks bulk object lookup
>> operation. This leads to increased query latency and increased REST catalog
>> load.
>>
>> The proposal aims to solve the problem introducing an optional endpoint
>> that returns information about several catalogs objects, including their
>> type (table, view) and metadata.
>>
>> Note that the proposal attempts to solve two distinct issues via a single
>> endpoint:
>>
>>    1. Inability to lookup the object without knowing its type
>>    2. Inability to lookup multiple objects in a single request
>>
>> If the community finds the proposal too complicated, we can minimize the
>> scope to the point 1, and introduce an endpoint for object lookup without
>> knowing it's type. Even without bulk lookup this can help engine developers
>> minimize SQL query planning latency.
>>
>> Proposal:
>> https://docs.google.com/document/d/1KfzdQT8Q2xiV_yPNvICROCepz-Qqpm0npob7hmb40Fc/edit?usp=sharing
>>
>> [1] https://lists.apache.org/thread/g44czzpjqqhdvronqfyckw4mnxvlpn3s
>>
>> Regards,
>> --
>> *Vladimir Ozerov*
>>
>>
>
> --
> *Vladimir Ozerov*
> Founder
> querifylabs.com
>

Reply via email to