A motivational example: Trino has to implement a parallel table metadata
fetching recently (https://github.com/trinodb/trino/pull/23909) because
otherwise metadata queries (e.g., INFORMATION_SCHEMA) was slow. Parallel
metadata retrieval boosted metadata query performance significantly. But
this solution is far from ideal:

   1. Now catalogs will experience request bursts whenever a user or a tool
   attempts to list Iceberg objects in Trino. This may potentially induce
   unpredictable latency spikes, especially for large schemas
   2. Each such request imposes a constant catalog overhead on
   request dispatching, serde, security checks, etc. which could be easily
   avoided with bulk metadata lookup
   3. The aforementioned fix addresses only parallel table retrieval. But
   then the engine will have to support the same thing for views and
   materialized views, producing even more requests bursts, with considerable
   number of requests returning error responses because we cannot get object
   type and its metadata in one shot.


On Tue, Dec 24, 2024 at 10:29 PM Vladimir Ozerov <voze...@querifylabs.com>
wrote:

> Hi,
>
> Following the discussion [1] I'd like to formally propose an extension to
> REST catalog API that allows efficient lookup of multiple catalog objects
> without knowing their types in advance.
>
> When a query is submitted, the engine needs to resolve referenced objects.
> The current REST API requires multiple catalog calls per query, because it
> (1) assumes the prior knowledge of the object type (not the case for
> virtually all query engines), and (2) lacks bulk object lookup operation.
> This leads to increased query latency and increased REST catalog load.
>
> The proposal aims to solve the problem introducing an optional endpoint
> that returns information about several catalogs objects, including their
> type (table, view) and metadata.
>
> Note that the proposal attempts to solve two distinct issues via a single
> endpoint:
>
>    1. Inability to lookup the object without knowing its type
>    2. Inability to lookup multiple objects in a single request
>
> If the community finds the proposal too complicated, we can minimize the
> scope to the point 1, and introduce an endpoint for object lookup without
> knowing it's type. Even without bulk lookup this can help engine developers
> minimize SQL query planning latency.
>
> Proposal:
> https://docs.google.com/document/d/1KfzdQT8Q2xiV_yPNvICROCepz-Qqpm0npob7hmb40Fc/edit?usp=sharing
>
> [1] https://lists.apache.org/thread/g44czzpjqqhdvronqfyckw4mnxvlpn3s
>
> Regards,
> --
> *Vladimir Ozerov*
>
>

-- 
*Vladimir Ozerov*
Founder
querifylabs.com

Reply via email to