I mostly agree with Denys's viewpoint. That is, when querying Iceberg and Hudi tables in HMS, engines need to implement and configure their own connectors. These connectors are specific to each engine and have nothing to do with HMS itself. HMS serves as a neutral, unified metadata management service, responsible only for managing the lifecycle of catalogs (such as creation and deletion) and providing unified metadata authorization services.
Add some extra information to respond to lisoda: 1) Q1: HMS may store various types of tables (e.g., Iceberg, Hudi), and some engines may not be able to query certain types of tables stored in HMS. First, this issue seems unrelated to the multi-catalog or federated catalog approach I proposed. This is essentially a problem where multiple table formats (Iceberg, Hudi, etc.) are mixed within a single HMS catalog. When a compute engine is configured with this HMS catalog, it may be able to see all tables via `SHOW TABLES`, but it may only be able to query a subset of them. This issue should be handled at the compute engine level. For example, the engine can determine whether a table should be visible or whether it can be queried based on table attributes like `table_type`. For instance, StarRocks provides a catalog/connector called the Unified Catalog (https://docs.starrocks.io/docs/data_source/catalog/unified_catalog/), which can query multiple table formats (such as Iceberg and Hudi) stored in the same HMS. If users only want to query a specific type of table stored in the same HMS, such as Iceberg tables, they can create a dedicated catalog/connector, like the Iceberg Catalog (https://docs.starrocks.io/docs/data_source/catalog/iceberg/iceberg_catalog/). This catalog/connector allows users to see only Iceberg tables when running `SHOW TABLES`, and any other table formats will be invisible. Additionally, based on my tests, when using `org.apache.iceberg.spark.SparkSessionCatalog`, Spark should be able to query both Hive tables and Iceberg tables through the HMS catalog. 2) Q2: Regarding the issue of circular catalogs, I believe this does not exist. When a compute engine is configured with an HMS catalog, that HMS catalog can only see its own catalog namespace (databases and tables). The engine cannot see information from other catalogs through this HMS catalog. Thanks, Butao Zhang ---- Replied Message ---- | From | lisoda<[email protected]> | | Date | 3/20/2026 22:53 | | To | dev<[email protected]> | | Subject | Re: [Discuss][HIVE-28879] Federated Catalog Support in Apache Hive | I understand your concern, but I may not have expressed myself clearly—I don't intend to tightly couple the catalog with specific engine runtime configurations either. What I'm suggesting is a lightweight convention mechanism, not deep integration. My idea is actually quite simple: engines could report just a few boolean flags upon connection (e.g., supports_iceberg: true/false ), or we could push the filtering logic down to the engine side via an SDK. This is less about "coupling" and more about a declarative contract. From an engineering perspective, convention over configuration is generally the better path: Convention (auto-reporting/filtering): The engine declares its capabilities → HMS or the SDK automatically masks incompatible metadata. This maintains a single source of truth—the physical properties of the table (format, location) directly determine its visibility. Configuration (manual access control): Administrators manually maintain a separate set of ACL rules outside of HMS to hide certain tables. This essentially creates duplicate definition—the metadata layer already defines "this is an Iceberg table," and then the permission layer has to define "this engine shouldn't see this Iceberg table." As the number of tables or engines scales, this manual synchronization overhead becomes unmanageable. In other words, I'm not asking HMS to understand "what connectors Spark 3.4 has installed." I'm simply suggesting that the physical properties of the metadata (the format type) should automatically determine its distribution scope. If HMS remains completely agnostic and relies on external permission systems to retroactively hide visibility, doesn't that actually increase operational complexity? ---- Replied Message ---- | From | Denys Kuzmenko<[email protected]> | | Date | 03/20/2026 19:12 | | To | [email protected] | | Cc | | | Subject | Re: [Discuss][HIVE-28879] Federated Catalog Support in Apache Hive | I don’t think tying catalog behavior to engine capabilities is a good direction. A catalog should remain engine-agnostic and focus purely on metadata management and discovery, not on the execution capabilities of specific query engines. Hive Metastore is intentionally designed as a neutral metadata service. It exposes table definitions, while each engine (e.g., Apache Spark, Trino, etc.) decides whether it can actually process those tables based on its configured connectors or format support. Introducing capability negotiation would effectively couple the catalog to specific engines and their runtime configuration, which breaks that separation of concerns and makes the catalog responsible for execution-layer logic. If a particular engine does not support a given format or catalog (for example, it does not have the appropriate client/connector installed), the cleaner solution is access control, not metadata filtering. In practice, permissions can simply be removed for users of that engine on catalogs or tables they are not expected to query. Keeping the catalog engine-agnostic preserves interoperability and avoids embedding engine-specific behavior into the metadata layer.
