Re: [PR] Design: AWS Glue Data Catalog connector design document [gravitino]

via GitHub Wed, 25 Mar 2026 21:39:17 -0700


markhoerth commented on code in PR #10539:
URL: https://github.com/apache/gravitino/pull/10539#discussion_r2992420670



##########
design/aws-glue-catalog-connector.md:
##########
@@ -0,0 +1,592 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Design: AWS Glue Data Catalog Support for Apache Gravitino
+
+## 1. Problem Statement and Goals
+
+### 1.1 Problem
+
+**Gravitino currently cannot federate AWS Glue Data Catalog.** This is a 
significant gap because:
+
+1. **Large user base on AWS**: The majority of cloud-native data lakes run on 
AWS with Glue Data Catalog as the central metadata service (default for Athena, 
Redshift Spectrum, EMR, Lake Formation). These organizations cannot bring their 
Glue metadata into Gravitino's unified management layer.
+2. **No native integration path**: The only workaround is pointing Gravitino's 
Hive catalog at Glue's HMS-compatible Thrift endpoint (`metastore.uris = 
thrift://...`), which is undocumented, region-limited, and cannot leverage 
Glue-native features (catalog ID, cross-account access, VPC endpoints).
+3. **Competitive landscape**: Trino, Spark, and other engines all have 
first-class Glue support with dedicated configuration. Users expect the same 
from Gravitino.
+
+### 1.2 Goals
+
+After this feature is implemented:
+
+1. **Register AWS Glue Data Catalog in Gravitino**:
+   ```bash
+   # Hive-format tables
+   gcli catalog create --name hive_on_glue --provider hive \
+     --properties metastore-type=glue,s3-region=us-east-1
+
+   # Iceberg-format tables
+   gcli catalog create --name iceberg_on_glue --provider lakehouse-iceberg \
+     --properties 
catalog-backend=glue,warehouse=s3://bucket/iceberg,s3-region=us-east-1
+   ```
+
+2. **Standard Gravitino API works against Glue catalogs**:
+   ```bash
+   gcli schema list --catalog hive_on_glue
+   gcli table list --catalog hive_on_glue --schema my_database
+   gcli table details --catalog iceberg_on_glue --schema analytics --table 
events
+   ```
+
+3. **Trino and Spark connect transparently** — Trino uses 
`hive.metastore=glue` / `iceberg.catalog.type=glue`; Spark uses 
`AWSGlueDataCatalogHiveClientFactory` / `GlueCatalog`. Users query Glue tables 
through Gravitino without knowing the underlying mechanism.
+
+4. **AWS-native authentication** (reuses existing S3 properties): static 
credentials, STS AssumeRole, or default credential chain (environment 
variables, instance profile).
+
+## 2. Background
+
+### 2.1 AWS Glue Data Catalog
+
+AWS Glue Data Catalog is a managed metadata repository storing:
+- **Databases** — logical groupings, equivalent to Gravitino schemas.
+- **Tables** — metadata records containing column definitions, storage 
descriptors, partition keys, and user-defined parameters.
+
+Tables come in two formats:
+
+| Format | How Glue Stores It |
+|---|---|
+| **Hive** | Full metadata in `StorageDescriptor` (columns, SerDe, 
InputFormat, OutputFormat, location). The majority of tables in most Glue 
catalogs (legacy ETL, Athena CTAS, Redshift Spectrum). |
+| **Iceberg** | `Parameters["table_type"] = "ICEBERG"` and 
`Parameters["metadata_location"]` pointing to Iceberg metadata JSON on S3. 
`StorageDescriptor.Columns` is typically empty. Growing rapidly. |
+
+A complete Glue integration must handle both table formats.
+
+### 2.2 How Query Engines Use Glue
+
+Trino and Spark both have native Glue support — they call the AWS Glue SDK 
directly, not via HMS Thrift:
+
+| Engine | Hive Tables on Glue | Iceberg Tables on Glue |
+|---|---|---|
+| **Trino** | Hive connector with `hive.metastore=glue` | Iceberg connector 
with `iceberg.catalog.type=glue` |
+| **Spark** | Hive catalog with `AWSGlueDataCatalogHiveClientFactory` | 
Iceberg catalog with `catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog` |
+
+Both engines use a **one-catalog-to-one-connector** model — a single catalog 
handles either Hive-format or Iceberg-format tables, not both. This is 
consistent with Gravitino's existing catalog model.
+
+### 2.3 Gravitino's Current Architecture
+
+Gravitino's catalog plugin system provides:
+- **Hive catalog** (`provider=hive`): Connects to HMS via Thrift. Client 
chain: `HiveCatalogOperations` → `CachedClientPool` → `HiveClientImpl` → 
`HiveShimV2/V3` → `IMetaStoreClient`.
+- **Iceberg catalog** (`provider=lakehouse-iceberg`): Supports pluggable 
backends (`catalog-backend=hive|jdbc|rest|memory|custom`). Each backend maps to 
a different Iceberg `Catalog` implementation.
+- **Trino/Spark connectors**: Property converters translate Gravitino catalog 
properties into engine-specific properties.
+
+## 3. Design Alternatives
+
+### Alternative A: New `catalog-glue` Module
+
+Create a standalone `catalogs/catalog-glue/` with its own 
`GlueCatalogOperations`, type converters, and entity classes. Directly call the 
AWS Glue SDK for both Hive and Iceberg tables.
+
+**Pros**: Full control over Glue-specific behavior. Single catalog for mixed 
table formats.
+**Cons**:
+- Duplicates logic already in Hive catalog (type conversion, partition 
handling, SerDe parsing) and Iceberg catalog (schema conversion, metadata 
loading).
+- Trino/Spark integration requires a "Composite Connector" that routes queries 
based on table type — a significant architectural change.
+- Larger implementation surface area and maintenance burden.
+
+### Alternative B: Glue as a Metastore Type (Chosen)
+
+Extend the existing Hive and Iceberg catalogs with Glue as a backend option.
+
+**Pros**:
+- Reuses all existing catalog logic, type conversion, property handling, and 
entity models.
+- Trino/Spark integration works almost for free — both engines already have 
native Glue support.
+- Much smaller change set (~15 files modified, 1 new file vs. ~15 new files).
+- Consistent with how Trino and Spark model Glue (as a metastore variant, not 
a separate catalog type).
+
+**Cons**:
+- Users must create two Gravitino catalogs to cover both Hive and Iceberg 
tables from the same Glue Data Catalog.
+- Cannot add Glue-only features (e.g., Glue crawlers) without extending the 
generic interfaces.
+
+**Decision**: Alternative B — the reuse benefits and Trino/Spark alignment 
outweigh the minor UX cost of two catalogs.
+
+## 4. Detailed Design
+
+### 4.1 Configuration Properties
+
+Gravitino already defines standardized AWS/S3 properties in 
`S3Properties.java`:
+
+| Existing Property | Used By |
+|---|---|
+| `s3-access-key-id` / `s3-secret-access-key` | Iceberg, Hive (S3 storage + 
Glue auth) |
+| `s3-region` | Iceberg, Hive (S3 storage + Glue region) |
+| `s3-role-arn` / `s3-external-id` | Iceberg, Hive (STS AssumeRole) |
+| `s3-endpoint` | Iceberg, Hive (custom S3 endpoint) |
+
+We **reuse `s3-region` as the default AWS region for both Glue and S3** and 
**reuse `s3-access-key-id` / `s3-secret-access-key` for authentication**. These 
properties already exist in `S3Properties.java` and are already handled by both 
the Hive and Iceberg catalogs — no new code is required for credential plumbing.
+
+Only two new Glue-specific properties are needed (prefixed with `aws-glue-` to 
clearly indicate they are AWS Glue Data Catalog settings, distinct from the 
generic `s3-` storage properties):
+
+| New Property | Required | Default | Description |

Review Comment:
   The aws-glue-catalog-id property is marked as optional, defaulting to the 
caller's AWS account ID. The catalog ID is actually required because an AWS 
account can have multiple Glue catalogs — for example, a default catalog and a 
federated S3 Tables catalog. Without it, there is no way to determine which 
catalog to connect to.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Design: AWS Glue Data Catalog connector design document [gravitino]

Reply via email to