Re: [PR] Design: AWS Glue Data Catalog connector design document [gravitino]

via GitHub Wed, 25 Mar 2026 21:51:37 -0700


markhoerth commented on code in PR #10539:
URL: https://github.com/apache/gravitino/pull/10539#discussion_r2992452182



##########
design/aws-glue-catalog-connector.md:
##########
@@ -0,0 +1,592 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Design: AWS Glue Data Catalog Support for Apache Gravitino
+
+## 1. Problem Statement and Goals
+
+### 1.1 Problem
+
+**Gravitino currently cannot federate AWS Glue Data Catalog.** This is a 
significant gap because:
+
+1. **Large user base on AWS**: The majority of cloud-native data lakes run on 
AWS with Glue Data Catalog as the central metadata service (default for Athena, 
Redshift Spectrum, EMR, Lake Formation). These organizations cannot bring their 
Glue metadata into Gravitino's unified management layer.
+2. **No native integration path**: The only workaround is pointing Gravitino's 
Hive catalog at Glue's HMS-compatible Thrift endpoint (`metastore.uris = 
thrift://...`), which is undocumented, region-limited, and cannot leverage 
Glue-native features (catalog ID, cross-account access, VPC endpoints).
+3. **Competitive landscape**: Trino, Spark, and other engines all have 
first-class Glue support with dedicated configuration. Users expect the same 
from Gravitino.
+
+### 1.2 Goals
+
+After this feature is implemented:
+
+1. **Register AWS Glue Data Catalog in Gravitino**:
+   ```bash
+   # Hive-format tables
+   gcli catalog create --name hive_on_glue --provider hive \
+     --properties metastore-type=glue,s3-region=us-east-1
+
+   # Iceberg-format tables
+   gcli catalog create --name iceberg_on_glue --provider lakehouse-iceberg \
+     --properties 
catalog-backend=glue,warehouse=s3://bucket/iceberg,s3-region=us-east-1
+   ```
+
+2. **Standard Gravitino API works against Glue catalogs**:
+   ```bash
+   gcli schema list --catalog hive_on_glue
+   gcli table list --catalog hive_on_glue --schema my_database
+   gcli table details --catalog iceberg_on_glue --schema analytics --table 
events
+   ```
+
+3. **Trino and Spark connect transparently** — Trino uses 
`hive.metastore=glue` / `iceberg.catalog.type=glue`; Spark uses 
`AWSGlueDataCatalogHiveClientFactory` / `GlueCatalog`. Users query Glue tables 
through Gravitino without knowing the underlying mechanism.
+
+4. **AWS-native authentication** (reuses existing S3 properties): static 
credentials, STS AssumeRole, or default credential chain (environment 
variables, instance profile).
+
+## 2. Background

Review Comment:
   The design does not address access control or governance. Gravitino's full 
governance model should apply to Glue catalog contents just as it does for 
other catalogs — including properties, tags, policies, statistics, audit, and 
comments. At the same time, Glue has its own permission model through IAM and 
AWS Lake Formation. The design should clarify how Gravitino's governance layer 
interacts with AWS permissions — specifically, whether Gravitino RBAC is 
enforced on top of whatever AWS permissions allow, or whether there is a 
conflict between the two models.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Design: AWS Glue Data Catalog connector design document [gravitino]

Reply via email to