Hello Samrat, Sorry for the late response.
+1 for a native Glue Data Catalog integration. We have internally developed a Glue Data Catalog catalog implementation that shims hive. We have been meaning to contribute, but this solution can replace our internal one. +1 for putting this in the flink-connector-aws. With regards to configuration, we have a flink-connector-aws-base [1] module where all the common configurations should go. Anything common, such as authentication providers, please use. Additionally any new configurations you need to add please consider them going into aws-base if they might be reusable for other AWS integrations. > We will create an e2e integration test cases capturing all the implementation in a mock environment. It looks like localstack supports glue [2], we already use localstack for integration tests so we can follow suite here. Thanks, Danny [1] https://github.com/apache/flink-connector-aws/tree/main/flink-connector-aws-base [2] https://docs.localstack.cloud/user-guide/aws/glue/ On Mon, Dec 12, 2022 at 12:18 PM Samrat Deb <decordea...@gmail.com> wrote: > Hi Konstantin Knauf, > > Can you explain how users are expected to authenticate with AWS Glue? I >> don't see any catalog options regardng authx. So I assume the credentials >> are taken from the environment? > > > We are planning to put GlueCatalog in flink-connector-aws[1]. > flink-connector-aws already provides base and already built AwsConfigs[2]. > These configs can be reused for the Catalog purpose also. > I will update the FLIP-277[3] with the auth related configs in the > Configuration Section. > > Users can pass these values as a part of config in catalog creation and if > not provided it will try to fetch from the environment. > This will allow users to create multiple catalog instances on the same > session pointing to different accounts. ( I haven't tested multi > account glue catalog instances during POC) . > > [1] https://github.com/apache/flink-connector-aws > <https://github.com/apache/flink-connector-aws/blob/main/flink-connector-aws-base/src/main/java/org/apache/flink/connector/aws/config/AWSConfigConstants.java> > [2] > https://github.com/apache/flink-connector-aws/blob/main/flink-connector-aws-base/src/main/java/org/apache/flink/connector/aws/config/AWSConfigConstants.java > [3] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-277%3A+Native+GlueCatalog+Support+in+Flink > > Bests, > Samrat > > On Mon, Dec 12, 2022 at 5:32 PM Samrat Deb <decordea...@gmail.com> wrote: > >> Hi Jark, >> Apologies for late reply. >> Thank you for your valuable input. >> >> Besides, I have a question about Glue Namespace. Could you share the >>> documentation of the Glue >>> Namespaces? (Sorry, I didn't find it.) According to the "Flink Glue >>> Metaspace Mapping" section, >>> if there is a database "mydb" under namespace "ns1", is that mean the >>> database name in Flink is "ns1.mydb"? >> >> There is no concept of namespace in glue data catalog. >> There are 3 levels in glue data catalog >> - catalog >> - database >> - table >> >> I have added the mapping in FLIP-277[1]. and updated it . >> it is directly database name from flink to database name in glue >> Please ignore the typo leftover in doc previously. >> >> Best, >> Samrat >> >> >> On Fri, Dec 9, 2022 at 8:38 PM Jark Wu <imj...@gmail.com> wrote: >> >>> Hi Samrat, >>> >>> Thanks a lot for driving the new catalog, and sorry for jumping into the >>> discussion late. >>> >>> As Flink SQL is becoming the first-class citizen of the Flink API, we are >>> planning to push Catalog >>> to become the first-class citizen of the connector instead of Source & >>> Sink. For Flink SQL users, >>> using Catalog is as natural and user-friendly as working with databases, >>> rather than having to define >>> DDL and schemas over and over again. This is also how Trino/Presto does. >>> >>> Regarding the repo for the Glue catalog, I think we can add it to >>> flink-connector-aws. We don't need >>> separate repos for Catalogs because Catalog is a kind of connector >>> (others >>> are sources & sinks). >>> For example, MySqlCatalog[1] and PostgresCatalog[2] are in >>> flink-connector-jdbc, and HiveCatalog is >>> in flink-connector-hive. This can reduce repository maintenance, and I >>> think maybe some common >>> AWS utils can be shared there. cc @Danny Cranmer < >>> dannycran...@apache.org> >>> what do you think about this? >>> >>> Besides, I have a question about Glue Namespace. Could you share the >>> documentation of the Glue >>> Namespaces? (Sorry, I didn't find it.) According to the "Flink Glue >>> Metaspace Mapping" section, >>> if there is a database "mydb" under namespace "ns1", is that mean the >>> database name in Flink is "ns1.mydb"? >>> >>> Best, >>> Jark >>> >>> >>> [1]: >>> >>> https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-jdbc/src/main/java/org/apache/flink/connector/jdbc/catalog/MySqlCatalog.java >>> [2]: >>> >>> https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-jdbc/src/main/java/org/apache/flink/connector/jdbc/catalog/PostgresCatalog.java >>> >>> On Fri, 9 Dec 2022 at 08:51, Dong Lin <lindon...@gmail.com> wrote: >>> >>> > Hi Samrat, >>> > >>> > Sorry for the late reply. Yeah I am referring to creating a similar >>> > external repo such as flink-catalog-glue. flink-connector-aws is >>> already >>> > named with `connector` so it seems a bit weird to put a catalog there. >>> > >>> > Thanks! >>> > Dong >>> > >>> > On Wed, Dec 7, 2022 at 1:04 PM Samrat Deb <decordea...@gmail.com> >>> wrote: >>> > >>> > > Hi Dong Lin, >>> > > >>> > > Since this is the first proposal for adding a vendor-specific catalog >>> > > > library in Flink, I think maybe we should also externalize those >>> > catalog >>> > > > libraries similar to how we are externalizing connector libraries. >>> It >>> > is >>> > > > likely that we might want to add catalogs for other vectors in the >>> > > future. >>> > > > Externalizing those catalogs can make Flink development more >>> scalable >>> > in >>> > > > the long term. >>> > > >>> > > Initially i mis-interpretted externalising the catalogs, There >>> already >>> > > exists an externalised connector for aws [1]. >>> > > Are you referring to creating a similar external repo for catalogs or >>> > will >>> > > it be better to add it in flink-connector-aws[1] ? >>> > > >>> > > [1] https://github.com/apache/flink-connector-aws >>> > > >>> > > Samrat >>> > > >>> > > On Tue, Dec 6, 2022 at 6:52 PM Samrat Deb <decordea...@gmail.com> >>> wrote: >>> > > >>> > > > Hi Dong Lin, >>> > > > >>> > > > Aws Glue Data catalog is vendor specific and in future we will get >>> such >>> > > > type of implementation from different providers. We should >>> > > > definitely externalize these catalog libraries similar to flink >>> > > connectors. >>> > > > I am thinking of creating >>> > > > flink-catalog similar to flink-connector under the root (flink). >>> glue >>> > > > catalog can be one of modules under the flink-catalog . Please >>> suggest >>> > if >>> > > > there is a better structure we can create for catalogs. >>> > > > >>> > > > >>> > > > It is mentioned in the FLIP that there will be two types of >>> > SdkHttpClient >>> > > >> supported based on the catalog option http-client.type. Is >>> > > >> http-client.type >>> > > >> a public config for the GlueCatalog? If yes, can we add this >>> config to >>> > > the >>> > > >> "Configurations" section and explain how users should choose the >>> > client >>> > > >> type? >>> > > > >>> > > > >>> > > > yes http-client.type is public config for the GlueCatalog. By >>> default >>> > > > client-type will be `urlconnection` , if user don't specify any >>> > > connection >>> > > > type. >>> > > > I have updated the FLIP-277[1] #configuration section with all the >>> > > configs >>> > > > . Please review it again . >>> > > > >>> > > > [1] >>> > > > >>> > > >>> > >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-277%3A+Native+GlueCatalog+Support+in+Flink >>> > > > >>> > > > Samrat >>> > > > >>> > > > On Tue, Dec 6, 2022 at 5:50 PM Samrat Deb <decordea...@gmail.com> >>> > wrote: >>> > > > >>> > > >> Hi Yuxia, >>> > > >> >>> > > >> Thank you for reviewing the flip and putting forward your >>> observations >>> > > >> and comments. >>> > > >> >>> > > >> 1: I noticed there's a YAML part in the section of "Using the >>> > Catalog", >>> > > >>> what do you mean by that? Do you mean how to use glue catalog in >>> sql >>> > > >>> client? If so, just for your information, it's not supported to >>> use >>> > > yaml >>> > > >>> envrioment file in sql client[2]. >>> > > >> >>> > > >> >>> > > >> Thank you for attaching the jira ticket [1] . I missed the >>> changes. >>> > > >> There is a provision to register catalog directly through factory >>> > > resources >>> > > >> . >>> > > >> - GenericInMemoryCatalog is defined through >>> > > >> >>> > > >>> > >>> `flink/flink-table/flink-table-api-java/src/main/resources/META-INF/services/org.apache.flink.table.factories.Factory` >>> > > >> - HiveCatalog is defined through >>> > > >> path >>> > > >>> > >>> `flink-connectors/flink-connector-hive/src/main/resources/META-INF/services/org.apache.flink.table.factories.Factory` >>> > > >> Similarly on the vendor specific module for Aws Glue we can >>> define it. >>> > > >> >>> > > >> 2: Seems there's a typo in "Design#views" part, it contains >>> > "listTables" >>> > > >>> which I think shouldn't be contained. >>> > > >> >>> > > >> >>> > > >> oh yes 😅 ! fixed it now thanks for pointing it out. >>> > > >> >>> > > >> >>> > > >> Also, I'm curious about how to list views using Glue API. Is >>> there an >>> > > >>> on-hand api to list views directly or we need to list the tables >>> and >>> > > then >>> > > >>> filter the views using the table-kind? >>> > > >> >>> > > >> >>> > > >> yes there is no in-hand api for list views directly , we need to >>> list >>> > > all >>> > > >> tables and then filter the views based on attribute tableKind >>> which >>> > is a >>> > > >> part of table object in api response. >>> > > >> >>> > > >> >>> > > >> 3: In "Flink Glue DataType Mapping" part, CharType is mapped to >>> > String. >>> > > >>> It seems the char's size will lose, is it possible to have a >>> better >>> > > mapping >>> > > >>> which won't loss the size of char type? >>> > > >> >>> > > >> >>> > > >> Thanks for pointing this out ! I have updated the flip with the >>> > correct >>> > > >> type. Initilially i mapped chartype , varchar type to string but >>> > > updated it >>> > > >> to directly map to the same type . >>> > > >> >>> > > >> >>> > > >> >>> > > >>> 4: About the "Flink CatalogFunction mapping with Glue Function" >>> part, >>> > > >>> how do we map the function language in Flink's CatalogFunction. >>> > > >> >>> > > >> >>> > > >> Glue Api (UserDefinedFunctionInput) doesn't support specific >>> attribute >>> > > >> for function language. Here is how aws hive compatible metastore >>> is >>> > > mapping >>> > > >> hive function to glue function[2]. We will append a prefix of >>> Language >>> > > in >>> > > >> the function name itself indicating the language. I see this has >>> been >>> > > >> already done for the Hive Catalog [3]. We are thinking of >>> implementing >>> > > it >>> > > >> in the same way. >>> > > >> >>> > > >> [1] https://issues.apache.org/jira/browse/FLINK-22540 >>> > > >> [2] >>> > > >> >>> > > >>> > >>> https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/blob/master/aws-glue-datacatalog-client-common/src/main/java/com/amazonaws/glue/catalog/converters/GlueInputConverter.java#L83 >>> > > >> [3] >>> > > >> >>> > > >>> > >>> https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-hive/src/main/java/org/apache/flink/table/catalog/hive/HiveCatalog.java#L1415 >>> > > >> >>> > > >> Samrat >>> > > >> >>> > > >> On Mon, Dec 5, 2022 at 4:33 PM Dong Lin <lindon...@gmail.com> >>> wrote: >>> > > >> >>> > > >>> Hi Samrat, >>> > > >>> >>> > > >>> Thanks for the FLIP! >>> > > >>> >>> > > >>> Since this is the first proposal for adding a vendor-specific >>> catalog >>> > > >>> library in Flink, I think maybe we should also externalize those >>> > > catalog >>> > > >>> libraries similar to how we are externalizing connector >>> libraries. It >>> > > is >>> > > >>> likely that we might want to add catalogs for other vectors in >>> the >>> > > >>> future. >>> > > >>> Externalizing those catalogs can make Flink development more >>> scalable >>> > > in >>> > > >>> the long term. >>> > > >>> >>> > > >>> It is mentioned in the FLIP that there will be two types of >>> > > SdkHttpClient >>> > > >>> supported based on the catalog option http-client.type. Is >>> > > >>> http-client.type >>> > > >>> a public config for the GlueCatalog? If yes, can we add this >>> config >>> > to >>> > > >>> the >>> > > >>> "Configurations" section and explain how users should choose the >>> > client >>> > > >>> type? >>> > > >>> >>> > > >>> Regards, >>> > > >>> Dong >>> > > >>> >>> > > >>> >>> > > >>> On Sat, Dec 3, 2022 at 12:31 PM Samrat Deb < >>> decordea...@gmail.com> >>> > > >>> wrote: >>> > > >>> >>> > > >>> > Hi everyone, >>> > > >>> > >>> > > >>> > I would like to open a discussion[1] on providing GlueCatalog >>> > support >>> > > >>> > in Flink. >>> > > >>> > Currently, Flink offers 3 major types of catalog[2]. Out of >>> which >>> > > only >>> > > >>> > HiveCatalog is a persistent catalog backed by Hive Metastore. >>> We >>> > > would >>> > > >>> like >>> > > >>> > to introduce GlueCatalog in Flink offering another option for >>> users >>> > > >>> which >>> > > >>> > will be persistent in nature. Aws Glue data catalog is a >>> > centralized >>> > > >>> data >>> > > >>> > catalog in AWS cloud that provides integrations with many >>> different >>> > > >>> > connectors[3]. Flink GlueCatalog can use the features provided >>> by >>> > > glue >>> > > >>> and >>> > > >>> > create strong integration with other services in the cloud. >>> > > >>> > >>> > > >>> > [1] >>> > > >>> > >>> > > >>> > >>> > > >>> >>> > > >>> > >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-277%3A+Native+GlueCatalog+Support+in+Flink >>> > > >>> > >>> > > >>> > [2] >>> > > >>> > >>> > > >>> > >>> > > >>> >>> > > >>> > >>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/catalogs/ >>> > > >>> > >>> > > >>> > [3] >>> > > >>> > >>> > > >>> > >>> > > >>> >>> > > >>> > >>> https://docs.aws.amazon.com/glue/latest/dg/components-overview.html#data-catalog-intro >>> > > >>> > >>> > > >>> > [4] https://issues.apache.org/jira/browse/FLINK-29549 >>> > > >>> > >>> > > >>> > Bests >>> > > >>> > Samrat >>> > > >>> > >>> > > >>> >>> > > >> >>> > > >>> > >>> >>