Thank you Danny for more insights on the flink-connector-aws-base[1]. It looks like localstack supports glue [2], we already use localstack for > integration tests so we can follow suite here.
As GlueCatalog will be a part of flink-connector-aws-base. As per suggestion, we will reuse code and resources as much as possible and add extra things required in extensible manner. Bests, Samrat [1] https://github.com/apache/flink-connector-aws/tree/main/flink-connector-aws-base [2] https://docs.localstack.cloud/user-guide/aws/glue/ On Tue, Dec 13, 2022 at 9:32 PM Danny Cranmer <dannycran...@apache.org> wrote: > Hello Samrat, > > Sorry for the late response. > > +1 for a native Glue Data Catalog integration. We have > internally developed a Glue Data Catalog catalog implementation that shims > hive. We have been meaning to contribute, but this solution can replace our > internal one. > > +1 for putting this in the flink-connector-aws. With regards to > configuration, we have a flink-connector-aws-base [1] module where all the > common configurations should go. Anything common, such as authentication > providers, please use. Additionally any new configurations you need to add > please consider them going into aws-base if they might be reusable for > other AWS integrations. > > > We will create an e2e integration test cases capturing all the > implementation in a mock environment. > > It looks like localstack supports glue [2], we already use localstack for > integration tests so we can follow suite here. > > Thanks, > Danny > > [1] > https://github.com/apache/flink-connector-aws/tree/main/flink-connector-aws-base > [2] https://docs.localstack.cloud/user-guide/aws/glue/ > > On Mon, Dec 12, 2022 at 12:18 PM Samrat Deb <decordea...@gmail.com> wrote: > >> Hi Konstantin Knauf, >> >> Can you explain how users are expected to authenticate with AWS Glue? I >>> don't see any catalog options regardng authx. So I assume the credentials >>> are taken from the environment? >> >> >> We are planning to put GlueCatalog in flink-connector-aws[1]. >> flink-connector-aws already provides base and already built AwsConfigs[2]. >> These configs can be reused for the Catalog purpose also. >> I will update the FLIP-277[3] with the auth related configs in the >> Configuration Section. >> >> Users can pass these values as a part of config in catalog creation and >> if not provided it will try to fetch from the environment. >> This will allow users to create multiple catalog instances on the same >> session pointing to different accounts. ( I haven't tested multi >> account glue catalog instances during POC) . >> >> [1] https://github.com/apache/flink-connector-aws >> <https://github.com/apache/flink-connector-aws/blob/main/flink-connector-aws-base/src/main/java/org/apache/flink/connector/aws/config/AWSConfigConstants.java> >> [2] >> https://github.com/apache/flink-connector-aws/blob/main/flink-connector-aws-base/src/main/java/org/apache/flink/connector/aws/config/AWSConfigConstants.java >> [3] >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-277%3A+Native+GlueCatalog+Support+in+Flink >> >> Bests, >> Samrat >> >> On Mon, Dec 12, 2022 at 5:32 PM Samrat Deb <decordea...@gmail.com> wrote: >> >>> Hi Jark, >>> Apologies for late reply. >>> Thank you for your valuable input. >>> >>> Besides, I have a question about Glue Namespace. Could you share the >>>> documentation of the Glue >>>> Namespaces? (Sorry, I didn't find it.) According to the "Flink Glue >>>> Metaspace Mapping" section, >>>> if there is a database "mydb" under namespace "ns1", is that mean the >>>> database name in Flink is "ns1.mydb"? >>> >>> There is no concept of namespace in glue data catalog. >>> There are 3 levels in glue data catalog >>> - catalog >>> - database >>> - table >>> >>> I have added the mapping in FLIP-277[1]. and updated it . >>> it is directly database name from flink to database name in glue >>> Please ignore the typo leftover in doc previously. >>> >>> Best, >>> Samrat >>> >>> >>> On Fri, Dec 9, 2022 at 8:38 PM Jark Wu <imj...@gmail.com> wrote: >>> >>>> Hi Samrat, >>>> >>>> Thanks a lot for driving the new catalog, and sorry for jumping into the >>>> discussion late. >>>> >>>> As Flink SQL is becoming the first-class citizen of the Flink API, we >>>> are >>>> planning to push Catalog >>>> to become the first-class citizen of the connector instead of Source & >>>> Sink. For Flink SQL users, >>>> using Catalog is as natural and user-friendly as working with databases, >>>> rather than having to define >>>> DDL and schemas over and over again. This is also how Trino/Presto does. >>>> >>>> Regarding the repo for the Glue catalog, I think we can add it to >>>> flink-connector-aws. We don't need >>>> separate repos for Catalogs because Catalog is a kind of connector >>>> (others >>>> are sources & sinks). >>>> For example, MySqlCatalog[1] and PostgresCatalog[2] are in >>>> flink-connector-jdbc, and HiveCatalog is >>>> in flink-connector-hive. This can reduce repository maintenance, and I >>>> think maybe some common >>>> AWS utils can be shared there. cc @Danny Cranmer < >>>> dannycran...@apache.org> >>>> what do you think about this? >>>> >>>> Besides, I have a question about Glue Namespace. Could you share the >>>> documentation of the Glue >>>> Namespaces? (Sorry, I didn't find it.) According to the "Flink Glue >>>> Metaspace Mapping" section, >>>> if there is a database "mydb" under namespace "ns1", is that mean the >>>> database name in Flink is "ns1.mydb"? >>>> >>>> Best, >>>> Jark >>>> >>>> >>>> [1]: >>>> >>>> https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-jdbc/src/main/java/org/apache/flink/connector/jdbc/catalog/MySqlCatalog.java >>>> [2]: >>>> >>>> https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-jdbc/src/main/java/org/apache/flink/connector/jdbc/catalog/PostgresCatalog.java >>>> >>>> On Fri, 9 Dec 2022 at 08:51, Dong Lin <lindon...@gmail.com> wrote: >>>> >>>> > Hi Samrat, >>>> > >>>> > Sorry for the late reply. Yeah I am referring to creating a similar >>>> > external repo such as flink-catalog-glue. flink-connector-aws is >>>> already >>>> > named with `connector` so it seems a bit weird to put a catalog there. >>>> > >>>> > Thanks! >>>> > Dong >>>> > >>>> > On Wed, Dec 7, 2022 at 1:04 PM Samrat Deb <decordea...@gmail.com> >>>> wrote: >>>> > >>>> > > Hi Dong Lin, >>>> > > >>>> > > Since this is the first proposal for adding a vendor-specific >>>> catalog >>>> > > > library in Flink, I think maybe we should also externalize those >>>> > catalog >>>> > > > libraries similar to how we are externalizing connector >>>> libraries. It >>>> > is >>>> > > > likely that we might want to add catalogs for other vectors in the >>>> > > future. >>>> > > > Externalizing those catalogs can make Flink development more >>>> scalable >>>> > in >>>> > > > the long term. >>>> > > >>>> > > Initially i mis-interpretted externalising the catalogs, There >>>> already >>>> > > exists an externalised connector for aws [1]. >>>> > > Are you referring to creating a similar external repo for catalogs >>>> or >>>> > will >>>> > > it be better to add it in flink-connector-aws[1] ? >>>> > > >>>> > > [1] https://github.com/apache/flink-connector-aws >>>> > > >>>> > > Samrat >>>> > > >>>> > > On Tue, Dec 6, 2022 at 6:52 PM Samrat Deb <decordea...@gmail.com> >>>> wrote: >>>> > > >>>> > > > Hi Dong Lin, >>>> > > > >>>> > > > Aws Glue Data catalog is vendor specific and in future we will >>>> get such >>>> > > > type of implementation from different providers. We should >>>> > > > definitely externalize these catalog libraries similar to flink >>>> > > connectors. >>>> > > > I am thinking of creating >>>> > > > flink-catalog similar to flink-connector under the root (flink). >>>> glue >>>> > > > catalog can be one of modules under the flink-catalog . Please >>>> suggest >>>> > if >>>> > > > there is a better structure we can create for catalogs. >>>> > > > >>>> > > > >>>> > > > It is mentioned in the FLIP that there will be two types of >>>> > SdkHttpClient >>>> > > >> supported based on the catalog option http-client.type. Is >>>> > > >> http-client.type >>>> > > >> a public config for the GlueCatalog? If yes, can we add this >>>> config to >>>> > > the >>>> > > >> "Configurations" section and explain how users should choose the >>>> > client >>>> > > >> type? >>>> > > > >>>> > > > >>>> > > > yes http-client.type is public config for the GlueCatalog. By >>>> default >>>> > > > client-type will be `urlconnection` , if user don't specify any >>>> > > connection >>>> > > > type. >>>> > > > I have updated the FLIP-277[1] #configuration section with all the >>>> > > configs >>>> > > > . Please review it again . >>>> > > > >>>> > > > [1] >>>> > > > >>>> > > >>>> > >>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-277%3A+Native+GlueCatalog+Support+in+Flink >>>> > > > >>>> > > > Samrat >>>> > > > >>>> > > > On Tue, Dec 6, 2022 at 5:50 PM Samrat Deb <decordea...@gmail.com> >>>> > wrote: >>>> > > > >>>> > > >> Hi Yuxia, >>>> > > >> >>>> > > >> Thank you for reviewing the flip and putting forward your >>>> observations >>>> > > >> and comments. >>>> > > >> >>>> > > >> 1: I noticed there's a YAML part in the section of "Using the >>>> > Catalog", >>>> > > >>> what do you mean by that? Do you mean how to use glue catalog >>>> in sql >>>> > > >>> client? If so, just for your information, it's not supported to >>>> use >>>> > > yaml >>>> > > >>> envrioment file in sql client[2]. >>>> > > >> >>>> > > >> >>>> > > >> Thank you for attaching the jira ticket [1] . I missed the >>>> changes. >>>> > > >> There is a provision to register catalog directly through factory >>>> > > resources >>>> > > >> . >>>> > > >> - GenericInMemoryCatalog is defined through >>>> > > >> >>>> > > >>>> > >>>> `flink/flink-table/flink-table-api-java/src/main/resources/META-INF/services/org.apache.flink.table.factories.Factory` >>>> > > >> - HiveCatalog is defined through >>>> > > >> path >>>> > > >>>> > >>>> `flink-connectors/flink-connector-hive/src/main/resources/META-INF/services/org.apache.flink.table.factories.Factory` >>>> > > >> Similarly on the vendor specific module for Aws Glue we can >>>> define it. >>>> > > >> >>>> > > >> 2: Seems there's a typo in "Design#views" part, it contains >>>> > "listTables" >>>> > > >>> which I think shouldn't be contained. >>>> > > >> >>>> > > >> >>>> > > >> oh yes 😅 ! fixed it now thanks for pointing it out. >>>> > > >> >>>> > > >> >>>> > > >> Also, I'm curious about how to list views using Glue API. Is >>>> there an >>>> > > >>> on-hand api to list views directly or we need to list the >>>> tables and >>>> > > then >>>> > > >>> filter the views using the table-kind? >>>> > > >> >>>> > > >> >>>> > > >> yes there is no in-hand api for list views directly , we need to >>>> list >>>> > > all >>>> > > >> tables and then filter the views based on attribute tableKind >>>> which >>>> > is a >>>> > > >> part of table object in api response. >>>> > > >> >>>> > > >> >>>> > > >> 3: In "Flink Glue DataType Mapping" part, CharType is mapped to >>>> > String. >>>> > > >>> It seems the char's size will lose, is it possible to have a >>>> better >>>> > > mapping >>>> > > >>> which won't loss the size of char type? >>>> > > >> >>>> > > >> >>>> > > >> Thanks for pointing this out ! I have updated the flip with the >>>> > correct >>>> > > >> type. Initilially i mapped chartype , varchar type to string but >>>> > > updated it >>>> > > >> to directly map to the same type . >>>> > > >> >>>> > > >> >>>> > > >> >>>> > > >>> 4: About the "Flink CatalogFunction mapping with Glue Function" >>>> part, >>>> > > >>> how do we map the function language in Flink's CatalogFunction. >>>> > > >> >>>> > > >> >>>> > > >> Glue Api (UserDefinedFunctionInput) doesn't support specific >>>> attribute >>>> > > >> for function language. Here is how aws hive compatible metastore >>>> is >>>> > > mapping >>>> > > >> hive function to glue function[2]. We will append a prefix of >>>> Language >>>> > > in >>>> > > >> the function name itself indicating the language. I see this has >>>> been >>>> > > >> already done for the Hive Catalog [3]. We are thinking of >>>> implementing >>>> > > it >>>> > > >> in the same way. >>>> > > >> >>>> > > >> [1] https://issues.apache.org/jira/browse/FLINK-22540 >>>> > > >> [2] >>>> > > >> >>>> > > >>>> > >>>> https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/blob/master/aws-glue-datacatalog-client-common/src/main/java/com/amazonaws/glue/catalog/converters/GlueInputConverter.java#L83 >>>> > > >> [3] >>>> > > >> >>>> > > >>>> > >>>> https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-hive/src/main/java/org/apache/flink/table/catalog/hive/HiveCatalog.java#L1415 >>>> > > >> >>>> > > >> Samrat >>>> > > >> >>>> > > >> On Mon, Dec 5, 2022 at 4:33 PM Dong Lin <lindon...@gmail.com> >>>> wrote: >>>> > > >> >>>> > > >>> Hi Samrat, >>>> > > >>> >>>> > > >>> Thanks for the FLIP! >>>> > > >>> >>>> > > >>> Since this is the first proposal for adding a vendor-specific >>>> catalog >>>> > > >>> library in Flink, I think maybe we should also externalize those >>>> > > catalog >>>> > > >>> libraries similar to how we are externalizing connector >>>> libraries. It >>>> > > is >>>> > > >>> likely that we might want to add catalogs for other vectors in >>>> the >>>> > > >>> future. >>>> > > >>> Externalizing those catalogs can make Flink development more >>>> scalable >>>> > > in >>>> > > >>> the long term. >>>> > > >>> >>>> > > >>> It is mentioned in the FLIP that there will be two types of >>>> > > SdkHttpClient >>>> > > >>> supported based on the catalog option http-client.type. Is >>>> > > >>> http-client.type >>>> > > >>> a public config for the GlueCatalog? If yes, can we add this >>>> config >>>> > to >>>> > > >>> the >>>> > > >>> "Configurations" section and explain how users should choose the >>>> > client >>>> > > >>> type? >>>> > > >>> >>>> > > >>> Regards, >>>> > > >>> Dong >>>> > > >>> >>>> > > >>> >>>> > > >>> On Sat, Dec 3, 2022 at 12:31 PM Samrat Deb < >>>> decordea...@gmail.com> >>>> > > >>> wrote: >>>> > > >>> >>>> > > >>> > Hi everyone, >>>> > > >>> > >>>> > > >>> > I would like to open a discussion[1] on providing GlueCatalog >>>> > support >>>> > > >>> > in Flink. >>>> > > >>> > Currently, Flink offers 3 major types of catalog[2]. Out of >>>> which >>>> > > only >>>> > > >>> > HiveCatalog is a persistent catalog backed by Hive Metastore. >>>> We >>>> > > would >>>> > > >>> like >>>> > > >>> > to introduce GlueCatalog in Flink offering another option for >>>> users >>>> > > >>> which >>>> > > >>> > will be persistent in nature. Aws Glue data catalog is a >>>> > centralized >>>> > > >>> data >>>> > > >>> > catalog in AWS cloud that provides integrations with many >>>> different >>>> > > >>> > connectors[3]. Flink GlueCatalog can use the features >>>> provided by >>>> > > glue >>>> > > >>> and >>>> > > >>> > create strong integration with other services in the cloud. >>>> > > >>> > >>>> > > >>> > [1] >>>> > > >>> > >>>> > > >>> > >>>> > > >>> >>>> > > >>>> > >>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-277%3A+Native+GlueCatalog+Support+in+Flink >>>> > > >>> > >>>> > > >>> > [2] >>>> > > >>> > >>>> > > >>> > >>>> > > >>> >>>> > > >>>> > >>>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/catalogs/ >>>> > > >>> > >>>> > > >>> > [3] >>>> > > >>> > >>>> > > >>> > >>>> > > >>> >>>> > > >>>> > >>>> https://docs.aws.amazon.com/glue/latest/dg/components-overview.html#data-catalog-intro >>>> > > >>> > >>>> > > >>> > [4] https://issues.apache.org/jira/browse/FLINK-29549 >>>> > > >>> > >>>> > > >>> > Bests >>>> > > >>> > Samrat >>>> > > >>> > >>>> > > >>> >>>> > > >> >>>> > > >>>> > >>>> >>>