Hi All , Thank you for all your valuable suggestions and questions regarding the proposals.
In case there are more queries or questions from the community , I will keep this discussion Thread open for a couple of more days and proceed with next steps. Bests Samrat On Wed, Dec 14, 2022 at 9:41 PM Samrat Deb <decordea...@gmail.com> wrote: > > > Thank you Danny for more insights on the flink-connector-aws-base[1]. > > It looks like localstack supports glue [2], we already use localstack for >> integration tests so we can follow suite here. > > > As GlueCatalog will be a part of flink-connector-aws-base. As per > suggestion, we will reuse code and resources as much as possible and add > extra things required in extensible manner. > > Bests, > Samrat > > > [1] > https://github.com/apache/flink-connector-aws/tree/main/flink-connector-aws-base > [2] https://docs.localstack.cloud/user-guide/aws/glue/ > > > > > On Tue, Dec 13, 2022 at 9:32 PM Danny Cranmer <dannycran...@apache.org> > wrote: > >> Hello Samrat, >> >> Sorry for the late response. >> >> +1 for a native Glue Data Catalog integration. We have >> internally developed a Glue Data Catalog catalog implementation that shims >> hive. We have been meaning to contribute, but this solution can replace our >> internal one. >> >> +1 for putting this in the flink-connector-aws. With regards to >> configuration, we have a flink-connector-aws-base [1] module where all the >> common configurations should go. Anything common, such as authentication >> providers, please use. Additionally any new configurations you need to add >> please consider them going into aws-base if they might be reusable for >> other AWS integrations. >> >> > We will create an e2e integration test cases capturing all the >> implementation in a mock environment. >> >> It looks like localstack supports glue [2], we already use localstack for >> integration tests so we can follow suite here. >> >> Thanks, >> Danny >> >> [1] >> https://github.com/apache/flink-connector-aws/tree/main/flink-connector-aws-base >> [2] https://docs.localstack.cloud/user-guide/aws/glue/ >> >> On Mon, Dec 12, 2022 at 12:18 PM Samrat Deb <decordea...@gmail.com> >> wrote: >> >>> Hi Konstantin Knauf, >>> >>> Can you explain how users are expected to authenticate with AWS Glue? I >>>> don't see any catalog options regardng authx. So I assume the >>>> credentials >>>> are taken from the environment? >>> >>> >>> We are planning to put GlueCatalog in flink-connector-aws[1]. >>> flink-connector-aws already provides base and already built AwsConfigs[2]. >>> These configs can be reused for the Catalog purpose also. >>> I will update the FLIP-277[3] with the auth related configs in the >>> Configuration Section. >>> >>> Users can pass these values as a part of config in catalog creation and >>> if not provided it will try to fetch from the environment. >>> This will allow users to create multiple catalog instances on the same >>> session pointing to different accounts. ( I haven't tested multi >>> account glue catalog instances during POC) . >>> >>> [1] https://github.com/apache/flink-connector-aws >>> <https://github.com/apache/flink-connector-aws/blob/main/flink-connector-aws-base/src/main/java/org/apache/flink/connector/aws/config/AWSConfigConstants.java> >>> [2] >>> https://github.com/apache/flink-connector-aws/blob/main/flink-connector-aws-base/src/main/java/org/apache/flink/connector/aws/config/AWSConfigConstants.java >>> [3] >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-277%3A+Native+GlueCatalog+Support+in+Flink >>> >>> Bests, >>> Samrat >>> >>> On Mon, Dec 12, 2022 at 5:32 PM Samrat Deb <decordea...@gmail.com> >>> wrote: >>> >>>> Hi Jark, >>>> Apologies for late reply. >>>> Thank you for your valuable input. >>>> >>>> Besides, I have a question about Glue Namespace. Could you share the >>>>> documentation of the Glue >>>>> Namespaces? (Sorry, I didn't find it.) According to the "Flink Glue >>>>> Metaspace Mapping" section, >>>>> if there is a database "mydb" under namespace "ns1", is that mean the >>>>> database name in Flink is "ns1.mydb"? >>>> >>>> There is no concept of namespace in glue data catalog. >>>> There are 3 levels in glue data catalog >>>> - catalog >>>> - database >>>> - table >>>> >>>> I have added the mapping in FLIP-277[1]. and updated it . >>>> it is directly database name from flink to database name in glue >>>> Please ignore the typo leftover in doc previously. >>>> >>>> Best, >>>> Samrat >>>> >>>> >>>> On Fri, Dec 9, 2022 at 8:38 PM Jark Wu <imj...@gmail.com> wrote: >>>> >>>>> Hi Samrat, >>>>> >>>>> Thanks a lot for driving the new catalog, and sorry for jumping into >>>>> the >>>>> discussion late. >>>>> >>>>> As Flink SQL is becoming the first-class citizen of the Flink API, we >>>>> are >>>>> planning to push Catalog >>>>> to become the first-class citizen of the connector instead of Source & >>>>> Sink. For Flink SQL users, >>>>> using Catalog is as natural and user-friendly as working with >>>>> databases, >>>>> rather than having to define >>>>> DDL and schemas over and over again. This is also how Trino/Presto >>>>> does. >>>>> >>>>> Regarding the repo for the Glue catalog, I think we can add it to >>>>> flink-connector-aws. We don't need >>>>> separate repos for Catalogs because Catalog is a kind of connector >>>>> (others >>>>> are sources & sinks). >>>>> For example, MySqlCatalog[1] and PostgresCatalog[2] are in >>>>> flink-connector-jdbc, and HiveCatalog is >>>>> in flink-connector-hive. This can reduce repository maintenance, and I >>>>> think maybe some common >>>>> AWS utils can be shared there. cc @Danny Cranmer < >>>>> dannycran...@apache.org> >>>>> what do you think about this? >>>>> >>>>> Besides, I have a question about Glue Namespace. Could you share the >>>>> documentation of the Glue >>>>> Namespaces? (Sorry, I didn't find it.) According to the "Flink Glue >>>>> Metaspace Mapping" section, >>>>> if there is a database "mydb" under namespace "ns1", is that mean the >>>>> database name in Flink is "ns1.mydb"? >>>>> >>>>> Best, >>>>> Jark >>>>> >>>>> >>>>> [1]: >>>>> >>>>> https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-jdbc/src/main/java/org/apache/flink/connector/jdbc/catalog/MySqlCatalog.java >>>>> [2]: >>>>> >>>>> https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-jdbc/src/main/java/org/apache/flink/connector/jdbc/catalog/PostgresCatalog.java >>>>> >>>>> On Fri, 9 Dec 2022 at 08:51, Dong Lin <lindon...@gmail.com> wrote: >>>>> >>>>> > Hi Samrat, >>>>> > >>>>> > Sorry for the late reply. Yeah I am referring to creating a similar >>>>> > external repo such as flink-catalog-glue. flink-connector-aws is >>>>> already >>>>> > named with `connector` so it seems a bit weird to put a catalog >>>>> there. >>>>> > >>>>> > Thanks! >>>>> > Dong >>>>> > >>>>> > On Wed, Dec 7, 2022 at 1:04 PM Samrat Deb <decordea...@gmail.com> >>>>> wrote: >>>>> > >>>>> > > Hi Dong Lin, >>>>> > > >>>>> > > Since this is the first proposal for adding a vendor-specific >>>>> catalog >>>>> > > > library in Flink, I think maybe we should also externalize those >>>>> > catalog >>>>> > > > libraries similar to how we are externalizing connector >>>>> libraries. It >>>>> > is >>>>> > > > likely that we might want to add catalogs for other vectors in >>>>> the >>>>> > > future. >>>>> > > > Externalizing those catalogs can make Flink development more >>>>> scalable >>>>> > in >>>>> > > > the long term. >>>>> > > >>>>> > > Initially i mis-interpretted externalising the catalogs, There >>>>> already >>>>> > > exists an externalised connector for aws [1]. >>>>> > > Are you referring to creating a similar external repo for catalogs >>>>> or >>>>> > will >>>>> > > it be better to add it in flink-connector-aws[1] ? >>>>> > > >>>>> > > [1] https://github.com/apache/flink-connector-aws >>>>> > > >>>>> > > Samrat >>>>> > > >>>>> > > On Tue, Dec 6, 2022 at 6:52 PM Samrat Deb <decordea...@gmail.com> >>>>> wrote: >>>>> > > >>>>> > > > Hi Dong Lin, >>>>> > > > >>>>> > > > Aws Glue Data catalog is vendor specific and in future we will >>>>> get such >>>>> > > > type of implementation from different providers. We should >>>>> > > > definitely externalize these catalog libraries similar to flink >>>>> > > connectors. >>>>> > > > I am thinking of creating >>>>> > > > flink-catalog similar to flink-connector under the root (flink). >>>>> glue >>>>> > > > catalog can be one of modules under the flink-catalog . Please >>>>> suggest >>>>> > if >>>>> > > > there is a better structure we can create for catalogs. >>>>> > > > >>>>> > > > >>>>> > > > It is mentioned in the FLIP that there will be two types of >>>>> > SdkHttpClient >>>>> > > >> supported based on the catalog option http-client.type. Is >>>>> > > >> http-client.type >>>>> > > >> a public config for the GlueCatalog? If yes, can we add this >>>>> config to >>>>> > > the >>>>> > > >> "Configurations" section and explain how users should choose the >>>>> > client >>>>> > > >> type? >>>>> > > > >>>>> > > > >>>>> > > > yes http-client.type is public config for the GlueCatalog. By >>>>> default >>>>> > > > client-type will be `urlconnection` , if user don't specify any >>>>> > > connection >>>>> > > > type. >>>>> > > > I have updated the FLIP-277[1] #configuration section with all >>>>> the >>>>> > > configs >>>>> > > > . Please review it again . >>>>> > > > >>>>> > > > [1] >>>>> > > > >>>>> > > >>>>> > >>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-277%3A+Native+GlueCatalog+Support+in+Flink >>>>> > > > >>>>> > > > Samrat >>>>> > > > >>>>> > > > On Tue, Dec 6, 2022 at 5:50 PM Samrat Deb <decordea...@gmail.com >>>>> > >>>>> > wrote: >>>>> > > > >>>>> > > >> Hi Yuxia, >>>>> > > >> >>>>> > > >> Thank you for reviewing the flip and putting forward your >>>>> observations >>>>> > > >> and comments. >>>>> > > >> >>>>> > > >> 1: I noticed there's a YAML part in the section of "Using the >>>>> > Catalog", >>>>> > > >>> what do you mean by that? Do you mean how to use glue catalog >>>>> in sql >>>>> > > >>> client? If so, just for your information, it's not supported >>>>> to use >>>>> > > yaml >>>>> > > >>> envrioment file in sql client[2]. >>>>> > > >> >>>>> > > >> >>>>> > > >> Thank you for attaching the jira ticket [1] . I missed the >>>>> changes. >>>>> > > >> There is a provision to register catalog directly through >>>>> factory >>>>> > > resources >>>>> > > >> . >>>>> > > >> - GenericInMemoryCatalog is defined through >>>>> > > >> >>>>> > > >>>>> > >>>>> `flink/flink-table/flink-table-api-java/src/main/resources/META-INF/services/org.apache.flink.table.factories.Factory` >>>>> > > >> - HiveCatalog is defined through >>>>> > > >> path >>>>> > > >>>>> > >>>>> `flink-connectors/flink-connector-hive/src/main/resources/META-INF/services/org.apache.flink.table.factories.Factory` >>>>> > > >> Similarly on the vendor specific module for Aws Glue we can >>>>> define it. >>>>> > > >> >>>>> > > >> 2: Seems there's a typo in "Design#views" part, it contains >>>>> > "listTables" >>>>> > > >>> which I think shouldn't be contained. >>>>> > > >> >>>>> > > >> >>>>> > > >> oh yes 😅 ! fixed it now thanks for pointing it out. >>>>> > > >> >>>>> > > >> >>>>> > > >> Also, I'm curious about how to list views using Glue API. Is >>>>> there an >>>>> > > >>> on-hand api to list views directly or we need to list the >>>>> tables and >>>>> > > then >>>>> > > >>> filter the views using the table-kind? >>>>> > > >> >>>>> > > >> >>>>> > > >> yes there is no in-hand api for list views directly , we need >>>>> to list >>>>> > > all >>>>> > > >> tables and then filter the views based on attribute tableKind >>>>> which >>>>> > is a >>>>> > > >> part of table object in api response. >>>>> > > >> >>>>> > > >> >>>>> > > >> 3: In "Flink Glue DataType Mapping" part, CharType is mapped to >>>>> > String. >>>>> > > >>> It seems the char's size will lose, is it possible to have a >>>>> better >>>>> > > mapping >>>>> > > >>> which won't loss the size of char type? >>>>> > > >> >>>>> > > >> >>>>> > > >> Thanks for pointing this out ! I have updated the flip with the >>>>> > correct >>>>> > > >> type. Initilially i mapped chartype , varchar type to string but >>>>> > > updated it >>>>> > > >> to directly map to the same type . >>>>> > > >> >>>>> > > >> >>>>> > > >> >>>>> > > >>> 4: About the "Flink CatalogFunction mapping with Glue >>>>> Function" part, >>>>> > > >>> how do we map the function language in Flink's CatalogFunction. >>>>> > > >> >>>>> > > >> >>>>> > > >> Glue Api (UserDefinedFunctionInput) doesn't support specific >>>>> attribute >>>>> > > >> for function language. Here is how aws hive compatible >>>>> metastore is >>>>> > > mapping >>>>> > > >> hive function to glue function[2]. We will append a prefix of >>>>> Language >>>>> > > in >>>>> > > >> the function name itself indicating the language. I see this >>>>> has been >>>>> > > >> already done for the Hive Catalog [3]. We are thinking of >>>>> implementing >>>>> > > it >>>>> > > >> in the same way. >>>>> > > >> >>>>> > > >> [1] https://issues.apache.org/jira/browse/FLINK-22540 >>>>> > > >> [2] >>>>> > > >> >>>>> > > >>>>> > >>>>> https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/blob/master/aws-glue-datacatalog-client-common/src/main/java/com/amazonaws/glue/catalog/converters/GlueInputConverter.java#L83 >>>>> > > >> [3] >>>>> > > >> >>>>> > > >>>>> > >>>>> https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-hive/src/main/java/org/apache/flink/table/catalog/hive/HiveCatalog.java#L1415 >>>>> > > >> >>>>> > > >> Samrat >>>>> > > >> >>>>> > > >> On Mon, Dec 5, 2022 at 4:33 PM Dong Lin <lindon...@gmail.com> >>>>> wrote: >>>>> > > >> >>>>> > > >>> Hi Samrat, >>>>> > > >>> >>>>> > > >>> Thanks for the FLIP! >>>>> > > >>> >>>>> > > >>> Since this is the first proposal for adding a vendor-specific >>>>> catalog >>>>> > > >>> library in Flink, I think maybe we should also externalize >>>>> those >>>>> > > catalog >>>>> > > >>> libraries similar to how we are externalizing connector >>>>> libraries. It >>>>> > > is >>>>> > > >>> likely that we might want to add catalogs for other vectors in >>>>> the >>>>> > > >>> future. >>>>> > > >>> Externalizing those catalogs can make Flink development more >>>>> scalable >>>>> > > in >>>>> > > >>> the long term. >>>>> > > >>> >>>>> > > >>> It is mentioned in the FLIP that there will be two types of >>>>> > > SdkHttpClient >>>>> > > >>> supported based on the catalog option http-client.type. Is >>>>> > > >>> http-client.type >>>>> > > >>> a public config for the GlueCatalog? If yes, can we add this >>>>> config >>>>> > to >>>>> > > >>> the >>>>> > > >>> "Configurations" section and explain how users should choose >>>>> the >>>>> > client >>>>> > > >>> type? >>>>> > > >>> >>>>> > > >>> Regards, >>>>> > > >>> Dong >>>>> > > >>> >>>>> > > >>> >>>>> > > >>> On Sat, Dec 3, 2022 at 12:31 PM Samrat Deb < >>>>> decordea...@gmail.com> >>>>> > > >>> wrote: >>>>> > > >>> >>>>> > > >>> > Hi everyone, >>>>> > > >>> > >>>>> > > >>> > I would like to open a discussion[1] on providing GlueCatalog >>>>> > support >>>>> > > >>> > in Flink. >>>>> > > >>> > Currently, Flink offers 3 major types of catalog[2]. Out of >>>>> which >>>>> > > only >>>>> > > >>> > HiveCatalog is a persistent catalog backed by Hive >>>>> Metastore. We >>>>> > > would >>>>> > > >>> like >>>>> > > >>> > to introduce GlueCatalog in Flink offering another option >>>>> for users >>>>> > > >>> which >>>>> > > >>> > will be persistent in nature. Aws Glue data catalog is a >>>>> > centralized >>>>> > > >>> data >>>>> > > >>> > catalog in AWS cloud that provides integrations with many >>>>> different >>>>> > > >>> > connectors[3]. Flink GlueCatalog can use the features >>>>> provided by >>>>> > > glue >>>>> > > >>> and >>>>> > > >>> > create strong integration with other services in the cloud. >>>>> > > >>> > >>>>> > > >>> > [1] >>>>> > > >>> > >>>>> > > >>> > >>>>> > > >>> >>>>> > > >>>>> > >>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-277%3A+Native+GlueCatalog+Support+in+Flink >>>>> > > >>> > >>>>> > > >>> > [2] >>>>> > > >>> > >>>>> > > >>> > >>>>> > > >>> >>>>> > > >>>>> > >>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/catalogs/ >>>>> > > >>> > >>>>> > > >>> > [3] >>>>> > > >>> > >>>>> > > >>> > >>>>> > > >>> >>>>> > > >>>>> > >>>>> https://docs.aws.amazon.com/glue/latest/dg/components-overview.html#data-catalog-intro >>>>> > > >>> > >>>>> > > >>> > [4] https://issues.apache.org/jira/browse/FLINK-29549 >>>>> > > >>> > >>>>> > > >>> > Bests >>>>> > > >>> > Samrat >>>>> > > >>> > >>>>> > > >>> >>>>> > > >> >>>>> > > >>>>> > >>>>> >>>>