Thank you Danny for more insights on the flink-connector-aws-base[1].

It looks like localstack supports glue [2], we already use localstack for
> integration tests so we can follow suite here.


As GlueCatalog will be a part of flink-connector-aws-base. As per
suggestion, we will reuse code and resources as much as possible and add
extra things required in extensible manner.

Bests,
Samrat


[1]
https://github.com/apache/flink-connector-aws/tree/main/flink-connector-aws-base
[2] https://docs.localstack.cloud/user-guide/aws/glue/




On Tue, Dec 13, 2022 at 9:32 PM Danny Cranmer <dannycran...@apache.org>
wrote:

> Hello Samrat,
>
> Sorry for the late response.
>
> +1 for a native Glue Data Catalog integration. We have
> internally developed a Glue Data Catalog catalog implementation that shims
> hive. We have been meaning to contribute, but this solution can replace our
> internal one.
>
> +1 for putting this in the flink-connector-aws. With regards to
> configuration, we have a flink-connector-aws-base [1] module where all the
> common configurations should go. Anything common, such as authentication
> providers, please use. Additionally any new configurations you need to add
> please consider them going into aws-base if they might be reusable for
> other AWS integrations.
>
> > We will create an e2e integration test cases capturing all the
> implementation in a mock environment.
>
> It looks like localstack supports glue [2], we already use localstack for
> integration tests so we can follow suite here.
>
> Thanks,
> Danny
>
> [1]
> https://github.com/apache/flink-connector-aws/tree/main/flink-connector-aws-base
> [2] https://docs.localstack.cloud/user-guide/aws/glue/
>
> On Mon, Dec 12, 2022 at 12:18 PM Samrat Deb <decordea...@gmail.com> wrote:
>
>> Hi Konstantin Knauf,
>>
>> Can you explain how users are expected to authenticate with AWS Glue? I
>>> don't see any catalog options regardng authx. So I assume the credentials
>>> are taken from the environment?
>>
>>
>> We are planning to put GlueCatalog in flink-connector-aws[1].
>> flink-connector-aws already provides base and already built AwsConfigs[2].
>> These configs can be reused for the Catalog purpose also.
>> I will update the FLIP-277[3] with the auth related configs in the
>> Configuration Section.
>>
>> Users can pass these values as a part of config in catalog creation and
>> if not provided it will try to fetch from the environment.
>> This will allow users to create multiple catalog instances on the same
>> session pointing to different accounts. ( I haven't tested multi
>> account glue catalog instances during POC) .
>>
>> [1] https://github.com/apache/flink-connector-aws
>> <https://github.com/apache/flink-connector-aws/blob/main/flink-connector-aws-base/src/main/java/org/apache/flink/connector/aws/config/AWSConfigConstants.java>
>> [2]
>> https://github.com/apache/flink-connector-aws/blob/main/flink-connector-aws-base/src/main/java/org/apache/flink/connector/aws/config/AWSConfigConstants.java
>> [3]
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-277%3A+Native+GlueCatalog+Support+in+Flink
>>
>> Bests,
>> Samrat
>>
>> On Mon, Dec 12, 2022 at 5:32 PM Samrat Deb <decordea...@gmail.com> wrote:
>>
>>> Hi Jark,
>>> Apologies for late reply.
>>> Thank you for your valuable input.
>>>
>>> Besides, I have a question about Glue Namespace. Could you share the
>>>> documentation of the Glue
>>>>  Namespaces? (Sorry, I didn't find it.) According to the "Flink Glue
>>>> Metaspace Mapping" section,
>>>> if there is a database "mydb" under namespace "ns1", is that mean the
>>>> database name in Flink is "ns1.mydb"?
>>>
>>> There is no concept of namespace in glue data catalog.
>>> There are 3 levels in glue data catalog
>>> - catalog
>>> - database
>>> - table
>>>
>>> I have added the mapping in FLIP-277[1]. and updated it .
>>> it is directly database name from flink to database name in glue
>>> Please ignore the typo leftover in doc previously.
>>>
>>> Best,
>>> Samrat
>>>
>>>
>>> On Fri, Dec 9, 2022 at 8:38 PM Jark Wu <imj...@gmail.com> wrote:
>>>
>>>> Hi Samrat,
>>>>
>>>> Thanks a lot for driving the new catalog, and sorry for jumping into the
>>>> discussion late.
>>>>
>>>> As Flink SQL is becoming the first-class citizen of the Flink API, we
>>>> are
>>>> planning to push Catalog
>>>> to become the first-class citizen of the connector instead of Source &
>>>> Sink. For Flink SQL users,
>>>> using Catalog is as natural and user-friendly as working with databases,
>>>> rather than having to define
>>>> DDL and schemas over and over again. This is also how Trino/Presto does.
>>>>
>>>> Regarding the repo for the Glue catalog, I think we can add it to
>>>> flink-connector-aws. We don't need
>>>> separate repos for Catalogs because Catalog is a kind of connector
>>>> (others
>>>> are sources & sinks).
>>>> For example, MySqlCatalog[1] and PostgresCatalog[2] are in
>>>> flink-connector-jdbc, and HiveCatalog is
>>>> in flink-connector-hive. This can reduce repository maintenance, and I
>>>> think maybe some common
>>>> AWS utils can be shared there.  cc @Danny Cranmer <
>>>> dannycran...@apache.org>
>>>> what do you think about this?
>>>>
>>>> Besides, I have a question about Glue Namespace. Could you share the
>>>> documentation of the Glue
>>>>  Namespaces? (Sorry, I didn't find it.) According to the "Flink Glue
>>>> Metaspace Mapping" section,
>>>> if there is a database "mydb" under namespace "ns1", is that mean the
>>>> database name in Flink is "ns1.mydb"?
>>>>
>>>> Best,
>>>> Jark
>>>>
>>>>
>>>> [1]:
>>>>
>>>> https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-jdbc/src/main/java/org/apache/flink/connector/jdbc/catalog/MySqlCatalog.java
>>>> [2]:
>>>>
>>>> https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-jdbc/src/main/java/org/apache/flink/connector/jdbc/catalog/PostgresCatalog.java
>>>>
>>>> On Fri, 9 Dec 2022 at 08:51, Dong Lin <lindon...@gmail.com> wrote:
>>>>
>>>> > Hi Samrat,
>>>> >
>>>> > Sorry for the late reply. Yeah I am referring to creating a similar
>>>> > external repo such as flink-catalog-glue. flink-connector-aws is
>>>> already
>>>> > named with `connector` so it seems a bit weird to put a catalog there.
>>>> >
>>>> > Thanks!
>>>> > Dong
>>>> >
>>>> > On Wed, Dec 7, 2022 at 1:04 PM Samrat Deb <decordea...@gmail.com>
>>>> wrote:
>>>> >
>>>> > > Hi Dong Lin,
>>>> > >
>>>> > > Since this is the first proposal for adding a vendor-specific
>>>> catalog
>>>> > > > library in Flink, I think maybe we should also externalize those
>>>> > catalog
>>>> > > > libraries similar to how we are externalizing connector
>>>> libraries. It
>>>> > is
>>>> > > > likely that we might want to add catalogs for other vectors in the
>>>> > > future.
>>>> > > > Externalizing those catalogs can make Flink development more
>>>> scalable
>>>> > in
>>>> > > > the long term.
>>>> > >
>>>> > > Initially i mis-interpretted externalising the catalogs, There
>>>> already
>>>> > > exists an externalised connector for aws [1].
>>>> > > Are you referring to creating a similar external repo for catalogs
>>>> or
>>>> > will
>>>> > > it be better to add it in flink-connector-aws[1] ?
>>>> > >
>>>> > > [1] https://github.com/apache/flink-connector-aws
>>>> > >
>>>> > > Samrat
>>>> > >
>>>> > > On Tue, Dec 6, 2022 at 6:52 PM Samrat Deb <decordea...@gmail.com>
>>>> wrote:
>>>> > >
>>>> > > > Hi Dong Lin,
>>>> > > >
>>>> > > > Aws Glue Data catalog is vendor specific and in future we will
>>>> get such
>>>> > > > type of implementation from different providers. We should
>>>> > > > definitely externalize these catalog libraries similar to flink
>>>> > > connectors.
>>>> > > > I am thinking of creating
>>>> > > > flink-catalog similar to flink-connector under the root (flink).
>>>> glue
>>>> > > > catalog can be one of modules under the flink-catalog . Please
>>>> suggest
>>>> > if
>>>> > > > there is a better structure we can create for catalogs.
>>>> > > >
>>>> > > >
>>>> > > > It is mentioned in the FLIP that there will be two types of
>>>> > SdkHttpClient
>>>> > > >> supported based on the catalog option http-client.type. Is
>>>> > > >> http-client.type
>>>> > > >> a public config for the GlueCatalog? If yes, can we add this
>>>> config to
>>>> > > the
>>>> > > >> "Configurations" section and explain how users should choose the
>>>> > client
>>>> > > >> type?
>>>> > > >
>>>> > > >
>>>> > > > yes http-client.type is public config for the GlueCatalog. By
>>>> default
>>>> > > > client-type will be `urlconnection` , if user don't specify any
>>>> > > connection
>>>> > > > type.
>>>> > > > I have updated the FLIP-277[1] #configuration section with all the
>>>> > > configs
>>>> > > > . Please review it again .
>>>> > > >
>>>> > > > [1]
>>>> > > >
>>>> > >
>>>> >
>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-277%3A+Native+GlueCatalog+Support+in+Flink
>>>> > > >
>>>> > > > Samrat
>>>> > > >
>>>> > > > On Tue, Dec 6, 2022 at 5:50 PM Samrat Deb <decordea...@gmail.com>
>>>> > wrote:
>>>> > > >
>>>> > > >> Hi Yuxia,
>>>> > > >>
>>>> > > >> Thank you for reviewing the flip and putting forward your
>>>> observations
>>>> > > >> and comments.
>>>> > > >>
>>>> > > >> 1: I noticed there's a YAML part in the section of "Using the
>>>> > Catalog",
>>>> > > >>> what do you mean by that? Do you mean how to use glue catalog
>>>> in sql
>>>> > > >>> client? If so, just for your information, it's not supported to
>>>> use
>>>> > > yaml
>>>> > > >>> envrioment file in sql client[2].
>>>> > > >>
>>>> > > >>
>>>> > > >> Thank you for attaching the jira ticket [1] . I missed the
>>>> changes.
>>>> > > >> There is a provision to register catalog directly through factory
>>>> > > resources
>>>> > > >> .
>>>> > > >> - GenericInMemoryCatalog is defined through
>>>> > > >>
>>>> > >
>>>> >
>>>> `flink/flink-table/flink-table-api-java/src/main/resources/META-INF/services/org.apache.flink.table.factories.Factory`
>>>> > > >> - HiveCatalog is defined through
>>>> > > >> path
>>>> > >
>>>> >
>>>> `flink-connectors/flink-connector-hive/src/main/resources/META-INF/services/org.apache.flink.table.factories.Factory`
>>>> > > >> Similarly on the vendor specific module for Aws Glue we can
>>>> define it.
>>>> > > >>
>>>> > > >> 2: Seems there's a typo in "Design#views" part, it contains
>>>> > "listTables"
>>>> > > >>> which I think shouldn't be contained.
>>>> > > >>
>>>> > > >>
>>>> > > >> oh yes 😅 ! fixed it now thanks for pointing it out.
>>>> > > >>
>>>> > > >>
>>>> > > >> Also, I'm curious about how to list views using Glue API. Is
>>>> there an
>>>> > > >>> on-hand api to list views directly or we need to list the
>>>> tables and
>>>> > > then
>>>> > > >>> filter the views using the table-kind?
>>>> > > >>
>>>> > > >>
>>>> > > >> yes there is no in-hand api for list views directly , we need to
>>>> list
>>>> > > all
>>>> > > >> tables and then filter the views based on attribute tableKind
>>>> which
>>>> > is a
>>>> > > >> part of table object in api response.
>>>> > > >>
>>>> > > >>
>>>> > > >> 3: In "Flink Glue DataType Mapping" part, CharType is mapped to
>>>> > String.
>>>> > > >>> It seems the char's size will lose, is it possible to have a
>>>> better
>>>> > > mapping
>>>> > > >>> which won't loss the size of char type?
>>>> > > >>
>>>> > > >>
>>>> > > >> Thanks for pointing this out ! I have updated the flip with the
>>>> > correct
>>>> > > >> type. Initilially i mapped chartype , varchar type to string but
>>>> > > updated it
>>>> > > >> to directly map to the same type .
>>>> > > >>
>>>> > > >>
>>>> > > >>
>>>> > > >>> 4: About the "Flink CatalogFunction mapping with Glue Function"
>>>> part,
>>>> > > >>> how do we map the function language in Flink's CatalogFunction.
>>>> > > >>
>>>> > > >>
>>>> > > >> Glue Api (UserDefinedFunctionInput) doesn't support specific
>>>> attribute
>>>> > > >> for function language. Here is how aws hive compatible metastore
>>>> is
>>>> > > mapping
>>>> > > >> hive function to glue function[2]. We will append a prefix of
>>>> Language
>>>> > > in
>>>> > > >> the function name itself indicating the language. I see this has
>>>> been
>>>> > > >> already done for the Hive Catalog [3]. We are thinking of
>>>> implementing
>>>> > > it
>>>> > > >> in the same way.
>>>> > > >>
>>>> > > >> [1] https://issues.apache.org/jira/browse/FLINK-22540
>>>> > > >> [2]
>>>> > > >>
>>>> > >
>>>> >
>>>> https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/blob/master/aws-glue-datacatalog-client-common/src/main/java/com/amazonaws/glue/catalog/converters/GlueInputConverter.java#L83
>>>> > > >> [3]
>>>> > > >>
>>>> > >
>>>> >
>>>> https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-hive/src/main/java/org/apache/flink/table/catalog/hive/HiveCatalog.java#L1415
>>>> > > >>
>>>> > > >> Samrat
>>>> > > >>
>>>> > > >> On Mon, Dec 5, 2022 at 4:33 PM Dong Lin <lindon...@gmail.com>
>>>> wrote:
>>>> > > >>
>>>> > > >>> Hi Samrat,
>>>> > > >>>
>>>> > > >>> Thanks for the FLIP!
>>>> > > >>>
>>>> > > >>> Since this is the first proposal for adding a vendor-specific
>>>> catalog
>>>> > > >>> library in Flink, I think maybe we should also externalize those
>>>> > > catalog
>>>> > > >>> libraries similar to how we are externalizing connector
>>>> libraries. It
>>>> > > is
>>>> > > >>> likely that we might want to add catalogs for other vectors in
>>>> the
>>>> > > >>> future.
>>>> > > >>> Externalizing those catalogs can make Flink development more
>>>> scalable
>>>> > > in
>>>> > > >>> the long term.
>>>> > > >>>
>>>> > > >>> It is mentioned in the FLIP that there will be two types of
>>>> > > SdkHttpClient
>>>> > > >>> supported based on the catalog option http-client.type. Is
>>>> > > >>> http-client.type
>>>> > > >>> a public config for the GlueCatalog? If yes, can we add this
>>>> config
>>>> > to
>>>> > > >>> the
>>>> > > >>> "Configurations" section and explain how users should choose the
>>>> > client
>>>> > > >>> type?
>>>> > > >>>
>>>> > > >>> Regards,
>>>> > > >>> Dong
>>>> > > >>>
>>>> > > >>>
>>>> > > >>> On Sat, Dec 3, 2022 at 12:31 PM Samrat Deb <
>>>> decordea...@gmail.com>
>>>> > > >>> wrote:
>>>> > > >>>
>>>> > > >>> > Hi everyone,
>>>> > > >>> >
>>>> > > >>> > I would like to open a discussion[1] on providing GlueCatalog
>>>> > support
>>>> > > >>> > in Flink.
>>>> > > >>> > Currently, Flink offers 3 major types of catalog[2]. Out of
>>>> which
>>>> > > only
>>>> > > >>> > HiveCatalog is a persistent catalog backed by Hive Metastore.
>>>> We
>>>> > > would
>>>> > > >>> like
>>>> > > >>> > to introduce GlueCatalog in Flink offering another option for
>>>> users
>>>> > > >>> which
>>>> > > >>> > will be persistent in nature. Aws Glue data catalog is a
>>>> > centralized
>>>> > > >>> data
>>>> > > >>> > catalog in AWS cloud that provides integrations with many
>>>> different
>>>> > > >>> > connectors[3]. Flink GlueCatalog can use the features
>>>> provided by
>>>> > > glue
>>>> > > >>> and
>>>> > > >>> > create strong integration with other services in the cloud.
>>>> > > >>> >
>>>> > > >>> > [1]
>>>> > > >>> >
>>>> > > >>> >
>>>> > > >>>
>>>> > >
>>>> >
>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-277%3A+Native+GlueCatalog+Support+in+Flink
>>>> > > >>> >
>>>> > > >>> > [2]
>>>> > > >>> >
>>>> > > >>> >
>>>> > > >>>
>>>> > >
>>>> >
>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/catalogs/
>>>> > > >>> >
>>>> > > >>> > [3]
>>>> > > >>> >
>>>> > > >>> >
>>>> > > >>>
>>>> > >
>>>> >
>>>> https://docs.aws.amazon.com/glue/latest/dg/components-overview.html#data-catalog-intro
>>>> > > >>> >
>>>> > > >>> > [4] https://issues.apache.org/jira/browse/FLINK-29549
>>>> > > >>> >
>>>> > > >>> > Bests
>>>> > > >>> > Samrat
>>>> > > >>> >
>>>> > > >>>
>>>> > > >>
>>>> > >
>>>> >
>>>>
>>>

Reply via email to