Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Timo Walther Mon, 07 Jan 2019 05:59:47 -0800

Hi everyone,

Xuefu and I had multiple iterations over the catalog design document[1]. I believe that it is in a good shape now to be converted into FLIP.Maybe we need a bit more explanation at some places but the generaldesign would be ready now.


The design document covers the following changes:

- Unify external catalog interface and Flink's internal catalog inTableEnvironment- Clearly define a hierarchy of reference objects namely:"catalog.database.table"- Enable a tight integration with Hive + Hive data connectors as well asa broad integration with existing TableFactories and discovery mechanism- Make the catalog interfaces more feature complete by adding views andfunctions

If you have any further feedback, it would be great to give it nowbefore we convert it into a FLIP.


Thanks,
Timo

[1]https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit#




Am 07.01.19 um 13:51 schrieb Timo Walther:

Hi Eron,

thank you very much for the contributions. I merged the first littlebug fixes. For the remaining PRs I think we can review and merge themsoon. As you said, the code is agnostic to the details of theExternalCatalog interface and I don't expect bigger merge conflicts inthe near future.

However, exposing the current external catalog interfaces to SQLClient users would make it even more difficult to change theinterfaces in the future. So maybe I would first wait until thegeneral catalog discussion is over and the FLIP has been created. Thisshould happen shortly.

We should definitely coordinate the efforts better in the future toavoid duplicate work.


Thanks,
Timo


Am 07.01.19 um 00:24 schrieb Eron Wright:

Thanks Timo for merging a couple of the PRs. Are you also able toreview the others that I mentioned? Xuefu I would like to incorporateyour feedback too.


Check out this short demonstration of using a catalog in SQL Client:
https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo

Thanks again!

On Thu, Jan 3, 2019 at 9:37 AM Eron Wright <[email protected]<mailto:[email protected]>> wrote:


    Would a couple folks raise their hand to make a review pass thru
    the 6 PRs listed above?  It is a lovely stack of PRs that is 'all
    green' at the moment.   I would be happy to open follow-on PRs to
    rapidly align with other efforts.

    Note that the code is agnostic to the details of the
    ExternalCatalog interface; the code would not be obsolete if/when
    the catalog interface is enhanced as per the design doc.



    On Wed, Jan 2, 2019 at 1:35 PM Eron Wright <[email protected]
    <mailto:[email protected]>> wrote:

        I propose that the community review and merge the PRs that I
        posted, and then evolve the design thru 1.8 and beyond.  I
        think having a basic infrastructure in place now will
        accelerate the effort, do you agree?

        Thanks again!

        On Wed, Jan 2, 2019 at 11:20 AM Zhang, Xuefu

<[email protected] <mailto:[email protected]>>wrote:


            Hi Eron,

            Happy New Year!

            Thank you very much for your contribution, especially
            during the holidays. Wile I'm encouraged by your work, I'd
            also like to share my thoughts on how to move forward.

            First, please note that the design discussion is still
            finalizing, and we expect some moderate changes,
            especially around TableFactories. Another pending change
            is our decision to shy away from scala, which our work
            will be impacted by.

            Secondly, while your work seemed about plugging in
            catalogs definitions to the execution environment, which
            is less impacted by TableFactory change, I did notice some
            duplication of your work and ours. This is no big deal,
            but going forward, we should probable have a better
            communication on the work assignment so as to avoid any
            possible duplication of work. On the other hand, I think
            some of your work is interesting and valuable for
            inclusion once we finalize the overall design.

            Thus, please continue your research and experiment and let
            us know when you start working on anything so we can
            better coordinate.

            Thanks again for your interest and contributions.

            Thanks,
            Xuefu



------------------------------------------------------------------
                From:Eron Wright <[email protected]
                <mailto:[email protected]>>
                Sent At:2019 Jan. 1 (Tue.) 18:39
                To:dev <[email protected]
                <mailto:[email protected]>>; Xuefu

<[email protected]<mailto:[email protected]>>

                Cc:Xiaowei Jiang <[email protected]
                <mailto:[email protected]>>; twalthr
                <[email protected] <mailto:[email protected]>>;
                piotr <[email protected]
                <mailto:[email protected]>>; Fabian Hueske
                <[email protected] <mailto:[email protected]>>;
                suez1224 <[email protected]
                <mailto:[email protected]>>; Bowen Li
                <[email protected] <mailto:[email protected]>>
                Subject:Re: [DISCUSS] Integrate Flink SQL well with
                Hive ecosystem

                Hi folks, there's clearly some incremental steps to be
                taken to introduce catalog support to SQL Client,
                complementary to what is proposed in the Flink-Hive
                Metastore design doc.  I was quietly working on this
                over the holidays.   I posted some new sub-tasks, PRs,
                and sample code to FLINK-10744.

                What inspired me to get involved is that the catalog
                interface seems like a great way to encapsulate a
                'library' of Flink tables and functions. For example,
                the NYC Taxi dataset (TaxiRides, TaxiFares, various
                UDFs) may be nicely encapsulated as a catalog
                (TaxiData).  Such a library should be fully consumable
                in SQL Client.

                I implemented the above. Some highlights:
                1. A fully-worked example of using the Taxi dataset in
                SQL Client via an environment file.
                - an ASCII video showing the SQL Client in action:
https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo

                - the corresponding environment file (will be even
                more concise once 'FLINK-10696 Catalog UDFs' is merged):
_https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml_

                - the typed API for standalone table applications:
_https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50_

                2. Implementation of the core catalog descriptor and
                factory.  I realize that some renames may later occur
                as per the design doc, and would be happy to do that
                as a follow-up.
                https://github.com/apache/flink/pull/7390

                3. Implementation of a connect-style API on
                TableEnvironment to use catalog descriptor.
                https://github.com/apache/flink/pull/7392

                4. Integration into SQL-Client's environment file:
                https://github.com/apache/flink/pull/7393

                I realize that the overall Hive integration is still
                evolving, but I believe that these PRs are a good
                stepping stone. Here's the list (in bottom-up order):
                - https://github.com/apache/flink/pull/7386
                - https://github.com/apache/flink/pull/7388
                - https://github.com/apache/flink/pull/7389
                - https://github.com/apache/flink/pull/7390
                - https://github.com/apache/flink/pull/7392
                - https://github.com/apache/flink/pull/7393

                Thanks and enjoy 2019!
                Eron W

                On Sun, Nov 18, 2018 at 3:04 PM Zhang, Xuefu
                <[email protected]
                <mailto:[email protected]>> wrote:
                Hi Xiaowei,

                Thanks for bringing up the question. In the current
                design, the properties for meta objects are meant to
                cover anything that's specific to a particular catalog
                and agnostic to Flink. Anything that is common (such
                as schema for tables, query text for views, and udf
                classname) are abstracted as members of the respective
                classes. However, this is still in discussion, and
                Timo and I will go over this and provide an update.

                Please note that UDF is a little more involved than
                what the current design doc shows. I'm still refining
                this part.

                Thanks,
                Xuefu

------------------------------------------------------------------
                Sender:Xiaowei Jiang <[email protected]
                <mailto:[email protected]>>
                Sent at:2018 Nov 18 (Sun) 15:17
                Recipient:dev <[email protected]
                <mailto:[email protected]>>
                Cc:Xuefu <[email protected]
                <mailto:[email protected]>>; twalthr
                <[email protected] <mailto:[email protected]>>;
                piotr <[email protected]
                <mailto:[email protected]>>; Fabian Hueske
                <[email protected] <mailto:[email protected]>>;

suez1224 <[email protected]<mailto:[email protected]>>

                Subject:Re: [DISCUSS] Integrate Flink SQL well with
                Hive ecosystem

                Thanks Xuefu for the detailed design doc! One question
                on the properties associated with the catalog objects.
                Are we going to leave them completely free form or we
                are going to set some standard for that? I think that
                the answer may depend on if we want to explore catalog
                specific optimization opportunities. In any case, I
                think that it might be helpful for standardize as much
                as possible into strongly typed classes and use leave
                these properties for catalog specific things. But I
                think that we can do it in steps.

                Xiaowei
                On Fri, Nov 16, 2018 at 4:00 AM Bowen Li

<[email protected] <mailto:[email protected]>>wrote:

                Thanks for keeping on improving the overall design,
                Xuefu! It looks quite
                 good to me now.

                 Would be nice that cc-ed Flink committers can help to
                review and confirm!



                 One minor suggestion: Since the last section of
                design doc already touches
                 some new sql statements, shall we add another section
                in our doc and
                 formalize the new sql statements in SQL Client and
                TableEnvironment that
                 are gonna come along naturally with our design? Here
                are some that the
                 design doc mentioned and some that I came up with:

                 To be added:

                    - USE <catalog> - set default catalog
                    - USE <catalog.schema> - set default schema
                    - SHOW CATALOGS - show all registered catalogs
                    - SHOW SCHEMAS [FROM catalog] - list schemas in
                the current default
                    catalog or the specified catalog
                    - DESCRIBE VIEW view - show the view's definition
                in CatalogView
                    - SHOW VIEWS [FROM schema/catalog.schema] - show
                views from current or a
                    specified schema.

                    (DDLs that can be addressed by either our design
                or Shuyi's DDL design)

                    - CREATE/DROP/ALTER SCHEMA schema
                    - CREATE/DROP/ALTER CATALOG catalog

                 To be modified:

                    - SHOW TABLES [FROM schema/catalog.schema] - show
                tables from current or
                    a specified schema. Add 'from schema' to existing
                'SHOW TABLES' statement
                    - SHOW FUNCTIONS [FROM schema/catalog.schema] -
                show functions from
                    current or a specified schema. Add 'from schema'
                to existing 'SHOW TABLES'
                    statement'


                 Thanks, Bowen



                 On Wed, Nov 14, 2018 at 10:39 PM Zhang, Xuefu

<[email protected]<mailto:[email protected]>>

                 wrote:

                 > Thanks, Bowen, for catching the error. I have
                granted comment permission
                 > with the link.
                 >
                 > I also updated the doc with the latest class
                definitions. Everyone is
                 > encouraged to review and comment.
                 >
                 > Thanks,
                 > Xuefu
                 >
                 >
------------------------------------------------------------------
                 > Sender:Bowen Li <[email protected]
                <mailto:[email protected]>>
                 > Sent at:2018 Nov 14 (Wed) 06:44
                 > Recipient:Xuefu <[email protected]
                <mailto:[email protected]>>
                 > Cc:piotr <[email protected]
                <mailto:[email protected]>>; dev
                <[email protected] <mailto:[email protected]>>;
                Shuyi
                 > Chen <[email protected] <mailto:[email protected]>>
                 > Subject:Re: [DISCUSS] Integrate Flink SQL well with
                Hive ecosystem
                 >
                 > Hi Xuefu,
                 >
                 > Currently the new design doc
                 >
<https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit>
                 > is on “view only" mode, and people cannot leave
                comments. Can you please
                 > change it to "can comment" or "can edit" mode?
                 >
                 > Thanks, Bowen
                 >
                 >
                 > On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu

<[email protected]<mailto:[email protected]>>

                 > wrote:
                 > Hi Piotr
                 >
                 > I have extracted the API portion of  the design and
                the google doc is here
                 >
<https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit?usp=sharing>.
                 > Please review and provide your feedback.
                 >
                 > Thanks,
                 > Xuefu
                 >
                 >
------------------------------------------------------------------
                 > Sender:Xuefu <[email protected]
                <mailto:[email protected]>>
                 > Sent at:2018 Nov 12 (Mon) 12:43
                 > Recipient:Piotr Nowojski <[email protected]
                <mailto:[email protected]>>; dev <
                 > [email protected] <mailto:[email protected]>>
                 > Cc:Bowen Li <[email protected]
                <mailto:[email protected]>>; Shuyi Chen
                <[email protected] <mailto:[email protected]>>
                 > Subject:Re: [DISCUSS] Integrate Flink SQL well with
                Hive ecosystem
                 >
                 > Hi Piotr,
                 >
                 > That sounds good to me. Let's close all the open
                questions ((there are a
                 > couple of them)) in the Google doc and I should be
                able to quickly split
                 > it into the three proposals as you suggested.
                 >
                 > Thanks,
                 > Xuefu
                 >
                 >
------------------------------------------------------------------
                 > Sender:Piotr Nowojski <[email protected]
                <mailto:[email protected]>>
                 > Sent at:2018 Nov 9 (Fri) 22:46
                 > Recipient:dev <[email protected]
                <mailto:[email protected]>>; Xuefu

<[email protected]<mailto:[email protected]>>

                 > Cc:Bowen Li <[email protected]
                <mailto:[email protected]>>; Shuyi Chen
                <[email protected] <mailto:[email protected]>>
                 > Subject:Re: [DISCUSS] Integrate Flink SQL well with
                Hive ecosystem
                 >
                 > Hi,
                 >
                 >
                 > Yes, it seems like the best solution. Maybe someone
                else can also suggests if we can split it further?
                Maybe changes in the interface in one doc, reading
                from hive meta store another and final storing our
                meta informations in hive meta store?
                 >
                 > Piotrek
                 >
                 > > On 9 Nov 2018, at 01:44, Zhang, Xuefu
                <[email protected]
                <mailto:[email protected]>> wrote:
                 > >
                 > > Hi Piotr,
                 > >
                 > > That seems to be good idea!
                 > >
                 >
                 > > Since the google doc for the design is currently
                under extensive review, I will leave it as it is for
                now. However, I'll convert it to two different FLIPs
                when the time comes.
                 > >
                 > > How does it sound to you?
                 > >
                 > > Thanks,
                 > > Xuefu
                 > >
                 > >
                 > >
------------------------------------------------------------------
                 > > Sender:Piotr Nowojski <[email protected]
                <mailto:[email protected]>>
                 > > Sent at:2018 Nov 9 (Fri) 02:31
                 > > Recipient:dev <[email protected]
                <mailto:[email protected]>>
                 > > Cc:Bowen Li <[email protected]
                <mailto:[email protected]>>; Xuefu

<[email protected]<mailto:[email protected]>

                 > >; Shuyi Chen <[email protected]
                <mailto:[email protected]>>
                 > > Subject:Re: [DISCUSS] Integrate Flink SQL well
                with Hive ecosystem
                 > >
                 > > Hi,
                 > >
                 >
                 > > Maybe we should split this topic (and the design
                doc) into couple of smaller ones, hopefully
                independent. The questions that you have asked Fabian
                have for example very little to do with reading
                metadata from Hive Meta Store?
                 > >
                 > > Piotrek
                 > >
                 > >> On 7 Nov 2018, at 14:27, Fabian Hueske
                <[email protected] <mailto:[email protected]>> wrote:
                 > >>
                 > >> Hi Xuefu and all,
                 > >>
                 > >> Thanks for sharing this design document!
                 >
                 > >> I'm very much in favor of restructuring /
                reworking the catalog handling in
                 > >> Flink SQL as outlined in the document.
                 >
                 > >> Most changes described in the design document
                seem to be rather general and
                 > >> not specifically related to the Hive integration.
                 > >>
                 >
                 > >> IMO, there are some aspects, especially those at
                the boundary of Hive and

> >> Flink, that need a bit more discussion. Forexample

                 > >>
                 > >> * What does it take to make Flink schema
                compatible with Hive schema?
                 > >> * How will Flink tables (descriptors) be stored
                in HMS?
                 > >> * How do both Hive catalogs differ? Could they
                be integrated into to a
                 > >> single one? When to use which one?
                 >
                 > >> * What meta information is provided by HMS? What
                of this can be leveraged
                 > >> by Flink?
                 > >>
                 > >> Thank you,
                 > >> Fabian
                 > >>
                 > >> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen
                Li <[email protected] <mailto:[email protected]>
                 > >:
                 > >>
                 > >>> After taking a look at how other discussion
                threads work, I think it's
                 > >>> actually fine just keep our discussion here.
                It's up to you, Xuefu.
                 > >>>
                 > >>> The google doc LGTM. I left some minor comments.
                 > >>>
                 > >>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li

<[email protected] <mailto:[email protected]>>wrote:

                 > >>>
                 > >>>> Hi all,
                 > >>>>
                 > >>>> As Xuefu has published the design doc on
                google, I agree with Shuyi's
                 >
                 > >>>> suggestion that we probably should start a new
                email thread like "[DISCUSS]
                 >
                 > >>>> ... Hive integration design ..." on only dev
                mailing list for community
                 > >>>> devs to review. The current thread sends to
                both dev and user list.
                 > >>>>
                 >
                 > >>>> This email thread is more like validating the
                general idea and direction
                 >
                 > >>>> with the community, and it's been pretty long
                and crowded so far. Since
                 >
                 > >>>> everyone is pro for the idea, we can move
                forward with another thread to
                 > >>>> discuss and finalize the design.
                 > >>>>
                 > >>>> Thanks,
                 > >>>> Bowen
                 > >>>>
                 > >>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <
                 > [email protected]
                <mailto:[email protected]>>
                 > >>>> wrote:
                 > >>>>
                 > >>>>> Hi Shuiyi,
                 > >>>>>
                 >
                 > >>>>> Good idea. Actually the PDF was converted
                from a google doc. Here is its
                 > >>>>> link:
                 > >>>>>
                 > >>>>>
                 >
https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
                 > >>>>> Once we reach an agreement, I can convert it
                to a FLIP.
                 > >>>>>
                 > >>>>> Thanks,
                 > >>>>> Xuefu
                 > >>>>>
                 > >>>>>
                 > >>>>>
                 > >>>>>
------------------------------------------------------------------
                 > >>>>> Sender:Shuyi Chen <[email protected]
                <mailto:[email protected]>>
                 > >>>>> Sent at:2018 Nov 1 (Thu) 02:47
                 > >>>>> Recipient:Xuefu <[email protected]
                <mailto:[email protected]>>
                 > >>>>> Cc:vino yang <[email protected]
                <mailto:[email protected]>>; Fabian Hueske <
                 > [email protected] <mailto:[email protected]>>;
                 > >>>>> dev <[email protected]
                <mailto:[email protected]>>; user
                <[email protected] <mailto:[email protected]>>
                 > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
                well with Hive ecosystem
                 > >>>>>
                 > >>>>> Hi Xuefu,
                 > >>>>>
                 >
                 > >>>>> Thanks a lot for driving this big effort. I
                would suggest convert your
                 >
                 > >>>>> proposal and design doc into a google doc,
                and share it on the dev mailing
                 >
                 > >>>>> list for the community to review and comment
                with title like "[DISCUSS] ...
                 >
                 > >>>>> Hive integration design ..." . Once
                approved,  we can document it as a FLIP
                 >
                 > >>>>> (Flink Improvement Proposals), and use JIRAs
                to track the implementations.
                 > >>>>> What do you think?
                 > >>>>>
                 > >>>>> Shuyi
                 > >>>>>
                 > >>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <
                 > [email protected]
                <mailto:[email protected]>>
                 > >>>>> wrote:
                 > >>>>> Hi all,
                 > >>>>>
                 > >>>>> I have also shared a design doc on Hive
                metastore integration that is
                 >
                 > >>>>> attached here and also to FLINK-10556[1].
                Please kindly review and share
                 > >>>>> your feedback.
                 > >>>>>
                 > >>>>>
                 > >>>>> Thanks,
                 > >>>>> Xuefu
                 > >>>>>
                 > >>>>> [1]
https://issues.apache.org/jira/browse/FLINK-10556
                 > >>>>>
------------------------------------------------------------------
                 > >>>>> Sender:Xuefu <[email protected]
                <mailto:[email protected]>>
                 > >>>>> Sent at:2018 Oct 25 (Thu) 01:08
                 > >>>>> Recipient:Xuefu <[email protected]
                <mailto:[email protected]>>; Shuyi Chen <
                 > >>>>> [email protected] <mailto:[email protected]>>
                 > >>>>> Cc:yanghua1127 <[email protected]
                <mailto:[email protected]>>; Fabian Hueske <
                 > [email protected] <mailto:[email protected]>>;
                 > >>>>> dev <[email protected]
                <mailto:[email protected]>>; user
                <[email protected] <mailto:[email protected]>>
                 > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
                well with Hive ecosystem
                 > >>>>>
                 > >>>>> Hi all,
                 > >>>>>
                 > >>>>> To wrap up the discussion, I have attached a
                PDF describing the
                 >
                 > >>>>> proposal, which is also attached to
                FLINK-10556 [1]. Please feel free to
                 > >>>>> watch that JIRA to track the progress.
                 > >>>>>
                 > >>>>> Please also let me know if you have
                additional comments or questions.
                 > >>>>>
                 > >>>>> Thanks,
                 > >>>>> Xuefu
                 > >>>>>
                 > >>>>> [1]
https://issues.apache.org/jira/browse/FLINK-10556
                 > >>>>>
                 > >>>>>
                 > >>>>>
------------------------------------------------------------------
                 > >>>>> Sender:Xuefu <[email protected]
                <mailto:[email protected]>>
                 > >>>>> Sent at:2018 Oct 16 (Tue) 03:40
                 > >>>>> Recipient:Shuyi Chen <[email protected]
                <mailto:[email protected]>>
                 > >>>>> Cc:yanghua1127 <[email protected]
                <mailto:[email protected]>>; Fabian Hueske <
                 > [email protected] <mailto:[email protected]>>;
                 > >>>>> dev <[email protected]
                <mailto:[email protected]>>; user
                <[email protected] <mailto:[email protected]>>
                 > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
                well with Hive ecosystem
                 > >>>>>
                 > >>>>> Hi Shuyi,
                 > >>>>>
                 >
                 > >>>>> Thank you for your input. Yes, I agreed with
                a phased approach and like
                 >
                 > >>>>> to move forward fast. :) We did some work
                internally on DDL utilizing babel
                 > >>>>> parser in Calcite. While babel makes
                Calcite's grammar extensible, at
                 > >>>>> first impression it still seems too
                cumbersome for a project when too
                 >
                 > >>>>> much extensions are made. It's even
                challenging to find where the extension
                 >
                 > >>>>> is needed! It would be certainly better if
                Calcite can magically support
                 >
                 > >>>>> Hive QL by just turning on a flag, such as
                that for MYSQL_5. I can also
                 >
                 > >>>>> see that this could mean a lot of work on
                Calcite. Nevertheless, I will
                 >
                 > >>>>> bring up the discussion over there and to see
                what their community thinks.
                 > >>>>>
                 > >>>>> Would mind to share more info about the
                proposal on DDL that you

> >>>>> mentioned? We can certainly collaborate onthis.

                 > >>>>>
                 > >>>>> Thanks,
                 > >>>>> Xuefu
                 > >>>>>
                 > >>>>>
------------------------------------------------------------------
                 > >>>>> Sender:Shuyi Chen <[email protected]
                <mailto:[email protected]>>
                 > >>>>> Sent at:2018 Oct 14 (Sun) 08:30
                 > >>>>> Recipient:Xuefu <[email protected]
                <mailto:[email protected]>>
                 > >>>>> Cc:yanghua1127 <[email protected]
                <mailto:[email protected]>>; Fabian Hueske <
                 > [email protected] <mailto:[email protected]>>;
                 > >>>>> dev <[email protected]
                <mailto:[email protected]>>; user
                <[email protected] <mailto:[email protected]>>
                 > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
                well with Hive ecosystem
                 > >>>>>
                 > >>>>> Welcome to the community and thanks for the
                great proposal, Xuefu! I
                 >
                 > >>>>> think the proposal can be divided into 2
                stages: making Flink to support
                 >
                 > >>>>> Hive features, and make Hive to work with
                Flink. I agreed with Timo that on
                 >
                 > >>>>> starting with a smaller scope, so we can make
                progress faster. As for [6],
                 >
                 > >>>>> a proposal for DDL is already in progress,
                and will come after the unified
                 >
                 > >>>>> SQL connector API is done. For supporting
                Hive syntax, we might need to
                 > >>>>> work with the Calcite community, and a recent
                effort called babel (
                 > >>>>>
https://issues.apache.org/jira/browse/CALCITE-2280) in
                Calcite might
                 > >>>>> help here.
                 > >>>>>
                 > >>>>> Thanks
                 > >>>>> Shuyi
                 > >>>>>
                 > >>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <
                 > [email protected]
                <mailto:[email protected]>>
                 > >>>>> wrote:
                 > >>>>> Hi Fabian/Vno,
                 > >>>>>
                 >
                 > >>>>> Thank you very much for your encouragement
                inquiry. Sorry that I didn't
                 >
                 > >>>>> see Fabian's email until I read Vino's
                response just now. (Somehow Fabian's
                 > >>>>> went to the spam folder.)
                 > >>>>>
                 >
                 > >>>>> My proposal contains long-term and
                short-terms goals. Nevertheless, the
                 > >>>>> effort will focus on the following areas,
                including Fabian's list:
                 > >>>>>
                 > >>>>> 1. Hive metastore connectivity - This covers
                both read/write access,
                 >
                 > >>>>> which means Flink can make full use of Hive's
                metastore as its catalog (at
                 > >>>>> least for the batch but can extend for
                streaming as well).
                 >
                 > >>>>> 2. Metadata compatibility - Objects
                (databases, tables, partitions, etc)
                 >
                 > >>>>> created by Hive can be understood by Flink
                and the reverse direction is
                 > >>>>> true also.
                 > >>>>> 3. Data compatibility - Similar to #2, data
                produced by Hive can be
                 > >>>>> consumed by Flink and vise versa.
                 >
                 > >>>>> 4. Support Hive UDFs - For all Hive's native
                udfs, Flink either provides
                 > >>>>> its own implementation or make Hive's
                implementation work in Flink.
                 > >>>>> Further, for user created UDFs in Hive, Flink
                SQL should provide a
                 >
                 > >>>>> mechanism allowing user to import them into
                Flink without any code change
                 > >>>>> required.
                 > >>>>> 5. Data types - Flink SQL should support all
                data types that are
                 > >>>>> available in Hive.
                 > >>>>> 6. SQL Language - Flink SQL should support
                SQL standard (such as
                 >
                 > >>>>> SQL2003) with extension to support Hive's
                syntax and language features,
                 > >>>>> around DDL, DML, and SELECT queries.
                 >
                 > >>>>> 7.  SQL CLI - this is currently developing in
                Flink but more effort is
                 > >>>>> needed.
                 >
                 > >>>>> 8. Server - provide a server that's
                compatible with Hive's HiverServer2
                 >
                 > >>>>> in thrift APIs, such that HiveServer2 users
                can reuse their existing client
                 > >>>>> (such as beeline) but connect to Flink's
                thrift server instead.
                 >
                 > >>>>> 9. JDBC/ODBC drivers - Flink may provide its
                own JDBC/ODBC drivers for
                 > >>>>> other application to use to connect to its
                thrift server
                 > >>>>> 10. Support other user's customizations in
                Hive, such as Hive Serdes,
                 > >>>>> storage handlers, etc.
                 >
                 > >>>>> 11. Better task failure tolerance and task
                scheduling at Flink runtime.
                 > >>>>>
                 > >>>>> As you can see, achieving all those requires
                significant effort and
                 >
                 > >>>>> across all layers in Flink. However, a
                short-term goal could include only
                 >
                 > >>>>> core areas (such as 1, 2, 4, 5, 6, 7) or
                start  at a smaller scope (such as
                 > >>>>> #3, #6).
                 > >>>>>
                 >
                 > >>>>> Please share your further thoughts. If we
                generally agree that this is
                 >
                 > >>>>> the right direction, I could come up with a
                formal proposal quickly and
                 > >>>>> then we can follow up with broader discussions.
                 > >>>>>
                 > >>>>> Thanks,
                 > >>>>> Xuefu
                 > >>>>>
                 > >>>>>
                 > >>>>>
                 > >>>>>
------------------------------------------------------------------
                 > >>>>> Sender:vino yang <[email protected]
                <mailto:[email protected]>>
                 > >>>>> Sent at:2018 Oct 11 (Thu) 09:45
                 > >>>>> Recipient:Fabian Hueske <[email protected]
                <mailto:[email protected]>>
                 > >>>>> Cc:dev <[email protected]
                <mailto:[email protected]>>; Xuefu

<[email protected]<mailto:[email protected]>

                 > >; user <
                 > >>>>> [email protected]
                <mailto:[email protected]>>
                 > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
                well with Hive ecosystem
                 > >>>>>
                 > >>>>> Hi Xuefu,
                 > >>>>>
                 >
                 > >>>>> Appreciate this proposal, and like Fabian, it
                would look better if you
                 > >>>>> can give more details of the plan.
                 > >>>>>
                 > >>>>> Thanks, vino.
                 > >>>>>
                 > >>>>> Fabian Hueske <[email protected]
                <mailto:[email protected]>> 于2018年10月10日周三
                下午5:27写道：
                 > >>>>> Hi Xuefu,
                 > >>>>>
                 >
                 > >>>>> Welcome to the Flink community and thanks for
                starting this discussion!
                 > >>>>> Better Hive integration would be really great!
                 > >>>>> Can you go into details of what you are
                proposing? I can think of a
                 > >>>>> couple ways to improve Flink in that regard:
                 > >>>>>
                 > >>>>> * Support for Hive UDFs
                 > >>>>> * Support for Hive metadata catalog
                 > >>>>> * Support for HiveQL syntax
                 > >>>>> * ???
                 > >>>>>
                 > >>>>> Best, Fabian
                 > >>>>>
                 > >>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb
                Zhang, Xuefu <
                 > >>>>> [email protected]
                <mailto:[email protected]>>:
                 > >>>>> Hi all,
                 > >>>>>
                 > >>>>> Along with the community's effort, inside
                Alibaba we have explored
                 >
                 > >>>>> Flink's potential as an execution engine not
                just for stream processing but
                 > >>>>> also for batch processing. We are encouraged
                by our findings and have
                 >
                 > >>>>> initiated our effort to make Flink's SQL
                capabilities full-fledged. When
                 >
                 > >>>>> comparing what's available in Flink to the
                offerings from competitive data
                 >
                 > >>>>> processing engines, we identified a major gap
                in Flink: a well integration
                 >
                 > >>>>> with Hive ecosystem. This is crucial to the
                success of Flink SQL and batch
                 >
                 > >>>>> due to the well-established data ecosystem
                around Hive. Therefore, we have
                 >
                 > >>>>> done some initial work along this direction
                but there are still a lot of
                 > >>>>> effort needed.
                 > >>>>>
                 > >>>>> We have two strategies in mind. The first one
                is to make Flink SQL
                 >
                 > >>>>> full-fledged and well-integrated with Hive
                ecosystem. This is a similar
                 >
                 > >>>>> approach to what Spark SQL adopted. The
                second strategy is to make Hive
                 >
                 > >>>>> itself work with Flink, similar to the
                proposal in [1]. Each approach bears
                 >
                 > >>>>> its pros and cons, but they don’t need to be
                mutually exclusive with each
                 > >>>>> targeting at different users and use cases.
                We believe that both will
                 > >>>>> promote a much greater adoption of Flink
                beyond stream processing.
                 > >>>>>
                 > >>>>> We have been focused on the first approach
                and would like to showcase
                 >
                 > >>>>> Flink's batch and SQL capabilities with Flink
                SQL. However, we have also
                 > >>>>> planned to start strategy #2 as the follow-up
                effort.
                 > >>>>>
                 >
                 > >>>>> I'm completely new to Flink(, with a short
                bio [2] below), though many
                 >
                 > >>>>> of my colleagues here at Alibaba are
                long-time contributors. Nevertheless,
                 >
                 > >>>>> I'd like to share our thoughts and invite
                your early feedback. At the same
                 >
                 > >>>>> time, I am working on a detailed proposal on
                Flink SQL's integration with
                 > >>>>> Hive ecosystem, which will be also shared
                when ready.
                 > >>>>>
                 > >>>>> While the ideas are simple, each approach
                will demand significant
                 >
                 > >>>>> effort, more than what we can afford. Thus,
                the input and contributions
                 > >>>>> from the communities are greatly welcome and
                appreciated.
                 > >>>>>
                 > >>>>> Regards,
                 > >>>>>
                 > >>>>>
                 > >>>>> Xuefu
                 > >>>>>
                 > >>>>> References:
                 > >>>>>
                 > >>>>> [1]
                https://issues.apache.org/jira/browse/HIVE-10712
                 >
                 > >>>>> [2] Xuefu Zhang is a long-time open source
                veteran, worked or working on
                 > >>>>> many projects under Apache Foundation, of
                which he is also an honored
                 >
                 > >>>>> member. About 10 years ago he worked in the
                Hadoop team at Yahoo where the
                 >
                 > >>>>> projects just got started. Later he worked at
                Cloudera, initiating and
                 >
                 > >>>>> leading the development of Hive on Spark
                project in the communities and
                 >
                 > >>>>> across many organizations. Prior to joining
                Alibaba, he worked at Uber
                 >
                 > >>>>> where he promoted Hive on Spark to all Uber's
                SQL on Hadoop workload and

> >>>>> significantly improved Uber's clusterefficiency.

                 > >>>>>
                 > >>>>>
                 > >>>>>
                 > >>>>>
                 > >>>>> --
                 >
                 > >>>>> "So you have to trust that the dots will
                somehow connect in your future."
                 > >>>>>
                 > >>>>>
                 > >>>>> --
                 >
                 > >>>>> "So you have to trust that the dots will
                somehow connect in your future."
                 > >>>>>
                 >
                 >

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Reply via email to