Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Bowen Li Tue, 08 Jan 2019 13:55:33 -0800

Thank you, Xuefu and Timo, for putting together the FLIP! I like that both
its scope and implementation plan are clear. Look forward to feedbacks from
the group.


I also added a few more complementary details in the doc.

Thanks,
Bowen


On Mon, Jan 7, 2019 at 8:37 PM Zhang, Xuefu <[email protected]> wrote:

> Thanks, Timo!
>
> I have started put the content from the google doc to FLIP-30 [1].
> However, please still keep the discussion along this thread.
>
> Thanks,
> Xuefu
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-30%3A+Unified+Catalog+APIs
>
>
> ------------------------------------------------------------------
> From:Timo Walther <[email protected]>
> Sent At:2019 Jan. 7 (Mon.) 05:59
> To:dev <[email protected]>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi everyone,
>
> Xuefu and I had multiple iterations over the catalog design document
> [1]. I believe that it is in a good shape now to be converted into FLIP.
> Maybe we need a bit more explanation at some places but the general
> design would be ready now.
>
> The design document covers the following changes:
> - Unify external catalog interface and Flink's internal catalog in
> TableEnvironment
> - Clearly define a hierarchy of reference objects namely:
> "catalog.database.table"
> - Enable a tight integration with Hive + Hive data connectors as well as
> a broad integration with existing TableFactories and discovery mechanism
> - Make the catalog interfaces more feature complete by adding views and
> functions
>
> If you have any further feedback, it would be great to give it now
> before we convert it into a FLIP.
>
> Thanks,
> Timo
>
> [1]
>
> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit#
>
>
>
> Am 07.01.19 um 13:51 schrieb Timo Walther:
> > Hi Eron,
> >
> > thank you very much for the contributions. I merged the first little
> > bug fixes. For the remaining PRs I think we can review and merge them
> > soon. As you said, the code is agnostic to the details of the
> > ExternalCatalog interface and I don't expect bigger merge conflicts in
> > the near future.
> >
> > However, exposing the current external catalog interfaces to SQL
> > Client users would make it even more difficult to change the
> > interfaces in the future. So maybe I would first wait until the
> > general catalog discussion is over and the FLIP has been created. This
> > should happen shortly.
> >
> > We should definitely coordinate the efforts better in the future to
> > avoid duplicate work.
> >
> > Thanks,
> > Timo
> >
> >
> > Am 07.01.19 um 00:24 schrieb Eron Wright:
> >> Thanks Timo for merging a couple of the PRs.   Are you also able to
> >> review the others that I mentioned? Xuefu I would like to incorporate
> >> your feedback too.
> >>
> >> Check out this short demonstration of using a catalog in SQL Client:
> >> https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo
> >>
> >> Thanks again!
> >>
> >> On Thu, Jan 3, 2019 at 9:37 AM Eron Wright <[email protected]
> >> <mailto:[email protected]>> wrote:
> >>
> >>     Would a couple folks raise their hand to make a review pass thru
> >>     the 6 PRs listed above?  It is a lovely stack of PRs that is 'all
> >>     green' at the moment.   I would be happy to open follow-on PRs to
> >>     rapidly align with other efforts.
> >>
> >>     Note that the code is agnostic to the details of the
> >>     ExternalCatalog interface; the code would not be obsolete if/when
> >>     the catalog interface is enhanced as per the design doc.
> >>
> >>
> >>
> >>     On Wed, Jan 2, 2019 at 1:35 PM Eron Wright <[email protected]
> >>     <mailto:[email protected]>> wrote:
> >>
> >>         I propose that the community review and merge the PRs that I
> >>         posted, and then evolve the design thru 1.8 and beyond.  I
> >>         think having a basic infrastructure in place now will
> >>         accelerate the effort, do you agree?
> >>
> >>         Thanks again!
> >>
> >>         On Wed, Jan 2, 2019 at 11:20 AM Zhang, Xuefu
> >>         <[email protected] <mailto:[email protected]>>
> >> wrote:
> >>
> >>             Hi Eron,
> >>
> >>             Happy New Year!
> >>
> >>             Thank you very much for your contribution, especially
> >>             during the holidays. Wile I'm encouraged by your work, I'd
> >>             also like to share my thoughts on how to move forward.
> >>
> >>             First, please note that the design discussion is still
> >>             finalizing, and we expect some moderate changes,
> >>             especially around TableFactories. Another pending change
> >>             is our decision to shy away from scala, which our work
> >>             will be impacted by.
> >>
> >>             Secondly, while your work seemed about plugging in
> >>             catalogs definitions to the execution environment, which
> >>             is less impacted by TableFactory change, I did notice some
> >>             duplication of your work and ours. This is no big deal,
> >>             but going forward, we should probable have a better
> >>             communication on the work assignment so as to avoid any
> >>             possible duplication of work. On the other hand, I think
> >>             some of your work is interesting and valuable for
> >>             inclusion once we finalize the overall design.
> >>
> >>             Thus, please continue your research and experiment and let
> >>             us know when you start working on anything so we can
> >>             better coordinate.
> >>
> >>             Thanks again for your interest and contributions.
> >>
> >>             Thanks,
> >>             Xuefu
> >>
> >>
> >>
> >> ------------------------------------------------------------------
> >>                 From:Eron Wright <[email protected]
> >>                 <mailto:[email protected]>>
> >>                 Sent At:2019 Jan. 1 (Tue.) 18:39
> >>                 To:dev <[email protected]
> >>                 <mailto:[email protected]>>; Xuefu
> >>                 <[email protected]
> >> <mailto:[email protected]>>
> >>                 Cc:Xiaowei Jiang <[email protected]
> >>                 <mailto:[email protected]>>; twalthr
> >>                 <[email protected] <mailto:[email protected]>>;
> >>                 piotr <[email protected]
> >>                 <mailto:[email protected]>>; Fabian Hueske
> >>                 <[email protected] <mailto:[email protected]>>;
> >>                 suez1224 <[email protected]
> >>                 <mailto:[email protected]>>; Bowen Li
> >>                 <[email protected] <mailto:[email protected]>>
> >>                 Subject:Re: [DISCUSS] Integrate Flink SQL well with
> >>                 Hive ecosystem
> >>
> >>                 Hi folks, there's clearly some incremental steps to be
> >>                 taken to introduce catalog support to SQL Client,
> >>                 complementary to what is proposed in the Flink-Hive
> >>                 Metastore design doc.  I was quietly working on this
> >>                 over the holidays.   I posted some new sub-tasks, PRs,
> >>                 and sample code to FLINK-10744.
> >>
> >>                 What inspired me to get involved is that the catalog
> >>                 interface seems like a great way to encapsulate a
> >>                 'library' of Flink tables and functions. For example,
> >>                 the NYC Taxi dataset (TaxiRides, TaxiFares, various
> >>                 UDFs) may be nicely encapsulated as a catalog
> >>                 (TaxiData).  Such a library should be fully consumable
> >>                 in SQL Client.
> >>
> >>                 I implemented the above. Some highlights:
> >>                 1. A fully-worked example of using the Taxi dataset in
> >>                 SQL Client via an environment file.
> >>                 - an ASCII video showing the SQL Client in action:
> >> https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo
> >>
> >>                 - the corresponding environment file (will be even
> >>                 more concise once 'FLINK-10696 Catalog UDFs' is merged):
> >> _
> https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml_
> >>
> >>                 - the typed API for standalone table applications:
> >> _
> https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50_
> >>
> >>                 2. Implementation of the core catalog descriptor and
> >>                 factory.  I realize that some renames may later occur
> >>                 as per the design doc, and would be happy to do that
> >>                 as a follow-up.
> >>                 https://github.com/apache/flink/pull/7390
> >>
> >>                 3. Implementation of a connect-style API on
> >>                 TableEnvironment to use catalog descriptor.
> >>                 https://github.com/apache/flink/pull/7392
> >>
> >>                 4. Integration into SQL-Client's environment file:
> >>                 https://github.com/apache/flink/pull/7393
> >>
> >>                 I realize that the overall Hive integration is still
> >>                 evolving, but I believe that these PRs are a good
> >>                 stepping stone. Here's the list (in bottom-up order):
> >>                 - https://github.com/apache/flink/pull/7386
> >>                 - https://github.com/apache/flink/pull/7388
> >>                 - https://github.com/apache/flink/pull/7389
> >>                 - https://github.com/apache/flink/pull/7390
> >>                 - https://github.com/apache/flink/pull/7392
> >>                 - https://github.com/apache/flink/pull/7393
> >>
> >>                 Thanks and enjoy 2019!
> >>                 Eron W
> >>
> >>
> >>                 On Sun, Nov 18, 2018 at 3:04 PM Zhang, Xuefu
> >>                 <[email protected]
> >>                 <mailto:[email protected]>> wrote:
> >>                 Hi Xiaowei,
> >>
> >>                 Thanks for bringing up the question. In the current
> >>                 design, the properties for meta objects are meant to
> >>                 cover anything that's specific to a particular catalog
> >>                 and agnostic to Flink. Anything that is common (such
> >>                 as schema for tables, query text for views, and udf
> >>                 classname) are abstracted as members of the respective
> >>                 classes. However, this is still in discussion, and
> >>                 Timo and I will go over this and provide an update.
> >>
> >>                 Please note that UDF is a little more involved than
> >>                 what the current design doc shows. I'm still refining
> >>                 this part.
> >>
> >>                 Thanks,
> >>                 Xuefu
> >>
> >>
> >> ------------------------------------------------------------------
> >>                 Sender:Xiaowei Jiang <[email protected]
> >>                 <mailto:[email protected]>>
> >>                 Sent at:2018 Nov 18 (Sun) 15:17
> >>                 Recipient:dev <[email protected]
> >>                 <mailto:[email protected]>>
> >>                 Cc:Xuefu <[email protected]
> >>                 <mailto:[email protected]>>; twalthr
> >>                 <[email protected] <mailto:[email protected]>>;
> >>                 piotr <[email protected]
> >>                 <mailto:[email protected]>>; Fabian Hueske
> >>                 <[email protected] <mailto:[email protected]>>;
> >>                 suez1224 <[email protected]
> >> <mailto:[email protected]>>
> >>                 Subject:Re: [DISCUSS] Integrate Flink SQL well with
> >>                 Hive ecosystem
> >>
> >>                 Thanks Xuefu for the detailed design doc! One question
> >>                 on the properties associated with the catalog objects.
> >>                 Are we going to leave them completely free form or we
> >>                 are going to set some standard for that? I think that
> >>                 the answer may depend on if we want to explore catalog
> >>                 specific optimization opportunities. In any case, I
> >>                 think that it might be helpful for standardize as much
> >>                 as possible into strongly typed classes and use leave
> >>                 these properties for catalog specific things. But I
> >>                 think that we can do it in steps.
> >>
> >>                 Xiaowei
> >>                 On Fri, Nov 16, 2018 at 4:00 AM Bowen Li
> >>                 <[email protected] <mailto:[email protected]>>
> >> wrote:
> >>                 Thanks for keeping on improving the overall design,
> >>                 Xuefu! It looks quite
> >>                  good to me now.
> >>
> >>                  Would be nice that cc-ed Flink committers can help to
> >>                 review and confirm!
> >>
> >>
> >>
> >>                  One minor suggestion: Since the last section of
> >>                 design doc already touches
> >>                  some new sql statements, shall we add another section
> >>                 in our doc and
> >>                  formalize the new sql statements in SQL Client and
> >>                 TableEnvironment that
> >>                  are gonna come along naturally with our design? Here
> >>                 are some that the
> >>                  design doc mentioned and some that I came up with:
> >>
> >>                  To be added:
> >>
> >>                     - USE <catalog> - set default catalog
> >>                     - USE <catalog.schema> - set default schema
> >>                     - SHOW CATALOGS - show all registered catalogs
> >>                     - SHOW SCHEMAS [FROM catalog] - list schemas in
> >>                 the current default
> >>                     catalog or the specified catalog
> >>                     - DESCRIBE VIEW view - show the view's definition
> >>                 in CatalogView
> >>                     - SHOW VIEWS [FROM schema/catalog.schema] - show
> >>                 views from current or a
> >>                     specified schema.
> >>
> >>                     (DDLs that can be addressed by either our design
> >>                 or Shuyi's DDL design)
> >>
> >>                     - CREATE/DROP/ALTER SCHEMA schema
> >>                     - CREATE/DROP/ALTER CATALOG catalog
> >>
> >>                  To be modified:
> >>
> >>                     - SHOW TABLES [FROM schema/catalog.schema] - show
> >>                 tables from current or
> >>                     a specified schema. Add 'from schema' to existing
> >>                 'SHOW TABLES' statement
> >>                     - SHOW FUNCTIONS [FROM schema/catalog.schema] -
> >>                 show functions from
> >>                     current or a specified schema. Add 'from schema'
> >>                 to existing 'SHOW TABLES'
> >>                     statement'
> >>
> >>
> >>                  Thanks, Bowen
> >>
> >>
> >>
> >>                  On Wed, Nov 14, 2018 at 10:39 PM Zhang, Xuefu
> >>                 <[email protected]
> >> <mailto:[email protected]>>
> >>                  wrote:
> >>
> >>                  > Thanks, Bowen, for catching the error. I have
> >>                 granted comment permission
> >>                  > with the link.
> >>                  >
> >>                  > I also updated the doc with the latest class
> >>                 definitions. Everyone is
> >>                  > encouraged to review and comment.
> >>                  >
> >>                  > Thanks,
> >>                  > Xuefu
> >>                  >
> >>                  >
> >> ------------------------------------------------------------------
> >>                  > Sender:Bowen Li <[email protected]
> >>                 <mailto:[email protected]>>
> >>                  > Sent at:2018 Nov 14 (Wed) 06:44
> >>                  > Recipient:Xuefu <[email protected]
> >>                 <mailto:[email protected]>>
> >>                  > Cc:piotr <[email protected]
> >>                 <mailto:[email protected]>>; dev
> >>                 <[email protected] <mailto:[email protected]>>;
> >>                 Shuyi
> >>                  > Chen <[email protected] <mailto:[email protected]
> >>
> >>                  > Subject:Re: [DISCUSS] Integrate Flink SQL well with
> >>                 Hive ecosystem
> >>                  >
> >>                  > Hi Xuefu,
> >>                  >
> >>                  > Currently the new design doc
> >>                  >
> >> <
> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit
> >
> >>                  > is on “view only" mode, and people cannot leave
> >>                 comments. Can you please
> >>                  > change it to "can comment" or "can edit" mode?
> >>                  >
> >>                  > Thanks, Bowen
> >>                  >
> >>                  >
> >>                  > On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu
> >>                 <[email protected]
> >> <mailto:[email protected]>>
> >>                  > wrote:
> >>                  > Hi Piotr
> >>                  >
> >>                  > I have extracted the API portion of  the design and
> >>                 the google doc is here
> >>                  >
> >> <
> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit?usp=sharing
> >.
> >>                  > Please review and provide your feedback.
> >>                  >
> >>                  > Thanks,
> >>                  > Xuefu
> >>                  >
> >>                  >
> >> ------------------------------------------------------------------
> >>                  > Sender:Xuefu <[email protected]
> >>                 <mailto:[email protected]>>
> >>                  > Sent at:2018 Nov 12 (Mon) 12:43
> >>                  > Recipient:Piotr Nowojski <[email protected]
> >>                 <mailto:[email protected]>>; dev <
> >>                  > [email protected] <mailto:[email protected]>>
> >>                  > Cc:Bowen Li <[email protected]
> >>                 <mailto:[email protected]>>; Shuyi Chen
> >>                 <[email protected] <mailto:[email protected]>>
> >>                  > Subject:Re: [DISCUSS] Integrate Flink SQL well with
> >>                 Hive ecosystem
> >>                  >
> >>                  > Hi Piotr,
> >>                  >
> >>                  > That sounds good to me. Let's close all the open
> >>                 questions ((there are a
> >>                  > couple of them)) in the Google doc and I should be
> >>                 able to quickly split
> >>                  > it into the three proposals as you suggested.
> >>                  >
> >>                  > Thanks,
> >>                  > Xuefu
> >>                  >
> >>                  >
> >> ------------------------------------------------------------------
> >>                  > Sender:Piotr Nowojski <[email protected]
> >>                 <mailto:[email protected]>>
> >>                  > Sent at:2018 Nov 9 (Fri) 22:46
> >>                  > Recipient:dev <[email protected]
> >>                 <mailto:[email protected]>>; Xuefu
> >>                 <[email protected]
> >> <mailto:[email protected]>>
> >>                  > Cc:Bowen Li <[email protected]
> >>                 <mailto:[email protected]>>; Shuyi Chen
> >>                 <[email protected] <mailto:[email protected]>>
> >>                  > Subject:Re: [DISCUSS] Integrate Flink SQL well with
> >>                 Hive ecosystem
> >>                  >
> >>                  > Hi,
> >>                  >
> >>                  >
> >>                  > Yes, it seems like the best solution. Maybe someone
> >>                 else can also suggests if we can split it further?
> >>                 Maybe changes in the interface in one doc, reading
> >>                 from hive meta store another and final storing our
> >>                 meta informations in hive meta store?
> >>                  >
> >>                  > Piotrek
> >>                  >
> >>                  > > On 9 Nov 2018, at 01:44, Zhang, Xuefu
> >>                 <[email protected]
> >>                 <mailto:[email protected]>> wrote:
> >>                  > >
> >>                  > > Hi Piotr,
> >>                  > >
> >>                  > > That seems to be good idea!
> >>                  > >
> >>                  >
> >>                  > > Since the google doc for the design is currently
> >>                 under extensive review, I will leave it as it is for
> >>                 now. However, I'll convert it to two different FLIPs
> >>                 when the time comes.
> >>                  > >
> >>                  > > How does it sound to you?
> >>                  > >
> >>                  > > Thanks,
> >>                  > > Xuefu
> >>                  > >
> >>                  > >
> >>                  > >
> >> ------------------------------------------------------------------
> >>                  > > Sender:Piotr Nowojski <[email protected]
> >>                 <mailto:[email protected]>>
> >>                  > > Sent at:2018 Nov 9 (Fri) 02:31
> >>                  > > Recipient:dev <[email protected]
> >>                 <mailto:[email protected]>>
> >>                  > > Cc:Bowen Li <[email protected]
> >>                 <mailto:[email protected]>>; Xuefu
> >>                 <[email protected]
> >> <mailto:[email protected]>
> >>                  > >; Shuyi Chen <[email protected]
> >>                 <mailto:[email protected]>>
> >>                  > > Subject:Re: [DISCUSS] Integrate Flink SQL well
> >>                 with Hive ecosystem
> >>                  > >
> >>                  > > Hi,
> >>                  > >
> >>                  >
> >>                  > > Maybe we should split this topic (and the design
> >>                 doc) into couple of smaller ones, hopefully
> >>                 independent. The questions that you have asked Fabian
> >>                 have for example very little to do with reading
> >>                 metadata from Hive Meta Store?
> >>                  > >
> >>                  > > Piotrek
> >>                  > >
> >>                  > >> On 7 Nov 2018, at 14:27, Fabian Hueske
> >>                 <[email protected] <mailto:[email protected]>> wrote:
> >>                  > >>
> >>                  > >> Hi Xuefu and all,
> >>                  > >>
> >>                  > >> Thanks for sharing this design document!
> >>                  >
> >>                  > >> I'm very much in favor of restructuring /
> >>                 reworking the catalog handling in
> >>                  > >> Flink SQL as outlined in the document.
> >>                  >
> >>                  > >> Most changes described in the design document
> >>                 seem to be rather general and
> >>                  > >> not specifically related to the Hive integration.
> >>                  > >>
> >>                  >
> >>                  > >> IMO, there are some aspects, especially those at
> >>                 the boundary of Hive and
> >>                  > >> Flink, that need a bit more discussion. For
> >> example
> >>                  > >>
> >>                  > >> * What does it take to make Flink schema
> >>                 compatible with Hive schema?
> >>                  > >> * How will Flink tables (descriptors) be stored
> >>                 in HMS?
> >>                  > >> * How do both Hive catalogs differ? Could they
> >>                 be integrated into to a
> >>                  > >> single one? When to use which one?
> >>                  >
> >>                  > >> * What meta information is provided by HMS? What
> >>                 of this can be leveraged
> >>                  > >> by Flink?
> >>                  > >>
> >>                  > >> Thank you,
> >>                  > >> Fabian
> >>                  > >>
> >>                  > >> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen
> >>                 Li <[email protected] <mailto:[email protected]>
> >>                  > >:
> >>                  > >>
> >>                  > >>> After taking a look at how other discussion
> >>                 threads work, I think it's
> >>                  > >>> actually fine just keep our discussion here.
> >>                 It's up to you, Xuefu.
> >>                  > >>>
> >>                  > >>> The google doc LGTM. I left some minor comments.
> >>                  > >>>
> >>                  > >>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li
> >>                 <[email protected] <mailto:[email protected]>>
> >> wrote:
> >>                  > >>>
> >>                  > >>>> Hi all,
> >>                  > >>>>
> >>                  > >>>> As Xuefu has published the design doc on
> >>                 google, I agree with Shuyi's
> >>                  >
> >>                  > >>>> suggestion that we probably should start a new
> >>                 email thread like "[DISCUSS]
> >>                  >
> >>                  > >>>> ... Hive integration design ..." on only dev
> >>                 mailing list for community
> >>                  > >>>> devs to review. The current thread sends to
> >>                 both dev and user list.
> >>                  > >>>>
> >>                  >
> >>                  > >>>> This email thread is more like validating the
> >>                 general idea and direction
> >>                  >
> >>                  > >>>> with the community, and it's been pretty long
> >>                 and crowded so far. Since
> >>                  >
> >>                  > >>>> everyone is pro for the idea, we can move
> >>                 forward with another thread to
> >>                  > >>>> discuss and finalize the design.
> >>                  > >>>>
> >>                  > >>>> Thanks,
> >>                  > >>>> Bowen
> >>                  > >>>>
> >>                  > >>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <
> >>                  > [email protected]
> >>                 <mailto:[email protected]>>
> >>                  > >>>> wrote:
> >>                  > >>>>
> >>                  > >>>>> Hi Shuiyi,
> >>                  > >>>>>
> >>                  >
> >>                  > >>>>> Good idea. Actually the PDF was converted
> >>                 from a google doc. Here is its
> >>                  > >>>>> link:
> >>                  > >>>>>
> >>                  > >>>>>
> >>                  >
> >>
> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
> >>                  > >>>>> Once we reach an agreement, I can convert it
> >>                 to a FLIP.
> >>                  > >>>>>
> >>                  > >>>>> Thanks,
> >>                  > >>>>> Xuefu
> >>                  > >>>>>
> >>                  > >>>>>
> >>                  > >>>>>
> >>                  > >>>>>
> >> ------------------------------------------------------------------
> >>                  > >>>>> Sender:Shuyi Chen <[email protected]
> >>                 <mailto:[email protected]>>
> >>                  > >>>>> Sent at:2018 Nov 1 (Thu) 02:47
> >>                  > >>>>> Recipient:Xuefu <[email protected]
> >>                 <mailto:[email protected]>>
> >>                  > >>>>> Cc:vino yang <[email protected]
> >>                 <mailto:[email protected]>>; Fabian Hueske <
> >>                  > [email protected] <mailto:[email protected]>>;
> >>                  > >>>>> dev <[email protected]
> >>                 <mailto:[email protected]>>; user
> >>                 <[email protected] <mailto:[email protected]>>
> >>                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
> >>                 well with Hive ecosystem
> >>                  > >>>>>
> >>                  > >>>>> Hi Xuefu,
> >>                  > >>>>>
> >>                  >
> >>                  > >>>>> Thanks a lot for driving this big effort. I
> >>                 would suggest convert your
> >>                  >
> >>                  > >>>>> proposal and design doc into a google doc,
> >>                 and share it on the dev mailing
> >>                  >
> >>                  > >>>>> list for the community to review and comment
> >>                 with title like "[DISCUSS] ...
> >>                  >
> >>                  > >>>>> Hive integration design ..." . Once
> >>                 approved,  we can document it as a FLIP
> >>                  >
> >>                  > >>>>> (Flink Improvement Proposals), and use JIRAs
> >>                 to track the implementations.
> >>                  > >>>>> What do you think?
> >>                  > >>>>>
> >>                  > >>>>> Shuyi
> >>                  > >>>>>
> >>                  > >>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <
> >>                  > [email protected]
> >>                 <mailto:[email protected]>>
> >>                  > >>>>> wrote:
> >>                  > >>>>> Hi all,
> >>                  > >>>>>
> >>                  > >>>>> I have also shared a design doc on Hive
> >>                 metastore integration that is
> >>                  >
> >>                  > >>>>> attached here and also to FLINK-10556[1].
> >>                 Please kindly review and share
> >>                  > >>>>> your feedback.
> >>                  > >>>>>
> >>                  > >>>>>
> >>                  > >>>>> Thanks,
> >>                  > >>>>> Xuefu
> >>                  > >>>>>
> >>                  > >>>>> [1]
> >> https://issues.apache.org/jira/browse/FLINK-10556
> >>                  > >>>>>
> >> ------------------------------------------------------------------
> >>                  > >>>>> Sender:Xuefu <[email protected]
> >>                 <mailto:[email protected]>>
> >>                  > >>>>> Sent at:2018 Oct 25 (Thu) 01:08
> >>                  > >>>>> Recipient:Xuefu <[email protected]
> >>                 <mailto:[email protected]>>; Shuyi Chen <
> >>                  > >>>>> [email protected] <mailto:[email protected]
> >>
> >>                  > >>>>> Cc:yanghua1127 <[email protected]
> >>                 <mailto:[email protected]>>; Fabian Hueske <
> >>                  > [email protected] <mailto:[email protected]>>;
> >>                  > >>>>> dev <[email protected]
> >>                 <mailto:[email protected]>>; user
> >>                 <[email protected] <mailto:[email protected]>>
> >>                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
> >>                 well with Hive ecosystem
> >>                  > >>>>>
> >>                  > >>>>> Hi all,
> >>                  > >>>>>
> >>                  > >>>>> To wrap up the discussion, I have attached a
> >>                 PDF describing the
> >>                  >
> >>                  > >>>>> proposal, which is also attached to
> >>                 FLINK-10556 [1]. Please feel free to
> >>                  > >>>>> watch that JIRA to track the progress.
> >>                  > >>>>>
> >>                  > >>>>> Please also let me know if you have
> >>                 additional comments or questions.
> >>                  > >>>>>
> >>                  > >>>>> Thanks,
> >>                  > >>>>> Xuefu
> >>                  > >>>>>
> >>                  > >>>>> [1]
> >> https://issues.apache.org/jira/browse/FLINK-10556
> >>                  > >>>>>
> >>                  > >>>>>
> >>                  > >>>>>
> >> ------------------------------------------------------------------
> >>                  > >>>>> Sender:Xuefu <[email protected]
> >>                 <mailto:[email protected]>>
> >>                  > >>>>> Sent at:2018 Oct 16 (Tue) 03:40
> >>                  > >>>>> Recipient:Shuyi Chen <[email protected]
> >>                 <mailto:[email protected]>>
> >>                  > >>>>> Cc:yanghua1127 <[email protected]
> >>                 <mailto:[email protected]>>; Fabian Hueske <
> >>                  > [email protected] <mailto:[email protected]>>;
> >>                  > >>>>> dev <[email protected]
> >>                 <mailto:[email protected]>>; user
> >>                 <[email protected] <mailto:[email protected]>>
> >>                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
> >>                 well with Hive ecosystem
> >>                  > >>>>>
> >>                  > >>>>> Hi Shuyi,
> >>                  > >>>>>
> >>                  >
> >>                  > >>>>> Thank you for your input. Yes, I agreed with
> >>                 a phased approach and like
> >>                  >
> >>                  > >>>>> to move forward fast. :) We did some work
> >>                 internally on DDL utilizing babel
> >>                  > >>>>> parser in Calcite. While babel makes
> >>                 Calcite's grammar extensible, at
> >>                  > >>>>> first impression it still seems too
> >>                 cumbersome for a project when too
> >>                  >
> >>                  > >>>>> much extensions are made. It's even
> >>                 challenging to find where the extension
> >>                  >
> >>                  > >>>>> is needed! It would be certainly better if
> >>                 Calcite can magically support
> >>                  >
> >>                  > >>>>> Hive QL by just turning on a flag, such as
> >>                 that for MYSQL_5. I can also
> >>                  >
> >>                  > >>>>> see that this could mean a lot of work on
> >>                 Calcite. Nevertheless, I will
> >>                  >
> >>                  > >>>>> bring up the discussion over there and to see
> >>                 what their community thinks.
> >>                  > >>>>>
> >>                  > >>>>> Would mind to share more info about the
> >>                 proposal on DDL that you
> >>                  > >>>>> mentioned? We can certainly collaborate on
> >> this.
> >>                  > >>>>>
> >>                  > >>>>> Thanks,
> >>                  > >>>>> Xuefu
> >>                  > >>>>>
> >>                  > >>>>>
> >> ------------------------------------------------------------------
> >>                  > >>>>> Sender:Shuyi Chen <[email protected]
> >>                 <mailto:[email protected]>>
> >>                  > >>>>> Sent at:2018 Oct 14 (Sun) 08:30
> >>                  > >>>>> Recipient:Xuefu <[email protected]
> >>                 <mailto:[email protected]>>
> >>                  > >>>>> Cc:yanghua1127 <[email protected]
> >>                 <mailto:[email protected]>>; Fabian Hueske <
> >>                  > [email protected] <mailto:[email protected]>>;
> >>                  > >>>>> dev <[email protected]
> >>                 <mailto:[email protected]>>; user
> >>                 <[email protected] <mailto:[email protected]>>
> >>                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
> >>                 well with Hive ecosystem
> >>                  > >>>>>
> >>                  > >>>>> Welcome to the community and thanks for the
> >>                 great proposal, Xuefu! I
> >>                  >
> >>                  > >>>>> think the proposal can be divided into 2
> >>                 stages: making Flink to support
> >>                  >
> >>                  > >>>>> Hive features, and make Hive to work with
> >>                 Flink. I agreed with Timo that on
> >>                  >
> >>                  > >>>>> starting with a smaller scope, so we can make
> >>                 progress faster. As for [6],
> >>                  >
> >>                  > >>>>> a proposal for DDL is already in progress,
> >>                 and will come after the unified
> >>                  >
> >>                  > >>>>> SQL connector API is done. For supporting
> >>                 Hive syntax, we might need to
> >>                  > >>>>> work with the Calcite community, and a recent
> >>                 effort called babel (
> >>                  > >>>>>
> >> https://issues.apache.org/jira/browse/CALCITE-2280) in
> >>                 Calcite might
> >>                  > >>>>> help here.
> >>                  > >>>>>
> >>                  > >>>>> Thanks
> >>                  > >>>>> Shuyi
> >>                  > >>>>>
> >>                  > >>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <
> >>                  > [email protected]
> >>                 <mailto:[email protected]>>
> >>                  > >>>>> wrote:
> >>                  > >>>>> Hi Fabian/Vno,
> >>                  > >>>>>
> >>                  >
> >>                  > >>>>> Thank you very much for your encouragement
> >>                 inquiry. Sorry that I didn't
> >>                  >
> >>                  > >>>>> see Fabian's email until I read Vino's
> >>                 response just now. (Somehow Fabian's
> >>                  > >>>>> went to the spam folder.)
> >>                  > >>>>>
> >>                  >
> >>                  > >>>>> My proposal contains long-term and
> >>                 short-terms goals. Nevertheless, the
> >>                  > >>>>> effort will focus on the following areas,
> >>                 including Fabian's list:
> >>                  > >>>>>
> >>                  > >>>>> 1. Hive metastore connectivity - This covers
> >>                 both read/write access,
> >>                  >
> >>                  > >>>>> which means Flink can make full use of Hive's
> >>                 metastore as its catalog (at
> >>                  > >>>>> least for the batch but can extend for
> >>                 streaming as well).
> >>                  >
> >>                  > >>>>> 2. Metadata compatibility - Objects
> >>                 (databases, tables, partitions, etc)
> >>                  >
> >>                  > >>>>> created by Hive can be understood by Flink
> >>                 and the reverse direction is
> >>                  > >>>>> true also.
> >>                  > >>>>> 3. Data compatibility - Similar to #2, data
> >>                 produced by Hive can be
> >>                  > >>>>> consumed by Flink and vise versa.
> >>                  >
> >>                  > >>>>> 4. Support Hive UDFs - For all Hive's native
> >>                 udfs, Flink either provides
> >>                  > >>>>> its own implementation or make Hive's
> >>                 implementation work in Flink.
> >>                  > >>>>> Further, for user created UDFs in Hive, Flink
> >>                 SQL should provide a
> >>                  >
> >>                  > >>>>> mechanism allowing user to import them into
> >>                 Flink without any code change
> >>                  > >>>>> required.
> >>                  > >>>>> 5. Data types - Flink SQL should support all
> >>                 data types that are
> >>                  > >>>>> available in Hive.
> >>                  > >>>>> 6. SQL Language - Flink SQL should support
> >>                 SQL standard (such as
> >>                  >
> >>                  > >>>>> SQL2003) with extension to support Hive's
> >>                 syntax and language features,
> >>                  > >>>>> around DDL, DML, and SELECT queries.
> >>                  >
> >>                  > >>>>> 7.  SQL CLI - this is currently developing in
> >>                 Flink but more effort is
> >>                  > >>>>> needed.
> >>                  >
> >>                  > >>>>> 8. Server - provide a server that's
> >>                 compatible with Hive's HiverServer2
> >>                  >
> >>                  > >>>>> in thrift APIs, such that HiveServer2 users
> >>                 can reuse their existing client
> >>                  > >>>>> (such as beeline) but connect to Flink's
> >>                 thrift server instead.
> >>                  >
> >>                  > >>>>> 9. JDBC/ODBC drivers - Flink may provide its
> >>                 own JDBC/ODBC drivers for
> >>                  > >>>>> other application to use to connect to its
> >>                 thrift server
> >>                  > >>>>> 10. Support other user's customizations in
> >>                 Hive, such as Hive Serdes,
> >>                  > >>>>> storage handlers, etc.
> >>                  >
> >>                  > >>>>> 11. Better task failure tolerance and task
> >>                 scheduling at Flink runtime.
> >>                  > >>>>>
> >>                  > >>>>> As you can see, achieving all those requires
> >>                 significant effort and
> >>                  >
> >>                  > >>>>> across all layers in Flink. However, a
> >>                 short-term goal could include only
> >>                  >
> >>                  > >>>>> core areas (such as 1, 2, 4, 5, 6, 7) or
> >>                 start  at a smaller scope (such as
> >>                  > >>>>> #3, #6).
> >>                  > >>>>>
> >>                  >
> >>                  > >>>>> Please share your further thoughts. If we
> >>                 generally agree that this is
> >>                  >
> >>                  > >>>>> the right direction, I could come up with a
> >>                 formal proposal quickly and
> >>                  > >>>>> then we can follow up with broader discussions.
> >>                  > >>>>>
> >>                  > >>>>> Thanks,
> >>                  > >>>>> Xuefu
> >>                  > >>>>>
> >>                  > >>>>>
> >>                  > >>>>>
> >>                  > >>>>>
> >> ------------------------------------------------------------------
> >>                  > >>>>> Sender:vino yang <[email protected]
> >>                 <mailto:[email protected]>>
> >>                  > >>>>> Sent at:2018 Oct 11 (Thu) 09:45
> >>                  > >>>>> Recipient:Fabian Hueske <[email protected]
> >>                 <mailto:[email protected]>>
> >>                  > >>>>> Cc:dev <[email protected]
> >>                 <mailto:[email protected]>>; Xuefu
> >>                 <[email protected]
> >> <mailto:[email protected]>
> >>                  > >; user <
> >>                  > >>>>> [email protected]
> >>                 <mailto:[email protected]>>
> >>                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
> >>                 well with Hive ecosystem
> >>                  > >>>>>
> >>                  > >>>>> Hi Xuefu,
> >>                  > >>>>>
> >>                  >
> >>                  > >>>>> Appreciate this proposal, and like Fabian, it
> >>                 would look better if you
> >>                  > >>>>> can give more details of the plan.
> >>                  > >>>>>
> >>                  > >>>>> Thanks, vino.
> >>                  > >>>>>
> >>                  > >>>>> Fabian Hueske <[email protected]
> >>                 <mailto:[email protected]>> 于2018年10月10日周三
> >>                 下午5:27写道：
> >>                  > >>>>> Hi Xuefu,
> >>                  > >>>>>
> >>                  >
> >>                  > >>>>> Welcome to the Flink community and thanks for
> >>                 starting this discussion!
> >>                  > >>>>> Better Hive integration would be really great!
> >>                  > >>>>> Can you go into details of what you are
> >>                 proposing? I can think of a
> >>                  > >>>>> couple ways to improve Flink in that regard:
> >>                  > >>>>>
> >>                  > >>>>> * Support for Hive UDFs
> >>                  > >>>>> * Support for Hive metadata catalog
> >>                  > >>>>> * Support for HiveQL syntax
> >>                  > >>>>> * ???
> >>                  > >>>>>
> >>                  > >>>>> Best, Fabian
> >>                  > >>>>>
> >>                  > >>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb
> >>                 Zhang, Xuefu <
> >>                  > >>>>> [email protected]
> >>                 <mailto:[email protected]>>:
> >>                  > >>>>> Hi all,
> >>                  > >>>>>
> >>                  > >>>>> Along with the community's effort, inside
> >>                 Alibaba we have explored
> >>                  >
> >>                  > >>>>> Flink's potential as an execution engine not
> >>                 just for stream processing but
> >>                  > >>>>> also for batch processing. We are encouraged
> >>                 by our findings and have
> >>                  >
> >>                  > >>>>> initiated our effort to make Flink's SQL
> >>                 capabilities full-fledged. When
> >>                  >
> >>                  > >>>>> comparing what's available in Flink to the
> >>                 offerings from competitive data
> >>                  >
> >>                  > >>>>> processing engines, we identified a major gap
> >>                 in Flink: a well integration
> >>                  >
> >>                  > >>>>> with Hive ecosystem. This is crucial to the
> >>                 success of Flink SQL and batch
> >>                  >
> >>                  > >>>>> due to the well-established data ecosystem
> >>                 around Hive. Therefore, we have
> >>                  >
> >>                  > >>>>> done some initial work along this direction
> >>                 but there are still a lot of
> >>                  > >>>>> effort needed.
> >>                  > >>>>>
> >>                  > >>>>> We have two strategies in mind. The first one
> >>                 is to make Flink SQL
> >>                  >
> >>                  > >>>>> full-fledged and well-integrated with Hive
> >>                 ecosystem. This is a similar
> >>                  >
> >>                  > >>>>> approach to what Spark SQL adopted. The
> >>                 second strategy is to make Hive
> >>                  >
> >>                  > >>>>> itself work with Flink, similar to the
> >>                 proposal in [1]. Each approach bears
> >>                  >
> >>                  > >>>>> its pros and cons, but they don’t need to be
> >>                 mutually exclusive with each
> >>                  > >>>>> targeting at different users and use cases.
> >>                 We believe that both will
> >>                  > >>>>> promote a much greater adoption of Flink
> >>                 beyond stream processing.
> >>                  > >>>>>
> >>                  > >>>>> We have been focused on the first approach
> >>                 and would like to showcase
> >>                  >
> >>                  > >>>>> Flink's batch and SQL capabilities with Flink
> >>                 SQL. However, we have also
> >>                  > >>>>> planned to start strategy #2 as the follow-up
> >>                 effort.
> >>                  > >>>>>
> >>                  >
> >>                  > >>>>> I'm completely new to Flink(, with a short
> >>                 bio [2] below), though many
> >>                  >
> >>                  > >>>>> of my colleagues here at Alibaba are
> >>                 long-time contributors. Nevertheless,
> >>                  >
> >>                  > >>>>> I'd like to share our thoughts and invite
> >>                 your early feedback. At the same
> >>                  >
> >>                  > >>>>> time, I am working on a detailed proposal on
> >>                 Flink SQL's integration with
> >>                  > >>>>> Hive ecosystem, which will be also shared
> >>                 when ready.
> >>                  > >>>>>
> >>                  > >>>>> While the ideas are simple, each approach
> >>                 will demand significant
> >>                  >
> >>                  > >>>>> effort, more than what we can afford. Thus,
> >>                 the input and contributions
> >>                  > >>>>> from the communities are greatly welcome and
> >>                 appreciated.
> >>                  > >>>>>
> >>                  > >>>>> Regards,
> >>                  > >>>>>
> >>                  > >>>>>
> >>                  > >>>>> Xuefu
> >>                  > >>>>>
> >>                  > >>>>> References:
> >>                  > >>>>>
> >>                  > >>>>> [1]
> >>                 https://issues.apache.org/jira/browse/HIVE-10712
> >>                  >
> >>                  > >>>>> [2] Xuefu Zhang is a long-time open source
> >>                 veteran, worked or working on
> >>                  > >>>>> many projects under Apache Foundation, of
> >>                 which he is also an honored
> >>                  >
> >>                  > >>>>> member. About 10 years ago he worked in the
> >>                 Hadoop team at Yahoo where the
> >>                  >
> >>                  > >>>>> projects just got started. Later he worked at
> >>                 Cloudera, initiating and
> >>                  >
> >>                  > >>>>> leading the development of Hive on Spark
> >>                 project in the communities and
> >>                  >
> >>                  > >>>>> across many organizations. Prior to joining
> >>                 Alibaba, he worked at Uber
> >>                  >
> >>                  > >>>>> where he promoted Hive on Spark to all Uber's
> >>                 SQL on Hadoop workload and
> >>                  > >>>>> significantly improved Uber's cluster
> >> efficiency.
> >>                  > >>>>>
> >>                  > >>>>>
> >>                  > >>>>>
> >>                  > >>>>>
> >>                  > >>>>> --
> >>                  >
> >>                  > >>>>> "So you have to trust that the dots will
> >>                 somehow connect in your future."
> >>                  > >>>>>
> >>                  > >>>>>
> >>                  > >>>>> --
> >>                  >
> >>                  > >>>>> "So you have to trust that the dots will
> >>                 somehow connect in your future."
> >>                  > >>>>>
> >>                  >
> >>                  >
> >>
> >
> >
>
>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Reply via email to