Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Eron Wright Sun, 06 Jan 2019 15:25:39 -0800

Thanks Timo for merging a couple of the PRs.   Are you also able to review
the others that I mentioned?  Xuefu I would like to incorporate your
feedback too.


Check out this short demonstration of using a catalog in SQL Client:
https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo

Thanks again!

On Thu, Jan 3, 2019 at 9:37 AM Eron Wright <eronwri...@gmail.com> wrote:

> Would a couple folks raise their hand to make a review pass thru the 6 PRs
> listed above?  It is a lovely stack of PRs that is 'all green' at the
> moment.   I would be happy to open follow-on PRs to rapidly align with
> other efforts.
>
> Note that the code is agnostic to the details of the ExternalCatalog
> interface; the code would not be obsolete if/when the catalog interface is
> enhanced as per the design doc.
>
>
>
> On Wed, Jan 2, 2019 at 1:35 PM Eron Wright <eronwri...@gmail.com> wrote:
>
>> I propose that the community review and merge the PRs that I posted, and
>> then evolve the design thru 1.8 and beyond.   I think having a basic
>> infrastructure in place now will accelerate the effort, do you agree?
>>
>> Thanks again!
>>
>> On Wed, Jan 2, 2019 at 11:20 AM Zhang, Xuefu <xuef...@alibaba-inc.com>
>> wrote:
>>
>>> Hi Eron,
>>>
>>> Happy New Year!
>>>
>>> Thank you very much for your contribution, especially during the
>>> holidays. Wile I'm encouraged by your work, I'd also like to share my
>>> thoughts on how to move forward.
>>>
>>> First, please note that the design discussion is still finalizing, and
>>> we expect some moderate changes, especially around TableFactories. Another
>>> pending change is our decision to shy away from scala, which our work will
>>> be impacted by.
>>>
>>> Secondly, while your work seemed about plugging in catalogs definitions
>>> to the execution environment, which is less impacted by TableFactory
>>> change, I did notice some duplication of your work and ours. This is no big
>>> deal, but going forward, we should probable have a better communication on
>>> the work assignment so as to avoid any possible duplication of work. On the
>>> other hand, I think some of your work is interesting and valuable for
>>> inclusion once we finalize the overall design.
>>>
>>> Thus, please continue your research and experiment and let us know when
>>> you start working on anything so we can better coordinate.
>>>
>>> Thanks again for your interest and contributions.
>>>
>>> Thanks,
>>> Xuefu
>>>
>>>
>>>
>>> ------------------------------------------------------------------
>>> From:Eron Wright <eronwri...@gmail.com>
>>> Sent At:2019 Jan. 1 (Tue.) 18:39
>>> To:dev <dev@flink.apache.org>; Xuefu <xuef...@alibaba-inc.com>
>>> Cc:Xiaowei Jiang <xiaow...@gmail.com>; twalthr <twal...@apache.org>;
>>> piotr <pi...@data-artisans.com>; Fabian Hueske <fhue...@gmail.com>;
>>> suez1224 <suez1...@gmail.com>; Bowen Li <bowenl...@gmail.com>
>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>
>>> Hi folks, there's clearly some incremental steps to be taken to
>>> introduce catalog support to SQL Client, complementary to what is proposed
>>> in the Flink-Hive Metastore design doc.  I was quietly working on this over
>>> the holidays.   I posted some new sub-tasks, PRs, and sample code
>>> to FLINK-10744.
>>>
>>> What inspired me to get involved is that the catalog interface seems
>>> like a great way to encapsulate a 'library' of Flink tables and functions.
>>> For example, the NYC Taxi dataset (TaxiRides, TaxiFares, various UDFs) may
>>> be nicely encapsulated as a catalog (TaxiData).   Such a library should be
>>> fully consumable in SQL Client.
>>>
>>> I implemented the above.  Some highlights:
>>>
>>> 1. A fully-worked example of using the Taxi dataset in SQL Client via an
>>> environment file.
>>> - an ASCII video showing the SQL Client in action:
>>> https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo
>>>
>>> - the corresponding environment file (will be even more concise once
>>> 'FLINK-10696 Catalog UDFs' is merged):
>>> *https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml
>>> <https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml>*
>>>
>>> - the typed API for standalone table applications:
>>> *https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50
>>> <https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50>*
>>>
>>> 2. Implementation of the core catalog descriptor and factory.  I realize
>>> that some renames may later occur as per the design doc, and would be happy
>>> to do that as a follow-up.
>>> https://github.com/apache/flink/pull/7390
>>>
>>> 3. Implementation of a connect-style API on TableEnvironment to use
>>> catalog descriptor.
>>> https://github.com/apache/flink/pull/7392
>>>
>>> 4. Integration into SQL-Client's environment file:
>>> https://github.com/apache/flink/pull/7393
>>>
>>> I realize that the overall Hive integration is still evolving, but I
>>> believe that these PRs are a good stepping stone. Here's the list (in
>>> bottom-up order):
>>> - https://github.com/apache/flink/pull/7386
>>> - https://github.com/apache/flink/pull/7388
>>> - https://github.com/apache/flink/pull/7389
>>> - https://github.com/apache/flink/pull/7390
>>> - https://github.com/apache/flink/pull/7392
>>> - https://github.com/apache/flink/pull/7393
>>>
>>> Thanks and enjoy 2019!
>>> Eron W
>>>
>>>
>>> On Sun, Nov 18, 2018 at 3:04 PM Zhang, Xuefu <xuef...@alibaba-inc.com>
>>> wrote:
>>> Hi Xiaowei,
>>>
>>> Thanks for bringing up the question. In the current design, the
>>> properties for meta objects are meant to cover anything that's specific to
>>> a particular catalog and agnostic to Flink. Anything that is common (such
>>> as schema for tables, query text for views, and udf classname) are
>>> abstracted as members of the respective classes. However, this is still in
>>> discussion, and Timo and I will go over this and provide an update.
>>>
>>> Please note that UDF is a little more involved than what the current
>>> design doc shows. I'm still refining this part.
>>>
>>> Thanks,
>>> Xuefu
>>>
>>>
>>> ------------------------------------------------------------------
>>> Sender:Xiaowei Jiang <xiaow...@gmail.com>
>>> Sent at:2018 Nov 18 (Sun) 15:17
>>> Recipient:dev <dev@flink.apache.org>
>>> Cc:Xuefu <xuef...@alibaba-inc.com>; twalthr <twal...@apache.org>; piotr
>>> <pi...@data-artisans.com>; Fabian Hueske <fhue...@gmail.com>; suez1224 <
>>> suez1...@gmail.com>
>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>
>>> Thanks Xuefu for the detailed design doc! One question on the properties
>>> associated with the catalog objects. Are we going to leave them completely
>>> free form or we are going to set some standard for that? I think that the
>>> answer may depend on if we want to explore catalog specific optimization
>>> opportunities. In any case, I think that it might be helpful for
>>> standardize as much as possible into strongly typed classes and use leave
>>> these properties for catalog specific things. But I think that we can do it
>>> in steps.
>>>
>>> Xiaowei
>>> On Fri, Nov 16, 2018 at 4:00 AM Bowen Li <bowenl...@gmail.com> wrote:
>>> Thanks for keeping on improving the overall design, Xuefu! It looks quite
>>>  good to me now.
>>>
>>>  Would be nice that cc-ed Flink committers can help to review and
>>> confirm!
>>>
>>>
>>>
>>>  One minor suggestion: Since the last section of design doc already
>>> touches
>>>  some new sql statements, shall we add another section in our doc and
>>>  formalize the new sql statements in SQL Client and TableEnvironment that
>>>  are gonna come along naturally with our design? Here are some that the
>>>  design doc mentioned and some that I came up with:
>>>
>>>  To be added:
>>>
>>>     - USE <catalog> - set default catalog
>>>     - USE <catalog.schema> - set default schema
>>>     - SHOW CATALOGS - show all registered catalogs
>>>     - SHOW SCHEMAS [FROM catalog] - list schemas in the current default
>>>     catalog or the specified catalog
>>>     - DESCRIBE VIEW view - show the view's definition in CatalogView
>>>     - SHOW VIEWS [FROM schema/catalog.schema] - show views from current
>>> or a
>>>     specified schema.
>>>
>>>     (DDLs that can be addressed by either our design or Shuyi's DDL
>>> design)
>>>
>>>     - CREATE/DROP/ALTER SCHEMA schema
>>>     - CREATE/DROP/ALTER CATALOG catalog
>>>
>>>  To be modified:
>>>
>>>     - SHOW TABLES [FROM schema/catalog.schema] - show tables from
>>> current or
>>>     a specified schema. Add 'from schema' to existing 'SHOW TABLES'
>>> statement
>>>     - SHOW FUNCTIONS [FROM schema/catalog.schema] - show functions from
>>>     current or a specified schema. Add 'from schema' to existing 'SHOW
>>> TABLES'
>>>     statement'
>>>
>>>
>>>  Thanks, Bowen
>>>
>>>
>>>
>>>  On Wed, Nov 14, 2018 at 10:39 PM Zhang, Xuefu <xuef...@alibaba-inc.com>
>>>  wrote:
>>>
>>>  > Thanks, Bowen, for catching the error. I have granted comment
>>> permission
>>>  > with the link.
>>>  >
>>>  > I also updated the doc with the latest class definitions. Everyone is
>>>  > encouraged to review and comment.
>>>  >
>>>  > Thanks,
>>>  > Xuefu
>>>  >
>>>  > ------------------------------------------------------------------
>>>  > Sender:Bowen Li <bowenl...@gmail.com>
>>>  > Sent at:2018 Nov 14 (Wed) 06:44
>>>  > Recipient:Xuefu <xuef...@alibaba-inc.com>
>>>  > Cc:piotr <pi...@data-artisans.com>; dev <dev@flink.apache.org>; Shuyi
>>>  > Chen <suez1...@gmail.com>
>>>  > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>  >
>>>  > Hi Xuefu,
>>>  >
>>>  > Currently the new design doc
>>>  > <
>>> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit
>>> >
>>>  > is on “view only" mode, and people cannot leave comments. Can you
>>> please
>>>  > change it to "can comment" or "can edit" mode?
>>>  >
>>>  > Thanks, Bowen
>>>  >
>>>  >
>>>  > On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu <xuef...@alibaba-inc.com
>>> >
>>>  > wrote:
>>>  > Hi Piotr
>>>  >
>>>  > I have extracted the API portion of  the design and the google doc is
>>> here
>>>  > <
>>> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit?usp=sharing
>>> >.
>>>  > Please review and provide your feedback.
>>>  >
>>>  > Thanks,
>>>  > Xuefu
>>>  >
>>>  > ------------------------------------------------------------------
>>>  > Sender:Xuefu <xuef...@alibaba-inc.com>
>>>  > Sent at:2018 Nov 12 (Mon) 12:43
>>>  > Recipient:Piotr Nowojski <pi...@data-artisans.com>; dev <
>>>  > dev@flink.apache.org>
>>>  > Cc:Bowen Li <bowenl...@gmail.com>; Shuyi Chen <suez1...@gmail.com>
>>>  > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>  >
>>>  > Hi Piotr,
>>>  >
>>>  > That sounds good to me. Let's close all the open questions ((there
>>> are a
>>>  > couple of them)) in the Google doc and I should be able to quickly
>>> split
>>>  > it into the three proposals as you suggested.
>>>  >
>>>  > Thanks,
>>>  > Xuefu
>>>  >
>>>  > ------------------------------------------------------------------
>>>  > Sender:Piotr Nowojski <pi...@data-artisans.com>
>>>  > Sent at:2018 Nov 9 (Fri) 22:46
>>>  > Recipient:dev <dev@flink.apache.org>; Xuefu <xuef...@alibaba-inc.com>
>>>  > Cc:Bowen Li <bowenl...@gmail.com>; Shuyi Chen <suez1...@gmail.com>
>>>  > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>  >
>>>  > Hi,
>>>  >
>>>  >
>>>  > Yes, it seems like the best solution. Maybe someone else can also
>>> suggests if we can split it further? Maybe changes in the interface in one
>>> doc, reading from hive meta store another and final storing our meta
>>> informations in hive meta store?
>>>  >
>>>  > Piotrek
>>>  >
>>>  > > On 9 Nov 2018, at 01:44, Zhang, Xuefu <xuef...@alibaba-inc.com>
>>> wrote:
>>>  > >
>>>  > > Hi Piotr,
>>>  > >
>>>  > > That seems to be good idea!
>>>  > >
>>>  >
>>>  > > Since the google doc for the design is currently under extensive
>>> review, I will leave it as it is for now. However, I'll convert it to two
>>> different FLIPs when the time comes.
>>>  > >
>>>  > > How does it sound to you?
>>>  > >
>>>  > > Thanks,
>>>  > > Xuefu
>>>  > >
>>>  > >
>>>  > > ------------------------------------------------------------------
>>>  > > Sender:Piotr Nowojski <pi...@data-artisans.com>
>>>  > > Sent at:2018 Nov 9 (Fri) 02:31
>>>  > > Recipient:dev <dev@flink.apache.org>
>>>  > > Cc:Bowen Li <bowenl...@gmail.com>; Xuefu <xuef...@alibaba-inc.com
>>>  > >; Shuyi Chen <suez1...@gmail.com>
>>>  > > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>  > >
>>>  > > Hi,
>>>  > >
>>>  >
>>>  > > Maybe we should split this topic (and the design doc) into couple
>>> of smaller ones, hopefully independent. The questions that you have asked
>>> Fabian have for example very little to do with reading metadata from Hive
>>> Meta Store?
>>>  > >
>>>  > > Piotrek
>>>  > >
>>>  > >> On 7 Nov 2018, at 14:27, Fabian Hueske <fhue...@gmail.com> wrote:
>>>  > >>
>>>  > >> Hi Xuefu and all,
>>>  > >>
>>>  > >> Thanks for sharing this design document!
>>>  >
>>>  > >> I'm very much in favor of restructuring / reworking the catalog
>>> handling in
>>>  > >> Flink SQL as outlined in the document.
>>>  >
>>>  > >> Most changes described in the design document seem to be rather
>>> general and
>>>  > >> not specifically related to the Hive integration.
>>>  > >>
>>>  >
>>>  > >> IMO, there are some aspects, especially those at the boundary of
>>> Hive and
>>>  > >> Flink, that need a bit more discussion. For example
>>>  > >>
>>>  > >> * What does it take to make Flink schema compatible with Hive
>>> schema?
>>>  > >> * How will Flink tables (descriptors) be stored in HMS?
>>>  > >> * How do both Hive catalogs differ? Could they be integrated into
>>> to a
>>>  > >> single one? When to use which one?
>>>  >
>>>  > >> * What meta information is provided by HMS? What of this can be
>>> leveraged
>>>  > >> by Flink?
>>>  > >>
>>>  > >> Thank you,
>>>  > >> Fabian
>>>  > >>
>>>  > >> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li <
>>> bowenl...@gmail.com
>>>  > >:
>>>  > >>
>>>  > >>> After taking a look at how other discussion threads work, I think
>>> it's
>>>  > >>> actually fine just keep our discussion here. It's up to you,
>>> Xuefu.
>>>  > >>>
>>>  > >>> The google doc LGTM. I left some minor comments.
>>>  > >>>
>>>  > >>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li <bowenl...@gmail.com>
>>> wrote:
>>>  > >>>
>>>  > >>>> Hi all,
>>>  > >>>>
>>>  > >>>> As Xuefu has published the design doc on google, I agree with
>>> Shuyi's
>>>  >
>>>  > >>>> suggestion that we probably should start a new email thread like
>>> "[DISCUSS]
>>>  >
>>>  > >>>> ... Hive integration design ..." on only dev mailing list for
>>> community
>>>  > >>>> devs to review. The current thread sends to both dev and user
>>> list.
>>>  > >>>>
>>>  >
>>>  > >>>> This email thread is more like validating the general idea and
>>> direction
>>>  >
>>>  > >>>> with the community, and it's been pretty long and crowded so
>>> far. Since
>>>  >
>>>  > >>>> everyone is pro for the idea, we can move forward with another
>>> thread to
>>>  > >>>> discuss and finalize the design.
>>>  > >>>>
>>>  > >>>> Thanks,
>>>  > >>>> Bowen
>>>  > >>>>
>>>  > >>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <
>>>  > xuef...@alibaba-inc.com>
>>>  > >>>> wrote:
>>>  > >>>>
>>>  > >>>>> Hi Shuiyi,
>>>  > >>>>>
>>>  >
>>>  > >>>>> Good idea. Actually the PDF was converted from a google doc.
>>> Here is its
>>>  > >>>>> link:
>>>  > >>>>>
>>>  > >>>>>
>>>  >
>>> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
>>>  > >>>>> Once we reach an agreement, I can convert it to a FLIP.
>>>  > >>>>>
>>>  > >>>>> Thanks,
>>>  > >>>>> Xuefu
>>>  > >>>>>
>>>  > >>>>>
>>>  > >>>>>
>>>  > >>>>>
>>> ------------------------------------------------------------------
>>>  > >>>>> Sender:Shuyi Chen <suez1...@gmail.com>
>>>  > >>>>> Sent at:2018 Nov 1 (Thu) 02:47
>>>  > >>>>> Recipient:Xuefu <xuef...@alibaba-inc.com>
>>>  > >>>>> Cc:vino yang <yanghua1...@gmail.com>; Fabian Hueske <
>>>  > fhue...@gmail.com>;
>>>  > >>>>> dev <dev@flink.apache.org>; user <u...@flink.apache.org>
>>>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive
>>> ecosystem
>>>  > >>>>>
>>>  > >>>>> Hi Xuefu,
>>>  > >>>>>
>>>  >
>>>  > >>>>> Thanks a lot for driving this big effort. I would suggest
>>> convert your
>>>  >
>>>  > >>>>> proposal and design doc into a google doc, and share it on the
>>> dev mailing
>>>  >
>>>  > >>>>> list for the community to review and comment with title like
>>> "[DISCUSS] ...
>>>  >
>>>  > >>>>> Hive integration design ..." . Once approved,  we can document
>>> it as a FLIP
>>>  >
>>>  > >>>>> (Flink Improvement Proposals), and use JIRAs to track the
>>> implementations.
>>>  > >>>>> What do you think?
>>>  > >>>>>
>>>  > >>>>> Shuyi
>>>  > >>>>>
>>>  > >>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <
>>>  > xuef...@alibaba-inc.com>
>>>  > >>>>> wrote:
>>>  > >>>>> Hi all,
>>>  > >>>>>
>>>  > >>>>> I have also shared a design doc on Hive metastore integration
>>> that is
>>>  >
>>>  > >>>>> attached here and also to FLINK-10556[1]. Please kindly review
>>> and share
>>>  > >>>>> your feedback.
>>>  > >>>>>
>>>  > >>>>>
>>>  > >>>>> Thanks,
>>>  > >>>>> Xuefu
>>>  > >>>>>
>>>  > >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>>>  > >>>>>
>>> ------------------------------------------------------------------
>>>  > >>>>> Sender:Xuefu <xuef...@alibaba-inc.com>
>>>  > >>>>> Sent at:2018 Oct 25 (Thu) 01:08
>>>  > >>>>> Recipient:Xuefu <xuef...@alibaba-inc.com>; Shuyi Chen <
>>>  > >>>>> suez1...@gmail.com>
>>>  > >>>>> Cc:yanghua1127 <yanghua1...@gmail.com>; Fabian Hueske <
>>>  > fhue...@gmail.com>;
>>>  > >>>>> dev <dev@flink.apache.org>; user <u...@flink.apache.org>
>>>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive
>>> ecosystem
>>>  > >>>>>
>>>  > >>>>> Hi all,
>>>  > >>>>>
>>>  > >>>>> To wrap up the discussion, I have attached a PDF describing the
>>>  >
>>>  > >>>>> proposal, which is also attached to FLINK-10556 [1]. Please
>>> feel free to
>>>  > >>>>> watch that JIRA to track the progress.
>>>  > >>>>>
>>>  > >>>>> Please also let me know if you have additional comments or
>>> questions.
>>>  > >>>>>
>>>  > >>>>> Thanks,
>>>  > >>>>> Xuefu
>>>  > >>>>>
>>>  > >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>>>  > >>>>>
>>>  > >>>>>
>>>  > >>>>>
>>> ------------------------------------------------------------------
>>>  > >>>>> Sender:Xuefu <xuef...@alibaba-inc.com>
>>>  > >>>>> Sent at:2018 Oct 16 (Tue) 03:40
>>>  > >>>>> Recipient:Shuyi Chen <suez1...@gmail.com>
>>>  > >>>>> Cc:yanghua1127 <yanghua1...@gmail.com>; Fabian Hueske <
>>>  > fhue...@gmail.com>;
>>>  > >>>>> dev <dev@flink.apache.org>; user <u...@flink.apache.org>
>>>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive
>>> ecosystem
>>>  > >>>>>
>>>  > >>>>> Hi Shuyi,
>>>  > >>>>>
>>>  >
>>>  > >>>>> Thank you for your input. Yes, I agreed with a phased approach
>>> and like
>>>  >
>>>  > >>>>> to move forward fast. :) We did some work internally on DDL
>>> utilizing babel
>>>  > >>>>> parser in Calcite. While babel makes Calcite's grammar
>>> extensible, at
>>>  > >>>>> first impression it still seems too cumbersome for a project
>>> when too
>>>  >
>>>  > >>>>> much extensions are made. It's even challenging to find where
>>> the extension
>>>  >
>>>  > >>>>> is needed! It would be certainly better if Calcite can
>>> magically support
>>>  >
>>>  > >>>>> Hive QL by just turning on a flag, such as that for MYSQL_5. I
>>> can also
>>>  >
>>>  > >>>>> see that this could mean a lot of work on Calcite.
>>> Nevertheless, I will
>>>  >
>>>  > >>>>> bring up the discussion over there and to see what their
>>> community thinks.
>>>  > >>>>>
>>>  > >>>>> Would mind to share more info about the proposal on DDL that you
>>>  > >>>>> mentioned? We can certainly collaborate on this.
>>>  > >>>>>
>>>  > >>>>> Thanks,
>>>  > >>>>> Xuefu
>>>  > >>>>>
>>>  > >>>>>
>>> ------------------------------------------------------------------
>>>  > >>>>> Sender:Shuyi Chen <suez1...@gmail.com>
>>>  > >>>>> Sent at:2018 Oct 14 (Sun) 08:30
>>>  > >>>>> Recipient:Xuefu <xuef...@alibaba-inc.com>
>>>  > >>>>> Cc:yanghua1127 <yanghua1...@gmail.com>; Fabian Hueske <
>>>  > fhue...@gmail.com>;
>>>  > >>>>> dev <dev@flink.apache.org>; user <u...@flink.apache.org>
>>>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive
>>> ecosystem
>>>  > >>>>>
>>>  > >>>>> Welcome to the community and thanks for the great proposal,
>>> Xuefu! I
>>>  >
>>>  > >>>>> think the proposal can be divided into 2 stages: making Flink
>>> to support
>>>  >
>>>  > >>>>> Hive features, and make Hive to work with Flink. I agreed with
>>> Timo that on
>>>  >
>>>  > >>>>> starting with a smaller scope, so we can make progress faster.
>>> As for [6],
>>>  >
>>>  > >>>>> a proposal for DDL is already in progress, and will come after
>>> the unified
>>>  >
>>>  > >>>>> SQL connector API is done. For supporting Hive syntax, we might
>>> need to
>>>  > >>>>> work with the Calcite community, and a recent effort called
>>> babel (
>>>  > >>>>> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite
>>> might
>>>  > >>>>> help here.
>>>  > >>>>>
>>>  > >>>>> Thanks
>>>  > >>>>> Shuyi
>>>  > >>>>>
>>>  > >>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <
>>>  > xuef...@alibaba-inc.com>
>>>  > >>>>> wrote:
>>>  > >>>>> Hi Fabian/Vno,
>>>  > >>>>>
>>>  >
>>>  > >>>>> Thank you very much for your encouragement inquiry. Sorry that
>>> I didn't
>>>  >
>>>  > >>>>> see Fabian's email until I read Vino's response just now.
>>> (Somehow Fabian's
>>>  > >>>>> went to the spam folder.)
>>>  > >>>>>
>>>  >
>>>  > >>>>> My proposal contains long-term and short-terms goals.
>>> Nevertheless, the
>>>  > >>>>> effort will focus on the following areas, including Fabian's
>>> list:
>>>  > >>>>>
>>>  > >>>>> 1. Hive metastore connectivity - This covers both read/write
>>> access,
>>>  >
>>>  > >>>>> which means Flink can make full use of Hive's metastore as its
>>> catalog (at
>>>  > >>>>> least for the batch but can extend for streaming as well).
>>>  >
>>>  > >>>>> 2. Metadata compatibility - Objects (databases, tables,
>>> partitions, etc)
>>>  >
>>>  > >>>>> created by Hive can be understood by Flink and the reverse
>>> direction is
>>>  > >>>>> true also.
>>>  > >>>>> 3. Data compatibility - Similar to #2, data produced by Hive
>>> can be
>>>  > >>>>> consumed by Flink and vise versa.
>>>  >
>>>  > >>>>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either
>>> provides
>>>  > >>>>> its own implementation or make Hive's implementation work in
>>> Flink.
>>>  > >>>>> Further, for user created UDFs in Hive, Flink SQL should
>>> provide a
>>>  >
>>>  > >>>>> mechanism allowing user to import them into Flink without any
>>> code change
>>>  > >>>>> required.
>>>  > >>>>> 5. Data types -  Flink SQL should support all data types that
>>> are
>>>  > >>>>> available in Hive.
>>>  > >>>>> 6. SQL Language - Flink SQL should support SQL standard (such as
>>>  >
>>>  > >>>>> SQL2003) with extension to support Hive's syntax and language
>>> features,
>>>  > >>>>> around DDL, DML, and SELECT queries.
>>>  >
>>>  > >>>>> 7.  SQL CLI - this is currently developing in Flink but more
>>> effort is
>>>  > >>>>> needed.
>>>  >
>>>  > >>>>> 8. Server - provide a server that's compatible with Hive's
>>> HiverServer2
>>>  >
>>>  > >>>>> in thrift APIs, such that HiveServer2 users can reuse their
>>> existing client
>>>  > >>>>> (such as beeline) but connect to Flink's thrift server instead.
>>>  >
>>>  > >>>>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC
>>> drivers for
>>>  > >>>>> other application to use to connect to its thrift server
>>>  > >>>>> 10. Support other user's customizations in Hive, such as Hive
>>> Serdes,
>>>  > >>>>> storage handlers, etc.
>>>  >
>>>  > >>>>> 11. Better task failure tolerance and task scheduling at Flink
>>> runtime.
>>>  > >>>>>
>>>  > >>>>> As you can see, achieving all those requires significant effort
>>> and
>>>  >
>>>  > >>>>> across all layers in Flink. However, a short-term goal could
>>> include only
>>>  >
>>>  > >>>>> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller
>>> scope (such as
>>>  > >>>>> #3, #6).
>>>  > >>>>>
>>>  >
>>>  > >>>>> Please share your further thoughts. If we generally agree that
>>> this is
>>>  >
>>>  > >>>>> the right direction, I could come up with a formal proposal
>>> quickly and
>>>  > >>>>> then we can follow up with broader discussions.
>>>  > >>>>>
>>>  > >>>>> Thanks,
>>>  > >>>>> Xuefu
>>>  > >>>>>
>>>  > >>>>>
>>>  > >>>>>
>>>  > >>>>>
>>> ------------------------------------------------------------------
>>>  > >>>>> Sender:vino yang <yanghua1...@gmail.com>
>>>  > >>>>> Sent at:2018 Oct 11 (Thu) 09:45
>>>  > >>>>> Recipient:Fabian Hueske <fhue...@gmail.com>
>>>  > >>>>> Cc:dev <dev@flink.apache.org>; Xuefu <xuef...@alibaba-inc.com
>>>  > >; user <
>>>  > >>>>> u...@flink.apache.org>
>>>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive
>>> ecosystem
>>>  > >>>>>
>>>  > >>>>> Hi Xuefu,
>>>  > >>>>>
>>>  >
>>>  > >>>>> Appreciate this proposal, and like Fabian, it would look better
>>> if you
>>>  > >>>>> can give more details of the plan.
>>>  > >>>>>
>>>  > >>>>> Thanks, vino.
>>>  > >>>>>
>>>  > >>>>> Fabian Hueske <fhue...@gmail.com> 于2018年10月10日周三 下午5:27写道：
>>>  > >>>>> Hi Xuefu,
>>>  > >>>>>
>>>  >
>>>  > >>>>> Welcome to the Flink community and thanks for starting this
>>> discussion!
>>>  > >>>>> Better Hive integration would be really great!
>>>  > >>>>> Can you go into details of what you are proposing? I can think
>>> of a
>>>  > >>>>> couple ways to improve Flink in that regard:
>>>  > >>>>>
>>>  > >>>>> * Support for Hive UDFs
>>>  > >>>>> * Support for Hive metadata catalog
>>>  > >>>>> * Support for HiveQL syntax
>>>  > >>>>> * ???
>>>  > >>>>>
>>>  > >>>>> Best, Fabian
>>>  > >>>>>
>>>  > >>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
>>>  > >>>>> xuef...@alibaba-inc.com>:
>>>  > >>>>> Hi all,
>>>  > >>>>>
>>>  > >>>>> Along with the community's effort, inside Alibaba we have
>>> explored
>>>  >
>>>  > >>>>> Flink's potential as an execution engine not just for stream
>>> processing but
>>>  > >>>>> also for batch processing. We are encouraged by our findings
>>> and have
>>>  >
>>>  > >>>>> initiated our effort to make Flink's SQL capabilities
>>> full-fledged. When
>>>  >
>>>  > >>>>> comparing what's available in Flink to the offerings from
>>> competitive data
>>>  >
>>>  > >>>>> processing engines, we identified a major gap in Flink: a well
>>> integration
>>>  >
>>>  > >>>>> with Hive ecosystem. This is crucial to the success of Flink
>>> SQL and batch
>>>  >
>>>  > >>>>> due to the well-established data ecosystem around Hive.
>>> Therefore, we have
>>>  >
>>>  > >>>>> done some initial work along this direction but there are still
>>> a lot of
>>>  > >>>>> effort needed.
>>>  > >>>>>
>>>  > >>>>> We have two strategies in mind. The first one is to make Flink
>>> SQL
>>>  >
>>>  > >>>>> full-fledged and well-integrated with Hive ecosystem. This is a
>>> similar
>>>  >
>>>  > >>>>> approach to what Spark SQL adopted. The second strategy is to
>>> make Hive
>>>  >
>>>  > >>>>> itself work with Flink, similar to the proposal in [1]. Each
>>> approach bears
>>>  >
>>>  > >>>>> its pros and cons, but they don’t need to be mutually exclusive
>>> with each
>>>  > >>>>> targeting at different users and use cases. We believe that
>>> both will
>>>  > >>>>> promote a much greater adoption of Flink beyond stream
>>> processing.
>>>  > >>>>>
>>>  > >>>>> We have been focused on the first approach and would like to
>>> showcase
>>>  >
>>>  > >>>>> Flink's batch and SQL capabilities with Flink SQL. However, we
>>> have also
>>>  > >>>>> planned to start strategy #2 as the follow-up effort.
>>>  > >>>>>
>>>  >
>>>  > >>>>> I'm completely new to Flink(, with a short bio [2] below),
>>> though many
>>>  >
>>>  > >>>>> of my colleagues here at Alibaba are long-time contributors.
>>> Nevertheless,
>>>  >
>>>  > >>>>> I'd like to share our thoughts and invite your early feedback.
>>> At the same
>>>  >
>>>  > >>>>> time, I am working on a detailed proposal on Flink SQL's
>>> integration with
>>>  > >>>>> Hive ecosystem, which will be also shared when ready.
>>>  > >>>>>
>>>  > >>>>> While the ideas are simple, each approach will demand
>>> significant
>>>  >
>>>  > >>>>> effort, more than what we can afford. Thus, the input and
>>> contributions
>>>  > >>>>> from the communities are greatly welcome and appreciated.
>>>  > >>>>>
>>>  > >>>>> Regards,
>>>  > >>>>>
>>>  > >>>>>
>>>  > >>>>> Xuefu
>>>  > >>>>>
>>>  > >>>>> References:
>>>  > >>>>>
>>>  > >>>>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>>>  >
>>>  > >>>>> [2] Xuefu Zhang is a long-time open source veteran, worked or
>>> working on
>>>  > >>>>> many projects under Apache Foundation, of which he is also an
>>> honored
>>>  >
>>>  > >>>>> member. About 10 years ago he worked in the Hadoop team at
>>> Yahoo where the
>>>  >
>>>  > >>>>> projects just got started. Later he worked at Cloudera,
>>> initiating and
>>>  >
>>>  > >>>>> leading the development of Hive on Spark project in the
>>> communities and
>>>  >
>>>  > >>>>> across many organizations. Prior to joining Alibaba, he worked
>>> at Uber
>>>  >
>>>  > >>>>> where he promoted Hive on Spark to all Uber's SQL on Hadoop
>>> workload and
>>>  > >>>>> significantly improved Uber's cluster efficiency.
>>>  > >>>>>
>>>  > >>>>>
>>>  > >>>>>
>>>  > >>>>>
>>>  > >>>>> --
>>>  >
>>>  > >>>>> "So you have to trust that the dots will somehow connect in
>>> your future."
>>>  > >>>>>
>>>  > >>>>>
>>>  > >>>>> --
>>>  >
>>>  > >>>>> "So you have to trust that the dots will somehow connect in
>>> your future."
>>>  > >>>>>
>>>  >
>>>  >
>>>
>>>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Reply via email to