Thank you, Xuefu and Timo, for putting together the FLIP! I like that both its scope and implementation plan are clear. Look forward to feedbacks from the group.
I also added a few more complementary details in the doc. Thanks, Bowen On Mon, Jan 7, 2019 at 8:37 PM Zhang, Xuefu <xuef...@alibaba-inc.com> wrote: > Thanks, Timo! > > I have started put the content from the google doc to FLIP-30 [1]. > However, please still keep the discussion along this thread. > > Thanks, > Xuefu > > [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-30%3A+Unified+Catalog+APIs > > > ------------------------------------------------------------------ > From:Timo Walther <twal...@apache.org> > Sent At:2019 Jan. 7 (Mon.) 05:59 > To:dev <dev@flink.apache.org> > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem > > Hi everyone, > > Xuefu and I had multiple iterations over the catalog design document > [1]. I believe that it is in a good shape now to be converted into FLIP. > Maybe we need a bit more explanation at some places but the general > design would be ready now. > > The design document covers the following changes: > - Unify external catalog interface and Flink's internal catalog in > TableEnvironment > - Clearly define a hierarchy of reference objects namely: > "catalog.database.table" > - Enable a tight integration with Hive + Hive data connectors as well as > a broad integration with existing TableFactories and discovery mechanism > - Make the catalog interfaces more feature complete by adding views and > functions > > If you have any further feedback, it would be great to give it now > before we convert it into a FLIP. > > Thanks, > Timo > > [1] > > https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit# > > > > Am 07.01.19 um 13:51 schrieb Timo Walther: > > Hi Eron, > > > > thank you very much for the contributions. I merged the first little > > bug fixes. For the remaining PRs I think we can review and merge them > > soon. As you said, the code is agnostic to the details of the > > ExternalCatalog interface and I don't expect bigger merge conflicts in > > the near future. > > > > However, exposing the current external catalog interfaces to SQL > > Client users would make it even more difficult to change the > > interfaces in the future. So maybe I would first wait until the > > general catalog discussion is over and the FLIP has been created. This > > should happen shortly. > > > > We should definitely coordinate the efforts better in the future to > > avoid duplicate work. > > > > Thanks, > > Timo > > > > > > Am 07.01.19 um 00:24 schrieb Eron Wright: > >> Thanks Timo for merging a couple of the PRs. Are you also able to > >> review the others that I mentioned? Xuefu I would like to incorporate > >> your feedback too. > >> > >> Check out this short demonstration of using a catalog in SQL Client: > >> https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo > >> > >> Thanks again! > >> > >> On Thu, Jan 3, 2019 at 9:37 AM Eron Wright <eronwri...@gmail.com > >> <mailto:eronwri...@gmail.com>> wrote: > >> > >> Would a couple folks raise their hand to make a review pass thru > >> the 6 PRs listed above? It is a lovely stack of PRs that is 'all > >> green' at the moment. I would be happy to open follow-on PRs to > >> rapidly align with other efforts. > >> > >> Note that the code is agnostic to the details of the > >> ExternalCatalog interface; the code would not be obsolete if/when > >> the catalog interface is enhanced as per the design doc. > >> > >> > >> > >> On Wed, Jan 2, 2019 at 1:35 PM Eron Wright <eronwri...@gmail.com > >> <mailto:eronwri...@gmail.com>> wrote: > >> > >> I propose that the community review and merge the PRs that I > >> posted, and then evolve the design thru 1.8 and beyond. I > >> think having a basic infrastructure in place now will > >> accelerate the effort, do you agree? > >> > >> Thanks again! > >> > >> On Wed, Jan 2, 2019 at 11:20 AM Zhang, Xuefu > >> <xuef...@alibaba-inc.com <mailto:xuef...@alibaba-inc.com>> > >> wrote: > >> > >> Hi Eron, > >> > >> Happy New Year! > >> > >> Thank you very much for your contribution, especially > >> during the holidays. Wile I'm encouraged by your work, I'd > >> also like to share my thoughts on how to move forward. > >> > >> First, please note that the design discussion is still > >> finalizing, and we expect some moderate changes, > >> especially around TableFactories. Another pending change > >> is our decision to shy away from scala, which our work > >> will be impacted by. > >> > >> Secondly, while your work seemed about plugging in > >> catalogs definitions to the execution environment, which > >> is less impacted by TableFactory change, I did notice some > >> duplication of your work and ours. This is no big deal, > >> but going forward, we should probable have a better > >> communication on the work assignment so as to avoid any > >> possible duplication of work. On the other hand, I think > >> some of your work is interesting and valuable for > >> inclusion once we finalize the overall design. > >> > >> Thus, please continue your research and experiment and let > >> us know when you start working on anything so we can > >> better coordinate. > >> > >> Thanks again for your interest and contributions. > >> > >> Thanks, > >> Xuefu > >> > >> > >> > >> ------------------------------------------------------------------ > >> From:Eron Wright <eronwri...@gmail.com > >> <mailto:eronwri...@gmail.com>> > >> Sent At:2019 Jan. 1 (Tue.) 18:39 > >> To:dev <dev@flink.apache.org > >> <mailto:dev@flink.apache.org>>; Xuefu > >> <xuef...@alibaba-inc.com > >> <mailto:xuef...@alibaba-inc.com>> > >> Cc:Xiaowei Jiang <xiaow...@gmail.com > >> <mailto:xiaow...@gmail.com>>; twalthr > >> <twal...@apache.org <mailto:twal...@apache.org>>; > >> piotr <pi...@data-artisans.com > >> <mailto:pi...@data-artisans.com>>; Fabian Hueske > >> <fhue...@gmail.com <mailto:fhue...@gmail.com>>; > >> suez1224 <suez1...@gmail.com > >> <mailto:suez1...@gmail.com>>; Bowen Li > >> <bowenl...@gmail.com <mailto:bowenl...@gmail.com>> > >> Subject:Re: [DISCUSS] Integrate Flink SQL well with > >> Hive ecosystem > >> > >> Hi folks, there's clearly some incremental steps to be > >> taken to introduce catalog support to SQL Client, > >> complementary to what is proposed in the Flink-Hive > >> Metastore design doc. I was quietly working on this > >> over the holidays. I posted some new sub-tasks, PRs, > >> and sample code to FLINK-10744. > >> > >> What inspired me to get involved is that the catalog > >> interface seems like a great way to encapsulate a > >> 'library' of Flink tables and functions. For example, > >> the NYC Taxi dataset (TaxiRides, TaxiFares, various > >> UDFs) may be nicely encapsulated as a catalog > >> (TaxiData). Such a library should be fully consumable > >> in SQL Client. > >> > >> I implemented the above. Some highlights: > >> 1. A fully-worked example of using the Taxi dataset in > >> SQL Client via an environment file. > >> - an ASCII video showing the SQL Client in action: > >> https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo > >> > >> - the corresponding environment file (will be even > >> more concise once 'FLINK-10696 Catalog UDFs' is merged): > >> _ > https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml_ > >> > >> - the typed API for standalone table applications: > >> _ > https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50_ > >> > >> 2. Implementation of the core catalog descriptor and > >> factory. I realize that some renames may later occur > >> as per the design doc, and would be happy to do that > >> as a follow-up. > >> https://github.com/apache/flink/pull/7390 > >> > >> 3. Implementation of a connect-style API on > >> TableEnvironment to use catalog descriptor. > >> https://github.com/apache/flink/pull/7392 > >> > >> 4. Integration into SQL-Client's environment file: > >> https://github.com/apache/flink/pull/7393 > >> > >> I realize that the overall Hive integration is still > >> evolving, but I believe that these PRs are a good > >> stepping stone. Here's the list (in bottom-up order): > >> - https://github.com/apache/flink/pull/7386 > >> - https://github.com/apache/flink/pull/7388 > >> - https://github.com/apache/flink/pull/7389 > >> - https://github.com/apache/flink/pull/7390 > >> - https://github.com/apache/flink/pull/7392 > >> - https://github.com/apache/flink/pull/7393 > >> > >> Thanks and enjoy 2019! > >> Eron W > >> > >> > >> On Sun, Nov 18, 2018 at 3:04 PM Zhang, Xuefu > >> <xuef...@alibaba-inc.com > >> <mailto:xuef...@alibaba-inc.com>> wrote: > >> Hi Xiaowei, > >> > >> Thanks for bringing up the question. In the current > >> design, the properties for meta objects are meant to > >> cover anything that's specific to a particular catalog > >> and agnostic to Flink. Anything that is common (such > >> as schema for tables, query text for views, and udf > >> classname) are abstracted as members of the respective > >> classes. However, this is still in discussion, and > >> Timo and I will go over this and provide an update. > >> > >> Please note that UDF is a little more involved than > >> what the current design doc shows. I'm still refining > >> this part. > >> > >> Thanks, > >> Xuefu > >> > >> > >> ------------------------------------------------------------------ > >> Sender:Xiaowei Jiang <xiaow...@gmail.com > >> <mailto:xiaow...@gmail.com>> > >> Sent at:2018 Nov 18 (Sun) 15:17 > >> Recipient:dev <dev@flink.apache.org > >> <mailto:dev@flink.apache.org>> > >> Cc:Xuefu <xuef...@alibaba-inc.com > >> <mailto:xuef...@alibaba-inc.com>>; twalthr > >> <twal...@apache.org <mailto:twal...@apache.org>>; > >> piotr <pi...@data-artisans.com > >> <mailto:pi...@data-artisans.com>>; Fabian Hueske > >> <fhue...@gmail.com <mailto:fhue...@gmail.com>>; > >> suez1224 <suez1...@gmail.com > >> <mailto:suez1...@gmail.com>> > >> Subject:Re: [DISCUSS] Integrate Flink SQL well with > >> Hive ecosystem > >> > >> Thanks Xuefu for the detailed design doc! One question > >> on the properties associated with the catalog objects. > >> Are we going to leave them completely free form or we > >> are going to set some standard for that? I think that > >> the answer may depend on if we want to explore catalog > >> specific optimization opportunities. In any case, I > >> think that it might be helpful for standardize as much > >> as possible into strongly typed classes and use leave > >> these properties for catalog specific things. But I > >> think that we can do it in steps. > >> > >> Xiaowei > >> On Fri, Nov 16, 2018 at 4:00 AM Bowen Li > >> <bowenl...@gmail.com <mailto:bowenl...@gmail.com>> > >> wrote: > >> Thanks for keeping on improving the overall design, > >> Xuefu! It looks quite > >> good to me now. > >> > >> Would be nice that cc-ed Flink committers can help to > >> review and confirm! > >> > >> > >> > >> One minor suggestion: Since the last section of > >> design doc already touches > >> some new sql statements, shall we add another section > >> in our doc and > >> formalize the new sql statements in SQL Client and > >> TableEnvironment that > >> are gonna come along naturally with our design? Here > >> are some that the > >> design doc mentioned and some that I came up with: > >> > >> To be added: > >> > >> - USE <catalog> - set default catalog > >> - USE <catalog.schema> - set default schema > >> - SHOW CATALOGS - show all registered catalogs > >> - SHOW SCHEMAS [FROM catalog] - list schemas in > >> the current default > >> catalog or the specified catalog > >> - DESCRIBE VIEW view - show the view's definition > >> in CatalogView > >> - SHOW VIEWS [FROM schema/catalog.schema] - show > >> views from current or a > >> specified schema. > >> > >> (DDLs that can be addressed by either our design > >> or Shuyi's DDL design) > >> > >> - CREATE/DROP/ALTER SCHEMA schema > >> - CREATE/DROP/ALTER CATALOG catalog > >> > >> To be modified: > >> > >> - SHOW TABLES [FROM schema/catalog.schema] - show > >> tables from current or > >> a specified schema. Add 'from schema' to existing > >> 'SHOW TABLES' statement > >> - SHOW FUNCTIONS [FROM schema/catalog.schema] - > >> show functions from > >> current or a specified schema. Add 'from schema' > >> to existing 'SHOW TABLES' > >> statement' > >> > >> > >> Thanks, Bowen > >> > >> > >> > >> On Wed, Nov 14, 2018 at 10:39 PM Zhang, Xuefu > >> <xuef...@alibaba-inc.com > >> <mailto:xuef...@alibaba-inc.com>> > >> wrote: > >> > >> > Thanks, Bowen, for catching the error. I have > >> granted comment permission > >> > with the link. > >> > > >> > I also updated the doc with the latest class > >> definitions. Everyone is > >> > encouraged to review and comment. > >> > > >> > Thanks, > >> > Xuefu > >> > > >> > > >> ------------------------------------------------------------------ > >> > Sender:Bowen Li <bowenl...@gmail.com > >> <mailto:bowenl...@gmail.com>> > >> > Sent at:2018 Nov 14 (Wed) 06:44 > >> > Recipient:Xuefu <xuef...@alibaba-inc.com > >> <mailto:xuef...@alibaba-inc.com>> > >> > Cc:piotr <pi...@data-artisans.com > >> <mailto:pi...@data-artisans.com>>; dev > >> <dev@flink.apache.org <mailto:dev@flink.apache.org>>; > >> Shuyi > >> > Chen <suez1...@gmail.com <mailto:suez1...@gmail.com > >> > >> > Subject:Re: [DISCUSS] Integrate Flink SQL well with > >> Hive ecosystem > >> > > >> > Hi Xuefu, > >> > > >> > Currently the new design doc > >> > > >> < > https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit > > > >> > is on “view only" mode, and people cannot leave > >> comments. Can you please > >> > change it to "can comment" or "can edit" mode? > >> > > >> > Thanks, Bowen > >> > > >> > > >> > On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu > >> <xuef...@alibaba-inc.com > >> <mailto:xuef...@alibaba-inc.com>> > >> > wrote: > >> > Hi Piotr > >> > > >> > I have extracted the API portion of the design and > >> the google doc is here > >> > > >> < > https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit?usp=sharing > >. > >> > Please review and provide your feedback. > >> > > >> > Thanks, > >> > Xuefu > >> > > >> > > >> ------------------------------------------------------------------ > >> > Sender:Xuefu <xuef...@alibaba-inc.com > >> <mailto:xuef...@alibaba-inc.com>> > >> > Sent at:2018 Nov 12 (Mon) 12:43 > >> > Recipient:Piotr Nowojski <pi...@data-artisans.com > >> <mailto:pi...@data-artisans.com>>; dev < > >> > dev@flink.apache.org <mailto:dev@flink.apache.org>> > >> > Cc:Bowen Li <bowenl...@gmail.com > >> <mailto:bowenl...@gmail.com>>; Shuyi Chen > >> <suez1...@gmail.com <mailto:suez1...@gmail.com>> > >> > Subject:Re: [DISCUSS] Integrate Flink SQL well with > >> Hive ecosystem > >> > > >> > Hi Piotr, > >> > > >> > That sounds good to me. Let's close all the open > >> questions ((there are a > >> > couple of them)) in the Google doc and I should be > >> able to quickly split > >> > it into the three proposals as you suggested. > >> > > >> > Thanks, > >> > Xuefu > >> > > >> > > >> ------------------------------------------------------------------ > >> > Sender:Piotr Nowojski <pi...@data-artisans.com > >> <mailto:pi...@data-artisans.com>> > >> > Sent at:2018 Nov 9 (Fri) 22:46 > >> > Recipient:dev <dev@flink.apache.org > >> <mailto:dev@flink.apache.org>>; Xuefu > >> <xuef...@alibaba-inc.com > >> <mailto:xuef...@alibaba-inc.com>> > >> > Cc:Bowen Li <bowenl...@gmail.com > >> <mailto:bowenl...@gmail.com>>; Shuyi Chen > >> <suez1...@gmail.com <mailto:suez1...@gmail.com>> > >> > Subject:Re: [DISCUSS] Integrate Flink SQL well with > >> Hive ecosystem > >> > > >> > Hi, > >> > > >> > > >> > Yes, it seems like the best solution. Maybe someone > >> else can also suggests if we can split it further? > >> Maybe changes in the interface in one doc, reading > >> from hive meta store another and final storing our > >> meta informations in hive meta store? > >> > > >> > Piotrek > >> > > >> > > On 9 Nov 2018, at 01:44, Zhang, Xuefu > >> <xuef...@alibaba-inc.com > >> <mailto:xuef...@alibaba-inc.com>> wrote: > >> > > > >> > > Hi Piotr, > >> > > > >> > > That seems to be good idea! > >> > > > >> > > >> > > Since the google doc for the design is currently > >> under extensive review, I will leave it as it is for > >> now. However, I'll convert it to two different FLIPs > >> when the time comes. > >> > > > >> > > How does it sound to you? > >> > > > >> > > Thanks, > >> > > Xuefu > >> > > > >> > > > >> > > > >> ------------------------------------------------------------------ > >> > > Sender:Piotr Nowojski <pi...@data-artisans.com > >> <mailto:pi...@data-artisans.com>> > >> > > Sent at:2018 Nov 9 (Fri) 02:31 > >> > > Recipient:dev <dev@flink.apache.org > >> <mailto:dev@flink.apache.org>> > >> > > Cc:Bowen Li <bowenl...@gmail.com > >> <mailto:bowenl...@gmail.com>>; Xuefu > >> <xuef...@alibaba-inc.com > >> <mailto:xuef...@alibaba-inc.com> > >> > >; Shuyi Chen <suez1...@gmail.com > >> <mailto:suez1...@gmail.com>> > >> > > Subject:Re: [DISCUSS] Integrate Flink SQL well > >> with Hive ecosystem > >> > > > >> > > Hi, > >> > > > >> > > >> > > Maybe we should split this topic (and the design > >> doc) into couple of smaller ones, hopefully > >> independent. The questions that you have asked Fabian > >> have for example very little to do with reading > >> metadata from Hive Meta Store? > >> > > > >> > > Piotrek > >> > > > >> > >> On 7 Nov 2018, at 14:27, Fabian Hueske > >> <fhue...@gmail.com <mailto:fhue...@gmail.com>> wrote: > >> > >> > >> > >> Hi Xuefu and all, > >> > >> > >> > >> Thanks for sharing this design document! > >> > > >> > >> I'm very much in favor of restructuring / > >> reworking the catalog handling in > >> > >> Flink SQL as outlined in the document. > >> > > >> > >> Most changes described in the design document > >> seem to be rather general and > >> > >> not specifically related to the Hive integration. > >> > >> > >> > > >> > >> IMO, there are some aspects, especially those at > >> the boundary of Hive and > >> > >> Flink, that need a bit more discussion. For > >> example > >> > >> > >> > >> * What does it take to make Flink schema > >> compatible with Hive schema? > >> > >> * How will Flink tables (descriptors) be stored > >> in HMS? > >> > >> * How do both Hive catalogs differ? Could they > >> be integrated into to a > >> > >> single one? When to use which one? > >> > > >> > >> * What meta information is provided by HMS? What > >> of this can be leveraged > >> > >> by Flink? > >> > >> > >> > >> Thank you, > >> > >> Fabian > >> > >> > >> > >> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen > >> Li <bowenl...@gmail.com <mailto:bowenl...@gmail.com> > >> > >: > >> > >> > >> > >>> After taking a look at how other discussion > >> threads work, I think it's > >> > >>> actually fine just keep our discussion here. > >> It's up to you, Xuefu. > >> > >>> > >> > >>> The google doc LGTM. I left some minor comments. > >> > >>> > >> > >>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li > >> <bowenl...@gmail.com <mailto:bowenl...@gmail.com>> > >> wrote: > >> > >>> > >> > >>>> Hi all, > >> > >>>> > >> > >>>> As Xuefu has published the design doc on > >> google, I agree with Shuyi's > >> > > >> > >>>> suggestion that we probably should start a new > >> email thread like "[DISCUSS] > >> > > >> > >>>> ... Hive integration design ..." on only dev > >> mailing list for community > >> > >>>> devs to review. The current thread sends to > >> both dev and user list. > >> > >>>> > >> > > >> > >>>> This email thread is more like validating the > >> general idea and direction > >> > > >> > >>>> with the community, and it's been pretty long > >> and crowded so far. Since > >> > > >> > >>>> everyone is pro for the idea, we can move > >> forward with another thread to > >> > >>>> discuss and finalize the design. > >> > >>>> > >> > >>>> Thanks, > >> > >>>> Bowen > >> > >>>> > >> > >>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu < > >> > xuef...@alibaba-inc.com > >> <mailto:xuef...@alibaba-inc.com>> > >> > >>>> wrote: > >> > >>>> > >> > >>>>> Hi Shuiyi, > >> > >>>>> > >> > > >> > >>>>> Good idea. Actually the PDF was converted > >> from a google doc. Here is its > >> > >>>>> link: > >> > >>>>> > >> > >>>>> > >> > > >> > https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing > >> > >>>>> Once we reach an agreement, I can convert it > >> to a FLIP. > >> > >>>>> > >> > >>>>> Thanks, > >> > >>>>> Xuefu > >> > >>>>> > >> > >>>>> > >> > >>>>> > >> > >>>>> > >> ------------------------------------------------------------------ > >> > >>>>> Sender:Shuyi Chen <suez1...@gmail.com > >> <mailto:suez1...@gmail.com>> > >> > >>>>> Sent at:2018 Nov 1 (Thu) 02:47 > >> > >>>>> Recipient:Xuefu <xuef...@alibaba-inc.com > >> <mailto:xuef...@alibaba-inc.com>> > >> > >>>>> Cc:vino yang <yanghua1...@gmail.com > >> <mailto:yanghua1...@gmail.com>>; Fabian Hueske < > >> > fhue...@gmail.com <mailto:fhue...@gmail.com>>; > >> > >>>>> dev <dev@flink.apache.org > >> <mailto:dev@flink.apache.org>>; user > >> <u...@flink.apache.org <mailto:u...@flink.apache.org>> > >> > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL > >> well with Hive ecosystem > >> > >>>>> > >> > >>>>> Hi Xuefu, > >> > >>>>> > >> > > >> > >>>>> Thanks a lot for driving this big effort. I > >> would suggest convert your > >> > > >> > >>>>> proposal and design doc into a google doc, > >> and share it on the dev mailing > >> > > >> > >>>>> list for the community to review and comment > >> with title like "[DISCUSS] ... > >> > > >> > >>>>> Hive integration design ..." . Once > >> approved, we can document it as a FLIP > >> > > >> > >>>>> (Flink Improvement Proposals), and use JIRAs > >> to track the implementations. > >> > >>>>> What do you think? > >> > >>>>> > >> > >>>>> Shuyi > >> > >>>>> > >> > >>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu < > >> > xuef...@alibaba-inc.com > >> <mailto:xuef...@alibaba-inc.com>> > >> > >>>>> wrote: > >> > >>>>> Hi all, > >> > >>>>> > >> > >>>>> I have also shared a design doc on Hive > >> metastore integration that is > >> > > >> > >>>>> attached here and also to FLINK-10556[1]. > >> Please kindly review and share > >> > >>>>> your feedback. > >> > >>>>> > >> > >>>>> > >> > >>>>> Thanks, > >> > >>>>> Xuefu > >> > >>>>> > >> > >>>>> [1] > >> https://issues.apache.org/jira/browse/FLINK-10556 > >> > >>>>> > >> ------------------------------------------------------------------ > >> > >>>>> Sender:Xuefu <xuef...@alibaba-inc.com > >> <mailto:xuef...@alibaba-inc.com>> > >> > >>>>> Sent at:2018 Oct 25 (Thu) 01:08 > >> > >>>>> Recipient:Xuefu <xuef...@alibaba-inc.com > >> <mailto:xuef...@alibaba-inc.com>>; Shuyi Chen < > >> > >>>>> suez1...@gmail.com <mailto:suez1...@gmail.com > >> > >> > >>>>> Cc:yanghua1127 <yanghua1...@gmail.com > >> <mailto:yanghua1...@gmail.com>>; Fabian Hueske < > >> > fhue...@gmail.com <mailto:fhue...@gmail.com>>; > >> > >>>>> dev <dev@flink.apache.org > >> <mailto:dev@flink.apache.org>>; user > >> <u...@flink.apache.org <mailto:u...@flink.apache.org>> > >> > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL > >> well with Hive ecosystem > >> > >>>>> > >> > >>>>> Hi all, > >> > >>>>> > >> > >>>>> To wrap up the discussion, I have attached a > >> PDF describing the > >> > > >> > >>>>> proposal, which is also attached to > >> FLINK-10556 [1]. Please feel free to > >> > >>>>> watch that JIRA to track the progress. > >> > >>>>> > >> > >>>>> Please also let me know if you have > >> additional comments or questions. > >> > >>>>> > >> > >>>>> Thanks, > >> > >>>>> Xuefu > >> > >>>>> > >> > >>>>> [1] > >> https://issues.apache.org/jira/browse/FLINK-10556 > >> > >>>>> > >> > >>>>> > >> > >>>>> > >> ------------------------------------------------------------------ > >> > >>>>> Sender:Xuefu <xuef...@alibaba-inc.com > >> <mailto:xuef...@alibaba-inc.com>> > >> > >>>>> Sent at:2018 Oct 16 (Tue) 03:40 > >> > >>>>> Recipient:Shuyi Chen <suez1...@gmail.com > >> <mailto:suez1...@gmail.com>> > >> > >>>>> Cc:yanghua1127 <yanghua1...@gmail.com > >> <mailto:yanghua1...@gmail.com>>; Fabian Hueske < > >> > fhue...@gmail.com <mailto:fhue...@gmail.com>>; > >> > >>>>> dev <dev@flink.apache.org > >> <mailto:dev@flink.apache.org>>; user > >> <u...@flink.apache.org <mailto:u...@flink.apache.org>> > >> > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL > >> well with Hive ecosystem > >> > >>>>> > >> > >>>>> Hi Shuyi, > >> > >>>>> > >> > > >> > >>>>> Thank you for your input. Yes, I agreed with > >> a phased approach and like > >> > > >> > >>>>> to move forward fast. :) We did some work > >> internally on DDL utilizing babel > >> > >>>>> parser in Calcite. While babel makes > >> Calcite's grammar extensible, at > >> > >>>>> first impression it still seems too > >> cumbersome for a project when too > >> > > >> > >>>>> much extensions are made. It's even > >> challenging to find where the extension > >> > > >> > >>>>> is needed! It would be certainly better if > >> Calcite can magically support > >> > > >> > >>>>> Hive QL by just turning on a flag, such as > >> that for MYSQL_5. I can also > >> > > >> > >>>>> see that this could mean a lot of work on > >> Calcite. Nevertheless, I will > >> > > >> > >>>>> bring up the discussion over there and to see > >> what their community thinks. > >> > >>>>> > >> > >>>>> Would mind to share more info about the > >> proposal on DDL that you > >> > >>>>> mentioned? We can certainly collaborate on > >> this. > >> > >>>>> > >> > >>>>> Thanks, > >> > >>>>> Xuefu > >> > >>>>> > >> > >>>>> > >> ------------------------------------------------------------------ > >> > >>>>> Sender:Shuyi Chen <suez1...@gmail.com > >> <mailto:suez1...@gmail.com>> > >> > >>>>> Sent at:2018 Oct 14 (Sun) 08:30 > >> > >>>>> Recipient:Xuefu <xuef...@alibaba-inc.com > >> <mailto:xuef...@alibaba-inc.com>> > >> > >>>>> Cc:yanghua1127 <yanghua1...@gmail.com > >> <mailto:yanghua1...@gmail.com>>; Fabian Hueske < > >> > fhue...@gmail.com <mailto:fhue...@gmail.com>>; > >> > >>>>> dev <dev@flink.apache.org > >> <mailto:dev@flink.apache.org>>; user > >> <u...@flink.apache.org <mailto:u...@flink.apache.org>> > >> > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL > >> well with Hive ecosystem > >> > >>>>> > >> > >>>>> Welcome to the community and thanks for the > >> great proposal, Xuefu! I > >> > > >> > >>>>> think the proposal can be divided into 2 > >> stages: making Flink to support > >> > > >> > >>>>> Hive features, and make Hive to work with > >> Flink. I agreed with Timo that on > >> > > >> > >>>>> starting with a smaller scope, so we can make > >> progress faster. As for [6], > >> > > >> > >>>>> a proposal for DDL is already in progress, > >> and will come after the unified > >> > > >> > >>>>> SQL connector API is done. For supporting > >> Hive syntax, we might need to > >> > >>>>> work with the Calcite community, and a recent > >> effort called babel ( > >> > >>>>> > >> https://issues.apache.org/jira/browse/CALCITE-2280) in > >> Calcite might > >> > >>>>> help here. > >> > >>>>> > >> > >>>>> Thanks > >> > >>>>> Shuyi > >> > >>>>> > >> > >>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu < > >> > xuef...@alibaba-inc.com > >> <mailto:xuef...@alibaba-inc.com>> > >> > >>>>> wrote: > >> > >>>>> Hi Fabian/Vno, > >> > >>>>> > >> > > >> > >>>>> Thank you very much for your encouragement > >> inquiry. Sorry that I didn't > >> > > >> > >>>>> see Fabian's email until I read Vino's > >> response just now. (Somehow Fabian's > >> > >>>>> went to the spam folder.) > >> > >>>>> > >> > > >> > >>>>> My proposal contains long-term and > >> short-terms goals. Nevertheless, the > >> > >>>>> effort will focus on the following areas, > >> including Fabian's list: > >> > >>>>> > >> > >>>>> 1. Hive metastore connectivity - This covers > >> both read/write access, > >> > > >> > >>>>> which means Flink can make full use of Hive's > >> metastore as its catalog (at > >> > >>>>> least for the batch but can extend for > >> streaming as well). > >> > > >> > >>>>> 2. Metadata compatibility - Objects > >> (databases, tables, partitions, etc) > >> > > >> > >>>>> created by Hive can be understood by Flink > >> and the reverse direction is > >> > >>>>> true also. > >> > >>>>> 3. Data compatibility - Similar to #2, data > >> produced by Hive can be > >> > >>>>> consumed by Flink and vise versa. > >> > > >> > >>>>> 4. Support Hive UDFs - For all Hive's native > >> udfs, Flink either provides > >> > >>>>> its own implementation or make Hive's > >> implementation work in Flink. > >> > >>>>> Further, for user created UDFs in Hive, Flink > >> SQL should provide a > >> > > >> > >>>>> mechanism allowing user to import them into > >> Flink without any code change > >> > >>>>> required. > >> > >>>>> 5. Data types - Flink SQL should support all > >> data types that are > >> > >>>>> available in Hive. > >> > >>>>> 6. SQL Language - Flink SQL should support > >> SQL standard (such as > >> > > >> > >>>>> SQL2003) with extension to support Hive's > >> syntax and language features, > >> > >>>>> around DDL, DML, and SELECT queries. > >> > > >> > >>>>> 7. SQL CLI - this is currently developing in > >> Flink but more effort is > >> > >>>>> needed. > >> > > >> > >>>>> 8. Server - provide a server that's > >> compatible with Hive's HiverServer2 > >> > > >> > >>>>> in thrift APIs, such that HiveServer2 users > >> can reuse their existing client > >> > >>>>> (such as beeline) but connect to Flink's > >> thrift server instead. > >> > > >> > >>>>> 9. JDBC/ODBC drivers - Flink may provide its > >> own JDBC/ODBC drivers for > >> > >>>>> other application to use to connect to its > >> thrift server > >> > >>>>> 10. Support other user's customizations in > >> Hive, such as Hive Serdes, > >> > >>>>> storage handlers, etc. > >> > > >> > >>>>> 11. Better task failure tolerance and task > >> scheduling at Flink runtime. > >> > >>>>> > >> > >>>>> As you can see, achieving all those requires > >> significant effort and > >> > > >> > >>>>> across all layers in Flink. However, a > >> short-term goal could include only > >> > > >> > >>>>> core areas (such as 1, 2, 4, 5, 6, 7) or > >> start at a smaller scope (such as > >> > >>>>> #3, #6). > >> > >>>>> > >> > > >> > >>>>> Please share your further thoughts. If we > >> generally agree that this is > >> > > >> > >>>>> the right direction, I could come up with a > >> formal proposal quickly and > >> > >>>>> then we can follow up with broader discussions. > >> > >>>>> > >> > >>>>> Thanks, > >> > >>>>> Xuefu > >> > >>>>> > >> > >>>>> > >> > >>>>> > >> > >>>>> > >> ------------------------------------------------------------------ > >> > >>>>> Sender:vino yang <yanghua1...@gmail.com > >> <mailto:yanghua1...@gmail.com>> > >> > >>>>> Sent at:2018 Oct 11 (Thu) 09:45 > >> > >>>>> Recipient:Fabian Hueske <fhue...@gmail.com > >> <mailto:fhue...@gmail.com>> > >> > >>>>> Cc:dev <dev@flink.apache.org > >> <mailto:dev@flink.apache.org>>; Xuefu > >> <xuef...@alibaba-inc.com > >> <mailto:xuef...@alibaba-inc.com> > >> > >; user < > >> > >>>>> u...@flink.apache.org > >> <mailto:u...@flink.apache.org>> > >> > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL > >> well with Hive ecosystem > >> > >>>>> > >> > >>>>> Hi Xuefu, > >> > >>>>> > >> > > >> > >>>>> Appreciate this proposal, and like Fabian, it > >> would look better if you > >> > >>>>> can give more details of the plan. > >> > >>>>> > >> > >>>>> Thanks, vino. > >> > >>>>> > >> > >>>>> Fabian Hueske <fhue...@gmail.com > >> <mailto:fhue...@gmail.com>> 于2018年10月10日周三 > >> 下午5:27写道: > >> > >>>>> Hi Xuefu, > >> > >>>>> > >> > > >> > >>>>> Welcome to the Flink community and thanks for > >> starting this discussion! > >> > >>>>> Better Hive integration would be really great! > >> > >>>>> Can you go into details of what you are > >> proposing? I can think of a > >> > >>>>> couple ways to improve Flink in that regard: > >> > >>>>> > >> > >>>>> * Support for Hive UDFs > >> > >>>>> * Support for Hive metadata catalog > >> > >>>>> * Support for HiveQL syntax > >> > >>>>> * ??? > >> > >>>>> > >> > >>>>> Best, Fabian > >> > >>>>> > >> > >>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb > >> Zhang, Xuefu < > >> > >>>>> xuef...@alibaba-inc.com > >> <mailto:xuef...@alibaba-inc.com>>: > >> > >>>>> Hi all, > >> > >>>>> > >> > >>>>> Along with the community's effort, inside > >> Alibaba we have explored > >> > > >> > >>>>> Flink's potential as an execution engine not > >> just for stream processing but > >> > >>>>> also for batch processing. We are encouraged > >> by our findings and have > >> > > >> > >>>>> initiated our effort to make Flink's SQL > >> capabilities full-fledged. When > >> > > >> > >>>>> comparing what's available in Flink to the > >> offerings from competitive data > >> > > >> > >>>>> processing engines, we identified a major gap > >> in Flink: a well integration > >> > > >> > >>>>> with Hive ecosystem. This is crucial to the > >> success of Flink SQL and batch > >> > > >> > >>>>> due to the well-established data ecosystem > >> around Hive. Therefore, we have > >> > > >> > >>>>> done some initial work along this direction > >> but there are still a lot of > >> > >>>>> effort needed. > >> > >>>>> > >> > >>>>> We have two strategies in mind. The first one > >> is to make Flink SQL > >> > > >> > >>>>> full-fledged and well-integrated with Hive > >> ecosystem. This is a similar > >> > > >> > >>>>> approach to what Spark SQL adopted. The > >> second strategy is to make Hive > >> > > >> > >>>>> itself work with Flink, similar to the > >> proposal in [1]. Each approach bears > >> > > >> > >>>>> its pros and cons, but they don’t need to be > >> mutually exclusive with each > >> > >>>>> targeting at different users and use cases. > >> We believe that both will > >> > >>>>> promote a much greater adoption of Flink > >> beyond stream processing. > >> > >>>>> > >> > >>>>> We have been focused on the first approach > >> and would like to showcase > >> > > >> > >>>>> Flink's batch and SQL capabilities with Flink > >> SQL. However, we have also > >> > >>>>> planned to start strategy #2 as the follow-up > >> effort. > >> > >>>>> > >> > > >> > >>>>> I'm completely new to Flink(, with a short > >> bio [2] below), though many > >> > > >> > >>>>> of my colleagues here at Alibaba are > >> long-time contributors. Nevertheless, > >> > > >> > >>>>> I'd like to share our thoughts and invite > >> your early feedback. At the same > >> > > >> > >>>>> time, I am working on a detailed proposal on > >> Flink SQL's integration with > >> > >>>>> Hive ecosystem, which will be also shared > >> when ready. > >> > >>>>> > >> > >>>>> While the ideas are simple, each approach > >> will demand significant > >> > > >> > >>>>> effort, more than what we can afford. Thus, > >> the input and contributions > >> > >>>>> from the communities are greatly welcome and > >> appreciated. > >> > >>>>> > >> > >>>>> Regards, > >> > >>>>> > >> > >>>>> > >> > >>>>> Xuefu > >> > >>>>> > >> > >>>>> References: > >> > >>>>> > >> > >>>>> [1] > >> https://issues.apache.org/jira/browse/HIVE-10712 > >> > > >> > >>>>> [2] Xuefu Zhang is a long-time open source > >> veteran, worked or working on > >> > >>>>> many projects under Apache Foundation, of > >> which he is also an honored > >> > > >> > >>>>> member. About 10 years ago he worked in the > >> Hadoop team at Yahoo where the > >> > > >> > >>>>> projects just got started. Later he worked at > >> Cloudera, initiating and > >> > > >> > >>>>> leading the development of Hive on Spark > >> project in the communities and > >> > > >> > >>>>> across many organizations. Prior to joining > >> Alibaba, he worked at Uber > >> > > >> > >>>>> where he promoted Hive on Spark to all Uber's > >> SQL on Hadoop workload and > >> > >>>>> significantly improved Uber's cluster > >> efficiency. > >> > >>>>> > >> > >>>>> > >> > >>>>> > >> > >>>>> > >> > >>>>> -- > >> > > >> > >>>>> "So you have to trust that the dots will > >> somehow connect in your future." > >> > >>>>> > >> > >>>>> > >> > >>>>> -- > >> > > >> > >>>>> "So you have to trust that the dots will > >> somehow connect in your future." > >> > >>>>> > >> > > >> > > >> > > > > > >