Thanks Timo for merging a couple of the PRs. Are you also able to review the others that I mentioned? Xuefu I would like to incorporate your feedback too.
Check out this short demonstration of using a catalog in SQL Client: https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo Thanks again! On Thu, Jan 3, 2019 at 9:37 AM Eron Wright <eronwri...@gmail.com> wrote: > Would a couple folks raise their hand to make a review pass thru the 6 PRs > listed above? It is a lovely stack of PRs that is 'all green' at the > moment. I would be happy to open follow-on PRs to rapidly align with > other efforts. > > Note that the code is agnostic to the details of the ExternalCatalog > interface; the code would not be obsolete if/when the catalog interface is > enhanced as per the design doc. > > > > On Wed, Jan 2, 2019 at 1:35 PM Eron Wright <eronwri...@gmail.com> wrote: > >> I propose that the community review and merge the PRs that I posted, and >> then evolve the design thru 1.8 and beyond. I think having a basic >> infrastructure in place now will accelerate the effort, do you agree? >> >> Thanks again! >> >> On Wed, Jan 2, 2019 at 11:20 AM Zhang, Xuefu <xuef...@alibaba-inc.com> >> wrote: >> >>> Hi Eron, >>> >>> Happy New Year! >>> >>> Thank you very much for your contribution, especially during the >>> holidays. Wile I'm encouraged by your work, I'd also like to share my >>> thoughts on how to move forward. >>> >>> First, please note that the design discussion is still finalizing, and >>> we expect some moderate changes, especially around TableFactories. Another >>> pending change is our decision to shy away from scala, which our work will >>> be impacted by. >>> >>> Secondly, while your work seemed about plugging in catalogs definitions >>> to the execution environment, which is less impacted by TableFactory >>> change, I did notice some duplication of your work and ours. This is no big >>> deal, but going forward, we should probable have a better communication on >>> the work assignment so as to avoid any possible duplication of work. On the >>> other hand, I think some of your work is interesting and valuable for >>> inclusion once we finalize the overall design. >>> >>> Thus, please continue your research and experiment and let us know when >>> you start working on anything so we can better coordinate. >>> >>> Thanks again for your interest and contributions. >>> >>> Thanks, >>> Xuefu >>> >>> >>> >>> ------------------------------------------------------------------ >>> From:Eron Wright <eronwri...@gmail.com> >>> Sent At:2019 Jan. 1 (Tue.) 18:39 >>> To:dev <dev@flink.apache.org>; Xuefu <xuef...@alibaba-inc.com> >>> Cc:Xiaowei Jiang <xiaow...@gmail.com>; twalthr <twal...@apache.org>; >>> piotr <pi...@data-artisans.com>; Fabian Hueske <fhue...@gmail.com>; >>> suez1224 <suez1...@gmail.com>; Bowen Li <bowenl...@gmail.com> >>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem >>> >>> Hi folks, there's clearly some incremental steps to be taken to >>> introduce catalog support to SQL Client, complementary to what is proposed >>> in the Flink-Hive Metastore design doc. I was quietly working on this over >>> the holidays. I posted some new sub-tasks, PRs, and sample code >>> to FLINK-10744. >>> >>> What inspired me to get involved is that the catalog interface seems >>> like a great way to encapsulate a 'library' of Flink tables and functions. >>> For example, the NYC Taxi dataset (TaxiRides, TaxiFares, various UDFs) may >>> be nicely encapsulated as a catalog (TaxiData). Such a library should be >>> fully consumable in SQL Client. >>> >>> I implemented the above. Some highlights: >>> >>> 1. A fully-worked example of using the Taxi dataset in SQL Client via an >>> environment file. >>> - an ASCII video showing the SQL Client in action: >>> https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo >>> >>> - the corresponding environment file (will be even more concise once >>> 'FLINK-10696 Catalog UDFs' is merged): >>> *https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml >>> <https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml>* >>> >>> - the typed API for standalone table applications: >>> *https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50 >>> <https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50>* >>> >>> 2. Implementation of the core catalog descriptor and factory. I realize >>> that some renames may later occur as per the design doc, and would be happy >>> to do that as a follow-up. >>> https://github.com/apache/flink/pull/7390 >>> >>> 3. Implementation of a connect-style API on TableEnvironment to use >>> catalog descriptor. >>> https://github.com/apache/flink/pull/7392 >>> >>> 4. Integration into SQL-Client's environment file: >>> https://github.com/apache/flink/pull/7393 >>> >>> I realize that the overall Hive integration is still evolving, but I >>> believe that these PRs are a good stepping stone. Here's the list (in >>> bottom-up order): >>> - https://github.com/apache/flink/pull/7386 >>> - https://github.com/apache/flink/pull/7388 >>> - https://github.com/apache/flink/pull/7389 >>> - https://github.com/apache/flink/pull/7390 >>> - https://github.com/apache/flink/pull/7392 >>> - https://github.com/apache/flink/pull/7393 >>> >>> Thanks and enjoy 2019! >>> Eron W >>> >>> >>> On Sun, Nov 18, 2018 at 3:04 PM Zhang, Xuefu <xuef...@alibaba-inc.com> >>> wrote: >>> Hi Xiaowei, >>> >>> Thanks for bringing up the question. In the current design, the >>> properties for meta objects are meant to cover anything that's specific to >>> a particular catalog and agnostic to Flink. Anything that is common (such >>> as schema for tables, query text for views, and udf classname) are >>> abstracted as members of the respective classes. However, this is still in >>> discussion, and Timo and I will go over this and provide an update. >>> >>> Please note that UDF is a little more involved than what the current >>> design doc shows. I'm still refining this part. >>> >>> Thanks, >>> Xuefu >>> >>> >>> ------------------------------------------------------------------ >>> Sender:Xiaowei Jiang <xiaow...@gmail.com> >>> Sent at:2018 Nov 18 (Sun) 15:17 >>> Recipient:dev <dev@flink.apache.org> >>> Cc:Xuefu <xuef...@alibaba-inc.com>; twalthr <twal...@apache.org>; piotr >>> <pi...@data-artisans.com>; Fabian Hueske <fhue...@gmail.com>; suez1224 < >>> suez1...@gmail.com> >>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem >>> >>> Thanks Xuefu for the detailed design doc! One question on the properties >>> associated with the catalog objects. Are we going to leave them completely >>> free form or we are going to set some standard for that? I think that the >>> answer may depend on if we want to explore catalog specific optimization >>> opportunities. In any case, I think that it might be helpful for >>> standardize as much as possible into strongly typed classes and use leave >>> these properties for catalog specific things. But I think that we can do it >>> in steps. >>> >>> Xiaowei >>> On Fri, Nov 16, 2018 at 4:00 AM Bowen Li <bowenl...@gmail.com> wrote: >>> Thanks for keeping on improving the overall design, Xuefu! It looks quite >>> good to me now. >>> >>> Would be nice that cc-ed Flink committers can help to review and >>> confirm! >>> >>> >>> >>> One minor suggestion: Since the last section of design doc already >>> touches >>> some new sql statements, shall we add another section in our doc and >>> formalize the new sql statements in SQL Client and TableEnvironment that >>> are gonna come along naturally with our design? Here are some that the >>> design doc mentioned and some that I came up with: >>> >>> To be added: >>> >>> - USE <catalog> - set default catalog >>> - USE <catalog.schema> - set default schema >>> - SHOW CATALOGS - show all registered catalogs >>> - SHOW SCHEMAS [FROM catalog] - list schemas in the current default >>> catalog or the specified catalog >>> - DESCRIBE VIEW view - show the view's definition in CatalogView >>> - SHOW VIEWS [FROM schema/catalog.schema] - show views from current >>> or a >>> specified schema. >>> >>> (DDLs that can be addressed by either our design or Shuyi's DDL >>> design) >>> >>> - CREATE/DROP/ALTER SCHEMA schema >>> - CREATE/DROP/ALTER CATALOG catalog >>> >>> To be modified: >>> >>> - SHOW TABLES [FROM schema/catalog.schema] - show tables from >>> current or >>> a specified schema. Add 'from schema' to existing 'SHOW TABLES' >>> statement >>> - SHOW FUNCTIONS [FROM schema/catalog.schema] - show functions from >>> current or a specified schema. Add 'from schema' to existing 'SHOW >>> TABLES' >>> statement' >>> >>> >>> Thanks, Bowen >>> >>> >>> >>> On Wed, Nov 14, 2018 at 10:39 PM Zhang, Xuefu <xuef...@alibaba-inc.com> >>> wrote: >>> >>> > Thanks, Bowen, for catching the error. I have granted comment >>> permission >>> > with the link. >>> > >>> > I also updated the doc with the latest class definitions. Everyone is >>> > encouraged to review and comment. >>> > >>> > Thanks, >>> > Xuefu >>> > >>> > ------------------------------------------------------------------ >>> > Sender:Bowen Li <bowenl...@gmail.com> >>> > Sent at:2018 Nov 14 (Wed) 06:44 >>> > Recipient:Xuefu <xuef...@alibaba-inc.com> >>> > Cc:piotr <pi...@data-artisans.com>; dev <dev@flink.apache.org>; Shuyi >>> > Chen <suez1...@gmail.com> >>> > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem >>> > >>> > Hi Xuefu, >>> > >>> > Currently the new design doc >>> > < >>> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit >>> > >>> > is on “view only" mode, and people cannot leave comments. Can you >>> please >>> > change it to "can comment" or "can edit" mode? >>> > >>> > Thanks, Bowen >>> > >>> > >>> > On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu <xuef...@alibaba-inc.com >>> > >>> > wrote: >>> > Hi Piotr >>> > >>> > I have extracted the API portion of the design and the google doc is >>> here >>> > < >>> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit?usp=sharing >>> >. >>> > Please review and provide your feedback. >>> > >>> > Thanks, >>> > Xuefu >>> > >>> > ------------------------------------------------------------------ >>> > Sender:Xuefu <xuef...@alibaba-inc.com> >>> > Sent at:2018 Nov 12 (Mon) 12:43 >>> > Recipient:Piotr Nowojski <pi...@data-artisans.com>; dev < >>> > dev@flink.apache.org> >>> > Cc:Bowen Li <bowenl...@gmail.com>; Shuyi Chen <suez1...@gmail.com> >>> > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem >>> > >>> > Hi Piotr, >>> > >>> > That sounds good to me. Let's close all the open questions ((there >>> are a >>> > couple of them)) in the Google doc and I should be able to quickly >>> split >>> > it into the three proposals as you suggested. >>> > >>> > Thanks, >>> > Xuefu >>> > >>> > ------------------------------------------------------------------ >>> > Sender:Piotr Nowojski <pi...@data-artisans.com> >>> > Sent at:2018 Nov 9 (Fri) 22:46 >>> > Recipient:dev <dev@flink.apache.org>; Xuefu <xuef...@alibaba-inc.com> >>> > Cc:Bowen Li <bowenl...@gmail.com>; Shuyi Chen <suez1...@gmail.com> >>> > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem >>> > >>> > Hi, >>> > >>> > >>> > Yes, it seems like the best solution. Maybe someone else can also >>> suggests if we can split it further? Maybe changes in the interface in one >>> doc, reading from hive meta store another and final storing our meta >>> informations in hive meta store? >>> > >>> > Piotrek >>> > >>> > > On 9 Nov 2018, at 01:44, Zhang, Xuefu <xuef...@alibaba-inc.com> >>> wrote: >>> > > >>> > > Hi Piotr, >>> > > >>> > > That seems to be good idea! >>> > > >>> > >>> > > Since the google doc for the design is currently under extensive >>> review, I will leave it as it is for now. However, I'll convert it to two >>> different FLIPs when the time comes. >>> > > >>> > > How does it sound to you? >>> > > >>> > > Thanks, >>> > > Xuefu >>> > > >>> > > >>> > > ------------------------------------------------------------------ >>> > > Sender:Piotr Nowojski <pi...@data-artisans.com> >>> > > Sent at:2018 Nov 9 (Fri) 02:31 >>> > > Recipient:dev <dev@flink.apache.org> >>> > > Cc:Bowen Li <bowenl...@gmail.com>; Xuefu <xuef...@alibaba-inc.com >>> > >; Shuyi Chen <suez1...@gmail.com> >>> > > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem >>> > > >>> > > Hi, >>> > > >>> > >>> > > Maybe we should split this topic (and the design doc) into couple >>> of smaller ones, hopefully independent. The questions that you have asked >>> Fabian have for example very little to do with reading metadata from Hive >>> Meta Store? >>> > > >>> > > Piotrek >>> > > >>> > >> On 7 Nov 2018, at 14:27, Fabian Hueske <fhue...@gmail.com> wrote: >>> > >> >>> > >> Hi Xuefu and all, >>> > >> >>> > >> Thanks for sharing this design document! >>> > >>> > >> I'm very much in favor of restructuring / reworking the catalog >>> handling in >>> > >> Flink SQL as outlined in the document. >>> > >>> > >> Most changes described in the design document seem to be rather >>> general and >>> > >> not specifically related to the Hive integration. >>> > >> >>> > >>> > >> IMO, there are some aspects, especially those at the boundary of >>> Hive and >>> > >> Flink, that need a bit more discussion. For example >>> > >> >>> > >> * What does it take to make Flink schema compatible with Hive >>> schema? >>> > >> * How will Flink tables (descriptors) be stored in HMS? >>> > >> * How do both Hive catalogs differ? Could they be integrated into >>> to a >>> > >> single one? When to use which one? >>> > >>> > >> * What meta information is provided by HMS? What of this can be >>> leveraged >>> > >> by Flink? >>> > >> >>> > >> Thank you, >>> > >> Fabian >>> > >> >>> > >> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li < >>> bowenl...@gmail.com >>> > >: >>> > >> >>> > >>> After taking a look at how other discussion threads work, I think >>> it's >>> > >>> actually fine just keep our discussion here. It's up to you, >>> Xuefu. >>> > >>> >>> > >>> The google doc LGTM. I left some minor comments. >>> > >>> >>> > >>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li <bowenl...@gmail.com> >>> wrote: >>> > >>> >>> > >>>> Hi all, >>> > >>>> >>> > >>>> As Xuefu has published the design doc on google, I agree with >>> Shuyi's >>> > >>> > >>>> suggestion that we probably should start a new email thread like >>> "[DISCUSS] >>> > >>> > >>>> ... Hive integration design ..." on only dev mailing list for >>> community >>> > >>>> devs to review. The current thread sends to both dev and user >>> list. >>> > >>>> >>> > >>> > >>>> This email thread is more like validating the general idea and >>> direction >>> > >>> > >>>> with the community, and it's been pretty long and crowded so >>> far. Since >>> > >>> > >>>> everyone is pro for the idea, we can move forward with another >>> thread to >>> > >>>> discuss and finalize the design. >>> > >>>> >>> > >>>> Thanks, >>> > >>>> Bowen >>> > >>>> >>> > >>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu < >>> > xuef...@alibaba-inc.com> >>> > >>>> wrote: >>> > >>>> >>> > >>>>> Hi Shuiyi, >>> > >>>>> >>> > >>> > >>>>> Good idea. Actually the PDF was converted from a google doc. >>> Here is its >>> > >>>>> link: >>> > >>>>> >>> > >>>>> >>> > >>> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing >>> > >>>>> Once we reach an agreement, I can convert it to a FLIP. >>> > >>>>> >>> > >>>>> Thanks, >>> > >>>>> Xuefu >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> ------------------------------------------------------------------ >>> > >>>>> Sender:Shuyi Chen <suez1...@gmail.com> >>> > >>>>> Sent at:2018 Nov 1 (Thu) 02:47 >>> > >>>>> Recipient:Xuefu <xuef...@alibaba-inc.com> >>> > >>>>> Cc:vino yang <yanghua1...@gmail.com>; Fabian Hueske < >>> > fhue...@gmail.com>; >>> > >>>>> dev <dev@flink.apache.org>; user <u...@flink.apache.org> >>> > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive >>> ecosystem >>> > >>>>> >>> > >>>>> Hi Xuefu, >>> > >>>>> >>> > >>> > >>>>> Thanks a lot for driving this big effort. I would suggest >>> convert your >>> > >>> > >>>>> proposal and design doc into a google doc, and share it on the >>> dev mailing >>> > >>> > >>>>> list for the community to review and comment with title like >>> "[DISCUSS] ... >>> > >>> > >>>>> Hive integration design ..." . Once approved, we can document >>> it as a FLIP >>> > >>> > >>>>> (Flink Improvement Proposals), and use JIRAs to track the >>> implementations. >>> > >>>>> What do you think? >>> > >>>>> >>> > >>>>> Shuyi >>> > >>>>> >>> > >>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu < >>> > xuef...@alibaba-inc.com> >>> > >>>>> wrote: >>> > >>>>> Hi all, >>> > >>>>> >>> > >>>>> I have also shared a design doc on Hive metastore integration >>> that is >>> > >>> > >>>>> attached here and also to FLINK-10556[1]. Please kindly review >>> and share >>> > >>>>> your feedback. >>> > >>>>> >>> > >>>>> >>> > >>>>> Thanks, >>> > >>>>> Xuefu >>> > >>>>> >>> > >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556 >>> > >>>>> >>> ------------------------------------------------------------------ >>> > >>>>> Sender:Xuefu <xuef...@alibaba-inc.com> >>> > >>>>> Sent at:2018 Oct 25 (Thu) 01:08 >>> > >>>>> Recipient:Xuefu <xuef...@alibaba-inc.com>; Shuyi Chen < >>> > >>>>> suez1...@gmail.com> >>> > >>>>> Cc:yanghua1127 <yanghua1...@gmail.com>; Fabian Hueske < >>> > fhue...@gmail.com>; >>> > >>>>> dev <dev@flink.apache.org>; user <u...@flink.apache.org> >>> > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive >>> ecosystem >>> > >>>>> >>> > >>>>> Hi all, >>> > >>>>> >>> > >>>>> To wrap up the discussion, I have attached a PDF describing the >>> > >>> > >>>>> proposal, which is also attached to FLINK-10556 [1]. Please >>> feel free to >>> > >>>>> watch that JIRA to track the progress. >>> > >>>>> >>> > >>>>> Please also let me know if you have additional comments or >>> questions. >>> > >>>>> >>> > >>>>> Thanks, >>> > >>>>> Xuefu >>> > >>>>> >>> > >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556 >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> ------------------------------------------------------------------ >>> > >>>>> Sender:Xuefu <xuef...@alibaba-inc.com> >>> > >>>>> Sent at:2018 Oct 16 (Tue) 03:40 >>> > >>>>> Recipient:Shuyi Chen <suez1...@gmail.com> >>> > >>>>> Cc:yanghua1127 <yanghua1...@gmail.com>; Fabian Hueske < >>> > fhue...@gmail.com>; >>> > >>>>> dev <dev@flink.apache.org>; user <u...@flink.apache.org> >>> > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive >>> ecosystem >>> > >>>>> >>> > >>>>> Hi Shuyi, >>> > >>>>> >>> > >>> > >>>>> Thank you for your input. Yes, I agreed with a phased approach >>> and like >>> > >>> > >>>>> to move forward fast. :) We did some work internally on DDL >>> utilizing babel >>> > >>>>> parser in Calcite. While babel makes Calcite's grammar >>> extensible, at >>> > >>>>> first impression it still seems too cumbersome for a project >>> when too >>> > >>> > >>>>> much extensions are made. It's even challenging to find where >>> the extension >>> > >>> > >>>>> is needed! It would be certainly better if Calcite can >>> magically support >>> > >>> > >>>>> Hive QL by just turning on a flag, such as that for MYSQL_5. I >>> can also >>> > >>> > >>>>> see that this could mean a lot of work on Calcite. >>> Nevertheless, I will >>> > >>> > >>>>> bring up the discussion over there and to see what their >>> community thinks. >>> > >>>>> >>> > >>>>> Would mind to share more info about the proposal on DDL that you >>> > >>>>> mentioned? We can certainly collaborate on this. >>> > >>>>> >>> > >>>>> Thanks, >>> > >>>>> Xuefu >>> > >>>>> >>> > >>>>> >>> ------------------------------------------------------------------ >>> > >>>>> Sender:Shuyi Chen <suez1...@gmail.com> >>> > >>>>> Sent at:2018 Oct 14 (Sun) 08:30 >>> > >>>>> Recipient:Xuefu <xuef...@alibaba-inc.com> >>> > >>>>> Cc:yanghua1127 <yanghua1...@gmail.com>; Fabian Hueske < >>> > fhue...@gmail.com>; >>> > >>>>> dev <dev@flink.apache.org>; user <u...@flink.apache.org> >>> > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive >>> ecosystem >>> > >>>>> >>> > >>>>> Welcome to the community and thanks for the great proposal, >>> Xuefu! I >>> > >>> > >>>>> think the proposal can be divided into 2 stages: making Flink >>> to support >>> > >>> > >>>>> Hive features, and make Hive to work with Flink. I agreed with >>> Timo that on >>> > >>> > >>>>> starting with a smaller scope, so we can make progress faster. >>> As for [6], >>> > >>> > >>>>> a proposal for DDL is already in progress, and will come after >>> the unified >>> > >>> > >>>>> SQL connector API is done. For supporting Hive syntax, we might >>> need to >>> > >>>>> work with the Calcite community, and a recent effort called >>> babel ( >>> > >>>>> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite >>> might >>> > >>>>> help here. >>> > >>>>> >>> > >>>>> Thanks >>> > >>>>> Shuyi >>> > >>>>> >>> > >>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu < >>> > xuef...@alibaba-inc.com> >>> > >>>>> wrote: >>> > >>>>> Hi Fabian/Vno, >>> > >>>>> >>> > >>> > >>>>> Thank you very much for your encouragement inquiry. Sorry that >>> I didn't >>> > >>> > >>>>> see Fabian's email until I read Vino's response just now. >>> (Somehow Fabian's >>> > >>>>> went to the spam folder.) >>> > >>>>> >>> > >>> > >>>>> My proposal contains long-term and short-terms goals. >>> Nevertheless, the >>> > >>>>> effort will focus on the following areas, including Fabian's >>> list: >>> > >>>>> >>> > >>>>> 1. Hive metastore connectivity - This covers both read/write >>> access, >>> > >>> > >>>>> which means Flink can make full use of Hive's metastore as its >>> catalog (at >>> > >>>>> least for the batch but can extend for streaming as well). >>> > >>> > >>>>> 2. Metadata compatibility - Objects (databases, tables, >>> partitions, etc) >>> > >>> > >>>>> created by Hive can be understood by Flink and the reverse >>> direction is >>> > >>>>> true also. >>> > >>>>> 3. Data compatibility - Similar to #2, data produced by Hive >>> can be >>> > >>>>> consumed by Flink and vise versa. >>> > >>> > >>>>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either >>> provides >>> > >>>>> its own implementation or make Hive's implementation work in >>> Flink. >>> > >>>>> Further, for user created UDFs in Hive, Flink SQL should >>> provide a >>> > >>> > >>>>> mechanism allowing user to import them into Flink without any >>> code change >>> > >>>>> required. >>> > >>>>> 5. Data types - Flink SQL should support all data types that >>> are >>> > >>>>> available in Hive. >>> > >>>>> 6. SQL Language - Flink SQL should support SQL standard (such as >>> > >>> > >>>>> SQL2003) with extension to support Hive's syntax and language >>> features, >>> > >>>>> around DDL, DML, and SELECT queries. >>> > >>> > >>>>> 7. SQL CLI - this is currently developing in Flink but more >>> effort is >>> > >>>>> needed. >>> > >>> > >>>>> 8. Server - provide a server that's compatible with Hive's >>> HiverServer2 >>> > >>> > >>>>> in thrift APIs, such that HiveServer2 users can reuse their >>> existing client >>> > >>>>> (such as beeline) but connect to Flink's thrift server instead. >>> > >>> > >>>>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC >>> drivers for >>> > >>>>> other application to use to connect to its thrift server >>> > >>>>> 10. Support other user's customizations in Hive, such as Hive >>> Serdes, >>> > >>>>> storage handlers, etc. >>> > >>> > >>>>> 11. Better task failure tolerance and task scheduling at Flink >>> runtime. >>> > >>>>> >>> > >>>>> As you can see, achieving all those requires significant effort >>> and >>> > >>> > >>>>> across all layers in Flink. However, a short-term goal could >>> include only >>> > >>> > >>>>> core areas (such as 1, 2, 4, 5, 6, 7) or start at a smaller >>> scope (such as >>> > >>>>> #3, #6). >>> > >>>>> >>> > >>> > >>>>> Please share your further thoughts. If we generally agree that >>> this is >>> > >>> > >>>>> the right direction, I could come up with a formal proposal >>> quickly and >>> > >>>>> then we can follow up with broader discussions. >>> > >>>>> >>> > >>>>> Thanks, >>> > >>>>> Xuefu >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> ------------------------------------------------------------------ >>> > >>>>> Sender:vino yang <yanghua1...@gmail.com> >>> > >>>>> Sent at:2018 Oct 11 (Thu) 09:45 >>> > >>>>> Recipient:Fabian Hueske <fhue...@gmail.com> >>> > >>>>> Cc:dev <dev@flink.apache.org>; Xuefu <xuef...@alibaba-inc.com >>> > >; user < >>> > >>>>> u...@flink.apache.org> >>> > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive >>> ecosystem >>> > >>>>> >>> > >>>>> Hi Xuefu, >>> > >>>>> >>> > >>> > >>>>> Appreciate this proposal, and like Fabian, it would look better >>> if you >>> > >>>>> can give more details of the plan. >>> > >>>>> >>> > >>>>> Thanks, vino. >>> > >>>>> >>> > >>>>> Fabian Hueske <fhue...@gmail.com> 于2018年10月10日周三 下午5:27写道: >>> > >>>>> Hi Xuefu, >>> > >>>>> >>> > >>> > >>>>> Welcome to the Flink community and thanks for starting this >>> discussion! >>> > >>>>> Better Hive integration would be really great! >>> > >>>>> Can you go into details of what you are proposing? I can think >>> of a >>> > >>>>> couple ways to improve Flink in that regard: >>> > >>>>> >>> > >>>>> * Support for Hive UDFs >>> > >>>>> * Support for Hive metadata catalog >>> > >>>>> * Support for HiveQL syntax >>> > >>>>> * ??? >>> > >>>>> >>> > >>>>> Best, Fabian >>> > >>>>> >>> > >>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu < >>> > >>>>> xuef...@alibaba-inc.com>: >>> > >>>>> Hi all, >>> > >>>>> >>> > >>>>> Along with the community's effort, inside Alibaba we have >>> explored >>> > >>> > >>>>> Flink's potential as an execution engine not just for stream >>> processing but >>> > >>>>> also for batch processing. We are encouraged by our findings >>> and have >>> > >>> > >>>>> initiated our effort to make Flink's SQL capabilities >>> full-fledged. When >>> > >>> > >>>>> comparing what's available in Flink to the offerings from >>> competitive data >>> > >>> > >>>>> processing engines, we identified a major gap in Flink: a well >>> integration >>> > >>> > >>>>> with Hive ecosystem. This is crucial to the success of Flink >>> SQL and batch >>> > >>> > >>>>> due to the well-established data ecosystem around Hive. >>> Therefore, we have >>> > >>> > >>>>> done some initial work along this direction but there are still >>> a lot of >>> > >>>>> effort needed. >>> > >>>>> >>> > >>>>> We have two strategies in mind. The first one is to make Flink >>> SQL >>> > >>> > >>>>> full-fledged and well-integrated with Hive ecosystem. This is a >>> similar >>> > >>> > >>>>> approach to what Spark SQL adopted. The second strategy is to >>> make Hive >>> > >>> > >>>>> itself work with Flink, similar to the proposal in [1]. Each >>> approach bears >>> > >>> > >>>>> its pros and cons, but they don’t need to be mutually exclusive >>> with each >>> > >>>>> targeting at different users and use cases. We believe that >>> both will >>> > >>>>> promote a much greater adoption of Flink beyond stream >>> processing. >>> > >>>>> >>> > >>>>> We have been focused on the first approach and would like to >>> showcase >>> > >>> > >>>>> Flink's batch and SQL capabilities with Flink SQL. However, we >>> have also >>> > >>>>> planned to start strategy #2 as the follow-up effort. >>> > >>>>> >>> > >>> > >>>>> I'm completely new to Flink(, with a short bio [2] below), >>> though many >>> > >>> > >>>>> of my colleagues here at Alibaba are long-time contributors. >>> Nevertheless, >>> > >>> > >>>>> I'd like to share our thoughts and invite your early feedback. >>> At the same >>> > >>> > >>>>> time, I am working on a detailed proposal on Flink SQL's >>> integration with >>> > >>>>> Hive ecosystem, which will be also shared when ready. >>> > >>>>> >>> > >>>>> While the ideas are simple, each approach will demand >>> significant >>> > >>> > >>>>> effort, more than what we can afford. Thus, the input and >>> contributions >>> > >>>>> from the communities are greatly welcome and appreciated. >>> > >>>>> >>> > >>>>> Regards, >>> > >>>>> >>> > >>>>> >>> > >>>>> Xuefu >>> > >>>>> >>> > >>>>> References: >>> > >>>>> >>> > >>>>> [1] https://issues.apache.org/jira/browse/HIVE-10712 >>> > >>> > >>>>> [2] Xuefu Zhang is a long-time open source veteran, worked or >>> working on >>> > >>>>> many projects under Apache Foundation, of which he is also an >>> honored >>> > >>> > >>>>> member. About 10 years ago he worked in the Hadoop team at >>> Yahoo where the >>> > >>> > >>>>> projects just got started. Later he worked at Cloudera, >>> initiating and >>> > >>> > >>>>> leading the development of Hive on Spark project in the >>> communities and >>> > >>> > >>>>> across many organizations. Prior to joining Alibaba, he worked >>> at Uber >>> > >>> > >>>>> where he promoted Hive on Spark to all Uber's SQL on Hadoop >>> workload and >>> > >>>>> significantly improved Uber's cluster efficiency. >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> > >>>>> -- >>> > >>> > >>>>> "So you have to trust that the dots will somehow connect in >>> your future." >>> > >>>>> >>> > >>>>> >>> > >>>>> -- >>> > >>> > >>>>> "So you have to trust that the dots will somehow connect in >>> your future." >>> > >>>>> >>> > >>> > >>> >>>