After taking a look at how other discussion threads work, I think it's actually fine just keep our discussion here. It's up to you, Xuefu.
The google doc LGTM. I left some minor comments. On Thu, Nov 1, 2018 at 10:17 AM Bowen Li <bowenl...@gmail.com> wrote: > Hi all, > > As Xuefu has published the design doc on google, I agree with Shuyi's > suggestion that we probably should start a new email thread like "[DISCUSS] > ... Hive integration design ..." on only dev mailing list for community > devs to review. The current thread sends to both dev and user list. > > This email thread is more like validating the general idea and direction > with the community, and it's been pretty long and crowded so far. Since > everyone is pro for the idea, we can move forward with another thread to > discuss and finalize the design. > > Thanks, > Bowen > > On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <xuef...@alibaba-inc.com> > wrote: > >> Hi Shuiyi, >> >> Good idea. Actually the PDF was converted from a google doc. Here is its >> link: >> >> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing >> Once we reach an agreement, I can convert it to a FLIP. >> >> Thanks, >> Xuefu >> >> >> >> ------------------------------------------------------------------ >> Sender:Shuyi Chen <suez1...@gmail.com> >> Sent at:2018 Nov 1 (Thu) 02:47 >> Recipient:Xuefu <xuef...@alibaba-inc.com> >> Cc:vino yang <yanghua1...@gmail.com>; Fabian Hueske <fhue...@gmail.com>; >> dev <dev@flink.apache.org>; user <u...@flink.apache.org> >> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem >> >> Hi Xuefu, >> >> Thanks a lot for driving this big effort. I would suggest convert your >> proposal and design doc into a google doc, and share it on the dev mailing >> list for the community to review and comment with title like "[DISCUSS] ... >> Hive integration design ..." . Once approved, we can document it as a FLIP >> (Flink Improvement Proposals), and use JIRAs to track the implementations. >> What do you think? >> >> Shuyi >> >> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <xuef...@alibaba-inc.com> >> wrote: >> Hi all, >> >> I have also shared a design doc on Hive metastore integration that is >> attached here and also to FLINK-10556[1]. Please kindly review and share >> your feedback. >> >> >> Thanks, >> Xuefu >> >> [1] https://issues.apache.org/jira/browse/FLINK-10556 >> ------------------------------------------------------------------ >> Sender:Xuefu <xuef...@alibaba-inc.com> >> Sent at:2018 Oct 25 (Thu) 01:08 >> Recipient:Xuefu <xuef...@alibaba-inc.com>; Shuyi Chen <suez1...@gmail.com >> > >> Cc:yanghua1127 <yanghua1...@gmail.com>; Fabian Hueske <fhue...@gmail.com>; >> dev <dev@flink.apache.org>; user <u...@flink.apache.org> >> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem >> >> Hi all, >> >> To wrap up the discussion, I have attached a PDF describing the proposal, >> which is also attached to FLINK-10556 [1]. Please feel free to watch that >> JIRA to track the progress. >> >> Please also let me know if you have additional comments or questions. >> >> Thanks, >> Xuefu >> >> [1] https://issues.apache.org/jira/browse/FLINK-10556 >> >> >> ------------------------------------------------------------------ >> Sender:Xuefu <xuef...@alibaba-inc.com> >> Sent at:2018 Oct 16 (Tue) 03:40 >> Recipient:Shuyi Chen <suez1...@gmail.com> >> Cc:yanghua1127 <yanghua1...@gmail.com>; Fabian Hueske <fhue...@gmail.com>; >> dev <dev@flink.apache.org>; user <u...@flink.apache.org> >> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem >> >> Hi Shuyi, >> >> Thank you for your input. Yes, I agreed with a phased approach and like >> to move forward fast. :) We did some work internally on DDL utilizing babel >> parser in Calcite. While babel makes Calcite's grammar extensible, at >> first impression it still seems too cumbersome for a project when too >> much extensions are made. It's even challenging to find where the extension >> is needed! It would be certainly better if Calcite can magically support >> Hive QL by just turning on a flag, such as that for MYSQL_5. I can also >> see that this could mean a lot of work on Calcite. Nevertheless, I will >> bring up the discussion over there and to see what their community thinks. >> >> Would mind to share more info about the proposal on DDL that you >> mentioned? We can certainly collaborate on this. >> >> Thanks, >> Xuefu >> >> ------------------------------------------------------------------ >> Sender:Shuyi Chen <suez1...@gmail.com> >> Sent at:2018 Oct 14 (Sun) 08:30 >> Recipient:Xuefu <xuef...@alibaba-inc.com> >> Cc:yanghua1127 <yanghua1...@gmail.com>; Fabian Hueske <fhue...@gmail.com>; >> dev <dev@flink.apache.org>; user <u...@flink.apache.org> >> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem >> >> Welcome to the community and thanks for the great proposal, Xuefu! I >> think the proposal can be divided into 2 stages: making Flink to support >> Hive features, and make Hive to work with Flink. I agreed with Timo that on >> starting with a smaller scope, so we can make progress faster. As for [6], >> a proposal for DDL is already in progress, and will come after the unified >> SQL connector API is done. For supporting Hive syntax, we might need to >> work with the Calcite community, and a recent effort called babel ( >> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might >> help here. >> >> Thanks >> Shuyi >> >> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <xuef...@alibaba-inc.com> >> wrote: >> Hi Fabian/Vno, >> >> Thank you very much for your encouragement inquiry. Sorry that I didn't >> see Fabian's email until I read Vino's response just now. (Somehow Fabian's >> went to the spam folder.) >> >> My proposal contains long-term and short-terms goals. Nevertheless, the >> effort will focus on the following areas, including Fabian's list: >> >> 1. Hive metastore connectivity - This covers both read/write access, >> which means Flink can make full use of Hive's metastore as its catalog (at >> least for the batch but can extend for streaming as well). >> 2. Metadata compatibility - Objects (databases, tables, partitions, etc) >> created by Hive can be understood by Flink and the reverse direction is >> true also. >> 3. Data compatibility - Similar to #2, data produced by Hive can be >> consumed by Flink and vise versa. >> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides >> its own implementation or make Hive's implementation work in Flink. >> Further, for user created UDFs in Hive, Flink SQL should provide a >> mechanism allowing user to import them into Flink without any code change >> required. >> 5. Data types - Flink SQL should support all data types that are >> available in Hive. >> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) >> with extension to support Hive's syntax and language features, around DDL, >> DML, and SELECT queries. >> 7. SQL CLI - this is currently developing in Flink but more effort is >> needed. >> 8. Server - provide a server that's compatible with Hive's HiverServer2 >> in thrift APIs, such that HiveServer2 users can reuse their existing client >> (such as beeline) but connect to Flink's thrift server instead. >> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for >> other application to use to connect to its thrift server >> 10. Support other user's customizations in Hive, such as Hive Serdes, >> storage handlers, etc. >> 11. Better task failure tolerance and task scheduling at Flink runtime. >> >> As you can see, achieving all those requires significant effort and >> across all layers in Flink. However, a short-term goal could include only >> core areas (such as 1, 2, 4, 5, 6, 7) or start at a smaller scope (such as >> #3, #6). >> >> Please share your further thoughts. If we generally agree that this is >> the right direction, I could come up with a formal proposal quickly and >> then we can follow up with broader discussions. >> >> Thanks, >> Xuefu >> >> >> >> ------------------------------------------------------------------ >> Sender:vino yang <yanghua1...@gmail.com> >> Sent at:2018 Oct 11 (Thu) 09:45 >> Recipient:Fabian Hueske <fhue...@gmail.com> >> Cc:dev <dev@flink.apache.org>; Xuefu <xuef...@alibaba-inc.com>; user < >> u...@flink.apache.org> >> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem >> >> Hi Xuefu, >> >> Appreciate this proposal, and like Fabian, it would look better if you >> can give more details of the plan. >> >> Thanks, vino. >> >> Fabian Hueske <fhue...@gmail.com> 于2018年10月10日周三 下午5:27写道: >> Hi Xuefu, >> >> Welcome to the Flink community and thanks for starting this discussion! >> Better Hive integration would be really great! >> Can you go into details of what you are proposing? I can think of a >> couple ways to improve Flink in that regard: >> >> * Support for Hive UDFs >> * Support for Hive metadata catalog >> * Support for HiveQL syntax >> * ??? >> >> Best, Fabian >> >> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu < >> xuef...@alibaba-inc.com>: >> Hi all, >> >> Along with the community's effort, inside Alibaba we have explored >> Flink's potential as an execution engine not just for stream processing but >> also for batch processing. We are encouraged by our findings and have >> initiated our effort to make Flink's SQL capabilities full-fledged. When >> comparing what's available in Flink to the offerings from competitive data >> processing engines, we identified a major gap in Flink: a well integration >> with Hive ecosystem. This is crucial to the success of Flink SQL and batch >> due to the well-established data ecosystem around Hive. Therefore, we have >> done some initial work along this direction but there are still a lot of >> effort needed. >> >> We have two strategies in mind. The first one is to make Flink SQL >> full-fledged and well-integrated with Hive ecosystem. This is a similar >> approach to what Spark SQL adopted. The second strategy is to make Hive >> itself work with Flink, similar to the proposal in [1]. Each approach bears >> its pros and cons, but they don’t need to be mutually exclusive with each >> targeting at different users and use cases. We believe that both will >> promote a much greater adoption of Flink beyond stream processing. >> >> We have been focused on the first approach and would like to showcase >> Flink's batch and SQL capabilities with Flink SQL. However, we have also >> planned to start strategy #2 as the follow-up effort. >> >> I'm completely new to Flink(, with a short bio [2] below), though many of >> my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd >> like to share our thoughts and invite your early feedback. At the same >> time, I am working on a detailed proposal on Flink SQL's integration with >> Hive ecosystem, which will be also shared when ready. >> >> While the ideas are simple, each approach will demand significant effort, >> more than what we can afford. Thus, the input and contributions from the >> communities are greatly welcome and appreciated. >> >> Regards, >> >> >> Xuefu >> >> References: >> >> [1] https://issues.apache.org/jira/browse/HIVE-10712 >> [2] Xuefu Zhang is a long-time open source veteran, worked or working on >> many projects under Apache Foundation, of which he is also an honored >> member. About 10 years ago he worked in the Hadoop team at Yahoo where the >> projects just got started. Later he worked at Cloudera, initiating and >> leading the development of Hive on Spark project in the communities and >> across many organizations. Prior to joining Alibaba, he worked at Uber >> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and >> significantly improved Uber's cluster efficiency. >> >> >> >> >> -- >> "So you have to trust that the dots will somehow connect in your future." >> >> >> -- >> "So you have to trust that the dots will somehow connect in your future." >> >>