Hi Xuefu, Thanks a lot for driving this big effort. I would suggest convert your proposal and design doc into a google doc, and share it on the dev mailing list for the community to review and comment with title like "[DISCUSS] ... Hive integration design ..." . Once approved, we can document it as a FLIP (Flink Improvement Proposals), and use JIRAs to track the implementations. What do you think?
Shuyi On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <xuef...@alibaba-inc.com> wrote: > Hi all, > > I have also shared a design doc on Hive metastore integration that is > attached here and also to FLINK-10556[1]. Please kindly review and share > your feedback. > > > Thanks, > Xuefu > > [1] https://issues.apache.org/jira/browse/FLINK-10556 > > ------------------------------------------------------------------ > Sender:Xuefu <xuef...@alibaba-inc.com> > Sent at:2018 Oct 25 (Thu) 01:08 > Recipient:Xuefu <xuef...@alibaba-inc.com>; Shuyi Chen <suez1...@gmail.com> > Cc:yanghua1127 <yanghua1...@gmail.com>; Fabian Hueske <fhue...@gmail.com>; > dev <d...@flink.apache.org>; user <user@flink.apache.org> > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem > > Hi all, > > To wrap up the discussion, I have attached a PDF describing the proposal, > which is also attached to FLINK-10556 [1]. Please feel free to watch that > JIRA to track the progress. > > Please also let me know if you have additional comments or questions. > > Thanks, > Xuefu > > [1] https://issues.apache.org/jira/browse/FLINK-10556 > > > ------------------------------------------------------------------ > Sender:Xuefu <xuef...@alibaba-inc.com> > Sent at:2018 Oct 16 (Tue) 03:40 > Recipient:Shuyi Chen <suez1...@gmail.com> > Cc:yanghua1127 <yanghua1...@gmail.com>; Fabian Hueske <fhue...@gmail.com>; > dev <d...@flink.apache.org>; user <user@flink.apache.org> > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem > > Hi Shuyi, > > Thank you for your input. Yes, I agreed with a phased approach and like to > move forward fast. :) We did some work internally on DDL utilizing babel > parser in Calcite. While babel makes Calcite's grammar extensible, at > first impression it still seems too cumbersome for a project when too > much extensions are made. It's even challenging to find where the extension > is needed! It would be certainly better if Calcite can magically support > Hive QL by just turning on a flag, such as that for MYSQL_5. I can also > see that this could mean a lot of work on Calcite. Nevertheless, I will > bring up the discussion over there and to see what their community thinks. > > Would mind to share more info about the proposal on DDL that you > mentioned? We can certainly collaborate on this. > > Thanks, > Xuefu > > ------------------------------------------------------------------ > Sender:Shuyi Chen <suez1...@gmail.com> > Sent at:2018 Oct 14 (Sun) 08:30 > Recipient:Xuefu <xuef...@alibaba-inc.com> > Cc:yanghua1127 <yanghua1...@gmail.com>; Fabian Hueske <fhue...@gmail.com>; > dev <d...@flink.apache.org>; user <user@flink.apache.org> > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem > > Welcome to the community and thanks for the great proposal, Xuefu! I think > the proposal can be divided into 2 stages: making Flink to support Hive > features, and make Hive to work with Flink. I agreed with Timo that on > starting with a smaller scope, so we can make progress faster. As for [6], > a proposal for DDL is already in progress, and will come after the unified > SQL connector API is done. For supporting Hive syntax, we might need to > work with the Calcite community, and a recent effort called babel ( > https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might help > here. > > Thanks > Shuyi > > On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <xuef...@alibaba-inc.com> > wrote: > Hi Fabian/Vno, > > Thank you very much for your encouragement inquiry. Sorry that I didn't > see Fabian's email until I read Vino's response just now. (Somehow Fabian's > went to the spam folder.) > > My proposal contains long-term and short-terms goals. Nevertheless, the > effort will focus on the following areas, including Fabian's list: > > 1. Hive metastore connectivity - This covers both read/write access, which > means Flink can make full use of Hive's metastore as its catalog (at least > for the batch but can extend for streaming as well). > 2. Metadata compatibility - Objects (databases, tables, partitions, etc) > created by Hive can be understood by Flink and the reverse direction is > true also. > 3. Data compatibility - Similar to #2, data produced by Hive can be > consumed by Flink and vise versa. > 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides > its own implementation or make Hive's implementation work in Flink. > Further, for user created UDFs in Hive, Flink SQL should provide a > mechanism allowing user to import them into Flink without any code change > required. > 5. Data types - Flink SQL should support all data types that are > available in Hive. > 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) > with extension to support Hive's syntax and language features, around DDL, > DML, and SELECT queries. > 7. SQL CLI - this is currently developing in Flink but more effort is > needed. > 8. Server - provide a server that's compatible with Hive's HiverServer2 in > thrift APIs, such that HiveServer2 users can reuse their existing client > (such as beeline) but connect to Flink's thrift server instead. > 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for > other application to use to connect to its thrift server > 10. Support other user's customizations in Hive, such as Hive Serdes, > storage handlers, etc. > 11. Better task failure tolerance and task scheduling at Flink runtime. > > As you can see, achieving all those requires significant effort and across > all layers in Flink. However, a short-term goal could include only core > areas (such as 1, 2, 4, 5, 6, 7) or start at a smaller scope (such as #3, > #6). > > Please share your further thoughts. If we generally agree that this is the > right direction, I could come up with a formal proposal quickly and then we > can follow up with broader discussions. > > Thanks, > Xuefu > > > > ------------------------------------------------------------------ > Sender:vino yang <yanghua1...@gmail.com> > Sent at:2018 Oct 11 (Thu) 09:45 > Recipient:Fabian Hueske <fhue...@gmail.com> > Cc:dev <d...@flink.apache.org>; Xuefu <xuef...@alibaba-inc.com>; user < > user@flink.apache.org> > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem > > Hi Xuefu, > > Appreciate this proposal, and like Fabian, it would look better if you can > give more details of the plan. > > Thanks, vino. > > Fabian Hueske <fhue...@gmail.com> 于2018年10月10日周三 下午5:27写道: > Hi Xuefu, > > Welcome to the Flink community and thanks for starting this discussion! > Better Hive integration would be really great! > Can you go into details of what you are proposing? I can think of a couple > ways to improve Flink in that regard: > > * Support for Hive UDFs > * Support for Hive metadata catalog > * Support for HiveQL syntax > * ??? > > Best, Fabian > > Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu < > xuef...@alibaba-inc.com>: > Hi all, > > Along with the community's effort, inside Alibaba we have explored Flink's > potential as an execution engine not just for stream processing but also > for batch processing. We are encouraged by our findings and have initiated > our effort to make Flink's SQL capabilities full-fledged. When comparing > what's available in Flink to the offerings from competitive data processing > engines, we identified a major gap in Flink: a well integration with Hive > ecosystem. This is crucial to the success of Flink SQL and batch due to the > well-established data ecosystem around Hive. Therefore, we have done some > initial work along this direction but there are still a lot of effort > needed. > > We have two strategies in mind. The first one is to make Flink SQL > full-fledged and well-integrated with Hive ecosystem. This is a similar > approach to what Spark SQL adopted. The second strategy is to make Hive > itself work with Flink, similar to the proposal in [1]. Each approach bears > its pros and cons, but they don’t need to be mutually exclusive with each > targeting at different users and use cases. We believe that both will > promote a much greater adoption of Flink beyond stream processing. > > We have been focused on the first approach and would like to showcase > Flink's batch and SQL capabilities with Flink SQL. However, we have also > planned to start strategy #2 as the follow-up effort. > > I'm completely new to Flink(, with a short bio [2] below), though many of > my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd > like to share our thoughts and invite your early feedback. At the same > time, I am working on a detailed proposal on Flink SQL's integration with > Hive ecosystem, which will be also shared when ready. > > While the ideas are simple, each approach will demand significant effort, > more than what we can afford. Thus, the input and contributions from the > communities are greatly welcome and appreciated. > > Regards, > > > Xuefu > > References: > > [1] https://issues.apache.org/jira/browse/HIVE-10712 > [2] Xuefu Zhang is a long-time open source veteran, worked or working on > many projects under Apache Foundation, of which he is also an honored > member. About 10 years ago he worked in the Hadoop team at Yahoo where the > projects just got started. Later he worked at Cloudera, initiating and > leading the development of Hive on Spark project in the communities and > across many organizations. Prior to joining Alibaba, he worked at Uber > where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and > significantly improved Uber's cluster efficiency. > > > > > -- > "So you have to trust that the dots will somehow connect in your future." > > -- "So you have to trust that the dots will somehow connect in your future."