Thank you Xuefu, for bringing up this awesome, detailed proposal! It will resolve lots of existing pain for users like me.
In general, I totally agree that improving FlinkSQL's completeness would be a much better start point than building 'Hive on Flink', as the Hive community is concerned about Flink's SQL incompleteness and lack of proven batch performance shown in https://issues.apache.org/jira/browse/HIVE-10712. Improving FlinkSQL seems a more natural direction to start with in order to achieve the integration. Xuefu and Timo has laid a quite clear path of what to tackle next. Given that there're already some efforts going on, for item 1,2,5,3,4,6 in Xuefu's list, shall we: identify gaps between a) Xuefu's proposal/discussion result in this thread and b) all the ongoing work/discussions? then, create some new top-level JIRA tickets to keep track of and start more detailed discussions with? It's gonna be a great and influential project , and I'd love to participate into it to move FlinkSQL's adoption and ecosystem even further. Thanks, Bowen > 在 2018年10月12日,下午3:37,Jörn Franke <jornfra...@gmail.com> 写道: > > Thank you very nice , I fully agree with that. > >> Am 11.10.2018 um 19:31 schrieb Zhang, Xuefu <xuef...@alibaba-inc.com>: >> >> Hi Jörn, >> >> Thanks for your feedback. Yes, I think Hive on Flink makes sense and in fact >> it is one of the two approaches that I named in the beginning of the thread. >> As also pointed out there, this isn't mutually exclusive from work we >> proposed inside Flink and they target at different user groups and user >> cases. Further, what we proposed to do in Flink should be a good showcase >> that demonstrate Flink's capabilities in batch processing and convince Hive >> community of the worth of a new engine. As you might know, the idea >> encountered some doubt and resistance. Nevertheless, we do have a solid plan >> for Hive on Flink, which we will execute once Flink SQL is in a good shape. >> >> I also agree with you that Flink SQL shouldn't be closely coupled with Hive. >> While we mentioned Hive in many of the proposed items, most of them are >> coupled only in concepts and functionality rather than code or libraries. We >> are taking the advantage of the connector framework in Flink. The only thing >> that might be exceptional is to support Hive built-in UDFs, which we may not >> make it work out of the box to avoid the coupling. We could, for example, >> require users bring in Hive library and register themselves. This is subject >> to further discussion. >> >> #11 is about Flink runtime enhancement that is meant to make task failures >> more tolerable (so that the job don't have to start from the beginning in >> case of task failures) and to make task scheduling more resource-efficient. >> Flink's current design in those two aspects leans more to stream processing, >> which may not be good enough for batch processing. We will provide more >> detailed design when we get to them. >> >> Please let me know if you have further thoughts or feedback. >> >> Thanks, >> Xuefu >> >> >> ------------------------------------------------------------------ >> Sender:Jörn Franke <jornfra...@gmail.com> >> Sent at:2018 Oct 11 (Thu) 13:54 >> Recipient:Xuefu <xuef...@alibaba-inc.com> >> Cc:vino yang <yanghua1...@gmail.com>; Fabian Hueske <fhue...@gmail.com>; dev >> <dev@flink.apache.org>; user <u...@flink.apache.org> >> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem >> >> Would it maybe make sense to provide Flink as an engine on Hive >> („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely >> coupled than integrating hive in all possible flink core modules and thus >> introducing a very tight dependency to Hive in the core. >> 1,2,3 could be achieved via a connector based on the Flink Table API. >> Just as a proposal to start this Endeavour as independent projects (hive >> engine, connector) to avoid too tight coupling with Flink. Maybe in a more >> distant future if the Hive integration is heavily demanded one could then >> integrate it more tightly if needed. >> >> What is meant by 11? >> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xuef...@alibaba-inc.com>: >> >> Hi Fabian/Vno, >> >> Thank you very much for your encouragement inquiry. Sorry that I didn't see >> Fabian's email until I read Vino's response just now. (Somehow Fabian's went >> to the spam folder.) >> >> My proposal contains long-term and short-terms goals. Nevertheless, the >> effort will focus on the following areas, including Fabian's list: >> >> 1. Hive metastore connectivity - This covers both read/write access, which >> means Flink can make full use of Hive's metastore as its catalog (at least >> for the batch but can extend for streaming as well). >> 2. Metadata compatibility - Objects (databases, tables, partitions, etc) >> created by Hive can be understood by Flink and the reverse direction is true >> also. >> 3. Data compatibility - Similar to #2, data produced by Hive can be consumed >> by Flink and vise versa. >> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its >> own implementation or make Hive's implementation work in Flink. Further, for >> user created UDFs in Hive, Flink SQL should provide a mechanism allowing >> user to import them into Flink without any code change required. >> 5. Data types - Flink SQL should support all data types that are available >> in Hive. >> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) >> with extension to support Hive's syntax and language features, around DDL, >> DML, and SELECT queries. >> 7. SQL CLI - this is currently developing in Flink but more effort is >> needed. >> 8. Server - provide a server that's compatible with Hive's HiverServer2 in >> thrift APIs, such that HiveServer2 users can reuse their existing client >> (such as beeline) but connect to Flink's thrift server instead. >> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other >> application to use to connect to its thrift server >> 10. Support other user's customizations in Hive, such as Hive Serdes, >> storage handlers, etc. >> 11. Better task failure tolerance and task scheduling at Flink runtime. >> >> As you can see, achieving all those requires significant effort and across >> all layers in Flink. However, a short-term goal could include only core >> areas (such as 1, 2, 4, 5, 6, 7) or start at a smaller scope (such as #3, >> #6). >> >> Please share your further thoughts. If we generally agree that this is the >> right direction, I could come up with a formal proposal quickly and then we >> can follow up with broader discussions. >> >> Thanks, >> Xuefu >> >> >> >> ------------------------------------------------------------------ >> Sender:vino yang <yanghua1...@gmail.com> >> Sent at:2018 Oct 11 (Thu) 09:45 >> Recipient:Fabian Hueske <fhue...@gmail.com> >> Cc:dev <dev@flink.apache.org>; Xuefu <xuef...@alibaba-inc.com>; user >> <u...@flink.apache.org> >> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem >> >> Hi Xuefu, >> >> Appreciate this proposal, and like Fabian, it would look better if you can >> give more details of the plan. >> >> Thanks, vino. >> >> Fabian Hueske <fhue...@gmail.com> 于2018年10月10日周三 下午5:27写道: >> Hi Xuefu, >> >> Welcome to the Flink community and thanks for starting this discussion! >> Better Hive integration would be really great! >> Can you go into details of what you are proposing? I can think of a couple >> ways to improve Flink in that regard: >> >> * Support for Hive UDFs >> * Support for Hive metadata catalog >> * Support for HiveQL syntax >> * ??? >> >> Best, Fabian >> >> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu >> <xuef...@alibaba-inc.com>: >> Hi all, >> >> Along with the community's effort, inside Alibaba we have explored Flink's >> potential as an execution engine not just for stream processing but also for >> batch processing. We are encouraged by our findings and have initiated our >> effort to make Flink's SQL capabilities full-fledged. When comparing what's >> available in Flink to the offerings from competitive data processing >> engines, we identified a major gap in Flink: a well integration with Hive >> ecosystem. This is crucial to the success of Flink SQL and batch due to the >> well-established data ecosystem around Hive. Therefore, we have done some >> initial work along this direction but there are still a lot of effort needed. >> >> We have two strategies in mind. The first one is to make Flink SQL >> full-fledged and well-integrated with Hive ecosystem. This is a similar >> approach to what Spark SQL adopted. The second strategy is to make Hive >> itself work with Flink, similar to the proposal in [1]. Each approach bears >> its pros and cons, but they don’t need to be mutually exclusive with each >> targeting at different users and use cases. We believe that both will >> promote a much greater adoption of Flink beyond stream processing. >> >> We have been focused on the first approach and would like to showcase >> Flink's batch and SQL capabilities with Flink SQL. However, we have also >> planned to start strategy #2 as the follow-up effort. >> >> I'm completely new to Flink(, with a short bio [2] below), though many of my >> colleagues here at Alibaba are long-time contributors. Nevertheless, I'd >> like to share our thoughts and invite your early feedback. At the same time, >> I am working on a detailed proposal on Flink SQL's integration with Hive >> ecosystem, which will be also shared when ready. >> >> While the ideas are simple, each approach will demand significant effort, >> more than what we can afford. Thus, the input and contributions from the >> communities are greatly welcome and appreciated. >> >> Regards, >> >> >> Xuefu >> >> References: >> >> [1] https://issues.apache.org/jira/browse/HIVE-10712 >> [2] Xuefu Zhang is a long-time open source veteran, worked or working on >> many projects under Apache Foundation, of which he is also an honored >> member. About 10 years ago he worked in the Hadoop team at Yahoo where the >> projects just got started. Later he worked at Cloudera, initiating and >> leading the development of Hive on Spark project in the communities and >> across many organizations. Prior to joining Alibaba, he worked at Uber where >> he promoted Hive on Spark to all Uber's SQL on Hadoop workload and >> significantly improved Uber's cluster efficiency. >> >>