Thank you very nice , I fully agree with that.
> Am 11.10.2018 um 19:31 schrieb Zhang, Xuefu <xuef...@alibaba-inc.com>: > > Hi Jörn, > > Thanks for your feedback. Yes, I think Hive on Flink makes sense and in fact > it is one of the two approaches that I named in the beginning of the thread. > As also pointed out there, this isn't mutually exclusive from work we > proposed inside Flink and they target at different user groups and user > cases. Further, what we proposed to do in Flink should be a good showcase > that demonstrate Flink's capabilities in batch processing and convince Hive > community of the worth of a new engine. As you might know, the idea > encountered some doubt and resistance. Nevertheless, we do have a solid plan > for Hive on Flink, which we will execute once Flink SQL is in a good shape. > > I also agree with you that Flink SQL shouldn't be closely coupled with Hive. > While we mentioned Hive in many of the proposed items, most of them are > coupled only in concepts and functionality rather than code or libraries. We > are taking the advantage of the connector framework in Flink. The only thing > that might be exceptional is to support Hive built-in UDFs, which we may not > make it work out of the box to avoid the coupling. We could, for example, > require users bring in Hive library and register themselves. This is subject > to further discussion. > > #11 is about Flink runtime enhancement that is meant to make task failures > more tolerable (so that the job don't have to start from the beginning in > case of task failures) and to make task scheduling more resource-efficient. > Flink's current design in those two aspects leans more to stream processing, > which may not be good enough for batch processing. We will provide more > detailed design when we get to them. > > Please let me know if you have further thoughts or feedback. > > Thanks, > Xuefu > > > ------------------------------------------------------------------ > Sender:Jörn Franke <jornfra...@gmail.com> > Sent at:2018 Oct 11 (Thu) 13:54 > Recipient:Xuefu <xuef...@alibaba-inc.com> > Cc:vino yang <yanghua1...@gmail.com>; Fabian Hueske <fhue...@gmail.com>; dev > <dev@flink.apache.org>; user <u...@flink.apache.org> > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem > > Would it maybe make sense to provide Flink as an engine on Hive > („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely > coupled than integrating hive in all possible flink core modules and thus > introducing a very tight dependency to Hive in the core. > 1,2,3 could be achieved via a connector based on the Flink Table API. > Just as a proposal to start this Endeavour as independent projects (hive > engine, connector) to avoid too tight coupling with Flink. Maybe in a more > distant future if the Hive integration is heavily demanded one could then > integrate it more tightly if needed. > > What is meant by 11? > Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xuef...@alibaba-inc.com>: > > Hi Fabian/Vno, > > Thank you very much for your encouragement inquiry. Sorry that I didn't see > Fabian's email until I read Vino's response just now. (Somehow Fabian's went > to the spam folder.) > > My proposal contains long-term and short-terms goals. Nevertheless, the > effort will focus on the following areas, including Fabian's list: > > 1. Hive metastore connectivity - This covers both read/write access, which > means Flink can make full use of Hive's metastore as its catalog (at least > for the batch but can extend for streaming as well). > 2. Metadata compatibility - Objects (databases, tables, partitions, etc) > created by Hive can be understood by Flink and the reverse direction is true > also. > 3. Data compatibility - Similar to #2, data produced by Hive can be consumed > by Flink and vise versa. > 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its > own implementation or make Hive's implementation work in Flink. Further, for > user created UDFs in Hive, Flink SQL should provide a mechanism allowing user > to import them into Flink without any code change required. > 5. Data types - Flink SQL should support all data types that are available > in Hive. > 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) > with extension to support Hive's syntax and language features, around DDL, > DML, and SELECT queries. > 7. SQL CLI - this is currently developing in Flink but more effort is needed. > 8. Server - provide a server that's compatible with Hive's HiverServer2 in > thrift APIs, such that HiveServer2 users can reuse their existing client > (such as beeline) but connect to Flink's thrift server instead. > 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other > application to use to connect to its thrift server > 10. Support other user's customizations in Hive, such as Hive Serdes, storage > handlers, etc. > 11. Better task failure tolerance and task scheduling at Flink runtime. > > As you can see, achieving all those requires significant effort and across > all layers in Flink. However, a short-term goal could include only core > areas (such as 1, 2, 4, 5, 6, 7) or start at a smaller scope (such as #3, > #6). > > Please share your further thoughts. If we generally agree that this is the > right direction, I could come up with a formal proposal quickly and then we > can follow up with broader discussions. > > Thanks, > Xuefu > > > > ------------------------------------------------------------------ > Sender:vino yang <yanghua1...@gmail.com> > Sent at:2018 Oct 11 (Thu) 09:45 > Recipient:Fabian Hueske <fhue...@gmail.com> > Cc:dev <dev@flink.apache.org>; Xuefu <xuef...@alibaba-inc.com>; user > <u...@flink.apache.org> > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem > > Hi Xuefu, > > Appreciate this proposal, and like Fabian, it would look better if you can > give more details of the plan. > > Thanks, vino. > > Fabian Hueske <fhue...@gmail.com> 于2018年10月10日周三 下午5:27写道: > Hi Xuefu, > > Welcome to the Flink community and thanks for starting this discussion! > Better Hive integration would be really great! > Can you go into details of what you are proposing? I can think of a couple > ways to improve Flink in that regard: > > * Support for Hive UDFs > * Support for Hive metadata catalog > * Support for HiveQL syntax > * ??? > > Best, Fabian > > Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu > <xuef...@alibaba-inc.com>: > Hi all, > > Along with the community's effort, inside Alibaba we have explored Flink's > potential as an execution engine not just for stream processing but also for > batch processing. We are encouraged by our findings and have initiated our > effort to make Flink's SQL capabilities full-fledged. When comparing what's > available in Flink to the offerings from competitive data processing engines, > we identified a major gap in Flink: a well integration with Hive ecosystem. > This is crucial to the success of Flink SQL and batch due to the > well-established data ecosystem around Hive. Therefore, we have done some > initial work along this direction but there are still a lot of effort needed. > > We have two strategies in mind. The first one is to make Flink SQL > full-fledged and well-integrated with Hive ecosystem. This is a similar > approach to what Spark SQL adopted. The second strategy is to make Hive > itself work with Flink, similar to the proposal in [1]. Each approach bears > its pros and cons, but they don’t need to be mutually exclusive with each > targeting at different users and use cases. We believe that both will promote > a much greater adoption of Flink beyond stream processing. > > We have been focused on the first approach and would like to showcase Flink's > batch and SQL capabilities with Flink SQL. However, we have also planned to > start strategy #2 as the follow-up effort. > > I'm completely new to Flink(, with a short bio [2] below), though many of my > colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like > to share our thoughts and invite your early feedback. At the same time, I am > working on a detailed proposal on Flink SQL's integration with Hive > ecosystem, which will be also shared when ready. > > While the ideas are simple, each approach will demand significant effort, > more than what we can afford. Thus, the input and contributions from the > communities are greatly welcome and appreciated. > > Regards, > > > Xuefu > > References: > > [1] https://issues.apache.org/jira/browse/HIVE-10712 > [2] Xuefu Zhang is a long-time open source veteran, worked or working on many > projects under Apache Foundation, of which he is also an honored member. > About 10 years ago he worked in the Hadoop team at Yahoo where the projects > just got started. Later he worked at Cloudera, initiating and leading the > development of Hive on Spark project in the communities and across many > organizations. Prior to joining Alibaba, he worked at Uber where he promoted > Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved > Uber's cluster efficiency. > >