> Thanks for your feedback. Yes, I think Hive on Flink makes sense and in fact 
> it is one of the two approaches that I named in the beginning of the thread. 
> As also pointed out there, this isn't mutually exclusive from work we 
> proposed inside Flink and they target at different user groups and user 
> cases. Further, what we proposed to do in Flink should be a good showcase 
> that demonstrate Flink's capabilities in batch processing and convince Hive 
> community of the worth of a new engine. As you might know, the idea 
> encountered some doubt and resistance. Nevertheless, we do have a solid plan 
> for Hive on Flink, which we will execute once Flink SQL is in a good shape.
> I also agree with you that Flink SQL shouldn't be closely coupled with Hive. 
> While we mentioned Hive in many of the proposed items, most of them are 
> coupled only in concepts and functionality rather than code or libraries. We 
> are taking the advantage of the connector framework in Flink. The only thing 
> that might be exceptional is to support Hive built-in UDFs, which we may not 
> make it work out of the box to avoid the coupling. We could, for example, 
> require users bring in Hive library and register themselves. This is subject 
> to further discussion.
> #11 is about Flink runtime enhancement that is meant to make task failures 
> more tolerable (so that the job don't have to start from the beginning in 
> case of task failures) and to make task scheduling more resource-efficient. 
> Flink's current design in those two aspects leans more to stream processing, 
> which may not be good enough for batch processing. We will provide more 
> detailed design when we get to them.
> Please let me know if you have further thoughts or feedback.
> Would it maybe make sense to provide Flink as an engine on Hive 
> („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely 
> coupled than integrating hive in all possible flink core modules and thus 
> introducing a very tight dependency to Hive in the core.
> 1,2,3 could be achieved via a connector based on the Flink Table API.
> Just as a proposal to start this Endeavour as independent projects (hive 
> engine, connector) to avoid too tight coupling with Flink. Maybe in a more 
> distant future if the Hive integration is heavily demanded one could then 
> integrate it more tightly if needed. 
> What is meant by 11?
> My proposal contains long-term and short-terms goals. Nevertheless, the 
> effort will focus on the following areas, including Fabian's list:
> 1. Hive metastore connectivity - This covers both read/write access, which 
> means Flink can make full use of Hive's metastore as its catalog (at least 
> for the batch but can extend for streaming as well).
> 2. Metadata compatibility - Objects (databases, tables, partitions, etc) 
> created by Hive can be understood by Flink and the reverse direction is true 
> also.
> 3. Data compatibility - Similar to #2, data produced by Hive can be consumed 
> by Flink and vise versa.
> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its 
> own implementation or make Hive's implementation work in Flink. Further, for 
> user created UDFs in Hive, Flink SQL should provide a mechanism allowing user 
> to import them into Flink without any code change required.
> 5. Data types -  Flink SQL should support all data types that are available 
> in Hive.
> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) 
> with extension to support Hive's syntax and language features, around DDL, 
> DML, and SELECT queries.
> 7.  SQL CLI - this is currently developing in Flink but more effort is needed.
> 8. Server - provide a server that's compatible with Hive's HiverServer2 in 
> thrift APIs, such that HiveServer2 users can reuse their existing client 
> (such as beeline) but connect to Flink's thrift server instead.
> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other 
> application to use to connect to its thrift server
> 10. Support other user's customizations in Hive, such as Hive Serdes, storage 
> handlers, etc.
> 11. Better task failure tolerance and task scheduling at Flink runtime.
> As you can see, achieving all those requires significant effort and across 
> all layers in Flink. However, a short-term goal could  include only core 
> areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, 
> #6).
> Please share your further thoughts. If we generally agree that this is the 
> right direction, I could come up with a formal proposal quickly and then we 
> can follow up with broader discussions.
> Can you go into details of what you are proposing? I can think of a couple 
> ways to improve Flink in that regard:
> * Support for Hive UDFs
> * Support for Hive metadata catalog
> * Support for HiveQL syntax
> * ???
> Hi all,
> Along with the community's effort, inside Alibaba we have explored Flink's 
> potential as an execution engine not just for stream processing but also for 
> batch processing. We are encouraged by our findings and have initiated our 
> effort to make Flink's SQL capabilities full-fledged. When comparing what's 
> available in Flink to the offerings from competitive data processing engines, 
> we identified a major gap in Flink: a well integration with Hive ecosystem. 
> This is crucial to the success of Flink SQL and batch due to the 
> well-established data ecosystem around Hive. Therefore, we have done some 
> initial work along this direction but there are still a lot of effort needed.
> We have two strategies in mind. The first one is to make Flink SQL 
> full-fledged and well-integrated with Hive ecosystem. This is a similar 
> approach to what Spark SQL adopted. The second strategy is to make Hive 
> itself work with Flink, similar to the proposal in [1]. Each approach bears 
> its pros and cons, but they don’t need to be mutually exclusive with each 
> targeting at different users and use cases. We believe that both will promote 
> a much greater adoption of Flink beyond stream processing.
> We have been focused on the first approach and would like to showcase Flink's 
> batch and SQL capabilities with Flink SQL. However, we have also planned to 
> start strategy #2 as the follow-up effort.
> I'm completely new to Flink(, with a short bio [2] below), though many of my 
> colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like 
> to share our thoughts and invite your early feedback. At the same time, I am 
> working on a detailed proposal on Flink SQL's integration with Hive 
> ecosystem, which will be also shared when ready.
> While the ideas are simple, each approach will demand significant effort, 
> more than what we can afford. Thus, the input and contributions from the 
> communities are greatly welcome and appreciated.
> [1]
> [2] Xuefu Zhang is a long-time open source veteran, worked or working on many 
> projects under Apache Foundation, of which he is also an honored member. 
> About 10 years ago he worked in the Hadoop team at Yahoo where the projects 
> just got started. Later he worked at Cloudera, initiating and leading the 
> development of Hive on Spark project in the communities and across many 
> organizations. Prior to joining Alibaba, he worked at Uber where he promoted 
> Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved 
> Uber's cluster efficiency.

