Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Jörn Franke Fri, 12 Oct 2018 00:38:04 -0700

Thank you very nice , I fully agree with that.


> Am 11.10.2018 um 19:31 schrieb Zhang, Xuefu <xuef...@alibaba-inc.com>:
> 
> Hi Jörn,
> 
> Thanks for your feedback. Yes, I think Hive on Flink makes sense and in fact 
> it is one of the two approaches that I named in the beginning of the thread. 
> As also pointed out there, this isn't mutually exclusive from work we 
> proposed inside Flink and they target at different user groups and user 
> cases. Further, what we proposed to do in Flink should be a good showcase 
> that demonstrate Flink's capabilities in batch processing and convince Hive 
> community of the worth of a new engine. As you might know, the idea 
> encountered some doubt and resistance. Nevertheless, we do have a solid plan 
> for Hive on Flink, which we will execute once Flink SQL is in a good shape.
> 
> I also agree with you that Flink SQL shouldn't be closely coupled with Hive. 
> While we mentioned Hive in many of the proposed items, most of them are 
> coupled only in concepts and functionality rather than code or libraries. We 
> are taking the advantage of the connector framework in Flink. The only thing 
> that might be exceptional is to support Hive built-in UDFs, which we may not 
> make it work out of the box to avoid the coupling. We could, for example, 
> require users bring in Hive library and register themselves. This is subject 
> to further discussion.
> 
> #11 is about Flink runtime enhancement that is meant to make task failures 
> more tolerable (so that the job don't have to start from the beginning in 
> case of task failures) and to make task scheduling more resource-efficient. 
> Flink's current design in those two aspects leans more to stream processing, 
> which may not be good enough for batch processing. We will provide more 
> detailed design when we get to them.
> 
> Please let me know if you have further thoughts or feedback.
> 
> Thanks,
> Xuefu
> 
> 
> ------------------------------------------------------------------
> Sender:Jörn Franke <jornfra...@gmail.com>
> Sent at:2018 Oct 11 (Thu) 13:54
> Recipient:Xuefu <xuef...@alibaba-inc.com>
> Cc:vino yang <yanghua1...@gmail.com>; Fabian Hueske <fhue...@gmail.com>; dev 
> <dev@flink.apache.org>; user <u...@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> 
> Would it maybe make sense to provide Flink as an engine on Hive 
> („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely 
> coupled than integrating hive in all possible flink core modules and thus 
> introducing a very tight dependency to Hive in the core.
> 1,2,3 could be achieved via a connector based on the Flink Table API.
> Just as a proposal to start this Endeavour as independent projects (hive 
> engine, connector) to avoid too tight coupling with Flink. Maybe in a more 
> distant future if the Hive integration is heavily demanded one could then 
> integrate it more tightly if needed. 
> 
> What is meant by 11?
> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xuef...@alibaba-inc.com>:
> 
> Hi Fabian/Vno,
> 
> Thank you very much for your encouragement inquiry. Sorry that I didn't see 
> Fabian's email until I read Vino's response just now. (Somehow Fabian's went 
> to the spam folder.)
> 
> My proposal contains long-term and short-terms goals. Nevertheless, the 
> effort will focus on the following areas, including Fabian's list:
> 
> 1. Hive metastore connectivity - This covers both read/write access, which 
> means Flink can make full use of Hive's metastore as its catalog (at least 
> for the batch but can extend for streaming as well).
> 2. Metadata compatibility - Objects (databases, tables, partitions, etc) 
> created by Hive can be understood by Flink and the reverse direction is true 
> also.
> 3. Data compatibility - Similar to #2, data produced by Hive can be consumed 
> by Flink and vise versa.
> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its 
> own implementation or make Hive's implementation work in Flink. Further, for 
> user created UDFs in Hive, Flink SQL should provide a mechanism allowing user 
> to import them into Flink without any code change required.
> 5. Data types -  Flink SQL should support all data types that are available 
> in Hive.
> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) 
> with extension to support Hive's syntax and language features, around DDL, 
> DML, and SELECT queries.
> 7.  SQL CLI - this is currently developing in Flink but more effort is needed.
> 8. Server - provide a server that's compatible with Hive's HiverServer2 in 
> thrift APIs, such that HiveServer2 users can reuse their existing client 
> (such as beeline) but connect to Flink's thrift server instead.
> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other 
> application to use to connect to its thrift server
> 10. Support other user's customizations in Hive, such as Hive Serdes, storage 
> handlers, etc.
> 11. Better task failure tolerance and task scheduling at Flink runtime.
> 
> As you can see, achieving all those requires significant effort and across 
> all layers in Flink. However, a short-term goal could  include only core 
> areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, 
> #6).
> 
> Please share your further thoughts. If we generally agree that this is the 
> right direction, I could come up with a formal proposal quickly and then we 
> can follow up with broader discussions.
> 
> Thanks,
> Xuefu
> 
> 
> 
> ------------------------------------------------------------------
> Sender:vino yang <yanghua1...@gmail.com>
> Sent at:2018 Oct 11 (Thu) 09:45
> Recipient:Fabian Hueske <fhue...@gmail.com>
> Cc:dev <dev@flink.apache.org>; Xuefu <xuef...@alibaba-inc.com>; user 
> <u...@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> 
> Hi Xuefu,
> 
> Appreciate this proposal, and like Fabian, it would look better if you can 
> give more details of the plan.
> 
> Thanks, vino.
> 
> Fabian Hueske <fhue...@gmail.com> 于2018年10月10日周三 下午5:27写道：
> Hi Xuefu,
> 
> Welcome to the Flink community and thanks for starting this discussion! 
> Better Hive integration would be really great!
> Can you go into details of what you are proposing? I can think of a couple 
> ways to improve Flink in that regard:
> 
> * Support for Hive UDFs
> * Support for Hive metadata catalog
> * Support for HiveQL syntax
> * ???
> 
> Best, Fabian
> 
> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu 
> <xuef...@alibaba-inc.com>:
> Hi all,
> 
> Along with the community's effort, inside Alibaba we have explored Flink's 
> potential as an execution engine not just for stream processing but also for 
> batch processing. We are encouraged by our findings and have initiated our 
> effort to make Flink's SQL capabilities full-fledged. When comparing what's 
> available in Flink to the offerings from competitive data processing engines, 
> we identified a major gap in Flink: a well integration with Hive ecosystem. 
> This is crucial to the success of Flink SQL and batch due to the 
> well-established data ecosystem around Hive. Therefore, we have done some 
> initial work along this direction but there are still a lot of effort needed.
> 
> We have two strategies in mind. The first one is to make Flink SQL 
> full-fledged and well-integrated with Hive ecosystem. This is a similar 
> approach to what Spark SQL adopted. The second strategy is to make Hive 
> itself work with Flink, similar to the proposal in [1]. Each approach bears 
> its pros and cons, but they don’t need to be mutually exclusive with each 
> targeting at different users and use cases. We believe that both will promote 
> a much greater adoption of Flink beyond stream processing.
> 
> We have been focused on the first approach and would like to showcase Flink's 
> batch and SQL capabilities with Flink SQL. However, we have also planned to 
> start strategy #2 as the follow-up effort.
> 
> I'm completely new to Flink(, with a short bio [2] below), though many of my 
> colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like 
> to share our thoughts and invite your early feedback. At the same time, I am 
> working on a detailed proposal on Flink SQL's integration with Hive 
> ecosystem, which will be also shared when ready.
> 
> While the ideas are simple, each approach will demand significant effort, 
> more than what we can afford. Thus, the input and contributions from the 
> communities are greatly welcome and appreciated.
> 
> Regards,
> 
> 
> Xuefu
> 
> References:
> 
> [1] https://issues.apache.org/jira/browse/HIVE-10712
> [2] Xuefu Zhang is a long-time open source veteran, worked or working on many 
> projects under Apache Foundation, of which he is also an honored member. 
> About 10 years ago he worked in the Hadoop team at Yahoo where the projects 
> just got started. Later he worked at Cloudera, initiating and leading the 
> development of Hive on Spark project in the communities and across many 
> organizations. Prior to joining Alibaba, he worked at Uber where he promoted 
> Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved 
> Uber's cluster efficiency.
> 
>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Reply via email to