Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Zhang, Xuefu Mon, 15 Oct 2018 12:11:00 -0700

Hi Bowen,

Thank you for your feedback and interest in the project. Your contribution is 
certainly welcome. Per your suggestion, I have created an Uber JIRA 
(https://issues.apache.org/jira/browse/FLINK-10556) to track our overall effort 
on this. For each subtask, we'd like to see a short description on the status 
quo and what is planned to add or change. Design doc should be provided when 
it's deemed necessary.


I'm looking forward to seeing your contributions!

Thanks,
Xuefu



Thanks,
Xuefu 


------------------------------------------------------------------
Sender:Bowen <bowenl...@gmail.com>
Sent at:2018 Oct 13 (Sat) 21:55
Recipient:Xuefu <xuef...@alibaba-inc.com>; Fabian Hueske <fhue...@gmail.com>
Cc:dev <dev@flink.apache.org>; user <u...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem


Thank you Xuefu, for bringing up this awesome, detailed proposal! It will 
resolve lots of existing pain for users like me.

In general, I totally agree that improving FlinkSQL's completeness would be a 
much better start point than building 'Hive on Flink', as the Hive community is 
concerned about Flink's SQL incompleteness and lack of proven batch performance 
shown in https://issues.apache.org/jira/browse/HIVE-10712. Improving FlinkSQL 
seems a more natural direction to start with in order to achieve the 
integration.

Xuefu and Timo has laid a quite clear path of what to tackle next. Given that 
there're already some efforts going on, for item 1,2,5,3,4,6 in Xuefu's list, 
shall we:

identify gaps between a) Xuefu's proposal/discussion result in this thread and 
b) all the ongoing work/discussions?
then, create some new top-level JIRA tickets to keep track of and start more 
detailed discussions with?
It's gonna be a great and influential project , and I'd love to participate 
into it to move FlinkSQL's adoption and ecosystem even further.

Thanks,
Bowen


在 2018年10月12日，下午3:37，Jörn Franke <jornfra...@gmail.com> 写道：


Thank you very nice , I fully agree with that. 

Am 11.10.2018 um 19:31 schrieb Zhang, Xuefu <xuef...@alibaba-inc.com>:

Hi Jörn,

Thanks for your feedback. Yes, I think Hive on Flink makes sense and in fact it 
is one of the two approaches that I named in the beginning of the thread. As 
also pointed out there, this isn't mutually exclusive from work we proposed 
inside Flink and they target at different user groups and user cases. Further, 
what we proposed to do in Flink should be a good showcase that demonstrate 
Flink's capabilities in batch processing and convince Hive community of the 
worth of a new engine. As you might know, the idea encountered some doubt and 
resistance. Nevertheless, we do have a solid plan for Hive on Flink, which we 
will execute once Flink SQL is in a good shape.

I also agree with you that Flink SQL shouldn't be closely coupled with Hive. 
While we mentioned Hive in many of the proposed items, most of them are coupled 
only in concepts and functionality rather than code or libraries. We are taking 
the advantage of the connector framework in Flink. The only thing that might be 
exceptional is to support Hive built-in UDFs, which we may not make it work out 
of the box to avoid the coupling. We could, for example, require users bring in 
Hive library and register themselves. This is subject to further discussion.

#11 is about Flink runtime enhancement that is meant to make task failures more 
tolerable (so that the job don't have to start from the beginning in case of 
task failures) and to make task scheduling more resource-efficient. Flink's 
current design in those two aspects leans more to stream processing, which may 
not be good enough for batch processing. We will provide more detailed design 
when we get to them.

Please let me know if you have further thoughts or feedback.

Thanks,
Xuefu


------------------------------------------------------------------
Sender:Jörn Franke <jornfra...@gmail.com>
Sent at:2018 Oct 11 (Thu) 13:54
Recipient:Xuefu <xuef...@alibaba-inc.com>
Cc:vino yang <yanghua1...@gmail.com>; Fabian Hueske <fhue...@gmail.com>; dev 
<dev@flink.apache.org>; user <u...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Would it maybe make sense to provide Flink as an engine on Hive 
(„flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely 
coupled than integrating hive in all possible flink core modules and thus 
introducing a very tight dependency to Hive in the core.
1,2,3 could be achieved via a connector based on the Flink Table API.
Just as a proposal to start this Endeavour as independent projects (hive 
engine, connector) to avoid too tight coupling with Flink. Maybe in a more 
distant future if the Hive integration is heavily demanded one could then 
integrate it more tightly if needed. 

What is meant by 11?
Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xuef...@alibaba-inc.com>:

Hi Fabian/Vno,

Thank you very much for your encouragement inquiry. Sorry that I didn't see 
Fabian's email until I read Vino's response just now. (Somehow Fabian's went to 
the spam folder.)

My proposal contains long-term and short-terms goals. Nevertheless, the effort 
will focus on the following areas, including Fabian's list:

1. Hive metastore connectivity - This covers both read/write access, which 
means Flink can make full use of Hive's metastore as its catalog (at least for 
the batch but can extend for streaming as well).
2. Metadata compatibility - Objects (databases, tables, partitions, etc) 
created by Hive can be understood by Flink and the reverse direction is true 
also.
3. Data compatibility - Similar to #2, data produced by Hive can be consumed by 
Flink and vise versa.
4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its 
own implementation or make Hive's implementation work in Flink. Further, for 
user created UDFs in Hive, Flink SQL should provide a mechanism allowing user 
to import them into Flink without any code change required.
5. Data types -  Flink SQL should support all data types that are available in 
Hive.
6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with 
extension to support Hive's syntax and language features, around DDL, DML, and 
SELECT queries.
7.  SQL CLI - this is currently developing in Flink but more effort is needed.
8. Server - provide a server that's compatible with Hive's HiverServer2 in 
thrift APIs, such that HiveServer2 users can reuse their existing client (such 
as beeline) but connect to Flink's thrift server instead.
9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other 
application to use to connect to its thrift server
10. Support other user's customizations in Hive, such as Hive Serdes, storage 
handlers, etc.
11. Better task failure tolerance and task scheduling at Flink runtime.

As you can see, achieving all those requires significant effort and across all 
layers in Flink. However, a short-term goal could  include only core areas 
(such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).

Please share your further thoughts. If we generally agree that this is the 
right direction, I could come up with a formal proposal quickly and then we can 
follow up with broader discussions.

Thanks,
Xuefu



------------------------------------------------------------------
Sender:vino yang <yanghua1...@gmail.com>
Sent at:2018 Oct 11 (Thu) 09:45
Recipient:Fabian Hueske <fhue...@gmail.com>
Cc:dev <dev@flink.apache.org>; Xuefu <xuef...@alibaba-inc.com>; user 
<u...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu,

Appreciate this proposal, and like Fabian, it would look better if you can give 
more details of the plan.

Thanks, vino.
Fabian Hueske <fhue...@gmail.com> 于2018年10月10日周三 下午5:27写道：
Hi Xuefu,

Welcome to the Flink community and thanks for starting this discussion! Better 
Hive integration would be really great!
Can you go into details of what you are proposing? I can think of a couple ways 
to improve Flink in that regard:

* Support for Hive UDFs
* Support for Hive metadata catalog
* Support for HiveQL syntax
* ???

Best, Fabian

Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu 
<xuef...@alibaba-inc.com>:
Hi all,

 Along with the community's effort, inside Alibaba we have explored Flink's 
potential as an execution engine not just for stream processing but also for 
batch processing. We are encouraged by our findings and have initiated our 
effort to make Flink's SQL capabilities full-fledged. When comparing what's 
available in Flink to the offerings from competitive data processing engines, 
we identified a major gap in Flink: a well integration with Hive ecosystem. 
This is crucial to the success of Flink SQL and batch due to the 
well-established data ecosystem around Hive. Therefore, we have done some 
initial work along this direction but there are still a lot of effort needed.

 We have two strategies in mind. The first one is to make Flink SQL 
full-fledged and well-integrated with Hive ecosystem. This is a similar 
approach to what Spark SQL adopted. The second strategy is to make Hive itself 
work with Flink, similar to the proposal in [1]. Each approach bears its pros 
and cons, but they don’t need to be mutually exclusive with each targeting at 
different users and use cases. We believe that both will promote a much greater 
adoption of Flink beyond stream processing.

 We have been focused on the first approach and would like to showcase Flink's 
batch and SQL capabilities with Flink SQL. However, we have also planned to 
start strategy #2 as the follow-up effort.

 I'm completely new to Flink(, with a short bio [2] below), though many of my 
colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like 
to share our thoughts and invite your early feedback. At the same time, I am 
working on a detailed proposal on Flink SQL's integration with Hive ecosystem, 
which will be also shared when ready.

 While the ideas are simple, each approach will demand significant effort, more 
than what we can afford. Thus, the input and contributions from the communities 
are greatly welcome and appreciated.

 Regards,


 Xuefu

 References:

 [1] https://issues.apache.org/jira/browse/HIVE-10712
 [2] Xuefu Zhang is a long-time open source veteran, worked or working on many 
projects under Apache Foundation, of which he is also an honored member. About 
10 years ago he worked in the Hadoop team at Yahoo where the projects just got 
started. Later he worked at Cloudera, initiating and leading the development of 
Hive on Spark project in the communities and across many organizations. Prior 
to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all 
Uber's SQL on Hadoop workload and significantly improved Uber's cluster 
efficiency.

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Reply via email to