Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xuef...@alibaba-inc.com>:
Hi Fabian/Vno,
Thank you very much for your encouragement inquiry. Sorry that I didn't see
Fabian's email until I read Vino's response just now. (Somehow Fabian's went to
the spam folder.)
My proposal contains long-term and short-terms goals. Nevertheless, the effort
will focus on the following areas, including Fabian's list:
1. Hive metastore connectivity - This covers both read/write access, which
means Flink can make full use of Hive's metastore as its catalog (at least for
the batch but can extend for streaming as well).
2. Metadata compatibility - Objects (databases, tables, partitions, etc)
created by Hive can be understood by Flink and the reverse direction is true
also.
3. Data compatibility - Similar to #2, data produced by Hive can be consumed by
Flink and vise versa.
4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its
own implementation or make Hive's implementation work in Flink. Further, for
user created UDFs in Hive, Flink SQL should provide a mechanism allowing user
to import them into Flink without any code change required.
5. Data types - Flink SQL should support all data types that are available in
Hive.
6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with
extension to support Hive's syntax and language features, around DDL, DML, and
SELECT queries.
7. SQL CLI - this is currently developing in Flink but more effort is needed.
8. Server - provide a server that's compatible with Hive's HiverServer2 in
thrift APIs, such that HiveServer2 users can reuse their existing client (such
as beeline) but connect to Flink's thrift server instead.
9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other
application to use to connect to its thrift server
10. Support other user's customizations in Hive, such as Hive Serdes, storage
handlers, etc.
11. Better task failure tolerance and task scheduling at Flink runtime.
As you can see, achieving all those requires significant effort and across all
layers in Flink. However, a short-term goal could include only core areas
(such as 1, 2, 4, 5, 6, 7) or start at a smaller scope (such as #3, #6).
Please share your further thoughts. If we generally agree that this is the
right direction, I could come up with a formal proposal quickly and then we can
follow up with broader discussions.
Thanks,
Xuefu
------------------------------------------------------------------
Sender:vino yang <yanghua1...@gmail.com>
Sent at:2018 Oct 11 (Thu) 09:45
Recipient:Fabian Hueske <fhue...@gmail.com>
Cc:dev <d...@flink.apache.org>; Xuefu <xuef...@alibaba-inc.com>; user
<user@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
Hi Xuefu,
Appreciate this proposal, and like Fabian, it would look better if you can give
more details of the plan.
Thanks, vino.
Fabian Hueske <fhue...@gmail.com> 于2018年10月10日周三 下午5:27写道:
Hi Xuefu,
Welcome to the Flink community and thanks for starting this discussion! Better
Hive integration would be really great!
Can you go into details of what you are proposing? I can think of a couple ways
to improve Flink in that regard:
* Support for Hive UDFs
* Support for Hive metadata catalog
* Support for HiveQL syntax
* ???
Best, Fabian
Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu
<xuef...@alibaba-inc.com>:
Hi all,
Along with the community's effort, inside Alibaba we have explored Flink's
potential as an execution engine not just for stream processing but also for
batch processing. We are encouraged by our findings and have initiated our
effort to make Flink's SQL capabilities full-fledged. When comparing what's
available in Flink to the offerings from competitive data processing engines,
we identified a major gap in Flink: a well integration with Hive ecosystem.
This is crucial to the success of Flink SQL and batch due to the
well-established data ecosystem around Hive. Therefore, we have done some
initial work along this direction but there are still a lot of effort needed.
We have two strategies in mind. The first one is to make Flink SQL full-fledged
and well-integrated with Hive ecosystem. This is a similar approach to what
Spark SQL adopted. The second strategy is to make Hive itself work with Flink,
similar to the proposal in [1]. Each approach bears its pros and cons, but they
don’t need to be mutually exclusive with each targeting at different users and
use cases. We believe that both will promote a much greater adoption of Flink
beyond stream processing.
We have been focused on the first approach and would like to showcase Flink's
batch and SQL capabilities with Flink SQL. However, we have also planned to
start strategy #2 as the follow-up effort.
I'm completely new to Flink(, with a short bio [2] below), though many of my
colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like
to share our thoughts and invite your early feedback. At the same time, I am
working on a detailed proposal on Flink SQL's integration with Hive ecosystem,
which will be also shared when ready.
While the ideas are simple, each approach will demand significant effort, more
than what we can afford. Thus, the input and contributions from the communities
are greatly welcome and appreciated.
Regards,
Xuefu
References:
[1] https://issues.apache.org/jira/browse/HIVE-10712
[2] Xuefu Zhang is a long-time open source veteran, worked or working on many
projects under Apache Foundation, of which he is also an honored member. About
10 years ago he worked in the Hadoop team at Yahoo where the projects just got
started. Later he worked at Cloudera, initiating and leading the development of
Hive on Spark project in the communities and across many organizations. Prior
to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all
Uber's SQL on Hadoop workload and significantly improved Uber's cluster
efficiency.