Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Bowen Li Thu, 01 Nov 2018 16:32:52 -0700

After taking a look at how other discussion threads work, I think it's
actually fine just keep our discussion here. It's up to you, Xuefu.


The google doc LGTM. I left some minor comments.

On Thu, Nov 1, 2018 at 10:17 AM Bowen Li <bowenl...@gmail.com> wrote:

> Hi all,
>
> As Xuefu has published the design doc on google, I agree with Shuyi's
> suggestion that we probably should start a new email thread like "[DISCUSS]
> ... Hive integration design ..." on only dev mailing list for community
> devs to review. The current thread sends to both dev and user list.
>
> This email thread is more like validating the general idea and direction
> with the community, and it's been pretty long and crowded so far. Since
> everyone is pro for the idea, we can move forward with another thread to
> discuss and finalize the design.
>
> Thanks,
> Bowen
>
> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <xuef...@alibaba-inc.com>
> wrote:
>
>> Hi Shuiyi,
>>
>> Good idea. Actually the PDF was converted from a google doc. Here is its
>> link:
>>
>> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
>> Once we reach an agreement, I can convert it to a FLIP.
>>
>> Thanks,
>> Xuefu
>>
>>
>>
>> ------------------------------------------------------------------
>> Sender:Shuyi Chen <suez1...@gmail.com>
>> Sent at:2018 Nov 1 (Thu) 02:47
>> Recipient:Xuefu <xuef...@alibaba-inc.com>
>> Cc:vino yang <yanghua1...@gmail.com>; Fabian Hueske <fhue...@gmail.com>;
>> dev <dev@flink.apache.org>; user <u...@flink.apache.org>
>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>
>> Hi Xuefu,
>>
>> Thanks a lot for driving this big effort. I would suggest convert your
>> proposal and design doc into a google doc, and share it on the dev mailing
>> list for the community to review and comment with title like "[DISCUSS] ...
>> Hive integration design ..." . Once approved,  we can document it as a FLIP
>> (Flink Improvement Proposals), and use JIRAs to track the implementations.
>> What do you think?
>>
>> Shuyi
>>
>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <xuef...@alibaba-inc.com>
>> wrote:
>> Hi all,
>>
>> I have also shared a design doc on Hive metastore integration that is
>> attached here and also to FLINK-10556[1]. Please kindly review and share
>> your feedback.
>>
>>
>> Thanks,
>> Xuefu
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>> ------------------------------------------------------------------
>> Sender:Xuefu <xuef...@alibaba-inc.com>
>> Sent at:2018 Oct 25 (Thu) 01:08
>> Recipient:Xuefu <xuef...@alibaba-inc.com>; Shuyi Chen <suez1...@gmail.com
>> >
>> Cc:yanghua1127 <yanghua1...@gmail.com>; Fabian Hueske <fhue...@gmail.com>;
>> dev <dev@flink.apache.org>; user <u...@flink.apache.org>
>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>
>> Hi all,
>>
>> To wrap up the discussion, I have attached a PDF describing the proposal,
>> which is also attached to FLINK-10556 [1]. Please feel free to watch that
>> JIRA to track the progress.
>>
>> Please also let me know if you have additional comments or questions.
>>
>> Thanks,
>> Xuefu
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>>
>>
>> ------------------------------------------------------------------
>> Sender:Xuefu <xuef...@alibaba-inc.com>
>> Sent at:2018 Oct 16 (Tue) 03:40
>> Recipient:Shuyi Chen <suez1...@gmail.com>
>> Cc:yanghua1127 <yanghua1...@gmail.com>; Fabian Hueske <fhue...@gmail.com>;
>> dev <dev@flink.apache.org>; user <u...@flink.apache.org>
>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>
>> Hi Shuyi,
>>
>> Thank you for your input. Yes, I agreed with a phased approach and like
>> to move forward fast. :) We did some work internally on DDL utilizing babel
>> parser in Calcite. While babel makes Calcite's grammar extensible, at
>> first impression it still seems too cumbersome for a project when too
>> much extensions are made. It's even challenging to find where the extension
>> is needed! It would be certainly better if Calcite can magically support
>> Hive QL by just turning on a flag, such as that for MYSQL_5. I can also
>> see that this could mean a lot of work on Calcite. Nevertheless, I will
>> bring up the discussion over there and to see what their community thinks.
>>
>> Would mind to share more info about the proposal on DDL that you
>> mentioned? We can certainly collaborate on this.
>>
>> Thanks,
>> Xuefu
>>
>> ------------------------------------------------------------------
>> Sender:Shuyi Chen <suez1...@gmail.com>
>> Sent at:2018 Oct 14 (Sun) 08:30
>> Recipient:Xuefu <xuef...@alibaba-inc.com>
>> Cc:yanghua1127 <yanghua1...@gmail.com>; Fabian Hueske <fhue...@gmail.com>;
>> dev <dev@flink.apache.org>; user <u...@flink.apache.org>
>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>
>> Welcome to the community and thanks for the great proposal, Xuefu! I
>> think the proposal can be divided into 2 stages: making Flink to support
>> Hive features, and make Hive to work with Flink. I agreed with Timo that on
>> starting with a smaller scope, so we can make progress faster. As for [6],
>> a proposal for DDL is already in progress, and will come after the unified
>> SQL connector API is done. For supporting Hive syntax, we might need to
>> work with the Calcite community, and a recent effort called babel (
>> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might
>> help here.
>>
>> Thanks
>> Shuyi
>>
>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <xuef...@alibaba-inc.com>
>> wrote:
>> Hi Fabian/Vno,
>>
>> Thank you very much for your encouragement inquiry. Sorry that I didn't
>> see Fabian's email until I read Vino's response just now. (Somehow Fabian's
>> went to the spam folder.)
>>
>> My proposal contains long-term and short-terms goals. Nevertheless, the
>> effort will focus on the following areas, including Fabian's list:
>>
>> 1. Hive metastore connectivity - This covers both read/write access,
>> which means Flink can make full use of Hive's metastore as its catalog (at
>> least for the batch but can extend for streaming as well).
>> 2. Metadata compatibility - Objects (databases, tables, partitions, etc)
>> created by Hive can be understood by Flink and the reverse direction is
>> true also.
>> 3. Data compatibility - Similar to #2, data produced by Hive can be
>> consumed by Flink and vise versa.
>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides
>> its own implementation or make Hive's implementation work in Flink.
>> Further, for user created UDFs in Hive, Flink SQL should provide a
>> mechanism allowing user to import them into Flink without any code change
>> required.
>> 5. Data types -  Flink SQL should support all data types that are
>> available in Hive.
>> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003)
>> with extension to support Hive's syntax and language features, around DDL,
>> DML, and SELECT queries.
>> 7.  SQL CLI - this is currently developing in Flink but more effort is
>> needed.
>> 8. Server - provide a server that's compatible with Hive's HiverServer2
>> in thrift APIs, such that HiveServer2 users can reuse their existing client
>> (such as beeline) but connect to Flink's thrift server instead.
>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
>> other application to use to connect to its thrift server
>> 10. Support other user's customizations in Hive, such as Hive Serdes,
>> storage handlers, etc.
>> 11. Better task failure tolerance and task scheduling at Flink runtime.
>>
>> As you can see, achieving all those requires significant effort and
>> across all layers in Flink. However, a short-term goal could  include only
>> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as
>> #3, #6).
>>
>> Please share your further thoughts. If we generally agree that this is
>> the right direction, I could come up with a formal proposal quickly and
>> then we can follow up with broader discussions.
>>
>> Thanks,
>> Xuefu
>>
>>
>>
>> ------------------------------------------------------------------
>> Sender:vino yang <yanghua1...@gmail.com>
>> Sent at:2018 Oct 11 (Thu) 09:45
>> Recipient:Fabian Hueske <fhue...@gmail.com>
>> Cc:dev <dev@flink.apache.org>; Xuefu <xuef...@alibaba-inc.com>; user <
>> u...@flink.apache.org>
>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>
>> Hi Xuefu,
>>
>> Appreciate this proposal, and like Fabian, it would look better if you
>> can give more details of the plan.
>>
>> Thanks, vino.
>>
>> Fabian Hueske <fhue...@gmail.com> 于2018年10月10日周三 下午5:27写道：
>> Hi Xuefu,
>>
>> Welcome to the Flink community and thanks for starting this discussion!
>> Better Hive integration would be really great!
>> Can you go into details of what you are proposing? I can think of a
>> couple ways to improve Flink in that regard:
>>
>> * Support for Hive UDFs
>> * Support for Hive metadata catalog
>> * Support for HiveQL syntax
>> * ???
>>
>> Best, Fabian
>>
>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
>> xuef...@alibaba-inc.com>:
>> Hi all,
>>
>> Along with the community's effort, inside Alibaba we have explored
>> Flink's potential as an execution engine not just for stream processing but
>> also for batch processing. We are encouraged by our findings and have
>> initiated our effort to make Flink's SQL capabilities full-fledged. When
>> comparing what's available in Flink to the offerings from competitive data
>> processing engines, we identified a major gap in Flink: a well integration
>> with Hive ecosystem. This is crucial to the success of Flink SQL and batch
>> due to the well-established data ecosystem around Hive. Therefore, we have
>> done some initial work along this direction but there are still a lot of
>> effort needed.
>>
>> We have two strategies in mind. The first one is to make Flink SQL
>> full-fledged and well-integrated with Hive ecosystem. This is a similar
>> approach to what Spark SQL adopted. The second strategy is to make Hive
>> itself work with Flink, similar to the proposal in [1]. Each approach bears
>> its pros and cons, but they don’t need to be mutually exclusive with each
>> targeting at different users and use cases. We believe that both will
>> promote a much greater adoption of Flink beyond stream processing.
>>
>> We have been focused on the first approach and would like to showcase
>> Flink's batch and SQL capabilities with Flink SQL. However, we have also
>> planned to start strategy #2 as the follow-up effort.
>>
>> I'm completely new to Flink(, with a short bio [2] below), though many of
>> my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd
>> like to share our thoughts and invite your early feedback. At the same
>> time, I am working on a detailed proposal on Flink SQL's integration with
>> Hive ecosystem, which will be also shared when ready.
>>
>> While the ideas are simple, each approach will demand significant effort,
>> more than what we can afford. Thus, the input and contributions from the
>> communities are greatly welcome and appreciated.
>>
>> Regards,
>>
>>
>> Xuefu
>>
>> References:
>>
>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>> [2] Xuefu Zhang is a long-time open source veteran, worked or working on
>> many projects under Apache Foundation, of which he is also an honored
>> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
>> projects just got started. Later he worked at Cloudera, initiating and
>> leading the development of Hive on Spark project in the communities and
>> across many organizations. Prior to joining Alibaba, he worked at Uber
>> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
>> significantly improved Uber's cluster efficiency.
>>
>>
>>
>>
>> --
>> "So you have to trust that the dots will somehow connect in your future."
>>
>>
>> --
>> "So you have to trust that the dots will somehow connect in your future."
>>
>>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Reply via email to