Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Bowen Li Tue, 13 Nov 2018 14:45:01 -0800

Hi Xuefu,

Currently the new design doc
<https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit>
is on “view only" mode, and people cannot leave comments. Can you please
change it to "can comment" or "can edit" mode?


Thanks, Bowen


On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu <[email protected]>
wrote:

> Hi Piotr
>
> I have extracted the API portion of  the design and the google doc is here
> <https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit?usp=sharing>.
> Please review and provide your feedback.
>
> Thanks,
> Xuefu
>
> ------------------------------------------------------------------
> Sender:Xuefu <[email protected]>
> Sent at:2018 Nov 12 (Mon) 12:43
> Recipient:Piotr Nowojski <[email protected]>; dev <
> [email protected]>
> Cc:Bowen Li <[email protected]>; Shuyi Chen <[email protected]>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi Piotr,
>
> That sounds good to me. Let's close all the open questions ((there are a
> couple of them)) in the Google doc and I should be able to quickly split
> it into the three proposals as you suggested.
>
> Thanks,
> Xuefu
>
> ------------------------------------------------------------------
> Sender:Piotr Nowojski <[email protected]>
> Sent at:2018 Nov 9 (Fri) 22:46
> Recipient:dev <[email protected]>; Xuefu <[email protected]>
> Cc:Bowen Li <[email protected]>; Shuyi Chen <[email protected]>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi,
>
>
> Yes, it seems like the best solution. Maybe someone else can also suggests if 
> we can split it further? Maybe changes in the interface in one doc, reading 
> from hive meta store another and final storing our meta informations in hive 
> meta store?
>
> Piotrek
>
> > On 9 Nov 2018, at 01:44, Zhang, Xuefu <[email protected]> wrote:
> >
> > Hi Piotr,
> >
> > That seems to be good idea!
> >
>
> > Since the google doc for the design is currently under extensive review, I 
> > will leave it as it is for now. However, I'll convert it to two different 
> > FLIPs when the time comes.
> >
> > How does it sound to you?
> >
> > Thanks,
> > Xuefu
> >
> >
> > ------------------------------------------------------------------
> > Sender:Piotr Nowojski <[email protected]>
> > Sent at:2018 Nov 9 (Fri) 02:31
> > Recipient:dev <[email protected]>
> > Cc:Bowen Li <[email protected]>; Xuefu <[email protected]
> >; Shuyi Chen <[email protected]>
> > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >
> > Hi,
> >
>
> > Maybe we should split this topic (and the design doc) into couple of 
> > smaller ones, hopefully independent. The questions that you have asked 
> > Fabian have for example very little to do with reading metadata from Hive 
> > Meta Store?
> >
> > Piotrek
> >
> >> On 7 Nov 2018, at 14:27, Fabian Hueske <[email protected]> wrote:
> >>
> >> Hi Xuefu and all,
> >>
> >> Thanks for sharing this design document!
>
> >> I'm very much in favor of restructuring / reworking the catalog handling in
> >> Flink SQL as outlined in the document.
>
> >> Most changes described in the design document seem to be rather general and
> >> not specifically related to the Hive integration.
> >>
>
> >> IMO, there are some aspects, especially those at the boundary of Hive and
> >> Flink, that need a bit more discussion. For example
> >>
> >> * What does it take to make Flink schema compatible with Hive schema?
> >> * How will Flink tables (descriptors) be stored in HMS?
> >> * How do both Hive catalogs differ? Could they be integrated into to a
> >> single one? When to use which one?
>
> >> * What meta information is provided by HMS? What of this can be leveraged
> >> by Flink?
> >>
> >> Thank you,
> >> Fabian
> >>
> >> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li <[email protected]
> >:
> >>
> >>> After taking a look at how other discussion threads work, I think it's
> >>> actually fine just keep our discussion here. It's up to you, Xuefu.
> >>>
> >>> The google doc LGTM. I left some minor comments.
> >>>
> >>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li <[email protected]> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> As Xuefu has published the design doc on google, I agree with Shuyi's
>
> >>>> suggestion that we probably should start a new email thread like 
> >>>> "[DISCUSS]
>
> >>>> ... Hive integration design ..." on only dev mailing list for community
> >>>> devs to review. The current thread sends to both dev and user list.
> >>>>
>
> >>>> This email thread is more like validating the general idea and direction
>
> >>>> with the community, and it's been pretty long and crowded so far. Since
>
> >>>> everyone is pro for the idea, we can move forward with another thread to
> >>>> discuss and finalize the design.
> >>>>
> >>>> Thanks,
> >>>> Bowen
> >>>>
> >>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <
> [email protected]>
> >>>> wrote:
> >>>>
> >>>>> Hi Shuiyi,
> >>>>>
>
> >>>>> Good idea. Actually the PDF was converted from a google doc. Here is its
> >>>>> link:
> >>>>>
> >>>>>
> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
> >>>>> Once we reach an agreement, I can convert it to a FLIP.
> >>>>>
> >>>>> Thanks,
> >>>>> Xuefu
> >>>>>
> >>>>>
> >>>>>
> >>>>> ------------------------------------------------------------------
> >>>>> Sender:Shuyi Chen <[email protected]>
> >>>>> Sent at:2018 Nov 1 (Thu) 02:47
> >>>>> Recipient:Xuefu <[email protected]>
> >>>>> Cc:vino yang <[email protected]>; Fabian Hueske <
> [email protected]>;
> >>>>> dev <[email protected]>; user <[email protected]>
> >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >>>>>
> >>>>> Hi Xuefu,
> >>>>>
>
> >>>>> Thanks a lot for driving this big effort. I would suggest convert your
>
> >>>>> proposal and design doc into a google doc, and share it on the dev 
> >>>>> mailing
>
> >>>>> list for the community to review and comment with title like "[DISCUSS] 
> >>>>> ...
>
> >>>>> Hive integration design ..." . Once approved,  we can document it as a 
> >>>>> FLIP
>
> >>>>> (Flink Improvement Proposals), and use JIRAs to track the 
> >>>>> implementations.
> >>>>> What do you think?
> >>>>>
> >>>>> Shuyi
> >>>>>
> >>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <
> [email protected]>
> >>>>> wrote:
> >>>>> Hi all,
> >>>>>
> >>>>> I have also shared a design doc on Hive metastore integration that is
>
> >>>>> attached here and also to FLINK-10556[1]. Please kindly review and share
> >>>>> your feedback.
> >>>>>
> >>>>>
> >>>>> Thanks,
> >>>>> Xuefu
> >>>>>
> >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
> >>>>> ------------------------------------------------------------------
> >>>>> Sender:Xuefu <[email protected]>
> >>>>> Sent at:2018 Oct 25 (Thu) 01:08
> >>>>> Recipient:Xuefu <[email protected]>; Shuyi Chen <
> >>>>> [email protected]>
> >>>>> Cc:yanghua1127 <[email protected]>; Fabian Hueske <
> [email protected]>;
> >>>>> dev <[email protected]>; user <[email protected]>
> >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >>>>>
> >>>>> Hi all,
> >>>>>
> >>>>> To wrap up the discussion, I have attached a PDF describing the
>
> >>>>> proposal, which is also attached to FLINK-10556 [1]. Please feel free to
> >>>>> watch that JIRA to track the progress.
> >>>>>
> >>>>> Please also let me know if you have additional comments or questions.
> >>>>>
> >>>>> Thanks,
> >>>>> Xuefu
> >>>>>
> >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
> >>>>>
> >>>>>
> >>>>> ------------------------------------------------------------------
> >>>>> Sender:Xuefu <[email protected]>
> >>>>> Sent at:2018 Oct 16 (Tue) 03:40
> >>>>> Recipient:Shuyi Chen <[email protected]>
> >>>>> Cc:yanghua1127 <[email protected]>; Fabian Hueske <
> [email protected]>;
> >>>>> dev <[email protected]>; user <[email protected]>
> >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >>>>>
> >>>>> Hi Shuyi,
> >>>>>
>
> >>>>> Thank you for your input. Yes, I agreed with a phased approach and like
>
> >>>>> to move forward fast. :) We did some work internally on DDL utilizing 
> >>>>> babel
> >>>>> parser in Calcite. While babel makes Calcite's grammar extensible, at
> >>>>> first impression it still seems too cumbersome for a project when too
>
> >>>>> much extensions are made. It's even challenging to find where the 
> >>>>> extension
>
> >>>>> is needed! It would be certainly better if Calcite can magically support
>
> >>>>> Hive QL by just turning on a flag, such as that for MYSQL_5. I can also
>
> >>>>> see that this could mean a lot of work on Calcite. Nevertheless, I will
>
> >>>>> bring up the discussion over there and to see what their community 
> >>>>> thinks.
> >>>>>
> >>>>> Would mind to share more info about the proposal on DDL that you
> >>>>> mentioned? We can certainly collaborate on this.
> >>>>>
> >>>>> Thanks,
> >>>>> Xuefu
> >>>>>
> >>>>> ------------------------------------------------------------------
> >>>>> Sender:Shuyi Chen <[email protected]>
> >>>>> Sent at:2018 Oct 14 (Sun) 08:30
> >>>>> Recipient:Xuefu <[email protected]>
> >>>>> Cc:yanghua1127 <[email protected]>; Fabian Hueske <
> [email protected]>;
> >>>>> dev <[email protected]>; user <[email protected]>
> >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >>>>>
> >>>>> Welcome to the community and thanks for the great proposal, Xuefu! I
>
> >>>>> think the proposal can be divided into 2 stages: making Flink to support
>
> >>>>> Hive features, and make Hive to work with Flink. I agreed with Timo 
> >>>>> that on
>
> >>>>> starting with a smaller scope, so we can make progress faster. As for 
> >>>>> [6],
>
> >>>>> a proposal for DDL is already in progress, and will come after the 
> >>>>> unified
>
> >>>>> SQL connector API is done. For supporting Hive syntax, we might need to
> >>>>> work with the Calcite community, and a recent effort called babel (
> >>>>> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might
> >>>>> help here.
> >>>>>
> >>>>> Thanks
> >>>>> Shuyi
> >>>>>
> >>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <
> [email protected]>
> >>>>> wrote:
> >>>>> Hi Fabian/Vno,
> >>>>>
>
> >>>>> Thank you very much for your encouragement inquiry. Sorry that I didn't
>
> >>>>> see Fabian's email until I read Vino's response just now. (Somehow 
> >>>>> Fabian's
> >>>>> went to the spam folder.)
> >>>>>
>
> >>>>> My proposal contains long-term and short-terms goals. Nevertheless, the
> >>>>> effort will focus on the following areas, including Fabian's list:
> >>>>>
> >>>>> 1. Hive metastore connectivity - This covers both read/write access,
>
> >>>>> which means Flink can make full use of Hive's metastore as its catalog 
> >>>>> (at
> >>>>> least for the batch but can extend for streaming as well).
>
> >>>>> 2. Metadata compatibility - Objects (databases, tables, partitions, etc)
>
> >>>>> created by Hive can be understood by Flink and the reverse direction is
> >>>>> true also.
> >>>>> 3. Data compatibility - Similar to #2, data produced by Hive can be
> >>>>> consumed by Flink and vise versa.
>
> >>>>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides
> >>>>> its own implementation or make Hive's implementation work in Flink.
> >>>>> Further, for user created UDFs in Hive, Flink SQL should provide a
>
> >>>>> mechanism allowing user to import them into Flink without any code 
> >>>>> change
> >>>>> required.
> >>>>> 5. Data types -  Flink SQL should support all data types that are
> >>>>> available in Hive.
> >>>>> 6. SQL Language - Flink SQL should support SQL standard (such as
>
> >>>>> SQL2003) with extension to support Hive's syntax and language features,
> >>>>> around DDL, DML, and SELECT queries.
>
> >>>>> 7.  SQL CLI - this is currently developing in Flink but more effort is
> >>>>> needed.
>
> >>>>> 8. Server - provide a server that's compatible with Hive's HiverServer2
>
> >>>>> in thrift APIs, such that HiveServer2 users can reuse their existing 
> >>>>> client
> >>>>> (such as beeline) but connect to Flink's thrift server instead.
>
> >>>>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
> >>>>> other application to use to connect to its thrift server
> >>>>> 10. Support other user's customizations in Hive, such as Hive Serdes,
> >>>>> storage handlers, etc.
>
> >>>>> 11. Better task failure tolerance and task scheduling at Flink runtime.
> >>>>>
> >>>>> As you can see, achieving all those requires significant effort and
>
> >>>>> across all layers in Flink. However, a short-term goal could  include 
> >>>>> only
>
> >>>>> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope 
> >>>>> (such as
> >>>>> #3, #6).
> >>>>>
>
> >>>>> Please share your further thoughts. If we generally agree that this is
>
> >>>>> the right direction, I could come up with a formal proposal quickly and
> >>>>> then we can follow up with broader discussions.
> >>>>>
> >>>>> Thanks,
> >>>>> Xuefu
> >>>>>
> >>>>>
> >>>>>
> >>>>> ------------------------------------------------------------------
> >>>>> Sender:vino yang <[email protected]>
> >>>>> Sent at:2018 Oct 11 (Thu) 09:45
> >>>>> Recipient:Fabian Hueske <[email protected]>
> >>>>> Cc:dev <[email protected]>; Xuefu <[email protected]
> >; user <
> >>>>> [email protected]>
> >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >>>>>
> >>>>> Hi Xuefu,
> >>>>>
>
> >>>>> Appreciate this proposal, and like Fabian, it would look better if you
> >>>>> can give more details of the plan.
> >>>>>
> >>>>> Thanks, vino.
> >>>>>
> >>>>> Fabian Hueske <[email protected]> 于2018年10月10日周三 下午5:27写道：
> >>>>> Hi Xuefu,
> >>>>>
>
> >>>>> Welcome to the Flink community and thanks for starting this discussion!
> >>>>> Better Hive integration would be really great!
> >>>>> Can you go into details of what you are proposing? I can think of a
> >>>>> couple ways to improve Flink in that regard:
> >>>>>
> >>>>> * Support for Hive UDFs
> >>>>> * Support for Hive metadata catalog
> >>>>> * Support for HiveQL syntax
> >>>>> * ???
> >>>>>
> >>>>> Best, Fabian
> >>>>>
> >>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
> >>>>> [email protected]>:
> >>>>> Hi all,
> >>>>>
> >>>>> Along with the community's effort, inside Alibaba we have explored
>
> >>>>> Flink's potential as an execution engine not just for stream processing 
> >>>>> but
> >>>>> also for batch processing. We are encouraged by our findings and have
>
> >>>>> initiated our effort to make Flink's SQL capabilities full-fledged. When
>
> >>>>> comparing what's available in Flink to the offerings from competitive 
> >>>>> data
>
> >>>>> processing engines, we identified a major gap in Flink: a well 
> >>>>> integration
>
> >>>>> with Hive ecosystem. This is crucial to the success of Flink SQL and 
> >>>>> batch
>
> >>>>> due to the well-established data ecosystem around Hive. Therefore, we 
> >>>>> have
>
> >>>>> done some initial work along this direction but there are still a lot of
> >>>>> effort needed.
> >>>>>
> >>>>> We have two strategies in mind. The first one is to make Flink SQL
>
> >>>>> full-fledged and well-integrated with Hive ecosystem. This is a similar
>
> >>>>> approach to what Spark SQL adopted. The second strategy is to make Hive
>
> >>>>> itself work with Flink, similar to the proposal in [1]. Each approach 
> >>>>> bears
>
> >>>>> its pros and cons, but they don’t need to be mutually exclusive with 
> >>>>> each
> >>>>> targeting at different users and use cases. We believe that both will
> >>>>> promote a much greater adoption of Flink beyond stream processing.
> >>>>>
> >>>>> We have been focused on the first approach and would like to showcase
>
> >>>>> Flink's batch and SQL capabilities with Flink SQL. However, we have also
> >>>>> planned to start strategy #2 as the follow-up effort.
> >>>>>
>
> >>>>> I'm completely new to Flink(, with a short bio [2] below), though many
>
> >>>>> of my colleagues here at Alibaba are long-time contributors. 
> >>>>> Nevertheless,
>
> >>>>> I'd like to share our thoughts and invite your early feedback. At the 
> >>>>> same
>
> >>>>> time, I am working on a detailed proposal on Flink SQL's integration 
> >>>>> with
> >>>>> Hive ecosystem, which will be also shared when ready.
> >>>>>
> >>>>> While the ideas are simple, each approach will demand significant
>
> >>>>> effort, more than what we can afford. Thus, the input and contributions
> >>>>> from the communities are greatly welcome and appreciated.
> >>>>>
> >>>>> Regards,
> >>>>>
> >>>>>
> >>>>> Xuefu
> >>>>>
> >>>>> References:
> >>>>>
> >>>>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>
> >>>>> [2] Xuefu Zhang is a long-time open source veteran, worked or working on
> >>>>> many projects under Apache Foundation, of which he is also an honored
>
> >>>>> member. About 10 years ago he worked in the Hadoop team at Yahoo where 
> >>>>> the
>
> >>>>> projects just got started. Later he worked at Cloudera, initiating and
>
> >>>>> leading the development of Hive on Spark project in the communities and
>
> >>>>> across many organizations. Prior to joining Alibaba, he worked at Uber
>
> >>>>> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
> >>>>> significantly improved Uber's cluster efficiency.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
>
> >>>>> "So you have to trust that the dots will somehow connect in your 
> >>>>> future."
> >>>>>
> >>>>>
> >>>>> --
>
> >>>>> "So you have to trust that the dots will somehow connect in your 
> >>>>> future."
> >>>>>
>
>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Reply via email to