Hi Xuefu, Currently the new design doc <https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit> is on “view only" mode, and people cannot leave comments. Can you please change it to "can comment" or "can edit" mode?
Thanks, Bowen On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu <xuef...@alibaba-inc.com> wrote: > Hi Piotr > > I have extracted the API portion of the design and the google doc is here > <https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit?usp=sharing>. > Please review and provide your feedback. > > Thanks, > Xuefu > > ------------------------------------------------------------------ > Sender:Xuefu <xuef...@alibaba-inc.com> > Sent at:2018 Nov 12 (Mon) 12:43 > Recipient:Piotr Nowojski <pi...@data-artisans.com>; dev < > dev@flink.apache.org> > Cc:Bowen Li <bowenl...@gmail.com>; Shuyi Chen <suez1...@gmail.com> > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem > > Hi Piotr, > > That sounds good to me. Let's close all the open questions ((there are a > couple of them)) in the Google doc and I should be able to quickly split > it into the three proposals as you suggested. > > Thanks, > Xuefu > > ------------------------------------------------------------------ > Sender:Piotr Nowojski <pi...@data-artisans.com> > Sent at:2018 Nov 9 (Fri) 22:46 > Recipient:dev <dev@flink.apache.org>; Xuefu <xuef...@alibaba-inc.com> > Cc:Bowen Li <bowenl...@gmail.com>; Shuyi Chen <suez1...@gmail.com> > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem > > Hi, > > > Yes, it seems like the best solution. Maybe someone else can also suggests if > we can split it further? Maybe changes in the interface in one doc, reading > from hive meta store another and final storing our meta informations in hive > meta store? > > Piotrek > > > On 9 Nov 2018, at 01:44, Zhang, Xuefu <xuef...@alibaba-inc.com> wrote: > > > > Hi Piotr, > > > > That seems to be good idea! > > > > > Since the google doc for the design is currently under extensive review, I > > will leave it as it is for now. However, I'll convert it to two different > > FLIPs when the time comes. > > > > How does it sound to you? > > > > Thanks, > > Xuefu > > > > > > ------------------------------------------------------------------ > > Sender:Piotr Nowojski <pi...@data-artisans.com> > > Sent at:2018 Nov 9 (Fri) 02:31 > > Recipient:dev <dev@flink.apache.org> > > Cc:Bowen Li <bowenl...@gmail.com>; Xuefu <xuef...@alibaba-inc.com > >; Shuyi Chen <suez1...@gmail.com> > > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem > > > > Hi, > > > > > Maybe we should split this topic (and the design doc) into couple of > > smaller ones, hopefully independent. The questions that you have asked > > Fabian have for example very little to do with reading metadata from Hive > > Meta Store? > > > > Piotrek > > > >> On 7 Nov 2018, at 14:27, Fabian Hueske <fhue...@gmail.com> wrote: > >> > >> Hi Xuefu and all, > >> > >> Thanks for sharing this design document! > > >> I'm very much in favor of restructuring / reworking the catalog handling in > >> Flink SQL as outlined in the document. > > >> Most changes described in the design document seem to be rather general and > >> not specifically related to the Hive integration. > >> > > >> IMO, there are some aspects, especially those at the boundary of Hive and > >> Flink, that need a bit more discussion. For example > >> > >> * What does it take to make Flink schema compatible with Hive schema? > >> * How will Flink tables (descriptors) be stored in HMS? > >> * How do both Hive catalogs differ? Could they be integrated into to a > >> single one? When to use which one? > > >> * What meta information is provided by HMS? What of this can be leveraged > >> by Flink? > >> > >> Thank you, > >> Fabian > >> > >> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li <bowenl...@gmail.com > >: > >> > >>> After taking a look at how other discussion threads work, I think it's > >>> actually fine just keep our discussion here. It's up to you, Xuefu. > >>> > >>> The google doc LGTM. I left some minor comments. > >>> > >>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li <bowenl...@gmail.com> wrote: > >>> > >>>> Hi all, > >>>> > >>>> As Xuefu has published the design doc on google, I agree with Shuyi's > > >>>> suggestion that we probably should start a new email thread like > >>>> "[DISCUSS] > > >>>> ... Hive integration design ..." on only dev mailing list for community > >>>> devs to review. The current thread sends to both dev and user list. > >>>> > > >>>> This email thread is more like validating the general idea and direction > > >>>> with the community, and it's been pretty long and crowded so far. Since > > >>>> everyone is pro for the idea, we can move forward with another thread to > >>>> discuss and finalize the design. > >>>> > >>>> Thanks, > >>>> Bowen > >>>> > >>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu < > xuef...@alibaba-inc.com> > >>>> wrote: > >>>> > >>>>> Hi Shuiyi, > >>>>> > > >>>>> Good idea. Actually the PDF was converted from a google doc. Here is its > >>>>> link: > >>>>> > >>>>> > https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing > >>>>> Once we reach an agreement, I can convert it to a FLIP. > >>>>> > >>>>> Thanks, > >>>>> Xuefu > >>>>> > >>>>> > >>>>> > >>>>> ------------------------------------------------------------------ > >>>>> Sender:Shuyi Chen <suez1...@gmail.com> > >>>>> Sent at:2018 Nov 1 (Thu) 02:47 > >>>>> Recipient:Xuefu <xuef...@alibaba-inc.com> > >>>>> Cc:vino yang <yanghua1...@gmail.com>; Fabian Hueske < > fhue...@gmail.com>; > >>>>> dev <dev@flink.apache.org>; user <u...@flink.apache.org> > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem > >>>>> > >>>>> Hi Xuefu, > >>>>> > > >>>>> Thanks a lot for driving this big effort. I would suggest convert your > > >>>>> proposal and design doc into a google doc, and share it on the dev > >>>>> mailing > > >>>>> list for the community to review and comment with title like "[DISCUSS] > >>>>> ... > > >>>>> Hive integration design ..." . Once approved, we can document it as a > >>>>> FLIP > > >>>>> (Flink Improvement Proposals), and use JIRAs to track the > >>>>> implementations. > >>>>> What do you think? > >>>>> > >>>>> Shuyi > >>>>> > >>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu < > xuef...@alibaba-inc.com> > >>>>> wrote: > >>>>> Hi all, > >>>>> > >>>>> I have also shared a design doc on Hive metastore integration that is > > >>>>> attached here and also to FLINK-10556[1]. Please kindly review and share > >>>>> your feedback. > >>>>> > >>>>> > >>>>> Thanks, > >>>>> Xuefu > >>>>> > >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556 > >>>>> ------------------------------------------------------------------ > >>>>> Sender:Xuefu <xuef...@alibaba-inc.com> > >>>>> Sent at:2018 Oct 25 (Thu) 01:08 > >>>>> Recipient:Xuefu <xuef...@alibaba-inc.com>; Shuyi Chen < > >>>>> suez1...@gmail.com> > >>>>> Cc:yanghua1127 <yanghua1...@gmail.com>; Fabian Hueske < > fhue...@gmail.com>; > >>>>> dev <dev@flink.apache.org>; user <u...@flink.apache.org> > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem > >>>>> > >>>>> Hi all, > >>>>> > >>>>> To wrap up the discussion, I have attached a PDF describing the > > >>>>> proposal, which is also attached to FLINK-10556 [1]. Please feel free to > >>>>> watch that JIRA to track the progress. > >>>>> > >>>>> Please also let me know if you have additional comments or questions. > >>>>> > >>>>> Thanks, > >>>>> Xuefu > >>>>> > >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556 > >>>>> > >>>>> > >>>>> ------------------------------------------------------------------ > >>>>> Sender:Xuefu <xuef...@alibaba-inc.com> > >>>>> Sent at:2018 Oct 16 (Tue) 03:40 > >>>>> Recipient:Shuyi Chen <suez1...@gmail.com> > >>>>> Cc:yanghua1127 <yanghua1...@gmail.com>; Fabian Hueske < > fhue...@gmail.com>; > >>>>> dev <dev@flink.apache.org>; user <u...@flink.apache.org> > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem > >>>>> > >>>>> Hi Shuyi, > >>>>> > > >>>>> Thank you for your input. Yes, I agreed with a phased approach and like > > >>>>> to move forward fast. :) We did some work internally on DDL utilizing > >>>>> babel > >>>>> parser in Calcite. While babel makes Calcite's grammar extensible, at > >>>>> first impression it still seems too cumbersome for a project when too > > >>>>> much extensions are made. It's even challenging to find where the > >>>>> extension > > >>>>> is needed! It would be certainly better if Calcite can magically support > > >>>>> Hive QL by just turning on a flag, such as that for MYSQL_5. I can also > > >>>>> see that this could mean a lot of work on Calcite. Nevertheless, I will > > >>>>> bring up the discussion over there and to see what their community > >>>>> thinks. > >>>>> > >>>>> Would mind to share more info about the proposal on DDL that you > >>>>> mentioned? We can certainly collaborate on this. > >>>>> > >>>>> Thanks, > >>>>> Xuefu > >>>>> > >>>>> ------------------------------------------------------------------ > >>>>> Sender:Shuyi Chen <suez1...@gmail.com> > >>>>> Sent at:2018 Oct 14 (Sun) 08:30 > >>>>> Recipient:Xuefu <xuef...@alibaba-inc.com> > >>>>> Cc:yanghua1127 <yanghua1...@gmail.com>; Fabian Hueske < > fhue...@gmail.com>; > >>>>> dev <dev@flink.apache.org>; user <u...@flink.apache.org> > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem > >>>>> > >>>>> Welcome to the community and thanks for the great proposal, Xuefu! I > > >>>>> think the proposal can be divided into 2 stages: making Flink to support > > >>>>> Hive features, and make Hive to work with Flink. I agreed with Timo > >>>>> that on > > >>>>> starting with a smaller scope, so we can make progress faster. As for > >>>>> [6], > > >>>>> a proposal for DDL is already in progress, and will come after the > >>>>> unified > > >>>>> SQL connector API is done. For supporting Hive syntax, we might need to > >>>>> work with the Calcite community, and a recent effort called babel ( > >>>>> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might > >>>>> help here. > >>>>> > >>>>> Thanks > >>>>> Shuyi > >>>>> > >>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu < > xuef...@alibaba-inc.com> > >>>>> wrote: > >>>>> Hi Fabian/Vno, > >>>>> > > >>>>> Thank you very much for your encouragement inquiry. Sorry that I didn't > > >>>>> see Fabian's email until I read Vino's response just now. (Somehow > >>>>> Fabian's > >>>>> went to the spam folder.) > >>>>> > > >>>>> My proposal contains long-term and short-terms goals. Nevertheless, the > >>>>> effort will focus on the following areas, including Fabian's list: > >>>>> > >>>>> 1. Hive metastore connectivity - This covers both read/write access, > > >>>>> which means Flink can make full use of Hive's metastore as its catalog > >>>>> (at > >>>>> least for the batch but can extend for streaming as well). > > >>>>> 2. Metadata compatibility - Objects (databases, tables, partitions, etc) > > >>>>> created by Hive can be understood by Flink and the reverse direction is > >>>>> true also. > >>>>> 3. Data compatibility - Similar to #2, data produced by Hive can be > >>>>> consumed by Flink and vise versa. > > >>>>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides > >>>>> its own implementation or make Hive's implementation work in Flink. > >>>>> Further, for user created UDFs in Hive, Flink SQL should provide a > > >>>>> mechanism allowing user to import them into Flink without any code > >>>>> change > >>>>> required. > >>>>> 5. Data types - Flink SQL should support all data types that are > >>>>> available in Hive. > >>>>> 6. SQL Language - Flink SQL should support SQL standard (such as > > >>>>> SQL2003) with extension to support Hive's syntax and language features, > >>>>> around DDL, DML, and SELECT queries. > > >>>>> 7. SQL CLI - this is currently developing in Flink but more effort is > >>>>> needed. > > >>>>> 8. Server - provide a server that's compatible with Hive's HiverServer2 > > >>>>> in thrift APIs, such that HiveServer2 users can reuse their existing > >>>>> client > >>>>> (such as beeline) but connect to Flink's thrift server instead. > > >>>>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for > >>>>> other application to use to connect to its thrift server > >>>>> 10. Support other user's customizations in Hive, such as Hive Serdes, > >>>>> storage handlers, etc. > > >>>>> 11. Better task failure tolerance and task scheduling at Flink runtime. > >>>>> > >>>>> As you can see, achieving all those requires significant effort and > > >>>>> across all layers in Flink. However, a short-term goal could include > >>>>> only > > >>>>> core areas (such as 1, 2, 4, 5, 6, 7) or start at a smaller scope > >>>>> (such as > >>>>> #3, #6). > >>>>> > > >>>>> Please share your further thoughts. If we generally agree that this is > > >>>>> the right direction, I could come up with a formal proposal quickly and > >>>>> then we can follow up with broader discussions. > >>>>> > >>>>> Thanks, > >>>>> Xuefu > >>>>> > >>>>> > >>>>> > >>>>> ------------------------------------------------------------------ > >>>>> Sender:vino yang <yanghua1...@gmail.com> > >>>>> Sent at:2018 Oct 11 (Thu) 09:45 > >>>>> Recipient:Fabian Hueske <fhue...@gmail.com> > >>>>> Cc:dev <dev@flink.apache.org>; Xuefu <xuef...@alibaba-inc.com > >; user < > >>>>> u...@flink.apache.org> > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem > >>>>> > >>>>> Hi Xuefu, > >>>>> > > >>>>> Appreciate this proposal, and like Fabian, it would look better if you > >>>>> can give more details of the plan. > >>>>> > >>>>> Thanks, vino. > >>>>> > >>>>> Fabian Hueske <fhue...@gmail.com> 于2018年10月10日周三 下午5:27写道: > >>>>> Hi Xuefu, > >>>>> > > >>>>> Welcome to the Flink community and thanks for starting this discussion! > >>>>> Better Hive integration would be really great! > >>>>> Can you go into details of what you are proposing? I can think of a > >>>>> couple ways to improve Flink in that regard: > >>>>> > >>>>> * Support for Hive UDFs > >>>>> * Support for Hive metadata catalog > >>>>> * Support for HiveQL syntax > >>>>> * ??? > >>>>> > >>>>> Best, Fabian > >>>>> > >>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu < > >>>>> xuef...@alibaba-inc.com>: > >>>>> Hi all, > >>>>> > >>>>> Along with the community's effort, inside Alibaba we have explored > > >>>>> Flink's potential as an execution engine not just for stream processing > >>>>> but > >>>>> also for batch processing. We are encouraged by our findings and have > > >>>>> initiated our effort to make Flink's SQL capabilities full-fledged. When > > >>>>> comparing what's available in Flink to the offerings from competitive > >>>>> data > > >>>>> processing engines, we identified a major gap in Flink: a well > >>>>> integration > > >>>>> with Hive ecosystem. This is crucial to the success of Flink SQL and > >>>>> batch > > >>>>> due to the well-established data ecosystem around Hive. Therefore, we > >>>>> have > > >>>>> done some initial work along this direction but there are still a lot of > >>>>> effort needed. > >>>>> > >>>>> We have two strategies in mind. The first one is to make Flink SQL > > >>>>> full-fledged and well-integrated with Hive ecosystem. This is a similar > > >>>>> approach to what Spark SQL adopted. The second strategy is to make Hive > > >>>>> itself work with Flink, similar to the proposal in [1]. Each approach > >>>>> bears > > >>>>> its pros and cons, but they don’t need to be mutually exclusive with > >>>>> each > >>>>> targeting at different users and use cases. We believe that both will > >>>>> promote a much greater adoption of Flink beyond stream processing. > >>>>> > >>>>> We have been focused on the first approach and would like to showcase > > >>>>> Flink's batch and SQL capabilities with Flink SQL. However, we have also > >>>>> planned to start strategy #2 as the follow-up effort. > >>>>> > > >>>>> I'm completely new to Flink(, with a short bio [2] below), though many > > >>>>> of my colleagues here at Alibaba are long-time contributors. > >>>>> Nevertheless, > > >>>>> I'd like to share our thoughts and invite your early feedback. At the > >>>>> same > > >>>>> time, I am working on a detailed proposal on Flink SQL's integration > >>>>> with > >>>>> Hive ecosystem, which will be also shared when ready. > >>>>> > >>>>> While the ideas are simple, each approach will demand significant > > >>>>> effort, more than what we can afford. Thus, the input and contributions > >>>>> from the communities are greatly welcome and appreciated. > >>>>> > >>>>> Regards, > >>>>> > >>>>> > >>>>> Xuefu > >>>>> > >>>>> References: > >>>>> > >>>>> [1] https://issues.apache.org/jira/browse/HIVE-10712 > > >>>>> [2] Xuefu Zhang is a long-time open source veteran, worked or working on > >>>>> many projects under Apache Foundation, of which he is also an honored > > >>>>> member. About 10 years ago he worked in the Hadoop team at Yahoo where > >>>>> the > > >>>>> projects just got started. Later he worked at Cloudera, initiating and > > >>>>> leading the development of Hive on Spark project in the communities and > > >>>>> across many organizations. Prior to joining Alibaba, he worked at Uber > > >>>>> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and > >>>>> significantly improved Uber's cluster efficiency. > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> -- > > >>>>> "So you have to trust that the dots will somehow connect in your > >>>>> future." > >>>>> > >>>>> > >>>>> -- > > >>>>> "So you have to trust that the dots will somehow connect in your > >>>>> future." > >>>>> > >