Sounds smashing; I think the initial integration will help 60% or so flink sql users and a lot other use cases will emerge when we solve the first one.
Thanks, Taher Koitawala On Fri 12 Oct, 2018, 10:13 AM Zhang, Xuefu, <xuef...@alibaba-inc.com> wrote: > Hi Taher, > > Thank you for your input. I think you emphasized two important points: > > 1. Hive metastore could be used for storing Flink metadata > 2. There are some usability issues around Flink SQL configuration > > I think we all agree on #1. #2 may be well true and the usability should > be improved. However, I'm afraid that this is orthogonal to Hive > integration and the proposed solution might be just one of the possible > solutions. On the surface, the extensions you proposed seem going beyond > the syntax and semantics of SQL language in general. > > I don't disagree on the value of your proposal. I guess it's better to > solve #1 first and leave #2 for follow-up discussions. How does this sound > to you? > > Thanks, > Xuefu > > ------------------------------------------------------------------ > Sender:Taher Koitawala <taher.koitaw...@gslab.com> > Sent at:2018 Oct 12 (Fri) 10:06 > Recipient:Xuefu <xuef...@alibaba-inc.com> > Cc:Rong Rong <walter...@gmail.com>; Timo Walther <twal...@apache.org>; > dev <dev@flink.apache.org>; jornfranke <jornfra...@gmail.com>; vino yang < > yanghua1...@gmail.com>; Fabian Hueske <fhue...@gmail.com>; user < > u...@flink.apache.org> > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem > > One other thought on the same lines was to use hive tables to store kafka > information to process streaming tables. Something like > > "create table streaming_table ( > bootstrapServers string, > topic string, keySerialiser string, ValueSerialiser string)" > > Insert into streaming_table values(,"10.17.1.1:9092,10.17.2.2:9092, > 10.17.3.3:9092", "KafkaTopicName", "SimpleStringSchema", > "SimpleSchemaString"); > > Create table processingtable( > //Enter fields here which match the kafka records schema); > > Now we make a custom clause called something like "using" > > The way we use this is: > > Using streaming_table as configuration select count(*) from > processingtable as streaming; > > > This way users can now pass Flink SQL info easily and get rid of the Flink > SQL configuration file all together. This is simple and easy to understand > and I think most users would follow this. > > Thanks, > Taher Koitawala > > On Fri 12 Oct, 2018, 7:24 AM Taher Koitawala, <taher.koitaw...@gslab.com> > wrote: > I think integrating Flink with Hive would be an amazing option and also to > get Flink's SQL up to pace would be amazing. > > Current Flink Sql syntax to prepare and process a table is too verbose, > users manually need to retype table definitions and that's a pain. Hive > metastore integration should be done through, many users are okay defining > their table schemas in Hive as it is easy to main, change or even migrate. > > Also we could simply choosing batch and stream there with simply something > like a "process as" clause. > > select count(*) from flink_mailing_list process as stream; > > select count(*) from flink_mailing_list process as batch; > > This way we could completely get rid of Flink SQL configuration files. > > Thanks, > Taher Koitawala > > Integrating > On Fri 12 Oct, 2018, 2:35 AM Zhang, Xuefu, <xuef...@alibaba-inc.com> > wrote: > Hi Rong, > > Thanks for your feedback. Some of my earlier comments might have addressed > some of your points, so here I'd like to cover some specifics. > > 1. Yes, I expect that table stats stored in Hive will be used in Flink > plan optimization, but it's not part of compatibility concern (yet). > 2. Both implementing Hive UDFs in Flink natively and making Hive UDFs work > in Flink are considered. > 3. I am aware of FLIP-24, but here the proposal is to make remote server > compatible with HiveServer2. They are not mutually exclusive either. > 4. The JDBC/ODBC driver in question is for the remote server that Flink > provides. It's usually the servicer owner who provides drivers to their > services. We weren't talking about JDBC/ODBC driver to external DB systems. > > Let me know if you have further questions. > > Thanks, > Xuefu > > ------------------------------------------------------------------ > Sender:Rong Rong <walter...@gmail.com> > Sent at:2018 Oct 12 (Fri) 01:52 > Recipient:Timo Walther <twal...@apache.org> > Cc:dev <dev@flink.apache.org>; jornfranke <jornfra...@gmail.com>; Xuefu < > xuef...@alibaba-inc.com>; vino yang <yanghua1...@gmail.com>; Fabian > Hueske <fhue...@gmail.com>; user <u...@flink.apache.org> > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem > > Hi Xuefu, > > Thanks for putting together the overview. I would like to add some more on > top of Timo's comments. > 1,2. I agree with Timo that a proper catalog support should also address > the metadata compatibility issues. I was actually wondering if you are > referring to something like utilizing table stats for plan optimization? > 4. If the key is to have users integrate Hive UDF without code changes to > Flink UDF, it shouldn't be a problem as Timo mentioned. Is your concern > mostly on the support of Hive UDFs that should be implemented in > Flink-table natively? > 7,8. Correct me if I am wrong, but I feel like some of the related > components might have already been discussed in the longer term road map of > FLIP-24 [1]? > 9. per Jorn's comment to stay clear from a tight dependency on Hive and > treat it as one "connector" system. Should we also consider treating > JDBC/ODBC driver as part of the component from the connector system instead > of having Flink to provide them? > > Thanks, > Rong > > [1]. > https://cwiki.apache.org/confluence/display/FLINK/FLIP-24+-+SQL+Client > > On Thu, Oct 11, 2018 at 12:46 AM Timo Walther <twal...@apache.org> wrote: > Hi Xuefu, > > thanks for your proposal, it is a nice summary. Here are my thoughts to > your list: > > 1. I think this is also on our current mid-term roadmap. Flink lacks a > poper catalog support for a very long time. Before we can connect > catalogs we need to define how to map all the information from a catalog > to Flink's representation. This is why the work on the unified connector > API [1] is going on for quite some time as it is the first approach to > discuss and represent the pure characteristics of connectors. > 2. It would be helpful to figure out what is missing in [1] to to ensure > this point. I guess we will need a new design document just for a proper > Hive catalog integration. > 3. This is already work in progress. ORC has been merged, Parquet is on > its way [1]. > 4. This should be easy. There was a PR in past that I reviewed but was > not maintained anymore. > 5. The type system of Flink SQL is very flexible. Only UNION type is > missing. > 6. A Flink SQL DDL is on the roadmap soon once we are done with [1]. > Support for Hive syntax also needs cooperation with Apache Calcite. > 7-11. Long-term goals. > > I would also propose to start with a smaller scope where also current > Flink SQL users can profit: 1, 2, 5, 3. This would allow to grow the > Flink SQL ecosystem. After that we can aim to be fully compatible > including syntax and UDFs (4, 6 etc.). Once the core is ready, we can > work on the tooling (7, 8, 9) and performance (10, 11). > > @Jörn: Yes, we should not have a tight dependency on Hive. It should be > treated as one "connector" system out of many. > > Thanks, > Timo > > [1] > > https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit?ts=5bb62df4# > [2] https://github.com/apache/flink/pull/6483 > > Am 11.10.18 um 07:54 schrieb Jörn Franke: > > Would it maybe make sense to provide Flink as an engine on Hive > („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely > coupled than integrating hive in all possible flink core modules and thus > introducing a very tight dependency to Hive in the core. > > 1,2,3 could be achieved via a connector based on the Flink Table API. > > Just as a proposal to start this Endeavour as independent projects (hive > engine, connector) to avoid too tight coupling with Flink. Maybe in a more > distant future if the Hive integration is heavily demanded one could then > integrate it more tightly if needed. > > > > What is meant by 11? > >> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xuef...@alibaba-inc.com>: > >> > >> Hi Fabian/Vno, > >> > >> Thank you very much for your encouragement inquiry. Sorry that I didn't > see Fabian's email until I read Vino's response just now. (Somehow Fabian's > went to the spam folder.) > >> > >> My proposal contains long-term and short-terms goals. Nevertheless, the > effort will focus on the following areas, including Fabian's list: > >> > >> 1. Hive metastore connectivity - This covers both read/write access, > which means Flink can make full use of Hive's metastore as its catalog (at > least for the batch but can extend for streaming as well). > >> 2. Metadata compatibility - Objects (databases, tables, partitions, > etc) created by Hive can be understood by Flink and the reverse direction > is true also. > >> 3. Data compatibility - Similar to #2, data produced by Hive can be > consumed by Flink and vise versa. > >> 4. Support Hive UDFs - For all Hive's native udfs, Flink either > provides its own implementation or make Hive's implementation work in > Flink. Further, for user created UDFs in Hive, Flink SQL should provide a > mechanism allowing user to import them into Flink without any code change > required. > >> 5. Data types - Flink SQL should support all data types that are > available in Hive. > >> 6. SQL Language - Flink SQL should support SQL standard (such as > SQL2003) with extension to support Hive's syntax and language features, > around DDL, DML, and SELECT queries. > >> 7. SQL CLI - this is currently developing in Flink but more effort is > needed. > >> 8. Server - provide a server that's compatible with Hive's HiverServer2 > in thrift APIs, such that HiveServer2 users can reuse their existing client > (such as beeline) but connect to Flink's thrift server instead. > >> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for > other application to use to connect to its thrift server > >> 10. Support other user's customizations in Hive, such as Hive Serdes, > storage handlers, etc. > >> 11. Better task failure tolerance and task scheduling at Flink runtime. > >> > >> As you can see, achieving all those requires significant effort and > across all layers in Flink. However, a short-term goal could include only > core areas (such as 1, 2, 4, 5, 6, 7) or start at a smaller scope (such as > #3, #6). > >> > >> Please share your further thoughts. If we generally agree that this is > the right direction, I could come up with a formal proposal quickly and > then we can follow up with broader discussions. > >> > >> Thanks, > >> Xuefu > >> > >> > >> > >> ------------------------------------------------------------------ > >> Sender:vino yang <yanghua1...@gmail.com> > >> Sent at:2018 Oct 11 (Thu) 09:45 > >> Recipient:Fabian Hueske <fhue...@gmail.com> > >> Cc:dev <dev@flink.apache.org>; Xuefu <xuef...@alibaba-inc.com>; user < > u...@flink.apache.org> > >> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem > >> > >> Hi Xuefu, > >> > >> Appreciate this proposal, and like Fabian, it would look better if you > can give more details of the plan. > >> > >> Thanks, vino. > >> > >> Fabian Hueske <fhue...@gmail.com> 于2018年10月10日周三 下午5:27写道: > >> Hi Xuefu, > >> > >> Welcome to the Flink community and thanks for starting this discussion! > Better Hive integration would be really great! > >> Can you go into details of what you are proposing? I can think of a > couple ways to improve Flink in that regard: > >> > >> * Support for Hive UDFs > >> * Support for Hive metadata catalog > >> * Support for HiveQL syntax > >> * ??? > >> > >> Best, Fabian > >> > >> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu < > xuef...@alibaba-inc.com>: > >> Hi all, > >> > >> Along with the community's effort, inside Alibaba we have explored > Flink's potential as an execution engine not just for stream processing but > also for batch processing. We are encouraged by our findings and have > initiated our effort to make Flink's SQL capabilities full-fledged. When > comparing what's available in Flink to the offerings from competitive data > processing engines, we identified a major gap in Flink: a well integration > with Hive ecosystem. This is crucial to the success of Flink SQL and batch > due to the well-established data ecosystem around Hive. Therefore, we have > done some initial work along this direction but there are still a lot of > effort needed. > >> > >> We have two strategies in mind. The first one is to make Flink SQL > full-fledged and well-integrated with Hive ecosystem. This is a similar > approach to what Spark SQL adopted. The second strategy is to make Hive > itself work with Flink, similar to the proposal in [1]. Each approach bears > its pros and cons, but they don’t need to be mutually exclusive with each > targeting at different users and use cases. We believe that both will > promote a much greater adoption of Flink beyond stream processing. > >> > >> We have been focused on the first approach and would like to showcase > Flink's batch and SQL capabilities with Flink SQL. However, we have also > planned to start strategy #2 as the follow-up effort. > >> > >> I'm completely new to Flink(, with a short bio [2] below), though many > of my colleagues here at Alibaba are long-time contributors. Nevertheless, > I'd like to share our thoughts and invite your early feedback. At the same > time, I am working on a detailed proposal on Flink SQL's integration with > Hive ecosystem, which will be also shared when ready. > >> > >> While the ideas are simple, each approach will demand significant > effort, more than what we can afford. Thus, the input and contributions > from the communities are greatly welcome and appreciated. > >> > >> Regards, > >> > >> > >> Xuefu > >> > >> References: > >> > >> [1] https://issues.apache.org/jira/browse/HIVE-10712 > >> [2] Xuefu Zhang is a long-time open source veteran, worked or working > on many projects under Apache Foundation, of which he is also an honored > member. About 10 years ago he worked in the Hadoop team at Yahoo where the > projects just got started. Later he worked at Cloudera, initiating and > leading the development of Hive on Spark project in the communities and > across many organizations. Prior to joining Alibaba, he worked at Uber > where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and > significantly improved Uber's cluster efficiency. > >> > >> > >