One other thought on the same lines was to use hive tables to store kafka information to process streaming tables. Something like
"create table streaming_table ( bootstrapServers string, topic string, keySerialiser string, ValueSerialiser string)" Insert into streaming_table values(,"10.17.1.1:9092,10.17.2.2:9092, 10.17.3.3:9092", "KafkaTopicName", "SimpleStringSchema", "SimpleSchemaString"); Create table processingtable( //Enter fields here which match the kafka records schema); Now we make a custom clause called something like "using" The way we use this is: Using streaming_table as configuration select count(*) from processingtable as streaming; This way users can now pass Flink SQL info easily and get rid of the Flink SQL configuration file all together. This is simple and easy to understand and I think most users would follow this. Thanks, Taher Koitawala On Fri 12 Oct, 2018, 7:24 AM Taher Koitawala, <taher.koitaw...@gslab.com> wrote: > I think integrating Flink with Hive would be an amazing option and also to > get Flink's SQL up to pace would be amazing. > > Current Flink Sql syntax to prepare and process a table is too verbose, > users manually need to retype table definitions and that's a pain. Hive > metastore integration should be done through, many users are okay defining > their table schemas in Hive as it is easy to main, change or even migrate. > > Also we could simply choosing batch and stream there with simply something > like a "process as" clause. > > select count(*) from flink_mailing_list process as stream; > > select count(*) from flink_mailing_list process as batch; > > This way we could completely get rid of Flink SQL configuration files. > > Thanks, > Taher Koitawala > > Integrating > On Fri 12 Oct, 2018, 2:35 AM Zhang, Xuefu, <xuef...@alibaba-inc.com> > wrote: > >> Hi Rong, >> >> Thanks for your feedback. Some of my earlier comments might have >> addressed some of your points, so here I'd like to cover some specifics. >> >> 1. Yes, I expect that table stats stored in Hive will be used in Flink >> plan optimization, but it's not part of compatibility concern (yet). >> 2. Both implementing Hive UDFs in Flink natively and making Hive UDFs >> work in Flink are considered. >> 3. I am aware of FLIP-24, but here the proposal is to make remote server >> compatible with HiveServer2. They are not mutually exclusive either. >> 4. The JDBC/ODBC driver in question is for the remote server that Flink >> provides. It's usually the servicer owner who provides drivers to their >> services. We weren't talking about JDBC/ODBC driver to external DB systems. >> >> Let me know if you have further questions. >> >> Thanks, >> Xuefu >> >> ------------------------------------------------------------------ >> Sender:Rong Rong <walter...@gmail.com> >> Sent at:2018 Oct 12 (Fri) 01:52 >> Recipient:Timo Walther <twal...@apache.org> >> Cc:dev <dev@flink.apache.org>; jornfranke <jornfra...@gmail.com>; Xuefu < >> xuef...@alibaba-inc.com>; vino yang <yanghua1...@gmail.com>; Fabian >> Hueske <fhue...@gmail.com>; user <u...@flink.apache.org> >> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem >> >> Hi Xuefu, >> >> Thanks for putting together the overview. I would like to add some more >> on top of Timo's comments. >> 1,2. I agree with Timo that a proper catalog support should also address >> the metadata compatibility issues. I was actually wondering if you are >> referring to something like utilizing table stats for plan optimization? >> 4. If the key is to have users integrate Hive UDF without code changes to >> Flink UDF, it shouldn't be a problem as Timo mentioned. Is your concern >> mostly on the support of Hive UDFs that should be implemented in >> Flink-table natively? >> 7,8. Correct me if I am wrong, but I feel like some of the related >> components might have already been discussed in the longer term road map of >> FLIP-24 [1]? >> 9. per Jorn's comment to stay clear from a tight dependency on Hive and >> treat it as one "connector" system. Should we also consider treating >> JDBC/ODBC driver as part of the component from the connector system instead >> of having Flink to provide them? >> >> Thanks, >> Rong >> >> [1]. >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-24+-+SQL+Client >> >> On Thu, Oct 11, 2018 at 12:46 AM Timo Walther <twal...@apache.org> wrote: >> Hi Xuefu, >> >> thanks for your proposal, it is a nice summary. Here are my thoughts to >> your list: >> >> 1. I think this is also on our current mid-term roadmap. Flink lacks a >> poper catalog support for a very long time. Before we can connect >> catalogs we need to define how to map all the information from a catalog >> to Flink's representation. This is why the work on the unified connector >> API [1] is going on for quite some time as it is the first approach to >> discuss and represent the pure characteristics of connectors. >> 2. It would be helpful to figure out what is missing in [1] to to ensure >> this point. I guess we will need a new design document just for a proper >> Hive catalog integration. >> 3. This is already work in progress. ORC has been merged, Parquet is on >> its way [1]. >> 4. This should be easy. There was a PR in past that I reviewed but was >> not maintained anymore. >> 5. The type system of Flink SQL is very flexible. Only UNION type is >> missing. >> 6. A Flink SQL DDL is on the roadmap soon once we are done with [1]. >> Support for Hive syntax also needs cooperation with Apache Calcite. >> 7-11. Long-term goals. >> >> I would also propose to start with a smaller scope where also current >> Flink SQL users can profit: 1, 2, 5, 3. This would allow to grow the >> Flink SQL ecosystem. After that we can aim to be fully compatible >> including syntax and UDFs (4, 6 etc.). Once the core is ready, we can >> work on the tooling (7, 8, 9) and performance (10, 11). >> >> @Jörn: Yes, we should not have a tight dependency on Hive. It should be >> treated as one "connector" system out of many. >> >> Thanks, >> Timo >> >> [1] >> >> https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit?ts=5bb62df4# >> [2] https://github.com/apache/flink/pull/6483 >> >> Am 11.10.18 um 07:54 schrieb Jörn Franke: >> > Would it maybe make sense to provide Flink as an engine on Hive >> („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely >> coupled than integrating hive in all possible flink core modules and thus >> introducing a very tight dependency to Hive in the core. >> > 1,2,3 could be achieved via a connector based on the Flink Table API. >> > Just as a proposal to start this Endeavour as independent projects >> (hive engine, connector) to avoid too tight coupling with Flink. Maybe in a >> more distant future if the Hive integration is heavily demanded one could >> then integrate it more tightly if needed. >> > >> > What is meant by 11? >> >> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xuef...@alibaba-inc.com>: >> >> >> >> Hi Fabian/Vno, >> >> >> >> Thank you very much for your encouragement inquiry. Sorry that I >> didn't see Fabian's email until I read Vino's response just now. (Somehow >> Fabian's went to the spam folder.) >> >> >> >> My proposal contains long-term and short-terms goals. Nevertheless, >> the effort will focus on the following areas, including Fabian's list: >> >> >> >> 1. Hive metastore connectivity - This covers both read/write access, >> which means Flink can make full use of Hive's metastore as its catalog (at >> least for the batch but can extend for streaming as well). >> >> 2. Metadata compatibility - Objects (databases, tables, partitions, >> etc) created by Hive can be understood by Flink and the reverse direction >> is true also. >> >> 3. Data compatibility - Similar to #2, data produced by Hive can be >> consumed by Flink and vise versa. >> >> 4. Support Hive UDFs - For all Hive's native udfs, Flink either >> provides its own implementation or make Hive's implementation work in >> Flink. Further, for user created UDFs in Hive, Flink SQL should provide a >> mechanism allowing user to import them into Flink without any code change >> required. >> >> 5. Data types - Flink SQL should support all data types that are >> available in Hive. >> >> 6. SQL Language - Flink SQL should support SQL standard (such as >> SQL2003) with extension to support Hive's syntax and language features, >> around DDL, DML, and SELECT queries. >> >> 7. SQL CLI - this is currently developing in Flink but more effort is >> needed. >> >> 8. Server - provide a server that's compatible with Hive's >> HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their >> existing client (such as beeline) but connect to Flink's thrift server >> instead. >> >> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for >> other application to use to connect to its thrift server >> >> 10. Support other user's customizations in Hive, such as Hive Serdes, >> storage handlers, etc. >> >> 11. Better task failure tolerance and task scheduling at Flink runtime. >> >> >> >> As you can see, achieving all those requires significant effort and >> across all layers in Flink. However, a short-term goal could include only >> core areas (such as 1, 2, 4, 5, 6, 7) or start at a smaller scope (such as >> #3, #6). >> >> >> >> Please share your further thoughts. If we generally agree that this is >> the right direction, I could come up with a formal proposal quickly and >> then we can follow up with broader discussions. >> >> >> >> Thanks, >> >> Xuefu >> >> >> >> >> >> >> >> ------------------------------------------------------------------ >> >> Sender:vino yang <yanghua1...@gmail.com> >> >> Sent at:2018 Oct 11 (Thu) 09:45 >> >> Recipient:Fabian Hueske <fhue...@gmail.com> >> >> Cc:dev <dev@flink.apache.org>; Xuefu <xuef...@alibaba-inc.com>; user < >> u...@flink.apache.org> >> >> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem >> >> >> >> Hi Xuefu, >> >> >> >> Appreciate this proposal, and like Fabian, it would look better if you >> can give more details of the plan. >> >> >> >> Thanks, vino. >> >> >> >> Fabian Hueske <fhue...@gmail.com> 于2018年10月10日周三 下午5:27写道: >> >> Hi Xuefu, >> >> >> >> Welcome to the Flink community and thanks for starting this >> discussion! Better Hive integration would be really great! >> >> Can you go into details of what you are proposing? I can think of a >> couple ways to improve Flink in that regard: >> >> >> >> * Support for Hive UDFs >> >> * Support for Hive metadata catalog >> >> * Support for HiveQL syntax >> >> * ??? >> >> >> >> Best, Fabian >> >> >> >> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu < >> xuef...@alibaba-inc.com>: >> >> Hi all, >> >> >> >> Along with the community's effort, inside Alibaba we have explored >> Flink's potential as an execution engine not just for stream processing but >> also for batch processing. We are encouraged by our findings and have >> initiated our effort to make Flink's SQL capabilities full-fledged. When >> comparing what's available in Flink to the offerings from competitive data >> processing engines, we identified a major gap in Flink: a well integration >> with Hive ecosystem. This is crucial to the success of Flink SQL and batch >> due to the well-established data ecosystem around Hive. Therefore, we have >> done some initial work along this direction but there are still a lot of >> effort needed. >> >> >> >> We have two strategies in mind. The first one is to make Flink SQL >> full-fledged and well-integrated with Hive ecosystem. This is a similar >> approach to what Spark SQL adopted. The second strategy is to make Hive >> itself work with Flink, similar to the proposal in [1]. Each approach bears >> its pros and cons, but they don’t need to be mutually exclusive with each >> targeting at different users and use cases. We believe that both will >> promote a much greater adoption of Flink beyond stream processing. >> >> >> >> We have been focused on the first approach and would like to showcase >> Flink's batch and SQL capabilities with Flink SQL. However, we have also >> planned to start strategy #2 as the follow-up effort. >> >> >> >> I'm completely new to Flink(, with a short bio [2] below), though many >> of my colleagues here at Alibaba are long-time contributors. Nevertheless, >> I'd like to share our thoughts and invite your early feedback. At the same >> time, I am working on a detailed proposal on Flink SQL's integration with >> Hive ecosystem, which will be also shared when ready. >> >> >> >> While the ideas are simple, each approach will demand significant >> effort, more than what we can afford. Thus, the input and contributions >> from the communities are greatly welcome and appreciated. >> >> >> >> Regards, >> >> >> >> >> >> Xuefu >> >> >> >> References: >> >> >> >> [1] https://issues.apache.org/jira/browse/HIVE-10712 >> >> [2] Xuefu Zhang is a long-time open source veteran, worked or working >> on many projects under Apache Foundation, of which he is also an honored >> member. About 10 years ago he worked in the Hadoop team at Yahoo where the >> projects just got started. Later he worked at Cloudera, initiating and >> leading the development of Hive on Spark project in the communities and >> across many organizations. Prior to joining Alibaba, he worked at Uber >> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and >> significantly improved Uber's cluster efficiency. >> >> >> >> >> >>