Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Taher Koitawala Fri, 12 Oct 2018 00:32:14 -0700

Sounds smashing; I think the initial integration will help 60% or so flink
sql users and a lot other use cases will emerge when we solve the first one.


Thanks,
Taher Koitawala




On Fri 12 Oct, 2018, 10:13 AM Zhang, Xuefu, <xuef...@alibaba-inc.com> wrote:

> Hi Taher,
>
> Thank you for your input. I think you emphasized two important points:
>
> 1. Hive metastore could be used for storing Flink metadata
> 2. There are some usability issues around Flink SQL configuration
>
> I think we all agree on #1. #2 may be well true and the usability should
> be improved. However, I'm afraid that this is orthogonal to Hive
> integration and the proposed solution might be just one of the possible
> solutions. On the surface, the extensions you proposed seem going beyond
> the syntax and semantics of SQL language in general.
>
> I don't disagree on the value of your proposal. I guess it's better to
> solve #1 first and leave #2 for follow-up discussions. How does this sound
> to you?
>
> Thanks,
> Xuefu
>
> ------------------------------------------------------------------
> Sender:Taher Koitawala <taher.koitaw...@gslab.com>
> Sent at:2018 Oct 12 (Fri) 10:06
> Recipient:Xuefu <xuef...@alibaba-inc.com>
> Cc:Rong Rong <walter...@gmail.com>; Timo Walther <twal...@apache.org>;
> dev <dev@flink.apache.org>; jornfranke <jornfra...@gmail.com>; vino yang <
> yanghua1...@gmail.com>; Fabian Hueske <fhue...@gmail.com>; user <
> u...@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> One other thought on the same lines was to use hive tables to store kafka
> information to process streaming tables. Something like
>
> "create table streaming_table (
> bootstrapServers string,
> topic string, keySerialiser string, ValueSerialiser string)"
>
> Insert into streaming_table values(,"10.17.1.1:9092,10.17.2.2:9092,
> 10.17.3.3:9092", "KafkaTopicName", "SimpleStringSchema",
> "SimpleSchemaString");
>
> Create table processingtable(
> //Enter fields here which match the kafka records schema);
>
> Now we make a custom clause called something like "using"
>
> The way we use this is:
>
> Using streaming_table as configuration select count(*) from
> processingtable as streaming;
>
>
> This way users can now pass Flink SQL info easily and get rid of the Flink
> SQL configuration file all together. This is simple and easy to understand
> and I think most users would follow this.
>
> Thanks,
> Taher Koitawala
>
> On Fri 12 Oct, 2018, 7:24 AM Taher Koitawala, <taher.koitaw...@gslab.com>
> wrote:
> I think integrating Flink with Hive would be an amazing option and also to
> get Flink's SQL up to pace would be amazing.
>
> Current Flink Sql syntax to prepare and process a table is too verbose,
> users manually need to retype table definitions and that's a pain. Hive
> metastore integration should be done through, many users are okay defining
> their table schemas in Hive as it is easy to main, change or even migrate.
>
> Also we could simply choosing batch and stream there with simply something
> like a "process as" clause.
>
> select count(*) from flink_mailing_list process as stream;
>
> select count(*) from flink_mailing_list process as batch;
>
> This way we could completely get rid of Flink SQL configuration files.
>
> Thanks,
> Taher Koitawala
>
> Integrating
> On Fri 12 Oct, 2018, 2:35 AM Zhang, Xuefu, <xuef...@alibaba-inc.com>
> wrote:
> Hi Rong,
>
> Thanks for your feedback. Some of my earlier comments might have addressed
> some of your points, so here I'd like to cover some specifics.
>
> 1. Yes, I expect that table stats stored in Hive will be used in Flink
> plan optimization, but it's not part of compatibility concern (yet).
> 2. Both implementing Hive UDFs in Flink natively and making Hive UDFs work
> in Flink are considered.
> 3. I am aware of FLIP-24, but here the proposal is to make remote server
> compatible with HiveServer2. They are not mutually exclusive either.
> 4. The JDBC/ODBC driver in question is for the remote server that Flink
> provides. It's usually the servicer owner who provides drivers to their
> services. We weren't talking about JDBC/ODBC driver to external DB systems.
>
> Let me know if you have further questions.
>
> Thanks,
> Xuefu
>
> ------------------------------------------------------------------
> Sender:Rong Rong <walter...@gmail.com>
> Sent at:2018 Oct 12 (Fri) 01:52
> Recipient:Timo Walther <twal...@apache.org>
> Cc:dev <dev@flink.apache.org>; jornfranke <jornfra...@gmail.com>; Xuefu <
> xuef...@alibaba-inc.com>; vino yang <yanghua1...@gmail.com>; Fabian
> Hueske <fhue...@gmail.com>; user <u...@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi Xuefu,
>
> Thanks for putting together the overview. I would like to add some more on
> top of Timo's comments.
> 1,2. I agree with Timo that a proper catalog support should also address
> the metadata compatibility issues. I was actually wondering if you are
> referring to something like utilizing table stats for plan optimization?
> 4. If the key is to have users integrate Hive UDF without code changes to
> Flink UDF, it shouldn't be a problem as Timo mentioned. Is your concern
> mostly on the support of Hive UDFs that should be implemented in
> Flink-table natively?
> 7,8. Correct me if I am wrong, but I feel like some of the related
> components might have already been discussed in the longer term road map of
> FLIP-24 [1]?
> 9. per Jorn's comment to stay clear from a tight dependency on Hive and
> treat it as one "connector" system. Should we also consider treating
> JDBC/ODBC driver as part of the component from the connector system instead
> of having Flink to provide them?
>
> Thanks,
> Rong
>
> [1].
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-24+-+SQL+Client
>
> On Thu, Oct 11, 2018 at 12:46 AM Timo Walther <twal...@apache.org> wrote:
> Hi Xuefu,
>
> thanks for your proposal, it is a nice summary. Here are my thoughts to
> your list:
>
> 1. I think this is also on our current mid-term roadmap. Flink lacks a
> poper catalog support for a very long time. Before we can connect
> catalogs we need to define how to map all the information from a catalog
> to Flink's representation. This is why the work on the unified connector
> API [1] is going on for quite some time as it is the first approach to
> discuss and represent the pure characteristics of connectors.
> 2. It would be helpful to figure out what is missing in [1] to to ensure
> this point. I guess we will need a new design document just for a proper
> Hive catalog integration.
> 3. This is already work in progress. ORC has been merged, Parquet is on
> its way [1].
> 4. This should be easy. There was a PR in past that I reviewed but was
> not maintained anymore.
> 5. The type system of Flink SQL is very flexible. Only UNION type is
> missing.
> 6. A Flink SQL DDL is on the roadmap soon once we are done with [1].
> Support for Hive syntax also needs cooperation with Apache Calcite.
> 7-11. Long-term goals.
>
> I would also propose to start with a smaller scope where also current
> Flink SQL users can profit: 1, 2, 5, 3. This would allow to grow the
> Flink SQL ecosystem. After that we can aim to be fully compatible
> including syntax and UDFs (4, 6 etc.). Once the core is ready, we can
> work on the tooling (7, 8, 9) and performance (10, 11).
>
> @Jörn: Yes, we should not have a tight dependency on Hive. It should be
> treated as one "connector" system out of many.
>
> Thanks,
> Timo
>
> [1]
>
> https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit?ts=5bb62df4#
> [2] https://github.com/apache/flink/pull/6483
>
> Am 11.10.18 um 07:54 schrieb Jörn Franke:
> > Would it maybe make sense to provide Flink as an engine on Hive
> („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely
> coupled than integrating hive in all possible flink core modules and thus
> introducing a very tight dependency to Hive in the core.
> > 1,2,3 could be achieved via a connector based on the Flink Table API.
> > Just as a proposal to start this Endeavour as independent projects (hive
> engine, connector) to avoid too tight coupling with Flink. Maybe in a more
> distant future if the Hive integration is heavily demanded one could then
> integrate it more tightly if needed.
> >
> > What is meant by 11?
> >> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xuef...@alibaba-inc.com>:
> >>
> >> Hi Fabian/Vno,
> >>
> >> Thank you very much for your encouragement inquiry. Sorry that I didn't
> see Fabian's email until I read Vino's response just now. (Somehow Fabian's
> went to the spam folder.)
> >>
> >> My proposal contains long-term and short-terms goals. Nevertheless, the
> effort will focus on the following areas, including Fabian's list:
> >>
> >> 1. Hive metastore connectivity - This covers both read/write access,
> which means Flink can make full use of Hive's metastore as its catalog (at
> least for the batch but can extend for streaming as well).
> >> 2. Metadata compatibility - Objects (databases, tables, partitions,
> etc) created by Hive can be understood by Flink and the reverse direction
> is true also.
> >> 3. Data compatibility - Similar to #2, data produced by Hive can be
> consumed by Flink and vise versa.
> >> 4. Support Hive UDFs - For all Hive's native udfs, Flink either
> provides its own implementation or make Hive's implementation work in
> Flink. Further, for user created UDFs in Hive, Flink SQL should provide a
> mechanism allowing user to import them into Flink without any code change
> required.
> >> 5. Data types -  Flink SQL should support all data types that are
> available in Hive.
> >> 6. SQL Language - Flink SQL should support SQL standard (such as
> SQL2003) with extension to support Hive's syntax and language features,
> around DDL, DML, and SELECT queries.
> >> 7.  SQL CLI - this is currently developing in Flink but more effort is
> needed.
> >> 8. Server - provide a server that's compatible with Hive's HiverServer2
> in thrift APIs, such that HiveServer2 users can reuse their existing client
> (such as beeline) but connect to Flink's thrift server instead.
> >> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
> other application to use to connect to its thrift server
> >> 10. Support other user's customizations in Hive, such as Hive Serdes,
> storage handlers, etc.
> >> 11. Better task failure tolerance and task scheduling at Flink runtime.
> >>
> >> As you can see, achieving all those requires significant effort and
> across all layers in Flink. However, a short-term goal could  include only
> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as
> #3, #6).
> >>
> >> Please share your further thoughts. If we generally agree that this is
> the right direction, I could come up with a formal proposal quickly and
> then we can follow up with broader discussions.
> >>
> >> Thanks,
> >> Xuefu
> >>
> >>
> >>
> >> ------------------------------------------------------------------
> >> Sender:vino yang <yanghua1...@gmail.com>
> >> Sent at:2018 Oct 11 (Thu) 09:45
> >> Recipient:Fabian Hueske <fhue...@gmail.com>
> >> Cc:dev <dev@flink.apache.org>; Xuefu <xuef...@alibaba-inc.com>; user <
> u...@flink.apache.org>
> >> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >>
> >> Hi Xuefu,
> >>
> >> Appreciate this proposal, and like Fabian, it would look better if you
> can give more details of the plan.
> >>
> >> Thanks, vino.
> >>
> >> Fabian Hueske <fhue...@gmail.com> 于2018年10月10日周三 下午5:27写道：
> >> Hi Xuefu,
> >>
> >> Welcome to the Flink community and thanks for starting this discussion!
> Better Hive integration would be really great!
> >> Can you go into details of what you are proposing? I can think of a
> couple ways to improve Flink in that regard:
> >>
> >> * Support for Hive UDFs
> >> * Support for Hive metadata catalog
> >> * Support for HiveQL syntax
> >> * ???
> >>
> >> Best, Fabian
> >>
> >> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
> xuef...@alibaba-inc.com>:
> >> Hi all,
> >>
> >> Along with the community's effort, inside Alibaba we have explored
> Flink's potential as an execution engine not just for stream processing but
> also for batch processing. We are encouraged by our findings and have
> initiated our effort to make Flink's SQL capabilities full-fledged. When
> comparing what's available in Flink to the offerings from competitive data
> processing engines, we identified a major gap in Flink: a well integration
> with Hive ecosystem. This is crucial to the success of Flink SQL and batch
> due to the well-established data ecosystem around Hive. Therefore, we have
> done some initial work along this direction but there are still a lot of
> effort needed.
> >>
> >> We have two strategies in mind. The first one is to make Flink SQL
> full-fledged and well-integrated with Hive ecosystem. This is a similar
> approach to what Spark SQL adopted. The second strategy is to make Hive
> itself work with Flink, similar to the proposal in [1]. Each approach bears
> its pros and cons, but they don’t need to be mutually exclusive with each
> targeting at different users and use cases. We believe that both will
> promote a much greater adoption of Flink beyond stream processing.
> >>
> >> We have been focused on the first approach and would like to showcase
> Flink's batch and SQL capabilities with Flink SQL. However, we have also
> planned to start strategy #2 as the follow-up effort.
> >>
> >> I'm completely new to Flink(, with a short bio [2] below), though many
> of my colleagues here at Alibaba are long-time contributors. Nevertheless,
> I'd like to share our thoughts and invite your early feedback. At the same
> time, I am working on a detailed proposal on Flink SQL's integration with
> Hive ecosystem, which will be also shared when ready.
> >>
> >> While the ideas are simple, each approach will demand significant
> effort, more than what we can afford. Thus, the input and contributions
> from the communities are greatly welcome and appreciated.
> >>
> >> Regards,
> >>
> >>
> >> Xuefu
> >>
> >> References:
> >>
> >> [1] https://issues.apache.org/jira/browse/HIVE-10712
> >> [2] Xuefu Zhang is a long-time open source veteran, worked or working
> on many projects under Apache Foundation, of which he is also an honored
> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
> projects just got started. Later he worked at Cloudera, initiating and
> leading the development of Hive on Spark project in the communities and
> across many organizations. Prior to joining Alibaba, he worked at Uber
> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
> significantly improved Uber's cluster efficiency.
> >>
> >>
>
>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Reply via email to