Re: [DISCUSS] Contribute Pulsar Flink connector back to Flink

Sijie Guo Tue, 03 Sep 2019 17:03:47 -0700

Hi Yun,

Since I was the main driver behind FLINK-9641 and FLINK-9168, let me try to
add more context on this.

FLINK-9641 and FLINK-9168 was created for bringing Pulsar as source and
sink for Flink. The integration was done with Flink 1.6.0. We sent out pull
requests about a year ago and we ended up maintaining those connectors in
Pulsar for Pulsar users to use Flink to process event streams in Pulsar.
(See https://github.com/apache/pulsar/tree/master/pulsar-flink). The Flink
1.6 integration is pretty simple and there is no schema considerations.

In the past year, we have made a lot of changes in Pulsar and brought
Pulsar schema as the first-class citizen in Pulsar. We also integrated with
other computing engines for processing Pulsar event streams with Pulsar
schema.

It led us to rethink how to integrate with Flink in the best way. Then we
reimplement the pulsar-flink connectors from the ground up with schema and
bring table API and catalog API as the first-class citizen in the
integration. With that being said, in the new pulsar-flink implementation,
you can register pulsar as a flink catalog and query / process the event
streams using Flink SQL.

This is an example about how to use Pulsar as a Flink catalog:
https://github.com/streamnative/pulsar-flink/blob/3eeddec5625fc7dddc3f8a3ec69f72e1614ca9c9/README.md#use-pulsar-catalog

Yijie has also written a blog post explaining why we re-implement the flink
connector with Flink 1.9 and what are the changes we made in the new
connector:
https://medium.com/streamnative/use-apache-pulsar-as-streaming-table-with-8-lines-of-code-39033a93947f

We believe Pulsar is not just a simple data sink or source for Flink. It
actually can be a fully integrated streaming data storage for Flink in many
areas (sink, source, schema/catalog and state). The combination of Flink
and Pulsar can create a great streaming warehouse architecture for
streaming-first, unified data processing. Since we are talking to
contribute Pulsar integration to Flink here, we are also dedicated to
maintain, improve and evolve the integration with Flink to help the users
who use both Flink and Pulsar.

Hope this give you a bit more background about the pulsar flink
integration. Let me know what are your thoughts.

Thanks,
Sijie

On Tue, Sep 3, 2019 at 11:54 AM Yun Tang <myas...@live.com> wrote:

> Hi Yijie
>
> I can see that Pulsar becomes more and more popular recently and very glad
> to see more people willing to contribute to Flink ecosystem.
>
> Before any further discussion, would you please give some explanation of
> the relationship between this thread to current existing JIRAs of pulsar
> source [1] and sink [2] connector? Will the contribution contains part of
> those PRs or totally different implementation?
>
> [1] https://issues.apache.org/jira/browse/FLINK-9641
> [2] https://issues.apache.org/jira/browse/FLINK-9168
>
> Best
> Yun Tang
> ________________________________
> From: Yijie Shen <henry.yijies...@gmail.com>
> Sent: Tuesday, September 3, 2019 13:57
> To: dev@flink.apache.org <dev@flink.apache.org>
> Subject: [DISCUSS] Contribute Pulsar Flink connector back to Flink
>
> Dear Flink Community!
>
> I would like to open the discussion of contributing Pulsar Flink
> connector [0] back to Flink.
>
> ## A brief introduction to Apache Pulsar
>
> Apache Pulsar[1] is a multi-tenant, high-performance distributed
> pub-sub messaging system. Pulsar includes multiple features such as
> native support for multiple clusters in a Pulsar instance, with
> seamless geo-replication of messages across clusters, very low publish
> and end-to-end latency, seamless scalability to over a million topics,
> and guaranteed message delivery with persistent message storage
> provided by Apache BookKeeper. Nowadays, Pulsar has been adopted by
> more and more companies[2].
>
> ## The status of Pulsar Flink connector
>
> The Pulsar Flink connector we are planning to contribute is built upon
> Flink 1.9.0 and Pulsar 2.4.0. The main features are:
> - Pulsar as a streaming source with exactly-once guarantee.
> - Sink streaming results to Pulsar with at-least-once semantics. (We
> would update this to exactly-once as well when Pulsar gets all
> transaction features ready in its 2.5.0 version)
> - Build upon Flink new Table API Type system (FLIP-37[3]), and can
> automatically (de)serialize messages with the help of Pulsar schema.
> - Integrate with Flink new Catalog API (FLIP-30[4]), which enables the
> use of Pulsar topics as tables in Table API as well as SQL client.
>
> ## Reference
> [0] https://github.com/streamnative/pulsar-flink
> [1] https://pulsar.apache.org/
> [2] https://pulsar.apache.org/en/powered-by/
> [3]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-37%3A+Rework+of+the+Table+API+Type+System
> [4]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-30%3A+Unified+Catalog+APIs
>
>
> Best,
> Yijie Shen
>

Re: [DISCUSS] Contribute Pulsar Flink connector back to Flink

Reply via email to