Re: [DISCUSS] FLIP-149: Introduce the KTable Connector

Jark Wu Fri, 23 Oct 2020 00:08:30 -0700

Thanks Shengkai!

+1 to start voting.


Best,
Jark

On Fri, 23 Oct 2020 at 15:02, Shengkai Fang <[email protected]> wrote:

> Add one more message, I have already updated the FLIP[1].
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-149%3A+Introduce+the+upsert-kafka+Connector
>
> Shengkai Fang <[email protected]> 于2020年10月23日周五 下午2:55写道：
>
> > Hi, all.
> > It seems we have reached a consensus on the FLIP. If no one has other
> > objections, I would like to start the vote for FLIP-149.
> >
> > Best,
> > Shengkai
> >
> > Jingsong Li <[email protected]> 于2020年10月23日周五 下午2:25写道：
> >
> >> Thanks for explanation,
> >>
> >> I am OK for `upsert`. Yes, Its concept has been accepted by many
> systems.
> >>
> >> Best,
> >> Jingsong
> >>
> >> On Fri, Oct 23, 2020 at 12:38 PM Jark Wu <[email protected]> wrote:
> >>
> >> > Hi Timo,
> >> >
> >> > I have some concerns about `kafka-cdc`,
> >> > 1) cdc is an abbreviation of Change Data Capture which is commonly
> used
> >> for
> >> > databases, not for message queues.
> >> > 2) usually, cdc produces full content of changelog, including
> >> > UPDATE_BEFORE, however "upsert kafka" doesn't
> >> > 3) `kafka-cdc` sounds like a natively support for `debezium-json`
> >> format,
> >> > however, it is not and even we don't want
> >> >    "upsert kafka" to support "debezium-json"
> >> >
> >> >
> >> > Hi Jingsong,
> >> >
> >> > I think the terminology of "upsert" is fine, because Kafka also uses
> >> > "upsert" to define such behavior in their official documentation [1]:
> >> >
> >> > > a data record in a changelog stream is interpreted as an UPSERT aka
> >> > INSERT/UPDATE
> >> >
> >> > Materialize uses the "UPSERT" keyword to define such behavior too [2].
> >> > Users have been requesting such feature using "upsert kafka"
> >> terminology in
> >> > user mailing lists [3][4].
> >> > Many other systems support "UPSERT" statement natively, such as impala
> >> [5],
> >> > SAP [6], Phoenix [7], Oracle NoSQL [8], etc..
> >> >
> >> > Therefore, I think we don't need to be afraid of introducing "upsert"
> >> > terminology, it is widely accepted by users.
> >> >
> >> > Best,
> >> > Jark
> >> >
> >> >
> >> > [1]:
> >> >
> >> >
> >>
> https://kafka.apache.org/20/documentation/streams/developer-guide/dsl-api.html#streams_concepts_ktable
> >> > [2]:
> >> >
> >> >
> >>
> https://materialize.io/docs/sql/create-source/text-kafka/#upsert-on-a-kafka-topic
> >> > [3]:
> >> >
> >> >
> >>
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/SQL-materialized-upsert-tables-td18482.html#a18503
> >> > [4]:
> >> >
> >> >
> >>
> http://apache-flink.147419.n8.nabble.com/Kafka-Sink-AppendStreamTableSink-doesn-t-support-consuming-update-changes-td5959.html
> >> > [5]:
> >> https://impala.apache.org/docs/build/html/topics/impala_upsert.html
> >> > [6]:
> >> >
> >> >
> >>
> https://help.sap.com/viewer/7c78579ce9b14a669c1f3295b0d8ca16/Cloud/en-US/ea8b6773be584203bcd99da76844c5ed.html
> >> > [7]: https://phoenix.apache.org/atomic_upsert.html
> >> > [8]:
> >> >
> >> >
> >>
> https://docs.oracle.com/en/database/other-databases/nosql-database/18.3/sqlfornosql/adding-table-rows-using-insert-and-upsert-statements.html
> >> >
> >> > On Fri, 23 Oct 2020 at 10:36, Jingsong Li <[email protected]>
> >> wrote:
> >> >
> >> > > The `kafka-cdc` looks good to me.
> >> > > We can even give options to indicate whether to turn on compact,
> >> because
> >> > > compact is just an optimization?
> >> > >
> >> > > - ktable let me think about KSQL.
> >> > > - kafka-compacted it is not just compacted, more than that, it still
> >> has
> >> > > the ability of CDC
> >> > > - upsert-kafka , upsert is back, and I don't really want to see it
> >> again
> >> > > since we have CDC
> >> > >
> >> > > Best,
> >> > > Jingsong
> >> > >
> >> > > On Fri, Oct 23, 2020 at 2:21 AM Timo Walther <[email protected]>
> >> wrote:
> >> > >
> >> > > > Hi Jark,
> >> > > >
> >> > > > I would be fine with `connector=upsert-kafka`. Another idea would
> >> be to
> >> > > > align the name to other available Flink connectors [1]:
> >> > > >
> >> > > > `connector=kafka-cdc`.
> >> > > >
> >> > > > Regards,
> >> > > > Timo
> >> > > >
> >> > > > [1] https://github.com/ververica/flink-cdc-connectors
> >> > > >
> >> > > > On 22.10.20 17:17, Jark Wu wrote:
> >> > > > > Another name is "connector=upsert-kafka', I think this can solve
> >> > Timo's
> >> > > > > concern on the "compacted" word.
> >> > > > >
> >> > > > > Materialize also uses "ENVELOPE UPSERT" [1] keyword to identify
> >> such
> >> > > > kafka
> >> > > > > sources.
> >> > > > > I think "upsert" is a well-known terminology widely used in many
> >> > > systems
> >> > > > > and matches the
> >> > > > >   behavior of how we handle the kafka messages.
> >> > > > >
> >> > > > > What do you think?
> >> > > > >
> >> > > > > Best,
> >> > > > > Jark
> >> > > > >
> >> > > > > [1]:
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://materialize.io/docs/sql/create-source/text-kafka/#upsert-on-a-kafka-topic
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Thu, 22 Oct 2020 at 22:53, Kurt Young <[email protected]>
> >> wrote:
> >> > > > >
> >> > > > >> Good validation messages can't solve the broken user
> experience,
> >> > > > especially
> >> > > > >> that
> >> > > > >> such update mode option will implicitly make half of current
> >> kafka
> >> > > > options
> >> > > > >> invalid or doesn't
> >> > > > >> make sense.
> >> > > > >>
> >> > > > >> Best,
> >> > > > >> Kurt
> >> > > > >>
> >> > > > >>
> >> > > > >> On Thu, Oct 22, 2020 at 10:31 PM Jark Wu <[email protected]>
> >> wrote:
> >> > > > >>
> >> > > > >>> Hi Timo, Seth,
> >> > > > >>>
> >> > > > >>> The default value "inserting" of "mode" might be not suitable,
> >> > > > >>> because "debezium-json" emits changelog messages which include
> >> > > updates.
> >> > > > >>>
> >> > > > >>> On Thu, 22 Oct 2020 at 22:10, Seth Wiesman <
> [email protected]>
> >> > > wrote:
> >> > > > >>>
> >> > > > >>>> +1 for supporting upsert results into Kafka.
> >> > > > >>>>
> >> > > > >>>> I have no comments on the implementation details.
> >> > > > >>>>
> >> > > > >>>> As far as configuration goes, I tend to favor Timo's option
> >> where
> >> > we
> >> > > > >> add
> >> > > > >>> a
> >> > > > >>>> "mode" property to the existing Kafka table with default
> value
> >> > > > >>> "inserting".
> >> > > > >>>> If the mode is set to "updating" then the validation changes
> to
> >> > the
> >> > > > new
> >> > > > >>>> requirements. I personally find it more intuitive than a
> >> seperate
> >> > > > >>>> connector, my fear is users won't understand its the same
> >> physical
> >> > > > >> kafka
> >> > > > >>>> sink under the hood and it will lead to other confusion like
> >> does
> >> > it
> >> > > > >>> offer
> >> > > > >>>> the same persistence guarantees? I think we are capable of
> >> adding
> >> > > good
> >> > > > >>>> valdiation messaging that solves Jark and Kurts concerns.
> >> > > > >>>>
> >> > > > >>>>
> >> > > > >>>> On Thu, Oct 22, 2020 at 8:51 AM Timo Walther <
> >> [email protected]>
> >> > > > >> wrote:
> >> > > > >>>>
> >> > > > >>>>> Hi Jark,
> >> > > > >>>>>
> >> > > > >>>>> "calling it "kafka-compacted" can even remind users to
> enable
> >> log
> >> > > > >>>>> compaction"
> >> > > > >>>>>
> >> > > > >>>>> But sometimes users like to store a lineage of changes in
> >> their
> >> > > > >> topics.
> >> > > > >>>>> Indepent of any ktable/kstream interpretation.
> >> > > > >>>>>
> >> > > > >>>>> I let the majority decide on this topic to not further block
> >> this
> >> > > > >>>>> effort. But we might find a better name like:
> >> > > > >>>>>
> >> > > > >>>>> connector = kafka
> >> > > > >>>>> mode = updating/inserting
> >> > > > >>>>>
> >> > > > >>>>> OR
> >> > > > >>>>>
> >> > > > >>>>> connector = kafka-updating
> >> > > > >>>>>
> >> > > > >>>>> ...
> >> > > > >>>>>
> >> > > > >>>>> Regards,
> >> > > > >>>>> Timo
> >> > > > >>>>>
> >> > > > >>>>>
> >> > > > >>>>>
> >> > > > >>>>>
> >> > > > >>>>> On 22.10.20 15:24, Jark Wu wrote:
> >> > > > >>>>>> Hi Timo,
> >> > > > >>>>>>
> >> > > > >>>>>> Thanks for your opinions.
> >> > > > >>>>>>
> >> > > > >>>>>> 1) Implementation
> >> > > > >>>>>> We will have an stateful operator to generate INSERT and
> >> > > > >>> UPDATE_BEFORE.
> >> > > > >>>>>> This operator is keyby-ed (primary key as the shuffle key)
> >> after
> >> > > > >> the
> >> > > > >>>>> source
> >> > > > >>>>>> operator.
> >> > > > >>>>>> The implementation of this operator is very similar to the
> >> > > existing
> >> > > > >>>>>> `DeduplicateKeepLastRowFunction`.
> >> > > > >>>>>> The operator will register a value state using the primary
> >> key
> >> > > > >> fields
> >> > > > >>>> as
> >> > > > >>>>>> keys.
> >> > > > >>>>>> When the value state is empty under current key, we will
> emit
> >> > > > >> INSERT
> >> > > > >>>> for
> >> > > > >>>>>> the input row.
> >> > > > >>>>>> When the value state is not empty under current key, we
> will
> >> > emit
> >> > > > >>>>>> UPDATE_BEFORE using the row in state,
> >> > > > >>>>>> and emit UPDATE_AFTER using the input row.
> >> > > > >>>>>> When the input row is DELETE, we will clear state and emit
> >> > DELETE
> >> > > > >>> row.
> >> > > > >>>>>>
> >> > > > >>>>>> 2) new option vs new connector
> >> > > > >>>>>>> We recently simplified the table options to a minimum
> >> amount of
> >> > > > >>>>>> characters to be as concise as possible in the DDL.
> >> > > > >>>>>> I think this is the reason why we want to introduce a new
> >> > > > >> connector,
> >> > > > >>>>>> because we can simplify the options in DDL.
> >> > > > >>>>>> For example, if using a new option, the DDL may look like
> >> this:
> >> > > > >>>>>>
> >> > > > >>>>>> CREATE TABLE users (
> >> > > > >>>>>>     user_id BIGINT,
> >> > > > >>>>>>     user_name STRING,
> >> > > > >>>>>>     user_level STRING,
> >> > > > >>>>>>     region STRING,
> >> > > > >>>>>>     PRIMARY KEY (user_id) NOT ENFORCED
> >> > > > >>>>>> ) WITH (
> >> > > > >>>>>>     'connector' = 'kafka',
> >> > > > >>>>>>     'model' = 'table',
> >> > > > >>>>>>     'topic' = 'pageviews_per_region',
> >> > > > >>>>>>     'properties.bootstrap.servers' = '...',
> >> > > > >>>>>>     'properties.group.id' = 'testGroup',
> >> > > > >>>>>>     'scan.startup.mode' = 'earliest',
> >> > > > >>>>>>     'key.format' = 'csv',
> >> > > > >>>>>>     'key.fields' = 'user_id',
> >> > > > >>>>>>     'value.format' = 'avro',
> >> > > > >>>>>>     'sink.partitioner' = 'hash'
> >> > > > >>>>>> );
> >> > > > >>>>>>
> >> > > > >>>>>> If using a new connector, we can have a different default
> >> value
> >> > > for
> >> > > > >>> the
> >> > > > >>>>>> options and remove unnecessary options,
> >> > > > >>>>>> the DDL can look like this which is much more concise:
> >> > > > >>>>>>
> >> > > > >>>>>> CREATE TABLE pageviews_per_region (
> >> > > > >>>>>>     user_id BIGINT,
> >> > > > >>>>>>     user_name STRING,
> >> > > > >>>>>>     user_level STRING,
> >> > > > >>>>>>     region STRING,
> >> > > > >>>>>>     PRIMARY KEY (user_id) NOT ENFORCED
> >> > > > >>>>>> ) WITH (
> >> > > > >>>>>>     'connector' = 'kafka-compacted',
> >> > > > >>>>>>     'topic' = 'pageviews_per_region',
> >> > > > >>>>>>     'properties.bootstrap.servers' = '...',
> >> > > > >>>>>>     'key.format' = 'csv',
> >> > > > >>>>>>     'value.format' = 'avro'
> >> > > > >>>>>> );
> >> > > > >>>>>>
> >> > > > >>>>>>> When people read `connector=kafka-compacted` they might
> not
> >> > know
> >> > > > >>> that
> >> > > > >>>> it
> >> > > > >>>>>>> has ktable semantics. You don't need to enable log
> >> compaction
> >> > in
> >> > > > >>> order
> >> > > > >>>>>>> to use a KTable as far as I know.
> >> > > > >>>>>> We don't need to let users know it has ktable semantics, as
> >> > > > >>> Konstantin
> >> > > > >>>>>> mentioned this may carry more implicit
> >> > > > >>>>>> meaning than we want to imply here. I agree users don't
> need
> >> to
> >> > > > >>> enable
> >> > > > >>>>> log
> >> > > > >>>>>> compaction, but from the production perspective,
> >> > > > >>>>>> log compaction should always be enabled if it is used in
> this
> >> > > > >>> purpose.
> >> > > > >>>>>> Calling it "kafka-compacted" can even remind users to
> enable
> >> log
> >> > > > >>>>> compaction.
> >> > > > >>>>>>
> >> > > > >>>>>> I don't agree to introduce "model = table/stream" option,
> or
> >> > > > >>>>>> "connector=kafka-table",
> >> > > > >>>>>> because this means we are introducing Table vs Stream
> concept
> >> > from
> >> > > > >>>> KSQL.
> >> > > > >>>>>> However, we don't have such top-level concept in Flink SQL
> >> now,
> >> > > > >> this
> >> > > > >>>> will
> >> > > > >>>>>> further confuse users.
> >> > > > >>>>>> In Flink SQL, all the things are STREAM, the differences
> are
> >> > > > >> whether
> >> > > > >>> it
> >> > > > >>>>> is
> >> > > > >>>>>> bounded or unbounded,
> >> > > > >>>>>>    whether it is insert-only or changelog.
> >> > > > >>>>>>
> >> > > > >>>>>>
> >> > > > >>>>>> Best,
> >> > > > >>>>>> Jark
> >> > > > >>>>>>
> >> > > > >>>>>>
> >> > > > >>>>>> On Thu, 22 Oct 2020 at 20:39, Timo Walther <
> >> [email protected]>
> >> > > > >>> wrote:
> >> > > > >>>>>>
> >> > > > >>>>>>> Hi Shengkai, Hi Jark,
> >> > > > >>>>>>>
> >> > > > >>>>>>> thanks for this great proposal. It is time to finally
> >> connect
> >> > the
> >> > > > >>>>>>> changelog processor with a compacted Kafka topic.
> >> > > > >>>>>>>
> >> > > > >>>>>>> "The operator will produce INSERT rows, or additionally
> >> > generate
> >> > > > >>>>>>> UPDATE_BEFORE rows for the previous image, or produce
> DELETE
> >> > rows
> >> > > > >>> with
> >> > > > >>>>>>> all columns filled with values."
> >> > > > >>>>>>>
> >> > > > >>>>>>> Could you elaborate a bit on the implementation details in
> >> the
> >> > > > >> FLIP?
> >> > > > >>>> How
> >> > > > >>>>>>> are UPDATE_BEFOREs are generated. How much state is
> >> required to
> >> > > > >>>> perform
> >> > > > >>>>>>> this operation.
> >> > > > >>>>>>>
> >> > > > >>>>>>>    From a conceptual and semantical point of view, I'm
> fine
> >> > with
> >> > > > >> the
> >> > > > >>>>>>> proposal. But I would like to share my opinion about how
> we
> >> > > expose
> >> > > > >>>> this
> >> > > > >>>>>>> feature:
> >> > > > >>>>>>>
> >> > > > >>>>>>> ktable vs kafka-compacted
> >> > > > >>>>>>>
> >> > > > >>>>>>> I'm against having an additional connector like `ktable`
> or
> >> > > > >>>>>>> `kafka-compacted`. We recently simplified the table
> options
> >> to
> >> > a
> >> > > > >>>> minimum
> >> > > > >>>>>>> amount of characters to be as concise as possible in the
> >> DDL.
> >> > > > >>>> Therefore,
> >> > > > >>>>>>> I would keep the `connector=kafka` and introduce an
> >> additional
> >> > > > >>> option.
> >> > > > >>>>>>> Because a user wants to read "from Kafka". And the "how"
> >> should
> >> > > be
> >> > > > >>>>>>> determined in the lower options.
> >> > > > >>>>>>>
> >> > > > >>>>>>> When people read `connector=ktable` they might not know
> that
> >> > this
> >> > > > >> is
> >> > > > >>>>>>> Kafka. Or they wonder where `kstream` is?
> >> > > > >>>>>>>
> >> > > > >>>>>>> When people read `connector=kafka-compacted` they might
> not
> >> > know
> >> > > > >>> that
> >> > > > >>>> it
> >> > > > >>>>>>> has ktable semantics. You don't need to enable log
> >> compaction
> >> > in
> >> > > > >>> order
> >> > > > >>>>>>> to use a KTable as far as I know. Log compaction and table
> >> > > > >> semantics
> >> > > > >>>> are
> >> > > > >>>>>>> orthogonal topics.
> >> > > > >>>>>>>
> >> > > > >>>>>>> In the end we will need 3 types of information when
> >> declaring a
> >> > > > >>> Kafka
> >> > > > >>>>>>> connector:
> >> > > > >>>>>>>
> >> > > > >>>>>>> CREATE TABLE ... WITH (
> >> > > > >>>>>>>      connector=kafka        -- Some information about the
> >> > > connector
> >> > > > >>>>>>>      end-offset = XXXX      -- Some information about the
> >> > > > >> boundedness
> >> > > > >>>>>>>      model = table/stream   -- Some information about
> >> > > > >> interpretation
> >> > > > >>>>>>> )
> >> > > > >>>>>>>
> >> > > > >>>>>>>
> >> > > > >>>>>>> We can still apply all the constraints mentioned in the
> >> FLIP.
> >> > > When
> >> > > > >>>>>>> `model` is set to `table`.
> >> > > > >>>>>>>
> >> > > > >>>>>>> What do you think?
> >> > > > >>>>>>>
> >> > > > >>>>>>> Regards,
> >> > > > >>>>>>> Timo
> >> > > > >>>>>>>
> >> > > > >>>>>>>
> >> > > > >>>>>>> On 21.10.20 14:19, Jark Wu wrote:
> >> > > > >>>>>>>> Hi,
> >> > > > >>>>>>>>
> >> > > > >>>>>>>> IMO, if we are going to mix them in one connector,
> >> > > > >>>>>>>> 1) either users need to set some options to a specific
> >> value
> >> > > > >>>>> explicitly,
> >> > > > >>>>>>>> e.g. "scan.startup.mode=earliest",
> "sink.partitioner=hash",
> >> > > etc..
> >> > > > >>>>>>>> This makes the connector awkward to use. Users may face
> to
> >> fix
> >> > > > >>>> options
> >> > > > >>>>>>> one
> >> > > > >>>>>>>> by one according to the exception.
> >> > > > >>>>>>>> Besides, in the future, it is still possible to use
> >> > > > >>>>>>>> "sink.partitioner=fixed" (reduce network cost) if users
> are
> >> > > aware
> >> > > > >>> of
> >> > > > >>>>>>>> the partition routing,
> >> > > > >>>>>>>> however, it's error-prone to have "fixed" as default for
> >> > > > >> compacted
> >> > > > >>>>> mode.
> >> > > > >>>>>>>>
> >> > > > >>>>>>>> 2) or make those options a different default value when
> >> > > > >>>>> "compacted=true".
> >> > > > >>>>>>>> This would be more confusing and unpredictable if the
> >> default
> >> > > > >> value
> >> > > > >>>> of
> >> > > > >>>>>>>> options will change according to other options.
> >> > > > >>>>>>>> What happens if we have a third mode in the future?
> >> > > > >>>>>>>>
> >> > > > >>>>>>>> In terms of usage and options, it's very different from
> the
> >> > > > >>>>>>>> original "kafka" connector.
> >> > > > >>>>>>>> It would be more handy to use and less fallible if
> >> separating
> >> > > > >> them
> >> > > > >>>> into
> >> > > > >>>>>>> two
> >> > > > >>>>>>>> connectors.
> >> > > > >>>>>>>> In the implementation layer, we can reuse code as much as
> >> > > > >> possible.
> >> > > > >>>>>>>>
> >> > > > >>>>>>>> Therefore, I'm still +1 to have a new connector.
> >> > > > >>>>>>>> The "kafka-compacted" name sounds good to me.
> >> > > > >>>>>>>>
> >> > > > >>>>>>>> Best,
> >> > > > >>>>>>>> Jark
> >> > > > >>>>>>>>
> >> > > > >>>>>>>>
> >> > > > >>>>>>>> On Wed, 21 Oct 2020 at 17:58, Konstantin Knauf <
> >> > > > >> [email protected]>
> >> > > > >>>>>>> wrote:
> >> > > > >>>>>>>>
> >> > > > >>>>>>>>> Hi Kurt, Hi Shengkai,
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>> thanks for answering my questions and the additional
> >> > > > >>>> clarifications. I
> >> > > > >>>>>>>>> don't have a strong opinion on whether to extend the
> >> "kafka"
> >> > > > >>>> connector
> >> > > > >>>>>>> or
> >> > > > >>>>>>>>> to introduce a new connector. So, from my perspective
> feel
> >> > free
> >> > > > >> to
> >> > > > >>>> go
> >> > > > >>>>>>> with
> >> > > > >>>>>>>>> a separate connector. If we do introduce a new
> connector I
> >> > > > >>> wouldn't
> >> > > > >>>>>>> call it
> >> > > > >>>>>>>>> "ktable" for aforementioned reasons (In addition, we
> might
> >> > > > >> suggest
> >> > > > >>>>> that
> >> > > > >>>>>>>>> there is also a "kstreams" connector for symmetry
> >> reasons). I
> >> > > > >>> don't
> >> > > > >>>>>>> have a
> >> > > > >>>>>>>>> good alternative name, though, maybe "kafka-compacted"
> or
> >> > > > >>>>>>>>> "compacted-kafka".
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>> Thanks,
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>> Konstantin
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>> On Wed, Oct 21, 2020 at 4:43 AM Kurt Young <
> >> [email protected]
> >> > >
> >> > > > >>>> wrote:
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> Hi all,
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> I want to describe the discussion process which drove
> us
> >> to
> >> > > > >> have
> >> > > > >>>> such
> >> > > > >>>>>>>>>> conclusion, this might make some of
> >> > > > >>>>>>>>>> the design choices easier to understand and keep
> >> everyone on
> >> > > > >> the
> >> > > > >>>> same
> >> > > > >>>>>>>>> page.
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> Back to the motivation, what functionality do we want
> to
> >> > > > >> provide
> >> > > > >>> in
> >> > > > >>>>> the
> >> > > > >>>>>>>>>> first place? We got a lot of feedback and
> >> > > > >>>>>>>>>> questions from mailing lists that people want to write
> >> > > > >>>>> Not-Insert-Only
> >> > > > >>>>>>>>>> messages into kafka. They might be
> >> > > > >>>>>>>>>> intentional or by accident, e.g. wrote an non-windowed
> >> > > > >> aggregate
> >> > > > >>>>> query
> >> > > > >>>>>>> or
> >> > > > >>>>>>>>>> non-windowed left outer join. And
> >> > > > >>>>>>>>>> some users from KSQL world also asked about why Flink
> >> didn't
> >> > > > >>>> leverage
> >> > > > >>>>>>> the
> >> > > > >>>>>>>>>> Key concept of every kafka topic
> >> > > > >>>>>>>>>> and make kafka as a dynamic changing keyed table.
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> To work with kafka better, we were thinking to extend
> the
> >> > > > >>>>> functionality
> >> > > > >>>>>>>>> of
> >> > > > >>>>>>>>>> the current kafka connector by letting it
> >> > > > >>>>>>>>>> accept updates and deletions. But due to the limitation
> >> of
> >> > > > >> kafka,
> >> > > > >>>> the
> >> > > > >>>>>>>>>> update has to be "update by key", aka a table
> >> > > > >>>>>>>>>> with primary key.
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> This introduces a couple of conflicts with current
> kafka
> >> > > > >> table's
> >> > > > >>>>>>> options:
> >> > > > >>>>>>>>>> 1. key.fields: as said above, we need the kafka table
> to
> >> > have
> >> > > > >> the
> >> > > > >>>>>>> primary
> >> > > > >>>>>>>>>> key constraint. And users can also configure
> >> > > > >>>>>>>>>> key.fields freely, this might cause friction. (Sure we
> >> can
> >> > do
> >> > > > >>> some
> >> > > > >>>>>>> sanity
> >> > > > >>>>>>>>>> check on this but it also creates friction.)
> >> > > > >>>>>>>>>> 2. sink.partitioner: to make the semantics right, we
> >> need to
> >> > > > >> make
> >> > > > >>>>> sure
> >> > > > >>>>>>>>> all
> >> > > > >>>>>>>>>> the updates on the same key are written to
> >> > > > >>>>>>>>>> the same kafka partition, such we should force to use a
> >> hash
> >> > > by
> >> > > > >>> key
> >> > > > >>>>>>>>>> partition inside such table. Again, this has conflicts
> >> > > > >>>>>>>>>> and creates friction with current user options.
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> The above things are solvable, though not perfect or
> most
> >> > user
> >> > > > >>>>>>> friendly.
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> Let's take a look at the reading side. The keyed kafka
> >> table
> >> > > > >>>> contains
> >> > > > >>>>>>> two
> >> > > > >>>>>>>>>> kinds of messages: upsert or deletion. What upsert
> >> > > > >>>>>>>>>> means is "If the key doesn't exist yet, it's an insert
> >> > record.
> >> > > > >>>>>>> Otherwise
> >> > > > >>>>>>>>>> it's an update record". For the sake of correctness or
> >> > > > >>>>>>>>>> simplicity, the Flink SQL engine also needs such
> >> > information.
> >> > > > >> If
> >> > > > >>> we
> >> > > > >>>>>>>>>> interpret all messages to "update record", some queries
> >> or
> >> > > > >>>>>>>>>> operators may not work properly. It's weird to see an
> >> update
> >> > > > >>> record
> >> > > > >>>>> but
> >> > > > >>>>>>>>> you
> >> > > > >>>>>>>>>> haven't seen the insert record before.
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> So what Flink should do is after reading out the
> records
> >> > from
> >> > > > >>> such
> >> > > > >>>>>>> table,
> >> > > > >>>>>>>>>> it needs to create a state to record which messages
> have
> >> > > > >>>>>>>>>> been seen and then generate the correct row type
> >> > > > >> correspondingly.
> >> > > > >>>>> This
> >> > > > >>>>>>>>> kind
> >> > > > >>>>>>>>>> of couples the state and the data of the message
> >> > > > >>>>>>>>>> queue, and it also creates conflicts with current kafka
> >> > > > >>> connector.
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> Think about if users suspend a running job (which
> >> contains
> >> > > some
> >> > > > >>>>> reading
> >> > > > >>>>>>>>>> state now), and then change the start offset of the
> >> reader.
> >> > > > >>>>>>>>>> By changing the reading offset, it actually change the
> >> whole
> >> > > > >>> story
> >> > > > >>>> of
> >> > > > >>>>>>>>>> "which records should be insert messages and which
> >> records
> >> > > > >>>>>>>>>> should be update messages). And it will also make Flink
> >> to
> >> > > deal
> >> > > > >>>> with
> >> > > > >>>>>>>>>> another weird situation that it might receive a
> deletion
> >> > > > >>>>>>>>>> on a non existing message.
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> We were unsatisfied with all the frictions and
> conflicts
> >> it
> >> > > > >> will
> >> > > > >>>>> create
> >> > > > >>>>>>>>> if
> >> > > > >>>>>>>>>> we enable the "upsert & deletion" support to the
> current
> >> > kafka
> >> > > > >>>>>>>>>> connector. And later we begin to realize that we
> >> shouldn't
> >> > > > >> treat
> >> > > > >>> it
> >> > > > >>>>> as
> >> > > > >>>>>>> a
> >> > > > >>>>>>>>>> normal message queue, but should treat it as a changing
> >> > keyed
> >> > > > >>>>>>>>>> table. We should be able to always get the whole data
> of
> >> > such
> >> > > > >>> table
> >> > > > >>>>> (by
> >> > > > >>>>>>>>>> disabling the start offset option) and we can also read
> >> the
> >> > > > >>>>>>>>>> changelog out of such table. It's like a HBase table
> with
> >> > > > >> binlog
> >> > > > >>>>>>> support
> >> > > > >>>>>>>>>> but doesn't have random access capability (which can be
> >> > > > >> fulfilled
> >> > > > >>>>>>>>>> by Flink's state).
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> So our intention was instead of telling and persuading
> >> users
> >> > > > >> what
> >> > > > >>>>> kind
> >> > > > >>>>>>> of
> >> > > > >>>>>>>>>> options they should or should not use by extending
> >> > > > >>>>>>>>>> current kafka connector when enable upsert support, we
> >> are
> >> > > > >>> actually
> >> > > > >>>>>>>>> create
> >> > > > >>>>>>>>>> a whole new and different connector that has total
> >> > > > >>>>>>>>>> different abstractions in SQL layer, and should be
> >> treated
> >> > > > >>> totally
> >> > > > >>>>>>>>>> different with current kafka connector.
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> Hope this can clarify some of the concerns.
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> Best,
> >> > > > >>>>>>>>>> Kurt
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> On Tue, Oct 20, 2020 at 5:20 PM Shengkai Fang <
> >> > > > >> [email protected]
> >> > > > >>>>
> >> > > > >>>>>>> wrote:
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>>> Hi devs,
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>> As many people are still confused about the difference
> >> > option
> >> > > > >>>>>>>>> behaviours
> >> > > > >>>>>>>>>>> between the Kafka connector and KTable connector, Jark
> >> and
> >> > I
> >> > > > >>> list
> >> > > > >>>>> the
> >> > > > >>>>>>>>>>> differences in the doc[1].
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>> Best,
> >> > > > >>>>>>>>>>> Shengkai
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>> [1]
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>
> >> > > > >>>>>
> >> > > > >>>>
> >> > > > >>>
> >> > > > >>
> >> > > >
> >> > >
> >> >
> >>
> https://docs.google.com/document/d/13oAWAwQez0lZLsyfV21BfTEze1fc2cz4AZKiNOyBNPk/edit
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>> Shengkai Fang <[email protected]> 于2020年10月20日周二
> >> > 下午12:05写道：
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>>> Hi Konstantin,
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> Thanks for your reply.
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> It uses the "kafka" connector and does not specify a
> >> > > primary
> >> > > > >>>> key.
> >> > > > >>>>>>>>>>>> The dimensional table `users` is a ktable connector
> >> and we
> >> > > > >> can
> >> > > > >>>>>>>>> specify
> >> > > > >>>>>>>>>>> the
> >> > > > >>>>>>>>>>>> pk on the KTable.
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> Will it possible to use a "ktable" as a dimensional
> >> table
> >> > > in
> >> > > > >>>>>>>>> FLIP-132
> >> > > > >>>>>>>>>>>> Yes. We can specify the watermark on the KTable and
> it
> >> can
> >> > > be
> >> > > > >>>> used
> >> > > > >>>>>>>>> as a
> >> > > > >>>>>>>>>>>> dimension table in temporal join.
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> Introduce a new connector vs introduce a new
> property
> >> > > > >>>>>>>>>>>> The main reason behind is that the KTable connector
> >> almost
> >> > > > >> has
> >> > > > >>> no
> >> > > > >>>>>>>>>> common
> >> > > > >>>>>>>>>>>> options with the Kafka connector. The options that
> can
> >> be
> >> > > > >>> reused
> >> > > > >>>> by
> >> > > > >>>>>>>>>>> KTable
> >> > > > >>>>>>>>>>>> connectors are 'topic',
> 'properties.bootstrap.servers'
> >> and
> >> > > > >>>>>>>>>>>> 'value.fields-include' . We can't set cdc format for
> >> > > > >>> 'key.format'
> >> > > > >>>>> and
> >> > > > >>>>>>>>>>>> 'value.format' in KTable connector now, which is
> >> > available
> >> > > > >> in
> >> > > > >>>>> Kafka
> >> > > > >>>>>>>>>>>> connector. Considering the difference between the
> >> options
> >> > we
> >> > > > >>> can
> >> > > > >>>>> use,
> >> > > > >>>>>>>>>>> it's
> >> > > > >>>>>>>>>>>> more suitable to introduce an another connector
> rather
> >> > than
> >> > > a
> >> > > > >>>>>>>>> property.
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> We are also fine to use "compacted-kafka" as the name
> >> of
> >> > the
> >> > > > >>> new
> >> > > > >>>>>>>>>>>> connector. What do you think?
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> Best,
> >> > > > >>>>>>>>>>>> Shengkai
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> Konstantin Knauf <[email protected]> 于2020年10月19日周一
> >> > > > >> 下午10:15写道：
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> Hi Shengkai,
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> Thank you for driving this effort. I believe this a
> >> very
> >> > > > >>>> important
> >> > > > >>>>>>>>>>> feature
> >> > > > >>>>>>>>>>>>> for many users who use Kafka and Flink SQL
> together. A
> >> > few
> >> > > > >>>>> questions
> >> > > > >>>>>>>>>> and
> >> > > > >>>>>>>>>>>>> thoughts:
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> * Is your example "Use KTable as a
> reference/dimension
> >> > > > >> table"
> >> > > > >>>>>>>>> correct?
> >> > > > >>>>>>>>>>> It
> >> > > > >>>>>>>>>>>>> uses the "kafka" connector and does not specify a
> >> primary
> >> > > > >> key.
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> * Will it be possible to use a "ktable" table
> directly
> >> > as a
> >> > > > >>>>>>>>>> dimensional
> >> > > > >>>>>>>>>>>>> table in temporal join (*based on event time*)
> >> > (FLIP-132)?
> >> > > > >>> This
> >> > > > >>>> is
> >> > > > >>>>>>>>> not
> >> > > > >>>>>>>>>>>>> completely clear to me from the FLIP.
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> * I'd personally prefer not to introduce a new
> >> connector
> >> > > and
> >> > > > >>>>> instead
> >> > > > >>>>>>>>>> to
> >> > > > >>>>>>>>>>>>> extend the Kafka connector. We could add an
> additional
> >> > > > >>> property
> >> > > > >>>>>>>>>>>>> "compacted"
> >> > > > >>>>>>>>>>>>> = "true"|"false". If it is set to "true", we can add
> >> > > > >>> additional
> >> > > > >>>>>>>>>>> validation
> >> > > > >>>>>>>>>>>>> logic (e.g. "scan.startup.mode" can not be set,
> >> primary
> >> > key
> >> > > > >>>>>>>>> required,
> >> > > > >>>>>>>>>>>>> etc.). If we stick to a separate connector I'd not
> >> call
> >> > it
> >> > > > >>>>> "ktable",
> >> > > > >>>>>>>>>> but
> >> > > > >>>>>>>>>>>>> rather "compacted-kafka" or similar. KTable seems to
> >> > carry
> >> > > > >>> more
> >> > > > >>>>>>>>>> implicit
> >> > > > >>>>>>>>>>>>> meaning than we want to imply here.
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> * I agree that this is not a bounded source. If we
> >> want
> >> > to
> >> > > > >>>>> support a
> >> > > > >>>>>>>>>>>>> bounded mode, this is an orthogonal concern that
> also
> >> > > > >> applies
> >> > > > >>> to
> >> > > > >>>>>>>>> other
> >> > > > >>>>>>>>>>>>> unbounded sources.
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> Best,
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> Konstantin
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> On Mon, Oct 19, 2020 at 3:26 PM Jark Wu <
> >> > [email protected]>
> >> > > > >>>> wrote:
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>> Hi Danny,
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>> First of all, we didn't introduce any concepts from
> >> KSQL
> >> > > > >>> (e.g.
> >> > > > >>>>>>>>>> Stream
> >> > > > >>>>>>>>>>> vs
> >> > > > >>>>>>>>>>>>>> Table notion).
> >> > > > >>>>>>>>>>>>>> This new connector will produce a changelog stream,
> >> so
> >> > > it's
> >> > > > >>>> still
> >> > > > >>>>>>>>> a
> >> > > > >>>>>>>>>>>>> dynamic
> >> > > > >>>>>>>>>>>>>> table and doesn't conflict with Flink core
> concepts.
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>> The "ktable" is just a connector name, we can also
> >> call
> >> > it
> >> > > > >>>>>>>>>>>>>> "compacted-kafka" or something else.
> >> > > > >>>>>>>>>>>>>> Calling it "ktable" is just because KSQL users can
> >> > migrate
> >> > > > >> to
> >> > > > >>>>>>>>> Flink
> >> > > > >>>>>>>>>>> SQL
> >> > > > >>>>>>>>>>>>>> easily.
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>> Regarding to why introducing a new connector vs a
> new
> >> > > > >>> property
> >> > > > >>>> in
> >> > > > >>>>>>>>>>>>> existing
> >> > > > >>>>>>>>>>>>>> kafka connector:
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>> I think the main reason is that we want to have a
> >> clear
> >> > > > >>>>> separation
> >> > > > >>>>>>>>>> for
> >> > > > >>>>>>>>>>>>> such
> >> > > > >>>>>>>>>>>>>> two use cases, because they are very different.
> >> > > > >>>>>>>>>>>>>> We also listed reasons in the FLIP, including:
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>> 1) It's hard to explain what's the behavior when
> >> users
> >> > > > >>> specify
> >> > > > >>>>> the
> >> > > > >>>>>>>>>>> start
> >> > > > >>>>>>>>>>>>>> offset from a middle position (e.g. how to process
> >> non
> >> > > > >> exist
> >> > > > >>>>>>>>> delete
> >> > > > >>>>>>>>>>>>>> events).
> >> > > > >>>>>>>>>>>>>>        It's dangerous if users do that. So we don't
> >> > > provide
> >> > > > >>> the
> >> > > > >>>>>>>>> offset
> >> > > > >>>>>>>>>>>>> option
> >> > > > >>>>>>>>>>>>>> in the new connector at the moment.
> >> > > > >>>>>>>>>>>>>> 2) It's a different perspective/abstraction on the
> >> same
> >> > > > >> kafka
> >> > > > >>>>>>>>> topic
> >> > > > >>>>>>>>>>>>> (append
> >> > > > >>>>>>>>>>>>>> vs. upsert). It would be easier to understand if we
> >> can
> >> > > > >>>> separate
> >> > > > >>>>>>>>>> them
> >> > > > >>>>>>>>>>>>>>        instead of mixing them in one connector. The
> >> new
> >> > > > >>>> connector
> >> > > > >>>>>>>>>>> requires
> >> > > > >>>>>>>>>>>>>> hash sink partitioner, primary key declared,
> regular
> >> > > > >> format.
> >> > > > >>>>>>>>>>>>>>        If we mix them in one connector, it might be
> >> > > > >> confusing
> >> > > > >>>> how
> >> > > > >>>>> to
> >> > > > >>>>>>>>>> use
> >> > > > >>>>>>>>>>>>> the
> >> > > > >>>>>>>>>>>>>> options correctly.
> >> > > > >>>>>>>>>>>>>> 3) The semantic of the KTable connector is just the
> >> same
> >> > > as
> >> > > > >>>>> KTable
> >> > > > >>>>>>>>>> in
> >> > > > >>>>>>>>>>>>> Kafka
> >> > > > >>>>>>>>>>>>>> Stream. So it's very handy for Kafka Stream and
> KSQL
> >> > > users.
> >> > > > >>>>>>>>>>>>>>        We have seen several questions in the
> mailing
> >> > list
> >> > > > >>> asking
> >> > > > >>>>> how
> >> > > > >>>>>>>>> to
> >> > > > >>>>>>>>>>>>> model
> >> > > > >>>>>>>>>>>>>> a KTable and how to join a KTable in Flink SQL.
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>> Best,
> >> > > > >>>>>>>>>>>>>> Jark
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>> On Mon, 19 Oct 2020 at 19:53, Jark Wu <
> >> [email protected]
> >> > >
> >> > > > >>>> wrote:
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>> Hi Jingsong,
> >> > > > >>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>> As the FLIP describes, "KTable connector produces
> a
> >> > > > >>> changelog
> >> > > > >>>>>>>>>>> stream,
> >> > > > >>>>>>>>>>>>>>> where each data record represents an update or
> >> delete
> >> > > > >>> event.".
> >> > > > >>>>>>>>>>>>>>> Therefore, a ktable source is an unbounded stream
> >> > source.
> >> > > > >>>>>>>>>> Selecting
> >> > > > >>>>>>>>>>> a
> >> > > > >>>>>>>>>>>>>>> ktable source is similar to selecting a kafka
> source
> >> > with
> >> > > > >>>>>>>>>>>>> debezium-json
> >> > > > >>>>>>>>>>>>>>> format
> >> > > > >>>>>>>>>>>>>>> that it never ends and the results are
> continuously
> >> > > > >> updated.
> >> > > > >>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>> It's possible to have a bounded ktable source in
> the
> >> > > > >> future,
> >> > > > >>>> for
> >> > > > >>>>>>>>>>>>> example,
> >> > > > >>>>>>>>>>>>>>> add an option 'bounded=true' or 'end-offset=xxx'.
> >> > > > >>>>>>>>>>>>>>> In this way, the ktable will produce a bounded
> >> > changelog
> >> > > > >>>> stream.
> >> > > > >>>>>>>>>>>>>>> So I think this can be a compatible feature in the
> >> > > future.
> >> > > > >>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>> I don't think we should associate with ksql
> related
> >> > > > >>> concepts.
> >> > > > >>>>>>>>>>>>> Actually,
> >> > > > >>>>>>>>>>>>>> we
> >> > > > >>>>>>>>>>>>>>> didn't introduce any concepts from KSQL (e.g.
> >> Stream vs
> >> > > > >>> Table
> >> > > > >>>>>>>>>>> notion).
> >> > > > >>>>>>>>>>>>>>> The "ktable" is just a connector name, we can also
> >> call
> >> > > it
> >> > > > >>>>>>>>>>>>>>> "compacted-kafka" or something else.
> >> > > > >>>>>>>>>>>>>>> Calling it "ktable" is just because KSQL users can
> >> > > migrate
> >> > > > >>> to
> >> > > > >>>>>>>>>> Flink
> >> > > > >>>>>>>>>>>>> SQL
> >> > > > >>>>>>>>>>>>>>> easily.
> >> > > > >>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>> Regarding the "value.fields-include", this is an
> >> option
> >> > > > >>>>>>>>> introduced
> >> > > > >>>>>>>>>>> in
> >> > > > >>>>>>>>>>>>>>> FLIP-107 for Kafka connector.
> >> > > > >>>>>>>>>>>>>>> I think we should keep the same behavior with the
> >> Kafka
> >> > > > >>>>>>>>> connector.
> >> > > > >>>>>>>>>>> I'm
> >> > > > >>>>>>>>>>>>>> not
> >> > > > >>>>>>>>>>>>>>> sure what's the default behavior of KSQL.
> >> > > > >>>>>>>>>>>>>>> But I guess it also stores the keys in value from
> >> this
> >> > > > >>> example
> >> > > > >>>>>>>>>> docs
> >> > > > >>>>>>>>>>>>> (see
> >> > > > >>>>>>>>>>>>>>> the "users_original" table) [1].
> >> > > > >>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>> Best,
> >> > > > >>>>>>>>>>>>>>> Jark
> >> > > > >>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>> [1]:
> >> > > > >>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>
> >> > > > >>>>>
> >> > > > >>>>
> >> > > > >>>
> >> > > > >>
> >> > > >
> >> > >
> >> >
> >>
> https://docs.confluent.io/current/ksqldb/tutorials/basics-local.html#create-a-stream-and-table
> >> > > > >>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>> On Mon, 19 Oct 2020 at 18:17, Danny Chan <
> >> > > > >>>> [email protected]>
> >> > > > >>>>>>>>>>>>> wrote:
> >> > > > >>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>> The concept seems conflicts with the Flink
> >> abstraction
> >> > > > >>>> “dynamic
> >> > > > >>>>>>>>>>>>> table”,
> >> > > > >>>>>>>>>>>>>>>> in Flink we see both “stream” and “table” as a
> >> dynamic
> >> > > > >>> table,
> >> > > > >>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>> I think we should make clear first how to express
> >> > stream
> >> > > > >>> and
> >> > > > >>>>>>>>>> table
> >> > > > >>>>>>>>>>>>>>>> specific features on one “dynamic table”,
> >> > > > >>>>>>>>>>>>>>>> it is more natural for KSQL because KSQL takes
> >> stream
> >> > > and
> >> > > > >>>> table
> >> > > > >>>>>>>>>> as
> >> > > > >>>>>>>>>>>>>>>> different abstractions for representing
> >> collections.
> >> > In
> >> > > > >>> KSQL,
> >> > > > >>>>>>>>>> only
> >> > > > >>>>>>>>>>>>>> table is
> >> > > > >>>>>>>>>>>>>>>> mutable and can have a primary key.
> >> > > > >>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>> Does this connector belongs to the “table” scope
> or
> >> > > > >>> “stream”
> >> > > > >>>>>>>>>> scope
> >> > > > >>>>>>>>>>> ?
> >> > > > >>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>> Some of the concepts (such as the primary key on
> >> > stream)
> >> > > > >>>> should
> >> > > > >>>>>>>>>> be
> >> > > > >>>>>>>>>>>>>>>> suitable for all the connectors, not just Kafka,
> >> > > > >> Shouldn’t
> >> > > > >>>> this
> >> > > > >>>>>>>>>> be
> >> > > > >>>>>>>>>>> an
> >> > > > >>>>>>>>>>>>>>>> extension of existing Kafka connector instead of
> a
> >> > > > >> totally
> >> > > > >>>> new
> >> > > > >>>>>>>>>>>>>> connector ?
> >> > > > >>>>>>>>>>>>>>>> What about the other connectors ?
> >> > > > >>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>> Because this touches the core abstraction of
> >> Flink, we
> >> > > > >>> better
> >> > > > >>>>>>>>>> have
> >> > > > >>>>>>>>>>> a
> >> > > > >>>>>>>>>>>>>>>> top-down overall design, following the KSQL
> >> directly
> >> > is
> >> > > > >> not
> >> > > > >>>> the
> >> > > > >>>>>>>>>>>>> answer.
> >> > > > >>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>> P.S. For the source
> >> > > > >>>>>>>>>>>>>>>>> Shouldn’t this be an extension of existing Kafka
> >> > > > >> connector
> >> > > > >>>>>>>>>>> instead
> >> > > > >>>>>>>>>>>>> of
> >> > > > >>>>>>>>>>>>>> a
> >> > > > >>>>>>>>>>>>>>>> totally new connector ?
> >> > > > >>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>> How could we achieve that (e.g. set up the
> >> parallelism
> >> > > > >>>>>>>>>> correctly) ?
> >> > > > >>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>> Best,
> >> > > > >>>>>>>>>>>>>>>> Danny Chan
> >> > > > >>>>>>>>>>>>>>>> 在 2020年10月19日 +0800 PM5:17，Jingsong Li <
> >> > > > >>>> [email protected]
> >> > > > >>>>>>>>>>>> ，写道：
> >> > > > >>>>>>>>>>>>>>>>> Thanks Shengkai for your proposal.
> >> > > > >>>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>>> +1 for this feature.
> >> > > > >>>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>>>> Future Work: Support bounded KTable source
> >> > > > >>>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>>> I don't think it should be a future work, I
> think
> >> it
> >> > is
> >> > > > >>> one
> >> > > > >>>>>>>>> of
> >> > > > >>>>>>>>>>> the
> >> > > > >>>>>>>>>>>>>>>>> important concepts of this FLIP. We need to
> >> > understand
> >> > > > >> it
> >> > > > >>>>>>>>> now.
> >> > > > >>>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>>> Intuitively, a ktable in my opinion is a bounded
> >> > table
> >> > > > >>>> rather
> >> > > > >>>>>>>>>>> than
> >> > > > >>>>>>>>>>>>> a
> >> > > > >>>>>>>>>>>>>>>>> stream, so select should produce a bounded table
> >> by
> >> > > > >>> default.
> >> > > > >>>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>>> I think we can list Kafka related knowledge,
> >> because
> >> > > the
> >> > > > >>>> word
> >> > > > >>>>>>>>>>>>> `ktable`
> >> > > > >>>>>>>>>>>>>>>> is
> >> > > > >>>>>>>>>>>>>>>>> easy to associate with ksql related concepts.
> (If
> >> > > > >>> possible,
> >> > > > >>>>>>>>>> it's
> >> > > > >>>>>>>>>>>>>> better
> >> > > > >>>>>>>>>>>>>>>> to
> >> > > > >>>>>>>>>>>>>>>>> unify with it)
> >> > > > >>>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>>> What do you think?
> >> > > > >>>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>>>> value.fields-include
> >> > > > >>>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>>> What about the default behavior of KSQL?
> >> > > > >>>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>>> Best,
> >> > > > >>>>>>>>>>>>>>>>> Jingsong
> >> > > > >>>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>>> On Mon, Oct 19, 2020 at 4:33 PM Shengkai Fang <
> >> > > > >>>>>>>>>> [email protected]
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>> wrote:
> >> > > > >>>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>>>> Hi, devs.
> >> > > > >>>>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>>>> Jark and I want to start a new FLIP to
> introduce
> >> the
> >> > > > >>> KTable
> >> > > > >>>>>>>>>>>>>>>> connector. The
> >> > > > >>>>>>>>>>>>>>>>>> KTable is a shortcut of "Kafka Table", it also
> >> has
> >> > the
> >> > > > >>> same
> >> > > > >>>>>>>>>>>>>> semantics
> >> > > > >>>>>>>>>>>>>>>> with
> >> > > > >>>>>>>>>>>>>>>>>> the KTable notion in Kafka Stream.
> >> > > > >>>>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>>>> FLIP-149:
> >> > > > >>>>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>
> >> > > > >>>>>
> >> > > > >>>>
> >> > > > >>>
> >> > > > >>
> >> > > >
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-149%3A+Introduce+the+KTable+Connector
> >> > > > >>>>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>>>> Currently many users have expressed their needs
> >> for
> >> > > the
> >> > > > >>>>>>>>>> upsert
> >> > > > >>>>>>>>>>>>> Kafka
> >> > > > >>>>>>>>>>>>>>>> by
> >> > > > >>>>>>>>>>>>>>>>>> mail lists and issues. The KTable connector has
> >> > > several
> >> > > > >>>>>>>>>>> benefits
> >> > > > >>>>>>>>>>>>> for
> >> > > > >>>>>>>>>>>>>>>> users:
> >> > > > >>>>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>>>> 1. Users are able to interpret a compacted
> Kafka
> >> > Topic
> >> > > > >> as
> >> > > > >>>>>>>>> an
> >> > > > >>>>>>>>>>>>> upsert
> >> > > > >>>>>>>>>>>>>>>> stream
> >> > > > >>>>>>>>>>>>>>>>>> in Apache Flink. And also be able to write a
> >> > changelog
> >> > > > >>>>>>>>> stream
> >> > > > >>>>>>>>>>> to
> >> > > > >>>>>>>>>>>>>> Kafka
> >> > > > >>>>>>>>>>>>>>>>>> (into a compacted topic).
> >> > > > >>>>>>>>>>>>>>>>>> 2. As a part of the real time pipeline, store
> >> join
> >> > or
> >> > > > >>>>>>>>>> aggregate
> >> > > > >>>>>>>>>>>>>>>> result (may
> >> > > > >>>>>>>>>>>>>>>>>> contain updates) into a Kafka topic for further
> >> > > > >>>>>>>>> calculation;
> >> > > > >>>>>>>>>>>>>>>>>> 3. The semantic of the KTable connector is just
> >> the
> >> > > > >> same
> >> > > > >>> as
> >> > > > >>>>>>>>>>>>> KTable
> >> > > > >>>>>>>>>>>>>> in
> >> > > > >>>>>>>>>>>>>>>> Kafka
> >> > > > >>>>>>>>>>>>>>>>>> Stream. So it's very handy for Kafka Stream and
> >> KSQL
> >> > > > >>> users.
> >> > > > >>>>>>>>>> We
> >> > > > >>>>>>>>>>>>> have
> >> > > > >>>>>>>>>>>>>>>> seen
> >> > > > >>>>>>>>>>>>>>>>>> several questions in the mailing list asking
> how
> >> to
> >> > > > >>> model a
> >> > > > >>>>>>>>>>>>> KTable
> >> > > > >>>>>>>>>>>>>>>> and how
> >> > > > >>>>>>>>>>>>>>>>>> to join a KTable in Flink SQL.
> >> > > > >>>>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>>>> We hope it can expand the usage of the Flink
> with
> >> > > > >> Kafka.
> >> > > > >>>>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>>>> I'm looking forward to your feedback.
> >> > > > >>>>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>>>> Best,
> >> > > > >>>>>>>>>>>>>>>>>> Shengkai
> >> > > > >>>>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>>> --
> >> > > > >>>>>>>>>>>>>>>>> Best, Jingsong Lee
> >> > > > >>>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> --
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> Konstantin Knauf
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> https://twitter.com/snntrable
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> https://github.com/knaufk
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>> --
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>> Konstantin Knauf
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>> https://twitter.com/snntrable
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>> https://github.com/knaufk
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>
> >> > > > >>>>>>>
> >> > > > >>>>>>>
> >> > > > >>>>>>
> >> > > > >>>>>
> >> > > > >>>>>
> >> > > > >>>>
> >> > > > >>>> --
> >> > > > >>>>
> >> > > > >>>> Seth Wiesman | Solutions Architect
> >> > > > >>>>
> >> > > > >>>> +1 314 387 1463
> >> > > > >>>>
> >> > > > >>>> <https://www.ververica.com/>
> >> > > > >>>>
> >> > > > >>>> Follow us @VervericaData
> >> > > > >>>>
> >> > > > >>>> --
> >> > > > >>>>
> >> > > > >>>> Join Flink Forward <https://flink-forward.org/> - The Apache
> >> > Flink
> >> > > > >>>> Conference
> >> > > > >>>>
> >> > > > >>>> Stream Processing | Event Driven | Real Time
> >> > > > >>>>
> >> > > > >>>
> >> > > > >>
> >> > > > >
> >> > > >
> >> > > >
> >> > >
> >> > > --
> >> > > Best, Jingsong Lee
> >> > >
> >> >
> >>
> >>
> >> --
> >> Best, Jingsong Lee
> >>
> >
>

Re: [DISCUSS] FLIP-149: Introduce the KTable Connector

Reply via email to