Re: [DISCUSS] Introduce TableFactory for StatefulSequenceSource

Benchao Li Mon, 23 Mar 2020 01:33:47 -0700

Hi Jingsong,

Thanks for bring this up. Generally, it's a very good proposal.


About data gen source, do you think we need to add more columns with
various types?

About print sink, do we need to specify the schema?

Jingsong Li <[email protected]> 于2020年3月23日周一 下午1:51写道：

> Thanks Bowen, Jark and Dian for your feedback and suggestions.
>
> I reorganize with your suggestions, and try to expose DDLs:
>
> 1.datagen source:
> - easy startup/test for streaming job
> - performance testing
>
> DDL:
> CREATE TABLE user (
>     id BIGINT,
>     age INT,
>     description STRING
> ) WITH (
>     'connector.type' = 'datagen',
>     'connector.rows-per-second'='100',
>     'connector.total-records'='1000000',
>
>     'schema.id.generator' = 'sequence',
>     'schema.id.generator.start' = '1',
>
>     'schema.age.generator' = 'random',
>     'schema.age.generator.min' = '0',
>     'schema.age.generator.max' = '100',
>
>     'schema.description.generator' = 'random',
>     'schema.description.generator.length' = '100'
> )
>
> Default is random generator.
> Hi Jark, I don't want to bring complicated regularities, because it can be
> done through computed columns. And it is hard to define
> standard regularities, I think we can leave it to the future.
>
> 2.print sink:
> - easy test for streaming job
> - be very useful in production debugging
>
> DDL:
> CREATE TABLE print_table (
>     ...
> ) WITH (
>     'connector.type' = 'print'
> )
>
> 3.blackhole sink
> - very useful for high performance testing of Flink
> - I've also run into users trying UDF to output, not sink, so they need
> this sink as well.
>
> DDL:
> CREATE TABLE blackhole_table (
>     ...
> ) WITH (
>     'connector.type' = 'blackhole'
> )
>
> What do you think?
>
> Best,
> Jingsong Lee
>
> On Mon, Mar 23, 2020 at 12:04 PM Dian Fu <[email protected]> wrote:
>
> > Thanks Jingsong for bringing up this discussion. +1 to this proposal. I
> > think Bowen's proposal makes much sense to me.
> >
> > This is also a painful problem for PyFlink users. Currently there is no
> > built-in easy-to-use table source/sink and it requires users to write a
> lot
> > of code to trying out PyFlink. This is especially painful for new users
> who
> > are not familiar with PyFlink/Flink. I have also encountered the tedious
> > process Bowen encountered, e.g. writing random source connector, print
> sink
> > and also blackhole print sink as there are no built-in ones to use.
> >
> > Regards,
> > Dian
> >
> > > 在 2020年3月22日，上午11:24，Jark Wu <[email protected]> 写道：
> > >
> > > +1 to Bowen's proposal. I also saw many requirements on such built-in
> > > connectors.
> > >
> > > I will leave some my thoughts here:
> > >
> > >> 1. datagen source (random source)
> > > I think we can merge the functinality of sequence-source into random
> > source
> > > to allow users to custom their data values.
> > > Flink can generate random data according to the field types, users
> > > can customize their values to be more domain specific, e.g.
> > > 'field.user'='User_[1-9]{0,1}'
> > > This will be similar to kafka-datagen-connect[1].
> > >
> > >> 2. console sink (print sink)
> > > This will be very useful in production debugging, to easily output an
> > > intermediate view or result view to a `.out` file.
> > > So that we can look into the data representation, or check dirty data.
> > > This should be out-of-box without manually DDL registration.
> > >
> > >> 3. blackhole sink (no output sink)
> > > This is very useful for high performance testing of Flink, to meansure
> > the
> > > throughput of the whole pipeline without sink.
> > > Presto also provides this as a built-in connector [2].
> > >
> > > Best,
> > > Jark
> > >
> > > [1]:
> > >
> >
> https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification
> > > [2]: https://prestodb.io/docs/current/connector/blackhole.html
> > >
> > >
> > > On Sat, 21 Mar 2020 at 12:31, Bowen Li <[email protected]> wrote:
> > >
> > >> +1.
> > >>
> > >> I would suggest to take a step even further and see what users really
> > need
> > >> to test/try/play with table API and Flink SQL. Besides this one,
> here're
> > >> some more sources and sinks that I have developed or used previously
> to
> > >> facilitate building Flink table/SQL pipelines.
> > >>
> > >>
> > >>   1. random input data source
> > >>      - should generate random data at a specified rate according to
> > schema
> > >>      - purposes
> > >>         - test Flink pipeline and data can end up in external storage
> > >>         correctly
> > >>         - stress test Flink sink as well as tuning up external storage
> > >>      2. print data sink
> > >>      - should print data in row format in console
> > >>      - purposes
> > >>         - make it easier to test Flink SQL job e2e in IDE
> > >>         - test Flink pipeline and ensure output data format/value is
> > >>         correct
> > >>      3. no output data sink
> > >>      - just swallow output data without doing anything
> > >>      - purpose
> > >>         - evaluate and tune performance of Flink source and the whole
> > >>         pipeline. Users' don't need to worry about sink back pressure
> > >>
> > >> These may be taken into consideration all together as an effort to
> lower
> > >> the threshold of running Flink SQL/table API, and facilitate users'
> > daily
> > >> work.
> > >>
> > >> Cheers,
> > >> Bowen
> > >>
> > >>
> > >> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li <[email protected]>
> > >> wrote:
> > >>
> > >>> Hi all,
> > >>>
> > >>> I heard some users complain that table is difficult to test. Now with
> > SQL
> > >>> client, users are more and more inclined to use it to test rather
> than
> > >>> program.
> > >>> The most common example is Kafka source. If users need to test their
> > SQL
> > >>> output and checkpoint, they need to:
> > >>>
> > >>> - 1.Launch a Kafka standalone, create a Kafka topic .
> > >>> - 2.Write a program, mock input records, and produce records to Kafka
> > >>> topic.
> > >>> - 3.Then test in Flink.
> > >>>
> > >>> The step 1 and 2 are annoying, although this test is E2E.
> > >>>
> > >>> Then I found StatefulSequenceSource, it is very good because it has
> > deal
> > >>> with checkpoint things, so it is very good to checkpoint
> > >> mechanism.Usually,
> > >>> users are turned on checkpoint in production.
> > >>>
> > >>> With computed columns, user are easy to create a sequence source DDL
> > same
> > >>> to Kafka DDL. Then they can test inside Flink, don't need launch
> other
> > >>> things.
> > >>>
> > >>> Have you consider this? What do you think?
> > >>>
> > >>> CC: @Aljoscha Krettek <[email protected]> the author
> > >>> of StatefulSequenceSource.
> > >>>
> > >>> Best,
> > >>> Jingsong Lee
> > >>>
> > >>
> >
> >
>
> --
> Best, Jingsong Lee
>


-- 

Benchao Li
School of Electronics Engineering and Computer Science, Peking University
Tel:+86-15650713730
Email: [email protected]; [email protected]

Re: [DISCUSS] Introduce TableFactory for StatefulSequenceSource

Reply via email to