Re: [DISCUSS] Introduce TableFactory for StatefulSequenceSource

Jingsong Li Sun, 22 Mar 2020 22:51:46 -0700

Thanks Bowen, Jark and Dian for your feedback and suggestions.

I reorganize with your suggestions, and try to expose DDLs:


1.datagen source:
- easy startup/test for streaming job
- performance testing

DDL:
CREATE TABLE user (
    id BIGINT,
    age INT,
    description STRING
) WITH (
    'connector.type' = 'datagen',
    'connector.rows-per-second'='100',
    'connector.total-records'='1000000',

    'schema.id.generator' = 'sequence',
    'schema.id.generator.start' = '1',

    'schema.age.generator' = 'random',
    'schema.age.generator.min' = '0',
    'schema.age.generator.max' = '100',

    'schema.description.generator' = 'random',
    'schema.description.generator.length' = '100'
)

Default is random generator.
Hi Jark, I don't want to bring complicated regularities, because it can be
done through computed columns. And it is hard to define
standard regularities, I think we can leave it to the future.

2.print sink:
- easy test for streaming job
- be very useful in production debugging

DDL:
CREATE TABLE print_table (
    ...
) WITH (
    'connector.type' = 'print'
)

3.blackhole sink
- very useful for high performance testing of Flink
- I've also run into users trying UDF to output, not sink, so they need
this sink as well.

DDL:
CREATE TABLE blackhole_table (
    ...
) WITH (
    'connector.type' = 'blackhole'
)

What do you think?

Best,
Jingsong Lee

On Mon, Mar 23, 2020 at 12:04 PM Dian Fu <dian0511...@gmail.com> wrote:

> Thanks Jingsong for bringing up this discussion. +1 to this proposal. I
> think Bowen's proposal makes much sense to me.
>
> This is also a painful problem for PyFlink users. Currently there is no
> built-in easy-to-use table source/sink and it requires users to write a lot
> of code to trying out PyFlink. This is especially painful for new users who
> are not familiar with PyFlink/Flink. I have also encountered the tedious
> process Bowen encountered, e.g. writing random source connector, print sink
> and also blackhole print sink as there are no built-in ones to use.
>
> Regards,
> Dian
>
> > 在 2020年3月22日，上午11:24，Jark Wu <imj...@gmail.com> 写道：
> >
> > +1 to Bowen's proposal. I also saw many requirements on such built-in
> > connectors.
> >
> > I will leave some my thoughts here:
> >
> >> 1. datagen source (random source)
> > I think we can merge the functinality of sequence-source into random
> source
> > to allow users to custom their data values.
> > Flink can generate random data according to the field types, users
> > can customize their values to be more domain specific, e.g.
> > 'field.user'='User_[1-9]{0,1}'
> > This will be similar to kafka-datagen-connect[1].
> >
> >> 2. console sink (print sink)
> > This will be very useful in production debugging, to easily output an
> > intermediate view or result view to a `.out` file.
> > So that we can look into the data representation, or check dirty data.
> > This should be out-of-box without manually DDL registration.
> >
> >> 3. blackhole sink (no output sink)
> > This is very useful for high performance testing of Flink, to meansure
> the
> > throughput of the whole pipeline without sink.
> > Presto also provides this as a built-in connector [2].
> >
> > Best,
> > Jark
> >
> > [1]:
> >
> https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification
> > [2]: https://prestodb.io/docs/current/connector/blackhole.html
> >
> >
> > On Sat, 21 Mar 2020 at 12:31, Bowen Li <bowenl...@gmail.com> wrote:
> >
> >> +1.
> >>
> >> I would suggest to take a step even further and see what users really
> need
> >> to test/try/play with table API and Flink SQL. Besides this one, here're
> >> some more sources and sinks that I have developed or used previously to
> >> facilitate building Flink table/SQL pipelines.
> >>
> >>
> >>   1. random input data source
> >>      - should generate random data at a specified rate according to
> schema
> >>      - purposes
> >>         - test Flink pipeline and data can end up in external storage
> >>         correctly
> >>         - stress test Flink sink as well as tuning up external storage
> >>      2. print data sink
> >>      - should print data in row format in console
> >>      - purposes
> >>         - make it easier to test Flink SQL job e2e in IDE
> >>         - test Flink pipeline and ensure output data format/value is
> >>         correct
> >>      3. no output data sink
> >>      - just swallow output data without doing anything
> >>      - purpose
> >>         - evaluate and tune performance of Flink source and the whole
> >>         pipeline. Users' don't need to worry about sink back pressure
> >>
> >> These may be taken into consideration all together as an effort to lower
> >> the threshold of running Flink SQL/table API, and facilitate users'
> daily
> >> work.
> >>
> >> Cheers,
> >> Bowen
> >>
> >>
> >> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li <jingsongl...@gmail.com>
> >> wrote:
> >>
> >>> Hi all,
> >>>
> >>> I heard some users complain that table is difficult to test. Now with
> SQL
> >>> client, users are more and more inclined to use it to test rather than
> >>> program.
> >>> The most common example is Kafka source. If users need to test their
> SQL
> >>> output and checkpoint, they need to:
> >>>
> >>> - 1.Launch a Kafka standalone, create a Kafka topic .
> >>> - 2.Write a program, mock input records, and produce records to Kafka
> >>> topic.
> >>> - 3.Then test in Flink.
> >>>
> >>> The step 1 and 2 are annoying, although this test is E2E.
> >>>
> >>> Then I found StatefulSequenceSource, it is very good because it has
> deal
> >>> with checkpoint things, so it is very good to checkpoint
> >> mechanism.Usually,
> >>> users are turned on checkpoint in production.
> >>>
> >>> With computed columns, user are easy to create a sequence source DDL
> same
> >>> to Kafka DDL. Then they can test inside Flink, don't need launch other
> >>> things.
> >>>
> >>> Have you consider this? What do you think?
> >>>
> >>> CC: @Aljoscha Krettek <aljos...@apache.org> the author
> >>> of StatefulSequenceSource.
> >>>
> >>> Best,
> >>> Jingsong Lee
> >>>
> >>
>
>

-- 
Best, Jingsong Lee

Re: [DISCUSS] Introduce TableFactory for StatefulSequenceSource

Reply via email to