Thanks Bowen, Jark and Dian for your feedback and suggestions.
I reorganize with your suggestions, and try to expose DDLs:
1.datagen source:
- easy startup/test for streaming job
- performance testing
DDL:
CREATE TABLE user (
id BIGINT,
age INT,
description STRING
) WITH (
'connector.type' = 'datagen',
'connector.rows-per-second'='100',
'connector.total-records'='1000000',
'schema.id.generator' = 'sequence',
'schema.id.generator.start' = '1',
'schema.age.generator' = 'random',
'schema.age.generator.min' = '0',
'schema.age.generator.max' = '100',
'schema.description.generator' = 'random',
'schema.description.generator.length' = '100'
)
Default is random generator.
Hi Jark, I don't want to bring complicated regularities, because it can be
done through computed columns. And it is hard to define
standard regularities, I think we can leave it to the future.
2.print sink:
- easy test for streaming job
- be very useful in production debugging
DDL:
CREATE TABLE print_table (
...
) WITH (
'connector.type' = 'print'
)
3.blackhole sink
- very useful for high performance testing of Flink
- I've also run into users trying UDF to output, not sink, so they need
this sink as well.
DDL:
CREATE TABLE blackhole_table (
...
) WITH (
'connector.type' = 'blackhole'
)
What do you think?
Best,
Jingsong Lee
On Mon, Mar 23, 2020 at 12:04 PM Dian Fu <[email protected]> wrote:
> Thanks Jingsong for bringing up this discussion. +1 to this proposal. I
> think Bowen's proposal makes much sense to me.
>
> This is also a painful problem for PyFlink users. Currently there is no
> built-in easy-to-use table source/sink and it requires users to write a lot
> of code to trying out PyFlink. This is especially painful for new users who
> are not familiar with PyFlink/Flink. I have also encountered the tedious
> process Bowen encountered, e.g. writing random source connector, print sink
> and also blackhole print sink as there are no built-in ones to use.
>
> Regards,
> Dian
>
> > 在 2020年3月22日,上午11:24,Jark Wu <[email protected]> 写道:
> >
> > +1 to Bowen's proposal. I also saw many requirements on such built-in
> > connectors.
> >
> > I will leave some my thoughts here:
> >
> >> 1. datagen source (random source)
> > I think we can merge the functinality of sequence-source into random
> source
> > to allow users to custom their data values.
> > Flink can generate random data according to the field types, users
> > can customize their values to be more domain specific, e.g.
> > 'field.user'='User_[1-9]{0,1}'
> > This will be similar to kafka-datagen-connect[1].
> >
> >> 2. console sink (print sink)
> > This will be very useful in production debugging, to easily output an
> > intermediate view or result view to a `.out` file.
> > So that we can look into the data representation, or check dirty data.
> > This should be out-of-box without manually DDL registration.
> >
> >> 3. blackhole sink (no output sink)
> > This is very useful for high performance testing of Flink, to meansure
> the
> > throughput of the whole pipeline without sink.
> > Presto also provides this as a built-in connector [2].
> >
> > Best,
> > Jark
> >
> > [1]:
> >
> https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification
> > [2]: https://prestodb.io/docs/current/connector/blackhole.html
> >
> >
> > On Sat, 21 Mar 2020 at 12:31, Bowen Li <[email protected]> wrote:
> >
> >> +1.
> >>
> >> I would suggest to take a step even further and see what users really
> need
> >> to test/try/play with table API and Flink SQL. Besides this one, here're
> >> some more sources and sinks that I have developed or used previously to
> >> facilitate building Flink table/SQL pipelines.
> >>
> >>
> >> 1. random input data source
> >> - should generate random data at a specified rate according to
> schema
> >> - purposes
> >> - test Flink pipeline and data can end up in external storage
> >> correctly
> >> - stress test Flink sink as well as tuning up external storage
> >> 2. print data sink
> >> - should print data in row format in console
> >> - purposes
> >> - make it easier to test Flink SQL job e2e in IDE
> >> - test Flink pipeline and ensure output data format/value is
> >> correct
> >> 3. no output data sink
> >> - just swallow output data without doing anything
> >> - purpose
> >> - evaluate and tune performance of Flink source and the whole
> >> pipeline. Users' don't need to worry about sink back pressure
> >>
> >> These may be taken into consideration all together as an effort to lower
> >> the threshold of running Flink SQL/table API, and facilitate users'
> daily
> >> work.
> >>
> >> Cheers,
> >> Bowen
> >>
> >>
> >> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li <[email protected]>
> >> wrote:
> >>
> >>> Hi all,
> >>>
> >>> I heard some users complain that table is difficult to test. Now with
> SQL
> >>> client, users are more and more inclined to use it to test rather than
> >>> program.
> >>> The most common example is Kafka source. If users need to test their
> SQL
> >>> output and checkpoint, they need to:
> >>>
> >>> - 1.Launch a Kafka standalone, create a Kafka topic .
> >>> - 2.Write a program, mock input records, and produce records to Kafka
> >>> topic.
> >>> - 3.Then test in Flink.
> >>>
> >>> The step 1 and 2 are annoying, although this test is E2E.
> >>>
> >>> Then I found StatefulSequenceSource, it is very good because it has
> deal
> >>> with checkpoint things, so it is very good to checkpoint
> >> mechanism.Usually,
> >>> users are turned on checkpoint in production.
> >>>
> >>> With computed columns, user are easy to create a sequence source DDL
> same
> >>> to Kafka DDL. Then they can test inside Flink, don't need launch other
> >>> things.
> >>>
> >>> Have you consider this? What do you think?
> >>>
> >>> CC: @Aljoscha Krettek <[email protected]> the author
> >>> of StatefulSequenceSource.
> >>>
> >>> Best,
> >>> Jingsong Lee
> >>>
> >>
>
>
--
Best, Jingsong Lee