Thanks Bowen, Jark and Dian for your feedback and suggestions. I reorganize with your suggestions, and try to expose DDLs:
1.datagen source: - easy startup/test for streaming job - performance testing DDL: CREATE TABLE user ( id BIGINT, age INT, description STRING ) WITH ( 'connector.type' = 'datagen', 'connector.rows-per-second'='100', 'connector.total-records'='1000000', 'schema.id.generator' = 'sequence', 'schema.id.generator.start' = '1', 'schema.age.generator' = 'random', 'schema.age.generator.min' = '0', 'schema.age.generator.max' = '100', 'schema.description.generator' = 'random', 'schema.description.generator.length' = '100' ) Default is random generator. Hi Jark, I don't want to bring complicated regularities, because it can be done through computed columns. And it is hard to define standard regularities, I think we can leave it to the future. 2.print sink: - easy test for streaming job - be very useful in production debugging DDL: CREATE TABLE print_table ( ... ) WITH ( 'connector.type' = 'print' ) 3.blackhole sink - very useful for high performance testing of Flink - I've also run into users trying UDF to output, not sink, so they need this sink as well. DDL: CREATE TABLE blackhole_table ( ... ) WITH ( 'connector.type' = 'blackhole' ) What do you think? Best, Jingsong Lee On Mon, Mar 23, 2020 at 12:04 PM Dian Fu <dian0511...@gmail.com> wrote: > Thanks Jingsong for bringing up this discussion. +1 to this proposal. I > think Bowen's proposal makes much sense to me. > > This is also a painful problem for PyFlink users. Currently there is no > built-in easy-to-use table source/sink and it requires users to write a lot > of code to trying out PyFlink. This is especially painful for new users who > are not familiar with PyFlink/Flink. I have also encountered the tedious > process Bowen encountered, e.g. writing random source connector, print sink > and also blackhole print sink as there are no built-in ones to use. > > Regards, > Dian > > > 在 2020年3月22日,上午11:24,Jark Wu <imj...@gmail.com> 写道: > > > > +1 to Bowen's proposal. I also saw many requirements on such built-in > > connectors. > > > > I will leave some my thoughts here: > > > >> 1. datagen source (random source) > > I think we can merge the functinality of sequence-source into random > source > > to allow users to custom their data values. > > Flink can generate random data according to the field types, users > > can customize their values to be more domain specific, e.g. > > 'field.user'='User_[1-9]{0,1}' > > This will be similar to kafka-datagen-connect[1]. > > > >> 2. console sink (print sink) > > This will be very useful in production debugging, to easily output an > > intermediate view or result view to a `.out` file. > > So that we can look into the data representation, or check dirty data. > > This should be out-of-box without manually DDL registration. > > > >> 3. blackhole sink (no output sink) > > This is very useful for high performance testing of Flink, to meansure > the > > throughput of the whole pipeline without sink. > > Presto also provides this as a built-in connector [2]. > > > > Best, > > Jark > > > > [1]: > > > https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification > > [2]: https://prestodb.io/docs/current/connector/blackhole.html > > > > > > On Sat, 21 Mar 2020 at 12:31, Bowen Li <bowenl...@gmail.com> wrote: > > > >> +1. > >> > >> I would suggest to take a step even further and see what users really > need > >> to test/try/play with table API and Flink SQL. Besides this one, here're > >> some more sources and sinks that I have developed or used previously to > >> facilitate building Flink table/SQL pipelines. > >> > >> > >> 1. random input data source > >> - should generate random data at a specified rate according to > schema > >> - purposes > >> - test Flink pipeline and data can end up in external storage > >> correctly > >> - stress test Flink sink as well as tuning up external storage > >> 2. print data sink > >> - should print data in row format in console > >> - purposes > >> - make it easier to test Flink SQL job e2e in IDE > >> - test Flink pipeline and ensure output data format/value is > >> correct > >> 3. no output data sink > >> - just swallow output data without doing anything > >> - purpose > >> - evaluate and tune performance of Flink source and the whole > >> pipeline. Users' don't need to worry about sink back pressure > >> > >> These may be taken into consideration all together as an effort to lower > >> the threshold of running Flink SQL/table API, and facilitate users' > daily > >> work. > >> > >> Cheers, > >> Bowen > >> > >> > >> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li <jingsongl...@gmail.com> > >> wrote: > >> > >>> Hi all, > >>> > >>> I heard some users complain that table is difficult to test. Now with > SQL > >>> client, users are more and more inclined to use it to test rather than > >>> program. > >>> The most common example is Kafka source. If users need to test their > SQL > >>> output and checkpoint, they need to: > >>> > >>> - 1.Launch a Kafka standalone, create a Kafka topic . > >>> - 2.Write a program, mock input records, and produce records to Kafka > >>> topic. > >>> - 3.Then test in Flink. > >>> > >>> The step 1 and 2 are annoying, although this test is E2E. > >>> > >>> Then I found StatefulSequenceSource, it is very good because it has > deal > >>> with checkpoint things, so it is very good to checkpoint > >> mechanism.Usually, > >>> users are turned on checkpoint in production. > >>> > >>> With computed columns, user are easy to create a sequence source DDL > same > >>> to Kafka DDL. Then they can test inside Flink, don't need launch other > >>> things. > >>> > >>> Have you consider this? What do you think? > >>> > >>> CC: @Aljoscha Krettek <aljos...@apache.org> the author > >>> of StatefulSequenceSource. > >>> > >>> Best, > >>> Jingsong Lee > >>> > >> > > -- Best, Jingsong Lee