Hi Konstantin, Thanks for the link of Java Faker. It's an intereting project and could benefit to a comprehensive datagen source.
What the discarding and printing sink look like in your thought? 1) manually create a table with a `blackhole` or `print` connector, e.g. CREATE TABLE my_sink ( a INT, b STRNG, c DOUBLE ) WITH ( 'connector' = 'print' ); INSERT INTO my_sink SELECT a, b, c FROM my_source; 2) a system built-in table named `blackhole` and `print` without manually schema work, e.g. INSERT INTO print SELECT a, b, c, d FROM my_source; Best, Jark On Thu, 30 Apr 2020 at 21:19, Konstantin Knauf <kna...@apache.org> wrote: > Hi everyone, > > sorry for reviving this thread at this point in time. Generally, I think, > this is a very valuable effort. Have we considered only providing a very > basic data generator (+ discarding and printing sink tables) in Apache > Flink and moving a more comprehensive data generating table source to an > ecosystem project promoted on flink-packages.org. I think this has a lot > of > potential (e.g. in combination with Java Faker [1]), but it would probably > be better served in a small separately maintained repository. > > Cheers, > > Konstantin > > [1] https://github.com/DiUS/java-faker > > > On Tue, Mar 24, 2020 at 9:10 AM Jingsong Li <jingsongl...@gmail.com> > wrote: > > > Hi all, > > > > I created https://issues.apache.org/jira/browse/FLINK-16743 for > follow-up > > discussion. FYI. > > > > Best, > > Jingsong Lee > > > > On Tue, Mar 24, 2020 at 2:20 PM Bowen Li <bowenl...@gmail.com> wrote: > > > > > I agree with Jingsong that sink schema inference and system tables can > be > > > considered later. I wouldn’t recommend to tackle them for the sake of > > > simplifying user experience to the extreme. Providing the above handy > > > source and sink implementations already offer users a ton of immediate > > > value. > > > > > > > > > On Mon, Mar 23, 2020 at 20:20 Jingsong Li <jingsongl...@gmail.com> > > wrote: > > > > > > > Hi Benchao, > > > > > > > > > do you think we need to add more columns with various types? > > > > > > > > I didn't list all types, but we should support primitive types, > > varchar, > > > > Decimal, Timestamp and etc... > > > > This can be done continuously. > > > > > > > > Hi Benchao, Jark, > > > > About console and blackhole, yes, they can have no schema, the schema > > can > > > > be inferred by upstream node. > > > > - But now we don't have this mechanism to do these configurable sink > > > > things. > > > > - If we want to support, we need a single way to support these two > > sinks. > > > > - And uses can use "create table like" and others way to simplify > DDL. > > > > > > > > And for providing system/registered tables (`console` and > `blackhole`): > > > > - I have no strong opinion on these system tables. In SQL, will be > > > "insert > > > > into blackhole select a /*int*/, b /*string*/ from tableA", "insert > > into > > > > blackhole select a /*double*/, b /*Map*/, c /*string*/ from tableB". > It > > > > seems that Blackhole is a universal thing, which makes me feel bad > > > > intuitively. > > > > - Can user override these tables? If can, we need ensure it can be > > > > overwrite by catalog tables. > > > > > > > > So I think we can leave these system tables to future too. > > > > What do you think? > > > > > > > > Best, > > > > Jingsong Lee > > > > > > > > On Mon, Mar 23, 2020 at 4:44 PM Jark Wu <imj...@gmail.com> wrote: > > > > > > > > > Hi Jingsong, > > > > > > > > > > Regarding (2) and (3), I was thinking to ignore manually DDL work, > so > > > > users > > > > > can use them directly: > > > > > > > > > > # this will log results to `.out` files > > > > > INSERT INTO console > > > > > SELECT ... > > > > > > > > > > # this will drop all received records > > > > > INSERT INTO blackhole > > > > > SELECT ... > > > > > > > > > > Here `console` and `blackhole` are system sinks which is similar to > > > > system > > > > > functions. > > > > > > > > > > Best, > > > > > Jark > > > > > > > > > > On Mon, 23 Mar 2020 at 16:33, Benchao Li <libenc...@gmail.com> > > wrote: > > > > > > > > > > > Hi Jingsong, > > > > > > > > > > > > Thanks for bring this up. Generally, it's a very good proposal. > > > > > > > > > > > > About data gen source, do you think we need to add more columns > > with > > > > > > various types? > > > > > > > > > > > > About print sink, do we need to specify the schema? > > > > > > > > > > > > Jingsong Li <jingsongl...@gmail.com> 于2020年3月23日周一 下午1:51写道: > > > > > > > > > > > > > Thanks Bowen, Jark and Dian for your feedback and suggestions. > > > > > > > > > > > > > > I reorganize with your suggestions, and try to expose DDLs: > > > > > > > > > > > > > > 1.datagen source: > > > > > > > - easy startup/test for streaming job > > > > > > > - performance testing > > > > > > > > > > > > > > DDL: > > > > > > > CREATE TABLE user ( > > > > > > > id BIGINT, > > > > > > > age INT, > > > > > > > description STRING > > > > > > > ) WITH ( > > > > > > > 'connector.type' = 'datagen', > > > > > > > 'connector.rows-per-second'='100', > > > > > > > 'connector.total-records'='1000000', > > > > > > > > > > > > > > 'schema.id.generator' = 'sequence', > > > > > > > 'schema.id.generator.start' = '1', > > > > > > > > > > > > > > 'schema.age.generator' = 'random', > > > > > > > 'schema.age.generator.min' = '0', > > > > > > > 'schema.age.generator.max' = '100', > > > > > > > > > > > > > > 'schema.description.generator' = 'random', > > > > > > > 'schema.description.generator.length' = '100' > > > > > > > ) > > > > > > > > > > > > > > Default is random generator. > > > > > > > Hi Jark, I don't want to bring complicated regularities, > because > > it > > > > can > > > > > > be > > > > > > > done through computed columns. And it is hard to define > > > > > > > standard regularities, I think we can leave it to the future. > > > > > > > > > > > > > > 2.print sink: > > > > > > > - easy test for streaming job > > > > > > > - be very useful in production debugging > > > > > > > > > > > > > > DDL: > > > > > > > CREATE TABLE print_table ( > > > > > > > ... > > > > > > > ) WITH ( > > > > > > > 'connector.type' = 'print' > > > > > > > ) > > > > > > > > > > > > > > 3.blackhole sink > > > > > > > - very useful for high performance testing of Flink > > > > > > > - I've also run into users trying UDF to output, not sink, so > > they > > > > need > > > > > > > this sink as well. > > > > > > > > > > > > > > DDL: > > > > > > > CREATE TABLE blackhole_table ( > > > > > > > ... > > > > > > > ) WITH ( > > > > > > > 'connector.type' = 'blackhole' > > > > > > > ) > > > > > > > > > > > > > > What do you think? > > > > > > > > > > > > > > Best, > > > > > > > Jingsong Lee > > > > > > > > > > > > > > On Mon, Mar 23, 2020 at 12:04 PM Dian Fu < > dian0511...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > > Thanks Jingsong for bringing up this discussion. +1 to this > > > > > proposal. I > > > > > > > > think Bowen's proposal makes much sense to me. > > > > > > > > > > > > > > > > This is also a painful problem for PyFlink users. Currently > > there > > > > is > > > > > no > > > > > > > > built-in easy-to-use table source/sink and it requires users > to > > > > > write a > > > > > > > lot > > > > > > > > of code to trying out PyFlink. This is especially painful for > > new > > > > > users > > > > > > > who > > > > > > > > are not familiar with PyFlink/Flink. I have also encountered > > the > > > > > > tedious > > > > > > > > process Bowen encountered, e.g. writing random source > > connector, > > > > > print > > > > > > > sink > > > > > > > > and also blackhole print sink as there are no built-in ones > to > > > use. > > > > > > > > > > > > > > > > Regards, > > > > > > > > Dian > > > > > > > > > > > > > > > > > 在 2020年3月22日,上午11:24,Jark Wu <imj...@gmail.com> 写道: > > > > > > > > > > > > > > > > > > +1 to Bowen's proposal. I also saw many requirements on > such > > > > > built-in > > > > > > > > > connectors. > > > > > > > > > > > > > > > > > > I will leave some my thoughts here: > > > > > > > > > > > > > > > > > >> 1. datagen source (random source) > > > > > > > > > I think we can merge the functinality of sequence-source > into > > > > > random > > > > > > > > source > > > > > > > > > to allow users to custom their data values. > > > > > > > > > Flink can generate random data according to the field > types, > > > > users > > > > > > > > > can customize their values to be more domain specific, e.g. > > > > > > > > > 'field.user'='User_[1-9]{0,1}' > > > > > > > > > This will be similar to kafka-datagen-connect[1]. > > > > > > > > > > > > > > > > > >> 2. console sink (print sink) > > > > > > > > > This will be very useful in production debugging, to easily > > > > output > > > > > an > > > > > > > > > intermediate view or result view to a `.out` file. > > > > > > > > > So that we can look into the data representation, or check > > > dirty > > > > > > data. > > > > > > > > > This should be out-of-box without manually DDL > registration. > > > > > > > > > > > > > > > > > >> 3. blackhole sink (no output sink) > > > > > > > > > This is very useful for high performance testing of Flink, > to > > > > > > meansure > > > > > > > > the > > > > > > > > > throughput of the whole pipeline without sink. > > > > > > > > > Presto also provides this as a built-in connector [2]. > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > Jark > > > > > > > > > > > > > > > > > > [1]: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification > > > > > > > > > [2]: > > https://prestodb.io/docs/current/connector/blackhole.html > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sat, 21 Mar 2020 at 12:31, Bowen Li < > bowenl...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > > > > >> +1. > > > > > > > > >> > > > > > > > > >> I would suggest to take a step even further and see what > > users > > > > > > really > > > > > > > > need > > > > > > > > >> to test/try/play with table API and Flink SQL. Besides > this > > > one, > > > > > > > here're > > > > > > > > >> some more sources and sinks that I have developed or used > > > > > previously > > > > > > > to > > > > > > > > >> facilitate building Flink table/SQL pipelines. > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> 1. random input data source > > > > > > > > >> - should generate random data at a specified rate > > > according > > > > > to > > > > > > > > schema > > > > > > > > >> - purposes > > > > > > > > >> - test Flink pipeline and data can end up in > > external > > > > > > storage > > > > > > > > >> correctly > > > > > > > > >> - stress test Flink sink as well as tuning up > > external > > > > > > storage > > > > > > > > >> 2. print data sink > > > > > > > > >> - should print data in row format in console > > > > > > > > >> - purposes > > > > > > > > >> - make it easier to test Flink SQL job e2e in IDE > > > > > > > > >> - test Flink pipeline and ensure output data > > > > format/value > > > > > is > > > > > > > > >> correct > > > > > > > > >> 3. no output data sink > > > > > > > > >> - just swallow output data without doing anything > > > > > > > > >> - purpose > > > > > > > > >> - evaluate and tune performance of Flink source > and > > > the > > > > > > whole > > > > > > > > >> pipeline. Users' don't need to worry about sink > back > > > > > > pressure > > > > > > > > >> > > > > > > > > >> These may be taken into consideration all together as an > > > effort > > > > to > > > > > > > lower > > > > > > > > >> the threshold of running Flink SQL/table API, and > facilitate > > > > > users' > > > > > > > > daily > > > > > > > > >> work. > > > > > > > > >> > > > > > > > > >> Cheers, > > > > > > > > >> Bowen > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li < > > > > > > jingsongl...@gmail.com> > > > > > > > > >> wrote: > > > > > > > > >> > > > > > > > > >>> Hi all, > > > > > > > > >>> > > > > > > > > >>> I heard some users complain that table is difficult to > > test. > > > > Now > > > > > > with > > > > > > > > SQL > > > > > > > > >>> client, users are more and more inclined to use it to > test > > > > rather > > > > > > > than > > > > > > > > >>> program. > > > > > > > > >>> The most common example is Kafka source. If users need to > > > test > > > > > > their > > > > > > > > SQL > > > > > > > > >>> output and checkpoint, they need to: > > > > > > > > >>> > > > > > > > > >>> - 1.Launch a Kafka standalone, create a Kafka topic . > > > > > > > > >>> - 2.Write a program, mock input records, and produce > > records > > > to > > > > > > Kafka > > > > > > > > >>> topic. > > > > > > > > >>> - 3.Then test in Flink. > > > > > > > > >>> > > > > > > > > >>> The step 1 and 2 are annoying, although this test is E2E. > > > > > > > > >>> > > > > > > > > >>> Then I found StatefulSequenceSource, it is very good > > because > > > it > > > > > has > > > > > > > > deal > > > > > > > > >>> with checkpoint things, so it is very good to checkpoint > > > > > > > > >> mechanism.Usually, > > > > > > > > >>> users are turned on checkpoint in production. > > > > > > > > >>> > > > > > > > > >>> With computed columns, user are easy to create a sequence > > > > source > > > > > > DDL > > > > > > > > same > > > > > > > > >>> to Kafka DDL. Then they can test inside Flink, don't need > > > > launch > > > > > > > other > > > > > > > > >>> things. > > > > > > > > >>> > > > > > > > > >>> Have you consider this? What do you think? > > > > > > > > >>> > > > > > > > > >>> CC: @Aljoscha Krettek <aljos...@apache.org> the author > > > > > > > > >>> of StatefulSequenceSource. > > > > > > > > >>> > > > > > > > > >>> Best, > > > > > > > > >>> Jingsong Lee > > > > > > > > >>> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Best, Jingsong Lee > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Benchao Li > > > > > > School of Electronics Engineering and Computer Science, Peking > > > > University > > > > > > Tel:+86-15650713730 > > > > > > Email: libenc...@gmail.com; libenc...@pku.edu.cn > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Best, Jingsong Lee > > > > > > > > > > > > > -- > > Best, Jingsong Lee > > > > > -- > > Konstantin Knauf > > https://twitter.com/snntrable > > https://github.com/knaufk >