I agree with Jingsong that sink schema inference and system tables can be considered later. I wouldn’t recommend to tackle them for the sake of simplifying user experience to the extreme. Providing the above handy source and sink implementations already offer users a ton of immediate value.
On Mon, Mar 23, 2020 at 20:20 Jingsong Li <jingsongl...@gmail.com> wrote: > Hi Benchao, > > > do you think we need to add more columns with various types? > > I didn't list all types, but we should support primitive types, varchar, > Decimal, Timestamp and etc... > This can be done continuously. > > Hi Benchao, Jark, > About console and blackhole, yes, they can have no schema, the schema can > be inferred by upstream node. > - But now we don't have this mechanism to do these configurable sink > things. > - If we want to support, we need a single way to support these two sinks. > - And uses can use "create table like" and others way to simplify DDL. > > And for providing system/registered tables (`console` and `blackhole`): > - I have no strong opinion on these system tables. In SQL, will be "insert > into blackhole select a /*int*/, b /*string*/ from tableA", "insert into > blackhole select a /*double*/, b /*Map*/, c /*string*/ from tableB". It > seems that Blackhole is a universal thing, which makes me feel bad > intuitively. > - Can user override these tables? If can, we need ensure it can be > overwrite by catalog tables. > > So I think we can leave these system tables to future too. > What do you think? > > Best, > Jingsong Lee > > On Mon, Mar 23, 2020 at 4:44 PM Jark Wu <imj...@gmail.com> wrote: > > > Hi Jingsong, > > > > Regarding (2) and (3), I was thinking to ignore manually DDL work, so > users > > can use them directly: > > > > # this will log results to `.out` files > > INSERT INTO console > > SELECT ... > > > > # this will drop all received records > > INSERT INTO blackhole > > SELECT ... > > > > Here `console` and `blackhole` are system sinks which is similar to > system > > functions. > > > > Best, > > Jark > > > > On Mon, 23 Mar 2020 at 16:33, Benchao Li <libenc...@gmail.com> wrote: > > > > > Hi Jingsong, > > > > > > Thanks for bring this up. Generally, it's a very good proposal. > > > > > > About data gen source, do you think we need to add more columns with > > > various types? > > > > > > About print sink, do we need to specify the schema? > > > > > > Jingsong Li <jingsongl...@gmail.com> 于2020年3月23日周一 下午1:51写道: > > > > > > > Thanks Bowen, Jark and Dian for your feedback and suggestions. > > > > > > > > I reorganize with your suggestions, and try to expose DDLs: > > > > > > > > 1.datagen source: > > > > - easy startup/test for streaming job > > > > - performance testing > > > > > > > > DDL: > > > > CREATE TABLE user ( > > > > id BIGINT, > > > > age INT, > > > > description STRING > > > > ) WITH ( > > > > 'connector.type' = 'datagen', > > > > 'connector.rows-per-second'='100', > > > > 'connector.total-records'='1000000', > > > > > > > > 'schema.id.generator' = 'sequence', > > > > 'schema.id.generator.start' = '1', > > > > > > > > 'schema.age.generator' = 'random', > > > > 'schema.age.generator.min' = '0', > > > > 'schema.age.generator.max' = '100', > > > > > > > > 'schema.description.generator' = 'random', > > > > 'schema.description.generator.length' = '100' > > > > ) > > > > > > > > Default is random generator. > > > > Hi Jark, I don't want to bring complicated regularities, because it > can > > > be > > > > done through computed columns. And it is hard to define > > > > standard regularities, I think we can leave it to the future. > > > > > > > > 2.print sink: > > > > - easy test for streaming job > > > > - be very useful in production debugging > > > > > > > > DDL: > > > > CREATE TABLE print_table ( > > > > ... > > > > ) WITH ( > > > > 'connector.type' = 'print' > > > > ) > > > > > > > > 3.blackhole sink > > > > - very useful for high performance testing of Flink > > > > - I've also run into users trying UDF to output, not sink, so they > need > > > > this sink as well. > > > > > > > > DDL: > > > > CREATE TABLE blackhole_table ( > > > > ... > > > > ) WITH ( > > > > 'connector.type' = 'blackhole' > > > > ) > > > > > > > > What do you think? > > > > > > > > Best, > > > > Jingsong Lee > > > > > > > > On Mon, Mar 23, 2020 at 12:04 PM Dian Fu <dian0511...@gmail.com> > > wrote: > > > > > > > > > Thanks Jingsong for bringing up this discussion. +1 to this > > proposal. I > > > > > think Bowen's proposal makes much sense to me. > > > > > > > > > > This is also a painful problem for PyFlink users. Currently there > is > > no > > > > > built-in easy-to-use table source/sink and it requires users to > > write a > > > > lot > > > > > of code to trying out PyFlink. This is especially painful for new > > users > > > > who > > > > > are not familiar with PyFlink/Flink. I have also encountered the > > > tedious > > > > > process Bowen encountered, e.g. writing random source connector, > > print > > > > sink > > > > > and also blackhole print sink as there are no built-in ones to use. > > > > > > > > > > Regards, > > > > > Dian > > > > > > > > > > > 在 2020年3月22日,上午11:24,Jark Wu <imj...@gmail.com> 写道: > > > > > > > > > > > > +1 to Bowen's proposal. I also saw many requirements on such > > built-in > > > > > > connectors. > > > > > > > > > > > > I will leave some my thoughts here: > > > > > > > > > > > >> 1. datagen source (random source) > > > > > > I think we can merge the functinality of sequence-source into > > random > > > > > source > > > > > > to allow users to custom their data values. > > > > > > Flink can generate random data according to the field types, > users > > > > > > can customize their values to be more domain specific, e.g. > > > > > > 'field.user'='User_[1-9]{0,1}' > > > > > > This will be similar to kafka-datagen-connect[1]. > > > > > > > > > > > >> 2. console sink (print sink) > > > > > > This will be very useful in production debugging, to easily > output > > an > > > > > > intermediate view or result view to a `.out` file. > > > > > > So that we can look into the data representation, or check dirty > > > data. > > > > > > This should be out-of-box without manually DDL registration. > > > > > > > > > > > >> 3. blackhole sink (no output sink) > > > > > > This is very useful for high performance testing of Flink, to > > > meansure > > > > > the > > > > > > throughput of the whole pipeline without sink. > > > > > > Presto also provides this as a built-in connector [2]. > > > > > > > > > > > > Best, > > > > > > Jark > > > > > > > > > > > > [1]: > > > > > > > > > > > > > > > > > > > > > https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification > > > > > > [2]: https://prestodb.io/docs/current/connector/blackhole.html > > > > > > > > > > > > > > > > > > On Sat, 21 Mar 2020 at 12:31, Bowen Li <bowenl...@gmail.com> > > wrote: > > > > > > > > > > > >> +1. > > > > > >> > > > > > >> I would suggest to take a step even further and see what users > > > really > > > > > need > > > > > >> to test/try/play with table API and Flink SQL. Besides this one, > > > > here're > > > > > >> some more sources and sinks that I have developed or used > > previously > > > > to > > > > > >> facilitate building Flink table/SQL pipelines. > > > > > >> > > > > > >> > > > > > >> 1. random input data source > > > > > >> - should generate random data at a specified rate according > > to > > > > > schema > > > > > >> - purposes > > > > > >> - test Flink pipeline and data can end up in external > > > storage > > > > > >> correctly > > > > > >> - stress test Flink sink as well as tuning up external > > > storage > > > > > >> 2. print data sink > > > > > >> - should print data in row format in console > > > > > >> - purposes > > > > > >> - make it easier to test Flink SQL job e2e in IDE > > > > > >> - test Flink pipeline and ensure output data > format/value > > is > > > > > >> correct > > > > > >> 3. no output data sink > > > > > >> - just swallow output data without doing anything > > > > > >> - purpose > > > > > >> - evaluate and tune performance of Flink source and the > > > whole > > > > > >> pipeline. Users' don't need to worry about sink back > > > pressure > > > > > >> > > > > > >> These may be taken into consideration all together as an effort > to > > > > lower > > > > > >> the threshold of running Flink SQL/table API, and facilitate > > users' > > > > > daily > > > > > >> work. > > > > > >> > > > > > >> Cheers, > > > > > >> Bowen > > > > > >> > > > > > >> > > > > > >> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li < > > > jingsongl...@gmail.com> > > > > > >> wrote: > > > > > >> > > > > > >>> Hi all, > > > > > >>> > > > > > >>> I heard some users complain that table is difficult to test. > Now > > > with > > > > > SQL > > > > > >>> client, users are more and more inclined to use it to test > rather > > > > than > > > > > >>> program. > > > > > >>> The most common example is Kafka source. If users need to test > > > their > > > > > SQL > > > > > >>> output and checkpoint, they need to: > > > > > >>> > > > > > >>> - 1.Launch a Kafka standalone, create a Kafka topic . > > > > > >>> - 2.Write a program, mock input records, and produce records to > > > Kafka > > > > > >>> topic. > > > > > >>> - 3.Then test in Flink. > > > > > >>> > > > > > >>> The step 1 and 2 are annoying, although this test is E2E. > > > > > >>> > > > > > >>> Then I found StatefulSequenceSource, it is very good because it > > has > > > > > deal > > > > > >>> with checkpoint things, so it is very good to checkpoint > > > > > >> mechanism.Usually, > > > > > >>> users are turned on checkpoint in production. > > > > > >>> > > > > > >>> With computed columns, user are easy to create a sequence > source > > > DDL > > > > > same > > > > > >>> to Kafka DDL. Then they can test inside Flink, don't need > launch > > > > other > > > > > >>> things. > > > > > >>> > > > > > >>> Have you consider this? What do you think? > > > > > >>> > > > > > >>> CC: @Aljoscha Krettek <aljos...@apache.org> the author > > > > > >>> of StatefulSequenceSource. > > > > > >>> > > > > > >>> Best, > > > > > >>> Jingsong Lee > > > > > >>> > > > > > >> > > > > > > > > > > > > > > > > > > -- > > > > Best, Jingsong Lee > > > > > > > > > > > > > -- > > > > > > Benchao Li > > > School of Electronics Engineering and Computer Science, Peking > University > > > Tel:+86-15650713730 > > > Email: libenc...@gmail.com; libenc...@pku.edu.cn > > > > > > > > -- > Best, Jingsong Lee >