Hi all, I created https://issues.apache.org/jira/browse/FLINK-16743 for follow-up discussion. FYI.
Best, Jingsong Lee On Tue, Mar 24, 2020 at 2:20 PM Bowen Li <bowenl...@gmail.com> wrote: > I agree with Jingsong that sink schema inference and system tables can be > considered later. I wouldn’t recommend to tackle them for the sake of > simplifying user experience to the extreme. Providing the above handy > source and sink implementations already offer users a ton of immediate > value. > > > On Mon, Mar 23, 2020 at 20:20 Jingsong Li <jingsongl...@gmail.com> wrote: > > > Hi Benchao, > > > > > do you think we need to add more columns with various types? > > > > I didn't list all types, but we should support primitive types, varchar, > > Decimal, Timestamp and etc... > > This can be done continuously. > > > > Hi Benchao, Jark, > > About console and blackhole, yes, they can have no schema, the schema can > > be inferred by upstream node. > > - But now we don't have this mechanism to do these configurable sink > > things. > > - If we want to support, we need a single way to support these two sinks. > > - And uses can use "create table like" and others way to simplify DDL. > > > > And for providing system/registered tables (`console` and `blackhole`): > > - I have no strong opinion on these system tables. In SQL, will be > "insert > > into blackhole select a /*int*/, b /*string*/ from tableA", "insert into > > blackhole select a /*double*/, b /*Map*/, c /*string*/ from tableB". It > > seems that Blackhole is a universal thing, which makes me feel bad > > intuitively. > > - Can user override these tables? If can, we need ensure it can be > > overwrite by catalog tables. > > > > So I think we can leave these system tables to future too. > > What do you think? > > > > Best, > > Jingsong Lee > > > > On Mon, Mar 23, 2020 at 4:44 PM Jark Wu <imj...@gmail.com> wrote: > > > > > Hi Jingsong, > > > > > > Regarding (2) and (3), I was thinking to ignore manually DDL work, so > > users > > > can use them directly: > > > > > > # this will log results to `.out` files > > > INSERT INTO console > > > SELECT ... > > > > > > # this will drop all received records > > > INSERT INTO blackhole > > > SELECT ... > > > > > > Here `console` and `blackhole` are system sinks which is similar to > > system > > > functions. > > > > > > Best, > > > Jark > > > > > > On Mon, 23 Mar 2020 at 16:33, Benchao Li <libenc...@gmail.com> wrote: > > > > > > > Hi Jingsong, > > > > > > > > Thanks for bring this up. Generally, it's a very good proposal. > > > > > > > > About data gen source, do you think we need to add more columns with > > > > various types? > > > > > > > > About print sink, do we need to specify the schema? > > > > > > > > Jingsong Li <jingsongl...@gmail.com> 于2020年3月23日周一 下午1:51写道: > > > > > > > > > Thanks Bowen, Jark and Dian for your feedback and suggestions. > > > > > > > > > > I reorganize with your suggestions, and try to expose DDLs: > > > > > > > > > > 1.datagen source: > > > > > - easy startup/test for streaming job > > > > > - performance testing > > > > > > > > > > DDL: > > > > > CREATE TABLE user ( > > > > > id BIGINT, > > > > > age INT, > > > > > description STRING > > > > > ) WITH ( > > > > > 'connector.type' = 'datagen', > > > > > 'connector.rows-per-second'='100', > > > > > 'connector.total-records'='1000000', > > > > > > > > > > 'schema.id.generator' = 'sequence', > > > > > 'schema.id.generator.start' = '1', > > > > > > > > > > 'schema.age.generator' = 'random', > > > > > 'schema.age.generator.min' = '0', > > > > > 'schema.age.generator.max' = '100', > > > > > > > > > > 'schema.description.generator' = 'random', > > > > > 'schema.description.generator.length' = '100' > > > > > ) > > > > > > > > > > Default is random generator. > > > > > Hi Jark, I don't want to bring complicated regularities, because it > > can > > > > be > > > > > done through computed columns. And it is hard to define > > > > > standard regularities, I think we can leave it to the future. > > > > > > > > > > 2.print sink: > > > > > - easy test for streaming job > > > > > - be very useful in production debugging > > > > > > > > > > DDL: > > > > > CREATE TABLE print_table ( > > > > > ... > > > > > ) WITH ( > > > > > 'connector.type' = 'print' > > > > > ) > > > > > > > > > > 3.blackhole sink > > > > > - very useful for high performance testing of Flink > > > > > - I've also run into users trying UDF to output, not sink, so they > > need > > > > > this sink as well. > > > > > > > > > > DDL: > > > > > CREATE TABLE blackhole_table ( > > > > > ... > > > > > ) WITH ( > > > > > 'connector.type' = 'blackhole' > > > > > ) > > > > > > > > > > What do you think? > > > > > > > > > > Best, > > > > > Jingsong Lee > > > > > > > > > > On Mon, Mar 23, 2020 at 12:04 PM Dian Fu <dian0511...@gmail.com> > > > wrote: > > > > > > > > > > > Thanks Jingsong for bringing up this discussion. +1 to this > > > proposal. I > > > > > > think Bowen's proposal makes much sense to me. > > > > > > > > > > > > This is also a painful problem for PyFlink users. Currently there > > is > > > no > > > > > > built-in easy-to-use table source/sink and it requires users to > > > write a > > > > > lot > > > > > > of code to trying out PyFlink. This is especially painful for new > > > users > > > > > who > > > > > > are not familiar with PyFlink/Flink. I have also encountered the > > > > tedious > > > > > > process Bowen encountered, e.g. writing random source connector, > > > print > > > > > sink > > > > > > and also blackhole print sink as there are no built-in ones to > use. > > > > > > > > > > > > Regards, > > > > > > Dian > > > > > > > > > > > > > 在 2020年3月22日,上午11:24,Jark Wu <imj...@gmail.com> 写道: > > > > > > > > > > > > > > +1 to Bowen's proposal. I also saw many requirements on such > > > built-in > > > > > > > connectors. > > > > > > > > > > > > > > I will leave some my thoughts here: > > > > > > > > > > > > > >> 1. datagen source (random source) > > > > > > > I think we can merge the functinality of sequence-source into > > > random > > > > > > source > > > > > > > to allow users to custom their data values. > > > > > > > Flink can generate random data according to the field types, > > users > > > > > > > can customize their values to be more domain specific, e.g. > > > > > > > 'field.user'='User_[1-9]{0,1}' > > > > > > > This will be similar to kafka-datagen-connect[1]. > > > > > > > > > > > > > >> 2. console sink (print sink) > > > > > > > This will be very useful in production debugging, to easily > > output > > > an > > > > > > > intermediate view or result view to a `.out` file. > > > > > > > So that we can look into the data representation, or check > dirty > > > > data. > > > > > > > This should be out-of-box without manually DDL registration. > > > > > > > > > > > > > >> 3. blackhole sink (no output sink) > > > > > > > This is very useful for high performance testing of Flink, to > > > > meansure > > > > > > the > > > > > > > throughput of the whole pipeline without sink. > > > > > > > Presto also provides this as a built-in connector [2]. > > > > > > > > > > > > > > Best, > > > > > > > Jark > > > > > > > > > > > > > > [1]: > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification > > > > > > > [2]: https://prestodb.io/docs/current/connector/blackhole.html > > > > > > > > > > > > > > > > > > > > > On Sat, 21 Mar 2020 at 12:31, Bowen Li <bowenl...@gmail.com> > > > wrote: > > > > > > > > > > > > > >> +1. > > > > > > >> > > > > > > >> I would suggest to take a step even further and see what users > > > > really > > > > > > need > > > > > > >> to test/try/play with table API and Flink SQL. Besides this > one, > > > > > here're > > > > > > >> some more sources and sinks that I have developed or used > > > previously > > > > > to > > > > > > >> facilitate building Flink table/SQL pipelines. > > > > > > >> > > > > > > >> > > > > > > >> 1. random input data source > > > > > > >> - should generate random data at a specified rate > according > > > to > > > > > > schema > > > > > > >> - purposes > > > > > > >> - test Flink pipeline and data can end up in external > > > > storage > > > > > > >> correctly > > > > > > >> - stress test Flink sink as well as tuning up external > > > > storage > > > > > > >> 2. print data sink > > > > > > >> - should print data in row format in console > > > > > > >> - purposes > > > > > > >> - make it easier to test Flink SQL job e2e in IDE > > > > > > >> - test Flink pipeline and ensure output data > > format/value > > > is > > > > > > >> correct > > > > > > >> 3. no output data sink > > > > > > >> - just swallow output data without doing anything > > > > > > >> - purpose > > > > > > >> - evaluate and tune performance of Flink source and > the > > > > whole > > > > > > >> pipeline. Users' don't need to worry about sink back > > > > pressure > > > > > > >> > > > > > > >> These may be taken into consideration all together as an > effort > > to > > > > > lower > > > > > > >> the threshold of running Flink SQL/table API, and facilitate > > > users' > > > > > > daily > > > > > > >> work. > > > > > > >> > > > > > > >> Cheers, > > > > > > >> Bowen > > > > > > >> > > > > > > >> > > > > > > >> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li < > > > > jingsongl...@gmail.com> > > > > > > >> wrote: > > > > > > >> > > > > > > >>> Hi all, > > > > > > >>> > > > > > > >>> I heard some users complain that table is difficult to test. > > Now > > > > with > > > > > > SQL > > > > > > >>> client, users are more and more inclined to use it to test > > rather > > > > > than > > > > > > >>> program. > > > > > > >>> The most common example is Kafka source. If users need to > test > > > > their > > > > > > SQL > > > > > > >>> output and checkpoint, they need to: > > > > > > >>> > > > > > > >>> - 1.Launch a Kafka standalone, create a Kafka topic . > > > > > > >>> - 2.Write a program, mock input records, and produce records > to > > > > Kafka > > > > > > >>> topic. > > > > > > >>> - 3.Then test in Flink. > > > > > > >>> > > > > > > >>> The step 1 and 2 are annoying, although this test is E2E. > > > > > > >>> > > > > > > >>> Then I found StatefulSequenceSource, it is very good because > it > > > has > > > > > > deal > > > > > > >>> with checkpoint things, so it is very good to checkpoint > > > > > > >> mechanism.Usually, > > > > > > >>> users are turned on checkpoint in production. > > > > > > >>> > > > > > > >>> With computed columns, user are easy to create a sequence > > source > > > > DDL > > > > > > same > > > > > > >>> to Kafka DDL. Then they can test inside Flink, don't need > > launch > > > > > other > > > > > > >>> things. > > > > > > >>> > > > > > > >>> Have you consider this? What do you think? > > > > > > >>> > > > > > > >>> CC: @Aljoscha Krettek <aljos...@apache.org> the author > > > > > > >>> of StatefulSequenceSource. > > > > > > >>> > > > > > > >>> Best, > > > > > > >>> Jingsong Lee > > > > > > >>> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Best, Jingsong Lee > > > > > > > > > > > > > > > > > -- > > > > > > > > Benchao Li > > > > School of Electronics Engineering and Computer Science, Peking > > University > > > > Tel:+86-15650713730 > > > > Email: libenc...@gmail.com; libenc...@pku.edu.cn > > > > > > > > > > > > > -- > > Best, Jingsong Lee > > > -- Best, Jingsong Lee