Re: [DISCUSS] Introduce TableFactory for StatefulSequenceSource

Jark Wu Thu, 30 Apr 2020 08:01:30 -0700

Hi Konstantin,

Thanks for the link of Java Faker. It's an intereting project and
could benefit to a comprehensive datagen source.


What the discarding and printing sink look like in your thought?
1) manually create a table with a `blackhole` or `print` connector, e.g.

CREATE TABLE my_sink (
  a INT,
  b STRNG,
  c DOUBLE
) WITH (
  'connector' = 'print'
);
INSERT INTO my_sink SELECT a, b, c FROM my_source;

2) a system built-in table named `blackhole` and `print` without manually
schema work, e.g.
INSERT INTO print SELECT a, b, c, d FROM my_source;

Best,
Jark



On Thu, 30 Apr 2020 at 21:19, Konstantin Knauf <[email protected]> wrote:

> Hi everyone,
>
> sorry for reviving this thread at this point in time. Generally, I think,
> this is a very valuable effort. Have we considered only providing a very
> basic data generator (+ discarding and printing sink tables) in Apache
> Flink and moving a more comprehensive data generating table source to an
> ecosystem project promoted on flink-packages.org. I think this has a lot
> of
> potential (e.g. in combination with Java Faker [1]), but it would probably
> be better served in a small separately maintained repository.
>
> Cheers,
>
> Konstantin
>
> [1] https://github.com/DiUS/java-faker
>
>
> On Tue, Mar 24, 2020 at 9:10 AM Jingsong Li <[email protected]>
> wrote:
>
> > Hi all,
> >
> > I created https://issues.apache.org/jira/browse/FLINK-16743 for
> follow-up
> > discussion. FYI.
> >
> > Best,
> > Jingsong Lee
> >
> > On Tue, Mar 24, 2020 at 2:20 PM Bowen Li <[email protected]> wrote:
> >
> > > I agree with Jingsong that sink schema inference and system tables can
> be
> > > considered later. I wouldn’t recommend to tackle them for the sake of
> > > simplifying user experience to the extreme. Providing the above handy
> > > source and sink implementations already offer users a ton of immediate
> > > value.
> > >
> > >
> > > On Mon, Mar 23, 2020 at 20:20 Jingsong Li <[email protected]>
> > wrote:
> > >
> > > > Hi Benchao,
> > > >
> > > > > do you think we need to add more columns with various types?
> > > >
> > > > I didn't list all types, but we should support primitive types,
> > varchar,
> > > > Decimal, Timestamp and etc...
> > > > This can be done continuously.
> > > >
> > > > Hi Benchao, Jark,
> > > > About console and blackhole, yes, they can have no schema, the schema
> > can
> > > > be inferred by upstream node.
> > > > - But now we don't have this mechanism to do these configurable sink
> > > > things.
> > > > - If we want to support, we need a single way to support these two
> > sinks.
> > > > - And uses can use "create table like" and others way to simplify
> DDL.
> > > >
> > > > And for providing system/registered tables (`console` and
> `blackhole`):
> > > > - I have no strong opinion on these system tables. In SQL, will be
> > > "insert
> > > > into blackhole select a /*int*/, b /*string*/ from tableA", "insert
> > into
> > > > blackhole select a /*double*/, b /*Map*/, c /*string*/ from tableB".
> It
> > > > seems that Blackhole is a universal thing, which makes me feel bad
> > > > intuitively.
> > > > - Can user override these tables? If can, we need ensure it can be
> > > > overwrite by catalog tables.
> > > >
> > > > So I think we can leave these system tables to future too.
> > > > What do you think?
> > > >
> > > > Best,
> > > > Jingsong Lee
> > > >
> > > > On Mon, Mar 23, 2020 at 4:44 PM Jark Wu <[email protected]> wrote:
> > > >
> > > > > Hi Jingsong,
> > > > >
> > > > > Regarding (2) and (3), I was thinking to ignore manually DDL work,
> so
> > > > users
> > > > > can use them directly:
> > > > >
> > > > > # this will log results to `.out` files
> > > > > INSERT INTO console
> > > > > SELECT ...
> > > > >
> > > > > # this will drop all received records
> > > > > INSERT INTO blackhole
> > > > > SELECT ...
> > > > >
> > > > > Here `console` and `blackhole` are system sinks which is similar to
> > > > system
> > > > > functions.
> > > > >
> > > > > Best,
> > > > > Jark
> > > > >
> > > > > On Mon, 23 Mar 2020 at 16:33, Benchao Li <[email protected]>
> > wrote:
> > > > >
> > > > > > Hi Jingsong,
> > > > > >
> > > > > > Thanks for bring this up. Generally, it's a very good proposal.
> > > > > >
> > > > > > About data gen source, do you think we need to add more columns
> > with
> > > > > > various types?
> > > > > >
> > > > > > About print sink, do we need to specify the schema?
> > > > > >
> > > > > > Jingsong Li <[email protected]> 于2020年3月23日周一 下午1:51写道：
> > > > > >
> > > > > > > Thanks Bowen, Jark and Dian for your feedback and suggestions.
> > > > > > >
> > > > > > > I reorganize with your suggestions, and try to expose DDLs:
> > > > > > >
> > > > > > > 1.datagen source:
> > > > > > > - easy startup/test for streaming job
> > > > > > > - performance testing
> > > > > > >
> > > > > > > DDL:
> > > > > > > CREATE TABLE user (
> > > > > > >     id BIGINT,
> > > > > > >     age INT,
> > > > > > >     description STRING
> > > > > > > ) WITH (
> > > > > > >     'connector.type' = 'datagen',
> > > > > > >     'connector.rows-per-second'='100',
> > > > > > >     'connector.total-records'='1000000',
> > > > > > >
> > > > > > >     'schema.id.generator' = 'sequence',
> > > > > > >     'schema.id.generator.start' = '1',
> > > > > > >
> > > > > > >     'schema.age.generator' = 'random',
> > > > > > >     'schema.age.generator.min' = '0',
> > > > > > >     'schema.age.generator.max' = '100',
> > > > > > >
> > > > > > >     'schema.description.generator' = 'random',
> > > > > > >     'schema.description.generator.length' = '100'
> > > > > > > )
> > > > > > >
> > > > > > > Default is random generator.
> > > > > > > Hi Jark, I don't want to bring complicated regularities,
> because
> > it
> > > > can
> > > > > > be
> > > > > > > done through computed columns. And it is hard to define
> > > > > > > standard regularities, I think we can leave it to the future.
> > > > > > >
> > > > > > > 2.print sink:
> > > > > > > - easy test for streaming job
> > > > > > > - be very useful in production debugging
> > > > > > >
> > > > > > > DDL:
> > > > > > > CREATE TABLE print_table (
> > > > > > >     ...
> > > > > > > ) WITH (
> > > > > > >     'connector.type' = 'print'
> > > > > > > )
> > > > > > >
> > > > > > > 3.blackhole sink
> > > > > > > - very useful for high performance testing of Flink
> > > > > > > - I've also run into users trying UDF to output, not sink, so
> > they
> > > > need
> > > > > > > this sink as well.
> > > > > > >
> > > > > > > DDL:
> > > > > > > CREATE TABLE blackhole_table (
> > > > > > >     ...
> > > > > > > ) WITH (
> > > > > > >     'connector.type' = 'blackhole'
> > > > > > > )
> > > > > > >
> > > > > > > What do you think?
> > > > > > >
> > > > > > > Best,
> > > > > > > Jingsong Lee
> > > > > > >
> > > > > > > On Mon, Mar 23, 2020 at 12:04 PM Dian Fu <
> [email protected]>
> > > > > wrote:
> > > > > > >
> > > > > > > > Thanks Jingsong for bringing up this discussion. +1 to this
> > > > > proposal. I
> > > > > > > > think Bowen's proposal makes much sense to me.
> > > > > > > >
> > > > > > > > This is also a painful problem for PyFlink users. Currently
> > there
> > > > is
> > > > > no
> > > > > > > > built-in easy-to-use table source/sink and it requires users
> to
> > > > > write a
> > > > > > > lot
> > > > > > > > of code to trying out PyFlink. This is especially painful for
> > new
> > > > > users
> > > > > > > who
> > > > > > > > are not familiar with PyFlink/Flink. I have also encountered
> > the
> > > > > > tedious
> > > > > > > > process Bowen encountered, e.g. writing random source
> > connector,
> > > > > print
> > > > > > > sink
> > > > > > > > and also blackhole print sink as there are no built-in ones
> to
> > > use.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Dian
> > > > > > > >
> > > > > > > > > 在 2020年3月22日，上午11:24，Jark Wu <[email protected]> 写道：
> > > > > > > > >
> > > > > > > > > +1 to Bowen's proposal. I also saw many requirements on
> such
> > > > > built-in
> > > > > > > > > connectors.
> > > > > > > > >
> > > > > > > > > I will leave some my thoughts here:
> > > > > > > > >
> > > > > > > > >> 1. datagen source (random source)
> > > > > > > > > I think we can merge the functinality of sequence-source
> into
> > > > > random
> > > > > > > > source
> > > > > > > > > to allow users to custom their data values.
> > > > > > > > > Flink can generate random data according to the field
> types,
> > > > users
> > > > > > > > > can customize their values to be more domain specific, e.g.
> > > > > > > > > 'field.user'='User_[1-9]{0,1}'
> > > > > > > > > This will be similar to kafka-datagen-connect[1].
> > > > > > > > >
> > > > > > > > >> 2. console sink (print sink)
> > > > > > > > > This will be very useful in production debugging, to easily
> > > > output
> > > > > an
> > > > > > > > > intermediate view or result view to a `.out` file.
> > > > > > > > > So that we can look into the data representation, or check
> > > dirty
> > > > > > data.
> > > > > > > > > This should be out-of-box without manually DDL
> registration.
> > > > > > > > >
> > > > > > > > >> 3. blackhole sink (no output sink)
> > > > > > > > > This is very useful for high performance testing of Flink,
> to
> > > > > > meansure
> > > > > > > > the
> > > > > > > > > throughput of the whole pipeline without sink.
> > > > > > > > > Presto also provides this as a built-in connector [2].
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Jark
> > > > > > > > >
> > > > > > > > > [1]:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification
> > > > > > > > > [2]:
> > https://prestodb.io/docs/current/connector/blackhole.html
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Sat, 21 Mar 2020 at 12:31, Bowen Li <
> [email protected]>
> > > > > wrote:
> > > > > > > > >
> > > > > > > > >> +1.
> > > > > > > > >>
> > > > > > > > >> I would suggest to take a step even further and see what
> > users
> > > > > > really
> > > > > > > > need
> > > > > > > > >> to test/try/play with table API and Flink SQL. Besides
> this
> > > one,
> > > > > > > here're
> > > > > > > > >> some more sources and sinks that I have developed or used
> > > > > previously
> > > > > > > to
> > > > > > > > >> facilitate building Flink table/SQL pipelines.
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>   1. random input data source
> > > > > > > > >>      - should generate random data at a specified rate
> > > according
> > > > > to
> > > > > > > > schema
> > > > > > > > >>      - purposes
> > > > > > > > >>         - test Flink pipeline and data can end up in
> > external
> > > > > > storage
> > > > > > > > >>         correctly
> > > > > > > > >>         - stress test Flink sink as well as tuning up
> > external
> > > > > > storage
> > > > > > > > >>      2. print data sink
> > > > > > > > >>      - should print data in row format in console
> > > > > > > > >>      - purposes
> > > > > > > > >>         - make it easier to test Flink SQL job e2e in IDE
> > > > > > > > >>         - test Flink pipeline and ensure output data
> > > > format/value
> > > > > is
> > > > > > > > >>         correct
> > > > > > > > >>      3. no output data sink
> > > > > > > > >>      - just swallow output data without doing anything
> > > > > > > > >>      - purpose
> > > > > > > > >>         - evaluate and tune performance of Flink source
> and
> > > the
> > > > > > whole
> > > > > > > > >>         pipeline. Users' don't need to worry about sink
> back
> > > > > > pressure
> > > > > > > > >>
> > > > > > > > >> These may be taken into consideration all together as an
> > > effort
> > > > to
> > > > > > > lower
> > > > > > > > >> the threshold of running Flink SQL/table API, and
> facilitate
> > > > > users'
> > > > > > > > daily
> > > > > > > > >> work.
> > > > > > > > >>
> > > > > > > > >> Cheers,
> > > > > > > > >> Bowen
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li <
> > > > > > [email protected]>
> > > > > > > > >> wrote:
> > > > > > > > >>
> > > > > > > > >>> Hi all,
> > > > > > > > >>>
> > > > > > > > >>> I heard some users complain that table is difficult to
> > test.
> > > > Now
> > > > > > with
> > > > > > > > SQL
> > > > > > > > >>> client, users are more and more inclined to use it to
> test
> > > > rather
> > > > > > > than
> > > > > > > > >>> program.
> > > > > > > > >>> The most common example is Kafka source. If users need to
> > > test
> > > > > > their
> > > > > > > > SQL
> > > > > > > > >>> output and checkpoint, they need to:
> > > > > > > > >>>
> > > > > > > > >>> - 1.Launch a Kafka standalone, create a Kafka topic .
> > > > > > > > >>> - 2.Write a program, mock input records, and produce
> > records
> > > to
> > > > > > Kafka
> > > > > > > > >>> topic.
> > > > > > > > >>> - 3.Then test in Flink.
> > > > > > > > >>>
> > > > > > > > >>> The step 1 and 2 are annoying, although this test is E2E.
> > > > > > > > >>>
> > > > > > > > >>> Then I found StatefulSequenceSource, it is very good
> > because
> > > it
> > > > > has
> > > > > > > > deal
> > > > > > > > >>> with checkpoint things, so it is very good to checkpoint
> > > > > > > > >> mechanism.Usually,
> > > > > > > > >>> users are turned on checkpoint in production.
> > > > > > > > >>>
> > > > > > > > >>> With computed columns, user are easy to create a sequence
> > > > source
> > > > > > DDL
> > > > > > > > same
> > > > > > > > >>> to Kafka DDL. Then they can test inside Flink, don't need
> > > > launch
> > > > > > > other
> > > > > > > > >>> things.
> > > > > > > > >>>
> > > > > > > > >>> Have you consider this? What do you think?
> > > > > > > > >>>
> > > > > > > > >>> CC: @Aljoscha Krettek <[email protected]> the author
> > > > > > > > >>> of StatefulSequenceSource.
> > > > > > > > >>>
> > > > > > > > >>> Best,
> > > > > > > > >>> Jingsong Lee
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best, Jingsong Lee
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Benchao Li
> > > > > > School of Electronics Engineering and Computer Science, Peking
> > > > University
> > > > > > Tel:+86-15650713730
> > > > > > Email: [email protected]; [email protected]
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best, Jingsong Lee
> > > >
> > >
> >
> >
> > --
> > Best, Jingsong Lee
> >
>
>
> --
>
> Konstantin Knauf
>
> https://twitter.com/snntrable
>
> https://github.com/knaufk
>

Re: [DISCUSS] Introduce TableFactory for StatefulSequenceSource

Reply via email to