Hi Becket,

it is true that concepts such as projection and filtering are worth having in DataStream API as well. And a SourceFunction can provide interfaces for those concepts. In the table related classes we will generate runtime classes that adhere to those interfaces and deal with RowData (possibly code generated and on binary data).

We consider everything in DataStream API as "runtime" classes. It is an orthogonal discussion which components a SourceFunction needs.

Currently, I could imagine something like:

SourceFunction(FlatMapFunction, WatermarkAssigner)

Once we have completed this discussion, the table sources/and sinks will generate FlatMapFunction (out of projection, filter, and computed columns) and WatermarkAssigner (out of computed columns). Until now we are using the old interfaces.

Regards,
Timo


On 24.03.20 15:03, Becket Qin wrote:
Hey Kurt,

I don't think DataStream should see some SQL specific concepts such as
Filtering or ComputedColumn.


Projectable and Filterable seems not necessarily SQL concepts, but could be
applicable to DataStream source as well to reduce the network load. For
example ORC and Parquet should probably also be readable from DataStream,
right?

ComputedColumn is not part of the Source, it is an interface extends the
Deserializer, which is a pluggable for the Source. From the SQL's
perspective it has the concept of computed column, but from the Source
perspective, It is essentially a Deserializer which also converts the
records internally, assuming we allow some conversion to be embedded to
the source in addition to just deserialization.

Thanks,

Jiangjie (Becket) Qin

On Tue, Mar 24, 2020 at 9:36 PM Jark Wu <imj...@gmail.com> wrote:

Thanks Timo for updating the formats section. That would be very helpful
for changelog supporting (FLIP-105).

I just left 2 minor comment about some method names. In general, I'm +1 to
start a voting.


--------------------------------------------------------------------------------------------------

Hi Becket,

I agree we shouldn't duplicate codes, especiall the runtime
implementations.
However, the interfaces proposed by FLIP-95 are mainly used during
optimization (compiling), not runtime.
I don't think there is much to share for this. Because table/sql
is declarative, but DataStream is imperative.
For example, filter push down, DataStream FilterableSource may allow to
accept a FilterFunction (which is a black box for the source).
However, table sources should pick the pushed filter expressions, some
sources may only support "=", "<", ">" conditions.
Pushing a FilterFunction doesn't work in table ecosystem. That means, the
connectors have to have some table-specific implementations.


Best,
Jark

On Tue, 24 Mar 2020 at 20:41, Kurt Young <ykt...@gmail.com> wrote:

Hi Becket,

I don't think DataStream should see some SQL specific concepts such as
Filtering or ComputedColumn. It's
better to stay within SQL area and translate to more generic concept when
translating to DataStream/Runtime
layer, such as use MapFunction to represent computed column logic.

Best,
Kurt


On Tue, Mar 24, 2020 at 5:47 PM Becket Qin <becket....@gmail.com> wrote:

Hi Timo and Dawid,

It's really great that we have the same goal. I am actually wondering
if
we
can go one step further to avoid some of the interfaces in Table as
well.

For example, if we have the FilterableSource, do we still need the
FilterableTableSource? Should DynamicTableSource just become a
Source<*Row*,
SourceSplitT, EnumChkT>?

Can you help me understand a bit more about the reason we need the
following relational representation / wrapper interfaces v.s. the
interfaces that we could put to the Source in FLIP-27?

DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
SupportsFilterablePushDown v.s. FilterableSource
SupportsProjectablePushDown v.s. ProjectableSource
SupportsWatermarkPushDown v.s. WithWatermarkAssigner
SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
ScanTableSource v.s. ChangeLogDeserializer.
LookUpTableSource v.s. LookUpSource

Assuming we have all the interfaces on the right side, do we still need
the
interfaces on the left side? Note that the interfaces on the right can
be
used by both DataStream and Table. If we do this, there will only be
one
set of Source interfaces Table and DataStream, the only difference is
that
the Source for table will have some specific plugins and
configurations.
An
omnipotent Source can implement all the the above interfaces and take a
Deserializer that implements both ComputedColumnDeserializer and
ChangeLogDeserializer.

Would the SQL planner work with that?

Thanks,

Jiangjie (Becket) Qin






On Tue, Mar 24, 2020 at 5:03 PM Jingsong Li <jingsongl...@gmail.com>
wrote:

+1. Thanks Timo for the design doc.

We can also consider @Experimental too. But I am +1 to
@PublicEvolving,
we
should be confident in the current change.

Best,
Jingsong Lee

On Tue, Mar 24, 2020 at 4:30 PM Timo Walther <twal...@apache.org>
wrote:

@Becket: We totally agree that we don't need table specific
connectors
during runtime. As Dawid said, the interfaces proposed here are
just
for
communication with the planner. Once the properties (watermarks,
computed column, filters, projecttion etc.) are negotiated, we can
configure a regular Flink connector.

E.g. setting the watermark assigner and deserialization schema of a
Kafka connector.

For better separation of concerns, Flink connectors should not
include
relational interfaces and depend on flink-table. This is the
responsibility of table source/sink.

@Kurt: I would like to mark them @PublicEvolving already because we
need
to deprecate the old interfaces as early as possible. We cannot
redirect
to @Internal interfaces. They are not marked @Public, so we can
still
evolve them. But a core design shift should not happen again, it
would
leave a bad impression if we are redesign over and over again.
Instead
we should be confident in the current change.

Regards,
Timo


On 24.03.20 09:20, Dawid Wysakowicz wrote:
Hi Becket,

Answering your question, we have the same intention not to
duplicate
connectors between datastream and table apis. The interfaces
proposed
in
the FLIP are a way to describe relational properties of a source.
The
intention is as you described to translate all of those expressed
as
expressions or other Table specific structures into a DataStream
source.
In other words I think what we are doing here is in line with
what
you
described.

Best,

Dawid

On 24/03/2020 02:23, Becket Qin wrote:
Hi Timo,

Thanks for the proposal. I completely agree that the current
Table
connectors could be simplified quite a bit. I haven't finished
reading
everything, but here are some quick thoughts.

Actually to me the biggest question is why should there be two
different
connector systems for DataStream and Table? What is the
fundamental
reason
that is preventing us from merging them to one?

The basic functionality of a connector is to provide
capabilities
to
do
IO
and Serde. Conceptually, Table connectors should just be
DataStream
connectors that are dealing with Rows. It seems that quite a few
of
the
special connector requirements are just a specific way to do IO
/
Serde.
Taking SupportsFilterPushDown as an example, imagine we have the
following
interface:

interface FilterableSource<PREDICATE> {
      void applyFilterable(Supplier<PREDICATE> predicate);
}

And if a ParquetSource would like to support filterable, it will
become:

class ParquetSource implements Source,
FilterableSource(FilterPredicate> {
      ...
}

For Table, one just need to provide an predicate supplier that
converts
an
Expression to the specified predicate type. This has a few
benefit:
1. Same unified API for filterable for sources, regardless of
DataStream or
Table.
2. The  DataStream users now can also use the
ExpressionToPredicate
supplier if they want to.

To summarize, my main point is that I am wondering if it is
possible
to
have a single set of connector interface for both Table and
DataStream,
rather than having two hierarchies. I am not 100% sure if this
would
work,
but if it works, this would be a huge win from both code
maintenance
and
user experience perspective.

Thanks,

Jiangjie (Becket) Qin



On Tue, Mar 24, 2020 at 2:03 AM Dawid Wysakowicz <
dwysakow...@apache.org>
wrote:

Hi Timo,

Thank you for the proposal. I think it is an important
improvement
that
will benefit many parts of the Table API. The proposal looks
really
good
to me and personally I would be comfortable with voting on the
current
state.

Best,

Dawid

On 23/03/2020 18:53, Timo Walther wrote:
Hi everyone,

I received some questions around how the new interfaces play
together
with formats and their factories.

Furthermore, for MySQL or Postgres CDC logs, the format should
be
able
to return a `ChangelogMode`.

Also, I incorporated the feedback around the factory design in
general.

I added a new section `Factory Interfaces` to the design
document.
This should be helpful to understand the big picture and
connecting
the concepts.

Please let me know what you think?

Thanks,
Timo


On 18.03.20 13:43, Timo Walther wrote:
Hi Benchao,

this is a very good question. I will update the FLIP about
this.

The legacy planner will not support the new interfaces. It
will
only
support the old interfaces. With the next release, I think
the
Blink
planner is stable enough to be the default one as well.

Regards,
Timo

On 18.03.20 08:45, Benchao Li wrote:
Hi Timo,

Thank you and others for the efforts to prepare this FLIP.

The FLIP LGTM generally.

+1 for moving blink data structures to table-common, it's
useful
to
udf too
in the future.
A little question is, do we plan to support the new
interfaces
and
data
types in legacy planner?
Or we only plan to support these new interfaces in blink
planner.

And using primary keys from DDL instead of derived key
information
from
each query is also a good idea,
we met some use cases where this does not works very well
before.

This FLIP also makes the dependencies of table modules more
clear, I
like
it very much.

Timo Walther <twal...@apache.org> 于2020年3月17日周二 上午1:36写道:

Hi everyone,

I'm happy to present the results of long discussions that
we
had
internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and
many
more
have contributed to this design document.

We would like to propose new long-term table source and
table
sink
interfaces:








https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces

This is a requirement for FLIP-105 and finalizing FLIP-32.

The goals of this FLIP are:

- Simplify the current interface architecture:
        - Merge upsert, retract, and append sinks.
        - Unify batch and streaming sources.
        - Unify batch and streaming sinks.

- Allow sources to produce a changelog:
        - UpsertTableSources have been requested a lot by
users.
Now
is the
time to open the internal planner capabilities via the new
interfaces.
        - According to FLIP-105, we would like to support
changelogs for
processing formats such as Debezium.

- Don't rely on DataStream API for source and sinks:
        - According to FLIP-32, the Table API and SQL should
be
independent
of the DataStream API which is why the `table-common`
module
has
no
dependencies on `flink-streaming-java`.
        - Source and sink implementations should only depend
on
the
`table-common` module after FLIP-27.
        - Until FLIP-27 is ready, we still put most of the
interfaces in
`table-common` and strictly separate interfaces that
communicate
with a
planner and actual runtime reader/writers.

- Implement efficient sources and sinks without planner
dependencies:
        - Make Blink's internal data structures available to
connectors.
        - Introduce stable interfaces for data structures
that
can
be
marked as `@PublicEvolving`.
        - Only require dependencies on `flink-table-common`
in
the
future

It finalizes the concept of dynamic tables and consideres
how
all
source/sink related classes play together.

We look forward to your feedback.

Regards,
Timo







--
Best, Jingsong Lee






Reply via email to