[jira] [Commented] (FLINK-15775) SourceFunctions are instantiated twice when pulled on from 2 Sinks

Flink Jira Bot (Jira) Thu, 22 Apr 2021 05:11:18 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-15775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17328079#comment-17328079
 ]


Flink Jira Bot commented on FLINK-15775:
----------------------------------------

This major issue is unassigned and itself and all of its Sub-Tasks have not 
been updated for 30 days. So, it has been labeled "stale-major". If this ticket 
is indeed "major", please either assign yourself or give an update. Afterwards, 
please remove the label. In 7 days the issue will be deprioritized.

> SourceFunctions are instantiated twice when pulled on from 2 Sinks
> ------------------------------------------------------------------
>
>                 Key: FLINK-15775
>                 URL: https://issues.apache.org/jira/browse/FLINK-15775
>             Project: Flink
>          Issue Type: Bug
>          Components: Table SQL / Planner
>    Affects Versions: 1.9.1, 1.10.0
>            Reporter: Benoît Paris
>            Priority: Major
>              Labels: stale-major
>         Attachments: flink-test-duplicated-sources.zip
>
>
> When pulled on by two sinks, the SourceFunctions of a TableSource will get 
> instantiated twice; (and subsequently opened by the parallelism number, which 
> is expected behavior):
> The following will instantiate the FooTableSource's SourceFunction once (OK 
> behavior, but not the processing we want):
>  
> {code:java}
> tEnv.registerTableSource("foo_table", new FooTableSource());
> Table out0 = tEnv.sqlQuery("SELECT * FROM foo_table WHERE field_1 = 0");
> tEnv.registerTableSink("syso_sink_0", new SysoSink());
> out0.insertInto("syso_sink_0");
> {code}
>  
> This will instantiate the FooTableSource's SourceFunction twice (Not OK, as 
> we're missing half the inputs in each SysoSink):
>  
> {code:java}
> tEnv.registerTableSource("foo_table", new FooTableSource());
> Table out0 = tEnv.sqlQuery("SELECT * FROM foo_table WHERE field_1 = 0");
> Table out1 = tEnv.sqlQuery("SELECT * FROM foo_table WHERE field_1 = 1");
> tEnv.registerTableSink("syso_sink_0", new SysoSink());
> tEnv.registerTableSink("syso_sink_1", new SysoSink());
> out0.insertInto("syso_sink_0");
> out1.insertInto("syso_sink_1"); 
> {code}
>  
> This might not be a problem for Kafka's SourceFunctions, as we can always 
> reread from a log; but it is a data loss problem when the source data can't 
> be reproduced.
> Actually, this might be me not understanding the API. Is there a way to make 
> the runtime read from the same opened SourceFunctions?
> Attached is Java code that logs the faulty opening of the SourceFunctions, 
> pom.xml, and logical execution plans for the duplicated case, and the 
> workaround.
>  
> ----
> Workaround: make a conversion to an appendStream. Somehow this makes the 
> planner think it has to put a materialization barrier after the Source and 
> read from that:
>  
> {code:java}
> tEnv.registerTableSource("foo_table_source", new FooTableSource());
> Table sourceTable = tEnv.sqlQuery("SELECT * FROM foo_table_source");
> Table appendingSourceTable = tEnv.fromDataStream(
>  tEnv.toAppendStream(sourceTable, Types.ROW(new String[]{"field_1"}, new 
> TypeInformation[]{Types.LONG()}))
> );
> tEnv.registerTable("foo_table", appendingSourceTable);{code}
>  
>  
> Best Regards,
> Ben



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-15775) SourceFunctions are instantiated twice when pulled on from 2 Sinks

Reply via email to