from:"admin"

[DISCUSS] Support source/sink parallelism config in Flink sql

2020-09-09 Thread admin

Hi devs:
Currently,Flink sql does not support source/sink parallelism config.So,it will 
result in wasting or lacking resources in some cases.
I think it is necessary to introduce configuration of source/sink parallelism 
in sql.
From my side,i have the solution for this feature.Add parallelism config in 
‘with’ properties of DDL.

Before 1.11,we can get parallelism and then set it to 
StreamTableSink#consumeDataStream or StreamTableSource#getDataStream
After 1.11,we can get parallelism from catalogTable and then set it to 
transformation in CommonPhysicalTableSourceScan or CommonPhysicalSink.

What do you think?

Re: [DISCUSS] FLIP-146: Improve new TableSource and TableSink interfaces

2020-09-23 Thread admin

+1,it’s a good news

> 2020年9月23日 下午6:22，Jingsong Li  写道：
> 
> Hi all,
> 
> I'd like to start a discussion about improving the new TableSource and
> TableSink interfaces.
> 
> Most connectors have been migrated to FLIP-95, but there are still the
> Filesystem and Hive that have not been migrated. They have some
> requirements on table connector API. And users also have some additional
> requirements：
> - Some connectors have the ability to infer parallelism, the parallelism is
> good for most cases.
> - Users have customized parallelism configuration requirements for source
> and sink.
> - The connectors need to use topology to build their source/sink instead of
> a single function. Like JIRA[1], Partition Commit feature and File
> Compaction feature.
> 
> Details are in [2].
> 
> [1]https://issues.apache.org/jira/browse/FLINK-18674
> [2]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-146%3A+Improve+new+TableSource+and+TableSink+interfaces
> 
> Best,
> Jingsong

Re: [DISCUSS] FLIP-146: Improve new TableSource and TableSink interfaces

2020-09-25 Thread admin

Hi everyone,
Thanks for the proposal.

In our company,we meet the same situation as @liu shouwei.
We developed some features base on flink.Such as parallelism of sql source/sink 
 connector, and kafka delay consumer which is adding a flatmap and a keyby 
transformation after the source Datastream.
What make us embarrassing is that when we migrate this features to Flink 
1.11,we found that the DataSteam is missing,So we modify the blink’s code to 
support parallelism.But kafka delay comsumer is unsolved until now.

From user’s perspective,it necessary to manipulate DataStream or have the 
interoperability between Table API and DataStream.

Best



> 2020年9月25日 下午4:18，Rui Li  写道：
> 
> Hi Jingsong,
> 
> Thanks for driving this effort. I have two minor comments.
> 
> 
>   1. IMHO, parallelism is a concept that applies to all ScanTableSource.
>   So instead of defining a new interface, is it more natural to incorporate
>   parallel inference to existing interfaces, e.g. ScanTableSource
>   or ScanRuntimeProvider?
>   2. `scan.infer-parallelism.enabled` doesn't seem very useful to me. From
>   a user's perspective, parallelism is either set by `scan.parallelism`, or
>   automatically decided by Flink. If a user doesn't want the connector to
>   infer parallelism, he/she can simply set `scan.parallelism`, no?
> 
> 
> On Fri, Sep 25, 2020 at 3:33 PM Jingsong Li  wrote:
> 
>> Hi Aljoscha,
>> 
>> Thank you for your feedback,
>> 
>> ## Connector parallelism
>> 
>> Requirements:
>> Set parallelism by user specified or inferred by connector.
>> 
>> How to configure parallelism in DataStream:
>> In the DataStream world, the only way to configure parallelism is
>> `SingleOutputStreamOperator.setParallelism`. Actually, users need to have
>> access to DataStream when using a connector, not just the `SourceFunction`
>> / `Source` interface.
>> Is parallelism related to connectors? I think yes, there are many
>> connectors that can support obtaining parallelism related information from
>> them, and users do exactly that. This means parallelism inference (From
>> connectors).
>> The key is that `DataStream` is an open programming API, and users can
>> freely program to set parallelism.
>> 
>> How to configure parallelism in Table/SQL:
>> But Table/SQL is not an open programming API, every feature needs a
>> corresponding mechanism, because the user is no longer able to program. Our
>> current connector interface: SourceFunctionProvider, SinkFunctionProvider,
>> through these interfaces, there is no ability to generate connector related
>> parallelism.
>> Back to our original intention: to avoid users directly manipulating
>> `DataStream`. Since we want to avoid it, we need to provide corresponding
>> features.
>> 
>> And parallelism is the runtime information of connectors, It fits the name
>> of `ScanRuntimeProvider`.
>> 
>>> If we wanted to add a "get parallelism" it would be in those underlying
>> connectors but I'm also skeptical about adding such a method there because
>> it is a static assignment and would preclude clever optimizations about the
>> parallelism of a connector at runtime.
>> 
>> I think that when a job is submitted, it is in compile time. It should only
>> provide static parallelism.
>> 
>> ## DataStream in table connector
>> 
>> As I said before, if we want to completely cancel DataStream in the table
>> connector, we need to provide corresponding functions in
>> `xxRuntimeProvider`.
>> Otherwise, we and users may not be able to migrate the old connectors.
>> Including Hive/FileSystem connectors and the user cases I mentioned above.
>> CC: @liu shouwei
>> 
>> We really need to consider these cases.
>> If there is no alternative in a short period of time, for a long
>> time, users need to continue to use the old table connector API, which has
>> been deprecated.
>> 
>> Why not use StreamTableEnvironment fromDataStream/toDataStream?
>> - These tables are just temporary tables. Can not be integrated/stored into
>> Catalog.
>> - Creating table DDL can not work...
>> - We need to lose the kinds of useful features of Table/SQL on the
>> connector. For example, projection pushdown, filter pushdown, partitions
>> and etc...
>> 
>> But I believe you are right in the long run. The source and sink APIs
>> should be powerful enough to cover all reasonable cases.
>> Maybe we can just introduce them in a minimal way. For example, we only
>> introduce `DataStreamSinkProvider` in planner as an internal API.
>> 
>> Your points are very meaningful, hope to get your reply.
>> 
>> Best,
>> Jingsong
>> 
>> On Fri, Sep 25, 2020 at 10:55 AM wenlong.lwl 
>> wrote:
>> 
>>> Hi，Aljoscha, I would like to share a use case to second setting
>> parallelism
>>> of table sink(or limiting parallelism range of table sink): When writing
>>> data to databases, there is limitation for number of jdbc connections and
>>> query TPS. we would get errors of too many connections or high load for
>>> db and poor performance because of too many smal

Re: [VOTE] FLIP-146: Improve new TableSource and TableSink interfaces

2020-10-15 Thread admin

+1

> 2020年10月16日 上午10:05，Danny Chan  写道：
> 
> +1, nice job !
> 
> Best,
> Danny Chan
> 在 2020年10月15日 +0800 PM8:08，Jingsong Li ，写道：
>> Hi all,
>> 
>> I would like to start the vote for FLIP-146 [1], which is discussed and
>> reached consensus in the discussion thread [2]. The vote will be open until
>> 20th Oct. (72h, exclude weekends), unless there is an objection or not
>> enough votes.
>> 
>> [1]
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-146%3A+Improve+new+TableSource+and+TableSink+interfaces
>> 
>> [2]
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-146-Improve-new-TableSource-and-TableSink-interfaces-td45161.html
>> 
>> Best,
>> Jingsong Lee

[DISCUSS] Support source/sink parallelism config in Flink sql

Re: [DISCUSS] FLIP-146: Improve new TableSource and TableSink interfaces

Re: [DISCUSS] FLIP-146: Improve new TableSource and TableSink interfaces

Re: [VOTE] FLIP-146: Improve new TableSource and TableSink interfaces

4 matches

Site Navigation

Mail list logo

Footer information