RE: Re: Re: Re: [DISCUSS] FLIP-239: Port JDBC Connector Source to FLIP-27

Roc Marshal Fri, 01 Jul 2022 21:27:38 -0700

Hi, Weike.

Thank you for your reply
As you said, too many splits stored in SourceEnumerator will increase the load 
of JM.
What do you think if we introduce a capacity of splits in SourceEnumerator to 
limit the total number, and introduce a reject or callback mechanism with too 
many splits in the timely generation strategy to solve this problem? 
Looking forward to a better solution .


Best regards,
Roc Marshal

On 2022/07/01 07:58:22 Dong Weike wrote:
> Hi,
> 
> Thank you for bringing this up, and I am +1 for this feature.
> 
> IMO, one important thing that I would like to mention is that an 
> improperly-designed FLIP-27 connector could impose very severe memory 
> pressure on the JobManager, especially when there are enormous number of 
> splits for the source tables, e.g. there are billions of records to read. 
> Frankly speaking, we have been haunted by this problem for a long time when 
> using the Flink CDC Connectors to read large tables.
> 
> Therefore, in order to prevent JobManager from experiencing frequent OOM 
> faults, JdbcSourceEnumerator should avoid saving too many JdbcSourceSplits in 
> the unassigned list. And it would be better if all the splits would be 
> computed on the fly.
> 
> Best,
> Weike
> 
> -----邮件原件-----
> 发件人: Lijie Wang <wa...@gmail.com> 
> 发送时间: 2022年7月1日 上午 10:25
> 收件人: dev@flink.apache.org
> 主题: Re: Re: [DISCUSS] FLIP-239: Port JDBC Connector Source to FLIP-27
> 
> Hi Roc,
> 
> Thanks for driving the discussion.
> 
> Could you describe in detail what the JdbcSourceSplit represents? It looks 
> like something wrong with the comments of JdbcSourceSplit in FLIP(it describe 
> as "A {@link SourceSplit} that represents a file, or a region of a file....").
> 
> Best,
> Lijie
> 
> 
> Roc Marshal <fl...@126.com> 于2022年6月30日周四 21:41写道：
> 
> > Hi, Boto.
> >     Thanks for your reply.
> >
> >    +1 to me on watermark strategy definition in ‘streaming’ & table 
> > source. I'm not sure if FLIP-202[1]  is suitable for a separate 
> > discussion, but I think your proposal is very helpful to the new 
> > source. It would be great if the new source could be compatible with this 
> > abstraction.
> >
> >    In addition, whether we need to support such a special bounded 
> > scenario abstraction?
> >    The number of JdbcSourceSplit is certain, but the time to generate 
> > all JdbcSourceSplit completely is not certain in the user defined 
> > implementation. When the condition that the JdbcSourceSplit 
> > generate-process end is met, the JdbcSourceSplit will not be generated.
> > After all JdbcSourceSplit processing is completed, the reader will be 
> > notified that there are no more JdbcSourceSplit from 
> > JdbcSourceSplitEnumerator.
> >
> > - [1]
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-202%3A+Introduc
> > e+ClickHouse+Connector
> >
> > Best regards,
> > Roc Marshal
> >
> > On 2022/06/30 09:02:23 João Boto wrote:
> > > Hi,
> > >
> > > On source we could improve the JdbcParameterValuesProvider.. to be
> > defined as a query(s) or something more dynamic.
> > > The most time if your job is dynamic or have some condition to be 
> > > met
> > (based on data on table) you have to create a connection an get that 
> > info from database
> > >
> > > If we are going to create/allow a "streaming" jdbc source, we should 
> > > be
> > able to define watermark and get new data from table using that watermark..
> > >
> > >
> > > For the sink (but it could apply on source) will be great to be able 
> > > to
> > set your implementation of the connection type.. For example if you 
> > are connecting to clickhouse, be able to set a implementation based on 
> > "BalancedClickhouseDataSource" for example (in this[1] implementation 
> > we have a example) or set a extension version of a implementation for 
> > debug purpose
> > >
> > > Regards
> > >
> > >
> > > [1]
> > https://github.com/apache/flink/pull/20097/files#diff-8b36e3403381dc14
> > c748aeb5de0b4ceb7d7daec39594b1eacff1694b5266419d
> > >
> > > On 2022/06/27 13:09:51 Roc Marshal wrote:
> > > > Hi, all,
> > > >
> > > >
> > > >
> > > >
> > > > I would like to open a discussion on porting JDBC Source to new 
> > > > Source
> > API (FLIP-27[1]).
> > > >
> > > > Martijn Visser, Jing Ge and I had a preliminary discussion on the 
> > > > JIRA
> > FLINK-25420[2] and planed to start the discussion about the source 
> > part first.
> > > >
> > > >
> > > >
> > > > Please let me know:
> > > >
> > > > - The issues about old Jdbc source you encountered;
> > > > - The new feature or design you want;
> > > > - More suggestions from other dimensions...
> > > >
> > > >
> > > >
> > > > You could find more details in FLIP-239[3].
> > > >
> > > > Looking forward to your feedback.
> > > >
> > > >
> > > >
> > > >
> > > > [1]
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+
> > Source+Interface
> > > >
> > > > [2] https://issues.apache.org/jira/browse/FLINK-25420
> > > >
> > > > [3]
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=21738
> > 6271
> > > >
> > > >
> > > >
> > > >
> > > > Best regards,
> > > >
> > > > Roc Marshal
> > >
>

RE: Re: Re: Re: [DISCUSS] FLIP-239: Port JDBC Connector Source to FLIP-27

Reply via email to