Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-07-29 Thread Alexander Fedulov
A quick update: - Since there is currently another initiative ongoing for building rate limiting that potentially covers a wider range of use cases, it was decided not to expose the RateLimiter API publicly in this FLIP. It now has a package private visibility and can later be swapped with a more u

Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-07-18 Thread Alexander Fedulov
Hi all, I updated the FLIP [1] to make it more extensible with the introduction of *SourceReaderFactory. *It gives users the ability to further customize the data generation and emission process if needed. I also incorporated the suggestion from Qingsheng and moved to the generator function design

Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-07-06 Thread Alexander Fedulov
Hi Becket, interesting points about the discrepancies in the *RuntimeContext* "wrapping" throughout the framework, but I agree - this is something that needs to be tackled separately. For now, I adjusted the FLIP and the PoC implementation to only expose the parallelism. Best, Alexander Fedulov

Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-07-05 Thread Becket Qin
Hi Alex, Personally I prefer the latter option, i.e. just add the currentParallelism() method. It is easy to add more stuff to the SourceReaderContext in the future, and it is likely that most of the stuff in the RuntimeContext is not required by the SourceReader implementations. For the purpose o

Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-07-05 Thread Alexander Fedulov
Hi Becket, I agree with you. We could introduce a *ReadOnlyRuntimeContext* that would act as a holder for the *RuntimeContext* data. This would also require read-only wrappers for the exposed fields, such as *ExecutionConfig*. Alternatively, we just add the *currentParallelism()* method for now an

Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-07-04 Thread Becket Qin
Hi Alex, While it is true that the RuntimeContext gives access to all the stuff the framework can provide, it seems a little overkilling for the SourceReader. It is probably OK to expose all the read-only information in the RuntimeContext to the SourceReader, but we may want to hide the "write" me

Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-07-04 Thread Alexander Fedulov
Hi Becket, I updated and extended FLIP-238 accordingly. Here is also my POC branch [1]. DataGeneratorSourceV3 is the class that I currently converged on [2]. It is based on the expanded SourceReaderContext. A couple more relevant classes [3] [4] Would appreciate it if you could take a quick look

Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-07-04 Thread Alexander Fedulov
Hi Becket, Exposing the RuntimeContext is potentially even more useful. Do you think it is worth having both currentParallelism() and getRuntimeContext() methods? One can always call getNumberOfParallelSubtasks() on the RuntimeContext directly if we expose it. Best, Alexander Fedulov On Mon, J

Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-07-03 Thread Becket Qin
Hi Alex, Yes, that is what I had in mind. We need to add the method getRuntimeContext() to the SourceReaderContext interface as well. Thanks, Jiangjie (Becket) Qin On Mon, Jul 4, 2022 at 3:01 AM Alexander Fedulov wrote: > Hi Becket, > > thanks for your input. I like the idea of adding the par

Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-07-03 Thread Alexander Fedulov
Hi Becket, thanks for your input. I like the idea of adding the parallelism to the SourceReaderContext. My understanding is that any change of parallelism causes recreation of all readers, so it should be safe to consider it "fixed" after the readers' initialization. In that case, it should be as

Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-07-01 Thread Becket Qin
Hi Alex, In FLIP-27 source, the SourceReader can get a SourceReaderContext. This is passed in by the TM in Source#createReader(). And supposedly the Source should pass this to the SourceReader if needed. In the SourceReaderContext, currently only the index of the current subtask is available, but

Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-06-30 Thread Alexander Fedulov
Hi all, getting back to the idea of reusing FlinkConnectorRateLimiter: it is designed for the SourceFunction API and has an open() method that takes a RuntimeContext. Therefore, we need to add a different interface for the new Source API. This is where I see a certain limitation for the rate-limi

Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-06-20 Thread David Anderson
I'm very happy with this. +1 A lot of SourceFunction implementations used in demos/POC implementations include a call to sleep(), so adding rate limiting is a good idea, in my opinion. Best, David On Mon, Jun 20, 2022 at 10:10 AM Qingsheng Ren wrote: > Hi Alexander, > > Thanks for creating thi

Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-06-20 Thread Qingsheng Ren
Hi Alexander, Thanks for creating this FLIP! I’d like to share some thoughts. 1. About the “generatorFunction” I’m expecting an initializer on it because it’s hard to require all fields in the generator function are serializable in user’s implementation. Providing a function like “open” in the

Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-06-16 Thread Alexander Fedulov
Hi Jing, thanks for your thorough analysis. I agree with the points you make and also with the idea to approach the larger task of providing a universal (DataStream + SQL) data generator base iteratively. Regarding the name, the SourceFunction-based *DataGeneratorSource* resides in the *org.apache

Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-06-14 Thread Steven Wu
Alex, thanks a lot for capturing the checkpoint lockstep emitting source. It doesn't have to be the same wrapper class. It could be another wrapper source (like ManualSource) where users supply the exact records emitted by the source per checkpoint. On Tue, Jun 14, 2022 at 10:18 AM Jing Ge wrote:

Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-06-14 Thread Jing Ge
Hi, After reading all discussions posted in this thread and the source code of DataGeneratorSource which unfortunately used "Source" instead of "SourceFunction" in its name, issues could summarized as following: 1. The current DataGeneratorSource based on SourceFunction is a stateful source conne

Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-06-14 Thread Alexander Fedulov
Hi Steven, FYI, I've added your requirement to the list of subtasks for deprecating the SourceFunction API [1] [2]. [1] https://issues.apache.org/jira/browse/FLINK-28045 [2] https://issues.apache.org/jira/browse/FLINK-28054 Best, Alexander Fedulov On Tue, Jun 7, 2022 at 6:03 PM Steven Wu wrot

Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-06-09 Thread Martijn Visser
Hey Alex, Yes, I think we need to make sure that we're not causing confusion (I know I already was confused). I think the DataSupplierSource is already better, but perhaps there are others who have an even better idea. Thanks, Martijn Op do 9 jun. 2022 om 14:28 schreef Alexander Fedulov < alexa

Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-06-09 Thread Alexander Fedulov
Hi Martijn, It seems that they serve a bit different purposes though. The DataGenTableSource is for generating random data described by the Table DDL and is tied into the RowDataGenerator/DataGenerator concept which is implemented as an Iterator. The proposed API in contrast is supposed to provid

Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-06-09 Thread Martijn Visser
Hi Alex, Thanks for creating the FLIP and opening up the discussion. +1 overall for getting this in place. One question: you've already mentioned that this focussed on the DataStream API. I think it would be a bit confusing that we have a Datagen connector (on the Table side) that wouldn't levera

Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-06-08 Thread Alexander Fedulov
Hi Xianxun, Thanks for bringing it up. I do believe it would be useful to have such a CDC data generator but I see the efforts to provide one a bit orthogonal to the DataSourceGenerator proposed in the FLIP. FLIP-238 focuses on the DataStream API and I could see integration into the Table/SQL ecos

Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-06-07 Thread Alexander Fedulov
I looked a bit further and it seems it should actually be easier than I initially thought: SourceReader extends CheckpointListener interface and with its custom implementation it should be possible to achieve similar results. A prototype that I have for the generator uses an IteratorSourceReader u

Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-06-07 Thread Alexander Fedulov
Hi Steven, This is going to be tricky since in the new Source API the checkpointing aspects that you based your logic on are pushed further away from the low-level interfaces responsible for handling data and splits [1]. At the same time, the SourceCoordinatorProvider is hardwired into the interna

Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-06-07 Thread Steven Wu
In Iceberg source, we have a data generator source that can control the records per checkpoint cycle. Can we support sth like this in the DataGeneratorSource? https://github.com/apache/iceberg/blob/master/flink/v1.15/flink/src/test/java/org/apache/iceberg/flink/source/BoundedTestSource.java public