A quick update:
- Since there is currently another initiative ongoing for building rate
limiting that potentially covers a wider range of use cases, it was decided
not to expose the RateLimiter API publicly in this FLIP. It now has a
package private visibility and can later be swapped with a more u
Hi all,
I updated the FLIP [1] to make it more extensible with the
introduction of *SourceReaderFactory.
*It gives users the ability to further customize the data generation and
emission process if needed. I also incorporated the suggestion from
Qingsheng and moved to the generator function design
Hi Becket,
interesting points about the discrepancies in the *RuntimeContext*
"wrapping" throughout the framework, but I agree - this is something that
needs to be tackled separately.
For now, I adjusted the FLIP and the PoC implementation to only expose the
parallelism.
Best,
Alexander Fedulov
Hi Alex,
Personally I prefer the latter option, i.e. just add the
currentParallelism() method. It is easy to add more stuff to the
SourceReaderContext in the future, and it is likely that most of the stuff
in the RuntimeContext is not required by the SourceReader implementations.
For the purpose o
Hi Becket,
I agree with you. We could introduce a *ReadOnlyRuntimeContext* that would
act as a holder for the *RuntimeContext* data. This would also require
read-only wrappers for the exposed fields, such as *ExecutionConfig*.
Alternatively, we just add the *currentParallelism()* method for now an
Hi Alex,
While it is true that the RuntimeContext gives access to all the stuff the
framework can provide, it seems a little overkilling for the SourceReader.
It is probably OK to expose all the read-only information in the
RuntimeContext to the SourceReader, but we may want to hide the "write"
me
Hi Becket,
I updated and extended FLIP-238 accordingly.
Here is also my POC branch [1].
DataGeneratorSourceV3 is the class that I currently converged on [2]. It is
based on the expanded SourceReaderContext.
A couple more relevant classes [3] [4]
Would appreciate it if you could take a quick look
Hi Becket,
Exposing the RuntimeContext is potentially even more useful.
Do you think it is worth having both currentParallelism() and
getRuntimeContext() methods?
One can always call getNumberOfParallelSubtasks() on the RuntimeContext
directly if we expose it.
Best,
Alexander Fedulov
On Mon, J
Hi Alex,
Yes, that is what I had in mind. We need to add the method
getRuntimeContext() to the SourceReaderContext interface as well.
Thanks,
Jiangjie (Becket) Qin
On Mon, Jul 4, 2022 at 3:01 AM Alexander Fedulov
wrote:
> Hi Becket,
>
> thanks for your input. I like the idea of adding the par
Hi Becket,
thanks for your input. I like the idea of adding the parallelism to the
SourceReaderContext. My understanding is that any change of parallelism
causes recreation of all readers, so it should be safe to consider it
"fixed" after the readers' initialization. In that case, it should be as
Hi Alex,
In FLIP-27 source, the SourceReader can get a SourceReaderContext. This is
passed in by the TM in Source#createReader(). And supposedly the Source
should pass this to the SourceReader if needed.
In the SourceReaderContext, currently only the index of the current subtask
is available, but
Hi all,
getting back to the idea of reusing FlinkConnectorRateLimiter: it is
designed for the SourceFunction API and has an open() method that takes a
RuntimeContext. Therefore, we need to add a different interface for
the new Source
API.
This is where I see a certain limitation for the rate-limi
I'm very happy with this. +1
A lot of SourceFunction implementations used in demos/POC implementations
include a call to sleep(), so adding rate limiting is a good idea, in my
opinion.
Best,
David
On Mon, Jun 20, 2022 at 10:10 AM Qingsheng Ren wrote:
> Hi Alexander,
>
> Thanks for creating thi
Hi Alexander,
Thanks for creating this FLIP! I’d like to share some thoughts.
1. About the “generatorFunction” I’m expecting an initializer on it because
it’s hard to require all fields in the generator function are serializable in
user’s implementation. Providing a function like “open” in the
Hi Jing,
thanks for your thorough analysis. I agree with the points you make and
also with the idea to approach the larger task of providing a universal
(DataStream + SQL) data generator base iteratively.
Regarding the name, the SourceFunction-based *DataGeneratorSource* resides
in the *org.apache
Alex, thanks a lot for capturing the checkpoint lockstep emitting source.
It doesn't have to be the same wrapper class. It could be another wrapper
source (like ManualSource) where users supply the exact records emitted by
the source per checkpoint.
On Tue, Jun 14, 2022 at 10:18 AM Jing Ge wrote:
Hi,
After reading all discussions posted in this thread and the source code of
DataGeneratorSource which unfortunately used "Source" instead of
"SourceFunction" in its name, issues could summarized as following:
1. The current DataGeneratorSource based on SourceFunction is a stateful
source conne
Hi Steven,
FYI, I've added your requirement to the list of subtasks for
deprecating the SourceFunction API [1] [2].
[1] https://issues.apache.org/jira/browse/FLINK-28045
[2] https://issues.apache.org/jira/browse/FLINK-28054
Best,
Alexander Fedulov
On Tue, Jun 7, 2022 at 6:03 PM Steven Wu wrot
Hey Alex,
Yes, I think we need to make sure that we're not causing confusion (I know
I already was confused). I think the DataSupplierSource is already better,
but perhaps there are others who have an even better idea.
Thanks,
Martijn
Op do 9 jun. 2022 om 14:28 schreef Alexander Fedulov <
alexa
Hi Martijn,
It seems that they serve a bit different purposes though. The
DataGenTableSource is for generating random data described by the Table DDL
and is tied into the RowDataGenerator/DataGenerator concept which is
implemented as an Iterator. The proposed API in contrast is supposed to
provid
Hi Alex,
Thanks for creating the FLIP and opening up the discussion. +1 overall for
getting this in place.
One question: you've already mentioned that this focussed on the DataStream
API. I think it would be a bit confusing that we have a Datagen connector
(on the Table side) that wouldn't levera
Hi Xianxun,
Thanks for bringing it up. I do believe it would be useful to have such a
CDC data generator but I see the
efforts to provide one a bit orthogonal to the DataSourceGenerator proposed
in the FLIP. FLIP-238 focuses
on the DataStream API and I could see integration into the Table/SQL
ecos
I looked a bit further and it seems it should actually be easier than I
initially thought: SourceReader extends CheckpointListener interface and
with its custom implementation it should be possible to achieve similar
results. A prototype that I have for the generator uses an IteratorSourceReader
u
Hi Steven,
This is going to be tricky since in the new Source API the checkpointing
aspects that you based your logic on are pushed further away from the
low-level interfaces responsible for handling data and splits [1]. At the
same time, the SourceCoordinatorProvider is hardwired into the interna
In Iceberg source, we have a data generator source that can control the
records per checkpoint cycle. Can we support sth like this in the
DataGeneratorSource?
https://github.com/apache/iceberg/blob/master/flink/v1.15/flink/src/test/java/org/apache/iceberg/flink/source/BoundedTestSource.java
public
25 matches
Mail list logo