[
https://issues.apache.org/jira/browse/FLINK-31192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692508#comment-17692508
]
xzw0223 commented on FLINK-31192:
---------------------------------
[~mapohl] I created a new issue to discuss the problem of slow datagen
initialization. We can discuss how this problem can be better solved.
> dataGen takes too long to initialize under sequence
> ---------------------------------------------------
>
> Key: FLINK-31192
> URL: https://issues.apache.org/jira/browse/FLINK-31192
> Project: Flink
> Issue Type: Improvement
> Affects Versions: 1.16.0, 1.16.1
> Reporter: xzw0223
> Priority: Major
> Fix For: 1.16.0, 1.16.1
>
>
> The SequenceGenerator preloads all sequence values in open. If the
> totalElement number is too large, it will take too long.
> [https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/source/datagen/SequenceGenerator.java#L91]
> The reason is that the capacity of the Deque will be expanded twice when the
> current capacity is full, and the array copy is required, which is
> time-consuming.
>
> Here's what I think :
> do not preload the full amount of data on Sequence, and generate a piece of
> data each time next is called to solve the problem of slow initialization
> caused by loading full amount of data.
> record the currently sent Sequence position through the checkpoint, and
> continue to send data through the recorded position after an abnormal restart
> to ensure fault tolerance
--
This message was sent by Atlassian Jira
(v8.20.10#820010)