Yes even we faced the same issue when trying to run a pipeline involving
join of two collections. It was deployed using AWS KDA, which uses flink
runner. The source was kinesis streams.

Looks like join operations are not very efficient in terms of size
management when run on flink.

We had to rewrite our pipeline to avoid these joins.

Thanks
Sachin


On Tue, 29 Aug 2023 at 7:00 PM, Ruben Vargas <ruben.var...@metova.com>
wrote:

> Hello
>
> I experimenting an issue with my beam pipeline
>
> I have a pipeline in which I split the work into different branches, then
> I do a join using CoGroupByKey, each message has its own unique Key.
>
> For the Join, I used a Session Window, and discarding the messages after
> trigger.
>
> I'm using Flink Runner and deployed a KInesis application. But I'm
> experiencing  an unbounded growth of the checkpoint data size. When I see
> in Flink console, the  following task has the largest checkpoint
>
> join_results/GBK -> ToGBKResult ->
> join_results/ConstructCoGbkResultFn/ParMultiDo(ConstructCoGbkResult) -> V
>
>
> Any Advice ?
>
> Thank you very much!
>
>

Reply via email to