Re: Issue with growing state/checkpoint size

Byron Ellis via user Fri, 01 Sep 2023 06:59:21 -0700

Depends on why you're using a fan-out approach in the first place. You
might actually be better off doing all the work at the same time.


On Fri, Sep 1, 2023 at 6:43 AM Ruben Vargas <ruben.var...@metova.com> wrote:

> Ohh I see
>
> That makes sense. Wondering if there is an strategy for my use case, where
> I have an ID unique per pair of messages
>
> Thanks for all your help!
>
> On Fri, Sep 1, 2023 at 6:51 AM Sachin Mittal <sjmit...@gmail.com> wrote:
>
>> Yes a very high and non deterministic cardinality can make the stored
>> state of join operation unbounded.
>> In my case we know the cardinality and it was not very high so we could
>> go with a lookup based approach using redis to enrich the stream and avoid
>> joins.
>>
>>
>>
>> On Wed, Aug 30, 2023 at 5:04 AM Ruben Vargas <ruben.var...@metova.com>
>> wrote:
>>
>>> Thanks for the reply and the advice
>>>
>>> One more thing, Do you know if the key-space carnality impacts on this?
>>> I'm assuming it is, but the thing is for my case all the messages from the
>>> sources has a unique ID, that makes my key-space huge and is not on my
>>> control .
>>>
>>> On Tue, Aug 29, 2023 at 9:29 AM Sachin Mittal <sjmit...@gmail.com>
>>> wrote:
>>>
>>>> So for the smaller size of collection which does not grow with size for
>>>> certain keys we stored the data in redis and instead of beam join in our
>>>> DoFn we just did the lookup and got the data we need.
>>>>
>>>>
>>>> On Tue, 29 Aug 2023 at 8:50 PM, Ruben Vargas <ruben.var...@metova.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> Thanks for the reply, Any strategy you followed to avoid joins when
>>>>> you rewrite your pipeline?
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 29, 2023 at 9:15 AM Sachin Mittal <sjmit...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Yes even we faced the same issue when trying to run a pipeline
>>>>>> involving join of two collections. It was deployed using AWS KDA, which
>>>>>> uses flink runner. The source was kinesis streams.
>>>>>>
>>>>>> Looks like join operations are not very efficient in terms of size
>>>>>> management when run on flink.
>>>>>>
>>>>>> We had to rewrite our pipeline to avoid these joins.
>>>>>>
>>>>>> Thanks
>>>>>> Sachin
>>>>>>
>>>>>>
>>>>>> On Tue, 29 Aug 2023 at 7:00 PM, Ruben Vargas <ruben.var...@metova.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello
>>>>>>>
>>>>>>> I experimenting an issue with my beam pipeline
>>>>>>>
>>>>>>> I have a pipeline in which I split the work into different branches,
>>>>>>> then I do a join using CoGroupByKey, each message has its own unique 
>>>>>>> Key.
>>>>>>>
>>>>>>> For the Join, I used a Session Window, and discarding the messages
>>>>>>> after trigger.
>>>>>>>
>>>>>>> I'm using Flink Runner and deployed a KInesis application. But I'm
>>>>>>> experiencing  an unbounded growth of the checkpoint data size. When I 
>>>>>>> see
>>>>>>> in Flink console, the  following task has the largest checkpoint
>>>>>>>
>>>>>>> join_results/GBK -> ToGBKResult ->
>>>>>>> join_results/ConstructCoGbkResultFn/ParMultiDo(ConstructCoGbkResult) -> 
>>>>>>> V
>>>>>>>
>>>>>>>
>>>>>>> Any Advice ?
>>>>>>>
>>>>>>> Thank you very much!
>>>>>>>
>>>>>>>

Re: Issue with growing state/checkpoint size

Reply via email to