Ohh I see That makes sense. Wondering if there is an strategy for my use case, where I have an ID unique per pair of messages
Thanks for all your help! On Fri, Sep 1, 2023 at 6:51 AM Sachin Mittal <sjmit...@gmail.com> wrote: > Yes a very high and non deterministic cardinality can make the stored > state of join operation unbounded. > In my case we know the cardinality and it was not very high so we could go > with a lookup based approach using redis to enrich the stream and avoid > joins. > > > > On Wed, Aug 30, 2023 at 5:04 AM Ruben Vargas <ruben.var...@metova.com> > wrote: > >> Thanks for the reply and the advice >> >> One more thing, Do you know if the key-space carnality impacts on this? >> I'm assuming it is, but the thing is for my case all the messages from the >> sources has a unique ID, that makes my key-space huge and is not on my >> control . >> >> On Tue, Aug 29, 2023 at 9:29 AM Sachin Mittal <sjmit...@gmail.com> wrote: >> >>> So for the smaller size of collection which does not grow with size for >>> certain keys we stored the data in redis and instead of beam join in our >>> DoFn we just did the lookup and got the data we need. >>> >>> >>> On Tue, 29 Aug 2023 at 8:50 PM, Ruben Vargas <ruben.var...@metova.com> >>> wrote: >>> >>>> Hello, >>>> >>>> Thanks for the reply, Any strategy you followed to avoid joins when you >>>> rewrite your pipeline? >>>> >>>> >>>> >>>> On Tue, Aug 29, 2023 at 9:15 AM Sachin Mittal <sjmit...@gmail.com> >>>> wrote: >>>> >>>>> Yes even we faced the same issue when trying to run a pipeline >>>>> involving join of two collections. It was deployed using AWS KDA, which >>>>> uses flink runner. The source was kinesis streams. >>>>> >>>>> Looks like join operations are not very efficient in terms of size >>>>> management when run on flink. >>>>> >>>>> We had to rewrite our pipeline to avoid these joins. >>>>> >>>>> Thanks >>>>> Sachin >>>>> >>>>> >>>>> On Tue, 29 Aug 2023 at 7:00 PM, Ruben Vargas <ruben.var...@metova.com> >>>>> wrote: >>>>> >>>>>> Hello >>>>>> >>>>>> I experimenting an issue with my beam pipeline >>>>>> >>>>>> I have a pipeline in which I split the work into different branches, >>>>>> then I do a join using CoGroupByKey, each message has its own unique Key. >>>>>> >>>>>> For the Join, I used a Session Window, and discarding the messages >>>>>> after trigger. >>>>>> >>>>>> I'm using Flink Runner and deployed a KInesis application. But I'm >>>>>> experiencing an unbounded growth of the checkpoint data size. When I see >>>>>> in Flink console, the following task has the largest checkpoint >>>>>> >>>>>> join_results/GBK -> ToGBKResult -> >>>>>> join_results/ConstructCoGbkResultFn/ParMultiDo(ConstructCoGbkResult) -> V >>>>>> >>>>>> >>>>>> Any Advice ? >>>>>> >>>>>> Thank you very much! >>>>>> >>>>>>