I have a running average problem. As I understand it, the traditional Beam
solution is state in a global window, but I'm not quite sure how to
approach it for my use case, which is a bit more complex.

I have a "score" output every 5 minutes based on a timer, up to a maximum
of 1 hour after some time, depending on the arrival times of a few input
events per day.

The output of this initial part of the pipeline is 1) versioned, so when
running the pipeline in batch mode, or dealing with up to 3-day late
inputs, the score in the output system is continuously updated (and outputs
from out-of-order inputs are ignored) and 2) aggregated into a daily score
along with inputs coming from other pipeline branches, which is also
continuously updated part-way through the day with early and late triggers.

Now, I need to calculate the running average of the individual scores
output every 5 minutes multiple times per day, and factor those into the
overall aggregated daily score. The running average should consider only
the highest version score for each day on which there are scores.

I don't see how I can do this with global windows without keeping a full
history of the latest score and version on every previous day, which will
grow without bound. Or am I missing something?

Thanks,
Raman

Reply via email to