Hi,

I'm writing a pipeline in Python/Beam 2.19 and wanted to get the opinions of 
the folks on list here around the best way to implement the following logic, 
specifically around the window + trigger combinations to use.

HourlySource := a PCollection that receives an event an hour from another 
pipeline (hourly aggregated data). This PCollection doesn't generate data on 
the weekends; there is no upstream data available on weekends.

DesiredOutput := a PCollection that is the last 100 events as a list that is 
updated hourly.

What is the correct windowing and triggering combination to go from 
HourlySource to DesiredOutput? Let's assume I'm not worried about late data.

My initial thinking was a SlidingWindow(size=100*60*60, period=60*60) with a 
Repeatedly(AfterProcessingTime(60*60)) trigger in accumulating mode. If I 
understand correctly, this should create a 100-hour window that will emit 
results hourly. Is my understanding correct here? Do I need an orFinally (I 
assume not since I would like the trigger to fire forever to always give me the 
last 100 hours)?  The largest gap here is that this seems like it would be the 
past 100-hours not 100 events.

I also thought of using the same window, but the default AfterWatermark trigger 
with early firings of Repeatedly(AfterProcessingTime(60*60)), however this 
doesn't solve the 100-hours vs. 100-events issue, albeit potentially the same 
as above?

The other idea was to use the GlobalWindow() and then a composite trigger of 
Repeatedly(AfterCount(1)).orFinally(AfterCount(100)) in accumulating mode, but 
to be honest I don't think I have a strong handle on the interaction between 
the different trigger types, window types and accumulation modes to be able to 
figure out the best path forward. 

So rather than a bunch of trial and error (or perhaps more correctly in 
addition to that) I was hoping the folks here could give me some insights into 
how to approach this. Thanks!

-Pradip

Reply via email to