Question regarding GoupByKey operator on unbounded data

Tao Li Thu, 10 Dec 2020 14:51:02 -0800

Hi Beam community,

I got a quick question about GoupByKey operator. According to this 
doc<https://beam.apache.org/documentation/programming-guide/#groupbykey>,  if 
we are using unbounded PCollection, it’s required to specify either non-global 
windowing<https://beam.apache.org/documentation/programming-guide/#setting-your-pcollections-windowing-function>
 or an aggregation 
trigger<https://beam.apache.org/documentation/programming-guide/#triggers> in 
order to perform a GroupByKey operation.


In comparison, 
KeyBy<https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/>
 operator from flink does not have such a hard requirement for streamed data.

In our use case, we do need to query all historical streamed data and group by 
keys. KeyBy from flink satisfies our need, but Beam GoupByKey does not satisfy 
this need. I thought about applying a sliding window with a very large size 
(say 1 year), thus we can query the past 1 year’s data. But not sure if this is 
feasible or a good practice.

So what would the Beam solution be to implement this business logic? Is there a 
support from beam to process a relative long history of a unbounded PCollection?

Thanks so much!

Question regarding GoupByKey operator on unbounded data

Reply via email to