You can definitely group by processing time. The way to do this in Beam is as follows
Window.into<T>(new GlobalWindows()) .triggering(AfterWatermark.pastEndOfWindow() .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(30)) .discardingFiredPanes()); The syntax is a bit unfortunately wordy, but the idea is that you are creating a single event-time window that encompasses all time, and "triggering" an aggregation every 30 seconds based on processing time. On Fri, Apr 23, 2021 at 8:14 AM Tao Li <t...@zillow.com> wrote: > Thanks @Kenneth Knowles <k...@apache.org>. I understand we need to > specify a window for groupby so that the app knowns when processing is > “done” to output result. > > > > Is it possible to specify a event arrival/processing time based window for > groupby? The purpose is to avoid dropping of late events. With a event > processing time based window, the app will periodically output the result > based on all events that arrived in that window, and a late arriving event > will fall into whatever window covers its arrival time and thus that late > data will not get lost. > > > > Does Beam support this kind of mechanism? Thanks. > > > > *From: *Kenneth Knowles <k...@apache.org> > *Reply-To: *"user@beam.apache.org" <user@beam.apache.org> > *Date: *Thursday, April 22, 2021 at 1:49 PM > *To: *user <user@beam.apache.org> > *Cc: *Kelly Smith <kell...@zillowgroup.com>, Lian Jiang < > li...@zillowgroup.com> > *Subject: *Re: Question on late data handling in Beam streaming mode > > > > Hello! > > > > In a streaming app, you have two choices: wait forever and never have any > output OR use some method to decide that aggregation is "done". > > > > In Beam, the way you decide that aggregation is "done" is the watermark. > When the watermark predicts no more data for an aggregation, then the > aggregation is done. For example GROUP BY <minute> is "done" when no more > data will arrive for that minute. At this point, your result is produced. > More data may arrive, and it is ignored. The watermark is determined by the > IO connector to be the best heuristic available. You can configure "allowed > lateness" for an aggregation to allow out of order data. > > > > Kenn > > > > On Thu, Apr 22, 2021 at 1:26 PM Tao Li <t...@zillow.com> wrote: > > Hi Beam community, > > > > I am wondering if there is a risk of losing late data from a Beam stream > app due to watermarking? > > > > I just went through this design doc and noticed the “droppable” definition > there: > https://docs.google.com/document/d/12r7frmxNickxB5tbpuEh_n35_IJeVZn1peOrBrhhP6Y/edit# > <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F12r7frmxNickxB5tbpuEh_n35_IJeVZn1peOrBrhhP6Y%2Fedit%23&data=04%7C01%7Ctaol%40zillow.com%7C5f68c051a16843dc6e5f08d905d016dc%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637547213557227210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=2Gjz8DNW5JDbFUie010%2FhrEiKajPR7sMMb67lC8vHrU%3D&reserved=0> > > > > Can you please confirm if it’s possible for us to lose some data in a > stream app in practice? If that’s possible, what would be the best practice > to avoid data loss? Thanks! > > > >