You can definitely group by processing time. The way to do this in Beam is
as follows

Window.into<T>(new GlobalWindows())
    .triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(30))
    .discardingFiredPanes());

The syntax is a bit unfortunately wordy, but the idea is that you are
creating a single event-time window that encompasses all time, and
"triggering" an aggregation every 30 seconds based on processing time.

On Fri, Apr 23, 2021 at 8:14 AM Tao Li <t...@zillow.com> wrote:

> Thanks @Kenneth Knowles <k...@apache.org>. I understand we need to
> specify a window for groupby so that the app knowns when processing is
> “done” to output result.
>
>
>
> Is it possible to specify a event arrival/processing time based window for
> groupby? The purpose is to avoid dropping of late events. With a event
> processing time based window, the app will periodically output the result
> based on all events that arrived in that window, and a late arriving event
> will fall into whatever window covers its arrival time and thus that late
> data will not get lost.
>
>
>
> Does Beam support this kind of mechanism? Thanks.
>
>
>
> *From: *Kenneth Knowles <k...@apache.org>
> *Reply-To: *"user@beam.apache.org" <user@beam.apache.org>
> *Date: *Thursday, April 22, 2021 at 1:49 PM
> *To: *user <user@beam.apache.org>
> *Cc: *Kelly Smith <kell...@zillowgroup.com>, Lian Jiang <
> li...@zillowgroup.com>
> *Subject: *Re: Question on late data handling in Beam streaming mode
>
>
>
> Hello!
>
>
>
> In a streaming app, you have two choices: wait forever and never have any
> output OR use some method to decide that aggregation is "done".
>
>
>
> In Beam, the way you decide that aggregation is "done" is the watermark.
> When the watermark predicts no more data for an aggregation, then the
> aggregation is done. For example GROUP BY <minute> is "done" when no more
> data will arrive for that minute. At this point, your result is produced.
> More data may arrive, and it is ignored. The watermark is determined by the
> IO connector to be the best heuristic available. You can configure "allowed
> lateness" for an aggregation to allow out of order data.
>
>
>
> Kenn
>
>
>
> On Thu, Apr 22, 2021 at 1:26 PM Tao Li <t...@zillow.com> wrote:
>
> Hi Beam community,
>
>
>
> I am wondering if there is a risk of losing late data from a Beam stream
> app due to watermarking?
>
>
>
> I just went through this design doc and noticed the “droppable” definition
> there:
> https://docs.google.com/document/d/12r7frmxNickxB5tbpuEh_n35_IJeVZn1peOrBrhhP6Y/edit#
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F12r7frmxNickxB5tbpuEh_n35_IJeVZn1peOrBrhhP6Y%2Fedit%23&data=04%7C01%7Ctaol%40zillow.com%7C5f68c051a16843dc6e5f08d905d016dc%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637547213557227210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=2Gjz8DNW5JDbFUie010%2FhrEiKajPR7sMMb67lC8vHrU%3D&reserved=0>
>
>
>
> Can you please confirm if it’s possible for us to lose some data in a
> stream app in practice? If that’s possible, what would be the best practice
> to avoid data loss? Thanks!
>
>
>
>

Reply via email to