Thanks @Kenneth Knowles<mailto:k...@apache.org>. I understand we need to specify a window for groupby so that the app knowns when processing is “done” to output result.
Is it possible to specify a event arrival/processing time based window for groupby? The purpose is to avoid dropping of late events. With a event processing time based window, the app will periodically output the result based on all events that arrived in that window, and a late arriving event will fall into whatever window covers its arrival time and thus that late data will not get lost. Does Beam support this kind of mechanism? Thanks. From: Kenneth Knowles <k...@apache.org> Reply-To: "user@beam.apache.org" <user@beam.apache.org> Date: Thursday, April 22, 2021 at 1:49 PM To: user <user@beam.apache.org> Cc: Kelly Smith <kell...@zillowgroup.com>, Lian Jiang <li...@zillowgroup.com> Subject: Re: Question on late data handling in Beam streaming mode Hello! In a streaming app, you have two choices: wait forever and never have any output OR use some method to decide that aggregation is "done". In Beam, the way you decide that aggregation is "done" is the watermark. When the watermark predicts no more data for an aggregation, then the aggregation is done. For example GROUP BY <minute> is "done" when no more data will arrive for that minute. At this point, your result is produced. More data may arrive, and it is ignored. The watermark is determined by the IO connector to be the best heuristic available. You can configure "allowed lateness" for an aggregation to allow out of order data. Kenn On Thu, Apr 22, 2021 at 1:26 PM Tao Li <t...@zillow.com<mailto:t...@zillow.com>> wrote: Hi Beam community, I am wondering if there is a risk of losing late data from a Beam stream app due to watermarking? I just went through this design doc and noticed the “droppable” definition there: https://docs.google.com/document/d/12r7frmxNickxB5tbpuEh_n35_IJeVZn1peOrBrhhP6Y/edit#<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F12r7frmxNickxB5tbpuEh_n35_IJeVZn1peOrBrhhP6Y%2Fedit%23&data=04%7C01%7Ctaol%40zillow.com%7C5f68c051a16843dc6e5f08d905d016dc%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637547213557227210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=2Gjz8DNW5JDbFUie010%2FhrEiKajPR7sMMb67lC8vHrU%3D&reserved=0> Can you please confirm if it’s possible for us to lose some data in a stream app in practice? If that’s possible, what would be the best practice to avoid data loss? Thanks!