Re: Data loss in BeamFlinkRunner

Eugene Kirpichov Fri, 08 Dec 2017 08:44:04 -0800

Accumulation mode doesn't affect late data dropping: I see your pipeline
specifies an allowed lateness of zero which means to drop all data that is
beyond the watermark: is this intentional?


On Fri, Dec 8, 2017, 1:23 AM Nishu <nishuta...@gmail.com> wrote:

> Below is the code, I use for Windows and triggers.
>
> *Trigger trigger = Repeatedly.forever(*
> *
> AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(delayDuration)));*
>
> *// Define windows for Customer Topic*
> *PCollection<KV<String, String>> customerSubset =
> customerInput.apply("CustomerWindowParDo",*
> * Window.<KV<String, String>> into(new
> GlobalWindows()).triggering(trigger).accumulatingFiredPanes()*
> * .withAllowedLateness(Duration.standardMinutes(0)));*
>
>
>
> Thanks & regards,
> Nishu
>
> On Fri, Dec 8, 2017 at 10:19 AM, Nishu <nishuta...@gmail.com> wrote:
>
>> Hi Eugene,
>>
>> In my usecase, I use GlobalWindow (
>> https://beam.apache.org/documentation/programming-guide/#provided-windowing-functions
>> ) and specify the trigger. In GLobal Window, entire data is accumulated
>> every time the trigger fires. so that we can avoid the late data issue.
>>
>> I found a JIRA issue(https://issues.apache.org/jira/browse/BEAM-3225 )
>> for the same issue in Beam.
>>
>> Today I am going to try to write similar implementation in Flink.
>>
>> Thanks,
>> Nishu
>>
>>
>> On Fri, Dec 8, 2017 at 12:08 AM, Eugene Kirpichov <kirpic...@google.com>
>> wrote:
>>
>>> Most likely this is late data
>>> https://beam.apache.org/documentation/programming-guide/#watermarks-and-late-data
>>> . Try configuring a trigger with a late data behavior that is more
>>> appropriate for your particular use case.
>>>
>>> On Thu, Dec 7, 2017 at 3:03 PM Nishu <nishuta...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am running a Streaming pipeline  with Flink runner.
>>>> *Operator sequence* is -> Reading the JSON data, Parse JSON String to
>>>> the Object and  Group the object based on common key.  I noticed that
>>>> GroupByKey operator throws away some data in between and hence I don't get
>>>> all the keys as output.
>>>>
>>>> In the below screenshot, 1001 records are read from kafka Topic , each
>>>> record has unique ID .  After grouping it returns only 857 unique IDs.
>>>> Ideally it should return 1001 records from GroupByKey operator.
>>>>
>>>>
>>>> [image: Inline image 3]
>>>>
>>>> Any idea, what can be the issue? Thanks in advance!
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Nishu Tayal
>>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Nishu Tayal
>>
>
>
>
> --
> Thanks & Regards,
> Nishu Tayal
>

Re: Data loss in BeamFlinkRunner

Reply via email to