The data  source is generated by an application that monitors some sort of 
sessions.
With the EVENT_TIME column being the session end time .

It is possible that the files will have out of order data , because of the 
async nature of the application writing  files.
 While the EVENT_TIME is monotonically  increasing in general .  some lateness 
is possible. However ,    I used allowlateness  on my stream and still got the 
inconsistencies

Although the real life use case is generically reading files form a folder.  
The testing  env has an already set of files in advanced -  these  should be 
read and produce the result.

You mentioned the “right” order of the files.  Is it sorted by update time ?  
when running in parallel, is it possible that 2 files will be read in parallel. 
And in case that the latter one is smaller.  The latest timestamp will  be 
handled first ?


BTW I tried to use a ContinuousEventTimeTrigger  to make sure the window is 
calculated ?  and got the processing to trigger multiple times  so I’m not sure 
exactly how this type of trigger works..

Thanks




From: Fabian Hueske <fhue...@gmail.com>
Sent: Monday, August 26, 2019 11:06 AM
To: Hanan Yehudai <hanan.yehu...@radcom.com>
Cc: user@flink.apache.org
Subject: Re: tumbling event time window , parallel

Hi,

Can you share a few more details about the data source?
Are you continuously ingesting files from a folder?

You are correct, that the parallelism should not affect the results, but there 
are a few things that can affect that:
1) non-determnistic keys
2) out-of-order data with inappropriate watermarks

Note that watermark configuration for file ingests can be difficult and that 
you need to ensure that files are read in the "right" order.
AFAIK, Flink's continuous file source uses the modification timestamp of files 
to determine the read order.

Best, Fabian

Am So., 25. Aug. 2019 um 19:32 Uhr schrieb Hanan Yehudai 
<hanan.yehu...@radcom.com<mailto:hanan.yehu...@radcom.com>>:
I have an issue with tumbling windows running  in parallel.

I run a Job on  a set of CSV files.

When the parallelism is set to 1.  I get the proper results.
While it runs in parallel.   I get no output.
Is it  due to the fact the parallel streams take the MAX(watermark) from all 
the parallel sources.
And only one of the streams advances the watermark ?

It seems wrong that the result is not deterministic  and depends on the 
parallel level.
What am I doing wrong ?

Reply via email to