It sounds like you would like to have something like event-time-based windowing, but with independent watermarking for every key. An approach that can work, but it is somewhat cumbersome, is to not use watermarks or windows, but instead put all of the logic in a KeyedProcessFunction (or RichFlatMap). In this way you are free to implement your own policy for deciding when a given "window" for a specific key (device) is ready for processing, based solely on observing the events for that specific key.
Semantically I think this is similar to running a separate instance of the job for each source, but with multi-tenancy, and with an impoverished API (no watermarks, no event time timers, no event time windows). Note that it is already the case that each parallel instance of an operator has its own, independent notion of the current watermark. I believe your problems arise from the fact that this current watermark is applied to all events processed by that instance, regardless of their keys. I believe you would like each key to maintain its own current watermark (event time clock), so if one key (device) is idle, its watermark will wait for further events to arrive. As it is now, events for other keys processed by the same operator instance (or subtask) will advance the shared watermark, causing an idle device's events to become late. Regards, David On Fri, Jul 31, 2020 at 1:42 PM Sush Bankapura < sushrutha.bankap...@man-es.com> wrote: > Hi, > > We have a single Flink job that works on data from multiple data sources. > These data sources are not aligned in time and also have intermittent > connectivity lasting for days, due to which data will arrive late > > We attempted to use the event time and watermarks with parallel streams > using keyby for the data source > > In case of parallel streams, for certain operators, the event time clock > across all the subtasks of the operator is the minimum value of the > watermark among all its input streams. > > Reference: > https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/event_time.html#watermarks > in-parallel-streams > > While this seems to be a fundamental concept of Flink, are there any plans > of having event time clock per operator per subtask for such operators? > > This is causing us, not to use watermarks and to fallback on processing > time semantics or in the worst case running the same Flink job for each and > every different data source from which we are collecting data through Kafka > > Thanks, > Sush >