Hi, Which file system are you reading from? If you are reading from S3, this might be cause by S3's eventual consistency property. Have a look at FLINK-9940 [1] for a more detailed discussion. There is also an open PR [2], that you could try to patch the source operator with.
Best, Fabian [1] https://issues.apache.org/jira/browse/FLINK-9940 [2] https://github.com/apache/flink/pull/6613 Am Fr., 12. Okt. 2018 um 20:41 Uhr schrieb Juan Miguel Cejuela < jua...@tagtog.net>: > Dear flinksters, > > > I'm using the class `ContinuousFileMonitoringFunction` as a source to > monitor a folder for new incoming files.* I have the problem that not all > the files that are sent to the folder get processed / triggered by the > function*. Specific details of my workflow is that I send up to 1k files > per minute, and that I consume the stream as a `AsyncDataStream`. > > I myself raised an unrelated issue with the > `ContinuousFileMonitoringFunction` class some time ago ( > https://issues.apache.org/jira/browse/FLINK-8046): if two or more files > shared the very same timestamp, only the first one (non-deterministically > chosen) would be processed. However, I patched the file myself to fix that > problem by using a LinkedHashMap to remember which files had been really > processed before or not. My patch is working fine as far as I can tell. > > The problem seems to be rather that some files (when many are sent at once > to the same folder) do not even get triggered/activated/registered by the > class. > > > Am I properly explaining my problem? > > > Any hints to solve this challenge would be greatly appreciated ! ❤ THANK > YOU > > -- > Juanmi, CEO and co-founder @ 🍃tagtog.net > > Follow tagtog updates on 🐦 Twitter: @tagtog_net > <https://twitter.com/tagtog_net> > >