Hi, The problem is that Flink tracks which files it has read by remembering the modification time of the file that was added (or modified) last. We use the modification time, to avoid that we have to remember the names of all files that were ever consumed, which would be expensive to check and store over time.
One could change this logic to a hybrid approach that keeps the names of all files that have a mod timestamp that is larger than the max mod time minus an offset. It would be great if you could open a Jira issue for this problem. Thanks, Fabian 2018-07-24 14:58 GMT+02:00 Averell <lvhu...@gmail.com>: > Hello Jörn. > > Thanks for your help. > "/Probably the system is putting them to the folder and Flink is triggered > before they are consistent./" <<< yes, I also guess so. However, if Flink > is > triggered before they are consistent, either (a) there should be some error > messages, or (b) Flink should be able to identify those files in the > subsequent triggers. But in my case, those files are missed forever. > > Right now those files for S3 are to be consumed by Flink only. The flow is > as follow: > Existing system >>> S3 >>> Flink >>> Elastic Search. > If I cannot find a solution to the mentioned problem, I might need to > change > to: > Existing system >>> Kinesis >>> Flink >>> Elastic Search > Or > Existing system >>> S3 >>> Kinesis >>> Flink >>> Elastic > Search > Or > Existing system >>> S3 >>> Custom File Source + Flink >>> > Elastic > Search > However, all those solutions would take much more effort. > > Thanks! > > > > > -- > Sent from: http://apache-flink-user-mailing-list-archive.2336050. > n4.nabble.com/ >