[ https://issues.apache.org/jira/browse/FLINK-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284383#comment-17284383 ]
Huyen Levan commented on FLINK-9940: ------------------------------------ Thank you [~maguowei] > File source continuous monitoring mode: S3 files sometimes missed > ----------------------------------------------------------------- > > Key: FLINK-9940 > URL: https://issues.apache.org/jira/browse/FLINK-9940 > Project: Flink > Issue Type: Bug > Components: API / DataStream > Affects Versions: 1.5.1 > Environment: Flink 1.5, EMRFS > Reporter: Huyen Levan > Assignee: Huyen Levan > Priority: Major > Labels: EMRFS, Flink, S3, pull-request-available > > When using StreamExecutionEnvironment.readFile() with > FileProcessingMode.PROCESS_CONTINUOUSLY mode to monitor an S3 prefix, if > there is a high amount of new/modified files at the same time, the directory > monitoring process might miss some files. The number of missing files depends > on the monitoring interval. > Cause: Flink tracks which files it has read by remembering the modification > time of the file that was added (or modified) last. So when there are > multiple files having a same last-modified timestamp. > Suggested solution (thanks to [[Fabian > Hueske|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=user_nodes&user=25]|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=user_nodes&user=25]): > a hybrid approach that keeps the names of all files that have a mod > timestamp that is larger than the max mod time minus an offset. > _org.apache.flink.streaming.api.functions.source.ContinuousFileMonitoringFunction_ -- This message was sent by Atlassian Jira (v8.3.4#803005)