[ 
https://issues.apache.org/jira/browse/FLINK-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183087#comment-17183087
 ] 

Kostas Kloudas commented on FLINK-9940:
---------------------------------------

Hi [~Averell]and [~maguowei]

Thanks for the discussion! My contribution to the discussion is the following:
1. with the 
[FLIP-27|https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface]
 effort now being part of the master branch, soon many of the sources 
(including the File Source) will be re-implemented based on the new APIs. This 
may be a point to consider when deciding how much time we want to spend on the 
problem.
2. I think that the solution of the {{READ_CONSISTENCY_OFFSET_INTERVAL}} which 
stores files within this interval in state is actually a good solution. It 
exposes the tradeoff between accuracy and required storage space to the user 
and with good documentation it can actually lead to a good user experience. In 
addition, it makes no assumptions about the user's set-up.

I have not looked at the code yet. But if we all agree that this is the 
solution we want to go with until a new File Source comes, I can have a look.

> File source continuous monitoring mode: S3 files sometimes missed
> -----------------------------------------------------------------
>
>                 Key: FLINK-9940
>                 URL: https://issues.apache.org/jira/browse/FLINK-9940
>             Project: Flink
>          Issue Type: Bug
>          Components: API / DataStream
>    Affects Versions: 1.5.1
>         Environment: Flink 1.5, EMRFS
>            Reporter: Huyen Levan
>            Assignee: Huyen Levan
>            Priority: Major
>              Labels: EMRFS, Flink, S3, pull-request-available
>
> When using StreamExecutionEnvironment.readFile() with 
> FileProcessingMode.PROCESS_CONTINUOUSLY mode to monitor an S3 prefix, if 
> there is a high amount of new/modified files at the same time, the directory 
> monitoring process might miss some files. The number of missing files depends 
> on the monitoring interval.
> Cause: Flink tracks which files it has read by remembering the modification 
> time of the file that was added (or modified) last. So when there are 
> multiple files having a same last-modified timestamp.
> Suggested solution (thanks to [[Fabian 
> Hueske|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=user_nodes&user=25]|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=user_nodes&user=25]):
>  a hybrid approach that keeps the names of all files that have a mod 
> timestamp that is larger than the max mod time minus an offset. 
> _org.apache.flink.streaming.api.functions.source.ContinuousFileMonitoringFunction_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to