Huyen Levan created FLINK-10518: ----------------------------------- Summary: Inefficient design in ContinuousFileMonitoringFunction Key: FLINK-10518 URL: https://issues.apache.org/jira/browse/FLINK-10518 Project: Flink Issue Type: Improvement Components: filesystem-connector Affects Versions: 1.5.2 Reporter: Huyen Levan
The ContinuousFileMonitoringFunction class keeps track of the latest file modification time to rule out all file it has processed in the previous cycles. For a long-running job, the list of eligible files will be much less than the list of all files in the folder being monitored. In the current implementation of the getInputSplitsSortedByModTime method, a list of all available splits are created first, and then every single split is checked with the list of eligible files. {quote}for (FileInputSplit split: format.createInputSplits(readerParallelism)) { FileStatus fileStatus = eligibleFiles.get(split.getPath()); if (fileStatus != null) { {quote} The improvement can be done as: * Listing of all files should be done once in _ContinuousFileMonitoringFunction.listEligibleFiles()_ (as of now it is done the 2nd time in _FileInputFormat.createInputSplits()_ ) * The list of file-splits should then be created from the list of paths in eligibleFiles. -- This message was sent by Atlassian JIRA (v7.6.3#76005)