[ https://issues.apache.org/jira/browse/FLINK-13245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16883829#comment-16883829 ]
Chesnay Schepler commented on FLINK-13245: ------------------------------------------ With the extensive help of [~azagrebin] we tracked this down to an issue in the {{PartitionRequestQueue}}, where readers are only properly released if they are marked as available (i.e., having credit). If a consumer attempted to close a connection for which the producer has no credit the actual release was skipped. This issue only surfaced since the new {{BoundedBlockingSubpartition}} only releases data when all readers have been released, contrary to the previous implementation. [~zjwang][~pnowojski][~NicoK][~StephanEwen] Our proposal would be to modify {{PartitionRequestQueue#userEventTriggered}} as follows: {code} // Cancel the request for the input channel int size = availableReaders.size(); final NetworkSequenceViewReader toRelease = allReaders.remove(toCancel); // remove reader from queue of available readers for (int i = 0; i < size; i++) { NetworkSequenceViewReader reader = pollAvailableReader(); if (reader != toRelease) { registerAvailableReader(reader); } } toRelease.releaseAllResources(); markAsReleased(toRelease.getReceiverId()); {code} > Network stack is leaking files > ------------------------------ > > Key: FLINK-13245 > URL: https://issues.apache.org/jira/browse/FLINK-13245 > Project: Flink > Issue Type: Bug > Components: Runtime / Network > Affects Versions: 1.9.0 > Reporter: Chesnay Schepler > Assignee: Chesnay Schepler > Priority: Blocker > Fix For: 1.9.0 > > > There's file leak in the network stack / shuffle service. > When running the {{SlotCountExceedingParallelismTest}} on Windows a large > number of {{.channel}} files continue to reside in a > {{flink-netty-shuffle-XXX}} directory. > From what I've gathered so far these files are still being used by a > {{BoundedBlockingSubpartition}}. The cleanup logic in this class uses > ref-counting to ensure we don't release data while a reader is still present. > However, at the end of the job this count has not reached 0, and thus nothing > is being released. > The same issue is also present on the {{ResultPartition}} level; the > {{ReleaseOnConsumptionResultPartition}} also are being released while the > ref-count is greater than 0. > Overall it appears like there's some issue with the notifications for > partitions being consumed. > It is feasible that this issue has recently caused issues on Travis where the > build were failing due to a lack of disk space. -- This message was sent by Atlassian JIRA (v7.6.14#76016)