[ https://issues.apache.org/jira/browse/FLINK-13245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16889022#comment-16889022 ]
zhijiang commented on FLINK-13245: ---------------------------------- After confirming the comments from [~Zentol] in PR, I found that for the case of `SlotCountExceedingParallelismTest` it would not generate `ReleaseOnConsumptionResultPartition` because the partition is blocking type. So the reference counter would not be used in `ResultPartition`, and the files for bounded blocking partition could be released finally via calling `TaskExecutorGateway#releasePartitions` based on `RegionPartitionReleaseStrategy`. The description of this jira ticket might not be accurate. In my local running this test in Mac system, it has no file leaks after finished. I am not sure why it has file leaks in windows system and I guess it might be relevant with mmap internal mechanism in different systems. I would double verify this test in windows system. My PR modifications seems only for the case of pipelined partition which is using `ReleaseOnConsumptionResultPartition`, then the call of `notifySubpartitionConsumed` would make the reference counter become 0 finally to trigger release. But for the pipelined partition it is no issues for persistent file. > Network stack is leaking files > ------------------------------ > > Key: FLINK-13245 > URL: https://issues.apache.org/jira/browse/FLINK-13245 > Project: Flink > Issue Type: Bug > Components: Runtime / Network > Affects Versions: 1.9.0 > Reporter: Chesnay Schepler > Assignee: zhijiang > Priority: Blocker > Labels: pull-request-available > Fix For: 1.9.0 > > Time Spent: 10m > Remaining Estimate: 0h > > There's file leak in the network stack / shuffle service. > When running the {{SlotCountExceedingParallelismTest}} on Windows a large > number of {{.channel}} files continue to reside in a > {{flink-netty-shuffle-XXX}} directory. > From what I've gathered so far these files are still being used by a > {{BoundedBlockingSubpartition}}. The cleanup logic in this class uses > ref-counting to ensure we don't release data while a reader is still present. > However, at the end of the job this count has not reached 0, and thus nothing > is being released. > The same issue is also present on the {{ResultPartition}} level; the > {{ReleaseOnConsumptionResultPartition}} also are being released while the > ref-count is greater than 0. > Overall it appears like there's some issue with the notifications for > partitions being consumed. > It is feasible that this issue has recently caused issues on Travis where the > build were failing due to a lack of disk space. -- This message was sent by Atlassian JIRA (v7.6.14#76016)