[ https://issues.apache.org/jira/browse/FLINK-13245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886116#comment-16886116 ]
Andrey Zagrebin commented on FLINK-13245: ----------------------------------------- Thanks for the clarification [~zjwang]! I have some more questions. Firstly, some comments about the second point `notifySubpartitionConsumed`. At the moment, the semantics for the partition lifecycle is that there are two cases: release or not on consumption: * If release on consumption is set to shuffle service then job master never tries to reuse any partitions. This means that the shuffle service should just monitor the first consumption attempt for each subpartition and independently how it ends (consumed or failed), it should release all subpartitions after one attempt is done for each of them. Job Master will always restart everything and not released partitions from previous attempts will just linger around, even successfully produced ones. * If release is done outside then again independently from how any consumption attempt ends, Job Master will decide when to release the successfully produced partitions. The partition should be auto-released only if the production fails which happens in Task. Basically, it looks like (sub)partition needs only some kind of `View is released/done for any reason notification` to clean up readers and count consumption attempts for auto release. Secondly, some questions about the described difference between 'releaseAllResources\notifySubpartitionConsumed' in PartitionRequestQueue if it is still needed. I think the confusion comes from the name of _CancelPartitionRequest_ which seems to be actually a confirmation of consumption from the consumer to producer at the same time (_NettyPartitionRequestClient#close_). Then we should notify it as an end of consumption to the whole partition and it is expected always to happen, right? This sounds more like _Acknowledge-_ or _ConfirmPartitionRequest_. At the same time, it also serves as a cancelation of producer/consumer communication in case of consumer internal channel failure. At least as I see it in _PartitionRequestClientHandler#decodeMsg_ calling _cancelRequestFor_. This case sounds more like '_cancelation'_. But does it actually mean that we should notify the end of consumption to the whole partition on the producer side? Is it not the similar case as channel inactive/exception? The consumption might have been not successful but the end of consumption notification will lead to the subpartition full release. Could the job master reuse this sub-partition again for the recovered consumer if it tried? Also looking more into `CloseRequest`, it will release and confirm consumption for all channels but does it actually mean that the consumption is done for all of them? Could it be that some of them failed internally in the mean time? > Network stack is leaking files > ------------------------------ > > Key: FLINK-13245 > URL: https://issues.apache.org/jira/browse/FLINK-13245 > Project: Flink > Issue Type: Bug > Components: Runtime / Network > Affects Versions: 1.9.0 > Reporter: Chesnay Schepler > Assignee: zhijiang > Priority: Blocker > Fix For: 1.9.0 > > > There's file leak in the network stack / shuffle service. > When running the {{SlotCountExceedingParallelismTest}} on Windows a large > number of {{.channel}} files continue to reside in a > {{flink-netty-shuffle-XXX}} directory. > From what I've gathered so far these files are still being used by a > {{BoundedBlockingSubpartition}}. The cleanup logic in this class uses > ref-counting to ensure we don't release data while a reader is still present. > However, at the end of the job this count has not reached 0, and thus nothing > is being released. > The same issue is also present on the {{ResultPartition}} level; the > {{ReleaseOnConsumptionResultPartition}} also are being released while the > ref-count is greater than 0. > Overall it appears like there's some issue with the notifications for > partitions being consumed. > It is feasible that this issue has recently caused issues on Travis where the > build were failing due to a lack of disk space. -- This message was sent by Atlassian JIRA (v7.6.14#76016)