[jira] [Commented] (FLINK-13245) Network stack is leaking files

Andrey Zagrebin (JIRA) Tue, 16 Jul 2019 06:39:20 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-13245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886116#comment-16886116
 ]


Andrey Zagrebin commented on FLINK-13245:
-----------------------------------------

Thanks for the clarification [~zjwang]! I have some more questions.

Firstly, some comments about the second point `notifySubpartitionConsumed`. At 
the moment, the semantics for the partition lifecycle is that there are two 
cases: release or not on consumption:
 * If release on consumption is set to shuffle service then job master never 
tries to reuse any partitions. This means that the shuffle service should just 
monitor the first consumption attempt for each subpartition and independently 
how it ends (consumed or failed), it should release all subpartitions after one 
attempt is done for each of them. Job Master will always restart everything and 
not released partitions from previous attempts will just linger around, even 
successfully produced ones.
 * If release is done outside then again independently from how any consumption 
attempt ends, Job Master will decide when to release the successfully produced 
partitions. The partition should be auto-released only if the production fails 
which happens in Task.

Basically, it looks like (sub)partition needs only some kind of `View is 
released/done for any reason notification` to clean up readers and count 
consumption attempts for auto release.

Secondly, some questions about the described difference between 
'releaseAllResources\notifySubpartitionConsumed' in PartitionRequestQueue if it 
is still needed.

I think the confusion comes from the name of _CancelPartitionRequest_ which 
seems to be actually a confirmation of consumption from the consumer to 
producer at the same time (_NettyPartitionRequestClient#close_). Then we should 
notify it as an end of consumption to the whole partition and it is expected 
always to happen, right? This sounds more like _Acknowledge-_ or 
_ConfirmPartitionRequest_.

At the same time, it also serves as a cancelation of producer/consumer 
communication in case of consumer internal channel failure. At least as I see 
it in _PartitionRequestClientHandler#decodeMsg_ calling _cancelRequestFor_. 
This case sounds more like '_cancelation'_. But does it actually mean that we 
should notify the end of consumption to the whole partition on the producer 
side? Is it not the similar case as channel inactive/exception? The consumption 
might have been not successful but the end of consumption notification will 
lead to the subpartition full release. Could the job master reuse this 
sub-partition again for the recovered consumer if it tried?

Also looking more into `CloseRequest`, it will release and confirm consumption 
for all channels but does it actually mean that the consumption is done for all 
of them? Could it be that some of them failed internally in the mean time?

> Network stack is leaking files
> ------------------------------
>
>                 Key: FLINK-13245
>                 URL: https://issues.apache.org/jira/browse/FLINK-13245
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.9.0
>            Reporter: Chesnay Schepler
>            Assignee: zhijiang
>            Priority: Blocker
>             Fix For: 1.9.0
>
>
> There's file leak in the network stack / shuffle service.
> When running the {{SlotCountExceedingParallelismTest}} on Windows a large 
> number of {{.channel}} files continue to reside in a 
> {{flink-netty-shuffle-XXX}} directory.
> From what I've gathered so far these files are still being used by a 
> {{BoundedBlockingSubpartition}}. The cleanup logic in this class uses 
> ref-counting to ensure we don't release data while a reader is still present. 
> However, at the end of the job this count has not reached 0, and thus nothing 
> is being released.
> The same issue is also present on the {{ResultPartition}} level; the 
> {{ReleaseOnConsumptionResultPartition}} also are being released while the 
> ref-count is greater than 0.
> Overall it appears like there's some issue with the notifications for 
> partitions being consumed.
> It is feasible that this issue has recently caused issues on Travis where the 
> build were failing due to a lack of disk space.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (FLINK-13245) Network stack is leaking files

Reply via email to