[jira] [Commented] (FLINK-13245) Network stack is leaking files

Stephan Ewen (JIRA) Wed, 24 Jul 2019 10:25:08 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-13245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892023#comment-16892023
 ]


Stephan Ewen commented on FLINK-13245:
--------------------------------------

An observation on the current state of the implementation of 
{{ReleaseOnConsumptionResultPartition}}

We have a weird inconsistent situation where we
  1. have a count down on the ResultPartition level about the 
{{notifySubpartitionConsumed()}} calls from the subpartitions, to release the 
partition when all subpartitions are released.
  2. have results that can have multiple readers/views and can hence receive 
multiple release/consumed calls.

This can probably lead to counting two notifications from one subpartition, and 
then releasing too early, unless there is super careful accounting when to 
notify about consumption and when not to.
This careful accounting seems super fragile to me, given the state of the netty 
stack. I expect that we will have issues were we either notify too often (early 
release) or not often enough (lingering files).

I would suggest to do the following: 
  - Change {{ReleaseOnConsumptionResultPartition}} to have a flag per 
subpartition that tracks whether there was a {{notifySubpartitionConsumed()}} 
call or not, so that multiple calls are idempotent.
  - Always send {{notifySubpartitionConsumed()}} calls with every 
{{releaseAllResources()}} call.

> Network stack is leaking files
> ------------------------------
>
>                 Key: FLINK-13245
>                 URL: https://issues.apache.org/jira/browse/FLINK-13245
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.9.0
>            Reporter: Chesnay Schepler
>            Assignee: zhijiang
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.9.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> There's file leak in the network stack / shuffle service.
> When running the {{SlotCountExceedingParallelismTest}} on Windows a large 
> number of {{.channel}} files continue to reside in a 
> {{flink-netty-shuffle-XXX}} directory.
> From what I've gathered so far these files are still being used by a 
> {{BoundedBlockingSubpartition}}. The cleanup logic in this class uses 
> ref-counting to ensure we don't release data while a reader is still present. 
> However, at the end of the job this count has not reached 0, and thus nothing 
> is being released.
> The same issue is also present on the {{ResultPartition}} level; the 
> {{ReleaseOnConsumptionResultPartition}} also are being released while the 
> ref-count is greater than 0.
> Overall it appears like there's some issue with the notifications for 
> partitions being consumed.
> It is feasible that this issue has recently caused issues on Travis where the 
> build were failing due to a lack of disk space.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (FLINK-13245) Network stack is leaking files

Reply via email to