[jira] [Comment Edited] (FLINK-13245) Network stack is leaking files

zhijiang (JIRA) Mon, 15 Jul 2019 09:23:26 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-13245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885371#comment-16885371
 ]


zhijiang edited comment on FLINK-13245 at 7/15/19 4:22 PM:
-----------------------------------------------------------

Thanks for finding this potential issue and the investigation! [~Zentol]  
[~azagrebin]

I think the idea of above modifications makes sense, because the 
`availableReader` is not always equivalent to `allReaders`, then it is proper 
to find the canceled view reader from `allReaders` instead.

This issue also exists in previous {{SpillableSubpartition}} which actually 
uses memory type in {{SlotCountExceedingParallelismTest,}} so we could not find 
this potential bug then.

In detail, we should also call `toRelease.notifySubpartitionConsumed` before 
calling `toRelease.releaseAllResources` in above modifications. Otherwise the 
reference counter in {{ReleaseOnConsumptionResultPartition}} would not decrease 
to zero and really release partition via {{ResultPartitionManager}}.

I would submit the PR and add some unite tests later tomorrow.


was (Author: zjwang):
Thanks for finding this potential issue and the investigation! [~Zentol]  
[~azagrebin]

I think the above idea of above modifications makes sense, because the 
`availableReader` is not always equivalent to `allReaders`, then it is proper 
to find the canceled view reader from `allReaders` instead.

This issue also exists in previous {{SpillableSubpartition}} which actually 
uses memory type in {{SlotCountExceedingParallelismTest,}} so we could not find 
this potential bug then.

In detail, we should also call `toRelease.notifySubpartitionConsumed` before 
calling `toRelease.releaseAllResources` in above modifications. Otherwise the 
reference counter in {{ReleaseOnConsumptionResultPartition}} would not decrease 
to zero and really release partition via {{ResultPartitionManager}}.

I would submit the PR and add some unite tests later tomorrow.

> Network stack is leaking files
> ------------------------------
>
>                 Key: FLINK-13245
>                 URL: https://issues.apache.org/jira/browse/FLINK-13245
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.9.0
>            Reporter: Chesnay Schepler
>            Assignee: zhijiang
>            Priority: Blocker
>             Fix For: 1.9.0
>
>
> There's file leak in the network stack / shuffle service.
> When running the {{SlotCountExceedingParallelismTest}} on Windows a large 
> number of {{.channel}} files continue to reside in a 
> {{flink-netty-shuffle-XXX}} directory.
> From what I've gathered so far these files are still being used by a 
> {{BoundedBlockingSubpartition}}. The cleanup logic in this class uses 
> ref-counting to ensure we don't release data while a reader is still present. 
> However, at the end of the job this count has not reached 0, and thus nothing 
> is being released.
> The same issue is also present on the {{ResultPartition}} level; the 
> {{ReleaseOnConsumptionResultPartition}} also are being released while the 
> ref-count is greater than 0.
> Overall it appears like there's some issue with the notifications for 
> partitions being consumed.
> It is feasible that this issue has recently caused issues on Travis where the 
> build were failing due to a lack of disk space.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Comment Edited] (FLINK-13245) Network stack is leaking files

Reply via email to