[ 
https://issues.apache.org/jira/browse/FLINK-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16854919#comment-16854919
 ] 

Stephan Ewen commented on FLINK-12070:
--------------------------------------

@Yingjie Cao Thank you for sharing these benchmarks. That is very valuable.
It is a bit unfortunate that we see these regressions.

Can you share the following details:
  - Was memory swapping to a page file activated in the setup? 
  - Can you find out why the machine froze up? Was it because of lazy swapping 
from the mmapped file?

For future benchmarks, I would also suggest to compare two different revisions 
from the flink-1.9 branch: One immediately before the merge, one after the 
merge. That way, we rule out that some of the performance changes come due to 
other unrelated changes between 1.8 and 1.9 (like possibly slower scheduling or 
so).

I would suggest to proceed the following way: Let's add two more 
implementations (small changes to the existing implementation)

  1. Directly write to a file and mmap the file, rather than writing to mmapped 
region. That way the data should be eagerly persisted, i.e., there is no I/O 
needed when memory paged are evicted.
  2. Directly write to file and directly read from file.

Based on how these perform, we can decide which implementation to use.

> Make blocking result partitions consumable multiple times
> ---------------------------------------------------------
>
>                 Key: FLINK-12070
>                 URL: https://issues.apache.org/jira/browse/FLINK-12070
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Network
>    Affects Versions: 1.9.0
>            Reporter: Till Rohrmann
>            Assignee: Stephan Ewen
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.9.0
>
>         Attachments: image-2019-04-18-17-38-24-949.png
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> In order to avoid writing produced results multiple times for multiple 
> consumers and in order to speed up batch recoveries, we should make the 
> blocking result partitions to be consumable multiple times. At the moment a 
> blocking result partition will be released once the consumers has processed 
> all data. Instead the result partition should be released once the next 
> blocking result has been produced and all consumers of a blocking result 
> partition have terminated. Moreover, blocking results should not hold on slot 
> resources like network buffers or memory as it is currently the case with 
> {{SpillableSubpartitions}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to