[GitHub] flink pull request: [FLINK-1730]Persist operator on Data Sets

StephanEwen Wed, 09 Sep 2015 03:02:44 -0700

Github user StephanEwen commented on the pull request:

    https://github.com/apache/flink/pull/1083#issuecomment-138861709
  
    I don't think this really works. It violates so many assumptions, like the 
fact that memory is available after a job ends. The accounting for that depends 
on task slots - a free slot must have the necessary memory.
    
    When you call collect twice, the whole program get re-executed and some 
parts are simply discarded for the sake of using the persisted data. That makes 
sure the results are the same, but does not safe any work. For that, you need 
backtracking through the graph to check whether results are already available.
    
    Which brings us to the point:
    
    Making the system's level of data streams aware of this (both on 
TaskManager and JobManager side) solves two issues with one approach: Caching 
for reuse in incrementally constructed programs, and caching for recovery. Both 
cases simply need persistent streams and backtracking in the execution graph.
    
    Sorry that you made this effort and it cannot be merged now, but I do not 
understand why you keep ignoring the discussions, inputs, and the request to 
plan and describe such highly involved features before writing them.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-1730]Persist operator on Data Sets

Reply via email to