[GitHub] flink pull request: [FLINK-2549] add topK operator to DataSet.

StephanEwen Wed, 23 Sep 2015 10:32:41 -0700

Github user StephanEwen commented on the pull request:

    https://github.com/apache/flink/pull/1161#issuecomment-142672766
  
    This looks super impressive and very well tested.
    
    The way that the operator is integrated into the system needs some 
improvement, though. The problem is mainly how the managed memory is obtained.
    
    The MemoryManager's memory is shared among all concurrently running tasks. 
This implementation takes up to half the total memory, which will cause 
programs to crash that have other memory consumers in the same pipeline. The 
tests here run, because the operator is executed in isolation, with no other 
memory consuming operators in the test program.
    
    Memory consumers need to be known to the Optimizer (in the program 
generation) to compute what maximal fraction of memory a certain consumer may 
request. That value is part of the Task's configuration and used by the memory 
consumer to obtain the right maximum amount.
    
    Integrating operators into the optimizer's planning is a bit tedious and 
not as easy as it could be (we did not get around to refactoring this so far, 
unfortunately). Maybe we can add some tooling that would mark a UDF as 
MemoryConsuming and would in that case expose a Memory Allocator that returns 
the right amount of memory.
    
    What we could do is the following: I will try to get to refactoring some of 
the Managed Memory Allocation abstractions (we need this anyways for more 
components) and then expose a MemoryAllocator in the runtime context, which is 
accessible if a user-defined function has been annotated as a memory consumer.
    
    This may take me two weeks (I am currently in the mids of working on the 
streaming windows), but if you don't mind letting this rest for some days, I 
think that is the cleanest approach.
    
    The other parts of the code look good, so after I finish my part, it should 
be a simple rebase of the TopKMapPartition function and the TopKReducer, and 
then this is good to merge.
    
    What do you think?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2549] add topK operator to DataSet.

Reply via email to