[ 
https://issues.apache.org/jira/browse/FLINK-1725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14723160#comment-14723160
 ] 

ASF GitHub Bot commented on FLINK-1725:
---------------------------------------

Github user tillrohrmann commented on the pull request:

    https://github.com/apache/flink/pull/1069#issuecomment-136290092
  
    @anisnasir, good to know that most real world data sets can be handled by 
just splitting keys into two components. But what about the rest? Wouldn't it 
be nice to have a partitioner which works for all? How hard would it be to 
generalize your approach? We could set the default number of distributing 
channels to 2 to mimic your initial implementation.
    
    Concerning the test, you could for example create a `DataStream` which only 
contains a single key. Then you group on this key and then apply some other 
operation where you use the `PartialPartitioner`. In this latter operation you 
can assign the sub index of the task which processes the elements. Having this 
index, you should be able to calculate the distribution of the data. If you 
execute this test on 2 TMs with a single slot or a single TM with 2 slots, then 
you should get a 50/50 distribution if I'm not mistaken.


> New Partitioner for better load balancing for skewed data
> ---------------------------------------------------------
>
>                 Key: FLINK-1725
>                 URL: https://issues.apache.org/jira/browse/FLINK-1725
>             Project: Flink
>          Issue Type: Improvement
>          Components: New Components
>    Affects Versions: 0.8.1
>            Reporter: Anis Nasir
>            Assignee: Anis Nasir
>              Labels: LoadBalancing, Partitioner
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Hi,
> We have recently studied the problem of load balancing in Storm [1].
> In particular, we focused on key distribution of the stream for skewed data.
> We developed a new stream partitioning scheme (which we call Partial Key 
> Grouping). It achieves better load balancing than key grouping while being 
> more scalable than shuffle grouping in terms of memory.
> In the paper we show a number of mining algorithms that are easy to implement 
> with partial key grouping, and whose performance can benefit from it. We 
> think that it might also be useful for a larger class of algorithms.
> Partial key grouping is very easy to implement: it requires just a few lines 
> of code in Java when implemented as a custom grouping in Storm [2].
> For all these reasons, we believe it will be a nice addition to the standard 
> Partitioners available in Flink. If the community thinks it's a good idea, we 
> will be happy to offer support in the porting.
> References:
> [1]. 
> https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
> [2]. https://github.com/gdfm/partial-key-grouping



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to