Hi Gianmarco,

Thanks for the pointer!

I had a quick look at the paper, but unfortunately I don’t see a connection to 
my problem. I have a batch job and elements in my dataset, that need quadratic 
much processing time depending on their size. The largest ones, that cause 
higher-than-average load, shall be split up and the splits shall be distributed 
among the workers. Your paper says “In  principle,  depending  on  the  
application,  two  different messages might impose a different load on workers. 
However, in  most  cases  these  differences  even  out  and  modeling  such 
application-specific differences is not necessary.” Maybe, I am missing 
something, but doesn’t this assumption render PKG inapplicable to my case? 
Objections to that are of course welcome :)

Cheers,
Sebastian

From: Gianmarco De Francisci Morales [mailto:g...@apache.org]
Sent: Mittwoch, 10. Juni 2015 15:40
To: user@flink.apache.org
Subject: Re: Load balancing

We have been working on an adaptive load balancing strategy that would address 
exactly the issue you point out.
FLINK-1725 is the starting point for the integration.

Cheers,

--
Gianmarco

On 9 June 2015 at 20:31, Fabian Hueske 
<fhue...@gmail.com<mailto:fhue...@gmail.com>> wrote:
Hi Sebastian,
I agree, shuffling only specific elements would be a very useful feature, but 
unfortunately it's not supported (yet).
Would you like to open a JIRA for that?
Cheers, Fabian

2015-06-09 17:22 GMT+02:00 Kruse, Sebastian 
<sebastian.kr...@hpi.de<mailto:sebastian.kr...@hpi.de>>:
Hi folks,

I would like to do some load balancing within one of my Flink jobs to achieve 
good scalability. The rebalance() method is not applicable in my case, as the 
runtime is dominated by the processing of very few larger elements in my 
dataset. Hence, I need to distribute the processing work for these elements 
among the nodes in the cluster. To do so, I subdivide those elements into 
partial tasks and want to distribute these partial tasks to other nodes by 
employing a custom partitioner.

Now, my question is the following: Actually, I do not need to shuffle the 
complete dataset but only a few elements. So is there a way of telling within 
the partitioner, that data should reside on the same task manager? Thanks!

Cheers,
Sebastian


Reply via email to