Hi Pieter! Interesting, but good :-)
I don't think we did much on the hash functions since 0.9.1. I am a bit surprised that it made such a difference. Well, as long as it improves with the newer version :-) Greetings, Stephan On Wed, Jan 27, 2016 at 9:42 PM, Pieter Hameete <phame...@gmail.com> wrote: > Hi Till, > > i've upgraded to Flink 0.10.1 and ran the job again without any changes to > the code to see the bytes input and output of the operators and for the > different workers.To my surprise it is very well balanced between all > workers and because of this the job completed much faster. > > Are there any changes/fixes between Flink 0.9.1 and 0.10.1 that could > cause this to be better for me now? > > Thanks, > > Pieter > > 2016-01-27 14:10 GMT+01:00 Pieter Hameete <phame...@gmail.com>: > >> >> Cheers for the quick reply Till. >> >> That would be very useful information to have! I'll upgrade my project to >> Flink 0.10.1 tongiht and let you know if I can find out if theres a skew in >> the data :-) >> >> - Pieter >> >> >> 2016-01-27 13:49 GMT+01:00 Till Rohrmann <trohrm...@apache.org>: >> >>> Could it be that your data is skewed? This could lead to different loads >>> on different task managers. >>> >>> With the latest Flink version, the web interface should show you how >>> many bytes each operator has written and received. There you could see if >>> one operator receives more elements than the others. >>> >>> Cheers, >>> Till >>> >>> On Wed, Jan 27, 2016 at 1:35 PM, Pieter Hameete <phame...@gmail.com> >>> wrote: >>> >>>> Hi guys, >>>> >>>> Currently I am running a job in the GCloud in a configuration with 4 >>>> task managers that each have 4 CPUs (for a total parallelism of 16). >>>> >>>> However, I noticed my job is running much slower than expected and >>>> after some more investigation I found that one of the workers is doing a >>>> majority of the work (its CPU load was at 100% while the others were almost >>>> idle). >>>> >>>> My job execution plan can be found here: http://i.imgur.com/fHKhVFf.png >>>> >>>> The input is split into multiple files so loading the data is properly >>>> distributed over the workers. >>>> >>>> I am wondering if you can provide me with some tips on how to figure >>>> out what is going wrong here: >>>> >>>> - Could this imbalance in workload be the result of an imbalance in >>>> the hash paritioning? >>>> - Is there a convenient way to see how many elements each worker >>>> gets to process? Would it work to write the output of the CoGroup to >>>> disk >>>> because each worker writes to its own output file and investigate the >>>> differences? >>>> - Is there something strange about the execution plan that could >>>> cause this? >>>> >>>> Thanks and kind regards, >>>> >>>> Pieter >>>> >>> >>> >> >