Cheers for the quick reply Till. That would be very useful information to have! I'll upgrade my project to Flink 0.10.1 tongiht and let you know if I can find out if theres a skew in the data :-)
- Pieter 2016-01-27 13:49 GMT+01:00 Till Rohrmann <trohrm...@apache.org>: > Could it be that your data is skewed? This could lead to different loads > on different task managers. > > With the latest Flink version, the web interface should show you how many > bytes each operator has written and received. There you could see if one > operator receives more elements than the others. > > Cheers, > Till > > On Wed, Jan 27, 2016 at 1:35 PM, Pieter Hameete <phame...@gmail.com> > wrote: > >> Hi guys, >> >> Currently I am running a job in the GCloud in a configuration with 4 task >> managers that each have 4 CPUs (for a total parallelism of 16). >> >> However, I noticed my job is running much slower than expected and after >> some more investigation I found that one of the workers is doing a majority >> of the work (its CPU load was at 100% while the others were almost idle). >> >> My job execution plan can be found here: http://i.imgur.com/fHKhVFf.png >> >> The input is split into multiple files so loading the data is properly >> distributed over the workers. >> >> I am wondering if you can provide me with some tips on how to figure out >> what is going wrong here: >> >> - Could this imbalance in workload be the result of an imbalance in >> the hash paritioning? >> - Is there a convenient way to see how many elements each worker gets >> to process? Would it work to write the output of the CoGroup to disk >> because each worker writes to its own output file and investigate the >> differences? >> - Is there something strange about the execution plan that could >> cause this? >> >> Thanks and kind regards, >> >> Pieter >> > >