Yes, if that's the case you should go with option (2) and run with the checksums I think.
On Thu, Oct 6, 2016 at 10:32 AM, Flavio Pompermaier <pomperma...@okkam.it> wrote: > The problem is that data is very large and usually cannot run on a single > machine :( > > On Thu, Oct 6, 2016 at 10:11 AM, Ufuk Celebi <u...@apache.org> wrote: >> >> On Wed, Oct 5, 2016 at 7:08 PM, Tarandeep Singh <tarand...@gmail.com> >> wrote: >> > @Stephan my flink cluster setup- 5 nodes, each running 1 TaskManager. >> > Slots >> > per task manager: 2-4 (I tried varying this to see if this has any >> > impact). >> > Network buffers: 5k - 20k (tried different values for it). >> >> Could you run the job first on a single task manager to see if the >> error occurs even if no network shuffle is involved? That should be >> less overhead for you than running the custom build (which might be >> buggy ;)). >> >> – Ufuk > > > >