Re: Batch jobs with a very large number of input splits

2016-08-23 Thread Fabian Hueske
Hi Niels, yes, in YARN mode, the default parallelism is the number of available slots. You can change the default task parallelism like this: 1) Use the -p parameter when submitting a job via the CLI client [1] 2) Set a parallelism on the execution environment: env.setParallelism() Best, Fabian

Re: Batch jobs with a very large number of input splits

2016-08-23 Thread Niels Basjes
I did more digging and finally understand what goes wrong. I create a yarn-session with 50 slots. Then I run my job that (due to the fact that my HBase table has 100s of regions) has a lot of inputsplits. The job then runs with parallelism 50 because I did not specify the value. As a consequence th

Re: Batch jobs with a very large number of input splits

2016-08-19 Thread Robert Metzger
Hi Niels, In Flink, you don't need one task per file, since splits are assigned lazily to reading tasks. What exactly is the error you are getting when trying to read that many input splits? (Is it on the JobManager?) Regards, Robert On Thu, Aug 18, 2016 at 1:56 PM, Niels Basjes wrote: > Hi, >

Batch jobs with a very large number of input splits

2016-08-18 Thread Niels Basjes
Hi, I'm working on a batch process using Flink and I ran into an interesting problem. The number of input splits in my job is really really large. I currently have a HBase input (with more than 1000 regions) and in the past I have worked with MapReduce jobs doing 2000+ files. The problem I have