Re: Batch jobs with a very large number of input splits

Robert Metzger Fri, 19 Aug 2016 06:19:05 -0700

Hi Niels,

In Flink, you don't need one task per file, since splits are assigned
lazily to reading tasks.
What exactly is the error you are getting when trying to read that many
input splits? (Is it on the JobManager?)


Regards,
Robert

On Thu, Aug 18, 2016 at 1:56 PM, Niels Basjes <ni...@basjes.nl> wrote:

> Hi,
>
> I'm working on a batch process using Flink and I ran into an interesting
> problem.
> The number of input splits in my job is really really large.
>
> I currently have a HBase input (with more than 1000 regions) and in the
> past I have worked with MapReduce jobs doing 2000+ files.
>
> The problem I have is that if I run such a job in a "small" yarn-session
> (i.e. less than 1000 tasks) I get a fatal error indicating that there are
> not enough resources.
> For a continuous streaming job this makes sense, yet for a batch job (like
> I'm having) this is an undesirable error.
>
> For my HBase situation I currently have a workaround by overriding the
> creatInputSplits method from the TableInputFormat and thus control the
> input splits that are created.
>
> What is the correct way to solve this (no my cluster is NOT big enough to
> run that many parallel tasks) ?
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes
>

Re: Batch jobs with a very large number of input splits

Reply via email to