Hi Greg,

I think we haven't discussed the opportunity for a parallelized collection
input format, yet. Thanks for bringing this up.

I think it should be possible to implement a generic parallel collection
input format. However, I have two questions here:

1. Is it really a problem for users that their job exceeds the akka frame
size limit when using the collection input format? The collection input
format should be used primarily for testing and, thus, the data should be
rather small.

2. Which message is causing the frame size problem? If it is the task
deployment descriptor, then a parallelized collection input format which
works with input splits can solve the problem. If the problem is rather the
`SubmitJob` message, then we have to tackle the problem differently. The
reason is that the input splits are only created on the `JobManager`.
Before, the collection is simply written into the task config of the data
source `JobVertex`, because we don't know the number of sub tasks yet. In
the latter case, which can also be cause by large closure objects, we
should send the job via the blob manager to the `JobManager` to solve the
problem.

Cheers,
Till

On Mon, Apr 25, 2016 at 3:45 PM, Greg Hogan <c...@greghogan.com> wrote:

> Hi,
>
> CollectionInputFormat currently enforces a parallelism of 1 by implementing
> NonParallelInput and serializing the entire Collection. If my understanding
> is correct this serialized InputFormat is often the cause of a new job
> exceeding the akka message size limit.
>
> As an alternative the Collection elements could be serialized into multiple
> InputSplits. Has this idea been considered and rejected?
>
> Thanks,
> Greg
>

Reply via email to