Hi Till, I appreciate the detailed explanation. My specific case has been with the graph generators. I think it is possible to implement some random sources using SplittableIterator rather than building a Collection, so it might be best to rework the graph generator API to better fit the Flink model. For LCGs we can simply build a skip-ahead table.
Greg On Mon, Apr 25, 2016 at 10:05 AM, Till Rohrmann <trohrm...@apache.org> wrote: > Hi Greg, > > I think we haven't discussed the opportunity for a parallelized collection > input format, yet. Thanks for bringing this up. > > I think it should be possible to implement a generic parallel collection > input format. However, I have two questions here: > > 1. Is it really a problem for users that their job exceeds the akka frame > size limit when using the collection input format? The collection input > format should be used primarily for testing and, thus, the data should be > rather small. > > 2. Which message is causing the frame size problem? If it is the task > deployment descriptor, then a parallelized collection input format which > works with input splits can solve the problem. If the problem is rather the > `SubmitJob` message, then we have to tackle the problem differently. The > reason is that the input splits are only created on the `JobManager`. > Before, the collection is simply written into the task config of the data > source `JobVertex`, because we don't know the number of sub tasks yet. In > the latter case, which can also be cause by large closure objects, we > should send the job via the blob manager to the `JobManager` to solve the > problem. > > Cheers, > Till > > On Mon, Apr 25, 2016 at 3:45 PM, Greg Hogan <c...@greghogan.com> wrote: > > > Hi, > > > > CollectionInputFormat currently enforces a parallelism of 1 by > implementing > > NonParallelInput and serializing the entire Collection. If my > understanding > > is correct this serialized InputFormat is often the cause of a new job > > exceeding the akka message size limit. > > > > As an alternative the Collection elements could be serialized into > multiple > > InputSplits. Has this idea been considered and rejected? > > > > Thanks, > > Greg > > >