Interesting question, you are the second to ask that. Batching in user code is a way, as Matthias said. We have on the roadmap a way to transform a stream to a set of batches, but it will be a bit until this is in. See https://cwiki.apache.org/confluence/display/FLINK/Streams+and+Operations+on+Streams
What kind of operation do you want to do on the batch? Will the batched communicate (repartition, group, join), or will they only work partition-local? On Fri, Sep 4, 2015 at 1:12 PM, Matthias J. Sax <mj...@apache.org> wrote: > Hi Andres, > > you could do this by using your own data type, for example > > > public class MyBatch { > > private ArrayList<MyTupleType> data = new ArrayList<MyTupleType> > > } > > In the DataSource, you need to specify your own InputFormat that reads > multiple tuples into a batch and emits the whole batch at once. > > However, be aware, that this POJO type hides the batch nature from > Flink, ie, Flink does not know anything about the tuples in the batch. > To Flink a batch is a single tuple. If you want to perform key-based > operations, this might become a problem. > > -Matthias > > On 09/04/2015 01:00 PM, Andres R. Masegosa wrote: > > Hi, > > > > I'm trying to code some machine learning algorithms on top of flink such > > as a variational Bayes learning algorithms. Instead of working at a data > > element level (i.e. using map transformations), it would be far more > > efficient to work at a "batch of elements" levels (i.e. I get a batch of > > elements and I produce some output). > > > > I could code that using "mapPartition" function. But I can not control > > the size of the partition, isn't? > > > > Is there any way to transform a stream (or DataSet) of elements in a > > stream (or DataSet) of data batches with the same size? > > > > > > Thanks for your support, > > Andres > > > >