Re: How to create a stream of data batches

Stephan Ewen Fri, 04 Sep 2015 05:15:04 -0700

Interesting question, you are the second to ask that.

Batching in user code is a way, as Matthias said. We have on the roadmap a
way to transform a stream to a set of batches, but it will be a bit until
this is in. See
https://cwiki.apache.org/confluence/display/FLINK/Streams+and+Operations+on+Streams


What kind of operation do you want to do on the batch? Will the batched
communicate (repartition, group, join), or will they only work
partition-local?

On Fri, Sep 4, 2015 at 1:12 PM, Matthias J. Sax <mj...@apache.org> wrote:

> Hi Andres,
>
> you could do this by using your own data type, for example
>
> > public class MyBatch {
> >   private ArrayList<MyTupleType> data = new ArrayList<MyTupleType>
> > }
>
> In the DataSource, you need to specify your own InputFormat that reads
> multiple tuples into a batch and emits the whole batch at once.
>
> However, be aware, that this POJO type hides the batch nature from
> Flink, ie, Flink does not know anything about the tuples in the batch.
> To Flink a batch is a single tuple. If you want to perform key-based
> operations, this might become a problem.
>
> -Matthias
>
> On 09/04/2015 01:00 PM, Andres R. Masegosa  wrote:
> > Hi,
> >
> > I'm trying to code some machine learning algorithms on top of flink such
> > as a variational Bayes learning algorithms. Instead of working at a data
> > element level (i.e. using map transformations), it would be far more
> > efficient to work at a "batch of elements" levels (i.e. I get a batch of
> > elements and I produce some output).
> >
> > I could code that using "mapPartition" function. But I can not control
> > the size of the partition, isn't?
> >
> > Is there any way to transform a stream (or DataSet) of elements in a
> > stream (or DataSet) of data batches with the same size?
> >
> >
> > Thanks for your support,
> > Andres
> >
>
>

Re: How to create a stream of data batches

Reply via email to