Yes this is a streaming use case, so broadcast is not an option. If I get it correctly with connected streams I would emulate side input by "streaming" the matrix with a key that all incoming vector records match on?
Wouldn't that create multiple copies of the matrix in memory? On Thu, Aug 4, 2016 at 6:56 PM, Sameer W <sam...@axiomine.com> wrote: > Theodore, > > Broadcast variables do that when using the DataSet API - > http://data-artisans.com/how-to-factorize-a-700-gb- > matrix-with-apache-flink/ > > See the following lines in the article- > To support the above presented algorithm efficiently we had to improve > Flinkās broadcasting mechanism since it easily becomes the bottleneck of > the implementation. The enhanced Flink version can share broadcast > variables among multiple tasks running on the same machine. *Sharing > avoids having to keep for each task an individual copy of the broadcasted > variable on the heap. This increases the memory efficiency significantly, > especially if the broadcasted variables can grow up to several GBs of size.* > > If you are using in the DataStream API then side-inputs (not yet > implemented) would achieve the same as broadcast variables. ( > https://docs.google.com/document/d/1hIgxi2Zchww_5fWUHLoYiXwSBXjv-M5eOv- > MKQYN3m4/edit#) . I use keyed Connected Streams in situation where I need > them for one of my use-cases (propagating rule changes to the data) where I > could have used side-inputs. > > Sameer > > > > > On Thu, Aug 4, 2016 at 8:56 PM, Theodore Vasiloudis < > theodoros.vasilou...@gmail.com> wrote: > >> Hello all, >> >> for a prototype we are looking into we would like to read a big matrix >> from HDFS, and for every element that comes in a stream of vectors do on >> multiplication with the matrix. The matrix should fit in the memory of one >> machine. >> >> We can read in the matrix using a RichMapFunction, but that would mean >> that a copy of the matrix is made for each Task Slot AFAIK, if the >> RichMapFunction is instantiated once per Task Slot. >> >> So I'm wondering how should we try address this problem, is it possible >> to have just one copy of the object in memory per TM? >> >> As a follow-up if we have more than one TM per node, is it possible to >> share memory between them? My guess is that we have to look at some >> external store for that. >> >> Cheers, >> Theo >> > >