You mean "Connected Streams"? I use that for the same requirement. I way it works it looks like it creates multiple copies per co-map operation. I use the keyed version to match side inputs with the data.
Sent from my iPhone > On Aug 5, 2016, at 12:36 PM, Theodore Vasiloudis > <theodoros.vasilou...@gmail.com> wrote: > > Yes this is a streaming use case, so broadcast is not an option. > > If I get it correctly with connected streams I would emulate side input by > "streaming" the matrix with a key that all incoming vector records match on? > > Wouldn't that create multiple copies of the matrix in memory? > >> On Thu, Aug 4, 2016 at 6:56 PM, Sameer W <sam...@axiomine.com> wrote: >> Theodore, >> >> Broadcast variables do that when using the DataSet API - >> http://data-artisans.com/how-to-factorize-a-700-gb-matrix-with-apache-flink/ >> >> See the following lines in the article- >> To support the above presented algorithm efficiently we had to improve >> Flinkās broadcasting mechanism since it easily becomes the bottleneck of the >> implementation. The enhanced Flink version can share broadcast variables >> among multiple tasks running on the same machine. Sharing avoids having to >> keep for each task an individual copy of the broadcasted variable on the >> heap. This increases the memory efficiency significantly, especially if the >> broadcasted variables can grow up to several GBs of size. >> >> If you are using in the DataStream API then side-inputs (not yet >> implemented) would achieve the same as broadcast variables. >> (https://docs.google.com/document/d/1hIgxi2Zchww_5fWUHLoYiXwSBXjv-M5eOv-MKQYN3m4/edit#) >> . I use keyed Connected Streams in situation where I need them for one of >> my use-cases (propagating rule changes to the data) where I could have used >> side-inputs. >> >> Sameer >> >> >> >> >>> On Thu, Aug 4, 2016 at 8:56 PM, Theodore Vasiloudis >>> <theodoros.vasilou...@gmail.com> wrote: >>> Hello all, >>> >>> for a prototype we are looking into we would like to read a big matrix from >>> HDFS, and for every element that comes in a stream of vectors do on >>> multiplication with the matrix. The matrix should fit in the memory of one >>> machine. >>> >>> We can read in the matrix using a RichMapFunction, but that would mean >>> that a copy of the matrix is made for each Task Slot AFAIK, if the >>> RichMapFunction is instantiated once per Task Slot. >>> >>> So I'm wondering how should we try address this problem, is it possible to >>> have just one copy of the object in memory per TM? >>> >>> As a follow-up if we have more than one TM per node, is it possible to >>> share memory between them? My guess is that we have to look at some >>> external store for that. >>> >>> Cheers, >>> Theo >