Hi Stefan Just a clarification : The output corresponding to an element based on the whole data will be a union of the outputs based on the two halves. Is this what you're trying to achieve? [It appears like that since every flatMap task will independently produce outputs.]
In that case, one solution could be to simply have two flatMap operations based on parts of the *broadcast* data set, and take a union. Cheers Sachin On Sep 13, 2015 7:04 PM, "Stefan Bunk" <stefan.b...@googlemail.com> wrote: > Hi! > > Following problem: I have 10 nodes on which I want to execute a flatMap > operator on a DataSet. In the open method of the operator, some data is > read from disk and preprocessed, which is necessary for the operator. > Problem is, the data does not fit in memory on one node, however, half of > the data does. > So in five out of ten nodes, I stored one half of the data to be read in > the open method, and the other half on the other five nodes. > > Now my question: How can I distribute my DataSet, so that each element is > sent once to a node with the first half of my data and once to a node with > the other half? > > I looked at implementing a custom partitioner, however my problems were: > (i) I have no mapping from the number I am supposed to return to the nodes > to the data. How do I know, that index 5 contains one half of the data, and > index 6 the other half? > (ii) I do not know the current index. Obviously, I want to send my DataSet > element only once over the network. > > Best, > Stefan >