Hi Stefan
Just a clarification : The output corresponding to an element based on the
whole data will be a union of the outputs based on the two halves. Is this
what you're trying to achieve? [It appears  like that since every flatMap
task will independently produce outputs.]

In that case, one solution could be to simply have two flatMap operations
based on parts of the *broadcast* data set, and take a union.

Cheers
Sachin
On Sep 13, 2015 7:04 PM, "Stefan Bunk" <stefan.b...@googlemail.com> wrote:

> Hi!
>
> Following problem: I have 10 nodes on which I want to execute a flatMap
> operator on a DataSet. In the open method of the operator, some data is
> read from disk and preprocessed, which is necessary for the operator.
> Problem is, the data does not fit in memory on one node, however, half of
> the data does.
> So in five out of ten nodes, I stored one half of the data to be read in
> the open method, and the other half on the other five nodes.
>
> Now my question: How can I distribute my DataSet, so that each element is
> sent once to a node with the first half of my data and once to a node with
> the other half?
>
> I looked at implementing a custom partitioner, however my problems were:
> (i) I have no mapping from the number I am supposed to return to the nodes
> to the data. How do I know, that index 5 contains one half of the data, and
> index 6 the other half?
> (ii) I do not know the current index. Obviously, I want to send my DataSet
> element only once over the network.
>
> Best,
> Stefan
>

Reply via email to