Hi!

Following problem: I have 10 nodes on which I want to execute a flatMap
operator on a DataSet. In the open method of the operator, some data is
read from disk and preprocessed, which is necessary for the operator.
Problem is, the data does not fit in memory on one node, however, half of
the data does.
So in five out of ten nodes, I stored one half of the data to be read in
the open method, and the other half on the other five nodes.

Now my question: How can I distribute my DataSet, so that each element is
sent once to a node with the first half of my data and once to a node with
the other half?

I looked at implementing a custom partitioner, however my problems were:
(i) I have no mapping from the number I am supposed to return to the nodes
to the data. How do I know, that index 5 contains one half of the data, and
index 6 the other half?
(ii) I do not know the current index. Obviously, I want to send my DataSet
element only once over the network.

Best,
Stefan

Reply via email to