Hi! Following problem: I have 10 nodes on which I want to execute a flatMap operator on a DataSet. In the open method of the operator, some data is read from disk and preprocessed, which is necessary for the operator. Problem is, the data does not fit in memory on one node, however, half of the data does. So in five out of ten nodes, I stored one half of the data to be read in the open method, and the other half on the other five nodes.
Now my question: How can I distribute my DataSet, so that each element is sent once to a node with the first half of my data and once to a node with the other half? I looked at implementing a custom partitioner, however my problems were: (i) I have no mapping from the number I am supposed to return to the nodes to the data. How do I know, that index 5 contains one half of the data, and index 6 the other half? (ii) I do not know the current index. Obviously, I want to send my DataSet element only once over the network. Best, Stefan