Re: Distribute DataSet to subset of nodes

Sachin Goel Sun, 13 Sep 2015 12:52:15 -0700

Of course, someone else might have better ideas in re the partitioner. :)
On Sep 14, 2015 1:12 AM, "Sachin Goel" <sachingoel0...@gmail.com> wrote:


> Hi Stefan
> Just a clarification : The output corresponding to an element based on the
> whole data will be a union of the outputs based on the two halves. Is this
> what you're trying to achieve? [It appears  like that since every flatMap
> task will independently produce outputs.]
>
> In that case, one solution could be to simply have two flatMap operations
> based on parts of the *broadcast* data set, and take a union.
>
> Cheers
> Sachin
> On Sep 13, 2015 7:04 PM, "Stefan Bunk" <stefan.b...@googlemail.com> wrote:
>
>> Hi!
>>
>> Following problem: I have 10 nodes on which I want to execute a flatMap
>> operator on a DataSet. In the open method of the operator, some data is
>> read from disk and preprocessed, which is necessary for the operator.
>> Problem is, the data does not fit in memory on one node, however, half of
>> the data does.
>> So in five out of ten nodes, I stored one half of the data to be read in
>> the open method, and the other half on the other five nodes.
>>
>> Now my question: How can I distribute my DataSet, so that each element is
>> sent once to a node with the first half of my data and once to a node with
>> the other half?
>>
>> I looked at implementing a custom partitioner, however my problems were:
>> (i) I have no mapping from the number I am supposed to return to the
>> nodes to the data. How do I know, that index 5 contains one half of the
>> data, and index 6 the other half?
>> (ii) I do not know the current index. Obviously, I want to send my
>> DataSet element only once over the network.
>>
>> Best,
>> Stefan
>>
>

Re: Distribute DataSet to subset of nodes

Reply via email to