Re: Distribute DataSet to subset of nodes

Fabian Hueske Mon, 14 Sep 2015 13:29:30 -0700

Hi Stefan,

forcing the scheduling of tasks to certain nodes and reading files from the
local file system in a multi-node setup is actually quite tricky and
requires a bit understanding of the internals.
It is possible and I can help you with that, but would recommend to use a
shared filesystem such as HDFS if that is possible.


Best, Fabian

2015-09-14 19:16 GMT+02:00 Stefan Bunk <stefan.b...@googlemail.com>:

> Hi,
>
> actually, I am distributing my data before the program starts, without
> using broadcast sets.
>
> However, the approach should still work, under one condition:
>
>> DataSet mapped1 =
>> data.flatMap(yourMap).withBroadcastSet(smallData1,"data").setParallelism(5);
>> DataSet mapped2 =
>> data.flatMap(yourMap).withBroadcastSet(smallData2,"data").setParallelism(5);
>>
> Is it guaranteed, that this selects a disjoint set of nodes, i.e. five
> nodes for mapped1 and five other nodes for mapped2?
>
> Is there any way of selecting the five nodes concretely? Currently, I have
> stored the first half of the data on nodes 1-5 and the second half on nodes
> 6-10. With this approach, I guess, nodes are selected randomly so I would
> have to copy both halves to all of the nodes.
>
> Best,
> Stefan
>
>

Re: Distribute DataSet to subset of nodes

Reply via email to