Hi Stefan, forcing the scheduling of tasks to certain nodes and reading files from the local file system in a multi-node setup is actually quite tricky and requires a bit understanding of the internals. It is possible and I can help you with that, but would recommend to use a shared filesystem such as HDFS if that is possible.
Best, Fabian 2015-09-14 19:16 GMT+02:00 Stefan Bunk <stefan.b...@googlemail.com>: > Hi, > > actually, I am distributing my data before the program starts, without > using broadcast sets. > > However, the approach should still work, under one condition: > >> DataSet mapped1 = >> data.flatMap(yourMap).withBroadcastSet(smallData1,"data").setParallelism(5); >> DataSet mapped2 = >> data.flatMap(yourMap).withBroadcastSet(smallData2,"data").setParallelism(5); >> > Is it guaranteed, that this selects a disjoint set of nodes, i.e. five > nodes for mapped1 and five other nodes for mapped2? > > Is there any way of selecting the five nodes concretely? Currently, I have > stored the first half of the data on nodes 1-5 and the second half on nodes > 6-10. With this approach, I guess, nodes are selected randomly so I would > have to copy both halves to all of the nodes. > > Best, > Stefan > >