So, on my example, when I execute: val usersMap = contacts.collectAsMap() --> Map goes to the driver and just lives there in the beginning. contacts.map(v => (v._1, (usersMap(v._1), v._2))).collect
When I execute usersMap(v._1), Does driver has to send to the executorX the "value" which it needs? I guess I'm missing something. How does the data transfer among usersMap(just in the driver) and executors work? On this case it looks like better to use broadcasting like: val usersMap = contacts.collectAsMap() val bc = sc.broadcast(usersMap) contacts.map(v => (v._1, (bc.value(v._1), v._2))).collect() 2015-02-26 11:16 GMT+01:00 Sean Owen <so...@cloudera.com>: > No, it exists only on the driver, not the executors. Executors don't > retain partitions unless they are supposed to be persisted. > > Generally, broadcasting a small Map to accomplish a join 'manually' is > more efficient than a join, but you are right that this is mostly > because joins usually involve shuffles. If not, it's not as clear > which way is best. I suppose that if the Map is large-ish, it's safer > to not keep pulling it to the driver. > > On Thu, Feb 26, 2015 at 10:00 AM, Guillermo Ortiz <konstt2...@gmail.com> > wrote: >> I have a question, >> >> If I execute this code, >> >> val users = sc.textFile("/tmp/users.log").map(x => x.split(",")).map( >> v => (v(0), v(1))) >> val contacts = sc.textFile("/tmp/contacts.log").map(y => >> y.split(",")).map( v => (v(0), v(1))) >> val usersMap = contacts.collectAsMap() >> contacts.map(v => (v._1, (usersMap(v._1), v._2))).collect() >> >> When I execute collectAsMap, where is data? in each Executor?? I guess >> than each executor has data that it proccesed. The result is sent to >> the driver, but I guess that each executor keeps its piece of >> processed data. >> >> I guess that it's more efficient that to use a join in this case >> because there's not shuffle but If I save usersMap as a broadcast >> variable, wouldn't it be less efficient because I'm sending data to >> executors and don't need it? >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org