So, on my example, when I execute:
val usersMap = contacts.collectAsMap() --> Map goes to the driver and
just lives there in the beginning.
contacts.map(v => (v._1, (usersMap(v._1), v._2))).collect

When I execute usersMap(v._1),
Does driver has to send to the executorX the "value" which it needs? I
guess I'm missing something.
How does the data transfer among usersMap(just in the driver) and
executors work?

On this case it looks like better to use broadcasting like:
val usersMap = contacts.collectAsMap()
val bc = sc.broadcast(usersMap)
contacts.map(v => (v._1, (bc.value(v._1), v._2))).collect()

2015-02-26 11:16 GMT+01:00 Sean Owen <so...@cloudera.com>:
> No, it exists only on the driver, not the executors. Executors don't
> retain partitions unless they are supposed to be persisted.
>
> Generally, broadcasting a small Map to accomplish a join 'manually' is
> more efficient than a join, but you are right that this is mostly
> because joins usually involve shuffles. If not, it's not as clear
> which way is best. I suppose that if the Map is large-ish, it's safer
> to not keep pulling it to the driver.
>
> On Thu, Feb 26, 2015 at 10:00 AM, Guillermo Ortiz <konstt2...@gmail.com> 
> wrote:
>> I have a question,
>>
>> If I execute this code,
>>
>> val users = sc.textFile("/tmp/users.log").map(x => x.split(",")).map(
>> v => (v(0), v(1)))
>> val contacts = sc.textFile("/tmp/contacts.log").map(y =>
>> y.split(",")).map( v => (v(0), v(1)))
>> val usersMap = contacts.collectAsMap()
>> contacts.map(v => (v._1, (usersMap(v._1), v._2))).collect()
>>
>> When I execute collectAsMap, where is data? in each Executor?? I guess
>> than each executor has data that it proccesed. The result is sent to
>> the driver, but I guess that each executor keeps its piece of
>> processed data.
>>
>> I guess that it's more efficient that to use a join in this case
>> because there's not shuffle but If I save usersMap as a broadcast
>> variable, wouldn't it be less efficient because I'm sending data to
>> executors and don't need it?
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to