Re: map vs foreach for sending data to external system

2015-07-02 Thread Alexandre Rodrigues
What I'm doing in the RDD is parsing a text file and sending things to the external system.. I guess that it does that immediately when the action (count) is triggered instead of being a two step process. So I guess I should have parsing logic + sending to external system inside the foreach (with

Re: map vs foreach for sending data to external system

2015-07-02 Thread Eugen Cepoi
Heh, an actions or materializaiton, means that it will trigger the computation over the RDD. A transformation like map, means that it will create the transformation chain that must be applied on the data, but it is actually not executed. It is executed only when an action is triggered over that RDD

Re: map vs foreach for sending data to external system

2015-07-02 Thread Alexandre Rodrigues
Foreach is listed as an action[1]. I guess an *action* just means that it forces materialization of the RDD. I just noticed much faster executions with map although I don't like the map approach. I'll look at it with new eyes if foreach is the way to go. [1] – https://spark.apache.org/docs/latest

Re: map vs foreach for sending data to external system

2015-07-02 Thread Silvio Fiorito
foreach absolutely runs on the executors. For sending data to an external system you should likely use foreachPartition in order to batch the output. Also if you want to limit the parallelism of the output action then you can use coalesce. What makes you think foreach is running on the driver?

Re: map vs foreach for sending data to external system

2015-07-02 Thread Eugen Cepoi
*"The thing is that foreach forces materialization of the RDD and it seems to be executed on the driver program"* What makes you think that? No, foreach is run in the executors (distributed) and not in the driver. 2015-07-02 18:32 GMT+02:00 Alexandre Rodrigues < alex.jose.rodrig...@gmail.com>: >