Rdd.collect brings all data to driver. This should be avoided as much as possible. So its best to do the inserts from executors. However conn strings are not serialised over network so you have to instantiate conn in executors. But it should be done once per partition instead of per row. On 26 Jun 2015 09:05, "Bill Milan" <bill.milan2...@gmail.com> wrote:
> Hi all, > > I am running a program which connects to Amazon RDS and generate some data > from S3 into RDD. When I run rdd.collect and insert the results into RDS > using JDBC, I get "communication link failure". I tried to insert results > into RDS using both python and mysql client in the master machine and > everything went well. However, when I used Spark, the insertion was not > successful. My questions are: > > > 1) When I establish connection with RDS before RDD is generated, is this > done in master? > > 2) When I calll rdd.collect, is the returned array in master or slave > nodes? > > 3) When I insert the results of rdd.collect, where does the insertion > happen? > > Thanks! > > Bill >