Re: Scaling spark jobs returning large amount of data

2015-06-04 Thread Richard Marscher
It is possible to start multiple concurrent drivers, Spark dynamically allocates ports per "spark application" on driver, master, and workers from a port range. When you collect results back to the driver, they do not go through the master. The master is mostly there as a coordinator between the dr

Re: Scaling spark jobs returning large amount of data

2015-06-04 Thread Igor Berman
Hi, as far as I understand you shouldn't send data to driver. Suppose you have file in hdfs/s3 or cassandra partitioning, you should create your job such that every executor/worker of spark will handle part of your input, transform, filter it and at the end write back to cassandra as output(once ag

Scaling spark jobs returning large amount of data

2015-06-04 Thread Giuseppe Sarno
Hello, I am relatively new to spark and I am currently trying to understand how to scale large numbers of jobs with spark. I understand that spark architecture is split in "Driver", "Master" and "Workers". Master has a standby node in case of failure and workers can scale out. All the examples I