subject:"Re\: Scaling spark jobs returning large amount of data"

Re: Scaling spark jobs returning large amount of data

2015-06-04 Thread Richard Marscher

It is possible to start multiple concurrent drivers, Spark dynamically allocates ports per "spark application" on driver, master, and workers from a port range. When you collect results back to the driver, they do not go through the master. The master is mostly there as a coordinator between the dr

Re: Scaling spark jobs returning large amount of data

2015-06-04 Thread Igor Berman

Hi, as far as I understand you shouldn't send data to driver. Suppose you have file in hdfs/s3 or cassandra partitioning, you should create your job such that every executor/worker of spark will handle part of your input, transform, filter it and at the end write back to cassandra as output(once ag