Re: Getting number of physical machines in Spark

2015-08-28 Thread Alexey Grishchenko
There's no canonical way to do this as I understand. For instance, when running under YARN, you have completely no idea where your containers would be started. Moreover, if one of the containers would fail, it might be restarted on another machine so the machine number might change at runtime To c

Re: Getting number of physical machines in Spark

2015-08-28 Thread Jason
I've wanted similar functionality too: when network IO bound (for me I was trying to pull things from s3 to hdfs) I wish there was a `.mapMachines` api where I wouldn't have to try guess at the proper partitioning of a 'driver' RDD for `sc.parallelize(1 to N, N).map( i=> pull the i'th chunk from S3