Re: Data locality across jobs

2015-04-03 Thread Ajay Srivastava
You can read same partition from every hour's output, union these RDDs and then repartition them as a single partition. This will be done for all partitions one by one. It may not necessarily improve the performance, will depend on size of spills in job when all the data was processed together.

Re: Data locality across jobs

2015-04-02 Thread Sandy Ryza
This isn't currently a capability that Spark has, though it has definitely been discussed: https://issues.apache.org/jira/browse/SPARK-1061. The primary obstacle at this point is that Hadoop's FileInputFormat doesn't guarantee that each file corresponds to a single split, so the records correspond