You can read same partition from every hour's output, union these RDDs and then 
repartition them as a single partition. This will be done for all partitions 
one by one. It may not necessarily improve the performance, will depend on size 
of spills in job when all the data was processed together.
Regards,Ajay
 


     On Friday, April 3, 2015 2:01 AM, Sandy Ryza <[email protected]> 
wrote:
   

 This isn't currently a capability that Spark has, though it has definitely 
been discussed: https://issues.apache.org/jira/browse/SPARK-1061.  The primary 
obstacle at this point is that Hadoop's FileInputFormat doesn't guarantee that 
each file corresponds to a single split, so the records corresponding to a 
particular partition at the end of the first job can end up split across 
multiple partitions in the second job.
-Sandy
On Wed, Apr 1, 2015 at 9:09 PM, kjsingh <[email protected]> wrote:

Hi,

We are running an hourly job using Spark 1.2 on Yarn. It saves an RDD of
Tuple2. At the end of day, a daily job is launched, which works on the
outputs of the hourly jobs.

For data locality and speed, we wish that when the daily job launches, it
finds all instances of a given key at a single executor rather than fetching
it from others during shuffle.

Is it possible to maintain key partitioning across jobs? We can control
partitioning in one job. But how do we send keys to the executors of same
node manager across jobs? And while saving data to HDFS, are the blocks
allocated to the same data node machine as the executor for a partition?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Data-locality-across-jobs-tp22351.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]





  

Reply via email to