Re: balancing RDDs

2014-06-26 Thread Mayur Rustagi
I would imagine this would be an extension of SchemaRDD (for Sparksql) or a new RDD altogether. The RDD location is determined based on where task generating the RDD is scheduled, the scheduler schedules basis of input RDD/sourcedata location. So ideally RDD codebase needs to check location of inp

Re: balancing RDDs

2014-06-25 Thread Sean McNamara
Yep exactly! I’m not sure how complicated it would be to pull off. If someone wouldn’t mind helping to get me pointed in the right direction I would be happy to look into and contribute this functionality. I imagine this would be implemented in the scheduler codebase and there would be some s

Re: balancing RDDs

2014-06-24 Thread Mayur Rustagi
This would be really useful. Especially for Shark where shift of partitioning effects all subsequent queries unless task scheduling time beats spark.locality.wait. Can cause overall low performance for all subsequent tasks. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur