I would imagine this would be an extension of SchemaRDD (for Sparksql) or
a new RDD altogether.
The RDD location is determined based on where task generating the RDD is
scheduled, the scheduler schedules basis of input RDD/sourcedata location.
So ideally RDD codebase needs to check location of inp
Yep exactly! I’m not sure how complicated it would be to pull off. If someone
wouldn’t mind helping to get me pointed in the right direction I would be happy
to look into and contribute this functionality. I imagine this would be
implemented in the scheduler codebase and there would be some s
This would be really useful. Especially for Shark where shift of
partitioning effects all subsequent queries unless task scheduling time
beats spark.locality.wait. Can cause overall low performance for all
subsequent tasks.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur