you might wanna have a look into using a PartitionPruningRDD to select
a subset of partitions by ID. This approach worked very well for
multi-key lookups for us [1].

A major advantage compared to scan-based operations is that, if your
source RDD has an existing partitioner, only relevant partitions are
accessed.

[1] 
https://github.com/apache/systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/MatrixIndexingSPInstruction.java#L603

Regards,
Matthias

On Sat, Apr 14, 2018 at 3:12 PM, Thodoris Zois <z...@ics.forth.gr> wrote:
> Hello list,
>
> I am sorry for sending this message here, but I could not manage to get any 
> response in “users”. For specific purposes I would like to isolate 1 
> partition of the RDD and perform computations only to this.
>
> For instance, suppose that a user asks Spark to create 500 partitions for the 
> RDD. I would like Spark to create the partitions but perform computations 
> only in one partition from those 500 ignoring the other 499.
>
> At first I tried to modify executor in order to run only 1 partition (task) 
> but I didn’t manage to make it work. Then I tried the DAG Scheduler but I 
> think that I should modify the code in a higher level and let Spark make the 
> partitioning but at the end see only one partition and throw throw away all 
> the others.
>
> My question is which file should I modify in order to achieve isolating 1 
> partition of the RDD? Where does the actual partitioning is made?
>
> I hope it is clear!
>
> Thank you very much,
> Thodoris
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to