Hi all, I think this is doable using the mapPartitionsWithIndex method of RDD.
Example: val partitionIndex = 0 // Your favorite partition index here val rdd = spark.sparkContext.parallelize(Array.range(0, 1000)) // Replace elements of partitionIndex with [-10, .. ,0] val fixed = rdd.mapPartitionsWithIndex{case (idx, iter) => if (idx == partitionIndex) Array.range(-10, 0).toIterator else iter} Best regards, Anastasios On Sun, Apr 15, 2018 at 12:59 AM, Thodoris Zois <z...@ics.forth.gr> wrote: > I forgot to mention that I would like my approach to be independent from > the application that user is going to submit to Spark. > > Assume that I don’t know anything about user’s application… I expected to > find a simpler approach. I saw in RDD.scala that an RDD is characterized by > a list of partitions. If I modify this list and keep only one partition, is > it going to work? > > - Thodoris > > > > On 15 Apr 2018, at 01:40, Matthias Boehm <mboe...@gmail.com> wrote: > > > > you might wanna have a look into using a PartitionPruningRDD to select > > a subset of partitions by ID. This approach worked very well for > > multi-key lookups for us [1]. > > > > A major advantage compared to scan-based operations is that, if your > > source RDD has an existing partitioner, only relevant partitions are > > accessed. > > > > [1] https://github.com/apache/systemml/blob/master/src/main/ > java/org/apache/sysml/runtime/instructions/spark/ > MatrixIndexingSPInstruction.java#L603 > > > > Regards, > > Matthias > > > > On Sat, Apr 14, 2018 at 3:12 PM, Thodoris Zois <z...@ics.forth.gr> > wrote: > >> Hello list, > >> > >> I am sorry for sending this message here, but I could not manage to get > any response in “users”. For specific purposes I would like to isolate 1 > partition of the RDD and perform computations only to this. > >> > >> For instance, suppose that a user asks Spark to create 500 partitions > for the RDD. I would like Spark to create the partitions but perform > computations only in one partition from those 500 ignoring the other 499. > >> > >> At first I tried to modify executor in order to run only 1 partition > (task) but I didn’t manage to make it work. Then I tried the DAG Scheduler > but I think that I should modify the code in a higher level and let Spark > make the partitioning but at the end see only one partition and throw throw > away all the others. > >> > >> My question is which file should I modify in order to achieve isolating > 1 partition of the RDD? Where does the actual partitioning is made? > >> > >> I hope it is clear! > >> > >> Thank you very much, > >> Thodoris > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >> > > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- -- Anastasios Zouzias <a...@zurich.ibm.com>