Hello, Thank you very much for your response Anastasie! Today I think I made it through dropping partitions in (runJob or submitJob) - I don’t remember exactly, in DAGScheduler.
If it doesn’t work properly after some tests, I will follow your approach. Thank you, Thodoris > On 16 Apr 2018, at 20:11, Anastasios Zouzias <zouz...@gmail.com> wrote: > > Hi all, > > I think this is doable using the mapPartitionsWithIndex method of RDD. > > Example: > > val partitionIndex = 0 // Your favorite partition index here > val rdd = spark.sparkContext.parallelize(Array.range(0, 1000)) > // Replace elements of partitionIndex with [-10, .. ,0] > val fixed = rdd.mapPartitionsWithIndex{case (idx, iter) => if (idx == > partitionIndex) Array.range(-10, 0).toIterator else iter} > > > Best regards, > Anastasios > > >> On Sun, Apr 15, 2018 at 12:59 AM, Thodoris Zois <z...@ics.forth.gr> wrote: >> I forgot to mention that I would like my approach to be independent from the >> application that user is going to submit to Spark. >> >> Assume that I don’t know anything about user’s application… I expected to >> find a simpler approach. I saw in RDD.scala that an RDD is characterized by >> a list of partitions. If I modify this list and keep only one partition, is >> it going to work? >> >> - Thodoris >> >> >> > On 15 Apr 2018, at 01:40, Matthias Boehm <mboe...@gmail.com> wrote: >> > >> > you might wanna have a look into using a PartitionPruningRDD to select >> > a subset of partitions by ID. This approach worked very well for >> > multi-key lookups for us [1]. >> > >> > A major advantage compared to scan-based operations is that, if your >> > source RDD has an existing partitioner, only relevant partitions are >> > accessed. >> > >> > [1] >> > https://github.com/apache/systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/MatrixIndexingSPInstruction.java#L603 >> > >> > Regards, >> > Matthias >> > >> > On Sat, Apr 14, 2018 at 3:12 PM, Thodoris Zois <z...@ics.forth.gr> wrote: >> >> Hello list, >> >> >> >> I am sorry for sending this message here, but I could not manage to get >> >> any response in “users”. For specific purposes I would like to isolate 1 >> >> partition of the RDD and perform computations only to this. >> >> >> >> For instance, suppose that a user asks Spark to create 500 partitions for >> >> the RDD. I would like Spark to create the partitions but perform >> >> computations only in one partition from those 500 ignoring the other 499. >> >> >> >> At first I tried to modify executor in order to run only 1 partition >> >> (task) but I didn’t manage to make it work. Then I tried the DAG >> >> Scheduler but I think that I should modify the code in a higher level and >> >> let Spark make the partitioning but at the end see only one partition and >> >> throw throw away all the others. >> >> >> >> My question is which file should I modify in order to achieve isolating 1 >> >> partition of the RDD? Where does the actual partitioning is made? >> >> >> >> I hope it is clear! >> >> >> >> Thank you very much, >> >> Thodoris >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> > > > > -- > -- Anastasios Zouzias