Re: Isolate 1 partition and perform computations

Thodoris Zois Mon, 16 Apr 2018 10:23:43 -0700

Hello,

Thank you very much for your response Anastasie! Today I think I made it 
through dropping partitions in (runJob or submitJob) - I don’t remember 
exactly, in DAGScheduler.


If it doesn’t work properly after some tests, I will follow your approach. 

Thank you,
Thodoris

> On 16 Apr 2018, at 20:11, Anastasios Zouzias <zouz...@gmail.com> wrote:
> 
> Hi all,
> 
> I think this is doable using the mapPartitionsWithIndex method of RDD.
> 
> Example:
> 
> val partitionIndex = 0 // Your favorite partition index here
> val rdd = spark.sparkContext.parallelize(Array.range(0, 1000))
> // Replace elements of partitionIndex with [-10, .. ,0]
> val fixed = rdd.mapPartitionsWithIndex{case (idx, iter) => if (idx == 
> partitionIndex) Array.range(-10, 0).toIterator else iter}
> 
> 
> Best regards,
> Anastasios
> 
> 
>> On Sun, Apr 15, 2018 at 12:59 AM, Thodoris Zois <z...@ics.forth.gr> wrote:
>> I forgot to mention that I would like my approach to be independent from the 
>> application that user is going to submit to Spark. 
>> 
>> Assume that I don’t know anything about user’s application… I expected to 
>> find a simpler approach. I saw in RDD.scala that an RDD is characterized by 
>> a list of partitions. If I modify this list and keep only one partition, is 
>> it going to work? 
>>  
>> - Thodoris
>> 
>> 
>> > On 15 Apr 2018, at 01:40, Matthias Boehm <mboe...@gmail.com> wrote:
>> > 
>> > you might wanna have a look into using a PartitionPruningRDD to select
>> > a subset of partitions by ID. This approach worked very well for
>> > multi-key lookups for us [1].
>> > 
>> > A major advantage compared to scan-based operations is that, if your
>> > source RDD has an existing partitioner, only relevant partitions are
>> > accessed.
>> > 
>> > [1] 
>> > https://github.com/apache/systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/MatrixIndexingSPInstruction.java#L603
>> > 
>> > Regards,
>> > Matthias
>> > 
>> > On Sat, Apr 14, 2018 at 3:12 PM, Thodoris Zois <z...@ics.forth.gr> wrote:
>> >> Hello list,
>> >> 
>> >> I am sorry for sending this message here, but I could not manage to get 
>> >> any response in “users”. For specific purposes I would like to isolate 1 
>> >> partition of the RDD and perform computations only to this.
>> >> 
>> >> For instance, suppose that a user asks Spark to create 500 partitions for 
>> >> the RDD. I would like Spark to create the partitions but perform 
>> >> computations only in one partition from those 500 ignoring the other 499.
>> >> 
>> >> At first I tried to modify executor in order to run only 1 partition 
>> >> (task) but I didn’t manage to make it work. Then I tried the DAG 
>> >> Scheduler but I think that I should modify the code in a higher level and 
>> >> let Spark make the partitioning but at the end see only one partition and 
>> >> throw throw away all the others.
>> >> 
>> >> My question is which file should I modify in order to achieve isolating 1 
>> >> partition of the RDD? Where does the actual partitioning is made?
>> >> 
>> >> I hope it is clear!
>> >> 
>> >> Thank you very much,
>> >> Thodoris
>> >> 
>> >> 
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> 
> 
> 
> 
> -- 
> -- Anastasios Zouzias

Re: Isolate 1 partition and perform computations

Reply via email to