Re: Global sequential access of elements in RDD

Imran Rashid Fri, 27 Feb 2015 10:38:45 -0800

Why would you want to use spark to sequentially process your entire data
set?  The entire purpose is to let you do distributed processing -- which
means letting partitions get processed simultaneously by different cores /
nodes.

that being said, occasionally in a bigger pipeline with a lot of
distributed operations, you might need to do one segment in a completely
sequential manner.  You have a few options -- just be aware that with all
of them, you are working *around* the idea of an RDD, so make sure you have
a really good reason.

1) rdd.toLocalIterator.  Still pulls all of the data to the driver, just
like rdd.collect(), but its slightly more scalable since it won't store
*all* of the data in memory on the driver (it does still store all of the
data in one partition in memory, though.)

2) write the rdd to some external data storage (eg. hdfs), and then read
the data sequentially off of hdfs on your driver.  Still needs to pull all
of the data to the driver, but you can get it to avoid pulling an entire
partition into memory and make it streaming.

3) create a number of rdds that consist of just one partition of your
original rdd, and then execute actions on them sequentially:

val originalRDD = ... //this should be cached to make sure you don't
recompute it
(0 until originalRDD.partitions.size).foreach{partitionIdx =>
  val prunedRdd = new PartitionPruningRDD(originalRDD, {x => x ==
partitionIdx})
  prunedRDD.runSomeActionHere()
}

note that PartitionPruningRDD is a developer api, however.  This will run
your action on one partition at a time, and ideally the tasks will be
scheduled on the same node where the partitions have been cached, so you
don't need to move the data around.  But again, b/c you're turning it into
a sequential program, most of your cluster is sitting idle, and your not
really leveraging spark ...

imran

On Fri, Feb 27, 2015 at 1:38 AM, Wush Wu <w...@bridgewell.com> wrote:

> Dear all,
>
> I want to implement some sequential algorithm on RDD.
>
> For example:
>
> val conf = new SparkConf()
>   conf.setMaster("local[2]").
>   setAppName("SequentialSuite")
> val sc = new SparkContext(conf)
> val rdd = sc.
>    parallelize(Array(1, 3, 2, 7, 1, 4, 2, 5, 1, 8, 9), 2).
>    sortBy(x => x, true)
> rdd.foreach(println)
>
> I want to see the ordered number on my screen, but it shows unordered
> integers. The two partitions execute the println simultaneously.
>
> How do I make the RDD execute a function globally sequential?
>
> Best,
> Wush
>

Re: Global sequential access of elements in RDD

Reply via email to