Re: Partitioning Dataset and Using Reduce in Apache Spark

Daniel Siegmann Fri, 13 Mar 2015 09:20:09 -0700

On Thu, Mar 12, 2015 at 1:45 AM, <raghav0110...@gmail.com> wrote:

>
> In your response you say “When you call reduce and *similar *methods,
> each partition can be reduced in parallel. Then the results of that can be
> transferred across the network and reduced to the final result”. By similar
> methods do you mean all actions within spark? Does transfer of data from
> worker nodes to driver nodes happen only when an action is performed?
>


There is reduceByKey and some others that are in a more generalized form.
Other transformations and actions may work somewhat differently, but they
will generally be parallelized as much as is possible.

If you look at the UI when the job is running, you will see some number of
tasks. Each task corresponds to a single partition.

Not all actions cause data to be transferred from worker nodes to the
driver (there is only one node) - saveAsTextFile, for example. In any case,
no processing is done and no data is transferred anywhere until an action
is invoked, since transformations are lazy.


> I am assuming that in Spark, you typically have a set of transformations
> followed by some sort of action. The RDD is partitioned and sent to
> different worker nodes(assuming this a cluster setup), the transformations
> are applied to the RDD partitions at the various worker nodes, and then
> when an action is performed, you perform the action on the worker nodes and
> then aggregate the partial results at the driver and then perform another
> reduction at the driver to obtain the overall results. I would also assume
> that deciding whether the action should be done on a worker node, depends
> on the type of action. For example, performing reduce at the worker node
> makes sense, while it doesn't make sense to save the file at the worker
> node.  Does that sound correct, or am I misinterpreting something?
>

On the contrary, saving of files would typically be done at the worker
node. If you are handling anything approaching "Big Data" you will be
keeping it in a distributed store (typically HDFS, though it depends on
your use case), and each worker will write into this store. For example, if
you saveAsTextFile you will see multiple "part-*" files in the output
directory from separate partitions (don't worry about combining them for
future processing, sc.textFile can read them all in as a single RDD with
the data appropriately partitioned).

Some actions do pull data back to the driver - collect, for example. You
need to be careful when using such methods that you can be sure the amount
of data won't be too large for your driver.

In general you should avoid pulling any data back to the driver or doing
any processing on the driver as that will not scale. In some cases it may
be useful; for example if you wanted to join a large data set with a small
data set, it might perform better to collect the small data set and
broadcast it.

Take this for example:

sc.textFile(inputPath).map(...).filter(...).saveAsTextFile(outputPath)

None of that will execute on the driver. The map and filter will be
collapsed into a single stage (which you will in the UI). Each worker will
prefer to read its local data (but data may be transferred if there's none
locally to work on), transform the data, and write it out.

If you had this for example:

sc.textFile(inputPath).map(...).reduceByKey(...).saveAsTextFile(outputPath)


You will have two stages, since reduceByKey causes a shuffle - data will be
transferred across the network and formed into a new set of partitions. But
after the reduce, each worker will still save the data it is responsible
for. (You can provide a custom partitioner, but don't do that unless you
feel you have a good reason to do so.)

Hope that helps.

Re: Partitioning Dataset and Using Reduce in Apache Spark

Reply via email to