On Thu, Mar 12, 2015 at 1:45 AM, <raghav0110...@gmail.com> wrote: > > In your response you say “When you call reduce and *similar *methods, > each partition can be reduced in parallel. Then the results of that can be > transferred across the network and reduced to the final result”. By similar > methods do you mean all actions within spark? Does transfer of data from > worker nodes to driver nodes happen only when an action is performed? >
There is reduceByKey and some others that are in a more generalized form. Other transformations and actions may work somewhat differently, but they will generally be parallelized as much as is possible. If you look at the UI when the job is running, you will see some number of tasks. Each task corresponds to a single partition. Not all actions cause data to be transferred from worker nodes to the driver (there is only one node) - saveAsTextFile, for example. In any case, no processing is done and no data is transferred anywhere until an action is invoked, since transformations are lazy. > I am assuming that in Spark, you typically have a set of transformations > followed by some sort of action. The RDD is partitioned and sent to > different worker nodes(assuming this a cluster setup), the transformations > are applied to the RDD partitions at the various worker nodes, and then > when an action is performed, you perform the action on the worker nodes and > then aggregate the partial results at the driver and then perform another > reduction at the driver to obtain the overall results. I would also assume > that deciding whether the action should be done on a worker node, depends > on the type of action. For example, performing reduce at the worker node > makes sense, while it doesn't make sense to save the file at the worker > node. Does that sound correct, or am I misinterpreting something? > On the contrary, saving of files would typically be done at the worker node. If you are handling anything approaching "Big Data" you will be keeping it in a distributed store (typically HDFS, though it depends on your use case), and each worker will write into this store. For example, if you saveAsTextFile you will see multiple "part-*" files in the output directory from separate partitions (don't worry about combining them for future processing, sc.textFile can read them all in as a single RDD with the data appropriately partitioned). Some actions do pull data back to the driver - collect, for example. You need to be careful when using such methods that you can be sure the amount of data won't be too large for your driver. In general you should avoid pulling any data back to the driver or doing any processing on the driver as that will not scale. In some cases it may be useful; for example if you wanted to join a large data set with a small data set, it might perform better to collect the small data set and broadcast it. Take this for example: sc.textFile(inputPath).map(...).filter(...).saveAsTextFile(outputPath) None of that will execute on the driver. The map and filter will be collapsed into a single stage (which you will in the UI). Each worker will prefer to read its local data (but data may be transferred if there's none locally to work on), transform the data, and write it out. If you had this for example: sc.textFile(inputPath).map(...).reduceByKey(...).saveAsTextFile(outputPath) You will have two stages, since reduceByKey causes a shuffle - data will be transferred across the network and formed into a new set of partitions. But after the reduce, each worker will still save the data it is responsible for. (You can provide a custom partitioner, but don't do that unless you feel you have a good reason to do so.) Hope that helps.