Hi, you may referer this http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism and http://spark.apache.org/docs/latest/programming-guide.html#parallelized-collections ,both of which are about the RDD partitions.As you are going to load data from hdfs, so you maybe also need to know http://spark.apache.org/docs/latest/programming-guide.html#external-datasets .
On Thu, Jul 31, 2014 at 1:07 PM, Sameer Tilak <ssti...@live.com> wrote: > Hi All, > > From the documention RDDs are already partitioned distributed. However, > there is a way to repartition a given RDD using the following function. Can > someone please point out the best practices for using this. I have a 10 GB > TSV file stored in HDFS and I have a 4 node cluster with 1 master and 3 > workers. Each worker has 15 GB memory and 4 cores. My processing pipeline > is not very deep as of now. Can someone please tell me when repartitioning > is recommended? When the documentation says balance doe to refer to memory > usage or compute load or I/O? > > *repartition*(*numPartitions*)Reshuffle the data in the RDD randomly to > create either more or fewer partitions and balance it across them. This > always shuffles all data over the network. > > > >