Re: Number of partitions when saving (pyspark)

Luis Guerra Wed, 17 Sep 2014 23:14:28 -0700

Thanks for the answer,

I have tried this solution but it did not work...


This is the involved part of the code:

data_aux= cruce.groupBy(lambda x: (x[0], x[1]), 80)
#data_aux= cruce.keyBy(lambda x: (x[0], x[1])).groupByKey(80)

exit_1= data_aux.filter(lambda (a,b): len(b) > 1).values()
exit_2 = data_aux.filter(lambda (a,b): len(b) == 1).values()

exit_1.map(lambda x: x.data).saveAsTextFile(exit1_path)
exit_2.map(lambda x: x.data).saveAsTextFile(exit2_path)

As you can see, I am using 80 partitions within the groupBy (and I have
also tried keyBy-GroupByKey). However, the saving process is carried out
only in 4 stages.

What am I doing wrong?

On Wed, Sep 17, 2014 at 6:20 PM, Davies Liu <dav...@databricks.com> wrote:

> On Wed, Sep 17, 2014 at 5:21 AM, Luis Guerra <luispelay...@gmail.com>
> wrote:
> > Hi everyone,
> >
> > Is it possible to fix the number of tasks related to a saveAsTextFile in
> > Pyspark?
> >
> > I am loading several files from HDFS, fixing the number of partitions to
> X
> > (let's say 40 for instance). Then some transformations, like joins and
> > filters are carried out. The weird thing here is that the number of tasks
> > involved in these transformations are 80, i.e. the double of the fixed
> > number of partitions. However, when the saveAsTextFile action is carried
> > out, there are only 4 tasks to do this (and I have not been able to
> increase
> > that number). My problem here is that those 4 tasks make rapidly increase
> > the used memory and take too long to finish.
>
> > I am launching my process from windows to a cluster in ubuntu, with 13
> > computers (4 cores each) with 32 gb of memory, and using pyspark 1.0.2.
>
> The saveAsTextFile() is an mapper RDD, so the number of partitions of it
> is determined by previous RDD.
>
> In Spark 1.0.2, groupByKey() or reduceByKey() will take the number of CPUs
> on driver (locally) as the default partitions, so it's 4. You need to
> change it
> to 40 or 80 in this case.
>
> BTW, In Spark 1.1, groupByKey() and reduceByKey() will use the number of
> partitions of previous RDD as the default value.
>
> Davies
>
> > Any clue with this?
> >
> > Thanks in advance
>

Re: Number of partitions when saving (pyspark)

Reply via email to