Hi everyone,

Is it possible to fix the number of tasks related to a saveAsTextFile in
Pyspark?

I am loading several files from HDFS, fixing the number of partitions to X
(let's say 40 for instance). Then some transformations, like joins and
filters are carried out. The weird thing here is that the number of tasks
involved in these transformations are 80, i.e. the double of the fixed
number of partitions. However, when the saveAsTextFile action is carried
out, there are only 4 tasks to do this (and I have not been able to
increase that number). My problem here is that those 4 tasks make rapidly
increase the used memory and take too long to finish.

I am launching my process from windows to a cluster in ubuntu, with 13
computers (4 cores each) with 32 gb of memory, and using pyspark 1.0.2.

Any clue with this?

Thanks in advance

Reply via email to