Hi everyone, Is it possible to fix the number of tasks related to a saveAsTextFile in Pyspark?
I am loading several files from HDFS, fixing the number of partitions to X (let's say 40 for instance). Then some transformations, like joins and filters are carried out. The weird thing here is that the number of tasks involved in these transformations are 80, i.e. the double of the fixed number of partitions. However, when the saveAsTextFile action is carried out, there are only 4 tasks to do this (and I have not been able to increase that number). My problem here is that those 4 tasks make rapidly increase the used memory and take too long to finish. I am launching my process from windows to a cluster in ubuntu, with 13 computers (4 cores each) with 32 gb of memory, and using pyspark 1.0.2. Any clue with this? Thanks in advance