The temp file creation is controlled by a hadoop OutputCommitter, which is normally FileOutputCommitter by default. Its used in SparkHadoopWriter (which in turn is used by PairRDDFunctions.saveAsHadoopDataset).
You could change the output committer to not use tmp files (eg. use this from Aaron Davidson: https://gist.github.com/aarondav/c513916e72101bbe14ec). On Wed, Apr 15, 2015 at 12:33 AM, Gil Vernik <g...@il.ibm.com> wrote: > Hi, > > I run very simple operation via ./spark-shell (version 1.3.0 ): > > val data = Array(1, 2, 3, 4) > val distd = sc.parallelize(data) > distd.saveAsTextFile(.. ) > > When i executed it, I saw that 4 tasks very created in Spark. Each task > created 2 temp files at different stages, there was 1st tmp file ( with > some long name ) that at some point it was renamed to 2nd tmp file with > another name. > By task completion the 2nd tmp file was renamed to PART-XXXX file. So in > totally for 4 tasks i had about 8 tmp files.. > > I have some questions related those tmp files generations. > What is the logic and algorithm in tasks to generate those tmp files. Can > someone explain it to me? Why there were 2 tmp files ( one after another > ) and not a single tmp file? > Is this something configurable in Spark? I mean can i run saveAsTextFile > so tasks will run without tmp files creations? Can this tmp data be > created in memory? > > And the last one, where is the code that responsible for this? > > Thanks a lot, > Gil Vernik. >