Imagine simple Spark job, that will store each line of the RDD to a separate file
val lines = sc.parallelize(1 to 100).map(n => s"this is line $n") lines.foreach(line => writeToFile(line)) def writeToFile(line: String) = { def filePath = "file://..." val file = new File(new URI(path).getPath) // using function simply closes the output stream using(new FileOutputStream(file)) { output => output.write(value) } } Now, example above works 99,9% of a time. Files are generated for each line, each file contains that particular line. However, when dealing with large number of data, we encounter situations where some of the files are empty! Files are generated, but there is no content inside of them (0 bytes). Now the question is: can Spark job have side effects. Is it even legal to write such code? If no, then what other choice do we have when we want to save data from our RDD? If yes, then do you guys see what could be the reason of this job acting in this strange manner 0.1% of the time? disclaimer: we are fully aware of .saveAsTextFile method in the API, however the example above is a simplification of our code - normally we produce PDF files. Best regards, Paweł Szulc