Hi, I’m trying to save about a million of lines containing statistics data, something like:
233815212529_10152316612422530 233815212529_10152316612422530 1328569332 1404691200 1404691200 1402316275 46 0 0 7 0 0 0 233815212529_10152316612422530 233815212529_10152316612422530 1328569332 1404694800 1404694800 1402316275 46 0 0 7 0 0 0 233815212529_10152316612422530 233815212529_10152316612422530 1328569332 1404698400 1404698400 1402316275 46 0 0 7 0 0 0 233815212529_10152316612422530 233815212529_10152316612422530 1328569332 1404702000 1404702000 1402316275 46 0 0 7 0 0 0 Using the standard saveAsTextFile with an optional codec (GzipCodec) postsStats.saveAsTextFile(s"s3n://smx-spark/...../raw_data", classOf[GzipCodec]) The resulting task is taking really long, i.e.: 3 hours to save 2Gb of data. I found some references and blog posts about to increase RDD partition to improve processing when READING from source. The oposite operation would improve WRITE operation, I mean, if a reduce the partitioning level can I avoid small file problem? Is it possible that GzipCodec affecting parallelism level and reducing the overall performance? I have 4 nodes m1.xlarge (1 master + 3 workers) on EC2 - standalone mode launched using spark-ec2script with version Spark 1.1.0 Thanks a lot! - gustavo