Hi,
I’m trying to save about a million of lines containing statistics data,
something like:
233815212529_10152316612422530 233815212529_10152316612422530 1328569332
1404691200 1404691200 1402316275 46 0 0 7
0 0 0
233815212529_10152316612422530 233815212529_10152316612422530 1328569332
1404694800 1404694800 1402316275 46 0 0 7
0 0 0
233815212529_10152316612422530 233815212529_10152316612422530 1328569332
1404698400 1404698400 1402316275 46 0 0 7
0 0 0
233815212529_10152316612422530 233815212529_10152316612422530 1328569332
1404702000 1404702000 1402316275 46 0 0 7
0 0 0
Using the standard saveAsTextFile with an optional codec (GzipCodec)
postsStats.saveAsTextFile(s"s3n://smx-spark/...../raw_data",
classOf[GzipCodec])
The resulting task is taking really long, i.e.: 3 hours to save 2Gb of data. I
found some references and blog posts about to increase RDD partition to improve
processing when READING from source.
The oposite operation would improve WRITE operation, I mean, if a reduce the
partitioning level can I avoid small file problem?
Is it possible that GzipCodec affecting parallelism level and reducing the
overall performance?
I have 4 nodes m1.xlarge (1 master + 3 workers) on EC2 - standalone mode
launched using spark-ec2script with version Spark 1.1.0
Thanks a lot!
- gustavo