Hi,
I’m trying to save about a million of lines containing statistics data, 
something like:

233815212529_10152316612422530  233815212529_10152316612422530  1328569332      
1404691200      1404691200      1402316275      46      0       0       7       
0       0       0
233815212529_10152316612422530  233815212529_10152316612422530  1328569332      
1404694800      1404694800      1402316275      46      0       0       7       
0       0       0
233815212529_10152316612422530  233815212529_10152316612422530  1328569332      
1404698400      1404698400      1402316275      46      0       0       7       
0       0       0
233815212529_10152316612422530  233815212529_10152316612422530  1328569332      
1404702000      1404702000      1402316275      46      0       0       7       
0       0       0

Using the standard saveAsTextFile with an optional codec (GzipCodec)

    postsStats.saveAsTextFile(s"s3n://smx-spark/...../raw_data", 
classOf[GzipCodec])

The resulting task is taking really long, i.e.: 3 hours to save 2Gb of data. I 
found some references and blog posts about to increase RDD partition to improve 
processing when READING from source.

The oposite operation would improve WRITE operation, I mean, if a reduce the 
partitioning level can I avoid small file problem?
Is it possible that GzipCodec affecting parallelism level and reducing the 
overall performance?

I have 4 nodes m1.xlarge (1 master + 3 workers) on EC2 - standalone mode 
launched using spark-ec2script with version Spark 1.1.0

Thanks a lot!
- gustavo

Reply via email to