Hi, I found the answer to my problem, and just writing to keep it as KB. Turns out the problem wasn’t related to S3 performance, it was due my SOURCE was not fast enough, due the lazy nature of Spark what I saw on the dashboard was saveAsTextFile at FacebookProcessor.scala:46 instead of the load method()
When I ran count() on my dataset before trying to save it to S3 I could figure out the input bottleneck. - gustavo On Sep 30, 2014, at 10:03 PM, Gustavo Arjones <[email protected]> wrote: > Hi, > I’m trying to save about a million of lines containing statistics data, > something like: > > 233815212529_10152316612422530 233815212529_10152316612422530 1328569332 > 1404691200 1404691200 1402316275 46 0 0 7 > 0 0 0 > 233815212529_10152316612422530 233815212529_10152316612422530 1328569332 > 1404694800 1404694800 1402316275 46 0 0 7 > 0 0 0 > 233815212529_10152316612422530 233815212529_10152316612422530 1328569332 > 1404698400 1404698400 1402316275 46 0 0 7 > 0 0 0 > 233815212529_10152316612422530 233815212529_10152316612422530 1328569332 > 1404702000 1404702000 1402316275 46 0 0 7 > 0 0 0 > > Using the standard saveAsTextFile with an optional codec (GzipCodec) > > postsStats.saveAsTextFile(s"s3n://smx-spark/...../raw_data", > classOf[GzipCodec]) > > The resulting task is taking really long, i.e.: 3 hours to save 2Gb of data. > I found some references and blog posts about to increase RDD partition to > improve processing when READING from source. > > The oposite operation would improve WRITE operation, I mean, if a reduce the > partitioning level can I avoid small file problem? > Is it possible that GzipCodec affecting parallelism level and reducing the > overall performance? > > I have 4 nodes m1.xlarge (1 master + 3 workers) on EC2 - standalone mode > launched using spark-ec2script with version Spark 1.1.0 > > Thanks a lot! > - gustavo
