Hi,
I found the answer to my problem, and just writing to keep it as KB.

Turns out the problem wasn’t related to S3 performance, it was due my SOURCE 
was not fast enough, due the lazy nature of Spark what I saw on the dashboard 
was saveAsTextFile at FacebookProcessor.scala:46 instead of the load method()

When I ran count() on my dataset before trying to save it to S3 I could figure 
out the input bottleneck.

- gustavo


On Sep 30, 2014, at 10:03 PM, Gustavo Arjones <[email protected]> wrote:

> Hi,
> I’m trying to save about a million of lines containing statistics data, 
> something like:
> 
> 233815212529_10152316612422530  233815212529_10152316612422530  1328569332    
>   1404691200      1404691200      1402316275      46      0       0       7   
>     0       0       0
> 233815212529_10152316612422530  233815212529_10152316612422530  1328569332    
>   1404694800      1404694800      1402316275      46      0       0       7   
>     0       0       0
> 233815212529_10152316612422530  233815212529_10152316612422530  1328569332    
>   1404698400      1404698400      1402316275      46      0       0       7   
>     0       0       0
> 233815212529_10152316612422530  233815212529_10152316612422530  1328569332    
>   1404702000      1404702000      1402316275      46      0       0       7   
>     0       0       0
> 
> Using the standard saveAsTextFile with an optional codec (GzipCodec)
> 
>     postsStats.saveAsTextFile(s"s3n://smx-spark/...../raw_data", 
> classOf[GzipCodec])
> 
> The resulting task is taking really long, i.e.: 3 hours to save 2Gb of data. 
> I found some references and blog posts about to increase RDD partition to 
> improve processing when READING from source.
> 
> The oposite operation would improve WRITE operation, I mean, if a reduce the 
> partitioning level can I avoid small file problem?
> Is it possible that GzipCodec affecting parallelism level and reducing the 
> overall performance?
> 
> I have 4 nodes m1.xlarge (1 master + 3 workers) on EC2 - standalone mode 
> launched using spark-ec2script with version Spark 1.1.0
> 
> Thanks a lot!
> - gustavo

Reply via email to