Spark will process _temporary folder on S3 is very slow and always cause failure

Shuai Zheng Fri, 13 Mar 2015 15:53:23 -0700

Hi All,


I try to run a sorting on a r3.2xlarge instance on AWS. I just try to run it
as a single node cluster for test. The data I use to sort is around 4GB and
sit on S3, output will also on S3.

 

I just connect spark-shell to the local cluster and run the code in the
script (because I just want a benchmark now).

 

My job is as simple as:

val parquetFile =
sqlContext.parquetFile("s3n://...,s3n://...,s3n://...,s3n://...,s3n://...,s3
n://...,s3n://...,")

parquetFile.registerTempTable("Test")

val sortedResult = sqlContext.sql("SELECT * FROM Test order by time").map {
row => { row.mkString("\t") } }

sortedResult.saveAsTextFile("s3n://myplace,");

 

The job takes around 6 mins to finish the sort when I am monitoring the
process. After I notice the process stop at: 

 

15/03/13 22:38:27 INFO DAGScheduler: Job 2 finished: saveAsTextFile at
<console>:31, took 581.304992 s

 

At that time, the spark actually just write all the data to the _temporary
folder first, after all sub-tasks finished, it will try to move all the
ready result from _temporary folder to the final location. This process
might be quick locally (because it will just be a cut/paste), but it looks
like very slow on my S3, it takes a few second to move one file (usually
there will be 200 partitions). And then it raise exceptions after it move
might be 40-50 files.

 

org.apache.http.NoHttpResponseException: The target server failed to respond

        at
org.apache.http.impl.conn.DefaultResponseParser.parseHead(DefaultResponsePar
ser.java:101)

        at
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.ja
va:252)

        at
org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(Abst
ractHttpClientConnection.java:281)

        at
org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(Defa
ultClientConnection.java:247)

        at
org.apache.http.impl.conn.AbstractClientConnAdapter.receiveResponseHeader(Ab
stractClientConnAdapter.java:219)

 



 

I try several times, but never get the full job finished. I am not sure
anything wrong here, but I use something very basic and I can see the job
has finished and all result on the S3 under temporary folder, but then it
raise the exception and fail. 

 

Any special setting I should do here when deal with S3?

 

I don't know what is the issue here, I never see MapReduce has similar
issue. So it could not be S3's problem.

 

Regards,

 

Shuai

Spark will process _temporary folder on S3 is very slow and always cause failure

Reply via email to