[jira] [Commented] (SPARK-18512) FileNotFoundException on _temporary directory with Spark Streaming 2.0.1 and S3A

Steve Loughran (JIRA) Thu, 15 Dec 2016 06:21:42 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-18512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15751492#comment-15751492
 ]


Steve Loughran commented on SPARK-18512:
----------------------------------------

yes, file a SPARK one and that can be a base for blame assignment. 

looking at the trace, it's happening because the actual job attempt path is 
mssing, which is generally considered traumatic. And you are following a 
codepath which only exists if you are using the v1 "all renames at the end" 
algorithm. Try to use the v2 algorithm, 
mapreduce.fileoutputcommitter.algorithm.version = 2 , to see if that helps

> FileNotFoundException on _temporary directory with Spark Streaming 2.0.1 and 
> S3A
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-18512
>                 URL: https://issues.apache.org/jira/browse/SPARK-18512
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 2.0.1
>         Environment: AWS EMR 5.0.1
> Spark 2.0.1
> S3 EU-West-1 (S3A)
>            Reporter: Giuseppe Bonaccorso
>
> After a few hours of streaming processing and data saving in Parquet format, 
> I got always this exception:
> {code:java}
> java.io.FileNotFoundException: No such file or directory: 
> s3a://xxx/_temporary/0/task_xxxx
>       at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1004)
>       at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:745)
>       at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:426)
>       at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:362)
>       at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:334)
>       at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
>       at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:222)
>       at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144)
>       at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
>       at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
>       at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>       at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
>       at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>       at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>       at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>       at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>       at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>       at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>       at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>       at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>       at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>       at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>       at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>       at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:510)
>       at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
>       at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
>       at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:488)
> {code}
> I've tried also s3:// and s3n:// but it always happens after a 3-5 hours. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-18512) FileNotFoundException on _temporary directory with Spark Streaming 2.0.1 and S3A

Reply via email to