Hi,
Wanted to understand if anybody uses DirectFileOutputCommitter or alikes
especially when working with s3? 
I know that there is one impl in spark distro for parquet format, but not
for files -  why?

Imho, it can bring huge performance boost. 
Using default FileOutputCommiter with s3 has big overhead at commit stage
when all parts are copied one-by-one to destination dir from _temporary,
which is bottleneck when number of partitions is high.

Also, wanted to know if there are some problems when using
DirectFileOutputCommitter? 
If writing one partition directly will fail in the middle is spark will
notice this and will fail job(say after all retries)?

thanks in advance




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/DirectFileOutputCommiter-tp26296.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to