hi, Our Spark writes to GCS are slow. The reason I see is that a staging directory used for the initial data generation following by copying the data to actual directory in GCS. Following are few configs and code. Any suggestions on how to speed this thing up will be great.
sparkSession.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic") sparkSession.conf.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2") sparkSession.conf.set("spark.hadoop.mapreduce.use.directfileoutputcommitter", "true") sparkSession.conf.set( "spark.hadoop.mapred.output.committer.class", "org.apache.hadoop.mapred.DirectFileOutputCommitter" ) sparkSession.sparkContext.hadoopConfiguration .set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false") sparkSession.sparkContext.hadoopConfiguration .set("spark.speculation", "false") snapshotInGCS.write .option("header", "true") .option("emptyValue", "") .option("delimiter", "^") .mode(SaveMode.Overwrite) .format("csv") .partitionBy("date", "id") .option("compression", "gzip") .save(s"gs://${bucketName}/${folderName}") Thank you, SK -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org