hi,

Our Spark writes to GCS are slow. The reason I see is that a staging
directory used for the initial data generation following by copying the data
to actual directory in GCS. Following are few configs and code. Any
suggestions on how to speed this thing up will be great.

    sparkSession.conf.set("spark.sql.sources.partitionOverwriteMode",
"dynamic")
   
sparkSession.conf.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version",
"2")
   
sparkSession.conf.set("spark.hadoop.mapreduce.use.directfileoutputcommitter",
"true")
    sparkSession.conf.set(
      "spark.hadoop.mapred.output.committer.class",
      "org.apache.hadoop.mapred.DirectFileOutputCommitter"
    )

    sparkSession.sparkContext.hadoopConfiguration
      .set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

    sparkSession.sparkContext.hadoopConfiguration
      .set("spark.speculation", "false")


    snapshotInGCS.write
      .option("header", "true")
      .option("emptyValue", "")
      .option("delimiter", "^")
      .mode(SaveMode.Overwrite)
      .format("csv")
      .partitionBy("date", "id")
      .option("compression", "gzip")
      .save(s"gs://${bucketName}/${folderName}")



Thank you,
SK



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to