The Spark event log writer expects to use a Hadoop-compatible file system that supports the ability to sync [1] previously written data, immediately making it durable and visible to other clients. The log message is warning that the S3A file system does not provide this capability. Even if it's asked to sync, the operation is a no-op. The data won't be visible until the stream is closed.
To address this, you can switch spark.eventLog.dir to a file system that does offer this capability, like HDFS. You could also ignore the warning, but the consequences are that event log data won't be visible until the job completes, and if it terminates unexpectedly before closing the stream, you might not get any event data at all. Different cloud storage providers handle this differently. GCS supports an option to enable a sync capability [2]. The implementation works by rolling to a new hidden file when a sync is requested, and composing all such files to present as a single stream to readers. The additional GCS API calls required to do this mean that latencies will be longer as compared to HDFS. Rate limiting can cause individual syncs to revert to no-ops too, meaning the guarantee is not as strong as HDFS. For more background on this, see HADOOP-13327 [3] and HADOOP-17597 [4]. Also note from those issues that these checks get stricter starting in Hadoop 3.3.1. Instead of a warning, the application will fail with an exception to alert users to potential misbehavior. You can opt back in to the old warning behavior with an additional configuration property. [1] https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/outputstream.html#org.apache.hadoop.fs.Syncable [2] https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/v2.2.8/gcs/CONFIGURATION.md#io-configuration [3] https://issues.apache.org/jira/browse/HADOOP-13327 [4] https://issues.apache.org/jira/browse/HADOOP-17597 Chris Nauroth On Wed, Nov 9, 2022 at 11:04 PM second_co...@yahoo.com.INVALID <second_co...@yahoo.com.invalid> wrote: > when running spark job, i used > > "spark.eventLog.dir": "s3a://_some_bucket_on_prem/spark-history", > "spark.eventLog.enabled": true > > i see the log of the job shows > > 22/11/10 06:42:30 INFO SingleEventLogFileWriter: Logging events to > s3a://_some_bucket_on_prem/spark-history > /spark-a2befd8cb9134190982a35663b61294b.inprogress > 22/11/10 06:42:30 WARN S3ABlockOutputStream: Application invoked the > Syncable API against stream writing to > _some_bucket_on_prem/spark-history/a2befd8cb9134190982a35663b61294b.inprogress. > This is unsupported > > > Does spark 3.3.0 support write to s3a bucket for the log? I can't write > the log . It is a on-premise s3a. Do I miss out any jar library? Does it > support any cloud blob storage providers? > >