Hello. In Spark 4, loading a dataframe from a path that contains a wildcard
produces a warning and a stack trace that doesn't happen in Spark 3.
>>> spark.read.load('s3a://ullswater-dev/uw01/temp/test_parquet/*.parquet')
25/07/22 08:33:38 WARN org.apache.spark.sql.execution.streaming.FileStreamSink:
Assume no metadata directory. Error while looking for metadata directory in the
path: s3a://ullswater-dev/uw01/temp/test_parquet/*.parquet.
java.io.FileNotFoundException: No such file or directory:
s3a://ullswater-dev/uw01/temp/test_parquet/*.parquet
at
org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:4156)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:4007)
I think it's due to the change from this
https://github.com/apache/spark/blob/v3.5.6/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala#L54
to this
https://github.com/apache/spark/blob/v4.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala#L56
S3AFileSystem.isDirectory(hdfsPath) does not throw an Exception if hdfsPath
contains a wildcard, whereas S3AFileSystem.getFileStatus(hdfsPath).isDirectory
does.
Is this a bug? Thanks.