I am trying to read an s3 object from a local S3 storage (Ceph based) using Spark 3.5.1. I see it can access the bucket and list the files (I have verified it on Ceph side by checking its logs), even returning the correct size of the object. But the content is not read.
The object url is: s3a://input/testfile.csv (I have also tested a nested bucket: s3a://test1/test2/test3/testfile.csv) Object's content: ===================== name int1 int2 first 1 2 second 3 4 ===================== Here is the config I have set so far: ("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.6") ("spark.hadoop.fs.s3a.access.key", "R*************6") ("spark.hadoop.fs.s3a.secret.key", "1***************e") ("spark.hadoop.fs.s3a.endpoint", "192.168.52.63:8000") ("spark.hadoop.fs.s3a.path.style.access", "true") ("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") The outop for my following Pyspark application: df = spark.read \ .option("header", "true") \ .schema(schema) \ .csv("s3a://input/testfile.csv", sep=' ') df.show(n=1) ================================== 24/05/20 02:35:00 INFO MetricsSystemImpl: s3a-file-system metrics system started24/05/20 02:35:01 INFO MetadataLogFileIndex: Reading streaming file log from s3a://input/testfile.csv/_spark_metadata24/05/20 02:35:01 INFO FileStreamSinkLog: BatchIds found from listing:24/05/20 02:35:03 INFO FileSourceStrategy: Pushed Filters:24/05/20 02:35:03 INFO FileSourceStrategy: Post-Scan Filters:24/05/20 02:35:03 INFO CodeGenerator: Code generated in 176.139675 ms24/05/20 02:35:03 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 496.6 KiB, free 4.1 GiB)24/05/20 02:35:03 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 54.4 KiB, free 4.1 GiB)24/05/20 02:35:03 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on master:38197 (size: 54.4 KiB, free: 4.1 GiB)24/05/20 02:35:03 INFO SparkContext: Created broadcast 0 from showString at NativeMethodAccessorImpl.java:024/05/20 02:35:03 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes. +----+----+----+ |name|int1|int2| +----+----+----+ +----+----+----+ 24/05/20 02:35:04 INFO SparkContext: Invoking stop() from shutdown hook24/05/20 02:35:04 INFO SparkContext: SparkContext is stopping with exitCode 0 ========================================= Am I missing something here? P.S. I see OP_IS_DIRECTORY is set to 1. Is that a correct behavior? Thanks in advance!