I am trying to read an s3 object from a local S3 storage (Ceph based)
using Spark 3.5.1. I see it can access the bucket and list the files (I
have verified it on Ceph side by checking its logs), even returning the
correct size of the object. But the content is not read.

The object url is:
s3a://input/testfile.csv (I have also tested a nested bucket:
s3a://test1/test2/test3/testfile.csv)


Object's content:

=====================
name int1 int2
first 1 2
second 3 4
=====================


Here is the config I have set so far:

("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.6")
("spark.hadoop.fs.s3a.access.key", "R*************6")
("spark.hadoop.fs.s3a.secret.key", "1***************e")
("spark.hadoop.fs.s3a.endpoint", "192.168.52.63:8000")
("spark.hadoop.fs.s3a.path.style.access", "true")
("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")


The outop for my following Pyspark application:
df = spark.read \
    .option("header", "true") \
    .schema(schema) \
    .csv("s3a://input/testfile.csv", sep=' ')

df.show(n=1)
==================================
24/05/20 02:35:00 INFO MetricsSystemImpl: s3a-file-system metrics
system started24/05/20 02:35:01 INFO MetadataLogFileIndex: Reading
streaming file log from
s3a://input/testfile.csv/_spark_metadata24/05/20 02:35:01 INFO
FileStreamSinkLog: BatchIds found from listing:24/05/20 02:35:03 INFO
FileSourceStrategy: Pushed Filters:24/05/20 02:35:03 INFO
FileSourceStrategy: Post-Scan Filters:24/05/20 02:35:03 INFO
CodeGenerator: Code generated in 176.139675 ms24/05/20 02:35:03 INFO
MemoryStore: Block broadcast_0 stored as values in memory (estimated
size 496.6 KiB, free 4.1 GiB)24/05/20 02:35:03 INFO MemoryStore: Block
broadcast_0_piece0 stored as bytes in memory (estimated size 54.4 KiB,
free 4.1 GiB)24/05/20 02:35:03 INFO BlockManagerInfo: Added
broadcast_0_piece0 in memory on master:38197 (size: 54.4 KiB, free:
4.1 GiB)24/05/20 02:35:03 INFO SparkContext: Created broadcast 0 from
showString at NativeMethodAccessorImpl.java:024/05/20 02:35:03 INFO
FileSourceScanExec: Planning scan with bin packing, max size: 4194304
bytes, open cost is considered as scanning 4194304 bytes.
+----+----+----+
|name|int1|int2|
+----+----+----+
+----+----+----+
24/05/20 02:35:04 INFO SparkContext: Invoking stop() from shutdown
hook24/05/20 02:35:04 INFO SparkContext: SparkContext is stopping with
exitCode 0
=========================================

Am I missing something here?

P.S. I see OP_IS_DIRECTORY is set to 1. Is that a correct behavior?


Thanks in advance!

Reply via email to