I am reading from a single file:
df = spark.read.text("s3a://test-bucket/testfile.csv")
On Fri, May 31, 2024 at 5:26 AM Mich Talebzadeh
wrote:
> Tell Spark to read from a single file
>
> data = spark.read.text("s3a://test-bucket/testfile.csv")
>
> This clarifies to Spark that you are dealing w
Tell Spark to read from a single file
data = spark.read.text("s3a://test-bucket/testfile.csv")
This clarifies to Spark that you are dealing with a single file and avoids
any bucket-like interpretation.
HTH
Mich Talebzadeh,
Technologist | Architect | Data Engineer | Generative AI | FinCrime
PhD
I will work on the first two possible causes.
For the third one, which I guess is the real problem, Spark treats the
testfile.csv object with the url s3a://test-bucket/testfile.csv as a bucket
to access _spark_metadata with url
s3a://test-bucket/testfile.csv/_spark_metadata
testfile.csv is an objec
ok
some observations
- Spark job successfully lists the S3 bucket containing testfile.csv.
- Spark job can retrieve the file size (33 Bytes) for testfile.csv.
- Spark job fails to read the actual data from testfile.csv.
- The printed content from testfile.csv is an empty list.
- S
The code should read testfile.csv file from s3. and print the content. It
only prints a empty list although the file has content.
I have also checked our custom s3 storage (Ceph based) logs and I see only
LIST operations coming from Spark, there is no GET object operation for
testfile.csv
The only
Hello,
Overall, the exit code of 0 suggests a successful run of your Spark job.
Analyze the intended purpose of your code and verify the output or Spark UI
for further confirmation.
24/05/30 01:23:43 INFO SparkContext: SparkContext is stopping with exitCode
0.
what to check
1. Verify Output
Hi Mich,
Thank you for the help and sorry about the late reply.
I ran your provided but I got "exitCode 0". Here is the complete output:
===
24/05/30 01:23:38 INFO SparkContext: Running Spark version 3.5.0
24/05/30 01:23:38 INFO SparkContext: OS info Linux, 5.4.0-182
Could be a number of reasons
First test reading the file with a cli
aws s3 cp s3a://input/testfile.csv .
cat testfile.csv
Try this code with debug option to diagnose the problem
from pyspark.sql import SparkSession
from pyspark.sql.utils import AnalysisException
try:
# Initialize Spark se
I am trying to read an s3 object from a local S3 storage (Ceph based)
using Spark 3.5.1. I see it can access the bucket and list the files (I
have verified it on Ceph side by checking its logs), even returning the
correct size of the object. But the content is not read.
The object url is:
s3a://i