I am reading from a single file:
df = spark.read.text("s3a://test-bucket/testfile.csv")
On Fri, May 31, 2024 at 5:26 AM Mich Talebzadeh
wrote:
> Tell Spark to read from a single file
>
> data = spark.read.text("s3a://test-bucket/testfile.csv")
>
> This clarifies to Spark that you are dealing w
Tell Spark to read from a single file
data = spark.read.text("s3a://test-bucket/testfile.csv")
This clarifies to Spark that you are dealing with a single file and avoids
any bucket-like interpretation.
HTH
Mich Talebzadeh,
Technologist | Architect | Data Engineer | Generative AI | FinCrime
PhD
I will work on the first two possible causes.
For the third one, which I guess is the real problem, Spark treats the
testfile.csv object with the url s3a://test-bucket/testfile.csv as a bucket
to access _spark_metadata with url
s3a://test-bucket/testfile.csv/_spark_metadata
testfile.csv is an objec
ok
some observations
- Spark job successfully lists the S3 bucket containing testfile.csv.
- Spark job can retrieve the file size (33 Bytes) for testfile.csv.
- Spark job fails to read the actual data from testfile.csv.
- The printed content from testfile.csv is an empty list.
- S
The code should read testfile.csv file from s3. and print the content. It
only prints a empty list although the file has content.
I have also checked our custom s3 storage (Ceph based) logs and I see only
LIST operations coming from Spark, there is no GET object operation for
testfile.csv
The only
Hello,
Overall, the exit code of 0 suggests a successful run of your Spark job.
Analyze the intended purpose of your code and verify the output or Spark UI
for further confirmation.
24/05/30 01:23:43 INFO SparkContext: SparkContext is stopping with exitCode
0.
what to check
1. Verify Output
Hi Mich,
Thank you for the help and sorry about the late reply.
I ran your provided but I got "exitCode 0". Here is the complete output:
===
24/05/30 01:23:38 INFO SparkContext: Running Spark version 3.5.0
24/05/30 01:23:38 INFO SparkContext: OS info Linux, 5.4.0-182
Could be a number of reasons
First test reading the file with a cli
aws s3 cp s3a://input/testfile.csv .
cat testfile.csv
Try this code with debug option to diagnose the problem
from pyspark.sql import SparkSession
from pyspark.sql.utils import AnalysisException
try:
# Initialize Spark se