Re: [s3a] Spark is not reading s3 object content

2024-05-31 Thread Amin Mosayyebzadeh
I am reading from a single file: df = spark.read.text("s3a://test-bucket/testfile.csv") On Fri, May 31, 2024 at 5:26 AM Mich Talebzadeh wrote: > Tell Spark to read from a single file > > data = spark.read.text("s3a://test-bucket/testfile.csv") > > This clarifies to Spark that you are dealing w

Re: [s3a] Spark is not reading s3 object content

2024-05-31 Thread Mich Talebzadeh
Tell Spark to read from a single file data = spark.read.text("s3a://test-bucket/testfile.csv") This clarifies to Spark that you are dealing with a single file and avoids any bucket-like interpretation. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime PhD

Re: [s3a] Spark is not reading s3 object content

2024-05-30 Thread Amin Mosayyebzadeh
I will work on the first two possible causes. For the third one, which I guess is the real problem, Spark treats the testfile.csv object with the url s3a://test-bucket/testfile.csv as a bucket to access _spark_metadata with url s3a://test-bucket/testfile.csv/_spark_metadata testfile.csv is an objec

Re: [s3a] Spark is not reading s3 object content

2024-05-30 Thread Mich Talebzadeh
ok some observations - Spark job successfully lists the S3 bucket containing testfile.csv. - Spark job can retrieve the file size (33 Bytes) for testfile.csv. - Spark job fails to read the actual data from testfile.csv. - The printed content from testfile.csv is an empty list. - S

Re: [s3a] Spark is not reading s3 object content

2024-05-30 Thread Amin Mosayyebzadeh
The code should read testfile.csv file from s3. and print the content. It only prints a empty list although the file has content. I have also checked our custom s3 storage (Ceph based) logs and I see only LIST operations coming from Spark, there is no GET object operation for testfile.csv The only

Re: [s3a] Spark is not reading s3 object content

2024-05-30 Thread Mich Talebzadeh
Hello, Overall, the exit code of 0 suggests a successful run of your Spark job. Analyze the intended purpose of your code and verify the output or Spark UI for further confirmation. 24/05/30 01:23:43 INFO SparkContext: SparkContext is stopping with exitCode 0. what to check 1. Verify Output

Re: [s3a] Spark is not reading s3 object content

2024-05-29 Thread Amin Mosayyebzadeh
Hi Mich, Thank you for the help and sorry about the late reply. I ran your provided but I got "exitCode 0". Here is the complete output: === 24/05/30 01:23:38 INFO SparkContext: Running Spark version 3.5.0 24/05/30 01:23:38 INFO SparkContext: OS info Linux, 5.4.0-182

Re: [s3a] Spark is not reading s3 object content

2024-05-23 Thread Mich Talebzadeh
Could be a number of reasons First test reading the file with a cli aws s3 cp s3a://input/testfile.csv . cat testfile.csv Try this code with debug option to diagnose the problem from pyspark.sql import SparkSession from pyspark.sql.utils import AnalysisException try: # Initialize Spark se

[s3a] Spark is not reading s3 object content

2024-05-23 Thread Amin Mosayyebzadeh
I am trying to read an s3 object from a local S3 storage (Ceph based) using Spark 3.5.1. I see it can access the bucket and list the files (I have verified it on Ceph side by checking its logs), even returning the correct size of the object. But the content is not read. The object url is: s3a://i