Re: [s3a] Spark is not reading s3 object content

Mich Talebzadeh Thu, 30 May 2024 13:05:45 -0700

Hello,

Overall, the exit code of 0 suggests a successful run of your Spark job.
Analyze the intended purpose of your code and verify the output or Spark UI
for further confirmation.


24/05/30 01:23:43 INFO SparkContext: SparkContext is stopping with exitCode
0.

what to check


   1. Verify Output: If your Spark job was intended to read data from S3
   and process it, you will need to verify the output to ensure the data was
   handled correctly. This might involve checking if any results were written
   to a designated location or if any transformations were applied
   successfully.
   2. Review Code:
   3. Check Spark UI:


HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College
London <https://en.wikipedia.org/wiki/Imperial_College_London>
London, United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Thu, 30 May 2024 at 11:56, Amin Mosayyebzadeh <mosayyebza...@gmail.com>
wrote:

> Hi Mich,
>
> Thank you for the help and sorry about the late reply.
> I ran your provided but I got "exitCode 0". Here is the complete output:
>
> ===============================
>
>
> 24/05/30 01:23:38 INFO SparkContext: Running Spark version 3.5.0
> 24/05/30 01:23:38 INFO SparkContext: OS info Linux, 5.4.0-182-generic,
> amd64
> 24/05/30 01:23:38 INFO SparkContext: Java version 11.0.22
> 24/05/30 01:23:38 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 24/05/30 01:23:38 INFO ResourceUtils:
> ==============================================================
> 24/05/30 01:23:38 INFO ResourceUtils: No custom resources configured for
> spark.driver.
> 24/05/30 01:23:38 INFO ResourceUtils:
> ==============================================================
> 24/05/30 01:23:38 INFO SparkContext: Submitted application: S3ReadTest
> 24/05/30 01:23:38 INFO ResourceProfile: Default ResourceProfile created,
> executor resources: Map(cores -> name: cores, amount: 1, script: , vendor:
> , memory -> name: memory, amount: 1024, script: , vendor: , offHeap ->
> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus ->
> name: cpus, amount: 1.0)
> 24/05/30 01:23:38 INFO ResourceProfile: Limiting resource is cpu
> 24/05/30 01:23:38 INFO ResourceProfileManager: Added ResourceProfile id: 0
> 24/05/30 01:23:38 INFO SecurityManager: Changing view acls to: ubuntu
> 24/05/30 01:23:38 INFO SecurityManager: Changing modify acls to: ubuntu
> 24/05/30 01:23:38 INFO SecurityManager: Changing view acls groups to:
> 24/05/30 01:23:38 INFO SecurityManager: Changing modify acls groups to:
> 24/05/30 01:23:38 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: ubuntu; groups
> with view permissions: EMPTY; users with modify permissions: ubuntu; groups
> with modify permissions: EMPTY
> 24/05/30 01:23:38 INFO Utils: Successfully started service 'sparkDriver'
> on port 46321.
> 24/05/30 01:23:38 INFO SparkEnv: Registering MapOutputTracker
> 24/05/30 01:23:38 INFO SparkEnv: Registering BlockManagerMaster
> 24/05/30 01:23:38 INFO BlockManagerMasterEndpoint: Using
> org.apache.spark.storage.DefaultTopologyMapper for getting topology
> information
> 24/05/30 01:23:38 INFO BlockManagerMasterEndpoint:
> BlockManagerMasterEndpoint up
> 24/05/30 01:23:38 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
> 24/05/30 01:23:38 INFO DiskBlockManager: Created local directory at
> /tmp/blockmgr-a1fc37d5-885a-4ed0-b8f2-4eeb930c69ee
> 24/05/30 01:23:38 INFO MemoryStore: MemoryStore started with capacity 2.8
> GiB
> 24/05/30 01:23:38 INFO SparkEnv: Registering OutputCommitCoordinator
> 24/05/30 01:23:39 INFO JettyUtils: Start Jetty 0.0.0.0:4040 for SparkUI
> 24/05/30 01:23:39 INFO Utils: Successfully started service 'SparkUI' on
> port 4040.
> 24/05/30 01:23:39 INFO Executor: Starting executor ID driver on host
> MOC-R4PAC08U33-S1C
> 24/05/30 01:23:39 INFO Executor: OS info Linux, 5.4.0-182-generic, amd64
> 24/05/30 01:23:39 INFO Executor: Java version 11.0.22
> 24/05/30 01:23:39 INFO Executor: Starting executor with user classpath
> (userClassPathFirst = false): ''
> 24/05/30 01:23:39 INFO Executor: Created or updated repl class loader
> org.apache.spark.util.MutableURLClassLoader@a45f4d6 for default.
> 24/05/30 01:23:39 INFO Utils: Successfully started service
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39343.
> 24/05/30 01:23:39 INFO NettyBlockTransferService: Server created on
> MOC-R4PAC08U33-S1C:39343
> 24/05/30 01:23:39 INFO BlockManager: Using
> org.apache.spark.storage.RandomBlockReplicationPolicy for block replication
> policy
> 24/05/30 01:23:39 INFO BlockManagerMaster: Registering BlockManager
> BlockManagerId(driver, MOC-R4PAC08U33-S1C, 39343, None)
> 24/05/30 01:23:39 INFO BlockManagerMasterEndpoint: Registering block
> manager MOC-R4PAC08U33-S1C:39343 with 2.8 GiB RAM, BlockManagerId(driver,
> MOC-R4PAC08U33-S1C, 39343, None)
> 24/05/30 01:23:39 INFO BlockManagerMaster: Registered BlockManager
> BlockManagerId(driver, MOC-R4PAC08U33-S1C, 39343, None)
> 24/05/30 01:23:39 INFO BlockManager: Initialized BlockManager:
> BlockManagerId(driver, MOC-R4PAC08U33-S1C, 39343, None)
> 24/05/30 01:23:39 INFO SharedState: Setting hive.metastore.warehouse.dir
> ('null') to the value of spark.sql.warehouse.dir.
> 24/05/30 01:23:39 INFO SharedState: Warehouse path is
> 'file:/home/ubuntu/tpch-spark/spark-warehouse'.
> 24/05/30 01:23:40 WARN MetricsConfig: Cannot locate configuration: tried
> hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
> 24/05/30 01:23:40 INFO MetricsSystemImpl: Scheduled Metric snapshot period
> at 10 second(s).
> 24/05/30 01:23:40 INFO MetricsSystemImpl: s3a-file-system metrics system
> started
> 24/05/30 01:23:41 INFO MetadataLogFileIndex: Reading streaming file log
> from s3a://test-bucket/testfile.csv/_spark_metadata
> 24/05/30 01:23:41 INFO FileStreamSinkLog: BatchIds found from listing:
> 24/05/30 01:23:43 INFO FileSourceStrategy: Pushed Filters:
> 24/05/30 01:23:43 INFO FileSourceStrategy: Post-Scan Filters:
> 24/05/30 01:23:43 INFO CodeGenerator: Code generated in 188.932153 ms
> 24/05/30 01:23:43 INFO MemoryStore: Block broadcast_0 stored as values in
> memory (estimated size 201.3 KiB, free 2.8 GiB)
> 24/05/30 01:23:43 INFO MemoryStore: Block broadcast_0_piece0 stored as
> bytes in memory (estimated size 34.8 KiB, free 2.8 GiB)
> 24/05/30 01:23:43 INFO BlockManagerInfo: Added broadcast_0_piece0 in
> memory on MOC-R4PAC08U33-S1C:39343 (size: 34.8 KiB, free: 2.8 GiB)
> 24/05/30 01:23:43 INFO SparkContext: Created broadcast 0 from showString
> at NativeMethodAccessorImpl.java:0
> 24/05/30 01:23:43 INFO FileSourceScanExec: Planning scan with bin packing,
> max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
> +----+----+----+
> |name|int1|int2|
> +----+----+----+
> +----+----+----+
>
> 24/05/30 01:23:43 INFO SparkContext: SparkContext is stopping with
> exitCode 0.
> 24/05/30 01:23:43 INFO SparkUI: Stopped Spark web UI at
> http://MOC-R4PAC08U33-S1C:4040
> 24/05/30 01:23:43 INFO MapOutputTrackerMasterEndpoint:
> MapOutputTrackerMasterEndpoint stopped!
> 24/05/30 01:23:43 INFO MemoryStore: MemoryStore cleared
> 24/05/30 01:23:43 INFO BlockManager: BlockManager stopped
> 24/05/30 01:23:43 INFO BlockManagerMaster: BlockManagerMaster stopped
> 24/05/30 01:23:43 INFO
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
> OutputCommitCoordinator stopped!
> 24/05/30 01:23:43 INFO SparkContext: Successfully stopped SparkContext
> 24/05/30 01:23:44 INFO ShutdownHookManager: Shutdown hook called
> 24/05/30 01:23:44 INFO ShutdownHookManager: Deleting directory
> /tmp/spark-f26cd915-aeb6-4efc-8960-56ca51ac1a7d
> 24/05/30 01:23:44 INFO ShutdownHookManager: Deleting directory
> /tmp/spark-91b984d6-62fe-4c6b-9996-36d6873ff5d6
> 24/05/30 01:23:44 INFO ShutdownHookManager: Deleting directory
> /tmp/spark-91b984d6-62fe-4c6b-9996-36d6873ff5d6/pyspark-bcca58a9-38dc-4359-85d1-81b728d6cf82
>
>
> Best,
> Amin
>
>
>
> On Thu, May 23, 2024 at 4:20 PM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Could be a number of reasons
>>
>> First test reading the file with a cli
>>
>> aws s3 cp s3a://input/testfile.csv .
>> cat testfile.csv
>>
>>
>> Try this code with debug option to diagnose the problem
>>
>> from pyspark.sql import SparkSession
>> from pyspark.sql.utils import AnalysisException
>>
>> try:
>>     # Initialize Spark session
>>     spark = SparkSession.builder \
>>         .appName("S3ReadTest") \
>>         .config("spark.jars.packages",
>> "org.apache.hadoop:hadoop-aws:3.3.6") \
>>         .config("spark.hadoop.fs.s3a.access.key", "R*************6") \
>>         .config("spark.hadoop.fs.s3a.secret.key", "1***************e") \
>>         .config("spark.hadoop.fs.s3a.endpoint", "192.168.52.63:8000") \
>>         .config("spark.hadoop.fs.s3a.path.style.access", "true") \
>>         .config("spark.hadoop.fs.s3a.impl",
>> "org.apache.hadoop.fs.s3a.S3AFileSystem") \
>>         .getOrCreate()
>>
>>     # Read the CSV file from S3
>>     df = spark.read \
>>         .option("header", "true") \
>>         .option("inferSchema", "true") \
>>         .option("delimiter", " ") \  # ensure this is apace
>>         .csv("s3a://input/testfile.csv")
>>
>>     # Show the data
>>     df.show(n=1)
>>
>> except AnalysisException as e:
>>     print(f"AnalysisException: {e}")
>> except Exception as e:
>>     print(f"Error: {e}")
>> finally:
>>     # Stop the Spark session
>>     spark.stop()
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>
>>
>> On Thu, 23 May 2024 at 20:14, Amin Mosayyebzadeh <mosayyebza...@gmail.com>
>> wrote:
>>
>>> I am trying to read an s3 object from a local S3 storage (Ceph based)
>>> using Spark 3.5.1. I see it can access the bucket and list the files (I
>>> have verified it on Ceph side by checking its logs), even returning the
>>> correct size of the object. But the content is not read.
>>>
>>> The object url is:
>>> s3a://input/testfile.csv (I have also tested a nested bucket:
>>> s3a://test1/test2/test3/testfile.csv)
>>>
>>>
>>> Object's content:
>>>
>>> =====================
>>> name int1 int2
>>> first 1 2
>>> second 3 4
>>> =====================
>>>
>>>
>>> Here is the config I have set so far:
>>>
>>> ("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.6")
>>> ("spark.hadoop.fs.s3a.access.key", "R*************6")
>>> ("spark.hadoop.fs.s3a.secret.key", "1***************e")
>>> ("spark.hadoop.fs.s3a.endpoint", "192.168.52.63:8000")
>>> ("spark.hadoop.fs.s3a.path.style.access", "true")
>>> ("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
>>>
>>>
>>> The outop for my following Pyspark application:
>>> df = spark.read \
>>>     .option("header", "true") \
>>>     .schema(schema) \
>>>     .csv("s3a://input/testfile.csv", sep=' ')
>>>
>>> df.show(n=1)
>>> ==================================
>>> 24/05/20 02:35:00 INFO MetricsSystemImpl: s3a-file-system metrics system 
>>> started24/05/20 02:35:01 INFO MetadataLogFileIndex: Reading streaming file 
>>> log from s3a://input/testfile.csv/_spark_metadata24/05/20 02:35:01 INFO 
>>> FileStreamSinkLog: BatchIds found from listing:24/05/20 02:35:03 INFO 
>>> FileSourceStrategy: Pushed Filters:24/05/20 02:35:03 INFO 
>>> FileSourceStrategy: Post-Scan Filters:24/05/20 02:35:03 INFO CodeGenerator: 
>>> Code generated in 176.139675 ms24/05/20 02:35:03 INFO MemoryStore: Block 
>>> broadcast_0 stored as values in memory (estimated size 496.6 KiB, free 4.1 
>>> GiB)24/05/20 02:35:03 INFO MemoryStore: Block broadcast_0_piece0 stored as 
>>> bytes in memory (estimated size 54.4 KiB, free 4.1 GiB)24/05/20 02:35:03 
>>> INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on master:38197 
>>> (size: 54.4 KiB, free: 4.1 GiB)24/05/20 02:35:03 INFO SparkContext: Created 
>>> broadcast 0 from showString at NativeMethodAccessorImpl.java:024/05/20 
>>> 02:35:03 INFO FileSourceScanExec: Planning scan with bin packing, max size: 
>>> 4194304 bytes, open cost is considered as scanning 4194304 bytes.
>>> +----+----+----+
>>> |name|int1|int2|
>>> +----+----+----+
>>> +----+----+----+
>>> 24/05/20 02:35:04 INFO SparkContext: Invoking stop() from shutdown 
>>> hook24/05/20 02:35:04 INFO SparkContext: SparkContext is stopping with 
>>> exitCode 0
>>> =========================================
>>>
>>> Am I missing something here?
>>>
>>> P.S. I see OP_IS_DIRECTORY is set to 1. Is that a correct behavior?
>>>
>>>
>>> Thanks in advance!
>>>
>>>

Re: [s3a] Spark is not reading s3 object content

Reply via email to