Re: [s3a] Spark is not reading s3 object content

Amin Mosayyebzadeh Wed, 29 May 2024 18:27:28 -0700

Hi Mich,

Thank you for the help and sorry about the late reply.
I ran your provided but I got "exitCode 0". Here is the complete output:


===============================


24/05/30 01:23:38 INFO SparkContext: Running Spark version 3.5.0
24/05/30 01:23:38 INFO SparkContext: OS info Linux, 5.4.0-182-generic, amd64
24/05/30 01:23:38 INFO SparkContext: Java version 11.0.22
24/05/30 01:23:38 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
24/05/30 01:23:38 INFO ResourceUtils:
==============================================================
24/05/30 01:23:38 INFO ResourceUtils: No custom resources configured for
spark.driver.
24/05/30 01:23:38 INFO ResourceUtils:
==============================================================
24/05/30 01:23:38 INFO SparkContext: Submitted application: S3ReadTest
24/05/30 01:23:38 INFO ResourceProfile: Default ResourceProfile created,
executor resources: Map(cores -> name: cores, amount: 1, script: , vendor:
, memory -> name: memory, amount: 1024, script: , vendor: , offHeap ->
name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus ->
name: cpus, amount: 1.0)
24/05/30 01:23:38 INFO ResourceProfile: Limiting resource is cpu
24/05/30 01:23:38 INFO ResourceProfileManager: Added ResourceProfile id: 0
24/05/30 01:23:38 INFO SecurityManager: Changing view acls to: ubuntu
24/05/30 01:23:38 INFO SecurityManager: Changing modify acls to: ubuntu
24/05/30 01:23:38 INFO SecurityManager: Changing view acls groups to:
24/05/30 01:23:38 INFO SecurityManager: Changing modify acls groups to:
24/05/30 01:23:38 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: ubuntu; groups
with view permissions: EMPTY; users with modify permissions: ubuntu; groups
with modify permissions: EMPTY
24/05/30 01:23:38 INFO Utils: Successfully started service 'sparkDriver' on
port 46321.
24/05/30 01:23:38 INFO SparkEnv: Registering MapOutputTracker
24/05/30 01:23:38 INFO SparkEnv: Registering BlockManagerMaster
24/05/30 01:23:38 INFO BlockManagerMasterEndpoint: Using
org.apache.spark.storage.DefaultTopologyMapper for getting topology
information
24/05/30 01:23:38 INFO BlockManagerMasterEndpoint:
BlockManagerMasterEndpoint up
24/05/30 01:23:38 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
24/05/30 01:23:38 INFO DiskBlockManager: Created local directory at
/tmp/blockmgr-a1fc37d5-885a-4ed0-b8f2-4eeb930c69ee
24/05/30 01:23:38 INFO MemoryStore: MemoryStore started with capacity 2.8
GiB
24/05/30 01:23:38 INFO SparkEnv: Registering OutputCommitCoordinator
24/05/30 01:23:39 INFO JettyUtils: Start Jetty 0.0.0.0:4040 for SparkUI
24/05/30 01:23:39 INFO Utils: Successfully started service 'SparkUI' on
port 4040.
24/05/30 01:23:39 INFO Executor: Starting executor ID driver on host
MOC-R4PAC08U33-S1C
24/05/30 01:23:39 INFO Executor: OS info Linux, 5.4.0-182-generic, amd64
24/05/30 01:23:39 INFO Executor: Java version 11.0.22
24/05/30 01:23:39 INFO Executor: Starting executor with user classpath
(userClassPathFirst = false): ''
24/05/30 01:23:39 INFO Executor: Created or updated repl class loader
org.apache.spark.util.MutableURLClassLoader@a45f4d6 for default.
24/05/30 01:23:39 INFO Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 39343.
24/05/30 01:23:39 INFO NettyBlockTransferService: Server created on
MOC-R4PAC08U33-S1C:39343
24/05/30 01:23:39 INFO BlockManager: Using
org.apache.spark.storage.RandomBlockReplicationPolicy for block replication
policy
24/05/30 01:23:39 INFO BlockManagerMaster: Registering BlockManager
BlockManagerId(driver, MOC-R4PAC08U33-S1C, 39343, None)
24/05/30 01:23:39 INFO BlockManagerMasterEndpoint: Registering block
manager MOC-R4PAC08U33-S1C:39343 with 2.8 GiB RAM, BlockManagerId(driver,
MOC-R4PAC08U33-S1C, 39343, None)
24/05/30 01:23:39 INFO BlockManagerMaster: Registered BlockManager
BlockManagerId(driver, MOC-R4PAC08U33-S1C, 39343, None)
24/05/30 01:23:39 INFO BlockManager: Initialized BlockManager:
BlockManagerId(driver, MOC-R4PAC08U33-S1C, 39343, None)
24/05/30 01:23:39 INFO SharedState: Setting hive.metastore.warehouse.dir
('null') to the value of spark.sql.warehouse.dir.
24/05/30 01:23:39 INFO SharedState: Warehouse path is
'file:/home/ubuntu/tpch-spark/spark-warehouse'.
24/05/30 01:23:40 WARN MetricsConfig: Cannot locate configuration: tried
hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
24/05/30 01:23:40 INFO MetricsSystemImpl: Scheduled Metric snapshot period
at 10 second(s).
24/05/30 01:23:40 INFO MetricsSystemImpl: s3a-file-system metrics system
started
24/05/30 01:23:41 INFO MetadataLogFileIndex: Reading streaming file log
from s3a://test-bucket/testfile.csv/_spark_metadata
24/05/30 01:23:41 INFO FileStreamSinkLog: BatchIds found from listing:
24/05/30 01:23:43 INFO FileSourceStrategy: Pushed Filters:
24/05/30 01:23:43 INFO FileSourceStrategy: Post-Scan Filters:
24/05/30 01:23:43 INFO CodeGenerator: Code generated in 188.932153 ms
24/05/30 01:23:43 INFO MemoryStore: Block broadcast_0 stored as values in
memory (estimated size 201.3 KiB, free 2.8 GiB)
24/05/30 01:23:43 INFO MemoryStore: Block broadcast_0_piece0 stored as
bytes in memory (estimated size 34.8 KiB, free 2.8 GiB)
24/05/30 01:23:43 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory
on MOC-R4PAC08U33-S1C:39343 (size: 34.8 KiB, free: 2.8 GiB)
24/05/30 01:23:43 INFO SparkContext: Created broadcast 0 from showString at
NativeMethodAccessorImpl.java:0
24/05/30 01:23:43 INFO FileSourceScanExec: Planning scan with bin packing,
max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
+----+----+----+
|name|int1|int2|
+----+----+----+
+----+----+----+

24/05/30 01:23:43 INFO SparkContext: SparkContext is stopping with exitCode
0.
24/05/30 01:23:43 INFO SparkUI: Stopped Spark web UI at
http://MOC-R4PAC08U33-S1C:4040
24/05/30 01:23:43 INFO MapOutputTrackerMasterEndpoint:
MapOutputTrackerMasterEndpoint stopped!
24/05/30 01:23:43 INFO MemoryStore: MemoryStore cleared
24/05/30 01:23:43 INFO BlockManager: BlockManager stopped
24/05/30 01:23:43 INFO BlockManagerMaster: BlockManagerMaster stopped
24/05/30 01:23:43 INFO
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
OutputCommitCoordinator stopped!
24/05/30 01:23:43 INFO SparkContext: Successfully stopped SparkContext
24/05/30 01:23:44 INFO ShutdownHookManager: Shutdown hook called
24/05/30 01:23:44 INFO ShutdownHookManager: Deleting directory
/tmp/spark-f26cd915-aeb6-4efc-8960-56ca51ac1a7d
24/05/30 01:23:44 INFO ShutdownHookManager: Deleting directory
/tmp/spark-91b984d6-62fe-4c6b-9996-36d6873ff5d6
24/05/30 01:23:44 INFO ShutdownHookManager: Deleting directory
/tmp/spark-91b984d6-62fe-4c6b-9996-36d6873ff5d6/pyspark-bcca58a9-38dc-4359-85d1-81b728d6cf82


Best,
Amin



On Thu, May 23, 2024 at 4:20 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Could be a number of reasons
>
> First test reading the file with a cli
>
> aws s3 cp s3a://input/testfile.csv .
> cat testfile.csv
>
>
> Try this code with debug option to diagnose the problem
>
> from pyspark.sql import SparkSession
> from pyspark.sql.utils import AnalysisException
>
> try:
>     # Initialize Spark session
>     spark = SparkSession.builder \
>         .appName("S3ReadTest") \
>         .config("spark.jars.packages",
> "org.apache.hadoop:hadoop-aws:3.3.6") \
>         .config("spark.hadoop.fs.s3a.access.key", "R*************6") \
>         .config("spark.hadoop.fs.s3a.secret.key", "1***************e") \
>         .config("spark.hadoop.fs.s3a.endpoint", "192.168.52.63:8000") \
>         .config("spark.hadoop.fs.s3a.path.style.access", "true") \
>         .config("spark.hadoop.fs.s3a.impl",
> "org.apache.hadoop.fs.s3a.S3AFileSystem") \
>         .getOrCreate()
>
>     # Read the CSV file from S3
>     df = spark.read \
>         .option("header", "true") \
>         .option("inferSchema", "true") \
>         .option("delimiter", " ") \  # ensure this is apace
>         .csv("s3a://input/testfile.csv")
>
>     # Show the data
>     df.show(n=1)
>
> except AnalysisException as e:
>     print(f"AnalysisException: {e}")
> except Exception as e:
>     print(f"Error: {e}")
> finally:
>     # Stop the Spark session
>     spark.stop()
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Thu, 23 May 2024 at 20:14, Amin Mosayyebzadeh <mosayyebza...@gmail.com>
> wrote:
>
>> I am trying to read an s3 object from a local S3 storage (Ceph based)
>> using Spark 3.5.1. I see it can access the bucket and list the files (I
>> have verified it on Ceph side by checking its logs), even returning the
>> correct size of the object. But the content is not read.
>>
>> The object url is:
>> s3a://input/testfile.csv (I have also tested a nested bucket:
>> s3a://test1/test2/test3/testfile.csv)
>>
>>
>> Object's content:
>>
>> =====================
>> name int1 int2
>> first 1 2
>> second 3 4
>> =====================
>>
>>
>> Here is the config I have set so far:
>>
>> ("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.6")
>> ("spark.hadoop.fs.s3a.access.key", "R*************6")
>> ("spark.hadoop.fs.s3a.secret.key", "1***************e")
>> ("spark.hadoop.fs.s3a.endpoint", "192.168.52.63:8000")
>> ("spark.hadoop.fs.s3a.path.style.access", "true")
>> ("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
>>
>>
>> The outop for my following Pyspark application:
>> df = spark.read \
>>     .option("header", "true") \
>>     .schema(schema) \
>>     .csv("s3a://input/testfile.csv", sep=' ')
>>
>> df.show(n=1)
>> ==================================
>> 24/05/20 02:35:00 INFO MetricsSystemImpl: s3a-file-system metrics system 
>> started24/05/20 02:35:01 INFO MetadataLogFileIndex: Reading streaming file 
>> log from s3a://input/testfile.csv/_spark_metadata24/05/20 02:35:01 INFO 
>> FileStreamSinkLog: BatchIds found from listing:24/05/20 02:35:03 INFO 
>> FileSourceStrategy: Pushed Filters:24/05/20 02:35:03 INFO 
>> FileSourceStrategy: Post-Scan Filters:24/05/20 02:35:03 INFO CodeGenerator: 
>> Code generated in 176.139675 ms24/05/20 02:35:03 INFO MemoryStore: Block 
>> broadcast_0 stored as values in memory (estimated size 496.6 KiB, free 4.1 
>> GiB)24/05/20 02:35:03 INFO MemoryStore: Block broadcast_0_piece0 stored as 
>> bytes in memory (estimated size 54.4 KiB, free 4.1 GiB)24/05/20 02:35:03 
>> INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on master:38197 
>> (size: 54.4 KiB, free: 4.1 GiB)24/05/20 02:35:03 INFO SparkContext: Created 
>> broadcast 0 from showString at NativeMethodAccessorImpl.java:024/05/20 
>> 02:35:03 INFO FileSourceScanExec: Planning scan with bin packing, max size: 
>> 4194304 bytes, open cost is considered as scanning 4194304 bytes.
>> +----+----+----+
>> |name|int1|int2|
>> +----+----+----+
>> +----+----+----+
>> 24/05/20 02:35:04 INFO SparkContext: Invoking stop() from shutdown 
>> hook24/05/20 02:35:04 INFO SparkContext: SparkContext is stopping with 
>> exitCode 0
>> =========================================
>>
>> Am I missing something here?
>>
>> P.S. I see OP_IS_DIRECTORY is set to 1. Is that a correct behavior?
>>
>>
>> Thanks in advance!
>>
>>

Re: [s3a] Spark is not reading s3 object content

Reply via email to