I run my Spark jobs in GCP with Google Dataproc using GCS buckets.
I've not used AWS, but its EMR product offers similar functionality to
Dataproc. The title of your post implies your Spark cluster runs on EKS.
You might be better off using EMR, see links below:
EMR
https://medium.com/big-data-on-amazon-elastic-mapreduce/run-a-spark-job-within-amazon-emr-in-15-minutes-68b02af1ae16
EKS https://medium.com/@vikas.navlani/running-spark-on-aws-eks-1cd4c31786c
Richard
On 19/02/2024 13:36, Jagannath Majhi wrote:
Dear Spark Community,
I hope this email finds you well. I am reaching out to seek assistance
and guidance regarding a task I'm currently working on involving
Apache Spark.
I have developed a JAR file that contains some Spark applications and
functionality, and I need to run this JAR file within a Spark cluster.
However, the JAR file is located in an AWS S3 bucket. I'm facing some
challenges in configuring Spark to access and execute this JAR file
directly from the S3 bucket.
I would greatly appreciate any advice, best practices, or pointers on
how to achieve this integration effectively. Specifically, I'm looking
for insights on:
1. Configuring Spark to access and retrieve the JAR file from an AWS
S3 bucket.
2. Setting up the necessary permissions and authentication mechanisms
to ensure seamless access to the S3 bucket.
3. Any potential performance considerations or optimizations when
running Spark applications with dependencies stored in remote
storage like AWS S3.
If anyone in the community has prior experience or knowledge in this
area, I would be extremely grateful for your guidance. Additionally,
if there are any relevant resources, documentation, or tutorials that
you could recommend, it would be incredibly helpful.
Thank you very much for considering my request. I look forward to
hearing from you and benefiting from the collective expertise of the
Spark community.
Best regards, Jagannath Majhi