Running the driver on a laptop but data is on the Spark server

Ryan Victory Wed, 25 Nov 2020 06:52:00 -0800

Hello!

I have been tearing my hair out trying to solve this problem. Here is my
setup:


1. I have Spark running on a server in standalone mode with data on the
filesystem of the server itself (/opt/data/).
2. I have an instance of a Hive Metastore server running (backed by
MariaDB) on the same server
3. I have a laptop where I am developing my spark jobs (Scala)

I have configured Spark to use the metastore and set the warehouse
directory to be in /opt/data/warehouse/. What I am trying to accomplish are
a couple of things:

1. I am trying to submit Spark jobs (via JARs) using spark-submit, but have
the driver run on my local machine (my laptop). I want the jobs to use the
data ON THE SERVER and not try to reference it from my local machine. If I
do something like this:

val df = spark.sql("SELECT * FROM parquet.`/opt/data/transactions.parquet`")

I get an error that the path doesn't exist (because it's trying to find it
on my laptop). If I run the same thing in a spark-shell on the spark server
itself, there isn't an issue because the driver has access to the data. If
I submit the job with submit-mode=cluster then it works too because the
driver is on the cluster. I don't want this, I want to get the results on
my laptop.

How can I force Spark to read the data from the cluster's filesystem and
not the driver's?

2. I have setup a Hive Metastore and created a table (in the spark shell on
the spark server itself). The data in the warehouse is in the local
filesystem. When I create a spark application JAR and try to run it from my
laptop, I get the same problem as #1, namely that it tries to find the
warehouse directory on my laptop itself.

Am I crazy? Perhaps this isn't a supported way to use Spark? Any help or
insights are much appreciated!

-Ryan Victory

Running the driver on a laptop but data is on the Spark server

Reply via email to