Hello! I have been tearing my hair out trying to solve this problem. Here is my setup:
1. I have Spark running on a server in standalone mode with data on the filesystem of the server itself (/opt/data/). 2. I have an instance of a Hive Metastore server running (backed by MariaDB) on the same server 3. I have a laptop where I am developing my spark jobs (Scala) I have configured Spark to use the metastore and set the warehouse directory to be in /opt/data/warehouse/. What I am trying to accomplish are a couple of things: 1. I am trying to submit Spark jobs (via JARs) using spark-submit, but have the driver run on my local machine (my laptop). I want the jobs to use the data ON THE SERVER and not try to reference it from my local machine. If I do something like this: val df = spark.sql("SELECT * FROM parquet.`/opt/data/transactions.parquet`") I get an error that the path doesn't exist (because it's trying to find it on my laptop). If I run the same thing in a spark-shell on the spark server itself, there isn't an issue because the driver has access to the data. If I submit the job with submit-mode=cluster then it works too because the driver is on the cluster. I don't want this, I want to get the results on my laptop. How can I force Spark to read the data from the cluster's filesystem and not the driver's? 2. I have setup a Hive Metastore and created a table (in the spark shell on the spark server itself). The data in the warehouse is in the local filesystem. When I create a spark application JAR and try to run it from my laptop, I get the same problem as #1, namely that it tries to find the warehouse directory on my laptop itself. Am I crazy? Perhaps this isn't a supported way to use Spark? Any help or insights are much appreciated! -Ryan Victory