Re: Running the driver on a laptop but data is on the Spark server

Chris Coutinho Wed, 25 Nov 2020 07:13:55 -0800

I'm also curious if this is possible, so while I can't offer a solution
maybe you could try the following.


The driver and executor nodes need to have access to the same
(distributed) file system, so you could try to mount the file system to
your laptop, locally, and then try to submit jobs and/or use the spark-
shell while connected to the same system.

A quick google search led me to find this article where someone shows
how to mount an HDFS locally. It appears that Cloudera supports some
kind of FUSE-based library, which may be useful for your use-case. 

https://idata.co.il/2018/10/how-to-connect-hdfs-to-local-filesystem/

On Wed, 2020-11-25 at 08:51 -0600, Ryan Victory wrote:
> Hello!
> I have been tearing my hair out trying to solve this problem. Here is
> my setup:
> 
> 1. I have Spark running on a server in standalone mode with data on
> the filesystem of the server itself (/opt/data/).
> 2. I have an instance of a Hive Metastore server running (backed by
> MariaDB) on the same server
> 3. I have a laptop where I am developing my spark jobs (Scala)
> 
> I have configured Spark to use the metastore and set the warehouse
> directory to be in /opt/data/warehouse/. What I am trying to
> accomplish are a couple of things:
> 
> 1. I am trying to submit Spark jobs (via JARs) using spark-submit,
> but have the driver run on my local machine (my laptop). I want the
> jobs to use the data ON THE SERVER and not try to reference it from
> my local machine. If I do something like this:
> 
> val df = spark.sql("SELECT * FROM
> parquet.`/opt/data/transactions.parquet`")
> I get an error that the path doesn't exist (because it's trying to
> find it on my laptop). If I run the same thing in a spark-shell on
> the spark server itself, there isn't an issue because the driver has
> access to the data. If I submit the job with submit-mode=cluster then
> it works too because the driver is on the cluster. I don't want this,
> I want to get the results on my laptop. 
> 
> How can I force Spark to read the data from the cluster's filesystem
> and not the driver's?
> 
> 2. I have setup a Hive Metastore and created a table (in the spark
> shell on the spark server itself). The data in the warehouse is in
> the local filesystem. When I create a spark application JAR and try
> to run it from my laptop, I get the same problem as #1, namely that
> it tries to find the warehouse directory on my laptop itself.
> 
> Am I crazy? Perhaps this isn't a supported way to use Spark? Any help
> or insights are much appreciated!
> 
> -Ryan Victory

signature.asc
Description: This is a digitally signed message part

Re: Running the driver on a laptop but data is on the Spark server

Reply via email to