Re: Running the driver on a laptop but data is on the Spark server

Apostolos N. Papadopoulos Wed, 25 Nov 2020 06:56:45 -0800

Hi Ryan,

since the driver is at your laptop, in order to access a remote file youneed to specify the url for this I guess.

For example, when I am using Spark over HDFS I specify the file likehdfs://blablabla which contains the url where namenode


can answer. I believe that something similar must be done here.

all the best,

Apostolos


On 25/11/20 16:51, Ryan Victory wrote:

Hello!
I have been tearing my hair out trying to solve this problem. Here ismy setup:
1. I have Spark running on a server in standalone mode with data onthe filesystem of the server itself (/opt/data/).2. I have an instance of a Hive Metastore server running (backed byMariaDB) on the same server
3. I have a laptop where I am developing my spark jobs (Scala)
I have configured Spark to use the metastore and set the warehousedirectory to be in /opt/data/warehouse/. What I am trying toaccomplish are a couple of things:
1. I am trying to submit Spark jobs (via JARs) using spark-submit, buthave the driver run on my local machine (my laptop). I want the jobsto use the data ON THE SERVER and not try to reference it from mylocal machine. If I do something like this:
val df = spark.sql("SELECT * FROMparquet.`/opt/data/transactions.parquet`")
I get an error that the path doesn't exist (because it's trying tofind it on my laptop). If I run the same thing in a spark-shell on thespark server itself, there isn't an issue because the driver hasaccess to the data. If I submit the job with submit-mode=cluster thenit works too because the driver is on the cluster. I don't want this,I want to get the results on my laptop.
How can I force Spark to read the data from the cluster's filesystemand not the driver's?
2. I have setup a Hive Metastore and created a table (in the sparkshell on the spark server itself). The data in the warehouse is in thelocal filesystem. When I create a spark application JAR and try to runit from my laptop, I get the same problem as #1, namely that it triesto find the warehouse directory on my laptop itself.
Am I crazy? Perhaps this isn't a supported way to use Spark? Any helpor insights are much appreciated!
-Ryan Victory


--
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email: [email protected]
twitter: @papadopoulos_ap
web: http://datalab.csd.auth.gr/~apostol


---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Re: Running the driver on a laptop but data is on the Spark server

Reply via email to