Re: Running the driver on a laptop but data is on the Spark server

Ryan Victory Wed, 25 Nov 2020 07:10:22 -0800

Thanks Apostolos,

I'm trying to avoid standing up HDFS just for this use case (single node).


-Ryan

On Wed, Nov 25, 2020 at 8:56 AM Apostolos N. Papadopoulos <
papad...@csd.auth.gr> wrote:

> Hi Ryan,
>
> since the driver is at your laptop, in order to access a remote file you
> need to specify the url for this I guess.
>
> For example, when I am using Spark over HDFS I specify the file like
> hdfs://blablabla which contains the url where namenode
>
> can answer. I believe that something similar must be done here.
>
> all the best,
>
> Apostolos
>
>
> On 25/11/20 16:51, Ryan Victory wrote:
> > Hello!
> >
> > I have been tearing my hair out trying to solve this problem. Here is
> > my setup:
> >
> > 1. I have Spark running on a server in standalone mode with data on
> > the filesystem of the server itself (/opt/data/).
> > 2. I have an instance of a Hive Metastore server running (backed by
> > MariaDB) on the same server
> > 3. I have a laptop where I am developing my spark jobs (Scala)
> >
> > I have configured Spark to use the metastore and set the warehouse
> > directory to be in /opt/data/warehouse/. What I am trying to
> > accomplish are a couple of things:
> >
> > 1. I am trying to submit Spark jobs (via JARs) using spark-submit, but
> > have the driver run on my local machine (my laptop). I want the jobs
> > to use the data ON THE SERVER and not try to reference it from my
> > local machine. If I do something like this:
> >
> > val df = spark.sql("SELECT * FROM
> > parquet.`/opt/data/transactions.parquet`")
> >
> > I get an error that the path doesn't exist (because it's trying to
> > find it on my laptop). If I run the same thing in a spark-shell on the
> > spark server itself, there isn't an issue because the driver has
> > access to the data. If I submit the job with submit-mode=cluster then
> > it works too because the driver is on the cluster. I don't want this,
> > I want to get the results on my laptop.
> >
> > How can I force Spark to read the data from the cluster's filesystem
> > and not the driver's?
> >
> > 2. I have setup a Hive Metastore and created a table (in the spark
> > shell on the spark server itself). The data in the warehouse is in the
> > local filesystem. When I create a spark application JAR and try to run
> > it from my laptop, I get the same problem as #1, namely that it tries
> > to find the warehouse directory on my laptop itself.
> >
> > Am I crazy? Perhaps this isn't a supported way to use Spark? Any help
> > or insights are much appreciated!
> >
> > -Ryan Victory
>
> --
> Apostolos N. Papadopoulos, Associate Professor
> Department of Informatics
> Aristotle University of Thessaloniki
> Thessaloniki, GREECE
> tel: ++0030312310991918
> email: papad...@csd.auth.gr
> twitter: @papadopoulos_ap
> web: http://datalab.csd.auth.gr/~apostol
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Running the driver on a laptop but data is on the Spark server

Reply via email to