Re: Running the driver on a laptop but data is on the Spark server

Sean Owen Wed, 25 Nov 2020 19:48:54 -0800

NFS is a simple option for this kind of usage, yes.
But --files is making N copies of the data - you may not want to do that
for large data, or for data that you need to mutate.


On Wed, Nov 25, 2020 at 9:16 PM Artemis User <[email protected]> wrote:

> Ah, I almost forgot that there is an even easier solution for your
> problem, namely to use the --files option in spark-submit.  Usage as
> follows:
>
> --files FILES           Comma-separated list of files to be placed in the
> working
>                               directory of each executor. File paths of
> these files
>                               in executors can be accessed via
> SparkFiles.get(fileName).
>
> -- ND
> On 11/25/20 9:51 PM, Artemis User wrote:
>
> This is a typical file sharing problem in Spark.  Just setting up HDFS
> won't solve the problem unless you make your local machine as part of the
> cluster.  Spark server doesn't share files with your local machine without
> mounting drives to each other.  The best/easiest way to share the data
> between your local machine and the Spark server machine is to use NFS (as
> Spark manual suggests).  You can use a common NFS server and mount
> /opt/data drive on both local and the server machine, or run NFS on either
> machine and mount the /opt/data on the other.  Regardless, you have to
> ensure that /opt/data on both local and server machine are pointing to the
> save physical drive.  Also don't forget to relax the read/write permissions
> for all on the drive or map the user ID on both machines.
>
> Using Fuse may be an option on Mac, but NFS is the standard solution for
> this type of problem (Mac supports NFS as well).
>
> -- ND
> On 11/25/20 10:34 AM, Ryan Victory wrote:
>
> A key part of what I'm trying to do involves NOT having to bring the data
> "through" the driver in order to get the cluster to work on it (which would
> involve a network hop from server to laptop and another from laptop to
> server). I'd rather have the data stay on the server and the driver stay on
> my laptop if possible, but I'm guessing the Spark APIs/topology wasn't
> designed that way.
>
> What I was hoping for was some way to be able to say val df =
> spark.sql("SELECT * FROM parquet.`*local://*/opt/data/transactions.parquet`")
> or similar to convince Spark to not move the data. I'd imagine if I used
> HDFS, data locality would kick in anyways to prevent the network shuffles
> between the driver and the cluster, but even then I wonder (based on what
> you guys are saying) if I'm wrong.
>
> Perhaps I'll just have to modify the workflow to move the JAR to the
> server and execute it from there. This isn't ideal but it's better than
> nothing.
>
> -Ryan
>
> On Wed, Nov 25, 2020 at 9:13 AM Chris Coutinho <[email protected]>
> wrote:
>
>> I'm also curious if this is possible, so while I can't offer a solution
>> maybe you could try the following.
>>
>> The driver and executor nodes need to have access to the same
>> (distributed) file system, so you could try to mount the file system to
>> your laptop, locally, and then try to submit jobs and/or use the
>> spark-shell while connected to the same system.
>>
>> A quick google search led me to find this article where someone shows how
>> to mount an HDFS locally. It appears that Cloudera supports some kind of
>> FUSE-based library, which may be useful for your use-case.
>>
>> https://idata.co.il/2018/10/how-to-connect-hdfs-to-local-filesystem/
>>
>> On Wed, 2020-11-25 at 08:51 -0600, Ryan Victory wrote:
>>
>> Hello!
>>
>> I have been tearing my hair out trying to solve this problem. Here is my
>> setup:
>>
>> 1. I have Spark running on a server in standalone mode with data on the
>> filesystem of the server itself (/opt/data/).
>> 2. I have an instance of a Hive Metastore server running (backed by
>> MariaDB) on the same server
>> 3. I have a laptop where I am developing my spark jobs (Scala)
>>
>> I have configured Spark to use the metastore and set the warehouse
>> directory to be in /opt/data/warehouse/. What I am trying to accomplish are
>> a couple of things:
>>
>> 1. I am trying to submit Spark jobs (via JARs) using spark-submit, but
>> have the driver run on my local machine (my laptop). I want the jobs to use
>> the data ON THE SERVER and not try to reference it from my local machine.
>> If I do something like this:
>>
>> val df = spark.sql("SELECT * FROM
>> parquet.`/opt/data/transactions.parquet`")
>>
>> I get an error that the path doesn't exist (because it's trying to find
>> it on my laptop). If I run the same thing in a spark-shell on the spark
>> server itself, there isn't an issue because the driver has access to the
>> data. If I submit the job with submit-mode=cluster then it works too
>> because the driver is on the cluster. I don't want this, I want to get the
>> results on my laptop.
>>
>> How can I force Spark to read the data from the cluster's filesystem and
>> not the driver's?
>>
>> 2. I have setup a Hive Metastore and created a table (in the spark shell
>> on the spark server itself). The data in the warehouse is in the local
>> filesystem. When I create a spark application JAR and try to run it from my
>> laptop, I get the same problem as #1, namely that it tries to find the
>> warehouse directory on my laptop itself.
>>
>> Am I crazy? Perhaps this isn't a supported way to use Spark? Any help or
>> insights are much appreciated!
>>
>> -Ryan Victory
>>
>>
>>

Re: Running the driver on a laptop but data is on the Spark server

Reply via email to