NFS is a simple option for this kind of usage, yes. But --files is making N copies of the data - you may not want to do that for large data, or for data that you need to mutate.
On Wed, Nov 25, 2020 at 9:16 PM Artemis User <[email protected]> wrote: > Ah, I almost forgot that there is an even easier solution for your > problem, namely to use the --files option in spark-submit. Usage as > follows: > > --files FILES Comma-separated list of files to be placed in the > working > directory of each executor. File paths of > these files > in executors can be accessed via > SparkFiles.get(fileName). > > -- ND > On 11/25/20 9:51 PM, Artemis User wrote: > > This is a typical file sharing problem in Spark. Just setting up HDFS > won't solve the problem unless you make your local machine as part of the > cluster. Spark server doesn't share files with your local machine without > mounting drives to each other. The best/easiest way to share the data > between your local machine and the Spark server machine is to use NFS (as > Spark manual suggests). You can use a common NFS server and mount > /opt/data drive on both local and the server machine, or run NFS on either > machine and mount the /opt/data on the other. Regardless, you have to > ensure that /opt/data on both local and server machine are pointing to the > save physical drive. Also don't forget to relax the read/write permissions > for all on the drive or map the user ID on both machines. > > Using Fuse may be an option on Mac, but NFS is the standard solution for > this type of problem (Mac supports NFS as well). > > -- ND > On 11/25/20 10:34 AM, Ryan Victory wrote: > > A key part of what I'm trying to do involves NOT having to bring the data > "through" the driver in order to get the cluster to work on it (which would > involve a network hop from server to laptop and another from laptop to > server). I'd rather have the data stay on the server and the driver stay on > my laptop if possible, but I'm guessing the Spark APIs/topology wasn't > designed that way. > > What I was hoping for was some way to be able to say val df = > spark.sql("SELECT * FROM parquet.`*local://*/opt/data/transactions.parquet`") > or similar to convince Spark to not move the data. I'd imagine if I used > HDFS, data locality would kick in anyways to prevent the network shuffles > between the driver and the cluster, but even then I wonder (based on what > you guys are saying) if I'm wrong. > > Perhaps I'll just have to modify the workflow to move the JAR to the > server and execute it from there. This isn't ideal but it's better than > nothing. > > -Ryan > > On Wed, Nov 25, 2020 at 9:13 AM Chris Coutinho <[email protected]> > wrote: > >> I'm also curious if this is possible, so while I can't offer a solution >> maybe you could try the following. >> >> The driver and executor nodes need to have access to the same >> (distributed) file system, so you could try to mount the file system to >> your laptop, locally, and then try to submit jobs and/or use the >> spark-shell while connected to the same system. >> >> A quick google search led me to find this article where someone shows how >> to mount an HDFS locally. It appears that Cloudera supports some kind of >> FUSE-based library, which may be useful for your use-case. >> >> https://idata.co.il/2018/10/how-to-connect-hdfs-to-local-filesystem/ >> >> On Wed, 2020-11-25 at 08:51 -0600, Ryan Victory wrote: >> >> Hello! >> >> I have been tearing my hair out trying to solve this problem. Here is my >> setup: >> >> 1. I have Spark running on a server in standalone mode with data on the >> filesystem of the server itself (/opt/data/). >> 2. I have an instance of a Hive Metastore server running (backed by >> MariaDB) on the same server >> 3. I have a laptop where I am developing my spark jobs (Scala) >> >> I have configured Spark to use the metastore and set the warehouse >> directory to be in /opt/data/warehouse/. What I am trying to accomplish are >> a couple of things: >> >> 1. I am trying to submit Spark jobs (via JARs) using spark-submit, but >> have the driver run on my local machine (my laptop). I want the jobs to use >> the data ON THE SERVER and not try to reference it from my local machine. >> If I do something like this: >> >> val df = spark.sql("SELECT * FROM >> parquet.`/opt/data/transactions.parquet`") >> >> I get an error that the path doesn't exist (because it's trying to find >> it on my laptop). If I run the same thing in a spark-shell on the spark >> server itself, there isn't an issue because the driver has access to the >> data. If I submit the job with submit-mode=cluster then it works too >> because the driver is on the cluster. I don't want this, I want to get the >> results on my laptop. >> >> How can I force Spark to read the data from the cluster's filesystem and >> not the driver's? >> >> 2. I have setup a Hive Metastore and created a table (in the spark shell >> on the spark server itself). The data in the warehouse is in the local >> filesystem. When I create a spark application JAR and try to run it from my >> laptop, I get the same problem as #1, namely that it tries to find the >> warehouse directory on my laptop itself. >> >> Am I crazy? Perhaps this isn't a supported way to use Spark? Any help or >> insights are much appreciated! >> >> -Ryan Victory >> >> >>
