You'd probably do best to ask that project, but scanning the source code, that looks like it's how it's meant to work. It downloads to a temp file on the driver then copies to distributed storage then returns a DataFrame for that. I can't see how it would be implemented directly over sftp as there would be so many pieces missing - locality, blocking, etc.
On Wed, Jul 22, 2020 at 4:48 PM Mark Bidewell <mbide...@gmail.com> wrote: > > Sorry if this is the wrong place for this. I am trying to debug an issue > with this library: > https://github.com/springml/spark-sftp > > When I attempt to create a dataframe: > > spark.read. > format("com.springml.spark.sftp"). > option("host", "..."). > option("username", "..."). > option("password", "..."). > option("fileType", "csv"). > option("inferSchema", "true"). > option("tempLocation","/srv/spark/tmp"). > option("hdfsTempLocation","/srv/spark/tmp"); > .load("...") > > What I am seeing is that the download is occurring on the spark driver not > the spark worker, This leads to a failure when spark tries to create the > DataFrame on the worker. > > I'm confused by the behavior. my understanding was that load() was lazily > executed on the Spark worker. Why would some elements be executing on the > driver? > > Thanks for your help > -- > Mark Bidewell > http://www.linkedin.com/in/markbidewell --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org