You'd probably do best to ask that project, but scanning the source
code, that looks like it's how it's meant to work. It downloads to a
temp file on the driver then copies to distributed storage then
returns a DataFrame for that. I can't see how it would be implemented
directly over sftp as there would be so many pieces missing -
locality, blocking, etc.

On Wed, Jul 22, 2020 at 4:48 PM Mark Bidewell <mbide...@gmail.com> wrote:
>
> Sorry if this is the wrong place for this.  I am trying to debug an issue 
> with this library:
> https://github.com/springml/spark-sftp
>
> When I attempt to create a dataframe:
>
> spark.read.
>             format("com.springml.spark.sftp").
>             option("host", "...").
>             option("username", "...").
>             option("password", "...").
>             option("fileType", "csv").
>             option("inferSchema", "true").
>             option("tempLocation","/srv/spark/tmp").
>             option("hdfsTempLocation","/srv/spark/tmp");
>      .load("...")
>
> What I am seeing is that the download is occurring on the spark driver not 
> the spark worker,  This leads to a failure when spark tries to create the 
> DataFrame on the worker.
>
> I'm confused by the behavior.  my understanding was that load() was lazily 
> executed on the Spark worker.  Why would some elements be executing on the 
> driver?
>
> Thanks for your help
> --
> Mark Bidewell
> http://www.linkedin.com/in/markbidewell

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to