Re: distribute work (files)

Yong Zhang Wed, 07 Sep 2016 08:06:07 -0700

What error do you get? FileNotFoundException?


Please paste the stacktrace here.


Yong


________________________________
From: Peter Figliozzi <pete.figlio...@gmail.com>
Sent: Wednesday, September 7, 2016 10:18 AM
To: ayan guha
Cc: Lydia Ickler; user.spark
Subject: Re: distribute work (files)

That's failing for me.  Can someone please try this-- is this even supposed to 
work:

  *   create a directory somewhere and add two text files to it
  *   mount that directory on the Spark worker machines with sshfs
  *   read the textfiles into one datas structure using a file URL with a 
wildcard

Thanks,

Pete

On Tue, Sep 6, 2016 at 11:20 PM, ayan guha 
<guha.a...@gmail.com<mailto:guha.a...@gmail.com>> wrote:
To access local file, try with file:// URI.

On Wed, Sep 7, 2016 at 8:52 AM, Peter Figliozzi 
<pete.figlio...@gmail.com<mailto:pete.figlio...@gmail.com>> wrote:
This is a great question.  Basically you don't have to worry about the 
details-- just give a wildcard in your call to textFile.  See the Programming 
Guide<http://spark.apache.org/docs/latest/programming-guide.html> section 
entitled "External Datasets".  The Spark framework will distribute your data 
across the workers.  Note that:

If using a path on the local filesystem, the file must also be accessible at 
the same path on worker nodes. Either copy the file to all workers or use a 
network-mounted shared file system.

In your case this would mean the directory of files.

Curiously, I cannot get this to work when I mount a directory with sshfs on all 
of my worker nodes.  It says "file not found" even though the file clearly 
exists in the specified path on all workers.   Anyone care to try and comment 
on this?

Thanks,

Pete

On Tue, Sep 6, 2016 at 9:51 AM, Lydia Ickler 
<ickle...@googlemail.com<mailto:ickle...@googlemail.com>> wrote:
Hi,

maybe this is a stupid question:

I have a list of files. Each file I want to take as an input for a 
ML-algorithm. All files are independent from another.
My question now is how do I distribute the work so that each worker takes a 
block of files and just runs the algorithm on them one by one.
I hope somebody can point me in the right direction! :)

Best regards,
Lydia
---------------------------------------------------------------------
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>





--
Best Regards,
Ayan Guha

Re: distribute work (files)

Reply via email to