So, can you try to simulate the same without sshfs? ie, create a folder on /tmp/datashare and copy your files on all the machines and point sc.textFiles to that folder?
On Thu, Sep 8, 2016 at 11:26 AM, Peter Figliozzi <pete.figlio...@gmail.com> wrote: > All (three) of them. It's kind of cool-- when I re-run collect() a different > executor will show up as first to encounter the error. > > On Wed, Sep 7, 2016 at 8:20 PM, ayan guha <guha.a...@gmail.com> wrote: > >> Hi >> >> Is it happening on all executors or one? >> >> On Thu, Sep 8, 2016 at 10:46 AM, Peter Figliozzi < >> pete.figlio...@gmail.com> wrote: >> >>> >>> Yes indeed (see below). Just to reiterate, I am not running Hadoop. >>> The "curly" node name mentioned in the stacktrace is the name of one of the >>> worker nodes. I've mounted the same directory "datashare" with two text >>> files to all worker nodes with sshfs. The Spark documentation suggests >>> that this should work: >>> >>> *If using a path on the local filesystem, the file must also be >>> accessible at the same path on worker nodes. Either copy the file to all >>> workers or use a network-mounted shared file system.* >>> >>> I was hoping someone else could try this and see if it works. >>> >>> Here's what I did to generate the error: >>> >>> val data = sc.textFile("file:///home/peter/datashare/*.txt") >>> data.collect() >>> >>> It's working to some extent because if I put a bogus path in, I'll get a >>> different (correct) error (InvalidInputException: Input Pattern >>> file:/home/peter/ddatashare/*.txt matches 0 files). >>> >>> Here's the stack trace when I use a valid path: >>> >>> org.apache.spark.SparkException: Job aborted due to stage failure: Task >>> 1 in stage 18.0 failed 4 times, most recent failure: Lost task 1.3 in stage >>> 18.0 (TID 792, curly): java.io.FileNotFoundException: File >>> file:/home/peter/datashare/f1.txt does not exist >>> at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileSta >>> tus(RawLocalFileSystem.java:609) >>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInt >>> ernal(RawLocalFileSystem.java:822) >>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLoc >>> alFileSystem.java:599) >>> at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFi >>> leSystem.java:421) >>> at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputCheck >>> er.<init>(ChecksumFileSystem.java:140) >>> at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSys >>> tem.java:341) >>> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767) >>> at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordR >>> eader.java:109) >>> at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(Tex >>> tInputFormat.java:67) >>> at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:246) >>> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:209) >>> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102) >>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) >>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) >>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR >>> DD.scala:38) >>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) >>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) >>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) >>> at org.apache.spark.scheduler.Task.run(Task.scala:85) >>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) >>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool >>> Executor.java:1142) >>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo >>> lExecutor.java:617) >>> at java.lang.Thread.run(Thread.java:745) >>> >>> >>> On Wed, Sep 7, 2016 at 9:50 AM, Yong Zhang <java8...@hotmail.com> wrote: >>> >>>> What error do you get? FileNotFoundException? >>>> >>>> >>>> Please paste the stacktrace here. >>>> >>>> >>>> Yong >>>> >>>> >>>> ------------------------------ >>>> *From:* Peter Figliozzi <pete.figlio...@gmail.com> >>>> *Sent:* Wednesday, September 7, 2016 10:18 AM >>>> *To:* ayan guha >>>> *Cc:* Lydia Ickler; user.spark >>>> *Subject:* Re: distribute work (files) >>>> >>>> That's failing for me. Can someone please try this-- is this even >>>> supposed to work: >>>> >>>> - create a directory somewhere and add two text files to it >>>> - mount that directory on the Spark worker machines with sshfs >>>> - read the textfiles into one datas structure using a file URL with >>>> a wildcard >>>> >>>> Thanks, >>>> >>>> Pete >>>> >>>> On Tue, Sep 6, 2016 at 11:20 PM, ayan guha <guha.a...@gmail.com> wrote: >>>> >>>>> To access local file, try with file:// URI. >>>>> >>>>> On Wed, Sep 7, 2016 at 8:52 AM, Peter Figliozzi < >>>>> pete.figlio...@gmail.com> wrote: >>>>> >>>>>> This is a great question. Basically you don't have to worry about >>>>>> the details-- just give a wildcard in your call to textFile. See >>>>>> the Programming Guide >>>>>> <http://spark.apache.org/docs/latest/programming-guide.html> section >>>>>> entitled "External Datasets". The Spark framework will distribute your >>>>>> data across the workers. Note that: >>>>>> >>>>>> *If using a path on the local filesystem, the file must also be >>>>>>> accessible at the same path on worker nodes. Either copy the file to all >>>>>>> workers or use a network-mounted shared file system.* >>>>>> >>>>>> >>>>>> In your case this would mean the directory of files. >>>>>> >>>>>> Curiously, I cannot get this to work when I mount a directory with >>>>>> sshfs on all of my worker nodes. It says "file not found" even >>>>>> though the file clearly exists in the specified path on all workers. >>>>>> Anyone care to try and comment on this? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Pete >>>>>> >>>>>> On Tue, Sep 6, 2016 at 9:51 AM, Lydia Ickler <ickle...@googlemail.com >>>>>> > wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> maybe this is a stupid question: >>>>>>> >>>>>>> I have a list of files. Each file I want to take as an input for a >>>>>>> ML-algorithm. All files are independent from another. >>>>>>> My question now is how do I distribute the work so that each worker >>>>>>> takes a block of files and just runs the algorithm on them one by one. >>>>>>> I hope somebody can point me in the right direction! :) >>>>>>> >>>>>>> Best regards, >>>>>>> Lydia >>>>>>> ------------------------------------------------------------ >>>>>>> --------- >>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Best Regards, >>>>> Ayan Guha >>>>> >>>> >>>> >>> >>> >> >> >> -- >> Best Regards, >> Ayan Guha >> > > -- Best Regards, Ayan Guha