Hi Is it happening on all executors or one?
On Thu, Sep 8, 2016 at 10:46 AM, Peter Figliozzi <pete.figlio...@gmail.com> wrote: > > Yes indeed (see below). Just to reiterate, I am not running Hadoop. The > "curly" node name mentioned in the stacktrace is the name of one of the > worker nodes. I've mounted the same directory "datashare" with two text > files to all worker nodes with sshfs. The Spark documentation suggests > that this should work: > > *If using a path on the local filesystem, the file must also be accessible > at the same path on worker nodes. Either copy the file to all workers or > use a network-mounted shared file system.* > > I was hoping someone else could try this and see if it works. > > Here's what I did to generate the error: > > val data = sc.textFile("file:///home/peter/datashare/*.txt") > data.collect() > > It's working to some extent because if I put a bogus path in, I'll get a > different (correct) error (InvalidInputException: Input Pattern > file:/home/peter/ddatashare/*.txt matches 0 files). > > Here's the stack trace when I use a valid path: > > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 > in stage 18.0 failed 4 times, most recent failure: Lost task 1.3 in stage > 18.0 (TID 792, curly): java.io.FileNotFoundException: File > file:/home/peter/datashare/f1.txt does not exist > at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileSta > tus(RawLocalFileSystem.java:609) > at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInt > ernal(RawLocalFileSystem.java:822) > at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLoc > alFileSystem.java:599) > at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFi > leSystem.java:421) > at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputCheck > er.<init>(ChecksumFileSystem.java:140) > at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSys > tem.java:341) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767) > at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordR > eader.java:109) > at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(Tex > tInputFormat.java:67) > at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:246) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:209) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR > DD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool > Executor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo > lExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > > On Wed, Sep 7, 2016 at 9:50 AM, Yong Zhang <java8...@hotmail.com> wrote: > >> What error do you get? FileNotFoundException? >> >> >> Please paste the stacktrace here. >> >> >> Yong >> >> >> ------------------------------ >> *From:* Peter Figliozzi <pete.figlio...@gmail.com> >> *Sent:* Wednesday, September 7, 2016 10:18 AM >> *To:* ayan guha >> *Cc:* Lydia Ickler; user.spark >> *Subject:* Re: distribute work (files) >> >> That's failing for me. Can someone please try this-- is this even >> supposed to work: >> >> - create a directory somewhere and add two text files to it >> - mount that directory on the Spark worker machines with sshfs >> - read the textfiles into one datas structure using a file URL with a >> wildcard >> >> Thanks, >> >> Pete >> >> On Tue, Sep 6, 2016 at 11:20 PM, ayan guha <guha.a...@gmail.com> wrote: >> >>> To access local file, try with file:// URI. >>> >>> On Wed, Sep 7, 2016 at 8:52 AM, Peter Figliozzi < >>> pete.figlio...@gmail.com> wrote: >>> >>>> This is a great question. Basically you don't have to worry about the >>>> details-- just give a wildcard in your call to textFile. See the >>>> Programming >>>> Guide <http://spark.apache.org/docs/latest/programming-guide.html> section >>>> entitled "External Datasets". The Spark framework will distribute your >>>> data across the workers. Note that: >>>> >>>> *If using a path on the local filesystem, the file must also be >>>>> accessible at the same path on worker nodes. Either copy the file to all >>>>> workers or use a network-mounted shared file system.* >>>> >>>> >>>> In your case this would mean the directory of files. >>>> >>>> Curiously, I cannot get this to work when I mount a directory with >>>> sshfs on all of my worker nodes. It says "file not found" even though >>>> the file clearly exists in the specified path on all workers. Anyone care >>>> to try and comment on this? >>>> >>>> Thanks, >>>> >>>> Pete >>>> >>>> On Tue, Sep 6, 2016 at 9:51 AM, Lydia Ickler <ickle...@googlemail.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> maybe this is a stupid question: >>>>> >>>>> I have a list of files. Each file I want to take as an input for a >>>>> ML-algorithm. All files are independent from another. >>>>> My question now is how do I distribute the work so that each worker >>>>> takes a block of files and just runs the algorithm on them one by one. >>>>> I hope somebody can point me in the right direction! :) >>>>> >>>>> Best regards, >>>>> Lydia >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>>> >>>>> >>>> >>> >>> >>> -- >>> Best Regards, >>> Ayan Guha >>> >> >> > > -- Best Regards, Ayan Guha