Re: distribute work (files)

ayan guha Wed, 07 Sep 2016 18:49:08 -0700

So, can you try to simulate the same without sshfs? ie, create a folder on
/tmp/datashare and copy your files on all the machines and point
sc.textFiles to that folder?



On Thu, Sep 8, 2016 at 11:26 AM, Peter Figliozzi <pete.figlio...@gmail.com>
wrote:

> All (three) of them.  It's kind of cool-- when I re-run collect() a different
> executor will show up as first to encounter the error.
>
> On Wed, Sep 7, 2016 at 8:20 PM, ayan guha <guha.a...@gmail.com> wrote:
>
>> Hi
>>
>> Is it happening on all executors or one?
>>
>> On Thu, Sep 8, 2016 at 10:46 AM, Peter Figliozzi <
>> pete.figlio...@gmail.com> wrote:
>>
>>>
>>> Yes indeed (see below).  Just to reiterate, I am not running Hadoop.
>>> The "curly" node name mentioned in the stacktrace is the name of one of the
>>> worker nodes.  I've mounted the same directory "datashare" with two text
>>> files to all worker nodes with sshfs.  The Spark documentation suggests
>>> that this should work:
>>>
>>> *If using a path on the local filesystem, the file must also be
>>> accessible at the same path on worker nodes. Either copy the file to all
>>> workers or use a network-mounted shared file system.*
>>>
>>> I was hoping someone else could try this and see if it works.
>>>
>>> Here's what I did to generate the error:
>>>
>>> val data = sc.textFile("file:///home/peter/datashare/*.txt")
>>> data.collect()
>>>
>>> It's working to some extent because if I put a bogus path in, I'll get a
>>> different (correct) error (InvalidInputException: Input Pattern
>>> file:/home/peter/ddatashare/*.txt matches 0 files).
>>>
>>> Here's the stack trace when I use a valid path:
>>>
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>>> 1 in stage 18.0 failed 4 times, most recent failure: Lost task 1.3 in stage
>>> 18.0 (TID 792, curly): java.io.FileNotFoundException: File
>>> file:/home/peter/datashare/f1.txt does not exist
>>> at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileSta
>>> tus(RawLocalFileSystem.java:609)
>>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInt
>>> ernal(RawLocalFileSystem.java:822)
>>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLoc
>>> alFileSystem.java:599)
>>> at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFi
>>> leSystem.java:421)
>>> at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputCheck
>>> er.<init>(ChecksumFileSystem.java:140)
>>> at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSys
>>> tem.java:341)
>>> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
>>> at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordR
>>> eader.java:109)
>>> at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(Tex
>>> tInputFormat.java:67)
>>> at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:246)
>>> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:209)
>>> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>>> DD.scala:38)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:85)
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>> Executor.java:1142)
>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>> lExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>>
>>> On Wed, Sep 7, 2016 at 9:50 AM, Yong Zhang <java8...@hotmail.com> wrote:
>>>
>>>> What error do you get? FileNotFoundException?
>>>>
>>>>
>>>> Please paste the stacktrace here.
>>>>
>>>>
>>>> Yong
>>>>
>>>>
>>>> ------------------------------
>>>> *From:* Peter Figliozzi <pete.figlio...@gmail.com>
>>>> *Sent:* Wednesday, September 7, 2016 10:18 AM
>>>> *To:* ayan guha
>>>> *Cc:* Lydia Ickler; user.spark
>>>> *Subject:* Re: distribute work (files)
>>>>
>>>> That's failing for me.  Can someone please try this-- is this even
>>>> supposed to work:
>>>>
>>>>    - create a directory somewhere and add two text files to it
>>>>    - mount that directory on the Spark worker machines with sshfs
>>>>    - read the textfiles into one datas structure using a file URL with
>>>>    a wildcard
>>>>
>>>> Thanks,
>>>>
>>>> Pete
>>>>
>>>> On Tue, Sep 6, 2016 at 11:20 PM, ayan guha <guha.a...@gmail.com> wrote:
>>>>
>>>>> To access local file, try with file:// URI.
>>>>>
>>>>> On Wed, Sep 7, 2016 at 8:52 AM, Peter Figliozzi <
>>>>> pete.figlio...@gmail.com> wrote:
>>>>>
>>>>>> This is a great question.  Basically you don't have to worry about
>>>>>> the details-- just give a wildcard in your call to textFile.  See
>>>>>> the Programming Guide
>>>>>> <http://spark.apache.org/docs/latest/programming-guide.html> section
>>>>>> entitled "External Datasets".  The Spark framework will distribute your
>>>>>> data across the workers.  Note that:
>>>>>>
>>>>>> *If using a path on the local filesystem, the file must also be
>>>>>>> accessible at the same path on worker nodes. Either copy the file to all
>>>>>>> workers or use a network-mounted shared file system.*
>>>>>>
>>>>>>
>>>>>> In your case this would mean the directory of files.
>>>>>>
>>>>>> Curiously, I cannot get this to work when I mount a directory with
>>>>>> sshfs on all of my worker nodes.  It says "file not found" even
>>>>>> though the file clearly exists in the specified path on all workers.
>>>>>> Anyone care to try and comment on this?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Pete
>>>>>>
>>>>>> On Tue, Sep 6, 2016 at 9:51 AM, Lydia Ickler <ickle...@googlemail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> maybe this is a stupid question:
>>>>>>>
>>>>>>> I have a list of files. Each file I want to take as an input for a
>>>>>>> ML-algorithm. All files are independent from another.
>>>>>>> My question now is how do I distribute the work so that each worker
>>>>>>> takes a block of files and just runs the algorithm on them one by one.
>>>>>>> I hope somebody can point me in the right direction! :)
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Lydia
>>>>>>> ------------------------------------------------------------
>>>>>>> ---------
>>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


-- 
Best Regards,
Ayan Guha

Re: distribute work (files)

Reply via email to