Re: distribute work (files)

ayan guha Wed, 07 Sep 2016 18:21:11 -0700

Hi

Is it happening on all executors or one?


On Thu, Sep 8, 2016 at 10:46 AM, Peter Figliozzi <pete.figlio...@gmail.com>
wrote:

>
> Yes indeed (see below).  Just to reiterate, I am not running Hadoop.  The
> "curly" node name mentioned in the stacktrace is the name of one of the
> worker nodes.  I've mounted the same directory "datashare" with two text
> files to all worker nodes with sshfs.  The Spark documentation suggests
> that this should work:
>
> *If using a path on the local filesystem, the file must also be accessible
> at the same path on worker nodes. Either copy the file to all workers or
> use a network-mounted shared file system.*
>
> I was hoping someone else could try this and see if it works.
>
> Here's what I did to generate the error:
>
> val data = sc.textFile("file:///home/peter/datashare/*.txt")
> data.collect()
>
> It's working to some extent because if I put a bogus path in, I'll get a
> different (correct) error (InvalidInputException: Input Pattern
> file:/home/peter/ddatashare/*.txt matches 0 files).
>
> Here's the stack trace when I use a valid path:
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
> in stage 18.0 failed 4 times, most recent failure: Lost task 1.3 in stage
> 18.0 (TID 792, curly): java.io.FileNotFoundException: File
> file:/home/peter/datashare/f1.txt does not exist
> at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileSta
> tus(RawLocalFileSystem.java:609)
> at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInt
> ernal(RawLocalFileSystem.java:822)
> at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLoc
> alFileSystem.java:599)
> at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFi
> leSystem.java:421)
> at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputCheck
> er.<init>(ChecksumFileSystem.java:140)
> at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSys
> tem.java:341)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
> at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordR
> eader.java:109)
> at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(Tex
> tInputFormat.java:67)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:246)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:209)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
> DD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
> Executor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
> lExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
>
> On Wed, Sep 7, 2016 at 9:50 AM, Yong Zhang <java8...@hotmail.com> wrote:
>
>> What error do you get? FileNotFoundException?
>>
>>
>> Please paste the stacktrace here.
>>
>>
>> Yong
>>
>>
>> ------------------------------
>> *From:* Peter Figliozzi <pete.figlio...@gmail.com>
>> *Sent:* Wednesday, September 7, 2016 10:18 AM
>> *To:* ayan guha
>> *Cc:* Lydia Ickler; user.spark
>> *Subject:* Re: distribute work (files)
>>
>> That's failing for me.  Can someone please try this-- is this even
>> supposed to work:
>>
>>    - create a directory somewhere and add two text files to it
>>    - mount that directory on the Spark worker machines with sshfs
>>    - read the textfiles into one datas structure using a file URL with a
>>    wildcard
>>
>> Thanks,
>>
>> Pete
>>
>> On Tue, Sep 6, 2016 at 11:20 PM, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> To access local file, try with file:// URI.
>>>
>>> On Wed, Sep 7, 2016 at 8:52 AM, Peter Figliozzi <
>>> pete.figlio...@gmail.com> wrote:
>>>
>>>> This is a great question.  Basically you don't have to worry about the
>>>> details-- just give a wildcard in your call to textFile.  See the 
>>>> Programming
>>>> Guide <http://spark.apache.org/docs/latest/programming-guide.html> section
>>>> entitled "External Datasets".  The Spark framework will distribute your
>>>> data across the workers.  Note that:
>>>>
>>>> *If using a path on the local filesystem, the file must also be
>>>>> accessible at the same path on worker nodes. Either copy the file to all
>>>>> workers or use a network-mounted shared file system.*
>>>>
>>>>
>>>> In your case this would mean the directory of files.
>>>>
>>>> Curiously, I cannot get this to work when I mount a directory with
>>>> sshfs on all of my worker nodes.  It says "file not found" even though
>>>> the file clearly exists in the specified path on all workers.   Anyone care
>>>> to try and comment on this?
>>>>
>>>> Thanks,
>>>>
>>>> Pete
>>>>
>>>> On Tue, Sep 6, 2016 at 9:51 AM, Lydia Ickler <ickle...@googlemail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> maybe this is a stupid question:
>>>>>
>>>>> I have a list of files. Each file I want to take as an input for a
>>>>> ML-algorithm. All files are independent from another.
>>>>> My question now is how do I distribute the work so that each worker
>>>>> takes a block of files and just runs the algorithm on them one by one.
>>>>> I hope somebody can point me in the right direction! :)
>>>>>
>>>>> Best regards,
>>>>> Lydia
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>
>


-- 
Best Regards,
Ayan Guha

Re: distribute work (files)

Reply via email to