Re: Spark GraphFrame ConnectedComponents

Joseph Bradley Thu, 05 Jan 2017 17:45:07 -0800

Would it be more robust to use the Path when creating the FileSystem?
https://github.com/graphframes/graphframes/issues/160


On Thu, Jan 5, 2017 at 4:57 PM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> This is likely a factor of your hadoop config and Spark rather then
> anything specific with GraphFrames.
>
> You might have better luck getting assistance if you could isolate the
> code to a simple case that manifests the problem (without GraphFrames), and
> repost.
>
>
> ------------------------------
> *From:* Ankur Srivastava <ankur.srivast...@gmail.com>
> *Sent:* Thursday, January 5, 2017 3:45:59 PM
> *To:* Felix Cheung; dev@spark.apache.org
>
> *Cc:* u...@spark.apache.org
> *Subject:* Re: Spark GraphFrame ConnectedComponents
>
> Adding DEV mailing list to see if this is a defect with ConnectedComponent
> or if they can recommend any solution.
>
> Thanks
> Ankur
>
> On Thu, Jan 5, 2017 at 1:10 PM, Ankur Srivastava <
> ankur.srivast...@gmail.com> wrote:
>
>> Yes I did try it out and it choses the local file system as my checkpoint
>> location starts with s3n://
>>
>> I am not sure how can I make it load the S3FileSystem.
>>
>> On Thu, Jan 5, 2017 at 12:12 PM, Felix Cheung <felixcheun...@hotmail.com>
>> wrote:
>>
>>> Right, I'd agree, it seems to be only with delete.
>>>
>>> Could you by chance run just the delete to see if it fails
>>>
>>> FileSystem.get(sc.hadoopConfiguration)
>>> .delete(new Path(somepath), true)
>>> ------------------------------
>>> *From:* Ankur Srivastava <ankur.srivast...@gmail.com>
>>> *Sent:* Thursday, January 5, 2017 10:05:03 AM
>>> *To:* Felix Cheung
>>> *Cc:* u...@spark.apache.org
>>>
>>> *Subject:* Re: Spark GraphFrame ConnectedComponents
>>>
>>> Yes it works to read the vertices and edges data from S3 location and is
>>> also able to write the checkpoint files to S3. It only fails when deleting
>>> the data and that is because it tries to use the default file system. I
>>> tried looking up how to update the default file system but could not find
>>> anything in that regard.
>>>
>>> Thanks
>>> Ankur
>>>
>>> On Thu, Jan 5, 2017 at 12:55 AM, Felix Cheung <felixcheun...@hotmail.com
>>> > wrote:
>>>
>>>> From the stack it looks to be an error from the explicit call to
>>>> hadoop.fs.FileSystem.
>>>>
>>>> Is the URL scheme for s3n registered?
>>>> Does it work when you try to read from s3 from Spark?
>>>>
>>>> _____________________________
>>>> From: Ankur Srivastava <ankur.srivast...@gmail.com>
>>>> Sent: Wednesday, January 4, 2017 9:23 PM
>>>> Subject: Re: Spark GraphFrame ConnectedComponents
>>>> To: Felix Cheung <felixcheun...@hotmail.com>
>>>> Cc: <u...@spark.apache.org>
>>>>
>>>>
>>>>
>>>> This is the exact trace from the driver logs
>>>>
>>>> Exception in thread "main" java.lang.IllegalArgumentException: Wrong
>>>> FS: s3n://<checkpoint-folder>/8ac233e4-10f9-4eb3-aa53-df6d9d7ea7
>>>> be/connected-components-c1dbc2b0/3, expected: file:///
>>>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
>>>> at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalF
>>>> ileSystem.java:80)
>>>> at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileSta
>>>> tus(RawLocalFileSystem.java:529)
>>>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInt
>>>> ernal(RawLocalFileSystem.java:747)
>>>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLoc
>>>> alFileSystem.java:524)
>>>> at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileS
>>>> ystem.java:534)
>>>> at org.graphframes.lib.ConnectedComponents$.org$graphframes$lib
>>>> $ConnectedComponents$$run(ConnectedComponents.scala:340)
>>>> at org.graphframes.lib.ConnectedComponents.run(ConnectedCompone
>>>> nts.scala:139)
>>>> at GraphTest.main(GraphTest.java:31) ----------- Application Class
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>>>> ssorImpl.java:57)
>>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>>>> thodAccessorImpl.java:43)
>>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy
>>>> $SparkSubmit$$runMain(SparkSubmit.scala:731)
>>>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit
>>>> .scala:181)
>>>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>>>
>>>> And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10
>>>>
>>>> Thanks
>>>> Ankur
>>>>
>>>> On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava <
>>>> ankur.srivast...@gmail.com> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> I am rerunning the pipeline to generate the exact trace, I have below
>>>>> part of trace from last run:
>>>>>
>>>>> Exception in thread "main" java.lang.IllegalArgumentException: Wrong
>>>>> FS: s3n://<folder-path>, expected: file:///
>>>>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
>>>>> at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalF
>>>>> ileSystem.java:69)
>>>>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLoc
>>>>> alFileSystem.java:516)
>>>>> at 
>>>>> org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:528)
>>>>>
>>>>>
>>>>> Also I think the error is happening in this part of the code
>>>>> "ConnectedComponents.scala:339" I am referring the code @
>>>>> https://github.com/graphframes/graphframes/blob/master/src/
>>>>> main/scala/org/graphframes/lib/ConnectedComponents.scala
>>>>>
>>>>>       if (shouldCheckpoint && (iteration % checkpointInterval == 0)) {
>>>>>         // TODO: remove this after DataFrame.checkpoint is implemented
>>>>>         val out = s"${checkpointDir.get}/$iteration"
>>>>>         ee.write.parquet(out)
>>>>>         // may hit S3 eventually consistent issue
>>>>>         ee = sqlContext.read.parquet(out)
>>>>>
>>>>>         // remove previous checkpoint
>>>>>         if (iteration > checkpointInterval) {
>>>>>           *FileSystem.get(sc.hadoopConfiguration)*
>>>>> *            .delete(new Path(s"${checkpointDir.get}/${iteration -
>>>>> checkpointInterval}"), true)*
>>>>>         }
>>>>>
>>>>>         System.gc() // hint Spark to clean shuffle directories
>>>>>       }
>>>>>
>>>>>
>>>>> Thanks
>>>>> Ankur
>>>>>
>>>>> On Wed, Jan 4, 2017 at 5:19 PM, Felix Cheung <
>>>>> felixcheun...@hotmail.com> wrote:
>>>>>
>>>>>> Do you have more of the exception stack?
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>> *From:* Ankur Srivastava <ankur.srivast...@gmail.com>
>>>>>> *Sent:* Wednesday, January 4, 2017 4:40:02 PM
>>>>>> *To:* u...@spark.apache.org
>>>>>> *Subject:* Spark GraphFrame ConnectedComponents
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am trying to use the ConnectedComponent algorithm of GraphFrames
>>>>>> but by default it needs a checkpoint directory. As I am running my spark
>>>>>> cluster with S3 as the DFS and do not have access to HDFS file system I
>>>>>> tried using a s3 directory as checkpoint directory but I run into below
>>>>>> exception:
>>>>>>
>>>>>> Exception in thread "main"java.lang.IllegalArgumentException: Wrong
>>>>>> FS: s3n://<folder-path>, expected: file:///
>>>>>>
>>>>>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
>>>>>>
>>>>>> at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalF
>>>>>> ileSystem.java:69)
>>>>>>
>>>>>> If I set checkpoint interval to -1 to avoid checkpointing the driver
>>>>>> just hangs after 3 or 4 iterations.
>>>>>>
>>>>>> Is there some way I can set the default FileSystem to S3 for Spark or
>>>>>> any other option?
>>>>>>
>>>>>> Thanks
>>>>>> Ankur
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: Spark GraphFrame ConnectedComponents

Reply via email to