Hi I am rerunning the pipeline to generate the exact trace, I have below part of trace from last run:
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: s3n://<folder-path>, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:516) at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:528) Also I think the error is happening in this part of the code "ConnectedComponents.scala:339" I am referring the code @ https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/lib/ConnectedComponents.scala if (shouldCheckpoint && (iteration % checkpointInterval == 0)) { // TODO: remove this after DataFrame.checkpoint is implemented val out = s"${checkpointDir.get}/$iteration" ee.write.parquet(out) // may hit S3 eventually consistent issue ee = sqlContext.read.parquet(out) // remove previous checkpoint if (iteration > checkpointInterval) { *FileSystem.get(sc.hadoopConfiguration)* * .delete(new Path(s"${checkpointDir.get}/${iteration - checkpointInterval}"), true)* } System.gc() // hint Spark to clean shuffle directories } Thanks Ankur On Wed, Jan 4, 2017 at 5:19 PM, Felix Cheung <felixcheun...@hotmail.com> wrote: > Do you have more of the exception stack? > > > ------------------------------ > *From:* Ankur Srivastava <ankur.srivast...@gmail.com> > *Sent:* Wednesday, January 4, 2017 4:40:02 PM > *To:* user@spark.apache.org > *Subject:* Spark GraphFrame ConnectedComponents > > Hi, > > I am trying to use the ConnectedComponent algorithm of GraphFrames but by > default it needs a checkpoint directory. As I am running my spark cluster > with S3 as the DFS and do not have access to HDFS file system I tried using > a s3 directory as checkpoint directory but I run into below exception: > > Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: > s3n://<folder-path>, expected: file:/// > > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) > > at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile( > RawLocalFileSystem.java:69) > > If I set checkpoint interval to -1 to avoid checkpointing the driver just > hangs after 3 or 4 iterations. > > Is there some way I can set the default FileSystem to S3 for Spark or any > other option? > > Thanks > Ankur > >