Another way is to configure S3 as Tachyon's under storage system, and then run Spark on Tachyon.
More info: http://tachyon-project.org/Setup-UFS.html Best, Haoyuan On Wed, May 13, 2015 at 10:52 AM, Stephen Carman <scar...@coldlight.com> wrote: > Thank you for the suggestions, the problem exists in the fact we need to > initialize the vfs s3 driver so what you suggested Akhil wouldn’t fix the > problem. > > Basically a job is submitted to the cluster and it tries to pull down the > data from s3, but fails because the s3 uri hasn’t been initilized in the > vfs and it doesn’t know how to handle > the URI. > > What I’m asking is, how do we before the job is ran, run some > bootstrapping or setup code that will let us do this initialization or > configuration step for the vfs so that when it executes the job > it has the information it needs to be able to handle the s3 URI. > > Thanks, > Steve > > On May 13, 2015, at 12:35 PM, jay vyas <jayunit100.apa...@gmail.com > <mailto:jayunit100.apa...@gmail.com>> wrote: > > > Might I ask why vfs? I'm new to vfs and not sure wether or not it > predates the hadoop file system interfaces (HCFS). > > After all spark natively supports any HCFS by leveraging the hadoop > FileSystem api and class loaders and so on. > > So simply putting those resources on your classpath should be sufficient > to directly connect to s3. By using the sc.hadoopFile (...) commands. > > On May 13, 2015 12:16 PM, "Akhil Das" <ak...@sigmoidanalytics.com<mailto: > ak...@sigmoidanalytics.com>> wrote: > Did you happened to have a look at this https://github.com/abashev/vfs-s3 > > Thanks > Best Regards > > On Tue, May 12, 2015 at 11:33 PM, Stephen Carman <scar...@coldlight.com > <mailto:scar...@coldlight.com>> > wrote: > > > We have a small mesos cluster and these slaves need to have a vfs setup > on > > them so that the slaves can pull down the data they need from S3 when > spark > > runs. > > > > There doesn’t seem to be any obvious way online on how to do this or how > > easily accomplish this. Does anyone have some best practices or some > ideas > > about how to accomplish this? > > > > An example stack trace when a job is ran on the mesos cluster… > > > > Any idea how to get this going? Like somehow bootstrapping spark on run > or > > something? > > > > Thanks, > > Steve > > > > > > java.io.IOException: Unsupported scheme s3n for URI s3n://removed > > at com.coldlight.ccc.vfs.NeuronPath.toPath(NeuronPath.java:43) > > at > > > com.coldlight.neuron.data.ClquetPartitionedData.makeInputStream(ClquetPartitionedData.java:465) > > at > > > com.coldlight.neuron.data.ClquetPartitionedData.access$200(ClquetPartitionedData.java:42) > > at > > > com.coldlight.neuron.data.ClquetPartitionedData$Iter.<init>(ClquetPartitionedData.java:330) > > at > > > com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:304) > > at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > > at > > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > > at org.apache.spark.scheduler.Task.run(Task.scala:64) > > at > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) > > at > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > > 15/05/12 13:57:51 ERROR Executor: Exception in task 0.1 in stage 0.0 (TID > > 1) > > java.lang.RuntimeException: java.io.IOException: Unsupported scheme s3n > > for URI s3n://removed > > at > > > com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:307) > > at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > > at > > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > > at org.apache.spark.scheduler.Task.run(Task.scala:64) > > at > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) > > at > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > > Caused by: java.io.IOException: Unsupported scheme s3n for URI > > s3n://removed > > at com.coldlight.ccc.vfs.NeuronPath.toPath(NeuronPath.java:43) > > at > > > com.coldlight.neuron.data.ClquetPartitionedData.makeInputStream(ClquetPartitionedData.java:465) > > at > > > com.coldlight.neuron.data.ClquetPartitionedData.access$200(ClquetPartitionedData.java:42) > > at > > > com.coldlight.neuron.data.ClquetPartitionedData$Iter.<init>(ClquetPartitionedData.java:330) > > at > > > com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:304) > > ... 8 more > > > > This e-mail is intended solely for the above-mentioned recipient and it > > may contain confidential or privileged information. If you have received > it > > in error, please notify us immediately and delete the e-mail. You must > not > > copy, distribute, disclose or take any action in reliance on it. In > > addition, the contents of an attachment to this e-mail may contain > software > > viruses which could damage your own computer system. While ColdLight > > Solutions, LLC has taken every reasonable precaution to minimize this > risk, > > we cannot accept liability for any damage which you sustain as a result > of > > software viruses. You should perform your own virus checks before opening > > the attachment. > > > > This e-mail is intended solely for the above-mentioned recipient and it > may contain confidential or privileged information. If you have received it > in error, please notify us immediately and delete the e-mail. You must not > copy, distribute, disclose or take any action in reliance on it. In > addition, the contents of an attachment to this e-mail may contain software > viruses which could damage your own computer system. While ColdLight > Solutions, LLC has taken every reasonable precaution to minimize this risk, > we cannot accept liability for any damage which you sustain as a result of > software viruses. You should perform your own virus checks before opening > the attachment. > -- Haoyuan Li CEO, Tachyon Nexus <http://www.tachyonnexus.com/> AMPLab, EECS, UC Berkeley http://www.cs.berkeley.edu/~haoyuan/