I faced similar issue with "wholeTextFiles" function due to version compatibility. Spark 1.0 with Hadoop 2.4.1 worked. Did you try other function such as "textFile" to check if the issue is specific to "wholeTextFiles"?
Spark needs to be re-compiled for different hadoop versions. However, you can keep multiple Spark directories in your system compiled with different versions. On Thu, Oct 9, 2014 at 1:15 PM, <jan.zi...@centrum.cz> wrote: > I've tried to add / at the end of the path, but the result was exactly the > same. I also guess that there will be some problem on the level of Hadoop - > S3 comunication. Doy you know if there is some possibility of how tu run > scripts from Spark on for example different hadoom version from the > standard EC2 installation? > > ______________________________________________________________ > > Od: Sean Owen <so...@cloudera.com> > > Komu: <jan.zi...@centrum.cz> > > Datum: 08.10.2014 18:05 > > Předmět: Re: SparkContext.wholeTextFiles() > java.io.FileNotFoundException: File does not exist: > > > > > CC: "user@spark.apache.org" > > Take this as a bit of a guess, since I don't use S3 much and am only a > bit aware of the Hadoop+S3 integration issues. But I know that S3's > lack of proper directories causes a few issues when used with Hadoop, > which wants to list directories. > > According to > http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html > ... I wonder if you simply need to end the path with "/" to make it > clear you mean it as a directory. Hadoop S3 OutputFormats are going to > append ..._$folder$ files to mark directories too, although I don't > think it's required necessarily to read them as dirs. > > I still imagine there could be some problem between Hadoop in Spark in > this regard, but worth trying the path thing first. You do need s3n:// > for sure. > > On Wed, Oct 8, 2014 at 4:54 PM, <jan.zi...@centrum.cz> wrote: > > One more update: I've realized that this problem is not only Python > related. > > I've tried it also in Scala, but I'm still getting the same error, my > scala > > code: val file = sc.wholeTextFiles("s3n://wiki-dump/wikiinput").first() > > > > ______________________________________________________________ > > > > > > My additional question is if this problem can be possibly caused by the > fact > > that my file is bigger than RAM memory across the whole cluster? > > > > > > > > ______________________________________________________________ > > > > Hi > > > > I'm trying to use sc.wholeTextFiles() on file that is stored amazon S3 > I'm > > getting following Error: > > > > > > > > 14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to > process : > > 1 > > > > 14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to > process : > > 1 > > > > Traceback (most recent call last): > > > > File "/root/distributed_rdd_test.py", line 27, in <module> > > > > result = > > distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10) > > > > File "/root/spark/python/pyspark/rdd.py", line 1126, in take > > > > totalParts = self._jrdd.partitions().size() > > > > File > "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > > line 538, in __call__ > > > > File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line > > 300, in get_return_value > > > > py4j.protocol.Py4JJavaError: An error occurred while calling > o30.partitions. > > > > : java.io.FileNotFoundException: File does not exist: > /wikiinput/wiki.xml.gz > > > > at > > > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517) > > > > at > > > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:489) > > > > at > > > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280) > > > > at > > > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240) > > > > at > > > org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:220) > > > > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > > > > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > > > > at scala.Option.getOrElse(Option.scala:120) > > > > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > > > > at > org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:56) > > > > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > > > > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > > > > at scala.Option.getOrElse(Option.scala:120) > > > > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > > > > at > > > org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50) > > > > at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32) > > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > > > at > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > > > at > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > > > at java.lang.reflect.Method.invoke(Method.java:606) > > > > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > > > > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > > > > at py4j.Gateway.invoke(Gateway.java:259) > > > > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > > > > at py4j.commands.CallCommand.execute(CallCommand.java:79) > > > > at py4j.GatewayConnection.run(GatewayConnection.java:207) > > > > > > > > at java.lang.Thread.run(Thread.java:745) > > > > > > > > My code is following: > > > > > > > > sc = SparkContext(appName="Process wiki") > > > > distData = sc.wholeTextFiles('s3n://wiki-dump/wikiinput') > > > > result = > distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10) > > > > for item in result: > > > > print item.getvalue() > > > > sc.stop() > > > > > > > > So my question is, is it possible to read whole files from S3? Based on > the > > documentation it shouold be possible, but it seems that it does not work > for > > me. > > > > > > > > When I do just: > > > > > > > > sc = SparkContext(appName="Process wiki") > > > > distData = sc.wholeTextFiles('s3n://wiki-dump/wikiinput').take(10) > > > > print distData > > > > > > > > Then the error that I'm getting is exactly the same. > > > > > > > > Thank you in advance for any advice. > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org >