Take this as a bit of a guess, since I don't use S3 much and am only a bit aware of the Hadoop+S3 integration issues. But I know that S3's lack of proper directories causes a few issues when used with Hadoop, which wants to list directories.
According to http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html ... I wonder if you simply need to end the path with "/" to make it clear you mean it as a directory. Hadoop S3 OutputFormats are going to append ..._$folder$ files to mark directories too, although I don't think it's required necessarily to read them as dirs. I still imagine there could be some problem between Hadoop in Spark in this regard, but worth trying the path thing first. You do need s3n:// for sure. On Wed, Oct 8, 2014 at 4:54 PM, <[email protected]> wrote: > One more update: I've realized that this problem is not only Python related. > I've tried it also in Scala, but I'm still getting the same error, my scala > code: val file = sc.wholeTextFiles("s3n://wiki-dump/wikiinput").first() > > ______________________________________________________________ > > > My additional question is if this problem can be possibly caused by the fact > that my file is bigger than RAM memory across the whole cluster? > > > > ______________________________________________________________ > > Hi > > I'm trying to use sc.wholeTextFiles() on file that is stored amazon S3 I'm > getting following Error: > > > > 14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process : > 1 > > 14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process : > 1 > > Traceback (most recent call last): > > File "/root/distributed_rdd_test.py", line 27, in <module> > > result = > distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10) > > File "/root/spark/python/pyspark/rdd.py", line 1126, in take > > totalParts = self._jrdd.partitions().size() > > File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > > File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line > 300, in get_return_value > > py4j.protocol.Py4JJavaError: An error occurred while calling o30.partitions. > > : java.io.FileNotFoundException: File does not exist: /wikiinput/wiki.xml.gz > > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517) > > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:489) > > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280) > > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240) > > at > org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:220) > > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > > at scala.Option.getOrElse(Option.scala:120) > > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > > at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:56) > > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > > at scala.Option.getOrElse(Option.scala:120) > > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > > at > org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50) > > at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:606) > > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > > at py4j.Gateway.invoke(Gateway.java:259) > > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > > at py4j.commands.CallCommand.execute(CallCommand.java:79) > > at py4j.GatewayConnection.run(GatewayConnection.java:207) > > > > at java.lang.Thread.run(Thread.java:745) > > > > My code is following: > > > > sc = SparkContext(appName="Process wiki") > > distData = sc.wholeTextFiles('s3n://wiki-dump/wikiinput') > > result = distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10) > > for item in result: > > print item.getvalue() > > sc.stop() > > > > So my question is, is it possible to read whole files from S3? Based on the > documentation it shouold be possible, but it seems that it does not work for > me. > > > > When I do just: > > > > sc = SparkContext(appName="Process wiki") > > distData = sc.wholeTextFiles('s3n://wiki-dump/wikiinput').take(10) > > print distData > > > > Then the error that I'm getting is exactly the same. > > > > Thank you in advance for any advice. > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
