Take this as a bit of a guess, since I don't use S3 much and am only a
bit aware of the Hadoop+S3 integration issues. But I know that S3's
lack of proper directories causes a few issues when used with Hadoop,
which wants to list directories.

According to 
http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html
... I wonder if you simply need to end the path with "/" to make it
clear you mean it as a directory. Hadoop S3 OutputFormats are going to
append ..._$folder$ files to mark directories too, although I don't
think it's required necessarily to read them as dirs.

I still imagine there could be some problem between Hadoop in Spark in
this regard, but worth trying the path thing first. You do need s3n://
for sure.

On Wed, Oct 8, 2014 at 4:54 PM,  <[email protected]> wrote:
> One more update: I've realized that this problem is not only Python related.
> I've tried it also in Scala, but I'm still getting the same error, my scala
> code: val file = sc.wholeTextFiles("s3n://wiki-dump/wikiinput").first()
>
> ______________________________________________________________
>
>
> My additional question is if this problem can be possibly caused by the fact
> that my file is bigger than RAM memory across the whole cluster?
>
>
>
> ______________________________________________________________
>
> Hi
>
> I'm trying to use sc.wholeTextFiles() on file that is stored amazon S3 I'm
> getting following Error:
>
>
>
> 14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process :
> 1
>
> 14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process :
> 1
>
> Traceback (most recent call last):
>
>   File "/root/distributed_rdd_test.py", line 27, in <module>
>
>     result =
> distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10)
>
>   File "/root/spark/python/pyspark/rdd.py", line 1126, in take
>
>     totalParts = self._jrdd.partitions().size()
>
>   File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>
>   File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
> 300, in get_return_value
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o30.partitions.
>
> : java.io.FileNotFoundException: File does not exist: /wikiinput/wiki.xml.gz
>
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
>
> at
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:489)
>
> at
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280)
>
> at
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240)
>
> at
> org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:220)
>
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>
> at scala.Option.getOrElse(Option.scala:120)
>
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>
> at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:56)
>
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>
> at scala.Option.getOrElse(Option.scala:120)
>
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>
> at
> org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50)
>
> at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:606)
>
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>
> at py4j.Gateway.invoke(Gateway.java:259)
>
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
>
>
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> My code is following:
>
>
>
> sc = SparkContext(appName="Process wiki")
>
> distData = sc.wholeTextFiles('s3n://wiki-dump/wikiinput')
>
> result = distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10)
>
> for item in result:
>
>         print item.getvalue()
>
> sc.stop()
>
>
>
> So my question is, is it possible to read whole files from S3? Based on the
> documentation it shouold be possible, but it seems that it does not work for
> me.
>
>
>
> When I do just:
>
>
>
> sc = SparkContext(appName="Process wiki")
>
> distData = sc.wholeTextFiles('s3n://wiki-dump/wikiinput').take(10)
>
> print distData
>
>
>
> Then the error that I'm getting is exactly the same.
>
>
>
> Thank you in advance for any advice.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to