51,000 files at about 1/2 MB per file. I am wondering if I need this
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html
Although if I am understanding you correctly, even if I copy the S3
files to HDFS on EMR, and use wholeTextFiles, I am still only going to
be able to u
Can you post more information about the number of files, their size and the
executor logs.
A gzipped file is not splittable i.e. Only one executor can gunzip it (the
unzipped data can then be processed in parallel).
Wholetextfile was designed to be executed only on one executor (e.g. For
proce
I've been working on this problem for several days (I am doing more to
increase my knowledge of Spark). The code you linked to hangs because
after reading in the file, I have to gunzip it.
Another way that seems to be working is reading each file in using
sc.textFile, and then writing it the H
Strange that it's working for some directories but not others. Looks like
wholeTextFiles maybe doesn't work with S3?
https://issues.apache.org/jira/browse/SPARK-4414 .
If it's possible to load the data into EMR and run Spark from there that
may be a workaround. This blogspot shows a python worka
I've actually been able to trace the problem to the files being read in.
If I change to a different directory, then I don't get the error. Is one
of the executors running out of memory?
On 02/06/2017 02:35 PM, Paul Tremblay wrote:
When I try to create an rdd using wholeTextFiles, I get an
i