As a followup to all of this, When I try increasing the number of
partitions ( *sc.WholeTextFiles("/MySource/dir1/*",8) I get the out of
memory much faster.*
*Eran *

On Wed, Dec 16, 2015 at 5:23 PM Eran Witkon <[email protected]> wrote:

> Hi,
> I have about 8K files on about 10 directories on hdfs and I need to add a
> column to all files with the file name (e.g. file1.txt adds a column with
> file1.txt, file 2 with "file2.txt" etc)
>
> The current approach was to read all files using *sc.WholeTextFiles("myPath")
> *and have the file name as key and add it as coulmn to each file.
>
> 1) I run this on 5 servers each with 24 cores and 24GB RAM with a config
> of :
> *spark-shell --master yarn-client --executor-core 5 --executor-memory 5G*
> But when we run this on all directories at once
> (sc.WholeTextFiles("/MySource/*/*") I am getting *java.lang.OutOfMemoryError:
> Java heap space*
> When running on a single directory all works well
> *sc.WholeTextFiles("/MySource/dir1/*") *.
>
> 2) One other option is not to use WholeTextFile but read each line with
> sc.textFile, but how can I get the file name with textFile?
>
> Eran
>

Reply via email to