As a followup to all of this, When I try increasing the number of
partitions ( *sc.WholeTextFiles("/MySource/dir1/*",8) I get the out of
memory much faster.*
*Eran *On Wed, Dec 16, 2015 at 5:23 PM Eran Witkon <[email protected]> wrote: > Hi, > I have about 8K files on about 10 directories on hdfs and I need to add a > column to all files with the file name (e.g. file1.txt adds a column with > file1.txt, file 2 with "file2.txt" etc) > > The current approach was to read all files using *sc.WholeTextFiles("myPath") > *and have the file name as key and add it as coulmn to each file. > > 1) I run this on 5 servers each with 24 cores and 24GB RAM with a config > of : > *spark-shell --master yarn-client --executor-core 5 --executor-memory 5G* > But when we run this on all directories at once > (sc.WholeTextFiles("/MySource/*/*") I am getting *java.lang.OutOfMemoryError: > Java heap space* > When running on a single directory all works well > *sc.WholeTextFiles("/MySource/dir1/*") *. > > 2) One other option is not to use WholeTextFile but read each line with > sc.textFile, but how can I get the file name with textFile? > > Eran >
