+ Add Forgot to mention, I tried with providing number of partitions while reading
1. val logRDD = sc.wholeTextFiles("file:/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/*", 200) val mappedRDD = logRDD.flatMap { x => x._2.split("[^A-Za-z']+") }.map { x => x.replaceAll("""\n""", " ")} mappedRDD.collect() 2. val textlogRDD = sc.textFile("file:///usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/spark-hduser-org.apache.spark.deploy.master.Master-1-chetan-ThinkPad-E460.out", 200) val textMappedRDD = textlogRDD.flatMap { x => x.split("[^A-Za-z']+")}.map { y => y.replaceAll("""\n""", " ")} textMappedRDD.collect() 3. val tempRDD = sc.wholeTextFiles("file:/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/*", 200).flatMap(files => files._2.split("[^A-Za-z']+".replaceAll("""\n"""," "))) Thanks On Sat, Aug 25, 2018 at 3:38 PM Chetan Khatri <chetan.opensou...@gmail.com> wrote: > Hello Spark Dev Community, > > Friend of mine is facing issue while reading 20 GB of log files from > Directory on Cluster. > Approach are as below: > > *1. This gives out of memory error.* > val logRDD = > sc.wholeTextFiles("file:/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/*") > val mappedRDD = logRDD.flatMap { x => x._2.split("[^A-Za-z']+") }.map { x > => x.replaceAll("""\n""", " ")} > > *2. Individual files can be processed with below approach* > val textlogRDD = > sc.textFile("file:///usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/spark-hduser-org.apache.spark.deploy.master.Master-1-chetan-ThinkPad-E460.out") > val textMappedRDD = textlogRDD.flatMap { x => x.split("[^A-Za-z']+")}.map > { y => y.replaceAll("""\n""", " ")} > > *3. Could be try.* > val tempRDD = > sc.wholeTextFiles("file:/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/*").flatMap(files > => files._2.split("[^A-Za-z']+".replaceAll("""\n"""," "))) > > *Thoughts:* > 1. What I am thinking is if OutOfMemory is the issue then increasing > driver-memory at spark-submit can help because *collect() *would be > causing issue due to taking everything on driver node. > 2. or persisting an RDD on Disk StoreageLevel.MEMORY_AND_DISK_SER2 and > then proceed further. > > Any suggestions please ? > > Thanks > -Chetan > >