Reading 20 GB of log files from Directory - Out of Memory Error

2018-08-25 Thread Chetan Khatri
Hello Spark Dev Community, Friend of mine is facing issue while reading 20 GB of log files from Directory on Cluster. Approach are as below: *1. This gives out of memory error.* val logRDD = sc.wholeTextFiles("file:/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/*") val mappedRDD = logRDD.flatMa

Re: Reading 20 GB of log files from Directory - Out of Memory Error

2018-08-25 Thread Chetan Khatri
+ Add Forgot to mention, I tried with providing number of partitions while reading 1. val logRDD = sc.wholeTextFiles("file:/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/*", 200) val mappedRDD = logRDD.flatMap { x => x._2.split("[^A-Za-z']+") }.map { x => x.replaceAll("""\n""", " ")} mappedRD

Handling Very Large volume(500TB) data using spark

2018-08-25 Thread Great Info
Hi All, I have large volume of data nearly 500TB(from 2016-2018-till date), I have to do some ETL on that data. This data is there in the AWS S3, so I planning to use AWS EMR setup to process this data but I am not sure what should be the config I should select . 1. Do I need to process monthly o