Hello Spark Dev Community,
Friend of mine is facing issue while reading 20 GB of log files from
Directory on Cluster.
Approach are as below:
*1. This gives out of memory error.*
val logRDD =
sc.wholeTextFiles("file:/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/*")
val mappedRDD = logRDD.flatMa
+ Add
Forgot to mention, I tried with providing number of partitions while
reading
1.
val logRDD =
sc.wholeTextFiles("file:/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/*",
200)
val mappedRDD = logRDD.flatMap { x => x._2.split("[^A-Za-z']+") }.map { x
=> x.replaceAll("""\n""", " ")}
mappedRD
Hi All,
I have large volume of data nearly 500TB(from 2016-2018-till date), I have
to do some ETL on that data.
This data is there in the AWS S3, so I planning to use AWS EMR setup to
process this data but I am not sure what should be the config I should
select .
1. Do I need to process monthly o