Reading 20 GB of log files from Directory - Out of Memory Error

2018-08-25 Thread Chetan Khatri
Hello Spark Dev Community,

Friend of mine is facing issue while reading 20 GB of log files from
Directory on Cluster.
Approach are as below:

*1. This gives out of memory error.*
val logRDD =
sc.wholeTextFiles("file:/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/*")
val mappedRDD = logRDD.flatMap { x => x._2.split("[^A-Za-z']+") }.map { x
=> x.replaceAll("""\n""", " ")}

*2. Individual files can be processed with below approach*
val textlogRDD =
sc.textFile("file:///usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/spark-hduser-org.apache.spark.deploy.master.Master-1-chetan-ThinkPad-E460.out")
val textMappedRDD = textlogRDD.flatMap { x => x.split("[^A-Za-z']+")}.map {
y => y.replaceAll("""\n""", " ")}

*3. Could be try.*
val tempRDD =
sc.wholeTextFiles("file:/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/*").flatMap(files
=> files._2.split("[^A-Za-z']+".replaceAll("""\n"""," ")))

*Thoughts:*
1. What I am thinking is if OutOfMemory is the issue then increasing
driver-memory at spark-submit can help because *collect() *would be causing
issue due to taking everything on driver node.
2. or persisting an RDD on Disk StoreageLevel.MEMORY_AND_DISK_SER2 and then
proceed further.

Any suggestions please ?

Thanks
-Chetan


Re: Reading 20 GB of log files from Directory - Out of Memory Error

2018-08-25 Thread Chetan Khatri
+ Add

Forgot to mention, I tried with providing number of partitions while
reading



1.
val logRDD =
sc.wholeTextFiles("file:/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/*",
200)
val mappedRDD = logRDD.flatMap { x => x._2.split("[^A-Za-z']+") }.map { x
=> x.replaceAll("""\n""", " ")}
mappedRDD.collect()

2.
val textlogRDD =
sc.textFile("file:///usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/spark-hduser-org.apache.spark.deploy.master.Master-1-chetan-ThinkPad-E460.out",
200)
val textMappedRDD = textlogRDD.flatMap { x => x.split("[^A-Za-z']+")}.map {
y => y.replaceAll("""\n""", " ")}
textMappedRDD.collect()

3.
val tempRDD =
sc.wholeTextFiles("file:/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/*",
200).flatMap(files => files._2.split("[^A-Za-z']+".replaceAll("""\n""","
")))

Thanks

On Sat, Aug 25, 2018 at 3:38 PM Chetan Khatri 
wrote:

> Hello Spark Dev Community,
>
> Friend of mine is facing issue while reading 20 GB of log files from
> Directory on Cluster.
> Approach are as below:
>
> *1. This gives out of memory error.*
> val logRDD =
> sc.wholeTextFiles("file:/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/*")
> val mappedRDD = logRDD.flatMap { x => x._2.split("[^A-Za-z']+") }.map { x
> => x.replaceAll("""\n""", " ")}
>
> *2. Individual files can be processed with below approach*
> val textlogRDD =
> sc.textFile("file:///usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/spark-hduser-org.apache.spark.deploy.master.Master-1-chetan-ThinkPad-E460.out")
> val textMappedRDD = textlogRDD.flatMap { x => x.split("[^A-Za-z']+")}.map
> { y => y.replaceAll("""\n""", " ")}
>
> *3. Could be try.*
> val tempRDD =
> sc.wholeTextFiles("file:/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/*").flatMap(files
> => files._2.split("[^A-Za-z']+".replaceAll("""\n"""," ")))
>
> *Thoughts:*
> 1. What I am thinking is if OutOfMemory is the issue then increasing
> driver-memory at spark-submit can help because *collect() *would be
> causing issue due to taking everything on driver node.
> 2. or persisting an RDD on Disk StoreageLevel.MEMORY_AND_DISK_SER2 and
> then proceed further.
>
> Any suggestions please ?
>
> Thanks
> -Chetan
>
>


Handling Very Large volume(500TB) data using spark

2018-08-25 Thread Great Info
Hi All,
I have large volume of data nearly 500TB(from 2016-2018-till date), I have
to do some ETL on that data.

This data is there in the AWS S3, so I planning to use AWS EMR setup to
process this data but I am not sure what should be the config I should
select .

1. Do I need to process monthly or can I process all data at once?
2. What should be Master and slave(executor) memory both Ram and storage?
3. What kind of processor (speed) I need?
4. How many slaves do we need ?

Based on this I want to calculate the cost of AWS EMR and start process the
data

Regards
Indra